Exclude specific URL patterns from search results returned by the web connector – Lucidworks

Issue

Search results are returning URLs that are not explicitly listed in the sitemap, including internal or staging URLs that should not be visible to end users. These documents may originate from internal paths such as /aemedge/ or /pm/.

Diagnosis

When using the web connector to crawl a site, Fusion uses the provided startLinks and sitemap URLs only as seed points. From those points, the connector recursively follows all internal links discovered on the page unless explicitly restricted. This can lead to content being indexed that is not directly listed in the sitemap.

To confirm this behavior:

Run a query in Query Workbench or via the Query API using the affected term.
Check the returned document URLs to see if they match paths not intended for indexing (e.g., URLs containing /aemedge/ or /pm/).
Use the id or parent_s fields in Fusion to confirm that the document was discovered via crawling, not from the sitemap directly.

Environment

Fusion 5.x
Web connector datasource with crawling via startLinks and sitemap
Applicable to both Lucidworks-hosted (Managed Fusion) and self-hosted deployments

Cause

The crawler behavior includes link discovery unless explicitly limited. URLs such as those under /aemedge/ or /pm/ are likely reached through internal links discovered during the crawl, even though they are not listed in the submitted sitemaps.

Resolution

To prevent these URLs from being indexed in future crawls, configure exclusive regex filters in the Limit Documents section of the Web connector configuration.

Add exclusive regex filters

Go to the Fusion UI and navigate to:
Indexing > Datasources > [Your Web connector datasource]
Under the Limit Documents section, locate the Exclusive regexes field.
Add the following patterns to exclude paths like /aemedge/ and /pm/:

.*/aemedge/.*
.*/pm/.*

Save the datasource configuration.

Reindex the datasource

After applying the filters:

Clear the existing indexed content from the collection to remove already indexed documents:

curl -X POST http://<fusion-host>/api/apps/<app-name>/index-pipelines/<pipeline-id>/collections/<collection-name>/index-reset

Rerun the datasource to start a fresh crawl with the new exclusions in place.

Note: If the existing documents are not removed, they will continue to appear in query results even after the new rules are applied.

Optional: Update your `robots.txt`

To enforce exclusion at the source level and reduce unnecessary crawl depth, consider adding Disallow rules in your website's robots.txt file:

User-agent: *
Disallow: /aemedge/
Disallow: /pm/

This helps guide crawlers, including Fusion, not to follow or index those paths in the first place.

Additional considerations

Always test regex filters and reindexing in a non-production environment before applying them to production.
Validate the query output before and after reindexing to ensure undesired URLs no longer appear.
Review the startLinks and sitemap configuration to ensure they are scoped as narrowly as necessary.