Issue
Search results are returning URLs that are not explicitly listed in the sitemap, including internal or staging URLs that should not be visible to end users. These documents may originate from internal paths such as /aemedge/ or /pm/.
Diagnosis
When using the web connector to crawl a site, Fusion uses the provided startLinks and sitemap URLs only as seed points. From those points, the connector recursively follows all internal links discovered on the page unless explicitly restricted. This can lead to content being indexed that is not directly listed in the sitemap.
To confirm this behavior:
Run a query in Query Workbench or via the Query API using the affected term.
Check the returned document URLs to see if they match paths not intended for indexing (e.g., URLs containing
/aemedge/or/pm/).Use the
idorparent_sfields in Fusion to confirm that the document was discovered via crawling, not from the sitemap directly.
Environment
Fusion 5.x
Web connector datasource with crawling via startLinks and sitemap
Applicable to both Lucidworks-hosted (Managed Fusion) and self-hosted deployments
Cause
The crawler behavior includes link discovery unless explicitly limited. URLs such as those under /aemedge/ or /pm/ are likely reached through internal links discovered during the crawl, even though they are not listed in the submitted sitemaps.
Resolution
To prevent these URLs from being indexed in future crawls, configure exclusive regex filters in the Limit Documents section of the Web connector configuration.
Add exclusive regex filters
Go to the Fusion UI and navigate to:
Indexing > Datasources > [Your Web connector datasource]Under the Limit Documents section, locate the Exclusive regexes field.
Add the following patterns to exclude paths like
/aemedge/and/pm/:
.*/aemedge/.*
.*/pm/.*Save the datasource configuration.
Reindex the datasource
After applying the filters:
Clear the existing indexed content from the collection to remove already indexed documents:
curl -X POST http://<fusion-host>/api/apps/<app-name>/index-pipelines/<pipeline-id>/collections/<collection-name>/index-resetRerun the datasource to start a fresh crawl with the new exclusions in place.
Note: If the existing documents are not removed, they will continue to appear in query results even after the new rules are applied.
Optional: Update your robots.txt
To enforce exclusion at the source level and reduce unnecessary crawl depth, consider adding Disallow rules in your website's robots.txt file:
User-agent: *
Disallow: /aemedge/
Disallow: /pm/This helps guide crawlers, including Fusion, not to follow or index those paths in the first place.
Additional considerations
Always test regex filters and reindexing in a non-production environment before applying them to production.
Validate the query output before and after reindexing to ensure undesired URLs no longer appear.
Review the
startLinksandsitemapconfiguration to ensure they are scoped as narrowly as necessary.