Issue
Only one document is returned when crawling a website in a non-production Fusion environment, while the production environment successfully indexes multiple documents. This causes issues during development and testing, as the results do not reflect production behavior.
Diagnosis
To determine whether this issue applies:
Check if your non-production datasource is indexing only a single document despite having the same target site and configuration as production.
Validate in Query Workbench vs. API output if expected results appear in one but not the other.
Examine datasource and pipeline configurations for subtle discrepancies.
Environment
Fusion 5.x
Applies to environments with web datasources indexing publicly available websites using custom indexing pipelines.
Cause
Multiple configuration differences can contribute to document indexing issues across environments:
HTML tag exclusions: If the Exclude tags setting in the datasource parser includes
headerandfooterin non-production but onlyfooterin production, valid content may be omitted.JavaScript parsing errors: A JS indexing stage that uses
doc.getId().split('/')can throw an error ifdoc.getId()isnull, blocking further processing.robots.txt compliance: If
obey robots.txtis enabled in non-production but disabled in production, the crawler may skip many pages even though they are accessible. This can occur if robots.txt blocks paths or if source site whitelisting differs between environments.Field mapping differences: If the source field for rich content (e.g.,
body) is mapped tobody_txtin production but a different field likedescriptionis used in non-production, API search behavior may differ even if results appear in Query Workbench.
Resolution
1. Align tag exclusions
In the Document Parsing > Exclude Tags section of the datasource:
Ensure only necessary tags are excluded.
Match settings exactly with the working environment.
Example:
Production: footer
Non-Production: header, footer ← Update to match Prod2. Check JavaScript stage for errors
If using a JS stage to categorize pages, ensure it gracefully handles null values:
Problematic script:
var url = doc.getId();
var splitURL = url.split('/'); // will fail if url is nullSafer version:
if (doc.getId()) {
var splitURL = doc.getId().split('/');
var page = splitURL[3];
if (page === "help") {
doc.addField("domainPageType_s", "help");
} else if (page === "whatson") {
doc.addField("domainPageType_s", "blog");
} else {
doc.addField("domainPageType_s", "main");
}
}
return doc;3. Test with default _system pipeline and parser
Temporarily configure the datasource to use:
Pipeline ID:
_systemParser:
_system
This helps isolate whether the issue is caused by custom pipeline stages or parser configurations.
4. Disable robots.txt compliance (if needed)
In the Advanced settings of the web datasource, try disabling Obey robots.txt to see if more documents are indexed.
Note: Use this with caution. Ensure disabling robots.txt compliance aligns with your organization's ethical crawling policies.
5. Match advanced crawler parameters
Under JavaScript Evaluation > Advanced, match these values with the working environment:
Web Driver Quit Timeout (ms):
20000Request counter min wait (ms):
10000Extra Load Size Delta (bytes):
5000
6. Review field mapping for searchable content
Check the field mapping stage in your indexing pipeline:
Working config (Production):
Source field: body
Target field: body_txt
Operation: copyProblem config (Non-Prod):
Source field: description
Target field: body_txt
Operation: copyEnsure that body_txt is being populated with content relevant to your search queries. Otherwise, API queries against that field may return no results even though documents exist in the collection.