Inconsistent document count when indexing the same site in different environments – Lucidworks

Issue

Only one document is returned when crawling a website in a non-production Fusion environment, while the production environment successfully indexes multiple documents. This causes issues during development and testing, as the results do not reflect production behavior.

Diagnosis

To determine whether this issue applies:

Check if your non-production datasource is indexing only a single document despite having the same target site and configuration as production.
Validate in Query Workbench vs. API output if expected results appear in one but not the other.
Examine datasource and pipeline configurations for subtle discrepancies.

Environment

Fusion 5.x
Applies to environments with web datasources indexing publicly available websites using custom indexing pipelines.

Cause

Multiple configuration differences can contribute to document indexing issues across environments:

HTML tag exclusions: If the Exclude tags setting in the datasource parser includes header and footer in non-production but only footer in production, valid content may be omitted.
JavaScript parsing errors: A JS indexing stage that uses doc.getId().split('/') can throw an error if doc.getId() is null, blocking further processing.
robots.txt compliance: If obey robots.txt is enabled in non-production but disabled in production, the crawler may skip many pages even though they are accessible. This can occur if robots.txt blocks paths or if source site whitelisting differs between environments.
Field mapping differences: If the source field for rich content (e.g., body) is mapped to body_txt in production but a different field like description is used in non-production, API search behavior may differ even if results appear in Query Workbench.

Resolution

1. Align tag exclusions

In the Document Parsing > Exclude Tags section of the datasource:

Ensure only necessary tags are excluded.
Match settings exactly with the working environment.

Example:

Production: footer
Non-Production: header, footer  ← Update to match Prod

2. Check JavaScript stage for errors

If using a JS stage to categorize pages, ensure it gracefully handles null values:

Problematic script:

var url = doc.getId();
var splitURL = url.split('/'); // will fail if url is null

Safer version:

if (doc.getId()) {
  var splitURL = doc.getId().split('/');
  var page = splitURL[3];
  if (page === "help") {
    doc.addField("domainPageType_s", "help");
  } else if (page === "whatson") {
    doc.addField("domainPageType_s", "blog");
  } else {
    doc.addField("domainPageType_s", "main");
  }
}
return doc;

3. Test with default `_system` pipeline and parser

Temporarily configure the datasource to use:

Pipeline ID: _system
Parser: _system

This helps isolate whether the issue is caused by custom pipeline stages or parser configurations.

4. Disable robots.txt compliance (if needed)

In the Advanced settings of the web datasource, try disabling Obey robots.txt to see if more documents are indexed.

Note: Use this with caution. Ensure disabling robots.txt compliance aligns with your organization's ethical crawling policies.

5. Match advanced crawler parameters

Under JavaScript Evaluation > Advanced, match these values with the working environment:

Web Driver Quit Timeout (ms): 20000
Request counter min wait (ms): 10000
Extra Load Size Delta (bytes): 5000

6. Review field mapping for searchable content

Check the field mapping stage in your indexing pipeline:

Working config (Production):

Source field: body
Target field: body_txt
Operation: copy

Problem config (Non-Prod):

Source field: description
Target field: body_txt
Operation: copy

Ensure that body_txt is being populated with content relevant to your search queries. Otherwise, API queries against that field may return no results even though documents exist in the collection.