Fix parsing errors and missing pipeline.complete in Web V2 datasources – Lucidworks

Issue

Datasource jobs using the Web V2 connector show inconsistent indexing behavior, including:

Parsing exceptions for sitemap.xml
Missing pipeline.complete field in the job history counters
Fluctuating document counts between runs (e.g., 523 vs 251 documents indexed)

These issues may occur even when the job history reports a status of success.

Diagnosis

To confirm this issue, inspect the Job History counters for the datasource run. If you see the following:

fetch.plugin-response.error: 1
Missing pipeline.complete
Presence of log entries such as:

ParsingException: Could not find a parser for stream.
WARN [fetch-input-receiver...Retries exceeded for id=https://<your-site>/sitemap.xml

Then the connector is incorrectly attempting to parse the sitemap file as a document.

Inconsistent document counts (e.g., 523 vs 251) across runs can also indicate stale job state data affecting incremental crawl behavior.

Environment

Fusion 5.9.14
Web V2 Connector v2.1.0 and above

Cause

Improper configuration: Including the sitemap.xml URL in the Start Links field causes the connector to treat it as a document, leading to a ParsingException.
Residual job state: The job_state collection may contain metadata from previous runs, causing subsequent runs to behave as incremental crawls even when not intended.
Force Recrawl behavior: If Force Recrawl is disabled, the connector may skip documents with unchanged lastmod dates, even during troubleshooting runs.

Resolution

1. Update the datasource configuration

Ensure the sitemap URL is not listed in the Start Links field. Instead, use the Sitemap URLs field exclusively:

Incorrect:

Start Links: https://example.com/sitemap.xml

Correct:

Sitemap URLs: https://example.com/sitemap.xml
Start Links: [leave empty or include only actual crawl entry points]

2. Clear job state after each run (for consistent testing)

Use a scheduled job or manual invocation of the clear_job_state datasource to remove residual crawl metadata:

Job name example: X_clear_job_state

This ensures each datasource run behaves as a clean full crawl.

3. Optional: Disable Force Recrawl during troubleshooting

Disabling Force Recrawl can help test whether crawl behavior is being influenced by server-side 304 (Not Modified) headers. However, disabling it should be combined with job state clearing for consistency.

4. Verify success using job counters

After running the datasource job:

Open Job History
Confirm pipeline.complete is present under counters
Ensure fetch.plugin-response.error is 0

These indicate a successful parse and indexing operation.

5. Optional: Fully reset datasource index for testing

To validate consistency, delete all documents for the datasource in Solr and re-run with a cleared job state. This ensures full crawl indexing and helps isolate issues.

Additional notes

If the sitemap.xml content does not change between runs and the connector is working correctly, a consistent number of documents (e.g., 523) should be indexed each time.
If document counts continue to fluctuate after applying these steps, verify the source server's HTTP responses and consider upgrading to the latest Web V2 connector version.