Issue
Datasource jobs using the Web V2 connector show inconsistent indexing behavior, including:
Parsing exceptions for
sitemap.xmlMissing
pipeline.completefield in the job history countersFluctuating document counts between runs (e.g., 523 vs 251 documents indexed)
These issues may occur even when the job history reports a status of success.
Diagnosis
To confirm this issue, inspect the Job History counters for the datasource run. If you see the following:
fetch.plugin-response.error: 1Missing
pipeline.completePresence of log entries such as:
ParsingException: Could not find a parser for stream.
WARN [fetch-input-receiver...Retries exceeded for id=https://<your-site>/sitemap.xmlThen the connector is incorrectly attempting to parse the sitemap file as a document.
Inconsistent document counts (e.g., 523 vs 251) across runs can also indicate stale job state data affecting incremental crawl behavior.
Environment
Fusion 5.9.14
Web V2 Connector v2.1.0 and above
Cause
Improper configuration: Including the
sitemap.xmlURL in the Start Links field causes the connector to treat it as a document, leading to aParsingException.Residual job state: The
job_statecollection may contain metadata from previous runs, causing subsequent runs to behave as incremental crawls even when not intended.Force Recrawl behavior: If Force Recrawl is disabled, the connector may skip documents with unchanged
lastmoddates, even during troubleshooting runs.
Resolution
1. Update the datasource configuration
Ensure the sitemap URL is not listed in the Start Links field. Instead, use the Sitemap URLs field exclusively:
Incorrect:
Start Links: https://example.com/sitemap.xmlCorrect:
Sitemap URLs: https://example.com/sitemap.xml
Start Links: [leave empty or include only actual crawl entry points]2. Clear job state after each run (for consistent testing)
Use a scheduled job or manual invocation of the clear_job_state datasource to remove residual crawl metadata:
Job name example: X_clear_job_state
This ensures each datasource run behaves as a clean full crawl.
3. Optional: Disable Force Recrawl during troubleshooting
Disabling Force Recrawl can help test whether crawl behavior is being influenced by server-side 304 (Not Modified) headers. However, disabling it should be combined with job state clearing for consistency.
4. Verify success using job counters
After running the datasource job:
Open Job History
Confirm
pipeline.completeis present undercountersEnsure
fetch.plugin-response.erroris0
These indicate a successful parse and indexing operation.
5. Optional: Fully reset datasource index for testing
To validate consistency, delete all documents for the datasource in Solr and re-run with a cleared job state. This ensures full crawl indexing and helps isolate issues.
Additional notes
If the
sitemap.xmlcontent does not change between runs and the connector is working correctly, a consistent number of documents (e.g., 523) should be indexed each time.If document counts continue to fluctuate after applying these steps, verify the source server's HTTP responses and consider upgrading to the latest Web V2 connector version.