Issue
Duplicate text appears in indexed fields such as article_section_t when the same content is repeated in the source HTML. This can occur if a section header or similar content is present more than once within the HTML structure, resulting in duplicate values in the indexed output.
Diagnosis
To confirm this issue:
- Review the indexed field for unexpected duplicate values.
- Inspect the HTML source for repeated elements (for example, two
<h4>or<div>tags with identical or similar content within the target section). -
Confirm that the HTML parser is configured to extract content from these elements.
Environment
Fusion 5.x
Cause
When multiple identical or similar elements exist in the HTML source, the parser extracts both, leading to duplicated content in the indexed field.
Resolution
Configure the HTML Parser Stage to exclude one of the duplicate elements using the Exclude Filters option.
Step 1: Identify the element to exclude
- Inspect the HTML of the affected content to locate a unique selector (such as an
idorclass) on the element you wish to exclude.
Step 2: Add the selector to Exclude Filters
- Open the pipeline that uses the relevant HTML parser stage.
- In the HTML parser stage settings, locate the Exclude Filters option.
-
Add a CSS selector matching the element to be excluded. For example, to exclude an element with
id="ignore-lw", add the following to the exclude filters:#ignore-lw
You may also use other Jsoup-compatible CSS selectors to target specific tags or classes.
Step 3: Enable filter before mapping (if available)
- Ensure the Filter before mapping option is enabled to apply the filter before field mapping.
Step 4: Test the configuration
- Save the configuration and reindex sample content in a lower (non-production) environment.
- Confirm that the duplicate content no longer appears in the indexed field.
Step 5: Apply to production
- Once verified, promote the configuration change to production according to your organisation’s standard change management process.