Remove duplicate text from indexed content using HTML parser exclude filters – Lucidworks

Issue

Duplicate text appears in indexed fields such as article_section_t when the same content is repeated in the source HTML. This can occur if a section header or similar content is present more than once within the HTML structure, resulting in duplicate values in the indexed output.

Diagnosis

To confirm this issue:

Review the indexed field for unexpected duplicate values.
Inspect the HTML source for repeated elements (for example, two <h4> or <div> tags with identical or similar content within the target section).
Confirm that the HTML parser is configured to extract content from these elements.

Environment

Fusion 5.x

Cause

When multiple identical or similar elements exist in the HTML source, the parser extracts both, leading to duplicated content in the indexed field.

Resolution

Configure the HTML Parser Stage to exclude one of the duplicate elements using the Exclude Filters option.

Step 1: Identify the element to exclude

Inspect the HTML of the affected content to locate a unique selector (such as an id or class) on the element you wish to exclude.

Step 2: Add the selector to Exclude Filters

Open the pipeline that uses the relevant HTML parser stage.
In the HTML parser stage settings, locate the Exclude Filters option.
Add a CSS selector matching the element to be excluded. For example, to exclude an element with id="ignore-lw", add the following to the exclude filters:

#ignore-lw

You may also use other Jsoup-compatible CSS selectors to target specific tags or classes.

Step 3: Enable filter before mapping (if available)

Ensure the Filter before mapping option is enabled to apply the filter before field mapping.

Step 4: Test the configuration

Save the configuration and reindex sample content in a lower (non-production) environment.
Confirm that the duplicate content no longer appears in the indexed field.

Step 5: Apply to production

Once verified, promote the configuration change to production according to your organisation’s standard change management process.