Extract HTML content from a specific element on a web page – Lucidworks

Goal

Extract and preserve the full HTML markup inside a specific HTML tag (such as a <div> or <table>) from a web page crawled using a Fusion web connector.

Environment

Fusion 4.x and 5.x

Guide

Enable and configure the HTML parser

Ensure the HTML parser stage is enabled in the index parser configuration.

Navigate to the index parser assigned to the web datasource.
Enable the HTML parser stage if it is not already active.
Within the HTML parser configuration:
- Set Extract Body Text to true.
- Set Keep Parent to true.
- Ensure Metatags Prefix is blank or set as needed.
Once configured, this will produce a field named body_formatted containing the entire HTML body, including all nested tags.

Add a regex field extraction stage

To isolate and extract the content inside a specific HTML element:

In your index pipeline, add a Regex Field Extraction stage at the beginning of the pipeline.
https://doc.lucidworks.com/docs/4.2/fusion-server/reference/pipeline-stages/indexing/regular-expression-extractor-index-stage
Configure the stage as follows:

Source Fields:
body_formatted

Target Field:
html_snippet (or any descriptive field name)

Write Mode:
overwrite

Regex Pattern:

.*(<table id='openBooks'>)(.*?)(</table>).* (The .*? expression ensures a non-greedy match, avoiding issues with multiple elements.)

Regex Capture Group:
2
Adjust the tag name (table) and id value (openBooks) to match the element you wish to extract.

Add a cleanup field mapping stage (optional)

To reduce index clutter, remove the now-redundant body_formatted field:

Add a Field Mapping stage after the Regex Field Extraction stage.

Configure it with the following rule:

Mapping:

{
  "source": "body_formatted",
  "operation": "delete"
}

This will ensure only the desired extracted HTML snippet is retained in the index.

Verify results

After crawling the target page:

Use the Solr Admin UI or Fusion Query Workbench to confirm that the new field (e.g., html_snippet) contains the expected HTML content extracted from the specified element.
If needed, refine the regular expression to account for optional whitespace, nested tags, or attribute variations.

Additional notes

You can inspect which parser was used for a document by querying for the _lw_parser_type_s field.
This approach can be extended to extract any HTML section by modifying the tag and ID used in the regex pattern.