Goal
Extract and preserve the full HTML markup inside a specific HTML tag (such as a <div> or <table>) from a web page crawled using a Fusion web connector.
Environment
Fusion 4.x and 5.x
Guide
Enable and configure the HTML parser
Ensure the HTML parser stage is enabled in the index parser configuration.
Navigate to the index parser assigned to the web datasource.
Enable the HTML parser stage if it is not already active.
-
Within the HTML parser configuration:
Set
Extract Body Texttotrue.Set
Keep Parenttotrue.Ensure
Metatags Prefixis blank or set as needed.
Once configured, this will produce a field named
body_formattedcontaining the entire HTML body, including all nested tags.
Add a regex field extraction stage
To isolate and extract the content inside a specific HTML element:
In your index pipeline, add a Regex Field Extraction stage at the beginning of the pipeline.
https://doc.lucidworks.com/docs/4.2/fusion-server/reference/pipeline-stages/indexing/regular-expression-extractor-index-stage-
Configure the stage as follows:
Source Fields:
body_formattedTarget Field:
html_snippet(or any descriptive field name)Write Mode:
overwriteRegex Pattern:
.*(<table id='openBooks'>)(.*?)(</table>).* (The
.*?expression ensures a non-greedy match, avoiding issues with multiple elements.)Regex Capture Group:
2
- Adjust the tag name (
table) andidvalue (openBooks) to match the element you wish to extract.
Add a cleanup field mapping stage (optional)
To reduce index clutter, remove the now-redundant body_formatted field:
Add a Field Mapping stage after the Regex Field Extraction stage.
-
Configure it with the following rule:
Mapping:
{ "source": "body_formatted", "operation": "delete" }
This will ensure only the desired extracted HTML snippet is retained in the index.
Verify results
After crawling the target page:
Use the Solr Admin UI or Fusion Query Workbench to confirm that the new field (e.g.,
html_snippet) contains the expected HTML content extracted from the specified element.If needed, refine the regular expression to account for optional whitespace, nested tags, or attribute variations.
Additional notes
You can inspect which parser was used for a document by querying for the
_lw_parser_type_sfield.This approach can be extended to extract any HTML section by modifying the tag and ID used in the regex pattern.