Preserve HTML content in a separate field when using the Web Connector – Lucidworks

Goal

Configure a Web Connector to extract both plain text and full HTML content from a webpage, storing the original HTML (with tags intact) in a separate field for downstream indexing or processing.

Environment

Fusion 5.9.5 and above
Self-hosted on Kubernetes (EKS, AKS, GKE, or similar)
Web Connector v2

Guide

Add and configure the Web Connector

In the Fusion UI, go to Connectors and create or edit a Web Connector job.
Ensure the connector is configured to crawl the target HTML content you wish to preserve.
Add or verify the presence of a pipeline in the Index Pipeline section.

Add the HTML Parser stage to your pipeline

Navigate to Pipelines > Index Pipelines.
Select the pipeline attached to your connector.
Add a new stage of type HTML Parser.

Configure the HTML Parser stage to extract and preserve HTML

In the HTML Parser stage settings:

Set Output Field to a custom field name where you want to store the HTML content. Example: html_body
Set Output Format to html (this ensures tags are retained)
Use a CSS selector in the Selector field to define what part of the document you want to extract.

This selector extracts the entire <body> tag and its contents, preserving nested tags.

Optional: Extract plain text separately

If you also want to extract plain text (without HTML), you can either:

Add a second HTML Parser stage with the same selector, but set Output Format to text
Or, configure the existing stage to extract multiple formats by duplicating the content extraction logic in another pipeline stage

Example configuration

{
  "type": "html-parser",
  "outputField": "html_body",
  "selector": "body",
  "outputFormat": "html"
}

This configuration ensures the entire <body> of the HTML is extracted and preserved as-is in the html_body field, with tags intact.

Notes

The HTML Parser stage uses jsoup under the hood for CSS selectors and HTML parsing.
To extract more specific sections, adjust the selector value accordingly (e.g., div.content, article, etc.).
Make sure your schema in Solr or Fusion’s managed fields includes the field you’re writing to (html_body) with stored=true if needed for retrieval.

Result

This configuration allows you to maintain a plain-text version for analysis and a tag-preserved version of HTML for display, archiving, or rendering use cases.