Goal
Configure a Web Connector to extract both plain text and full HTML content from a webpage, storing the original HTML (with tags intact) in a separate field for downstream indexing or processing.
Environment
Fusion 5.9.5 and above
Self-hosted on Kubernetes (EKS, AKS, GKE, or similar)
Web Connector v2
Guide
Add and configure the Web Connector
In the Fusion UI, go to Connectors and create or edit a Web Connector job.
Ensure the connector is configured to crawl the target HTML content you wish to preserve.
Add or verify the presence of a pipeline in the Index Pipeline section.
Add the HTML Parser stage to your pipeline
Navigate to Pipelines > Index Pipelines.
Select the pipeline attached to your connector.
Add a new stage of type HTML Parser.
Configure the HTML Parser stage to extract and preserve HTML
In the HTML Parser stage settings:
Set Output Field to a custom field name where you want to store the HTML content. Example:
html_bodySet Output Format to
html(this ensures tags are retained)Use a CSS selector in the Selector field to define what part of the document you want to extract.
This selector extracts the entire <body> tag and its contents, preserving nested tags.
Optional: Extract plain text separately
If you also want to extract plain text (without HTML), you can either:
Add a second HTML Parser stage with the same selector, but set Output Format to
textOr, configure the existing stage to extract multiple formats by duplicating the content extraction logic in another pipeline stage
Example configuration
{
"type": "html-parser",
"outputField": "html_body",
"selector": "body",
"outputFormat": "html"
}This configuration ensures the entire <body> of the HTML is extracted and preserved as-is in the html_body field, with tags intact.
Notes
The HTML Parser stage uses jsoup under the hood for CSS selectors and HTML parsing.
To extract more specific sections, adjust the
selectorvalue accordingly (e.g.,div.content,article, etc.).Make sure your schema in Solr or Fusion’s managed fields includes the field you’re writing to (
html_body) withstored=trueif needed for retrieval.
Result
This configuration allows you to maintain a plain-text version for analysis and a tag-preserved version of HTML for display, archiving, or rendering use cases.