Include specific pages while excluding all others in a web connector crawl – Lucidworks

Goal

Configure a Web v2 datasource in Fusion to crawl only a specific set of pages under a directory, while excluding all other pages within that directory.

Environment

Fusion 5.9.14
Component: Web v2 Connector
Applies to: Managed Fusion and self-managed deployments

Guide

Block all pages under a target path using `excludeRegexes`

To prevent all pages under a specific directory from being crawled, add a regular expression to the excludeRegexes configuration of the datasource. For example, to exclude all pages under /sample/, /test/, and /location/:

"excludeRegexes": [
  ".*www.your.org/sample/.*",
  ".*www.your.org/test/.*",
  ".*www.your.org/location/.*"
]

This will prevent all URLs under /directory/ from being fetched by the crawler.

Allow only specific URLs using `includeRegexes`

To permit specific pages from the excluded path to still be crawled, add them to the includeRegexes section. Include rules take precedence over exclude rules. For example:

"includeRegexes": [
  "https://www.your.org/directory/abc.html",
  "https://www.your.org/directory/def.html",
  "https://www.your.org/directory/hij.html",
  "https://www.your.org/directory/lmn.html",
  "https://www.your.org/directory/opq.html"
]

With this configuration:

All URLs under /directory/ are blocked by default.
Only the 5 explicitly listed URLs will be crawled.

Additional notes

Include and exclude regexes must be valid Java-style regular expressions.
Make sure both lists are properly formatted JSON arrays.
You can test regex behavior by previewing the crawl or inspecting the crawl logs in Fusion.

This configuration pattern is useful when the majority of content under a path should be ignored, but exceptions need to be preserved.

Goal

Environment

Guide

Block all pages under a target path using excludeRegexes

Allow only specific URLs using includeRegexes

Additional notes

Related articles

Block all pages under a target path using `excludeRegexes`

Allow only specific URLs using `includeRegexes`