Goal
Configure a Web v2 datasource in Fusion to crawl only a specific set of pages under a directory, while excluding all other pages within that directory.
Environment
Fusion 5.9.14
Component: Web v2 Connector
Applies to: Managed Fusion and self-managed deployments
Guide
Block all pages under a target path using excludeRegexes
To prevent all pages under a specific directory from being crawled, add a regular expression to the excludeRegexes configuration of the datasource. For example, to exclude all pages under /sample/, /test/, and /location/:
"excludeRegexes": [
".*www.your.org/sample/.*",
".*www.your.org/test/.*",
".*www.your.org/location/.*"
]This will prevent all URLs under /directory/ from being fetched by the crawler.
Allow only specific URLs using includeRegexes
To permit specific pages from the excluded path to still be crawled, add them to the includeRegexes section. Include rules take precedence over exclude rules. For example:
"includeRegexes": [
"https://www.your.org/directory/abc.html",
"https://www.your.org/directory/def.html",
"https://www.your.org/directory/hij.html",
"https://www.your.org/directory/lmn.html",
"https://www.your.org/directory/opq.html"
]With this configuration:
All URLs under
/directory/are blocked by default.Only the 5 explicitly listed URLs will be crawled.
Additional notes
Include and exclude regexes must be valid Java-style regular expressions.
Make sure both lists are properly formatted JSON arrays.
You can test regex behavior by previewing the crawl or inspecting the crawl logs in Fusion.
This configuration pattern is useful when the majority of content under a path should be ignored, but exceptions need to be preserved.