Securely delete crawlDB across nodes in a multi-node Fusion 4 cluster – Lucidworks

Goal

Ensure reliable and safe deletion of a crawlDB used by Web Connector V1 with on-disk storage in a multi-node Fusion 4 deployment.

Environment

Fusion 4.2.6 (self-hosted)
Web Connector V1 using on-disk crawlDB
Multi-node cluster deployment

Guide

Use the REST API to delete crawlDB

Fusion 4 provides an endpoint to delete the crawlDB associated with a connector data source. To ensure deletion across all nodes, avoid using localhost in the API call.

Correct API format:

curl -X DELETE -u USERNAME:PASSWORD 'http://YOURFUSIONHOST:8764/api/apollo/connectors/datasources/YOUR_DATASOURCE_NAME/db'

Important:

Always replace localhost with the actual hostname or IP address of the Fusion node.
This API call can be executed from any node, but the hostname must resolve correctly to the Fusion service.
The API does not return a list of nodes where the deletion occurred, and the operation is not fully deterministic in multi-node environments.
Ensure that the connector data source is not running when issuing this command.

Alternative: Delete crawlDB directly from the file system

If REST deletion is unreliable, you can manually delete the crawlDB file structure from each node.

Path to delete:

<fusion-base-path>/data/connectors/connectors-classic/crawldb/lucid.web/YOUR_DATASOURCE_NAME

To ensure safe deletion:

Stop the connector job before removing files.
Repeat the deletion manually on all nodes in the cluster.

There is no system-wide propagation when deleting files manually; each node must be handled individually.

What happens when using “Clear Datasource” in the UI?

When the "Clear Datasource" option is selected from the Fusion UI, two internal operations are triggered:

A Solr delete-by-query call is made to remove indexed documents associated with the data source.
The same REST API mentioned above is called to delete the crawlDB:

curl -X DELETE -u USERNAME:PASSWORD 'http://YOURFUSIONHOST:8764/api/apollo/connectors/datasources/YOUR_DATASOURCE_NAME/db'

Can the crawlDB be permanently deactivated?

No. The crawlDB is required for several core Web Connector features, including incremental crawling and dead URI detection. Deactivation is not supported.

Additional notes

Note: The key to safe and reliable crawlDB removal in Fusion 4 is ensuring consistent targeting of nodes and avoiding the use of localhost. Manual deletion is safe if the connector job is stopped beforehand.