Goal
Replicate a subset of documents from one Fusion collection to another in the same environment, applying specific filters to control which documents are copied.
Environment
Fusion 5.5 and above
Applicable to environments running on Kubernetes with support for Spark and Parallel Bulk Loader jobs.
Guide
Use Parallel Bulk Loader (PBL) to replicate collection data
To copy data between collections or applications in the same Fusion instance:
Create a new Spark job using the Parallel Bulk Loader job type.
Set the format to
solr.
Under Read Options, configure the following parameters:
parameter: collection
value: <source_collection_name>Under Read Options, configure the filter parameter to limit the documents that need to be copied..
parameter: filters
value: <filter_expression>Under Write Options, configure the destination collection:
Output Collection: <target_collection_name>Example filter expression
The filters parameter supports simple Solr query syntax. Example:
parameter: filters
value: id:"https://en.wikipedia.org/wiki/Patent"Note: Not all Solr filter query (
fq) expressions are guaranteed to work in thefiltersparameter. Trial and error may be required.
More examples of spark native parameters for the filtering can be found here -filters
Alternative approach using index pipeline filtering
If the filters parameter does not produce the expected result, you can use a custom Index Pipeline in the target app to exclude documents that do not meet the filtering criteria.
Create an Index Pipeline that drops documents based on specified conditions.
Attach the pipeline to the Parallel Bulk Loader job using the
Index Pipelinefield.
This method allows complete control over which documents are written to the target collection, even when the read step pulls in a broader set of records.
Note: Clear the destination collection before running the job if you want to avoid duplicate or outdated data.