Replicate filtered records between Fusion applications using PBL and index pipelines – Lucidworks

Goal

Replicate a subset of documents from one Fusion collection to another in the same environment, applying specific filters to control which documents are copied.

Environment

Fusion 5.5 and above
Applicable to environments running on Kubernetes with support for Spark and Parallel Bulk Loader jobs.

Guide

Use Parallel Bulk Loader (PBL) to replicate collection data

To copy data between collections or applications in the same Fusion instance:

Create a new Spark job using the Parallel Bulk Loader job type.
Set the format to solr.

Under Read Options, configure the following parameters:

parameter: collection
value: <source_collection_name>

Under Read Options, configure the filter parameter to limit the documents that need to be copied..

parameter: filters
value: <filter_expression>

Under Write Options, configure the destination collection:

Output Collection: <target_collection_name>

Example filter expression

The filters parameter supports simple Solr query syntax. Example:

parameter: filters
value: id:"https://en.wikipedia.org/wiki/Patent"

Note: Not all Solr filter query (fq) expressions are guaranteed to work in the filters parameter. Trial and error may be required.

More examples of spark native parameters for the filtering can be found here -filters

Alternative approach using index pipeline filtering

If the filters parameter does not produce the expected result, you can use a custom Index Pipeline in the target app to exclude documents that do not meet the filtering criteria.

Create an Index Pipeline that drops documents based on specified conditions.
Attach the pipeline to the Parallel Bulk Loader job using the Index Pipeline field.

This method allows complete control over which documents are written to the target collection, even when the read step pulls in a broader set of records.

Note: Clear the destination collection before running the job if you want to avoid duplicate or outdated data.