Parallel Bulk Loader job to copy data between cross cluster collections – Lucidworks

Goal

Fusion Parallel Bulk Loader (PBL) jobs enable high-performance ingestion of structured and semi-structured data from big data systems, NoSQL databases, and common file formats like Parquet and Avro. PBL can also be used to transfer data between different Fusion clusters. This article provides an overview of PBL configuration for cross-cluster data transfer.

Environment

Fusion 4.x and 5.x

Guide

In this specific scenario, we're copying data between two fusion 4.2.5 instances where source cluster/collection is named "Source" and destination cluster/collection is named "Destination".

The process is outlined below in detail.

Steps

Create a new PBL job in the Source cluster using the option:
"Collections" -> "Jobs" -> "Add" -> "Parallel Bulk Loader"
Provide a job name, format for the specific load job. In this example, the format is given as "Solr" as we're transferring Solr data between clusters. (Fig. 1)
Under Read Options, add a new parameter to specify the name of the collection we're copying data from. In this case, it's "Source". (Fig. 1)
Specify the name of the collection we're copying the data to under "Output Collection". (Fig. 1)

Fig. 1
Under "Write Options", add the parameters for the destination hostname and the destination collection name. In this case destination collection name is "Destination" and destination hostname/zk host is "localhost:9983/lwfusion/4.2.5/solr". (Fig. 2)
If required, we can enable functionalities such as "Clear Existing Documents" (which will only work if the job is triggered from Destination cluster) and Solr Field definition, etc. (Fig. 2)
Once the configuration is completed, save the job and click run to copy data over to the destination collection.

Fig. 2

readOptions: To specify the source collection to read the data.

outputCollection: Solr Collection to send the documents loaded from the input data source. Here since output collection is on a separate cluster(Ex: Destination) and this is a mandatory field, just specify a dummy collection name.

writeOptions: Specify the destination collection name where we need to copy the data from the source collection. (Ex: Destination)

Additionally, we have a custom-built collection transfer app that uses Spark to move data between collections, possibly across clusters. You can find the GitHub link here.

https://github.com/lucidworks/fusion-spark-job-workbench/tree/master/collection-transfer-app

Goal

Environment

Guide

Steps

Related articles