Goal
Fusion Parallel Bulk Loader (PBL) jobs enable high-performance ingestion of structured and semi-structured data from big data systems, NoSQL databases, and common file formats like Parquet and Avro. PBL can also be used to transfer data between different Fusion clusters. This article provides an overview of PBL configuration for cross-cluster data transfer.
Environment
Fusion 4.x & 5.x
Guide
In this specific scenario, we're copying data between two fusion 4.2.5 instances where source cluster/collection is named "Source" and destination cluster/collection is named "Destination".
The process is outlined below in detail.
Steps
- Create a new PBL job in the Source cluster using the option:
"Collections" -> "Jobs" -> "Add" -> "Parallel Bulk Loader"
- Provide a job name, format for the specific load job. In this example, the format is given as "Solr" as we're transferring Solr data between clusters. (Fig. 1)
- Under Read Options, add a new parameter to specify the name of the collection we're copying data from. In this case, it's "Source". (Fig. 1)
- Specify the name of the collection we're copying the data to under "Output Collection". (Fig. 1)
Fig. 1
- Under "Write Options", add the parameters for the destination hostname and the destination collection name. In this case destination collection name is "Destination" and destination hostname/zk host is "localhost:9983/lwfusion/4.2.5/solr". (Fig. 2)
- If required, we can enable functionalities such as "Clear Existing Documents" (which will only work if the job is triggered from Destination cluster) and Solr Field definition, etc. (Fig. 2)
- Once the configuration is completed, save the job and click run to copy data over to the destination collection.
Fig. 2
readOptions: To specify the source collection to read the data.
outputCollection: Solr Collection to send the documents loaded from the input data source. Here since output collection is on a separate cluster(Ex: Destination) and this is a mandatory field, just specify a dummy collection name.
writeOptions: Specify the destination collection name where we need to copy the data from the source collection. (Ex: Destination)
Additionally, we have a custom-built collection transfer app that uses Spark to move data between collections, possibly across clusters. You can find the GitHub link here.
https://github.com/lucidworks/fusion-spark-job-workbench/tree/master/collection-transfer-app
Comments
0 comments
Article is closed for comments.