Some testing was done to try to identify a process that could be used to help migrate a SolrCloud installation from one DC to another. These are the notes from that testing (which was far from rigorous).
Testing done with solr-4.10.4, but will likely work with a range of Solr versions, including some 5.x, since it doesn't rely on anything special in 4.10.
2 DCs simulated, and testing started with this infrastructure:
- functional zookeeper ensemble (3 nodes) and solrcloud cluster (3 nodes, all on port 8983) in DC1 (one collection, 2 shards, 2 replicas/shard)
- functional zookeeper ensemble (3 nodes) and solrcloud cluster (3 nodes, all on port 8984) in DC2 (no collections or data)
- copy all configs from zookeeper in DC1 to zookeeper in DC2 (ensures that when nodes in DC2 come up with their new data, they'll properly link to their associated configs).
- stop ingestion in DC1, and have all data committed.
- Have all solr nodes in DC2 in shutdown state.
- In DC1, shut down one replica for each shard. This may not be completely necessary, but does ensure that nothing will be modified (such as background Lucene segment merges) during copy.
- Copy the data directories of the shutdown nodes from DC1 to DC2. In my test, that was:
- Restart the DC1 nodes, and ingestion (if desired).
- Start the DC2 nodes. Cloud should come up with appropriate collections and data, but only 1 shard per replica.
- Add replicas using the Collection API, when full cluster is desired. But if you want to resynchronize, you may not want to add the replicas until the last moment.
These are notes from a failed attempt to try to keep both clusters fully functional, and synchronized. But it does lead to a way to keep DC2 up to date with DC1, albeit with DC2 operating with minimal infrastructure until ready to build DC2 out completely.
- Both DC's fully running (one collection, 2 shards, 2 replicas/shard)
- Added documents to DC1 and commit.
- Stopped ingestion in DC1
- Used Replication API to trigger fetchindex on leader of shard1 in DC2.
- Note the use of masterUrl (properly encoded) to specify the location of the correct node in DC1. In this case that value is http://192.168.200.20:8983/solr/collection_shard1_replica2/replication
This properly updated the leader in DC2 (GOOD!), but the replica on that shard didn't update, even after a reload of it (BAD). Basically, the other replica isn't aware of the triggered replication, and so we can't maintain both clusters as fully synchronized.
But if we did the original process for initial migration, and left each shard with only 1 replica in DC2, then we should be able to issue the fetchindex commands to keep them in sync with DC1, even if updates are being done in DC1.
Then when ready to fully shift to DC2, stop ingestion in DC1, do a final replication to DC2, build out the other replicas in DC2, and then bring DC2 on line.