High Availability and Disaster Recovery are very different now!
One of the concerns for any enterprise is how to insure that the system remains available in the face of various points of failure. In the past (pre Solr 4.0), this has been a pretty manual situation. Solr has been configured to be highly fault-tolerant, but recovery is usually a manual process. Solr 4.0 changes this pretty dramatically. This article outlines the old process and the differences in Solr 4.0.
Solr 4.0 was released on 12-October-2012
- HA/DR - High Availability/Disaster Recovery. HA is simply that the application is available to serve requests some percentage of the time. HA is commonly expressed in percentages, i.e. 90%, 99%, 99.99%. Modern applications are often required to be available over 99.99% of the time. DR is the ability of the system to come back on-line when Something Bad Happens (1).
- Leader - In Solr 4, a node with some additional responsibilities. The leader and replica (see below) roles can change dynamically for a given node based on system health. This is coordinated through Zookeeper. Both leaders and replicas index items and perform searches.
- Master - in pre Solr 4 terms, the machine that is responsible for indexing. In significantly-sized installations, the master was rarely used for searching, it is usually optimized for indexing.
- Near Real Time - The ability in Solr 4 to search documents soon after the index request is sent. This is usually < 5 seconds.
- Replica - In Solr 4 a node that receives updates as coordinated by the leader. Both indexing and searching are performed on replicas.
- Sharding - Splitting up logical indexes into multiple physical segments. This article will not address sharding in order to focus on HA/DR.
- Slave - in pre Solr 4 terms, the machine(s) that pulled copies of the index from the master and were used for searching. There can be many slaves per master.
- SolrCloud (or Solr 4) - in this article, this should be understood to refer to the Zookeeper-enabled Solr configuration only available in 4.0. It is not referring to Solr-in-the-cloud (e.g. Azure or AWS). The two are not incompatible, that is you can run Solr 4 with Zookeeper on AWS etc, but for the purposes of this document we'll let SolrCloud refer to running Solr with Zookeeper
- Zookeeper - An Apache open-source project allowing coordination between nodes and the adoption of various roles, notification of node health etc.
This article is not particularly intended to be a tutorial or "how to" for HA/DR. Rather it is a higher-level orientation to how the problem is solved in the two versions of Solr.
It is usually recommended that Solr not be used as the system-of-record. The ultimate fallback for true disaster is to re-index all your content. This may be a very painful process, sometimes impossible. The amount of effort put into HA/DR is often based on how painful complete re-indexing would be.
This article largely assumes a basic familiarity with Solr pre 4.0 replication. For a more thorough introduction to Solr replication see: Solr Replication.
HA/DR pre Solr 4
Note that for this discussion, we're ignoring sharding in order to focus on HA/DR. Sharding simply replicates this problem N times, once for each shard.
Solr 3.x HA/DR is primarily based ont he concept of a master/slave setup. In short indexing is done to the master and periodically the index is replicated by the slaves where searches are performed. By "periodically", we mean that the slave is configured with a "polling interval". Upon expiration, the slave asks the master "has the index changed since the version I have?". If the answer is "yes", the changed bits of the index are copied by the slave, and once autowarming is complete new queries are executed against the updated index. For purposes of discussion, let's label the master M and the slaves S1 and S2. There are several consequences of this model:
- There is a time lag between indexing a document and seeing that document in searches. This is bounded by (commit time) + (replication interval) + (replication time) + (searcher warmup time).
- Due to the fact that roles are configured in configuration files, switching a node from a master to a slave or vice-versa is a manual process.
- If a master goes down, it is usually unclear what documents have been indexed to the master, let alone whether the last N updates have gotten to the slave. So knowing how to bring an index up to date on the slave is problematical.
- Depending upon which slave a request went to, different users could see slightly different versions of the index. This can be ameliorated by locking the session to a particular slave.
- If the data ingestion rate (i.e. indexing) is very rapid, the time it takes to recover a master and get the index up to date can be unsatisfactory.
When speaking of HA, there are two dimensions: continuing to server searches and continuing to add fresh data. In the pre SolrCloud days, the way of insuring that HA would be guaranteed in terms of continuing to serve some content (even if stale) was to replicate to a number of slave searchers fronted by a load balancer. True, these slaves wouldn't receive any new content if the master went down, but at least your application would show something.
The second dimension is new data being available for search. In the case of having multiple slaves and one of the slaves goes down, the user sees uninterrupted service as the load balancer will route incoming requests to functional machines. But if a master goes down, no index up dates take place and none will until the problem is detected, a new master is put in place, and the backlog of items is indexed (and, a replication takes place!).
Say S1 goes down for whatever reason. Recovery here is simple,
- Bring up another slave, point it to the master (i.e. set it up to run Solr and copy the slave configs appropriately).
- Wait for replication to complete, which can be determined from the log files or the admin page.
- Let the load balancer know about the new machine.
All this while, the incoming requests will be served by S2, M is still indexing, etc. The supposition here is that there has been a bit of "over provisioning" such that the remaining slaves can adequately handle the load. Apart from the time it takes to "bring up another slave", which usually simply means installing Solr and copying a few configuration files, the time it takes for the replication to happen is approximately the time it takes to move the entire index over the wire to the new machine. At that point the machine can be made available through the load balancer.
One implication here is that it's better to over-provision your slaves by having enough machines so as to handle the query load if one (or perhaps more) should go down. Then, during the interval between the time the machine dies and you have a new machine in place, your query load will be handled by the remaining slaves.
If the master goes down, it's slightly more complex, but still not too bad.
- Take one of the slaves (let's say S1) out of the lineup.
- Reconfigure it to be a master (this is really just copying the solrconfig.xml file you archived for the master to the config directory).
- Point your indexing process at the new master and re-index from the last know good point. This is important because data indexed between the last hard commit and the time the machine went down is lost. Note that since the <uniqueKey> causes older documents with the same value in this field to be overwritten, it's usually sufficient to pick a starting point _known_ to be before the last (hard commit + polling interval) and re-index from that record, duplicates being overwritten. This is most often some kind of timestamp. You must also consider the polling interval since the data on the slave is only current as of the last time the replication happened.
- This is all kind of wordy, but in practical terms, it's usually something like (autocommit interval) + ((polling interval) * 2) + 1 hour) or something equally simple. Or even, in the case where indexes don't change all that often, something like "midnight last night". The only time it should get more complex is if "over-indexing" is expensive.
- Point your remaining slaves at the new master. This is also just a config change (on the slaves), and is often handled by having the slaves point not at a hard-coded IP, but at a load balancer fronting the master in which case nothing needs to be done for this step.
- To get the aggregate search capacity back up, bring up a new slave to replace the one turned into a new master as above in "slave case".
Observations for HA/DR in pre SolrCloud
The above doesn't tackle how you determine that a machine has bitten the dust. Solr is JMX enabled, so a common solution is to configure some monitoring process (Zabbix, Nagios, or whatever understands JMX) to periodically ping the machines in the cluster and issue the proper alerts.
The process outlined above has various refinements for splitting the machines in your cluster amongst physically separate data centers (the so-called "repeater" setup) that I'm not going to cover here. This is important, but the short form is that by locating slaves (or the "repeater" that partakes of both a master and slave role) in physically separated locations, you can protect against fire/flood/earthquake level disasters).
As you can see, though, there are multiple places where things can go wrong in the above cases, and manual intervention is required.
HA/DR in SolrCloud (aka Solr 4)
Note that as before we're skipping the entire "sharding" issue. Sharding is handled much more elegantly in SolrCloud than in pre SolrCloud, but is not relevant to the HA/DR discussion.
HA/DR drastically changes in SolrCloud, as has the problem of insuring that data sent to the indexing machine is up to date. This section outlines the general behaviors, more detailed information can be found at: Solr Cloud and Near Real Time.
In essence, you don't need to pay nearly as much attention any more. Assuming you have monitoring in place and a machine goes down, you just fire up another machine, and you're done.
Hmmmm, that's a little terse.
SolrCloud replaces "masters" and "slaves" with "replicas". One replica will have some additional responsibilities and we call that node a "leader". But a leader and a replica can change roles without manual intervention should the leader become unavailable so the distinction is much less sharp than between masters and slave in pre SolrCloud installations.
With SolrCloud, the process by which data becomes searchable is:
- Updates are sent to any machine in the cluster.
- The updates are forwarded to the leader if necessary.
- The leader sends the updates to all of the replicas (this is one of the added responsibilities of the leader).
- Once the leader receives acknowledgements that all the other replicas have received the updates, the original update request is responded to.
- The acknowledgement from the replica is not received until after the update has been written in the transaction log. At this point, even if the process is interrupted, the updates can be replayed when the machine starts back up. There are no longer any "lost updates" to contend with.
- If Near Real Time (NRT) is configured, the updates are immediately (almost) searchable.
- Its role in the cluster.
- What the state of the cluster is, i.e. what machines are currently functioning and any time a state changes (a node comes online or goes offline).
- Elect a leader if the leader should become unavailable.
Replica becomes unavailable
Nothing much happens at all from the outside. To replace the replica, you simply start up a new node in SolrCloud mode (start a regular Solr instance with the -DzkHost parameter specified). At that point:
- The Solr node asks the Zookeeper ensemble what role it should fill. Since this machine is just being "introduced" to the cluster, it will almost certainly be a replica.
- The new machine automatically gets a new version of the index. Note: this is efficiently carried out by an old-style replication, but it is not necessary to configure this, the SolrCloud node figures this out automatically.
- Any pending updates in the transaction log on the leader are sent to the new machine for indexing.
- Search requests are automatically routed to the new machine after it has caught up.
- The critical point is that "bringing up a new replica" is as simple as
- Installing Solr 4 on a new machine.
- Starting it up with a command like: "java -DzkHost=zk1[,zk#..] -jar start.jar" where zk# is a node where Zookeeper is running
- From this point on, everything is automatic. The configuration files (including schema.xml, solrconfig.xml, etc) are automatically installed on the new machine, the index is brought up to date and searches begin to get routed to this machine when appropriate.
Leader becomes unavailable
If the leader becomes unavailable, the following occurs.
- SolrCloud nodes detect that the leader is unavailable through Zookeeper.
- One of the remaining replicas is elected the leader and the cluster continues to behave as before.
- If you need to bring the capacity back up for the cluster, add another machine as above.
- Note that since index updates are atomic across all replicas, the new leader's transaction log is up to date, updates sent to the leader before it went down are already on the replica before the acknowledgement is returned from the leader.
Observations for HA/DR in SolrCloud
As you can see, the HA/DR process is vastly simpler in SolrCloud than in pre SolrCloud. Primarily, all that's required is to notice that a node has gone down (as in Solr 3.x, any JMX-savvy application can be used, with additional capabilities possible by taking advantage of Zookeeper) and bring up another node. Then sit back and watch. That said, there are a couple of situations to be aware of.
- The remote data center scenario should be thought about carefully. Glossed over in the outline above is quite a bit of communication between leaders and replicas for updates. If the nodes are separated by an expensive communications channel, where "expensive" is interpreted as slow, throughput will suffer. The current best practice is for the indexing process to send separate updates to each datacenter. Each datacenter is effectively its own isolated cluster without knowledge of the other datacenter.
- Zookeeper needs to be configured. Zookeeper is an Apache open source project and should be set up with at least three separate nodes since it operates on a quorum basis.
- We've completely left out shards. This to simplify the discussion, the HA/DR process is identical to the non-sharded process and SolrCloud automatically handles shard assignment and provisioning replicas on a round-robin basis for to the appropriate shards.
This article is mainly intended to be an introduction to HA/DR for both Solr 3.x and SolrCloud/4x. Many technical details have been left out in the interests of conveying the "big picture". Additionally, the SolrCloud functionality is currently (October, 2012) relatively new, although clients are in the process of going live with it as I write this. We're all on the front end of the learning curve in terms of how SolrCloud performs "in the wild", and as with any new capabilities, there will accure more refined "best practices" as more real-life installations come on-line. We at LucidWorks are currently engaged with several clients who are in process of going "live" with customer-facing applications based on SolrCloud, we'll keep you posted!
(1) I've seen the consequences of, for instance, the 110-psi, 2" fire main in the ceiling above a server room burst. On the weekend. It filled the elevator shafts and put 4" of water on the main floor of the entire building. What it did to the servers should't happen to your worst enemy.