Recently, we had a customer working with LucidWorks 2.1 in "cloud" configuration, that is the new SolrCloud capabilities (we sometimes refer to this as the "ZooKeeper" Solr Cloud to distinguish it from, say AWS or Azure). This client was getting very poor indexing performance, and was naturally wondering why.
Additionally, there were inconsistencies. LW/SolrCloud should, for instance, report the same number of hits if one does a search no matter what node one directs a query, but it wasn't. To make it worse, the kind of inconsistency changed depending on which node one looked at. Non-readily reproducible problems drive support people crazy. The first step is always trying to get to a point where you can reliable fail!
What was found
Through a marathon interactive session with the client, 4 things became apparent:
- There is a bug in SolrCloud (and consequently in LW 2.1.1) that was a problem. The client was adding documents to the LW/Solr index via a SolrJ program, something that is quite common. The advice we've been giving for a long time is to "batch up" the Solr documents and use the "server.add(doclist);" call to add the documents to the Solr server. In SolrCloud only, the implementation could lead to documents being inconsistently updated. This is currently fixed in Solr 4.0-BETA and will be included in the next LW release (after 2.1.1). The interim solution is to index a single document at a time i.e "server.add(doc);" and start enough clients to get the throughput you need. This client was indexing between 3,000 and 6,000 documents/second and still hadn't reached the limits of the cluster. The client was also seeing spurious updates. That is, they'd index 1,000 documents and see, say 500 extra updates (the difference between numDocs and maxDoc on the admin page).
- Scripts must be used when bringing up/down complex clusters. Especially in development mode, you're bouncing all your servers repeatedly for various reasons, not the least of which is to start all over from scratch to insure that you're starting from a known state. And it's almost impossible to do this accurately, reliably and repeatably, especially after 10 hours of being frustrated while trying to understand a problem. And it's virtually guaranteed that some of your puzzling results will be the result of missing the error message that scrolls past after the command you just hand-typed. Or one fails to shut down the external indexing program, so now you have an indexing program running that you don't expect. Or....
- A subtle issue was that the client set the autoSoftCommit to a relatively small number of documents (100). The documents were extremely short, so we estimate they they were committing every 100 ms or less. While soft commits are explicitly intended to support near-real-time searching and thus are "inexpensive", they are inexpensive relative to hard commits. They still have some cost, and doing a soft commit 10 times a second was too much. We recommend it be set to a time interval rather than number of documents (1 second is in the example), but longer is reasonable too.
- There were several bugs fixed in the 2.1.1 release that made the problem easier to find/solve, so think about upgrading when possible. They were not the root of this client's issue, the problems were resolved using LW 2.1. But LW is improving constantly and it's often best to take advantage of these improvements.
After scripting the server and indexing program start/stop, specifying the autoSoftCommit as a second rather than number of documents, and moving to the server.add(doc) rather than server.add(doclist) method, the client was satisfied. The results became perfectly consistent, and the indexing throughput was increased to 6,000 docs/second by starting a number of clients all sending documents to the cluster at the same time. I should emphasize that using the server.add(doc) method is temporary, this will NOT be required when the next version of LW (after 2.1.1) is released, and if you're a pure SolrCloud user it's not required if you're using a trunk (or BETA) build.