A very frequent question is what should be monitored on an installation. First off, it is highly recommended you use a monitoring tool such as Nagios, Zabbix, etc. That is because the information provided by doing analysis on the raw JMX values will give you much more information than just the numbers themselves. This article assumes that you can do analysis on the data. There are several layers you need to pay attention to that includes the hardware resources, performance benchmarks, the java virtual machine and lucene/solr. A secondary frequent question is what limits should you begin to worry. This is generally very customer specific. The expected performance of one customers queries may be very different than another. There are however some basic things that we will cover as we go through the list.
Hardware & Resources: This is just knowing what is going on with your overall system resources.
- CPU Idle: This will tell you that at that time whether or not you have run out of CPU resources. It is best to create some averaged numbers, like avg(1 min). It is also good to know down to each individual CPU core to make sure you don't get into situations where everything is paused on one CPU core (such as would happen if you used the wrong garbage collector).
- Load Average: This tells you whether processes are loading into a queue to be processed. If it is higher than the number of CPU cores you have, you are getting into the processing equivalent of a traffic jam. That leaves load average a core statistic, which may not point to 'the' problem, but points at 'a' problem.
- Swap: You do not want to be in swap. If you are using swap, figure out a better configuration or add memory.
- Page In/Out: Shows if you are writing in/out to swap. There may be a small amount of swap used and little page in/out action. However, using swap is generally a sign you need more memory and if you are concerned about performance, you want to stay out of swap completely.
- Free Memory: This is actually a very important statistic. If it's low, it essentially means you are no longer going to add to the size of your disk cache as queries are run. If your disk cache is low and you are also low on free memory, most of your index accesses are going to come from disk (read: slow). You should always be paying attention to both the size of free memory and your disk cache in comparison to the overall index size.
- I/O Wait: Excess Disk I/O is slow and if you are waiting on disk access this will let you know your CPU is waiting.
- Free Disk Space: You of course always want to be alerted if you are running low on disk space for your indexing partition.
Performance Benchmarks: These benchmarks you will get via JMX beans. They will be found under the namespace: "solr/<corename>"
- Average Response Times: Monitoring tools can do very useful calculations by doing deltas on previous values. So you can say, 'get the average response time over the last five minutes'. You can get this information on every requestHandler you use. You use the Runtime JMX bean (specified below) as the divisor to get performance over time periods. The JMX bean for the standard request handler is found at (use jconsole to browse): "solr/<corename>:type=standard,id=org.apache.solr.handler.StandardRequestHandler", totalTime, requests
- nth Percentile Response Times: This allows you to know if say your 95% response times are actually slow. Again, you don't really want to care about averages as much as if there is a problem at the long end of the tail and knowing just how long or short that tail is to normal behavior. There are 75thPercentile, 95thPercentile, 99thPercentile and 999thPercentile. JMX example:"solr/collection1:type=standard,id=org.apache.solr.handler.component.SearchHandler", 75thPcRequestTime (the same format is used for all percentiles).
- Average QPS: Knowing your peak QPS over say a 5 minute time span will show you how well you are handling those queries and be vital in benchmarking to a configuration that can handle that load (assuming your testing is valid which is complex to do because of caches). JMX Bean found at: "solr/collection1:type=standard,id=org.apache.solr.handler.StandardRequestHandler", requests divided by JMX Java.lang.Runtime (seen below).
- Average Cache/Hit Ratios: This can point out potential problems in how you are warming searcher caches. Again, you can use this with the Runtime JMX bean to find changes over a time period. JMX Beans are found at: "solr/<corename>:type=<cachetype>id=org.apache.solr.search.<cache impl>", hitratio/cumulative hitratio
- External Query "HTTP Ping": It is vital to have a heartbeat sent to solr. Logs provide how long solr took to respond to queries, but that doesn't count how long it may have waited before responding.
Java Virtual Machine. This is found via JMX Beans as well.
- Runtime: This is an important piece of data that will help you determine time frames and get statistics like performance over the last five minutes. That is because you can use the delta of Runtime over the last time period as a divisor in other statistics. It's basic use is that it can be used to determine how long your server has been running. This JMX Bean is found at: "java.lang:Runtime",Uptime
- Last Full Garbage Collection: You can get the data on the total time your last full gc took. Your application is paused during a full gc so knowing how long it is paused is an important piece of data. How long you want for this to be acceptable is up to you, but generally anything over a few seconds usually indicates a problem. JMX Bean: "java.lang:type=GarbageCollector,name=<GC Name>", LastGcInfo.duration
- Full GC as % of Runtime: This can let you know if you are spending too much time in full garbage collections. A lot of time spent in full GC is indicative of too low of space given to the heap.
- Total GC Time as % of Runtime: An indexing server will tend to spend a lot more time in GC because of all the new data coming at it. It is important to tuning your heap and generation sizes to know how much time is being spent in GC. Too small and you will GC more frequently. Too large and you will have longer pauses.
- Total Threads: This is important to know because each thread takes up memory space and may overload your system. Sometimes OOM errors are not because your system needs more memory, but it is just overloaded with threads. This can be a warning sign before your system tips over. Ultimately if you have too many threads opening it in general indicates some sort of performance bottleneck on your system or a lack of hardware resources. JMX Bean: "java.lang:type=Threading,ThreadCount
Lucene/Solr. This will be found via JMX as well.
- Autowarming Times: This tells you how long it's taking a new searcher or caches to load. You can get autowarming times on both caches and the searcher itself. The warmuptime of the searcher is separate from the caches, which can be found as well. There is always a tradeoff between autowarming times and having prewarmed caches ready for search. Searcher JMX Bean: "solr/collection1:type=searcher,id=org.apache.solr.search.SolrIndexSearcher", warmupTime. "solr/collection1:type=documentCache,id=org.apache.solr.search.LRUCache","solr/collection1:type=fieldValueCache,id=org.apache.solr.search.FastLRUCache","solr/collection1:type=filterCache,id=org.apache.solr.search.FastLRUCache","solr/collection1:type=queryResultCache,id=org.apache.solr.search.LRUCache"