This article details various endpoints to check when Solr encounters Out-Of-Memory frequently. There can be various diverse reasons, and here we try to list down the most common ones:
1. Solr Caches:
2. Sorting/Faceting/Grouping on non-DocValued fields, ineffective utilization of FieldCache
3. Insufficient Heap Memory allocated to Solr JVMs
4. Memory leaks and other anonymous factors
Solr Caches - QueryResultCache, DocumentCache, FilterCache:
Refer the official Solr cwiki page: Solr-Caches to understand what is and how to configure the caches in your solrconfig.xml. They seems to be a very powerful tool and it is, until the significance of its use-case on current environment/infrastructure is known. Otherwise it can take big chunk of heap memory allocated to Solr and eventually run out-of-memory when we aren't doing any heavy operations (indexing/searching/analytics) whatsoever.
Let's see a sample solr-cache configuration:
<documentCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="128"/>
'autowarmCount' is explained wonderfully in Solr-Caches (replay of old-searcher-cache entries), gets loaded every-time a new searcher is opened. With current scenarios asking for Near-Real Time Search, a new searcher is opened via commits in relatively short intervals. If we have autowarmCount a sizeable number (in this case 128), it replays the stated number of entries for the cache again and again. With respect to heap memory, a part of it allocated is already filled up.
'autowarmcount' has no significance whatsoever in DocumentCache and it doesn't matter what number you specify for it. One thing to emphasize is that, by and large, documentCache utilizes the amount of heap used by the stored fields. This is true no matter how many docs get in the index.
QueryresultCache is actually usually quite small, and we should not be bothered unless it’s outrageously sized. The rough size is size of query in bytes + N integers. N comes from queryResultWindowSize in solrconfig.xml (default 20).
For both above stated caches, adding a zillion more documents to the index doesn’t change the JVM memory requirements.
FilterCache size is sensitive towards maximum documents index contains. Each FilterCache entry is capped by approximately (maximum documents in the index)/8 plus the size of the query, with the "fq" clause being the key and result-set is a bitmap of the documents associated with the query denoted via bits. It means if if we have 64M docs, the bitmap is 8M, and a 1K query is lost in the noise.
So, should we use Solr-Caches? the answer is "it depends on queries which are coming along". For a government/firm/private/low-content website, where the data is limited and there are some specific queries which users hit often, we can have the Solr-Caches enabled, what size? we will discuss shortly. For an ecommerce website, where we have billions and millions of data/documents and the queries coming are dynamic and different (yes! there can be cases where there will be products which are searched very frequently), we don't need to cache results/documents, put size Zero for every cache (or maybe a relatively small count, how much? we will discuss next).
Now the rule of thumb on how much size to be provided to each cache and the autowarmCount associated to it should solely be set after doing thorough analysis of the queries hitting the searchers on a decent timeframe on live traffic. What to set the size when you setup for the first time? Start with Zero!
Always be a touch more-mindful while setting up FilterCache out of the three mentioned, specially when we have multiple 'fq' in our queries.
Sorting/Faceting/Grouping on non-DocValued fields, ineffective utilization of FieldCache:
Better explanation for this cannot be written than already in the official Solr cwiki page: DocValues. It is highly recommended that we don't sort or facet or group on non-DocValues fields. If we do, it utilizes the FieldCache which caches values for all indexed documents, which if left unchecked, can result in consuming the whole heap. For example, when performing facet queries on multi-valued fields the multiValued fields are multi-segmented (as opposed to single segmented single-valued fields), resulting in an inefficient near real time performance. FieldCache is non-configurable.
There is one edge-case, when we have low traffic and a stable index (non-changing, non-expanding), using FieldCache would be more beneficial/faster than DocValues for the above stated operations. Though again, we strictly recommend to use DocValues for sort, facet and group.
Insufficient Heap Memory allocated to Solr JVMs:
In other words, it was just not enough! Suppose we have set-up the Solr-Caches properly and are not sorting/faceting/grouping on non-DocValues but still running out-of-memory now and then. We checked other-endpoints/configurations, everything seems alright! We are indexing a huge batch of documents consistently and Solr is supposed to index everything smoothly. We are requesting 300 rows in the result-set from Solr for a query, it should return the same. Red-Flag! We may have allocated just-not-enough memory to the Solr nodes. By default, each Solr node is started with 2G of heap memory, and the operations do need some space of its own to execute them. Rule of thumb is to give maximum possible memory to the Solr node without compromising with Operating system's. It is best to fire up a separate machine for each Solr node, and don't run any other application on it (yes! it is not possible every-time every-where). If you have machine of 16G physical memory, don't allot the whole 16G or even 15G or even 14G to the Solr. Give some room for breathing to Operating system so that you don't end up with unnecessary swapping of files b/w primary and secondary memory. In Lucidworks, we have witnessed 8G heap memory per node does a decent job when our cluster has multiple collections, we are doing light-to-medium complex queries, indexing in again-light-to-medium batches and have set up Solr-Caches too, though a optimal number can only be achieved by analysing the cluster on live traffic for a decent timeFrame.
Memory leaks and other anonymous factors:
Unintentional buggy code on client-side are one of the factors which can take unnecessary memory from the heap. They are fairly difficult to locate and after running out of ideas on other arbitrary factors which may cause OOM, there is nothing better than taking the heap dump of the Solr node and firing up the analyser to catch the culprit!
Don't miss out on reading The Seven Deadly Sins of Solr on Lucidworks Blogs section to avoid committing unforced errors while setting up Solr.
Please provide your feedback and suggestions on the above stated and also mention if we have missed out any significant reason(s) which can cause Solr running out-of-memory, we can add them to the list. Cheers!