Optimize indexing and commit strategies for large Solr collections – Lucidworks

Goal

Provide best practices and recommendations for efficiently indexing large Solr collections and minimizing indexing errors such as timeouts and out-of-memory (OOM) exceptions. This article helps users choose appropriate commit strategies and batch sizes for high-volume indexing.

Environment

Apache Solr 8.11.2 and above
Applicable to large production collections (hundreds of millions to billions of documents; multi-terabyte indexes)
Assumes indexing is performed via SolrJ, HTTP APIs, or other batch processing frameworks

Guide

Index in smaller, manageable batches

Splitting large indexing jobs into smaller batches (for example, weekly or daily segments rather than quarterly or yearly) helps reduce the risk of memory and timeout errors.
Committing after each batch of 3–4 million documents is generally much more manageable for Solr than processing and committing very large groups (such as 70–100 million documents at once).

Use frequent hard commits, but monitor impact

Frequent commits ensure data is regularly persisted to disk, which lowers the risk of losing uncommitted documents in the event of a crash.
Every hard commit opens a new searcher in Solr, which temporarily increases memory usage and can impact performance. Ensure your JVM heap size is sized appropriately for your indexing workload.
Monitor system resources and commit times to assess the impact of more frequent commits. If necessary, adjust batch sizes and commit frequency to avoid bottlenecks.

Optimize hardware and configuration

Use SSD storage for Solr data directories to improve read/write performance during indexing and commits.
Ensure sufficient CPU cores and RAM are available on Solr nodes. Indexing is both CPU and memory intensive.
Monitor JVM heap usage and adjust Solr’s JVM heap settings as needed to avoid out-of-memory errors during large indexing jobs.

Monitor and tune system performance

Use monitoring tools to track commit latency, CPU load, memory usage, and disk I/O during indexing.
Review and optimize Solr’s autoCommit and autoSoftCommit settings in your solrconfig.xml for your workload. For very large loads, you may want to disable autoCommit and perform manual commits at controlled intervals.
Adjust the number of Solr shards if you find that a single or small number of shards cannot keep up with the indexing load. Sharding distributes indexing work across more nodes, which can improve throughput and reliability.

Example: Issuing manual hard commits

When performing batch indexing, use Solr’s HTTP commit API after each batch is complete:

POST http://<solr-host>:<port>/solr/<collection>/update?commit=true

Additional recommendations

Avoid extremely large batches that accumulate excessive uncommitted data in memory.
Test indexing with different batch sizes to find the optimal trade-off between throughput and stability for your environment.
Regularly review and optimize Solr node hardware and JVM settings to align with current data and workload growth.

References

For further guidance, check the following articles:

Apache-Solr-indexing-performance-guide

SolrCloud Shards and Indexing