Issue
After a system or JVM-level crash (such as an out-of-memory event), Solr may fail to restart due to corrupted index files. This may manifest in the logs as Lucene-related CorruptIndexException errors, such as:
Caused by: org.apache.lucene.index.CorruptIndexException: length should be 5377813198 bytes, but is 5378845390 instead
(resource=MMapIndexInput(path="/path/to/index/_XXXX.cfs"))In this scenario, corrupted replicas will not start until the index files are restored or resynchronized.
Diagnosis
To confirm Solr index corruption:
Review the Solr server logs for
CorruptIndexExceptionentriesAttempt to restart the affected Solr node and note if startup fails consistently
Run
bin/solr indextool(if available) with the--checkflag to validate the index
Additional logs that may help:
OS-level crash logs (e.g.,
/var/log/messages,dmesg) for OOM or unclean shutdownsDisk I/O errors or file system issues near the time of failure
Environment
Solr 8.x
Cause
Solr index corruption typically results from an unexpected or uncontrolled shutdown of the Solr process or the host system. Common causes include:
Java heap exhaustion triggering out-of-memory (OOM) kill of the JVM
Forced OS shutdown or crash without graceful termination of Solr
Insufficient disk space during index writing operations
Less commonly, disk write failures or file system corruption
Lucene index files are not atomic; incomplete writes during sudden process termination can result in corrupted segment files.
Resolution
To restore the affected Solr nodes and reduce future risk:
Restore from a healthy replica
Identify the healthy replica of the corrupted shard
On the affected node, remove the corrupted index directory (e.g.,
solr/data/<collection>/<shard>/data/index*)Restart Solr to trigger replica recovery from the healthy node
Prevent future corruption
Ensure the host has sufficient memory and is sized appropriately for Solr workloads
Monitor and alert on disk space utilization
Consider implementing process-level monitoring to detect and address OOM conditions
Where possible, gracefully stop Solr before any planned reboots or maintenance windows
It may also be beneficial to investigate implementing snapshot-based backups for Solr indexes to reduce recovery time after corruption events.