Issue
Solr pods in a Fusion Kubernetes deployment may experience 100% disk utilization due to the accumulation of large transaction log (tlog) files. This can result in pods becoming unresponsive or other system instability.
Diagnosis
To determine if disk utilization is caused by tlog file accumulation in Solr pods:
Access the affected Solr pod using kubectl:
kubectl exec -it <solr-pod-name> -n <namespace> -- /bin/bashNavigate to the data directory for the relevant Solr core, typically at:
/var/lib/solr/data/<collection_shard>_<replica>/data/tlogNote: The exact path may differ based on your deployment; adjust as needed for your Solr installation.
List the contents of the tlog directory and check for unusually large tlog files:
ls -lh /var/lib/solr/data/<collection_shard>_<replica>/data/tlogReview disk usage:
du -sh /var/lib/solr/data/<collection_shard>_<replica>/data/tlogNote: High disk usage, especially from files named
tlog.xxxxxxxxxxxxx, indicates that transaction logs are not being cleared as expected.
Environment
Fusion 5.x and above, deployed in Kubernetes environments (GKE, AKS, EKS, or other supported platforms).
Note: Applies to any Fusion deployment using Solr as a search engine where disk-based persistence is used for Solr cores.
Cause
Solr uses transaction logs (tlogs) to record recent updates before they are committed to disk. Under normal operation, tlog files are truncated or deleted after a commit. However, large tlog files may persist due to:
Uncommitted updates or delayed hard commits
Solr pod restarts or failures during write operations
Issues with Solr replication or leader election
Underlying storage latency or failures preventing tlog cleanup
Resolution
Follow these steps to safely resolve disk fill issues caused by large tlog files in Solr pods:
1. Verify tlog file persistence after commit
Issue a hard commit to the affected collection:
curl http://<solr-pod-service>:8983/solr/<collection>/update?commit=trueWait a few moments, then check if tlog files are reduced in size or removed.
2. If tlog files persist, verify data persistence
Confirm that recent data is successfully persisted and that the collection is fully operational.
Ensure no ongoing indexing or update operations are active for the collection.
3. Manually delete the large tlog file (if safe)
Only after verifying that all data is committed and available, remove the offending tlog file(s):
rm /var/lib/solr/data/<collection_shard>_<replica>/data/tlog/tlog.*Note: Removing tlog files while there are uncommitted writes may result in data loss. Always confirm data safety before deletion.
4. Recycle the Solr pod
Restart the affected Solr pod to reclaim disk space and ensure proper pod operation:
kubectl delete pod <solr-pod-name> -n <namespace>Kubernetes will automatically recreate the pod.
5. Monitor pod status and disk utilization
After the pod is running, confirm that disk usage is within normal parameters and that Solr is serving requests as expected.
kubectl get pods -n <namespace> kubectl exec -it <solr-pod-name> -n <namespace> -- df -h
Note: If tlog accumulation recurs, investigate potential issues with Solr commit configuration, storage performance, or cluster stability.