Resolve disk fill issues caused by large tlog files in Solr pods – Lucidworks

Issue

Solr pods in a Fusion Kubernetes deployment may experience 100% disk utilization due to the accumulation of large transaction log (tlog) files. This can result in pods becoming unresponsive or other system instability.

Diagnosis

To determine if disk utilization is caused by tlog file accumulation in Solr pods:

Access the affected Solr pod using kubectl:

kubectl exec -it <solr-pod-name> -n <namespace> -- /bin/bash

Navigate to the data directory for the relevant Solr core, typically at:
```
/var/lib/solr/data/<collection_shard>_<replica>/data/tlog
```
Note: The exact path may differ based on your deployment; adjust as needed for your Solr installation.
List the contents of the tlog directory and check for unusually large tlog files:
```
ls -lh /var/lib/solr/data/<collection_shard>_<replica>/data/tlog
```
Review disk usage:
```
du -sh /var/lib/solr/data/<collection_shard>_<replica>/data/tlog
```
Note: High disk usage, especially from files named tlog.xxxxxxxxxxxxx, indicates that transaction logs are not being cleared as expected.

Environment

Fusion 5.x and above, deployed in Kubernetes environments (GKE, AKS, EKS, or other supported platforms).
Note: Applies to any Fusion deployment using Solr as a search engine where disk-based persistence is used for Solr cores.

Cause

Solr uses transaction logs (tlogs) to record recent updates before they are committed to disk. Under normal operation, tlog files are truncated or deleted after a commit. However, large tlog files may persist due to:

Uncommitted updates or delayed hard commits
Solr pod restarts or failures during write operations
Issues with Solr replication or leader election
Underlying storage latency or failures preventing tlog cleanup

Resolution

Follow these steps to safely resolve disk fill issues caused by large tlog files in Solr pods:

1. Verify tlog file persistence after commit

Issue a hard commit to the affected collection:

curl http://<solr-pod-service>:8983/solr/<collection>/update?commit=true

Wait a few moments, then check if tlog files are reduced in size or removed.

2. If tlog files persist, verify data persistence

Confirm that recent data is successfully persisted and that the collection is fully operational.
Ensure no ongoing indexing or update operations are active for the collection.

3. Manually delete the large tlog file (if safe)

Only after verifying that all data is committed and available, remove the offending tlog file(s):
```
rm /var/lib/solr/data/<collection_shard>_<replica>/data/tlog/tlog.*
```
Note: Removing tlog files while there are uncommitted writes may result in data loss. Always confirm data safety before deletion.

4. Recycle the Solr pod

Restart the affected Solr pod to reclaim disk space and ensure proper pod operation:
```
kubectl delete pod <solr-pod-name> -n <namespace>
```
Kubernetes will automatically recreate the pod.

5. Monitor pod status and disk utilization

After the pod is running, confirm that disk usage is within normal parameters and that Solr is serving requests as expected.
```
kubectl get pods -n <namespace>
kubectl exec -it <solr-pod-name> -n <namespace> -- df -h
```

Note: If tlog accumulation recurs, investigate potential issues with Solr commit configuration, storage performance, or cluster stability.