Issue
Following a sequential microservice upgrade of the Fusion platform, certain internal system collections (such as system_history and system_blobs) show a degraded status in the Solr cluster layout. Specifically, a target replica (e.g., on node solr-search-0) switches to a permanent DOWN state. While other data collections across the cluster remain unaffected, attempts to delete or append replicas to the degraded system shards result in execution timeouts or immediate transitions back into a failed state.
Diagnosis
During active diagnostic review, observations indicate distinct variations in behavior depending on collection configurations:
For multi-replica collections (such as
system_blobs), a healthy leader continues servicing requests while the replica on the upgraded pod fails to recover.For collections running a single replica per shard (such as
system_history), the loss of the active node leaves the shard without an available leader.
When a replacement replica is appended to the shard, it fails to synchronize metadata. In addition, internal Solr logs capture errors concerning unrecognized metrics or snitch parameters:
WARN [zkCallback-1-thread-1] o.a.s.c.c.ZkStateReader Received unsolicited snitch tag totaldisk from node NodeImplFurther structural analysis indicates that although the collection management interface returns partial success on manual deletion commands, the underlying cluster state stays trapped. The configuration metadata remains registered inside the coordination engine znodes, which systematically aborts any fresh initialization attempts.
Environment
Fusion Version: 5.9.12 to 5.9.15
Solr Version: 9.6.1
Kubernetes Version: 1.33
Cloud Platform: AKS / Kubernetes-native StatefulSets
Cause
When an upgrade occurs while internal system collections are actively processing tracking records, a transient interruption in pod availability or localized disk I/O operations can corrupt active shard definitions. If a shard contains only a single replica, the system cannot elect a new leader to stream replication records.
Furthermore, a failure in the automated cleanup cycle leaves stale metadata entries inside the ZooKeeper file structure. These orphaned znodes trick the cluster coordination layer into expecting an older configuration, causing version conflict errors and driving any newly allocated replicas straight into a DOWN state.
Resolution
If localized replica deletion and replacement commands continually time out, you must clear the stale ZooKeeper paths and force an automatic collection regeneration via the Fusion internal management microservices.
Step 1: Purge the Corrupted Collection and Node Directories
-
Access the Solr Admin UI panel, select the problematic
system_historycollection, and execute a complete delete operation.
-
Authenticate to the affected Kubernetes cluster and access the file system of the search pod to completely clear the local index data directories.
kubectl exec -it pod/solr-search-0 -- rm -rf /data/system_history*Step 2: Clear Orphaned Metadata Paths from ZooKeeper
Access an active ZooKeeper pod instance using the internal command-line tool.
kubectl exec -it pod/zookeeper-0 -- /opt/zookeeper/bin/zkCli.shInside the interactive shell, completely remove the stale metadata nodes referencing the broken system collection from both the standard Solr paths and the specific Fusion environment paths.
deleteall /collections/system_history
deleteall /lwfusion/5.0/core/collections/system_history
quitStep 3: Trigger Core Service Reinitialization
Perform a rollout restart of the internal administrative deployment to trigger the automated collection validation routine.
kubectl rollout restart deployment/fusion-adminMonitor the pod initialization states to verify that the service fully recovers.
kubectl rollout status deployment/fusion-adminOnce restarted, the
fusion-adminmicroservice detects the missing infrastructure components and automatically recreates the collection with fresh, uncorrupted metadata definitions. Verify the final operational health inside the Solr Admin interface.