Issue
Fusion becomes unresponsive or experiences latency spikes during high-volume load testing. Clients may observe response times exceeding frontend timeouts (e.g., 8 seconds) and fusion-admin pods entering crashloop status. Query performance degrades sharply, and services such as query-pipeline and api-gateway become overloaded.
Diagnosis
To determine if your deployment is experiencing this issue:
-
Monitor
query-pipelineservice metrics in Prometheus or Grafana.-
Look for p95 or p99 latency values rising toward frontend timeout thresholds.
-
Observe rolling averages of queries per second (QPS) during the test period.
-
-
Examine pod status using the following:
kubectl get pods -n <fusion-namespace>
-
Focus on:
-
fusion-adminpods in aCrashLoopBackOfforPendingstate. -
Inability to schedule pods due to node capacity constraints.
-
-
Identify which stages of the query pipelines are responsible for latency:
-
Use pipeline stage breakdowns in the Fusion-provided Grafana dashboards to isolate high-latency components.
-
-
Confirm Solr indexing and querying workloads are properly isolated using dedicated nodepools.
-
Inspect disk types for Solr StatefulSets:
-
Ensure SSDs are used for high-throughput Solr workloads, as standard persistent disks may introduce I/O bottlenecks.
-
Environment
Fusion 5.9.4 and later
Kubernetes version 1.29
Applies to Fusion deployed on GKE, EKS, AKS, or other Kubernetes environments using autoscaling and nodepool isolation.
Cause
The degradation is typically caused by one or more of the following:
-
Insufficient compute capacity for
fusion-adminorquery-pipelineservices when autoscaling reaches peak levels. -
Shared nodepools between indexing-heavy and query-serving workloads, creating contention.
-
Use of standard persistent disks for Solr data, resulting in slow disk I/O under heavy read/write load.
-
Query pipelines containing expensive stages or large per-request logic.
-
Delta indexing jobs triggering during peak query times.
Resolution
Scale and isolate critical services
-
Ensure
fusion-adminpods have sufficient replicas and are scheduled on appropriately sized nodes:
kubectl scale deployment fusion-admin -n <fusion-namespace> --replicas=<desired-count>
-
Review nodepool assignments and autoscaler settings for services such as:
-
query-pipeline -
api-gateway -
fusion-indexing
-
-
Separate TLOG and PULL Solr replicas into different nodepools to minimize contention.
Tune query pipeline behavior
-
Use Grafana dashboards to analyze slowest pipeline stages.
-
Modify or remove expensive stages that consistently impact performance.
-
If using Lucidworks Predictive Merchandiser, Suggestions, or Recommendations pipelines, profile these separately to ensure isolated optimization.
Optimize Solr storage performance
-
Verify that Solr StatefulSets are using SSD-backed volumes:
-
For example, on GKE use a storage class such as
premium-rwo. -
Update StatefulSet volume claim templates if needed, then recreate pods to apply changes.
-
Reset affected services
If services crash or enter a degraded state during testing:
- Restart affected pods manually:
kubectl rollout restart deployment <deployment-name> -n <fusion-namespace>
-
Commonly impacted deployments include:
-
fusion-admin -
api-gateway -
query-pipeline solrcloud-node
-
After applying these changes, rerun load tests to validate whether latency thresholds remain within acceptable bounds.