Troubleshoot Fusion performance degradation under high query load – Lucidworks

Issue

Fusion becomes unresponsive or experiences latency spikes during high-volume load testing. Clients may observe response times exceeding frontend timeouts (e.g., 8 seconds) and fusion-admin pods entering crashloop status. Query performance degrades sharply, and services such as query-pipeline and api-gateway become overloaded.

Diagnosis

To determine if your deployment is experiencing this issue:

Monitor query-pipeline service metrics in Prometheus or Grafana.
- Look for p95 or p99 latency values rising toward frontend timeout thresholds.
- Observe rolling averages of queries per second (QPS) during the test period.
Examine pod status using the following:

kubectl get pods -n <fusion-namespace>

Focus on:
- fusion-admin pods in a CrashLoopBackOff or Pending state.
- Inability to schedule pods due to node capacity constraints.
Identify which stages of the query pipelines are responsible for latency:
- Use pipeline stage breakdowns in the Fusion-provided Grafana dashboards to isolate high-latency components.
Confirm Solr indexing and querying workloads are properly isolated using dedicated nodepools.
Inspect disk types for Solr StatefulSets:
- Ensure SSDs are used for high-throughput Solr workloads, as standard persistent disks may introduce I/O bottlenecks.

Environment

Fusion 5.9.4 and later
Kubernetes version 1.29
Applies to Fusion deployed on GKE, EKS, AKS, or other Kubernetes environments using autoscaling and nodepool isolation.

Cause

The degradation is typically caused by one or more of the following:

Insufficient compute capacity for fusion-admin or query-pipeline services when autoscaling reaches peak levels.
Shared nodepools between indexing-heavy and query-serving workloads, creating contention.
Use of standard persistent disks for Solr data, resulting in slow disk I/O under heavy read/write load.
Query pipelines containing expensive stages or large per-request logic.
Delta indexing jobs triggering during peak query times.

Resolution

Scale and isolate critical services

Ensure fusion-admin pods have sufficient replicas and are scheduled on appropriately sized nodes:

kubectl scale deployment fusion-admin -n <fusion-namespace> --replicas=<desired-count>

Review nodepool assignments and autoscaler settings for services such as:
- query-pipeline
- api-gateway
- fusion-indexing
Separate TLOG and PULL Solr replicas into different nodepools to minimize contention.

Tune query pipeline behavior

Use Grafana dashboards to analyze slowest pipeline stages.
Modify or remove expensive stages that consistently impact performance.
If using Lucidworks Predictive Merchandiser, Suggestions, or Recommendations pipelines, profile these separately to ensure isolated optimization.

Optimize Solr storage performance

Verify that Solr StatefulSets are using SSD-backed volumes:
- For example, on GKE use a storage class such as premium-rwo.
- Update StatefulSet volume claim templates if needed, then recreate pods to apply changes.

Reset affected services

If services crash or enter a degraded state during testing:

Restart affected pods manually:

kubectl rollout restart deployment <deployment-name> -n <fusion-namespace>

Commonly impacted deployments include:
- fusion-admin
- api-gateway
- query-pipeline
- solrcloud-node

After applying these changes, rerun load tests to validate whether latency thresholds remain within acceptable bounds.