Spark job stuck in "Running" state and cannot be aborted – Lucidworks

Issue

A job submitted through Fusion's Jobs UI appears stuck in a "Running" state and cannot be aborted or deleted, even after attempting removal via both the Fusion UI and Kubernetes commands. This issue may occur after the job initially failed due to a misconfiguration or image error.

Diagnosis

This behavior may be observed when:

A Spark job initially fails due to a missing image or environment misconfiguration (e.g., ErrImagePull)
The image issue is corrected, but subsequent retries result in the job appearing permanently in "Running" state
Attempts to delete, recreate, or abort the job via UI or Kubernetes have no effect
Restarting relevant services does not clear the job state

In affected versions, this issue is linked to the job state not being updated due to an internal error in the job-config service.

Environment

Fusion 5.9.14
Applies to self-hosted deployments running on Kubernetes (e.g., Amazon EKS, GKE, AKS)

Cause

A known issue in Fusion 5.9.14 causes certain Spark job states to remain in a non-terminal "Running" state due to a failure in the job-config service's ability to persist state transitions. This occurs most often after a job initially fails on image pull and is then resubmitted without clearing the job's internal state properly.

Resolution

To resolve this issue, install the engineering patch for job-config provided for Fusion 5.9.14:

Patch image:
lucidworks/job-config:5.9.14-SUST-1371-patch
https://hub.docker.com/layers/lucidworks/job-config/5.9.14-SUST-1371-patch

Lucidworks recommends applying this patch proactively in all Fusion 5.9.14 environments, whether or not the issue has already been observed.

Optional: restart affected services

Before applying the patch, you may optionally try restarting the following Fusion services to release any stuck job state:

kubectl rollout restart deploy job-launcher -n <fusion-namespace>
kubectl rollout restart deploy job-rest-server -n <fusion-namespace>
kubectl rollout restart deploy job-config -n <fusion-namespace>

If the job state remains stuck after service restarts, proceed to apply the patch as described above.

After the patch is applied, future jobs should transition through their lifecycle states correctly, and stuck job entries should be avoided.