Signal aggregation job fails due to missing JAR, CPU limits, or RBAC permissions – Lucidworks

Issue

When triggering a signal aggregation job (such as click signals), the job fails with one or more of the following errors:

FileNotFoundException related to a missing temporary JAR file.
Kubernetes 422 error due to executor CPU request exceeding namespace limits.
Forbidden error on shutdown hook when listing persistent volume claims (PVCs).

Diagnosis

Check the Spark driver logs for the following patterns:

Missing JAR file reference in spark.jars or spark.repl.local.jars.

Pod creation error:

Invalid value: "X": must be less than or equal to cpu limit of Y

Forbidden error in shutdown hook:

persistentvolumeclaims is forbidden: User "system:serviceaccount:<namespace>:<service-account>" cannot list resource "persistentvolumeclaims"

Use the following command to inspect job-related Spark properties:

kubectl -n <namespace> logs <driver-pod-name> | grep -i 'spark.jars\|repl.local.jars'

Environment

Fusion 5.9.7
Kubernetes (EKS, version 1.29)
Self-hosted deployment

Cause

A stale or missing JAR reference left over from a previous job attempt.
Spark executor CPU request exceeds the namespace’s resource limit policy.
The Spark job's service account does not have permission to list or delete PVCs during shutdown.

Resolution

1. Remove stale or invalid JAR references

In the job's Spark configuration, clear the following fields if they reference non-existent JARs:

spark.jars
spark.repl.local.jars

Allow Fusion to manage the classpath internally instead of referencing specific uploaded JARs.

2. Align executor CPU requests with namespace policy

If the namespace enforces a CPU limit (e.g., 1 core), configure the job's executor to match:

spark.kubernetes.executor.request.cores=1
spark.executor.cores=1
spark.executor.instances=2

Alternatively, if more CPU per executor is required, first inspect the namespace limit:

kubectl -n <namespace> describe limitranges

Then, explicitly configure both the request and limit in the job:

spark.kubernetes.executor.request.cores=3
spark.kubernetes.executor.limit.cores=3
spark.executor.cores=3

3. Grant RBAC permissions to clean up PVCs

If the Spark job fails on shutdown with a PVC access error, ensure the service account has the correct permissions:

kubectl -n <namespace> create role spark-pvc-cleanup \
  --verb=get,list,watch,delete \
  --resource=persistentvolumeclaims

kubectl -n <namespace> create rolebinding spark-pvc-cleanup-binding \
  --role=spark-pvc-cleanup \
  --serviceaccount=<namespace>:<job-launcher-service-account>

4. Review SQL rollup fields

If the final rollup query references a field not present in the aggregated dataset, either:

Remove the field from the query.
Ensure it is included in the first aggregation's SELECT and GROUP BY clauses so it carries through.

Example of valid rollup SQL if params_dv_mv_partners_ss is unnecessary:

SELECT concat_ws('|', query_s, doc_id_s, filters_s) as id,
       query_s,
       query_s as query_t,
       doc_id_s,
       filters_s,
       first(aggr_type_s) AS aggr_type_s,
       SPLIT(filters_s, ' \\$ ') AS filters_ss,
       SUM(weight_d) AS weight_d,
       SUM(aggr_count_i) AS aggr_count_i
FROM my_signals_aggr
GROUP BY query_s, doc_id_s, filters_s

If the field is required, it must be included in both aggregation layers.