Troubleshooting Spark job failures – Lucidworks

Issue

Spark jobs are used for a variety of tasks to perform processing in Fusion 5 environments. At times, you may observe a Spark job to report that it failed. This document covers common errors and troubleshooting scenarios that may cause a Spark job that is otherwise successful, to fail.

Other Spark documentation is referenced and is relevant for configuring and troubleshooting Spark jobs:

Spark Operations

Spark Administration In Kubernetes

How To Configure Spark Job Resource Allocation

How to Get Logs for a Spark Job

How To Configure Spark Jobs to Access Cloud Storage

Environment

Fusion 5

Understanding the job history fields in a failed job

resource: "spark:JOBNAME"
runId: "86aeffe28c"
startTime: TIMESTAMP
endTime: TIMESTAMP
status: "failed"
extra: (dropdown arrow)
   exception:(content such as stack trace)
   podStatus:(Container status, relevant messages, etc)

The “extra” messages are particularly relevant here. The messages within this area are frequently referenced throughout this document and contain a lot of relevant information about the failure reason for your job. This may contain fields such as ‘podStatus’ if the pod itself contributed to the failure of the job, or exception information if there was an execution error in the code of the job.

This article won’t cover execution errors (typically exit code 1), as those tend to be specific to the job, your specific code or data, and/or other implementation-specific scenarios.

You also can use the run ID in conjunction with the documentation to get your Spark job logs.

Configuring jobs to be fault tolerant.

In general, it is ALWAYS best practice to configure jobs in a fault tolerant manner, where a graceful termination of the job in progress before completion does not compromise the data integrity of your environment, and the job can simply be retried.

You can also use job trigger logic to set up an error handling job which runs in the event of a primary job failing as one way to accomplish this. This additional job may be configured to perform specific error handling tasks, move unprocessed data into an error processing location, etc. See more about triggering jobs here: Jobs API - Triggers

Scenario 1: Exit Code 137, message=OOMKilled

Reason: The executor with id 8 exited with exit code 137.

ContainerStatus(....... exitCode=137, finishedAt=TIMESTAMP, 
message=null, reason=OOMKilled, signal=null, startedAt=TIMESTAMP)

These messages correspond to the Spark process being under-resourced. Spark jobs are run inside a container with limited heap space according to configuration. The default executor and driver memory in Fusion environments is 3gB. Jobs which load, parse, and/or process large amounts of data may need more memory.

A job that fails due to the Spark job needing more memory than this will exit with the kubernetes exit code 137 corresponds with Out of Memory.

The Spark job settings mentioned in the article How To Configure Spark Job Resource Allocation can be used to adjust memory allocation for a Spark job. In particular:
Property: spark.executor.memory Value: (memory amount e.g. 4g)

Property: spark.driver.memory Value: (memory amount, e.g. 4g)

This allows the executor (the process which is executing your Spark job) and the driver (the Spark process controlling the executors) to request more memory. If increasing the memory does not resolve the issue, you may need to inspect your job logic for potential memory leaks.

This may also manifest in a “exit code 143” where a pod is killed by kubernetes because it is consuming more resources than available to it, which can be configured with spark.exeturor.cores and spark.driver.cores corresponding. These can be configured with whole cores or millicores.

Scenario 2: Exit Code 143 / message=Pod was terminated in response to imminent node shutdown

exitCode=143, finishedAt=TIMESTAMP, reason=Error, startedat=TIMESTAMP
...
message=Pod was terminated in response to imminent node shutdown.,
 phase=Failed, podIP=IPADDRESS, qosClass=Burstable, reason=Terminated,

It is normal for a kubernetes environments to sometimes recycle nodes (The computing environments on which pods are scheduled, not to be confused with pods, which are virtual containers that run your job)

In particular, cloud computing environments will have nodes that are not guaranteed to be permanently available, such as Amazon EKS “Spot Nodes” or Google GKE “Preemptible nodes” which have significantly lower costs associated with them, and are not guaranteed to have long uptime. Since Spark jobs typically are “run as needed” tasks that do not require compute resources when they are not scheduled, it is common to configure them to run on nodes such as these to save on cloud computing costs, since this means you are not paying for the compute resources for times where your Spark jobs are not running. This is actively recommended in the documentation: Workload Isolation with Multiple Node Pools

However in some cases, particularly with jobs that are expected to have long running times, you may encounter scenarios where a preemptible or spot node is not appropriate for a particular job.

This can result in your job frequently terminating before completion with a reason such as the one above. This means the Spark job was requested to terminate by the kubernetes operating system (gracefully, but before completion.) While you should always account for the possibility that a node shutdown could happen during a job, long-running jobs may encounter this more frequently if scheduled on preemptible or spot nodes.

You can use node selectors as mentioned in Spark Administration In Kubernetes to schedule some Spark jobs to run on preemptible nodes and other jobs to run on non-preemptible nodes which have different availability.

For example, you may have a preemptible nodepool with the node selector “spark-only” as used in the examples in Spark Administration in Kubernetes. However, for your long running job, you may choose to allocate it to a non-preemptible node pool instead with a different node selector.

Scenario 3: Spark Jobs failing to start / FailedScheduling / spark job pods in pending status

This may be observed in a few different ways:

Kubectl get events shows that your Spark pods are failing to be scheduled with messages along the lines of :

Warning  FailedScheduling  18h  default-scheduler 0/10 nodes are available: 
3 Insufficient memory, 7 node(s) didn't match Pod's node affinity/selector.

The reasons in the events will explain what kind of resource isn’t available in your kubernetes cluster. Matching the affinity / selector is relevant for your node pool (for example, restricting Spark jobs to specific node pool. In the event of ‘insufficient memory’ this means that 3 of the nodes in your Spark node pool do not have sufficient memory to allocate to the Spark job which you are attempting to run, based on that spark job’s configuration for memory requirements which were specified with spark.executor.memory and spark.driver.memory, plus an amount of overhead memory for the JVM itself (default 10%).

This is also scenario where configuring a spark-only preemptible node pool accordingly can address the issue, since doing so allows you to provision a dedicated pool of compute resources available for Spark jobs, so that your Spark jobs are not competing for compute resources with solr, query pipelines, etc, while also being able to access additional compute resources without impacting the cloud computing costs as much as simply adding additional nodes would.