Machine Learning Stage Failure: UNKNOWN Application Error Processing RPC – Lucidworks

Issue

Users may encounter a failure in Query Pipelines at the Machine Learning (ML) stage, specifically when using Named Entity Recognition (NER) or other custom models. The pipeline stops at the ML stage, and subsequent stages are not executed.

The following error is observed in the Query Pipeline logs:

ERROR [com.lucidworks.apollo.pipeline.query.stages.ml.MLQueryStage] - Failed to generate prediction in Machine Learning Stage [Stage_Name] due to Model execution error: UNKNOWN: Application error processing RPC

Diagnosis

To diagnose this issue, review the logs for the ml-model-service and the specific Seldon deployment pods.

Check the seldon-container-engine logs for readiness failures:

{"error":"dial tcp 127.0.0.1:9500: connect: connection refused", "level":"error", "logger":"SeldonRestApi", "msg":"Ready check failed"}

Verify the environment variables for the ml-model-service deployment using kubectl:

kubectl get deployment ml-model-service -o yaml | grep -A2 "JAVA_TOOL_OPTIONS"

If the output does not contain the -Dcom.google.protobuf.use_unsafe_pre22_gencode=true flag, the service will fail to process gRPC calls due to a Protobuf runtime exception.

Environment

Managed Fusion 5.9.16
Kubernetes (K8s)
Seldon Core

Cause

The issue is typically caused by two distinct factors:

Missing Protobuf Flag: In Fusion version 5.9.16, a specific JVM flag is required for the Protobuf runtime to handle gRPC calls correctly. If the JAVA_TOOL_OPTIONS environment variable is overwritten during deployment (instead of being augmented), this flag is dropped, leading to UnsupportedOperationException on gRPC calls.
Image Dependencies: Models may fail to start if the Docker image is missing the setuptools package, specifically the pkg_resources module required by the Seldon wrapper.

Resolution

Step 1: Patch ml-model-service JVM Options

Update the ml-model-service deployment to include the mandatory Protobuf flag.

Modify the JAVA_TOOL_OPTIONS to include:

-Dcom.google.protobuf.use_unsafe_pre22_gencode=true

This can be applied via kubectl edit deployment ml-model-service or by updating the Helm values and redeploying the service.

Step 2: Update Model Dockerfile (Optional)

Ensure the custom model image includes the necessary build dependencies. Update the Dockerfile to install setuptools before other Python dependencies.

Add the following line to the Dockerfile:

RUN pip install --no-cache-dir setuptools>=65.0.0

Rebuild and push the image:

docker build -t [image_name]:[tag] .
docker push [image_name]:[tag]

Step 3: Verify and Redeploy

Delete the existing ML pods to force a pull of the updated image:

kubectl delete pod -l seldon-deployment-id=[model_id]

Validate that the model is generating predictions by testing the Query Pipeline in the Query Workbench.