Issue
The Synonym Detection job fails with a NullPointerException, or it runs successfully but does not generate any synonyms.
Diagnosis
Check job logs for the following exception:
java.lang.NullPointerException: Value at index 0 is null
at org.apache.spark.sql.errors.QueryExecutionErrors$.valueIsNullError
at org.apache.spark.sql.Row.getDouble
...
at com.lucidworks.dc.job.impl.ml.nlp.SynonymDetection$.findSynonymThis exception typically occurs when required fields are missing in the signals aggregation collection used as input. Specifically, the following fields are required for Synonym Detection:
query_saggr_count_idoc_id_s
To verify this, query the input signals collection in Query Workbench or directly via Solr:
q=-doc_id_s:* OR -aggr_count_i:* OR -query_s:*If any documents are returned, the collection includes incomplete records that will cause the job to fail.
If the job succeeds but no synonyms are generated, this may be due to the input signals not meeting the similarity threshold criteria required for synonym extraction.
Environment
Fusion 5.9.3 and 5.9.5
Managed Fusion on Kubernetes
Spark-based Synonym Detection jobs
Applicable in both Dev and Prod environments
Cause
The NullPointerException is caused by null values in critical fields within the signals aggregation collection. These fields are accessed without null-checking in the job logic.
When no synonyms are produced despite job success, it's often due to:
Poor signal diversity (e.g., identical queries producing identical results)
Overly strict parameter settings for synonym similarity or query similarity
Resolution
Patch the affected Spark component
A patch has been published that resolves the NPE by skipping or handling records with null values. This patch is version-specific.
For Fusion 5.9.5, apply:
gcr.io/lw-support-team/fusion-spark:5.9.5-sust-1130-synonym-detection-npe-backportFor Fusion 5.9.9, apply:
gcr.io/lw-support-team/fusion-spark:5.9.9-sust-1130-synonym-detection-npe-backportOnly the fusion-spark image needs to be updated. Ensure the patch is applied via ConfigMap updates and deployed to the job runner pods in your Kubernetes environment.
Clean or filter the input signals collection
If you cannot patch immediately, use a workaround by applying a data filter to the job configuration:
"-doc_id_s:* AND -query_s:* AND -aggr_count_i:*"This ensures the Synonym Detection job only processes documents that have all required fields.
Alternatively, remove documents with null fields from the input collection entirely.
Tune the synonym detection parameters
If the job completes but does not produce synonyms, adjust the model tuning parameters:
Reduce
synonymSimilarityThreshold(e.g., from 0.5 to 0.01)Reduce
querySimilarityThreshold(e.g., from 0.9 to 0.5)
These parameters can be modified in the Spark job configuration:
{
"synonymSimilarityThreshold": 0.01,
"querySimilarityThreshold": 0.5
}
Ensure your input signals contain sufficient variety in queries and result sets. If all signal entries are produced by the same query or lead to identical results, synonym generation will not be effective.
Additional notes
The patch fix has been backported to Fusion versions starting with 5.9.13.
To avoid runtime issues, always validate signals collection quality prior to running ML-based jobs.