Spark Job error "Found duplicate column(s) in the data schema" – Lucidworks

Issue

When running a Spark job that reads from a collection, the job fails with an error similar to:

Error during job z2NHYM execution: org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: 

followed by a list of field names.  One might be "params_f.productit_facet.facet.prefix_ss"

Diagnosis

When you investigate, you see that the actual field name in Fusion/Solr is "params_f.ProductIT_facet.facet.prefix_ss" -- i.e. same name, but mixed case characters.

Environment

Fusion 4.2.6, and 5+

Cause

Out of the box, we have spark configured to use the default for spark.sql.caseSensitive, which is false. This causes multiple columns with the same name (but different cases) to appear as duplicates.

Resolution

In the Spark jobs configuration screen, turn on the Advanced settings and look for the Spark Settings.

Click the Add button, and set spark.sql.caseSensitive to true.

This will cause that spark job to treat the two columns name as distinct columns

Issue

Diagnosis

Environment

Cause

Resolution

Related articles