Issue
When running a Spark job that reads from a collection, the job fails with an error similar to:
Error during job z2NHYM execution: org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema:
followed by a list of field names. One might be "params_f.productit_facet.facet.prefix_ss"Diagnosis
When you investigate, you see that the actual field name in Fusion/Solr is "params_f.ProductIT_facet.facet.prefix_ss" -- i.e. same name, but mixed case characters.
Environment
Fusion 4.2.6, and 5+
Cause
Out of the box, we have spark configured to use the default for spark.sql.caseSensitive, which is false. This causes multiple columns with the same name (but different cases) to appear as duplicates.
Resolution
In the Spark jobs configuration screen, turn on the Advanced settings and look for the Spark Settings.
Click the Add button, and set spark.sql.caseSensitive to true.
This will cause that spark job to treat the two columns name as distinct columns