How to avoid using Nested Queries from the JDBC Connector

Lucidworks Fusion has a JDBC Database connector so that you can index your database values.

This works based on a single sql query that gets the rows to index. For each result in that query will be indexed as a SOLR document.

Fusion also supports "Nested Queries" which are SQL queries that are run on each row result of the first sql query to get additional fields to add along to the SOLR document.

For example you might have a table that you want to index:

CREATE TABLE MainTable (
	MyKeyField varchar(10) NOT NULL,
	Field1 varchar(10) NOT NULL )

So you will make this your main sql query:

select MyKeyField, MyDateField, MyStrField from MainTable

But then lets say you have another table you want to query along with the first query.

CREATE TABLE AdditionalTable (
	MyKeyField varchar(10) NOT NULL,
	AdditionalField varchar(10) NOT NULL )
So we try adding a nested query:
select AdditionalField from AdditionalTable where MyKeyField = $
Note: The JDBC connector recognizes "$" as the primary key from the main sql query.

This is an easy and user-friendly way to join fields into your main sql query.

But I am here to caution you against using the "Nested Queries." Why? Because if you have a large number of results in the main sql query, the connection overhead of those extra sql queries dominates the performance. You will experience a very slow crawl speed if that first query has many rows.

To fix it, we instead of using Fusion's Nested Query element, we just add the nested queries into the main SQL query as a text-delimited column. To achieve this is slightly different on each database. See: http://stackoverflow.com/questions/194852/concatenate-many-rows-into-a-single-text-string 

For our example, SQL server uses the STUFF function thus the new query becomes:

SELECT [MyKeyField]
      , [Field1]
      , AdditionalField1 = STUFF((
          SELECT ',' + at.AdditionalField1
          FROM [dbo].[JoinTable] at
          WHERE at.MyKeyField = [MyKeyField]

 

This example returns the sub query result as a comma separated text field.

2) At this point our Solr document has a single text field with all the sub query results. We need to split the delimited text field into an array before indexing. Open the pipeline for your datasource, and add a javascript query stage after the Field Mapping Stage with the following code:

function(doc, ctx) {
  var fields = ["AdditionalField1"]
  for (i = 0; i < fields.length; i++ ) {
    var field = fields[i]
    var value = doc.getFirstFieldValue(field)
    doc.removeFields(field)
    var vals = value.split(",")
    for (var j=0; j<vals.length; j++) {
      doc.addField(field, vals[j])
    }
  }
  return doc
}

To use this, simply add the nested query field names to the fields array above, and if you are using a different delimiter than "," change that as well.

Now if you try an index, you should see your nested query results added into the solr document as a Text[] field.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk