How to avoid using Nested Queries from the JDBC Connector

Please see comment below the article.

Lucidworks Fusion has a JDBC Database connector so that you can index your database values.

This works based on a single sql query that gets the rows to index. For each result in that query will be indexed as a SOLR document.

Fusion also supports "Nested Queries" which are SQL queries that are run on each row result of the first sql query to get additional fields to add along to the SOLR document.

For example you might have a table that you want to index:

CREATE TABLE MainTable (
	MyKeyField varchar(10) NOT NULL,
	Field1 varchar(10) NOT NULL )

So you will make this your main sql query:

select MyKeyField, MyDateField, MyStrField from MainTable

But then lets say you have another table you want to query along with the first query.

CREATE TABLE AdditionalTable (
	MyKeyField varchar(10) NOT NULL,
	AdditionalField varchar(10) NOT NULL )
So we try adding a nested query:
select AdditionalField from AdditionalTable where MyKeyField = $
Note: The JDBC connector recognizes "$" as the primary key from the main sql query.

This is an easy and user-friendly way to join fields into your main sql query.

But I am here to caution you against using the "Nested Queries." Why? Because if you have a large number of results in the main sql query, the connection overhead of those extra sql queries dominates the performance. You will experience a very slow crawl speed if that first query has many rows.

To fix it, we instead of using Fusion's Nested Query element, we just add the nested queries into the main SQL query as a text-delimited column. To achieve this is slightly different on each database. See: http://stackoverflow.com/questions/194852/concatenate-many-rows-into-a-single-text-string 

For our example, SQL server uses the STUFF function thus the new query becomes:

SELECT [MyKeyField]
      , [Field1]
      , AdditionalField1 = STUFF((
          SELECT ',' + at.AdditionalField1
          FROM [dbo].[JoinTable] at
          WHERE at.MyKeyField = [MyKeyField]

 

This example returns the sub query result as a comma separated text field.

2) At this point our Solr document has a single text field with all the sub query results. We need to split the delimited text field into an array before indexing. Open the pipeline for your datasource, and add a javascript query stage after the Field Mapping Stage with the following code:

function(doc, ctx) {
  var fields = ["AdditionalField1"]
  for (i = 0; i < fields.length; i++ ) {
    var field = fields[i]
    var value = doc.getFirstFieldValue(field)
    doc.removeFields(field)
    var vals = value.split(",")
    for (var j=0; j<vals.length; j++) {
      doc.addField(field, vals[j])
    }
  }
  return doc
}

To use this, simply add the nested query field names to the fields array above, and if you are using a different delimiter than "," change that as well.

Now if you try an index, you should see your nested query results added into the solr document as a Text[] field.

Have more questions? Submit a request

1 Comments

  • 0
    Avatar
    Gerald Kanapathy

    A much better way to do this is to perform the JOIN inside the SQL database. This just means rewriting the main query, getting rid of the nested query, and skipping the JS stage. The way to do with your tables and fields as above to to write the query as:

     

        SELECT MyKeyField, MyDateField, MyStrField, AdditionalField

        FROM MainTable 

        JOIN JoinTable jt

        ON jt.MyKeyField = MyKeyField

     

    Then you can skip all the JS and others, you'll just get a document per row (efficiently) all the specified fields. More generically, if you have

     

        SELECT mkey,m1,m2,... FROM maintable

    and 

       SELECT jkey, j1,j2,... FROM jointable

    and you want the join, you can use:

        SELECT m.m1, m.m2,...j.j1, j.j2,...

        FROM maintable m

        JOIN jointable j ON m.mkey = j.jkey

    The "a.fN AS fN" is just there if you have conflicting field names between the two tables. If they are unambigious, you can just use "fN" instead. If you don't like the resulting column names, you can rename them with the SQL "AS" keyword:

        SELECT m.m1 AS name1, m.m2 AS name2,...

    Edited by Gerald Kanapathy
Please sign in to leave a comment.
Powered by Zendesk