Tweaking Fusion's commit strategy for multiple data sources

The recipe is useful for users going live with Fusion. It's to tune how frequently documents become visible to searches in an efficient manner. It's also useful if you start seeing the "exceeded limit of maxWarmingSearchers=2" error in your Solr logs.

Currently when we create a datasource in Fusion, any document that gets crawled becomes visible to searches within 10s by default. This happens via the "commitWithin" functionality that Solr provides.

This provides a great out of box experience while trying out Fusion but to be in control of when documents get visible to searches you could use this strategy - 

 1. Disable commitWithin for the collection in Fusion. 

For this you need to get the current payload information for this collection using this API - 

curl -X GET 'http://localhost:8765/api/v1/collections/test'

Now modify the output JSON and change the commitWithin time to -1. Use this JSON in the following API call to disable commitWithin

curl -H 'Content-type: application/json' -X PUT 'http://localhost:8765/api/v1/collections/<collection-name>' -d '{json-here}'

2. Increase the hard commit interval. The default is set to every 15s. It's best to change it to every 10000 docs. You can reduce the number from 10k to something smaller if your documents are very big.

You can do that by going to the http://localhost:8764/admin/collections/<collection-name>/solr-config page, clicking on solrconfig.xml and modifying <autoCommit> to -

<autoCommit>
<maxDocs>${solr.autoCommit.maxDocs:10000}</maxDocs>
<openSearcher>false</openSearcher>
</autoCommit>

3. Now to make documents visible to searches use the autoSoftCommit feature in Solr. In the same solrconfig.xml file you can change the value to 10s.

<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:10000}</maxTime> <!-- Time is in milliseconds -->
</autoSoftCommit>

Note: Regarding auto soft commits - if you don't need to make documents available to searches every 10s feel free to increase it to the highest value you can. This is an expensive operation and the less we call soft commits the more efficient it is.

So to wrap it up, disabling commitWithin helped not schedule commits every 10s a document gets added ( which can mean Solr is committing at every few seconds if you have multiple data sources running in parallel ) . Then we increased the hard commit time since it was too small and finally we used soft commits for documents to be visible for searches.

Here is a great blog post by Erick Erickson explaining hard commits and soft commits - https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

 

 

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk