Sharding w/ joins SolrCloud question

Question>>

When does Sharding and replication for High Availability // scaling vs query patterns... joining between documents... will lots of shards make joining less efficient (is there a way for us to do "better" joining or schema design to optimize efficiency?)

Answer>>

Joins are something to use with great caution. First of all, quite apart from sharding etc, joins do not perform well when the "from" or "to" field has many unique values across the corpus. One of the most popular things to join on is something like id (or <uniqueKey>). This will kill performance.

Second, distributed search (aka sharding) doesn't support joins, see: https://issues.apache.org/jira/browse/LUCENE-3759

If you can turn the problem into a pure search issue by de-normalizing the data so the problem turns into a straight search you're usually better off. A couple of ways to go about that are:

1> throwing all the related fields from the secondary table into the document. e.g. go from a separate table listing, say, phone numbers that's joined with, say, a user table and just include all the phone numbers in a field in the user record. You can use some tricks with positionIncrementGap and multiValued fields to keep such "bags of attributes" separately searchable (i.e. let's say you have two phone numbers 123 4567 and 321 7654. You can structure the query to NOT match "4567 321" Or to match, depending on your needs.).

2> essentially index the cross product of the join as separate documents. Extending the above you'd have two user records, each would only have one of the phone numbers. This can be unacceptable if there is a combinatorial explosion though.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk