Multiple documents with same doc-id in Index

This article is dedicated to addressing a scenario where multiple documents with same doc-id exist in the same index.

There are two known reasons for this particular anomaly:

1. Indexing parent-child documents alongside singleton documents in the same index.
2. Use MERGEINDEXES tool directly or indirectly (MapReduceIndexerTool)

Indexing parent-child documents alongside singleton documents in the same index

Lucene does not support nesting of documents by its model, flat object, while indexing. While it does support adding a list of documents atomically and contiguously -- like a virtual block. This is the feature used by Solr to implement "nested objects".

When you add a parent document with 'n' number of children, they appear contiguously in the index as:

child-1, child-2, child-3, .... , child-n, parent

All children of a parent document must be indexed together with the parent document. We cannot update either parent or child documents individually while maintaining the relationship intact. The entire block needs to be re-indexed of any changes need to be made.
There is no information provided at Lucene-level that links parent to child, or distinguishes this parent/child block from the other documents in the same index.

Solr committers are working on multiple JIRAs listed below now and then to fix this particular anomaly:

SOLR-6596
SOLR-5211
SOLR-7672

In the below example, we try to index documents in three formats in our collection 'books':

1. Index a singleton document with doc-id "book1"
curl http://localhost:8983/solr/books/update?commitWithin=3000 -d '
[{id : book1, type_s:book, title_t : "The Way of Kings"}]'

2. Index a parent document doc-id "book1" and child documents doc-id "book-c1" and "book-c2"
curl http://localhost:8983/solr/books/update?commitWithin=3000 -d '
[{id : book1, type_s:book, title_t : "The Way of Kings",
_childDocuments_ : [{ id: book1_c1, type_s:review,stars_i:5},
{ id: book1_c2, type_s:review,stars_i:3}]}]'

3. Index a parent document doc-id "book1-c1" and child documents doc-id "book1" and "book-c2"
curl http://localhost:8983/solr/books/update?commitWithin=3000 -d '
[{id : book1_c1, type_s:book, title_t : "The Way of Kings",
_childDocuments_ : [{ id: book1, type_s:review,stars_i:5},
{ id: book1_c2, type_s:review,stars_i:3}]}]'

Ideally, there should be just 3 documents, with parent document doc-id "book1_c1" and child "book1" and "book2_c2".

{"response":{"numFound":7,"start":0,"docs":[
{"id":"book1","type_s":"book","title_t":["The Way of Kings"],"_version_":1556318141913497600},
{"id":"book1_c1","type_s":"review","stars_i":5},
{"id":"book1_c2","type_s":"review","stars_i":3},
{"id":"book1","type_s":"book","title_t":["The Way of Kings"],"_version_":1556318146698149888},
{"id":"book1","type_s":"review","stars_i":5},
{"id":"book1_c2","type_s":"review","stars_i":3},
{"id":"book1_c1","type_s":"book","title_t":["The Way of Kings"],"_version_":1556318151351730176}]}}

Surprise! We get numFound "7". Each document has been treated individually with first one getting indexed in conventional manner and the other two in contiguous blocks. We get three documents with doc-id "book1", two "book1-c1" and two "book1-c2".

Now let's index the first document again:
curl http://localhost:8983/solr/books/update?commitWithin=3000 -d '
[{id : book1, type_s:book, title_t : "The Way of Kings"}]'

Before we observe the results, let's speculate what should happen. Ideally, it should update the singleton document and doesn't affect the parent-child block documents.

However ..

{"response":{"numFound":5,"start":0,"docs":[
{"id":"book1_c1","type_s":"review","stars_i":5},
{"id":"book1_c2","type_s":"review","stars_i":3},
{"id":"book1_c2","type_s":"review","stars_i":3},
{"id":"book1_c1","type_s":"book","title_t":["The Way of Kings"],"_version_":1556318151351730176},
{"id":"book1","type_s":"book","title_t":["The Way of Kings"],"_version_":1556318459786166272}]}}

The number of documents dropped from '7' to '5', with one document with doc-id "book1" is listed from the last update and all the previous three updates are gone! WHY? Before answering this behavior, let's index the below stated document, i.e. a singleton document with doc-id "book1_c1":

curl http://localhost:8983/solr/books/update?commitWithin=3000 -d '[
{id : book1_c1, type_s:book, title_t : "The Way of Kings"}]'

{"response":{"numFound":4,"start":0,"docs":[
{"id":"book1_c2","type_s":"review","stars_i":3},
{"id":"book1_c2","type_s":"review","stars_i":3},
{"id":"book1","type_s":"book","title_t":["The Way of Kings"],"_version_":1556318602367336448},
{"id":"book1_c1","type_s":"book","title_t":["The Way of Kings"],"_version_":1556318669163724800}]}}

numFound "5" to "4", with only 1 document with doc-id "book1_c1".

Many of our readers may have already deduced, that the parent-child link/block broke when we index the document-1 (singleton document with doc-id "book1"). As stated in the beginning of the article, a parent-child block document can be updated only and only in the format they are indexed. If we try to update individual documents (parent or child), not only the relationship between parent-child will be destroyed, Solr will make sure only one document with that doc-id resides in the index.

One should, therefore, be very careful while indexing parent-child documents in the Solr! 

Using MERGEINDEXES tool directly or indirectly (MapReduceIndexerTool)

Please refer official Solr cwiki link for details: MERGEINDEXES

When we trigger this particular CORE API, it merges the stated indexes of the cores without checking duplicate doc-ids. If one is using MapReduceIndexerTool, it internally calls the MERGEINDEXES API and can face the same anomaly.

De-duplicating the documents would impossible to do correctly due to Solr's nature. Say we have the same document in two sub-indexes (two different versions) that are being merged. How would one know which one was the latest? In the Map/Reduce case we have no control about when documents get indexed, so relying on a timestamp won't work (imagine the "right" document happens to be assigned to run on a faster machine than the "wrong" document, then the "right" document would get an earlier timestamp). Doc-ids are not required to be unique at the Lucene level, that’s an imposition made at the Solr level and since merging is low-level, that's not "known".

One should be absolutely sure when they merge the various indexes to one, that they don't contain/have similar doc-ids. Having different format for doc-id for indexes of different cores can be an effective habit (thus, cannot be achieved with documents with different versions).

Please provide your feedback and suggestions in the comments' section for the above stated and mention other factors which can lead to multiple documents with same doc-id if any. Cheers!

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk