Indications
After a sequence of datasource crawls, you may notice that the document count shown in the Collections Manager is smaller than the number of documents with an 'id' field as displayed in the Fields panel.
Since every document has an 'id' field, why aren't these values the same?
What's happening
The Collections Manager report on Total Documents is pretty straightforward. It's the number of docs in that collection that could be retrieved (such as with a query like "*:*"). This is the number that should be used to answer the question "how many documents are in the Solr collection?"
The Fields report gets data from Lucene. The report shows the number of documents with that field name that exist in all segments of the collection. When Solr attempts an overwrite or deletion of a document, it just marks the 'old' one as not retrievable. The document still exists in the segment until the segment is merged away. Here's what the segments look like for the collection in this example:
Those 19.01% deletions are the reason you see 121 docs with an 'id' field, but only have 98 documents retrievable.
As more incremental crawls take place, these segments will get merged with other segments, and the documents deleted in Solr will actually be removed from disk.
Summary
During an initial crawl on an empty collection, the number of documents reported by the Collections Manager and the Fields panel should be the same. If subsequent crawls are done on an existing collection, those counts may start differing from each other due to the way overwrites and deletions are handled in the Lucene segments.
Comments
0 comments
Article is closed for comments.