How can I determine how many documents each Solr shard should have for acceptable performance?
Fusion (all versions), Solr (all versions)
Determining how many documents each shard should have depends on hardware specs, the complexity of queries, and how slow it is acceptable for queries to be. There is no one answer and finding the correct number will involve testing for each system, but 10 million - 100 million documents per shard is a good starting point.
To determine a more exact number of documents for shard, test what is important to you. Load (serving many queries at once) is not important right now since that is better dealt with by adding replicas rather than shards at a later point, only single-query performance matters when evaluating an optimal number of shards.
Does your system require fast update speed? Do you need fast query speed? You should start with (virtual) hardware similar to the system you plan to put into production. Using this, test query or indexing speeds (per document) for your specific queries on different numbers of documents, increasing the number of documents in the collection each time you test.
Find the number of documents where performance starts to deteriorate past what is acceptable. This is your base level of number of docs you want per shard. There will be some overhead once you start sharding, and you want to estimate how much you expect to grow over time, so you want the number of docs to be somewhat less than this number. The overhead will increase the more shards you have. Also by sharding you need to ensure that you have good performance across your cluster, since a query is only as fast as it's slowest member.