Differences in search result order due to collection size variations – Lucidworks

Issue

Search queries return the same number of matching documents across two environments, but the ordering of results and the maximum score values differ. This occurs even when data, schema, query pipelines, and query formation are identical.

Diagnosis

To confirm if collection size differences are affecting ranking:

In Fusion, open the Query Workbench for the collection in question.
Set the view mode to Debug to inspect scoring components.
Compare the idf (inverse document frequency) and tf (term frequency) values for the same document across the two environments.

explain
   https://someurl.com/page
   3.088692 = weight(_text_:something in 0) [SchemaSimilarity], result of:
     3.088692 = score(freq=4.0), computed as boost * idf * tf from:
       3.6557271 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
         11 = n, number of documents containing term
         444 = N, total number of documents with field
       0.8448913 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
         4.0 = freq, occurrences of term within document
         1.2 = k1, term saturation parameter
         0.75 = b, length normalization parameter
         344.0 = dl, length of field (approximate)
         712.8108 = avgdl, average length of field

Check the total document count in each collection.

Example command to view collection document counts in Solr:

curl -u USERNAME:PASSWORD "https://FUSION_HOST/api/solr/COLLECTION_NAME/select?q=*:*&rows=0"

In the JSON response, review the numFound value.

If idf values differ significantly, it is likely due to the collections having different total document counts.

Environment

Any Fusion version. Applicable to all collections using the default Solr tf-idf-based scoring.

Cause

In Solr’s default similarity algorithm, idf is calculated based on the total number of documents in the collection, regardless of how many match the query. When two collections contain the same matching documents but have different total document counts, idf values for the same terms will differ. This leads to different final scores and potentially different ranking order.

Resolution

To achieve consistent ranking between environments:

Ensure that collections being compared contain the same total number of documents.
If only a subset of documents is needed in one environment, expect score and rank differences due to idf variations.
For testing or staging environments, mirror the full production index when validating search order.

Issue

Diagnosis

Environment

Cause

Resolution

Related articles