Export and import data between Solr collections with large datasets – Lucidworks

Goal

Export data from an existing Solr collection and import it into a new collection—commonly done for archiving, data segmentation (such as by year), or testing in lower environments.

This article outlines how to correctly perform large-scale data exports and imports using Solr’s /export and /select handlers, especially when working with collections exceeding 10 million records.

Environment

Fusion environments running Solr 8.x (including Solr 8.11.2)

Applicable for:

Clients managing large datasets (>10M documents)
Use cases requiring year-wise export, schema compatibility, or full collection duplication

Guide

Prepare schema for export

The /export handler requires that:

All fields listed in the fl parameter have docValues="true"
The field used for sort must be:
- Single-valued
- Have docValues="true"

If some fields do not meet this, consider using <copyField> directives in the schema to point to fields that do support docValues.

Example workaround:

<copyField source="legacy_field" dest="legacy_field_docval"/>
<field name="legacy_field_docval" type="string" docValues="true" stored="true"/>

Export documents from a collection

Use Solr's /export handler via curl. This handler supports efficient streaming of large result sets.

curl --user USER:PASSWORD "http://localhost:8983/solr/<collection_name>/export?q=*:*&fq=fiscal_year:2025&sort=id+asc&fl=field1,field2,field3" -o /path/to/output.json

Important notes:

Wildcards in fl=* are not supported
All fields in fl must be explicitly listed and must have docValues=true
/export does not return numFound in the response

Verify document counts

Because /export does not return total document counts (numFound), run a separate /select query to get the actual expected count:

curl "http://localhost:8983/solr/<collection_name>/select?q=fiscal_year:2025&rows=0&wt=json"

This returns the numFound field in the response for validation.

To count the number of records actually exported to disk:

jq -c '.response.docs[]' /path/to/output.json | wc -l

Alternatively:

jq '.response.docs | length' /path/to/output.json

Understand distributed export limitations

The /export handler is not a distributed search. This means:

It must be run on each shard/replica individually
Consolidate results manually if needed
Data mismatch may occur if soft commits are pending or if replicas are not fully synchronized

To avoid discrepancies:

Export only from leader replicas of each shard
Ensure soft commits are flushed before export
Run /export and /select on the same replica if verifying counts

Import data into a new collection

Import data using the /dataimport handler or direct indexing (e.g., POST to /update/json).

Ensure:

Schema compatibility between source and target collections
Field mappings align for the import method used

Example using curl to POST exported data:

curl --user USER:PASSWORD "http://localhost:8983/solr/<new_collection>/update?commit=true" \
  -H "Content-Type: application/json" \
  --data-binary @/path/to/output.json

For structured ingestion, transform the data as needed to match the schema before importing.

Additional considerations

Use the rows parameter with /select if performing test exports
For collections with uninvertible field types or useDocValuesAsStored, docValues=true is required or Solr will return errors
When working with over 10 million records, break the export into year or segment-based queries using filters (fq)

Let us know if you need further help with schema changes or handling bulk ingestion workflows.