Resolving RecordFormatException when indexing large files using Tika async parsing – Lucidworks

Issue

When attempting to index large Microsoft Office files (such as Excel or Word documents over 100 MB) using the Tika asynchronous parser in Fusion 5.x, indexing may fail with errors similar to the following in async-parsing pod logs:

org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 111,140,834, but the maximum length for this record type is 100,000,000.

This results in documents not being fully processed or indexed.

Diagnosis

To confirm this issue:

Review the logs for the async-parsing pod in the relevant Kubernetes namespace.
Look for exceptions referencing RecordFormatException and an array allocation failure with a message about the maximum allowable size for this record type.
Example log message:

org.apache.poi.util.RecordFormatException: Tried to allocate an array of length <number>, but the maximum length for this record type is 100,000,000.

If these errors are present during attempts to index large files, this article applies.

Environment

Fusion 5.9.12
Connectors: SharePoint Optimized (v2.1.0)

Cause

This issue occurs due to Apache POI's default internal safety limit on byte array allocation for certain record types. When the Tika async parser attempts to process files larger than the 100 MB cap, it triggers a RecordFormatException and the file cannot be parsed or indexed.

The parameter byteArrayMaxOverride must be set in the tika-config.xml to increase the allowed maximum array size for parsing large files.

Resolution

To resolve the error and enable parsing of larger files, update the tika-config.xml used by the async-parsing pod to specify a higher value for the byteArrayMaxOverride parameter.

Step 1: Edit the tika-config.xml

Modify the tika-config.xml file used by the async-parsing service to include the byteArrayMaxOverride property. Set this value to a number larger than your largest expected file size in bytes (e.g., 200000000 for 200 MB).

Example configuration snippet:

<parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>
    </parser>
    <parser class="org.apache.tika.parser.microsoft.OfficeParser">
      <params>
        <param name="byteArrayMaxOverride" type="int">200000000</param>
      </params>
    </parser>
  </parsers>

Ensure this block is added inside the root <properties> element of your tika-config.xml file.

Step 2: Apply the configuration

Update the ConfigMap containing the tika-config.xml in your Kubernetes cluster, as appropriate for your deployment.
If updating a ConfigMap, use a command similar to:

kubectl apply -f <your-tika-configmap-file>.yaml -n <fusion-namespace>

Step 3: Restart the async-parsing pod

After updating the configuration, restart the async-parsing pod to apply the changes:

kubectl delete pod <async-parsing-pod-name> -n <fusion-namespace>

The pod will automatically restart and pick up the updated configuration.

Step 4: Validate resolution

Retry indexing a large file that previously triggered the error.
Confirm that no new RecordFormatException entries appear in the async-parsing logs.
Check that documents are now fully processed and indexed as expected.

Note: If the pod fails to start after editing tika-config.xml, verify the XML for proper syntax and placement of the properties block. A malformed configuration will prevent the Tika server from starting.