Count tokens in a query using Solr analysis API and a JavaScript query stage – Lucidworks

Goal

Determine the number of tokens in a user query during query processing and dynamically adjust the query logic (e.g., switching between AND and OR operators) based on the token count. This is particularly useful when optimizing for both precision and recall in multilingual environments.

Environment

Fusion 5.5.1+

Guide

To adjust query behavior based on the number of tokens in the user query, use the Solr analysis API within a custom JavaScript query pipeline stage. This allows you to retrieve token information in real time and make conditional modifications to the query request.

Step 1: Make a call to the Solr Analysis API

Use the Solr /analysis/field endpoint to analyze the input text using a specific field type. This ensures tokenization aligns with Solr’s internal processing, including filters like stopwords and stemming.

Example API request:

GET /api/solr/<collection-name>/analysis/field?wt=json
&analysis.fieldvalue=How to find a splunk forwarder
&analysis.fieldtype=text_en

Replace:

<collection-name> with the name of the Solr collection
analysis.fieldvalue with the user query
analysis.fieldtype with the appropriate Solr field type based on the query language

The API response will return a breakdown of tokens at various stages of analysis. Use the last element of the index array for the most complete processed token list (e.g., output from filters like PorterStemFilter).

Step 2: Parse the token count in a JavaScript query pipeline stage

Add a JavaScript query stage to your pipeline that makes the Solr analysis API call, parses the JSON response, and counts the number of tokens from the final stage of analysis.

Example JavaScript snippet:

var http = require("http");
var fieldValue = request.queryParams.q; // Or wherever the query string is sourced
var lang = request.queryParams.lang || "en"; // Assumes language is passed in
var fieldTypeMap = {
"en": "text_en",
"de": "text_de",
"fr": "text_fr"
};

var fieldType = fieldTypeMap[lang] || "text_en";
var analysisUrl = "/api/solr/my_collection/analysis/field?wt=json" +
"&analysis.fieldvalue=" + encodeURIComponent(fieldValue) +
"&analysis.fieldtype=" + fieldType;

var response = http.get(analysisUrl);
var tokens = [];

if (response && response.analysis && response.analysis.field_types && response.analysis.field_types[fieldType]) {
var indexArray = response.analysis.field_types[fieldType].index;
var finalStageTokens = indexArray[indexArray.length - 1]; // Use last filter stage

tokens = finalStageTokens.map(function(token) {
return token.text;
});
}

if (tokens.length < 4) {
// Apply AND logic
request.queryParams.mm = "100%";
request.queryParams.q.op = "AND";
} else {
// Apply OR logic with mm
request.queryParams.mm = "70%";
request.queryParams.q.op = "OR";
}

Notes:

Ensure the JavaScript stage is placed before any parsing or Solr query execution stages.
This approach works in environments with multiple languages by mapping query language to the appropriate Solr field type.
Field types must be configured in your Solr schema with appropriate analyzers and filters.

Step 3: Configure field type mapping

For multilingual support, maintain a mapping from language codes to Solr field types. Each language should have a field type defined with proper analyzers, such as text_en, text_de, etc.

Update the field type map in the JavaScript stage to match your schema configuration.

Additional tips

Use the REST Query pipeline stage if you prefer to externalize the analysis API call outside of the JavaScript stage logic.
For debugging, inspect the full token output of the analysis API to understand how various filters (e.g., StopFilter, LowerCaseFilter, PorterStemFilter) transform the input text.
When extracting tokens, always use the last filter stage as the source of truth unless your use case requires otherwise.