Solr: How to include delimiter characters such as + and # in the queries

Question

I want to be able to search for C++ and C#, however it looks like the + and # characters are being removed.

Description

From SOLR-2059:

By default, WordDelimiterFilter assigns 'types' to each character (computed from Unicode Properties).
Based on these types and the options provided, it splits and concatenates text.

Solution

Here are the steps to follow to explicitly define the '+' and the '#' characters as alpha characters so that they are not filtered out, and they are considered in the search queries.

1. Edit the schema.xml file and find the solr.TextField that you are using (e.g. text_en)

2. Under "index" and query" analyzers modify the WordDelimiterFilterFactory and add types="wdfftypes.txt"
For instance:

<filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="wdfftypes.txt"/>

3. Then create the wdfftypes.txt file with the following and place it in the same folder as the schema.xml file.
NOTE: for the # character we have to use the unicode value.

# A customized type mapping for WordDelimiterFilterFactory
# the allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM
#
# the default for any character without a mapping is always computed from
# Unicode character properties

+ => ALPHA
\u0023 => ALPHA

4. Reload the core, or restart Solr
5. Re-index the data so that the missing characters are included in the index

Have more questions? Submit a request

1 Comments

  • 0
    Avatar
    Promod George

    Also, 

    1) Do it for both, <analyzer type="index"> and <analyzer type="query">

    2) <tokenizer class="solr.WhitespaceTokenizerFactory"/> should be for both the analyzer types.

    Ie:-

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
    </analyzer>
    </fieldType>

    Use text_general as the type of the field you intend to use in solr.

Please sign in to leave a comment.
Powered by Zendesk