If the data set has keywords such as ‘San Francisco’, ‘SFO’ where each keyword could be a term or a phrase and their synonyms keywords could be - SF, San Fran, San Francisco CA.
User may search for 'SF' and expects the results set containing documents that has SFO, San Francisco, San Francisco CA and etc.
User may also search for 'San Francisco' and expects the results set containing documents that has SF, SFO, San Francisco, San Francisco CA, San, Francisco, Francisc and etc.
An ootb (out of the box) solution that I know of, for such an use case, currently, is to have two separate fields. One field that has the text analyzed with general text analyzer as the field type and other field that has the text analyzed just for synonym search.
Warning Note: This solution will work. But it surely has its pros and cons. Please have a look at them below before you plan to consider it.
1. Create two fields,
One that will have source data and the other field will copies it using solr's copy field feature.
<field name="source_text" type="text_en" indexed="true" stored="true"/>
<field name="syn_text" type="text_synonym" indexed="true" stored="false"/>
2. To copy,
<copyField source="source_text" dest="syn_text"/>
3. Create the respective field types : text_en, text_synonym. The field type definition are in the attachment. Please download and copy it from there.
4. It would need re-indexing one time (incase if you have already indexed) because the copy field has to copy it from the source field. Reindexing in not needed if you update synonyms with the new ones later. This is because synonyms are applied at search time.
Search approach (while querying)
In order to obtain synonym results as well as the general text search results, the q should perform a boolean OR on both the fields. Before we dive into solution, lets consider some scenarios.
Say the q=text:("GB" OR GB)
The query parser analyses this query and does term search as well as phrase search. See the query constructed below. "parsedquery_toString": "text:\"(gb gib gigabyte gigabytes giga) bytes\" ((text:gb text:gib text:gigabyte text:gigabytes text:giga) text:bytes)" But the above will not work for case where user search has multi word.
Now, say q="giga byte" or giga byte.
If the user search with quote or no quote, the analyser chain will be applied for both cases and thus tokenizing giga byte into tokens giga and byte (It wont retain the phrase itself hence failing to fetch synonyms for them) and therefore it wont be returning any results other than docs that have giga byte itself.
Work around to handle the above case, In my opinion, is we have 2 fields, one which does white space tokenization and one that does not; as mentioned in the above set up section. And while querying, search on both the fields with OR.
So, q=source_text:giga byte OR syn_text:"giga byte".
This will fetch results for both hence resolving the issue. Nevertheless, as mentioned, please have a look at pros and cons section. Hope this helps.
I have attached the configurations screen shot for reference and for making the copy easier.
Pro’s and Con’s
It resolves the current issue. This has been one of the use case that majority of us have as a requirement and there have been ways to hack it. This is one of the simplest and ootb.
1. It will increases the index size since we create a copy field. Though this field can be stored=false, it will contribute to index size. 2. The query will have another boolean clause which will surely make difference to query performance. Having said that, it can be still be used without much of a issue for data of smaller-med size. Solr, having proved best in its performance for indexes of size greater than tb(s), this approach for small amount of index size in gb(s) should work.