Issue:
Can you shed some light on sow=false (splitOnWhitespace=false) behavior with multi word synonyms, ‘mm’ (minimumMatch) and a potential solution?
splitOnWhitespace=false, as parameter suggests, doesn’t split incoming text stream on whitespace and consider the tokens with ‘OR’. sow=false makes it possible to have fieldType without KeywordAnalyzer to support multi-word operations including synonyms. But it really doesn’t work in the manner we imagine it would do when we induce minimum match (mm) criteria.
Let’s take an example;
We have two fields: “description_t” of text type and “title_s” of string type with multi-words synonyms:
white marble => granite
cream marble => levantina
And would like to perform search with mm=100% (all query terms should part of the field values in the document)
Our query looks like;
"params":{
"mm":"100%",
"q":"white marble chopsticks",
"defType":"edismax",
"indent":"on",
"qf":"title_s description_t title_txt",
"sow":"false",
"wt":"json",
"debugQuery":"on"}
Please note, we have splitOnWhitespace(sow) = false to incorporate the multi-word synonyms in query expansion. Without looking at the results, this is how our debugQuery looks like and our problem will be visible imminently.
(((title_s:white title_s:marble title_s:chopsticks)~3) | ((title_txt:granite title_txt:chopsticks)~2) | ((description_t:granite description_t:chopsticks)~2))
“~” sign denoted the ‘mm’ value, minimum terms should match.
When we say mm=100%, we want the terms to be present in the stated fields, in total.
In this case, terms “white” “marble” “chopsticks” can be part of either “title_s” or “description_t”, and we are fine as long as all three terms are either present in these two fields.
While in the debug query above, (description_t:granite description_t:chopsticks)~2), solr expects each keyword to be part of each field specified under “qf” parameter, leading to unforeseen results.
Environment:
Solr
Resolution:
Please note; some senior search and relevance engineers have already put out articles on how to include multi-word synonyms on your platforms and I would recommend to go through them first:
https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/#footnote6
https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
https://dzone.com/articles/solution-multi-term-synonyms
There are multiple solutions can be devised, and we devised a simple one that is easy to understand.
"params":{
"mm":"100%",
"q”:"
({!edismax qf='description_t subject_t title_txt' v=$queryX} OR
{!edismax qf='title_s' v=$queryX})",
"debug":"true",
"queryX":"white marble chopsticks",
"indent":"on",
"sow":"false",
"wt":"json"}
Define edismax parsers specifically for field types, like above, we have clubbed all text fields under one edismax parser and string to other. Looking at the debug query:
(+(((subject_t:granite | title_txt:granite | description_t:granite) (subject_t:chopsticks | title_txt:chopsticks | description_t:chopsticks))~2)) (+(((title_s:white) (title_s:marble) (title_s:chopsticks))~3))
We still have a problem here, since we have two ‘mm’ values. But all the string fields will either have exact match or not. We can atleast club all text fields together and perform minimum match.
Leading multi-word synonyms to text fields strictly.
Cause:
N/A
Comments
0 comments
Article is closed for comments.