We unfortunately had to give up on our PhraseQuery with a custom ‘include stop words’ analyzer because of problems searching for non-alphanumeric characters inside strings.
What we are trying to build is a FTS search on a ‘full name’ field that will search for an exact match of the search string, but the search result must not include matches where the search string was found as a sub-string of other words.
So if the document contains “SMITHS, JOHN JAY; JR.” then our desired search results are:
Search String Match?
------------------ ------
"SMITHS" Y
"SMITH" N
"SMITHS," Y
"SMITHS " N
"SMITH JOHN" N
"SMITHS JOHN" Y
"SMITHS, JOHN" Y
"SMITHS J" N
"JOHN" Y
"JOHN JAY" Y
"JAY" Y
"JAY;" Y
"JR" Y
"JR." Y
"JOHN SMITHS" N
"MITH" N
"SMITHS, JOHN JAY" Y
"SMITHS, JOHN JAY;" Y
"SMITHS, JOHN J" N
"SMITHS, JOHN JAY; JR" Y
"SMITHS, JOHN JAY; JR." Y
"SMITHS, JOHN JAY JR" N
Our research suggested that we should use a custom analyzer set to use the “single” tokenizer, so all words in the indexed field will be kept together.
Then we query the index with a RegexpQuery to get the search field to be “whole word”.
However, with that configuration the only way we have gotten any search results is when we supply the entire search term as part of the search expression.
Given a document that contains:
“fullName”: “DLNAME, DFNAME DMNAME; DSF”,
And a FTS index using a custom analyzer of
"analyzers": {
“singleTokenizer”: {
“tokenizer”: “single”,
“type”: “custom”
}
Running a RegexpQuery that contains:
“regexp”:"^(DLNAME, DFNAME)"
Returns no results.
[Yes, I know the above regex won’t restrict the results to ‘whole words’, but I first need to get a more basic expression working before dealing with that.]
Running a RegexpQuery that contains:
“regexp”:"^(DLNAME, DFNAME DMNAME; DSF)"
Returns the correct result.
I have tried several other expressions that return the correct matches in Regex101.com’s expression tester, but do not return any results in RegexpQuery.
Are there any known problems or restrictions with RegexpQuery?
If not, what is wrong with our expression and/or index?
Or, is there a better combination of tokenizer/query that would be a better fit?
Thanks very much (again) for your time.
Jeff