Is there any way to force the Standard analyzer to include stop words?
As documented here (https://issues.couchbase.com/browse/MB-18631), stop words are removed from the index when using the standard analyzer.
You can demonstrate this with Bleve’s online Text Analysis @ http://analysis.blevesearch.com/analysis.
We have an FTS index on a string field, and have a requirement to include stop words in the queries & results.
We are using a PhraseQuery for the search, separating the search string into terms at spaces.
This worked fine, until we tested finding a stop word, which failed.
Removing stop words from the query terms also failed (I’m guessing this happens because the index includes stop words when calculating the “position” of the indexed fields).
We changed the analyzer of the FTS index to “simple”, which does not exclude stop words.
After that seaching for a stop word worked.
Given a document containing “THIS GEORGE OF ENGLAND
” in the indexed field:
PhraseQuery
Search Term(s) Analyzer Match Found?
------------------- --------- ------------
"george" standard Y
"george"/"of" standard N
"george"/"england" standard N
"george"/"of"/"england" standard N
"george" simple Y
"george"/"of" simple Y
"george"/"england" simple N
"george"/"of"/"england" simple Y
However, searching for a word that contains an apostrophe always fails with the “simple” analyzer.
Given a document containing “THIS G'EORGE OF ENGLAND
” in the indexed field:
PhraseQuery
Search Term(s) Analyzer Match Found?
------------------- --------- ------------
"g'eorge" standard Y
"g'eorge"/"of" standard N
"g'eorge"/"england" standard N
"g'eorge"/"of"/"england" standard N
"g'eorge" simple N
"g'eorge"/"of" simple N
"g'eorge"/"england" simple N
"g'eorge"/"of"/"england" simple N
Is there any way to customize/hack the query or index to use the standard analyzer and still include stop words?
Or can we create our own analyzer that behaves like “simple” but handles apostrophes,etc. properly?
Also, we have a RegexpQuery that works with the standard analyzer, but fails with the simple analyzer.
The document’s indexed field contains 578206327, the regex is “^\d{5}6327”.
Using the standard analyzer finds a match.
Using the simple analyzer does not.
Is that proper behavior?
Thanks very much for your time!
Jeff