FTS phrase match explanation

flaviu · April 17, 2023, 2:36pm

@abhinav can you please explain to me this FTS phrase match?

the string is: “With zombie make up on Billy”

the search phrase is: “make out”

{“query”: {“match_phrase”: “make out”, “field”:“pp”}, “score”: “none”}

I am trying to understand why there is a match on the above text

this is the index definition:

{
  "type": "fulltext-index",
  "name": "image_fts-v2",
  "uuid": "4e8a00692c191b1f",
  "sourceType": "gocbcore",
  "sourceName": "images_fts",
  "sourceUUID": "446eb8f81d4c800ccad037596d85a254",
  "planParams": {
    "maxPartitionsPerPIndex": 256,
    "indexPartitions": 4
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "type_field",
      "type_field": "m.t"
    },
    "mapping": {
      "analysis": {},
      "default_analyzer": "en",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "default_analyzer": "en",
        "dynamic": false,
        "enabled": false
      },
      "default_type": "_default",
      "docvalues_dynamic": false,
      "index_dynamic": false,
      "store_dynamic": false,
      "type_field": "_type",
      "types": {
        "fts": {
          "dynamic": false,
          "enabled": true,
          "properties": {
            "cd": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "cd",
                  "type": "number"
                }
              ]
            },
            "dl": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "dl",
                  "type": "number"
                }
              ]
            },
            "ih": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "ih",
                  "type": "number"
                }
              ]
            },
            "ii": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "keyword",
                  "index": true,
                  "name": "ii",
                  "type": "text"
                }
              ]
            },
            "im": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "keyword",
                  "index": true,
                  "name": "im",
                  "type": "text"
                }
              ]
            },
            "iw": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "iw",
                  "type": "number"
                }
              ]
            },
            "mi": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "mi",
                  "type": "number"
                }
              ]
            },
            "mt": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "mt",
                  "store": true,
                  "type": "number"
                }
              ]
            },
            "np": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "en",
                  "include_term_vectors": true,
                  "index": true,
                  "name": "np",
                  "type": "text"
                }
              ]
            },
            "nv": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "en",
                  "index": true,
                  "name": "nv",
                  "type": "number"
                }
              ]
            },
            "pi": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "pi",
                  "store": true,
                  "type": "boolean"
                }
              ]
            },
            "pp": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "en",
                  "include_term_vectors": true,
                  "index": true,
                  "name": "pp",
                  "type": "text"
                }
              ]
            },
            "pv": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "en",
                  "index": true,
                  "name": "pv",
                  "type": "number"
                }
              ]
            },
            "s": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "s",
                  "type": "number"
                }
              ]
            },
            "st": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "st",
                  "type": "boolean"
                }
              ]
            }
          }
        }
      }
    },
    "store": {
      "indexType": "scorch",
      "segmentVersion": 15
    }
  },
  "sourceParams": {}
}

abhinav · April 17, 2023, 4:47pm

out is a stop word per en language analyzer, reason why the phrase match ends up looking only for make during your search.

flaviu · April 17, 2023, 5:25pm

can I exclude it as a stop word?

abhinav · April 17, 2023, 5:45pm

Well, not using the en analyzer. What you can do is create a custom analyzer with all the en analyzer components other than the stop words (stop_en) and use that for the field pp.

So your custom analyzer’s components would be …

unicode tokenizer
possessive_en token filter
to_lower token filter
stemmer_en_snowball token filter

flaviu · April 17, 2023, 6:13pm

“out” is the only stop word?

if I add this tokenizer, it needs to reindex my entire database, right?

abhinav · April 17, 2023, 6:23pm

No there’s several others which will also end up getting indexed. Here’s the list of stop words for en (remember these are the ones that are dropped and not indexed by the analyzer):

github.com

blevesearch/bleve/blob/v2.3.7/analysis/lang/en/stop_words_en.go#L15


      
          	"github.com/blevesearch/bleve/v2/registry"
          )
          
          const StopName = "stop_en"
          
          // EnglishStopWords is the built-in list of stopwords used by the "stop_en" TokenFilter.
          //
          // this content was obtained from:
          // lucene-4.7.2/analysis/common/src/resources/org/apache/lucene/analysis/snowball/
          // ` was changed to ' to allow for literal string
          var EnglishStopWords = []byte(` | From svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt
           | This file is distributed under the BSD License.
           | See http://snowball.tartarus.org/license.php
           | Also see http://www.opensource.org/licenses/bsd-license.html
           |  - Encoding was converted to UTF-8.
           |  - This notice was added.
           |
           | NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
           
           | An English stop word list. Comments begin with vertical bar. Each stop
           | word is at the start of a line.

if I add this tokenizer, it needs to reindex my entire database, right?

Any changes to the index definition that changes the mapping will cause an index rebuild - correct.

flaviu · April 17, 2023, 6:36pm

ok, I understand, creating now a clone of the index with the new analyzer. Thanks for the details. I will let you know how it is going.

system · July 16, 2023, 6:36pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FTS with PhraseSearch Full Text Search	13	1026	May 8, 2023
Any way to use standard analyzer & still include stop words? Full Text Search	7	5197	July 2, 2018
How to include all words to the search index, such as "from" "to" Full Text Search	6	483	November 8, 2023
FTS partial phrase search Full Text Search	12	4725	February 15, 2019
FTS matching an exact word of a format "aaaa:100@143" SQL++ fts	9	180	December 31, 2024

FTS phrase match explanation

Related topics