Other word is not searchable with ngram but "ther" is searchable

i’ve index on 11 columns out of which 4 columns we have ngram index of min 2 and max 13

so some weird issue coming for some words like “Other” is here case

using syntax to search with and conditions

{query:{ query: “+apple +mobile”}}
{query:{query:“+Other”}}

below is my index def:


{
  "type": "fulltext-index",
  "name": "product-search-fts-c1",
  "uuid": "1a2692d6dd5f9270",
  "sourceType": "gocbcore",
  "sourceName": "test-bucket",
  "sourceUUID": "5c6b61f7be43d99d8efb1035c9b9e4e9",
  "planParams": {
    "maxPartitionsPerPIndex": 1024,
    "indexPartitions": 1
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "scope.collection.type_field",
      "type_field": "type"
    },
    "mapping": {
      "analysis": {
        "analyzers": {
          "substringMatcher": {
            "token_filters": [
              "to_lower",
              "substringMatcher"
            ],
            "tokenizer": "unicode",
            "type": "custom"
          }
        },
        "token_filters": {
          "substringMatcher": {
            "max": 13,
            "min": 1,
            "type": "ngram"
          }
        }
      },
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": false
      },
      "default_type": "_default",
      "docvalues_dynamic": false,
      "index_dynamic": true,
      "store_dynamic": false,
      "type_field": "_type",
      "types": {
        "_default.bridgeProduct": {
          "dynamic": false,
          "enabled": true,
          "properties": {
            "name": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "substringMatcher",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "name",
                  "store": true,
                  "type": "text"
                }
              ]
            },
            "subtype": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "substringMatcher",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "subtype",
                  "store": true,
                  "type": "text"
                }
              ]
            }
          }
        }
      }
    },
    "store": {
      "indexType": "scorch",
      "segmentVersion": 15
    }
  },
  "sourceParams": {}
}

@akshaydhawle So I do see that you’ve created an analyzer substringMatcher which contains the following components -

  • tokenizer: unicode
  • tokenfilters: to_lower, substringMatcher
    • substringMatcher: ngram with min:1 and max:13

Per your index definition you’ve applied this analyzer over fields:

  • name
  • subtype

Now you’ve included both the fields into the composite field (_all) - so your content can be searched for without field scoping, but you haven’t updated what analyzer to use in that situation. You can do this by updating the default_analyzer in your index definition to substringMatcher (which is currently set to standard). This is the analyzer definition applied to your search criteria in the event you do not provide field while searching.

That’s^ your first option here.
A second option is you field scope your query like this …

{query:{query: "+name:apple +subtype:mobile"}}
{query:{query:"+subtype:Other"}}

Separately, worth mentioning that Other is considered a stopword by the standard analyzer (whose rules are pretty close to the english analyzer) - meaning it’ll be dropped.

Thanks @abhinav

I’ve one more question,
previously we are using standard analyser on all column,
and on 4 columns we are using partial text matching using regex .

this regex we have replace with ngram analyser
are you seeing any performance concerns here.

below is my index

{
  "type": "fulltext-index",
  "name": "product-search-fts-build",
  "uuid": "51b8345b83e423e6",
  "sourceType": "gocbcore",
  "sourceName": "buildCatalog",
  "sourceUUID": "796e48c0e626eff8e4d3e2be84148778",
  "planParams": {
    "maxPartitionsPerPIndex": 1024
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "scope.collection.type_field",
      "type_field": "type"
    },
    "mapping": {
      "analysis": {
        "analyzers": {
          "substringMatcher": {
            "char_filters": [
              "asciifolding"
            ],
            "token_filters": [
              "substringMatcher",
              "to_lower"
            ],
            "tokenizer": "unicode",
            "type": "custom"
          }
        },
        "token_filters": {
          "substringMatcher": {
            "max": 13,
            "min": 2,
            "type": "ngram"
          }
        }
      },
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": false
      },
      "default_type": "_default",
      "docvalues_dynamic": true,
      "index_dynamic": true,
      "store_dynamic": true,
      "type_field": "_type",
      "types": {
        "_default.bridgeProduct": {
          "dynamic": false,
          "enabled": true,
          "properties": {
            "attributes": {
              "dynamic": false,
              "enabled": true,
              "properties": {
                "effect": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "docvalues": true,
                      "include_in_all": true,
                      "include_term_vectors": true,
                      "index": true,
                      "name": "effect",
                      "store": true,
                      "type": "text"
                    }
                  ]
                },
                "flavor": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "docvalues": true,
                      "include_in_all": true,
                      "include_term_vectors": true,
                      "index": true,
                      "name": "flavor",
                      "store": true,
                      "type": "text"
                    }
                  ]
                },
                "general": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "docvalues": true,
                      "include_in_all": true,
                      "include_term_vectors": true,
                      "index": true,
                      "name": "general",
                      "store": true,
                      "type": "text"
                    }
                  ]
                },
                "ingredient": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "docvalues": true,
                      "include_in_all": true,
                      "include_term_vectors": true,
                      "index": true,
                      "name": "ingredient",
                      "store": true,
                      "type": "text"
                    }
                  ]
                }
              }
            },
            "brandName": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "substringMatcher",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "brandName",
                  "store": true,
                  "type": "text"
                }
              ]
            },
            "classification": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "classification",
                  "store": true,
                  "type": "text"
                }
              ]
            },
            "name": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "substringMatcher",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "name",
                  "store": true,
                  "type": "text"
                }
              ]
            },
            "size": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "standard",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "size",
                  "store": true,
                  "type": "text"
                }
              ]
            },
            "sku": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "sku",
                  "store": true,
                  "type": "text"
                }
              ]
            },
            "subtype": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "substringMatcher",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "subtype",
                  "store": true,
                  "type": "text"
                }
              ]
            },
            "type": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "substringMatcher",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "type",
                  "store": true,
                  "type": "text"
                }
              ]
            }
          }
        }
      }
    },
    "store": {
      "indexType": "scorch",
      "segmentVersion": 15
    }
  },
  "sourceParams": {}
}```

Using an ngram tokenfilter/analyzer will move the bulk of the compute to index time as opposed to using the standard analyzer and a regex match where the compute to determine the candidate terms occurs at query time.

So if query performance is what matters to your application, I recommend the ngram approach.

1 Like

Thanks @abhinav

:slightly_smiling_face:

Hi @abhinav ,

- I’ve one more question:

  • so when we use ngram ,
    how to decide min and max ,

- use case - partial text matching.

  • below is my understanding: if i want to give support for queries
    1 gram : [‘run 1’,‘run 2’] - in order differentiate 1 from 2 i need 1 gram.
    2 gram: [“run 12”, “run 13”] - in order to differentiate 12 from 13 i need 2 gram
    3 gram: [“run 123”,“run 124”] - in order to differentiate 123 from 124 i need 3 gram
    4 gram: [“run 1234”,“run 1235”] - in order to differentiate 1234 from 1235 i need 4 gram
    5 gram: [“apple mobile”,“apple cover”] - in order to differentiate these two words like above
    …same goes till
    12 gram: [“SwiftFusionX123”,“SwiftFusionX124”]

  • if we set max to 12 and if search string is 13 then it is not searchable

  • do we need to review our customers search history, but yes customer can type any continuous word.

  • do we need to add some validation before creating product for continuous word so they can type continuous word at some limit.

  • or should we give autocomplete list to user as they type , so they will not type longer words .

  • or we should use some other techniques like stemming, possessiveness , stopword
    to reduce word length.

  • or we should use ngram for shorter words < 8 and for longer words we use regex.

- I also want to understand

  • today many of the ecommerce platforms support for partial text matching ,
    they’ve big data as well how they support .
  • will they use word mapping
  • or they use ngram - to some max like 20 and handle this large data by
    scaling the resources.

If I’ve followed your comment properly - I don’t think your understanding of partial text matching with ngrams is entirely correct.

An ngram tokenfilter essentially gives the user the power to break up text into tokens based on the rules defined. So for the word sample if you define an ngram token filter with min:3 and max:5, these are the tokens generated -

  • sam
  • samp
  • sampl
  • amp
  • ampl
  • ample
  • mpl
  • mple
  • ple

So these tokens are what are held within your index, meaning any of these tokens are searchable.
Now if you want to search for samples, notice that this token is not indexed as is in your system.
So, it will be on you to make sure the query derives the analyzer you’ve defined within your index definition before search -

  • Firstly, must use an analytic query - like match/match_phrase
  • This can be done by “field scoping” your query or setting the “default_analyzer” in your index definition which is used for all analytic queries that are not field scoped.
  • An analytic query applies the analyzer definition for the field over your search criteria and will then search for the tokens resulting from that operation instead.
  • So if I look do …
{"query": {"field": "X", "match": "sampling"}}
  • All the ngram tokens for samples is generated: ["sam", "samp", "sampl", "am", ...]
  • And a disjunction will ensue over all these tokens, many of which exist in your index and so the document will be returned as a hit.

Your choice of the ngram limits should be based on what the minimum length of text you want from your customer before you start looking for results. Needless to say, lower the min setting for the ngram token filter, more tokens will be indexed - and larger your index.

The ngram approach is the faster approach to support auto complete as opposed wildcard/regex because you’ll be moving the majority of the compute to index time as opposed to query time.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.