Full Text Search index size 3 times bigger than the data in the bucket

I have a strange situation.

I am indexing a few fields from a document (less than 1% of the JSON document), and I don’t store any of the (docvalues, terms vectors, stores). I also have the default disabled and still the data become 3-4 times bigger than the entire data in the bucket which includes multiple (large) other types of documents which are not included in the index in the first place.

Is this normal? Is it possible that I am doing something wrong, or there is some kind of a bug?

Can you please share your FTS index definition?

Yes, here it is:

{
  "type": "fulltext-index",
  "name": "product_name",
  "uuid": "784b2727c74866a2",
  "sourceType": "couchbase",
  "sourceName": "app-live",
  "planParams": {
    "maxPartitionsPerPIndex": 171,
    "indexPartitions": 6
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "type_field",
      "type_field": "sub_type"
    },
    "mapping": {
      "analysis": {},
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": false
      },
      "default_type": "_default",
      "docvalues_dynamic": true,
      "index_dynamic": true,
      "store_dynamic": false,
      "type_field": "_type",
      "types": {
        "product": {
          "dynamic": false,
          "enabled": true,
          "properties": {
            "product_a_store": {
              "dynamic": false,
              "enabled": true,
              "properties": {
                "product_a_id": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "include_in_all": true,
                      "include_term_vectors": true,
                      "index": true,
                      "name": "product_a_id",
                      "type": "number"
                    }
                  ]
                },
                "product_name": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "include_in_all": true,
                      "index": true,
                      "name": "product_name",
                      "type": "text"
                    }
                  ]
                },
                "product_primary_genre_id": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "include_term_vectors": true,
                      "index": true,
                      "name": "product_primary_genre_id",
                      "type": "number"
                    }
                  ]
                },
                "product_provider_name": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "include_in_all": true,
                      "index": true,
                      "name": "product_provider_name",
                      "type": "text"
                    }
                  ]
                },
                "product_seller_name": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "include_in_all": true,
                      "index": true,
                      "name": "product_seller_name",
                      "type": "text"
                    }
                  ]
                },
                "bundle_id": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "include_in_all": true,
                      "index": true,
                      "name": "bundle_id",
                      "type": "text"
                    }
                  ]
                }
              }
            }
          }
        }
      }
    },
    "store": {
      "indexType": "scorch"
    }
  },
  "sourceParams": {}
}

@flaviu Your index definition looks OK to me.

I see that you’ve selected “Include in _all field” for several of your fields. This will push a copy of the indexed content for each of those fields into the “_all” composite field.

The “_all” composite field is necessary only if you intend to not specify the “field” to look within for your search terms. If the “field” can be specified in your queries, you wouldn’t need to “include in _all field” for your fields and this should give you some savings.

It’s hard to exactly quantify the size of an index with relation to your data. This could depend on …

  • The size of the content within your fields of interest
  • The frequency of the tokens as analyzed by your chosen analyzer
  • Numeric data is known to cause a bit of a bloat, takes up a larger footprint to assist in quick querying.

Still is very strange,

the total bucket has 1.2 TB and the Search index has more than 5TB … this should not be normal in any way… A typical document has 5.2k characters from which I am indexinng around 100 characters. And I am indexing just 30% of the bucket (taking in consideration the JSON type field fromthe FTS)

How can be possible the index to be so big in this situation?

I have an enterprise license. How should I proceed with this?

Highlighting a couple of bullets from my previous comment

  • We’re aware of the issue with numeric content causing an index bloat - the design trades of on index space for query latency. We’ll be improving the behavior in a future release
  • Did you try disabling the “include in _all field” for all your fields? This will need you to specify the field in the query, but should give you significant savings.

Feel free to get in touch with Couchbase support anytime for this.
You will be asked for logs and some details and we’ll be able to assist you better.

#1 I have only one field which is numeric, and the value can be between 1 and 100. Could this have this big influence?
#2 I will try this

Also, wanted to ask if there is any way in which I can force the index creation. The servers CPU loads are close to 3% but the index is creating super slow. Is there a way in which I can force the index creation to be faster?

Also, wanted to ask if there is any way in which I can force the index creation. The servers CPU loads are close to 3% but the index is creating super slow. Is there a way in which I can force the index creation to be faster?

Increasing the number of index partitions will certainly boost your indexing speed if you have the CPU for it.

We can see there are two numeric fields there,
eg: product_a_id and product_primary_genre_id.

There are some known issues with index size bloating which got fixed in upcoming releases of CB server.
One workaround you could plausibly explore is the suggestion specified here with the index definition.

So, you could try creating a newer index with higher number of partitions with the above store property overrides for verifying the index size impacts without impacting your current index.

As @abhinav mentioned, creating a customer support ticket would be the way to go.

Cheers!