We have already seen a handful of best practices for defining an FTS index in part1, now let’s explore a few useful tips about the operational and maintenance aspects of an index.
A user who is new to the Information Retrieval(IR) domain may find the FTS index definition process a bit tedious, especially with configuring the text analysis pipeline for an index. You might be overwhelmed by the information overload from all that IR jargons like character filters, tokenisers and about umpteen token filters like shingles, ngrams, edge_ngrams etc. These choices are endless when you start with configuring a custom text analyser. Once you define an index with a custom/built-in analyser, due to all these aforementioned inherent complexities, it becomes an increasing cumbersome task to predict and verify the output of the configured analysers.
Now, you may ask why does a detailed insight into the text analysis output even matter?
It is very important, as most of the search users stumble at the following question during their initial tryst with any full text search system,
Qn. Why the search isn’t returning any hits, even when I have defined the index as per the guidelines, and I know the search tokens exists in the documents?
This mostly happens due to,
- A glitch in the analysers configured in the index definition – causes the tokenising or curating the data against your expectations. ie the tokens emitted from the analysers (or getting indexed in the system) aren’t the ones what you expect it to be.
- You have unknowingly picked a different analyser during the search time as against what is defined as per the field mapping in the index definition.
Document text is analysed when it is written to the index, and separately the search terms are analysed as well when one performs a search. Usually by default, the field analysers specified in index definition will be picked for analysing the query contents if it’s a field scoped query.
To help with the easy understanding and fixing of the index analysis pipeline, FTS has introduced a rest endpoint in 6.5.0 release, /api/index/{indexName}/analyzeDoc which accepts a json document in the request body and returns a json response which contains the analysed tokens as per the analysers configured with the given index.
You can use this endpoint to explore and debug the analyser outputs against your development or staging environments. This endpoint is even safe to try on a production system. The response literally shows the tokens which would have got indexed as a part of the index if you had indexed similar documents. You may use the same endpoint for checking the query analysis output by passing only the query field along with the query text as the json document with a matching document type as defined by the index type mapping.
For correcting the analyser mismatch between index and query time, you have to either use field scoped queries Or else you could override the analysers during query time as many queries provide an option for this.
Qn. How do you know whether the FTS cluster is sized correctly Or are there any signposts for sensing an under provisioned cluster?
Cluster sizing usually comprises of configuring sufficient system resources like RAM/FTS memory quota, CPU processing power along with the right cluster topology. Cluster topology further comprises of the number of partitions and the number of nodes apt for a given data volume and the query/indexing load. We have seen many of our customers end up using an undersized clusters without following any sizing guidelines. Let’t not get into the sizing details here as that is a topic very specific to each customer use case. Users are recommended to contact the Couchbase team for helping with the customised sizing guidelines for their respective requirements.
But there are a few easy indicators already built into the system which can hint that your cluster is under provisioned.
- Really slow indexing progress even in the absence of any query load or steady document mutations.
- Search queries are getting rejected with http status code 429. This shows that your system is undersized on FTS memory quota.
- A lot of slow-queries in the stats graphs may allude to either poor sizing or sub optimal search queries.
The suboptimal queries here refers to those queries which are not carefully written to get the best results with the right search clauses. Sometimes we have seen customers trying complex compound queries with more than 250 sub queries. Biggest inefficiency lurking here is that, the search system has to scan over a large swath of indexed data (mostly from disks) to fetch you the right results as against a better targeted query which would get you the same results by just grazing a smaller surface area of the index. This results in much better system resource use.