For all those who are restlessly eager, let me reveal the deep dark secret of all the performance optimisations in FTS in 6.5.0 is,
“ Do less And Do it less often ** ”
And I am sure you would concur with me as you scroll further down.
(For those who missed – Part 1)
Scatter Gather over gRPC
Usually an FTS index is partitioned to sub indexes and the index partitions are placed across nodes in a multi node cluster. Any typical incoming search query is then scattered to all those nodes/remote index partitions and the search results are gathered from different nodes, merged by rank and sent back to the user by the coordinating or the query handling node.
Until 6.5.0 release, all the scatter-gather in FTS happens over http/http2. Moving away from this REST based scatter gather was a long pending item in team’s backlog. Procrastinating did work to our benefit here.
With the onset of 6.5.0 features like N1QL-FTS integration, we realised that the time is ripe to experiment with gRPC for the internal scatter gather. Since much has already been talked about the pros of gRPC everywhere on the web, Iet me refrain from repetition here.
With FTS, we were mostly interested in exploring the streaming and request multiplexing capabilities of gRPC. Streaming back the search results to the clients helps in addressing the use case of – serving a search result with millions of non-ranked (constant ranked) hits in a memory friendlier way.
Interesting challenges were,
- Getting the correct interfaces for enabling the higher level cbft package to write the search results to the client stream in a configurable manner(eg: batch of hits or a single hit) as and when the indexing library iterators tread through the matched documents somewhere deep inside storage engine.
- Defining the right protobuf message types in context of the existing message types.
Getting the right protobuf message wasn’t very straight forward since FTS already had the established json message types which were knitted tightly with the internal Golang types and all our customers are using them. Hence any newer message format had to take into consideration the legacy message types. This introduces an extra level of marshal/unmarshal efforts between those Go types and the new protobuf types. And this indeed triggered the panic button during initial tests due to the significant dents in our performance graphs.
We immediately addressed that issue by cutting the parsing overheads of the legacy search message from the new protobuf message type. The idea was to embed the whole legacy search message as a byte slice within the new proto message which removed some roundabouts and created much lesser garbage.
Outcome
The gRPC based scatter gather has shown a 2X throughput improvement for low frequency term queries in our internal benchmarks with a 3 node cluster.
Numeric Range Query
The text indexing library we use (bleve) stores the numeric field values at different precision points, i.e, a single numeric field value is indexed as multiple shifted binary encoded tokens. Though this approach makes the index size bigger, at query time this will result in fewer terms to search for a given numeric range and thus makes these queries much faster.
Opportunity
Similar to the modus operandi of the geo queries mentioned in Part 1 of this series, the numeric range query implementation also derives the exact numeric terms (candidate terms) to be searched mathematically. The search candidate term filtering logic used for numeric range query is similar to that of the fuzzy/regex queries. It will try to filter the candidate terms by intersecting the two DFAs, ie intersection between the Terms Dictionary and that of the temporary DFA/FST built from the mathematically created numeric ranges. However, from the golang flame graph profiling, this approach turned out to be less efficient on memory and CPU.
We have improved this filtering logic with a cleaner and lighter Terms Dictionary look up API which helped tame the memory and cpu usage. And this removed those extra spikes that popped up in the flame graphs.
Outcome
This change has brought about 77% throughput improvements in FTS internal performance tests which had a steady stream of incoming mutations (10K sets/secs).
Prefix/Fuzzy/Wildcard Queries
Opportunity
All these query types make heavy use of FST traversals. Prefix queries leverage the range based FST iterators while the wildcard/fuzzy make use of the automaton based iterators. i.e, a regex automaton for wildcard/regex queries and levenshtein automaton for the fuzzy queries.
We identified the concurrently shareable structures across these queries and cached them at a segment snapshot level to improve the reusability across queries. This method reduced the amount of garbage created by a lot and helped us save on the GC cycles.
Outcome
Better caching and reuse helped improve throughput numbers for –
Wildcard by 25%, Fuzzy by 12%, Prefix by 9%.
{** There are only three optimisations: “Do less. Do it less often. Do it faster.”
The largest gains come from 1, but we spend all our time on 3. – Michael Fromberger}