Introducing vector search (KNN), with its distance-based similarity scoring, into the existing Search paradigm necessitated a shift in how we thought about “relevant” results and how to measure them. 

Text based indexes use tf-idf as their scoring mechanism with the results remaining the same across searches, given a fixed corpus of words (here, the documents in a partition).

In contrast, a KNN search does not guarantee the same level of idempotency. The results are approximate, often differing between queries. This article is about the Search team pivoting from exact to approximate. Along the way, we answer questions about why approximate results can become the new normal and how much approximation is acceptable.

Setting the Stage 

Each Search index partition is a Bleve index comprising multiple zap segments (segmented architecture), with each segment containing a vector index. These are periodically compacted by a merger routine for the Bleve index.

Search uses FAISS for the vector index creation, training, searching and some more related functionality. 

The two broad classes of indexes currently used by Couchbase Search are:

    1. Flat indexes – perform exhaustive search, akin to storing the vectors in an array.
    2. IVF indexes – centroid-based indexes which involve clustering (KMeans in this case) the dataset and then populating those clusters.

A Brief about KNN 

Like I touched upon earlier, the testware (and equally important, our thinking!) was predisposed to exact scoring. Text-based search is fundamentally exhaustive in that an inverted index includes all the tokens in the partition’s documents. All the documents in an inverted index are eligible for the search and will be searched through. 

A centroid-based vector index, by comparison, limits the pool of eligible vectors right off the bat – by only searching through specific clusters, which may or may not be the same for each query. This means that for a given query, potentially time consuming, exhaustive search is traded off for approximation.  

(If the word ‘recall’ threw you off for a bit, sit tight – we will be coming to that in just a bit).

Considering that we limit our search space right at the beginning, it’s important to “cluster right”. 

One of the first steps in a search involves picking how many and which clusters to search through. Too few and you end up missing out on some potentially similar vectors. Too many and search latency increases significantly for a relatively small increase in search quality. 

The metric used for search quality is recall – what percentage of the returned results are objectively the closest to the query vectors. The set of the vectors closest to the query vector are called the ground truth and are used as a baseline when measuring recall. Since the KNN score is the distance between two vectors, it is independent of the other documents in the partition (unlike in tf-idf) and more importantly, this helps in an objective comparison between independently evaluated ground truth and the result. 

How Much (Approximation) is Too Much (Approximation): Driving Recall from 0.06 to 90+:

Armed with all this knowledge, we decided to start testing for recall. Our early tests showed a surprisingly low recall of close to 0 – 0.06 to be precise. 

While we had offloaded the vector index search to FAISS, there were some aspects we needed to handle from our end. One of them being mapping document IDs to the vectors. Search maps each document number to a unique vector hash. These hashes are then passed as custom IDs to FAISS to leverage the support for mapping vectors to custom ID and searches return the custom ID. Considering that the vectors (each of which is of the same size) are concatenated into a large vector and so are the IDs, getting the ordering right determines the mapping of the vector to the ID. Internally, FAISS uses a hashmap to store vectors and their IDs.

A closer look showed that we were mapping randomly ordered IDs to vectors when rebuilding indexes during a merge, resulting in the result set being essentially random. This was impacting both flat and IVF indexes since they both relied on the ordering of the IDs when retrieving results. 

Once the ordering issue was solved, along with some other merge path fixes, the recall jumped to around 70. We were now on the right track – we didn’t have any fundamental bugs plaguing us. We started taking a look at the knobs we could tune. 

Turning Knobs – Centroids and Nprobe

The initial strategy used a fixed number (100) of centroids for all vector indexes with more than 10k vectors. In essence, this was treating 1M vectors the same as 20k vectors.

FAISS clustering defaults have a minimum (39) and maximum (256) number of points per cluster. The remaining points are subsampled. 100 centroids may have been enough for 100 * 256 = 25600 vectors at most but for anything over that, there was excessive subsampling taking place, as reflected in the recall.  What we needed was a formula for centroids which scaled with the dataset.

What we’re looking to optimise for: Recall@K, without indexing and search latency taking too much of a hit, if possible. 

Setup

The setup was fairly simple – scripts creating a FAISS index (training and adding IDs) and querying them, with the ground truth results known beforehand. I used the SIFT10K and SIFT1M datasets from the original paper since they provided groundtruth vectors using Euclidean distance. The recall@K was the mean recall over 100/10k respectively queries.

Increasing centroids

The first phase involved tweaking the number of clusters.

Sift1M results – 10,000 queries

# centroids Training time(s) Search time(s) recall@100
100 1.83 20.72 0.61
200 1.856 14.423 0.558
500 4.75 4.101 0.4833
1000 15.13 2.4113 0.43

Sift10k results – 100 queries

# centroids Training time (ms) Search time (ms) recall@100
10 30.78 1370 0.82
50 103 368.9 0.69
100 100 188.48 0.6

Insights

    1. The current baseline shows a recall of 0.61, which can definitely be improved.
    2. The recall decreases with an increase in the number of centroids. 
    3. Search time decreases due to increasing localization even as training time increases.

The converse being search time increasing, despite low training time, for a lower number of centroids since that entails searching in larger cells with greater number of vectors.

Now that it’s been established that increasing centroids has a negative impact on recall, let’s try to intuitively understand why that’s so.

With a fixed size dataset, increasing the number of centroids could decrease the number of documents in each cluster. With smaller clusters, we search fewer vectors overall and pick the K closest vectors.

Hence, an increase in the number of clusters should be accompanied by a corresponding increase in the number of clusters searched.

Increasing nprobe

Sift1M – 10,000 queries

nlist nprobe Training time(s) Total Search time(s) recall@100
100 (current baseline) 1 1.43 21.24  0.61
100 10 0.778 119.5 0.993
200 14 1.12 84.54 0.99
500 22 3.23 52.80 0.988
1000 31 10.033 37.79 0.988
2000 44 36.36 27.61 0.985
3000 54 80.94 22.74 0.985
3906 (1M/256) 62 134.61 20s 0.984
4000 32 136.71 10.09 0.956
4000 64 138.57 20.36 0.987

Sift10k – 100 queries

nlist nprobe Training time Total Search time recall@100
10 1 33.85ms 1.52s 0.82
39 (10000/256) 6 70.6ms 1.91s 0.96
50 7 70.26ms 1.68s 0.99
100 5 91.5ms 677.14ms 0.9
100 10 99ms 1.317s 0.96
200 14 133.26ms 930.2ms 1.0

Insights

    1. For the 1M dataset, the baseline recall is low due to nprobe = 1.
      • Once that is fixed, the recall increases rapidly and is no longer a concern, despite subsampling.
      • However, with great recall comes greater search latency. 
    2. For the 10k dataset, baseline recall, while not as low as that of the 1M dataset, is still a matter of concern.
      • This too, is easily remedied by increasing nprobe

The defaults for determining nlist and nprobe were then changed to:

Suffice to say, this is how the team felt once we’d successfully found a solution to the recall issue:



Tuning a Vector Index

Tuning a Couchbase vector index definitely isn’t something to sweat about.

Since the recall/latency tradeoff is a crucial one in vector search, we wanted to allow the user some flexibility in which way they wanted to lean. Along with user-facing considerations such as easy to understand and intuitive, future proofing (forward compatibility) and the segmented architecture entailed some limitations in the API.

Each segment is a vector index with a different number of vectors. At query time, a user isn’t aware of the data distribution at a partition level, let alone at the segment level. Depending on the nature of mutations (large number of deletes for eg.), the number of vectors can vary quite a bit when merging segments. Hence, a one-size-fits-all(-segments) approach cannot be applied.  

In addition, forward compatibility necessitates that the knobs aren’t specific to a FAISS index type since this is an area that should be, for the most part, abstracted away from the user, despite changing the index types being used in the implementation. For example, the user specifying nprobe or nlist would be very IVF index specific and would be a breaking change if the index types powering vector search changed. 

Allowing the user to toggle which metric to optimise for (recall/latency) fits the bill. While tuning for recall is done differently for an IVF index and say, a HNSW index, recall is applicable to both and can be tuned for at the segment level. In the data above, doubling nprobe leads to a corresponding doubling of search time but with the corresponding increase in recall being a few points. Hence, when optimising for latency, halving the nprobe yields gains in latency without too great a hit on recall. 

The index definition setting where the user can toggle between recall and latency can be modified is vector_index_optimized_for. This setting has been documented in the official docs. 

For more such deep dives and developments in the Couchbase Vector Search space, stay tuned!



Author

Posted by Aditi Ahuja, Software Engineer

Leave a reply