Check if a large number of keys exists in bucket

Hello. I am trying to check if a set of keys exists in the database. I don’t actually need their values, I just want to check if the key exists or not.

The array has millions of keys. I see two possible approaches:

SELECT meta().id 
FROM bucket USE KEYS ARRAY v FOR v IN myArray END

This one doesn’t work when my array is too big (it throws out an error).

The second one is to use the SDK and do a getMulti, but this one will return all of the document’s properties. Is there a function or an approach that will allow me to do something like a hasMulti, just checking for the existence of multiple keys?

Thank you

EDIT: getMulti will not work since I cannot identify the key of the document through the results. N1QL appears to be the only option now

CREATE PRIMARY INDEX ON bucket;
SELECT RAW META().id FROM bucket; 

Check above query using covered index (I think 4.51 onwards). Query gives array of META().id’s you can do the comparison in SDK/application.

Looking specific pattern of META().id add WHERE META().id LIKE "XYZ:%"
Looking for specific type = “product”, ADD WHERE type = "product"
CREATE INDEX ix1 on bucket(type);

Split array try in the Loop with primary index.
SELECT RAW META().id FROM bucket WHERE META().id IN […];

Make sure all the options uses covered index to make it faster. If some reason not possible USE index.
Also try with query parameter pretty=false

Hi @vsr1, thank you for the prompt reply. We are avoiding creating a primary index, as doing so will negatively affect performance (this was one Couchbase recommendation in the past), by listening to all changes made in the bucket.

Regarding your first approach:

Looking specific pattern of META().id add WHERE META().id LIKE “XYZ:%”
Looking for specific type = “product”, ADD WHERE type = “product”

Do you mean extracting all ids and compare them to the array in the server? This particular document does not have a type property.

Split array try in the Loop with primary index.
SELECT RAW META().id FROM bucket WHERE META().id IN […];

This is the approach we are currently following, and I guess the right one for what we want to do.

Make sure all the options uses covered index to make it faster. If some reason not possible USE index.

Covered indexes do make the queries faster, but adding indexes come at a (high) server cost. In this scenario, since we already have the keys, we would very much like to not use N1QL, as we found that good performance with N1QL has high requirements.

Thank you for your help, I am looking forward to your feedback

If your documents are JSON and of a regular format (i.e. all contain a common field), you can use the Sub-Document API exists request to return True or False if the key exists. For example (in Python SDK), if all your documents contain an id field:

#!/usr/bin/env python

from couchbase.bucket import Bucket
from couchbase.exceptions import NotFoundError

import couchbase.subdocument as SD

client = Bucket('couchbase://localhost/default'.format(host, port))

# Testing - create all even keys
for i in range(0, 10, 2):
    key = 'key_' + str(i)
    client.set(key, {"id": i})

# Check for existance (of the field "id") in each document
for i in range(10):
    key = 'key_' + str(i)
    print key,
    try:
        rv = client.lookup_in(key, SD.exists('id'))
        print rv.exists(0)
    except NotFoundError:
        print False

Outputs:

key_0 True
key_1 False
key_2 True
key_3 False
key_4 True
key_5 False
key_6 True
key_7 False
key_8 True
key_9 False

In terms of performance, this is a single key/value operation, and if you pipeline your requests (which most of SDKs will allow you to do) then it should be essentially as fast as the same number of GET operations, without the cost of the server having to send back the document body (and the client having to read it).

1 Like

The documents are JSON, but they don’t contain a common field. The key by itself contains the info we need, so we did not add it to the document itself. We can add it, if it will simplify operations on that document.

In terms of performance, this is a single key/value operation, and if you pipeline your requests (which most of SDKs will allow you to do) then it should be essentially as fast as the same number of GET operations, without the cost of the server having to send back the document body (and the client having to read it).

I am not sure I understand what you mean with pipeline my requests. Using a for cycle wouldn’t effectively make millions of different requests? In comparison with a getMulti, wouldn’t performance be much worse?

That might be your easiest option. In fact you don’t even need to store the whole key, you just need any path which is guaranteed to exist on all your document - e.g. “type: x” or similar. There’s no need to store a (possibly lengthy) key to use this technique.

Unfortunately there isn’t currently an EXISTS command at the document level (“does this key exist?”), only for the Sub-Document API as detailed above.

Here’s a secret about getMulti - there’s actually no binary protocol “multi” commands on the wire :wink:

The SDK getMulti APIs simply issue a series of GET requests in a pipeline (i.e. sends all the requests immediately without waiting for a response from the previous request).

As such as long as you pipeline (exact details depends on the SDK - see the Async methods in Java and .NET) then the requests will be issued as soon as possible.

Hey @manusyone,

Just chiming in with more details regarding the Node.js SDK. As @drigby indicated, there is no such thing as a getMulti operation on the network, it is always implemented using a sequence of independent GET operations that are dispatched together. You will note that we actually pipeline operations across the network together automatically by the Node.js SDK without any extra tricky required. You can take a look at the existing getMulti operation which is more or less just a loop here:
https://github.com/couchbase/couchnode/blob/master/lib/bucket.js#L1241-L1288

We provide getMulti as a convenience method (since it doesn’t have any difficult-to-map problems with CAS comparison and such), but you can easily implement the same logic for any other specific kinds of operations you wish to perform.

Cheers, Brett