Retrieving single tenant data in multi-tenant bucket

sebarys · April 15, 2021, 12:42pm

Hello,

We are considering of using Couchbase database. I have question about how we could achieve following use-case using this storage engine:

we would like to store many (millions) documents per company
each document has external unique identifier
there is dynamic list of companies - companies can be added but also removed
as list of companies can change dynamically we would prefer to have one bucket for all those documents and distinguish somehow to which company particular document belongs
documents are retrieved using unique identifier - main use case
any company can ask to export all its document - very rare operation
in case of company removal we should delete all documents that belong to given company - very rare operation

From above list of requirements we see that using single bucket where document is referred by its unique identifier would handle main use case by just using Key Value Couchbase operations.

Anyway for mentioned two very rare operations: company documents export and company documents removal we wouldn’t be probably able to achieve functionality without adding some indexes/views etc. am I right?

It leads us to two questions:

which indexing mechanism use: Global Secondary Index or View?
- Views couldn’t be queried using N1QL as far as I understand
- Primary index of N1QL will allow you to query any field of your JSON - as we would like to query only one property it looks like too wide and too expensive
- GSI - looks like option which would support mentioned requirements and provide lower overhead in compare with above features
- we could even consider creating ad-hoc partial SGI index dedicated to tenant just for the time of export/removal
- Is GSI still doesn’t self replicate among nodes? I’m referring to Differences between primary and secondary Indexes? - #17 by cihangirb “we don’t support replication/replicas with GSI but you can manually do this. The way to do this would be to create the 2 identical indexes on 2 separate nodes (a.k.a mirror the index). If one node fails and takes out one copy of the index, all traffic will hit the remaining available index.”
where store information about company name?
- We could store it as part of document identifier
```
doc-${COMPANY_NAME}-${DOCUMENT_ID}: 
{
  "someDocumentKey": "someDocValue"
}
```
it would allow us to query with N1QL LIKE statement (Indexing Metadata Information | Couchbase Docs)
```
// get all company document IDs
SELECT META().id FROM bucketName WHERE META().id LIKE "%COMPANY_NAME%"
```
- alternatively we can put info about company name in custom field
```
doc-${DOCUMENT_ID}: 
{
  "companyName": "XYZ",
  "someDocumentKey": "someDocValue"
}
```
and then query against this field

Is there any performance differentiator in case when we will query against part of document identifier instead of custom field in document?
Would both cases should result in the similar overhead in terms of additional storage/rebalance times/speed of solution?

Could you guide us which Indexing mechanism would be better and how we should approach storing info about company name to support such use case? Or maybe you see different solution that we could try?

Thanks in advance for any insights and info !

jon.strabala · April 19, 2021, 10:47pm

Hi, @sebarys,

Thanks for the details in our DM to recap

I asked many (millions) 50, 100, 500, 900 million? Can you be a bit more accurate?

depends on company, but in general between 0.5million and 50million documents - this number is dynamic and correlate to traffic that we’re receiving for given company

You went through documentation and used the follwoing for your rare removal request (less than 5 archive or removal requests per year)

on such rare request we will create conditional SGI
* CREATE INDEX company_xyz_documents_idx ON bucket_name(companyName) WHERE companyName='xyz';
data export - using created SGI and pagination Using OFFSET and Keyset in N1QL | The Couchbase Blog we will export all company document IDs and then using our API get all documents
data removal - using created SGI delete all company related documents using DELTE N1QL query
after operation will be completed index will be removed

ARCHIVING A COMPANY’S DOC SET VIA THE EVENTING SERVICE

Now that I know the above scale and pattern you could use the Eventing service to archive a company’s doc set to a different bucket (no INDEX needed). This is just 4 lines of JavaScript (sans comments).

// This function requires:
//
// 1. has a source bucket of 'source'
// 2. has an Eventing scratchpad bucket called of 'meta' 100MB is fine.
// 3. has a target bucket called of 'archive' with a bucket binding 
//    aliased to 'dst_bkt' this must be big enough to hold the archived 
//    data in mode read+write.
// 4. If more than 50M documents double the Eventing Memory Quota to 512 MB
// 5. In settings we use 24 workers for more performance (you need 24 vCPUs 
//    to take advantage of this parallelism, use less if you have less vCPUs)
//
// Use as follows:
//
// A. create or flush the bucket 'archive' 
// B. Edit this function perhaps to:
//        if (! meta.id.startsWith("doc-somecompany-")) return;
// C. Deploy this function wait for completion and then undeploy.
// D. Use /opt/data/cbexport
// E. drop or flush the bucket 'archive' 

function OnUpdate(doc, meta) {
    if (! meta.id.startsWith("doc-wigitcompany-")) return;
    dst_bkt[meta.id] = doc;
}

I did a quick test stuck 2 sets of company docs into a single bucket 10M each company. So the Bucket ‘source’ holds 20M docs total or 10M each company. I also made Bucket ‘archive’ and the Eventing scratchpad bucket ‘meta’ 100MB

doc-tirecompany-doc########
doc-widgetcompany-doc########

I deployed the above function (named “archive_all_keys_by_prefix”) on a 12 Core non-MDS single node test server and it archived one company at 85K/docs per seconds. This took less than 2 minutes. It should run at about 60K docs/sec if you had 50 companies because DCP is faster than KV (DCP is a firehose) and we only need to copy or archive data from a single relevant company.

Next I ran the cbexport utility as follows:

/opt/couchbase/bin/cbexport json -c couchbase://127.0.0.1 -u $CB_USERNAME -p $CB_PASSWORD -b archive -f lines -o /tmp/tmp.out.json

This also took a bit less than 2 minutes. At this point we can delete the Bucket Archive.

FYI we could adjust the function to also archive the company doc with a minor tweak.

PURGING A COMPANY’S DOC SET VIA THE EVENTING SERVICE

Now that I know the above scale and pattern you could use the Eventing service to purge a company doc set directly from it’s source bucket (no INDEX needed). This is just 4 lines of JavaScript again (sans comments).

// This function requires:
//
// 1. has a source bucket of 'source'
// 2. has an Eventing scratchpad bucket called of 'meta' 100MB is fine.
// 3. has a bucket binding aliased to 'src_bkt' to the 'source' bucket
//    in mode read+write.
// 4. If more than 50M documents double the Eventing Memory Quota to 512 MB
// 5. In settings we use 24 workers for more performance (you need 24 vCPUs 
//    to take advantage of this parallelism, use less if you have less vCPUs)
//
// Use as follows:
//
// A. Make sure you already did an 'archive' if needed.
// B. Edit this function perhaps to:
//        if (! meta.id.startsWith("doc-somecompany-")) return;
// C. Deploy this function wait for completion and then undeploy.

function OnUpdate(doc, meta) {
    if (! meta.id.startsWith("doc-wigitcompany-")) return;
    delete src_bkt[meta.id];
}

I did a quick test stuck 2 sets of company docs into a single bucket 10M each company. So the Bucket ‘source’ holds 20M docs total or 10M each company. I also made the Eventing scratchpad bucket ‘meta’ 100MB

We are deleting all the docs related to

doc-widgetcompany-doc########

But leaving

doc-tirecompany-doc########

I deployed the above function (named “purge_all_keys_by_prefix”) on a 12 Core non-MDS single node test server and it archived one company at 90K/docs per seconds. This deletion also took less than 2 minutes.

FYI we could adjust the function to also delete the company doc with a minor tweak.

FINAL COMMENTS

The two Eventing Functions (attached) could also be combined, we are using the Eventing service as a one off utility I beleive it will fit your needs perfectly.

Best

Jon Strabala
Principal Product Manager - Server‌Functions_for_sebarys.zip (2.1 KB)

pvarley · April 20, 2021, 9:32am

Another option here once Couchbase Server 7.0 is release is to use scopes and collections. Each company could have its own collection.

any company can ask to export all its document - very rare operation

For exporting cbexport supporting exporting all the docs from a single collection. For example exporting all the documents for the couchbase collection in the companies scope:

/opt/couchbase/bin/cbexport json -c couchbase://127.0.0.1 -u $CB_USERNAME -p $CB_PASSWORD -b $BUCKET -f lines -o /tmp/tmp.out.json --include-data companies.couchbase

in case of company removal we should delete all documents that belong to given company - very rare operation

A collection can be drop deleting all the documents in it.

Couchbase Server 7.0 is release later in the year .

sebarys · April 20, 2021, 3:23pm

@jon.strabala @pvarley thank you for ideas and your responses! Much appreciated!

I think that for now we will stay with indexing as it looks like the most natural way for handling such use case and don’
t require using very specific Couchbase features that would be hard to translate in case we will decide migrate to another DB engine:

Eventing Service - this stream-like approach looks very nice but require Enterprise version of Couchbase where we’re currently using Community edition
Collections - in the future when migration to Couchbase 7.0 would be possible we would definitely look on it but as we need something that is already available, so it is no-go solution for now

Thanks again for replies and alternatives - it definitely gave us a broader overview on what is possible when using Couchbase!

Topic		Replies	Views
Multiple different indexes per tenant - How? SQL++ n1ql , index	9	3001	January 25, 2018
Indexing and Search on Huge Bucket Analytics n1ql , view	4	2470	May 4, 2018
How to create or copy indexes from one couchbase server to couchbase another server Couchbase Server n1ql	3	2302	February 14, 2020
Global Secondary Index with null key Couchbase Server query , n1ql	3	2636	July 13, 2016
Need to retrieve document ID without using N1QL query(index) and views Couchbase Server query , java , n1ql	3	1252	April 20, 2020

Retrieving single tenant data in multi-tenant bucket

Related topics