Large scale query-multithreading and batching

We have to set up a batch process that runs a query to delete a specific portion of couchbase docs (More than 35k docs at once). We have been told to use multi-threading and use 3.x version of .net sdk. Had read that batch processing is supported only in 2.7 .Can anyone suggest any pointers here for multithreading and batchprocessing ?

@krishnaa

It is supported, but via async Tasks, in 3.x. Basically, just run off a bunch of tasks and then await Task.WhenAll(tasks).

However, this approach does add a few complexities. For example, exceptions if any of the tasks fail are difficult to extract clearly. Also, if the batch size is very large (as yours is) it can actually cause performance issues if you don’t control the degree of parallelism.

I’ve been working on a library to help address this. It’s on GitHub, but not published to NuGet yet. I’d love it if you wanted to give it a try and give feedback on performance and API surface. Implement Couchbase.Extensions.MultiOp by brantburnett · Pull Request #92 · couchbaselabs/Couchbase.Extensions · GitHub

You should be able to pull the source from that branch and build it. Instructions for use are here: Couchbase.Extensions/docs/multi-op.md at 95b12bfcb0fee84281ae9c556d429c9c133185d0 · couchbaselabs/Couchbase.Extensions · GitHub

2 Likes

@btburnett3
My requirement is to delete a specific node from all the couchbase documents. So number of couchbase documents that specific node consists might vary from 10 couchbase docs to 35k couchbase documents.
I can fetch the count of documents from which the a specific node has to be deleted.

So how can I use await Task.WhenAll(tasks) so that I can run a bunch off tasks?. How can I make sure the first task is not overwriting or repeating the things done by second task because I will not have the list of document Ids from which the node has to be deleted and document count is huge .

If you don’t have the list of document keys what’s your plan to know which documents to mutate? Are you just going through all documents in the bucket and inspecting them one by one?

@btburnett3
Currently doing it via this N1QL query and have added the index for nested array
UPDATE Test AS d
SET d.children= ARRAY l FOR l IN d.children WHEN l.childId != “123456” END
WHERE ANY v IN d.children SATISFIES v.childId = “123456” END;

CREATE INDEX ix1 ON Test (DISTINCT ARRAY v.childId FOR v IN children END);

But wanted a better way of achieving this which improves performance

What’s the performance like running that query just as a SELECT query to get the document keys? If you had that list, then you could use the keys to spool off the mutations.

That said, depending on your goals, I’m not sure if it would improve performance. That’s basically what the query node is already doing for you. Seems like your major problem isn’t the mutations, but how you’re identifying documents which need to be mutated.