Batch Update statements

Hi there,

If I need to run about 1 million update statements on documents, can I run them in batch instead of writing code to update them one by one with SDK upsert? if so how many should I do it to be safe? Can I do it in the UI or I should do it with command line? also, will it be better without an index created on the total field (so no need to update the index)?
because this is just one time thing, wonder if there’s a easy way to do this.

statement example:

update table use keys ['row_id'] set total = 123; 

Do you have to run singleton updates ? (i.e. you only have a list of keys and each document is going to have a different specific value set) If you do, it will be most efficient to use the SDK and direct updating (GET + REPLACE), to avoid the hop through the Query node. Furthermore if you are simply setting a specific field to a simple value and have no need to read the rest of the document, then the sub-document operations (ref: (eg) Sub-Document Operations with the C SDK | Couchbase Docs ) will be best of all since the documents won’t have to be read first.

If the update can be calculated from the rest of the document (or is some fixed value for multiple documents), then an SQL++ statement could be the right way to approach this and you may not need so many individual statements, e.g.

UPDATE my_collection SET total = ARRAY_SUM(array_field) WHERE...

(Documents could be selected by a WHERE clause or multiple keys provided in a USE KEYS clause and obviously if it is all documents in the collection then no such filtering is required.)

If selecting with a USE KEYS clause, 8k keys in a clause is not unheard of, but it will likely come down to your needs and how unwieldy you wish your statements to be.

From the server perspective there is no batching of individual update statements (we’re not talking transactions here); each is received and processed as an individual request.

An index on the update target field wouldn’t greatly impact this, unless it was also used to select documents. The index updates happen asynchronously so the update statement runtime should be the same.

How and where you issue the statements is your choice - if you’re going to have a large number of them or large USE KEYS clauses you’ll probably be programmatically generating them so issuing them from same environment will probably be simplest. It shouldn’t make any difference to the server.

HTH.

1 Like

Thank you for your response @dh. the situation is I need to update 1mil documents with a new field and their own total value, so it will be 1 mil update statement with use keys on 1 meta().id. we are using SDK 2.6, I don’t think that version supports sub-doc operations. I was trying to see if there’s any other efficient way than use the SDK and loop through each doc to update. based on what you mentioned, seems like that’s the only way then.

Where do u get 1 million META().id’s? Is those are some sort sequence order? If yes. You can use WHERE clause on document key with primary index.

UPDATE my_collection SET total = ARRAY_SUM(array_field) WHERE META().id …

$Obj = { "doc1":"val1", "doc2":"val2"}
Construct object of 1024 key values and batch each update

UPDATE default AS s USE KEYS OBJECT_NAMES($Obj)
SET s.f1 = $Obj.[META(s).id];

UI it is doable but hard - Set $Obj in prefrences as named parameters and execute query.

cbq shell
\set -$Obj { “doc1”:“val1”, “doc2”:“val2”}`;
Execute update,
change value again and set and execute

OR store in file and give cbq shell < filename

\set -$Obj { "doc1":"val1", "doc2":"val2"}`;
UPDATE default AS s USE KEYS OBJECT_NAMES($Obj)
SET s.f1 = $Obj.[META(s).id];
\set -$Obj { "doc1001":"val1001", "doc1002":"val1002"}`;
UPDATE default AS s USE KEYS OBJECT_NAMES($Obj)
SET s.f1 = $Obj.[META(s).id];

@vsr1 we know the ids of each document. so are you saying we can run like say 4K of update statements below each time on the UI to do the batch updates? but we have the primary index thou, or i misunderstand what you mean?

UPDATE table SET total = 1 where meta().id = 'doc.u.10908960';
UPDATE table SET total = 3 where meta().id = 'doc.u.10908961';
UPDATE table SET total = 2 where meta().id = 'doc.u.10908972';
UPDATE table SET total = 1 where meta().id = 'doc.u.10908983';

Currently N1QL only INSERT/UPSERT has batch not other SQL statements. Each statement is separate executed by UI (via loop).

If you need batch you can follow approach suggested previous post.
Instead of UI u can also put all statements in file execute via cbq shell (shell will execute statements serially). If you think it is slow put in multiple files and execute in parallel if required change queryhost

/opt/couchbase/bin/cbq -e=queryhost:8093 -u=user -p=password < filename

PREPARE p1 FROM UPDATE table USE KEYS $1 SET total = $2;
EXECUTE p1 USING ["doc.u.1098960",1];
EXECUTE p1 USING ["doc.u.1098961",3];

As statement using USE KEYS no index needed.

1 Like

thank you @vsr1 for the suggestions!

@vsr1 sorry, one more question. what would you suggest for the number of records per file? 10k?

It does n’t matter number of entries in file as those run serially it may take longer.
File can have 1Million.

If you have 2 query nodes
Split into 8 files
4 files execute node1 (i.e. 4 statements in parallel)
4 files execute node2 (i.e. 4 statements in parallel).

All depends on load of your cluster and size etc.

In general query service allows 4 concurrent queries per CPU. CE is limited by 4 CPU’s per query node.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.