Move documents between buckets efficiently

josh21 · March 16, 2017, 9:51pm

[using CB 4.5 Enterprise]

We are trying to implement a solution to reduce load from our queries on our main application bucket(s).
What we have decided to do was create an “archive” buckets and move documents from main buckets to it when they are not needed (say, older than X days)

We want to implement that in N1QL.

Ideally, it can be done in a single N1QL statement, so it’s “atomic”.
However, we could not find a way to do it.

We thought to use “DELETE … RETURNING” on the main bucket, and wrap that in an INSERT.
Something like:

INSERT INTO `archive` (KEY _k, VALUE _v)
SELECT _k, _v FROM 
(
    DELETE from `main` a
    WHERE a.`document_time` <= DATE_ADD_MILLIS(NOW_MILLIS(), -15, 'day')
    RETURNING meta(a).id _k, a _v
)

But that fails with “syntax error - at DELETE”

Or, as a second option, “UPDATE … RETURNING” and again wrap that in the INSERT to archive
(and now at least the docs that were copied are “marked” for deletion):

INSERT INTO `archive` (KEY _k, VALUE _v)
SELECT _k, _v FROM 
(
    UPDATE `main` a
    SET `archive_candidate` = 1
    WHERE a.`document_time` <= DATE_ADD_MILLIS(NOW_MILLIS(), -15, 'day')
    RETURNING meta(a).id _k, a _v
)

that too fails the same with “syntax error - at UPDATE”.

For some reason, the “DELETE/UPDATE” statements with “RETURNING”, even tho they return a very real looking result sets, can’t be used in an inner SELECT statement (with or without the wrapping INSERT).

Which leaves us with the least favorable option - to do it in 3 steps:

PASS 1: Update candidates for deletion
PASS 2: “copy” candidates (INSERT with SELECT)
PASS 3: Delete those documents

Any way possible to make it work in 1 (or 2) steps?
Any reason for the inner UPDATE/DELETE not to be allowed?

thanks

geraldss · March 20, 2017, 5:31pm

Hi @josh21, you need to use two statements. Here is the safest approach, which should also perform well.

UPSERT INTO dest_bucket ( ... )
SELECT ... FROM source_bucket ...;

DELETE FROM source_bucket
USE KEYS ( SELECT RAW META().id FROM dest_bucket ... );

uris · March 20, 2017, 7:59pm

Hi @geraldss - thanks for the reply.

However, I have two comments:

You didn’t explain why an UPDATE/DELETE with a “RETURNING” can’t be used as an inner query (although it seems to return a result-set).
You option, although should work just fine - seems not very good for scale. As described in the first post, this is to implement an “Archive” bucket. Imagine we move documents every day that are older than X days. Say we move 50k documents every time. Now, think of the query you mentioned running after a year - in order to perform the DELETE part on a mere 50k document, your inner SELECT will scan AND return close to 20 million IDs. That sounds very inefficient. Am I missing something?

Thanks

uris · May 31, 2018, 6:03pm

Any update on this? has something changed?

vsr1 · May 31, 2018, 6:17pm

If you want copy whole bucket you can use backup/restore with rename bucket during restore.
Otherwise you need to use options suggested above

uris · May 31, 2018, 6:21pm

No, copy whole bucket is not interesting… you can also use XDCR for it.
What I’m (and I presume @josh21 was/is) looking for was a way to archive old(er) documents to a different bucket. So the “main” bucket would be fast, smaller, with multiple indexes and serve the day to day operation, while the “Archive” bucket will be growing and only be used if someone has to dive into old data.
so if I have a bucket that produces 50MIL documents a month, i can maintain in it a moving window of say, 60 days to work with.

uris · January 14, 2019, 10:19pm

Was this ever answered? Any updates from recent version?

again - is there an easy and efficient way to move some documents between two large buckets, using N1QL or otherwise?

Thanks

vsr1 · January 14, 2019, 10:56pm

Nothing is changed in the N1QL.

ingenthr · January 16, 2019, 2:00am

While it’s not possible at the moment to do this atomically, if you have an index that would let you retrieve the meta().id of the candidate items, I could see retrieving it with KV from an SDK of choice, writing it to the other bucket (possibly with durability requirements specifying persistence to guard against failures), then once that is successful deleting it from the original bucket.

This should be quite efficient and indexes are asynchronously updated in Couchbase, so the index shouldn’t impact any critical path ops.

Hope that helps!

uris · January 16, 2019, 6:30pm

Thanks - this is some help and making this a 2-step instead of 3-step process.

However… the risk is this - that the first step will manage to copy only SOME of the documents (or smaller risk, but risking the second step failing to delete SOME documents). You end up with inconsistent data (the biggest risk is the first case - you might end up deleting docs from the source which were not copied to the target).

Why a MOVE command was never implemented, that’s a shame.

But even sadder is the fact that RETURNING - which is a great mechanism - is not working in the case above - this could have been such an easy way for users to archive old data
(or, if XDCR could filter by document age or something other than a pattern in the doc ID, AND at the same time support deleting the source… )

ingenthr · January 16, 2019, 10:47pm

The underlying challenge is that effectively the source and target are different things, distributed on the network. Moving things atomically and reliably is possible, but takes work. For your case, if it works, it might just be as easy to do it in two phases. One is copy a batch, second is walk the copied batch to delete from the source. While it’s true that N1QL could do this for you as an enhancement, it’d in effect be doing much the same thing. The cbq-engine process (which executes N1QL statements) is a client too.

Something like this is under consideration. However, it depends a bit on the selectivity of your predicate, meaning if you’re trying to move a large amount of data, something like XDCR that in-effect visits everything is less costly. If, on the other hand, you’re moving a day’s worth of data out of a whole lot, then using an index of the things to be moved is probably cheaper.

By the way, if you want to learn any more about the couple of things that may help with this and you have an enterprise subscription, check in with your team. They may be able to tell you a bit more about what’s being considered.

uris · January 16, 2019, 11:07pm

Thanks for the reply.

But I believe you didn’t address the concern and question:

Concern - in the two-step approach, how can you be sure the first step was “fully” successful, or how do you handle partial-success?
Question - why is RETURNING not working in these cases? could solve the whole thing

Lastly - are you saying there are other ways to do this, that are not mentioned in this thread?

ingenthr · January 18, 2019, 5:57am

In the example I gave, I was suggesting that you would only delete things after reading them from the destination. During the write you’ll be able to do any error handling and the method suggested is idempotent. If in the first run there are errors of any kind, running again would get you there.

I believe the problem is related to that not being legal in SQL, but @vsr1 can probably answer that better and/or correct me.

There aren’t any other ways in anything currently available, no. Of course, there are always various enhancements being considered.

vsr1 · January 18, 2019, 6:03am

DELETE/INSERT/UPDATE are stand alone. You can’t use them inside the SELECT as subquery, It is not valid SQL syntax. As first reply suggested you need to use 2 SQL statements.
As document key immutable 2 SQL statement should work.

Normally MERGE is option when not matched INSERT and matched UPDATE/DELETE. But In this case it is not possible because when not matched you need to INSERT and also DELETE.

Topic		Replies	Views
Delete bulk documents using N1QL SQL++ query , n1ql	8	1306	February 6, 2023
Get and delete in a single operation SQL++	10	524	June 4, 2023
Easier way to mass update with n1ql? SQL++	7	2400	May 16, 2016
Delete one by one or USE KEYS SQL++ n1ql	4	504	November 26, 2023
N1QL SELECT / UPDATE statement SQL++	2	4724	December 9, 2015

Move documents between buckets efficiently

Related topics