Move documents between buckets efficiently

[using CB 4.5 Enterprise]

We are trying to implement a solution to reduce load from our queries on our main application bucket(s).
What we have decided to do was create an “archive” buckets and move documents from main buckets to it when they are not needed (say, older than X days)

We want to implement that in N1QL.

Ideally, it can be done in a single N1QL statement, so it’s “atomic”.
However, we could not find a way to do it.

We thought to use “DELETE … RETURNING” on the main bucket, and wrap that in an INSERT.
Something like:

INSERT INTO `archive` (KEY _k, VALUE _v)
SELECT _k, _v FROM 
(
    DELETE from `main` a
    WHERE a.`document_time` <= DATE_ADD_MILLIS(NOW_MILLIS(), -15, 'day')
    RETURNING meta(a).id _k, a _v
)

But that fails with “syntax error - at DELETE”

Or, as a second option, “UPDATE … RETURNING” and again wrap that in the INSERT to archive
(and now at least the docs that were copied are “marked” for deletion):

INSERT INTO `archive` (KEY _k, VALUE _v)
SELECT _k, _v FROM 
(
    UPDATE `main` a
    SET `archive_candidate` = 1
    WHERE a.`document_time` <= DATE_ADD_MILLIS(NOW_MILLIS(), -15, 'day')
    RETURNING meta(a).id _k, a _v
) 

that too fails the same with “syntax error - at UPDATE”.

For some reason, the “DELETE/UPDATE” statements with “RETURNING”, even tho they return a very real looking result sets, can’t be used in an inner SELECT statement (with or without the wrapping INSERT).

Which leaves us with the least favorable option - to do it in 3 steps:

  • PASS 1: Update candidates for deletion
  • PASS 2: “copy” candidates (INSERT with SELECT)
  • PASS 3: Delete those documents

Any way possible to make it work in 1 (or 2) steps?
Any reason for the inner UPDATE/DELETE not to be allowed?

thanks

Hi @josh21, you need to use two statements. Here is the safest approach, which should also perform well.

UPSERT INTO dest_bucket ( ... )
SELECT ... FROM source_bucket ...;

DELETE FROM source_bucket
USE KEYS ( SELECT RAW META().id FROM dest_bucket ... );
1 Like

Hi @geraldss - thanks for the reply.

However, I have two comments:

  1. You didn’t explain why an UPDATE/DELETE with a “RETURNING” can’t be used as an inner query (although it seems to return a result-set).
  2. You option, although should work just fine - seems not very good for scale. As described in the first post, this is to implement an “Archive” bucket. Imagine we move documents every day that are older than X days. Say we move 50k documents every time. Now, think of the query you mentioned running after a year - in order to perform the DELETE part on a mere 50k document, your inner SELECT will scan AND return close to 20 million IDs. That sounds very inefficient. Am I missing something?

Thanks

Any update on this? has something changed?

If you want copy whole bucket you can use backup/restore with rename bucket during restore.
Otherwise you need to use options suggested above

No, copy whole bucket is not interesting… you can also use XDCR for it.
What I’m (and I presume @josh21 was/is) looking for was a way to archive old(er) documents to a different bucket. So the “main” bucket would be fast, smaller, with multiple indexes and serve the day to day operation, while the “Archive” bucket will be growing and only be used if someone has to dive into old data.
so if I have a bucket that produces 50MIL documents a month, i can maintain in it a moving window of say, 60 days to work with.

Was this ever answered? Any updates from recent version?

again - is there an easy and efficient way to move some documents between two large buckets, using N1QL or otherwise?

Thanks

Nothing is changed in the N1QL.

While it’s not possible at the moment to do this atomically, if you have an index that would let you retrieve the meta().id of the candidate items, I could see retrieving it with KV from an SDK of choice, writing it to the other bucket (possibly with durability requirements specifying persistence to guard against failures), then once that is successful deleting it from the original bucket.

This should be quite efficient and indexes are asynchronously updated in Couchbase, so the index shouldn’t impact any critical path ops.

Hope that helps!

Thanks - this is some help and making this a 2-step instead of 3-step process.

However… the risk is this - that the first step will manage to copy only SOME of the documents (or smaller risk, but risking the second step failing to delete SOME documents). You end up with inconsistent data (the biggest risk is the first case - you might end up deleting docs from the source which were not copied to the target).

Why a MOVE command was never implemented, that’s a shame.

But even sadder is the fact that RETURNING - which is a great mechanism - is not working in the case above - this could have been such an easy way for users to archive old data
(or, if XDCR could filter by document age or something other than a pattern in the doc ID, AND at the same time support deleting the source… :slight_smile:)

The underlying challenge is that effectively the source and target are different things, distributed on the network. Moving things atomically and reliably is possible, but takes work. For your case, if it works, it might just be as easy to do it in two phases. One is copy a batch, second is walk the copied batch to delete from the source. While it’s true that N1QL could do this for you as an enhancement, it’d in effect be doing much the same thing. The cbq-engine process (which executes N1QL statements) is a client too.

Something like this is under consideration. However, it depends a bit on the selectivity of your predicate, meaning if you’re trying to move a large amount of data, something like XDCR that in-effect visits everything is less costly. If, on the other hand, you’re moving a day’s worth of data out of a whole lot, then using an index of the things to be moved is probably cheaper.

By the way, if you want to learn any more about the couple of things that may help with this and you have an enterprise subscription, check in with your team. They may be able to tell you a bit more about what’s being considered.

Thanks for the reply.

But I believe you didn’t address the concern and question:

  • Concern - in the two-step approach, how can you be sure the first step was “fully” successful, or how do you handle partial-success?
  • Question - why is RETURNING not working in these cases? could solve the whole thing

Lastly - are you saying there are other ways to do this, that are not mentioned in this thread?

In the example I gave, I was suggesting that you would only delete things after reading them from the destination. During the write you’ll be able to do any error handling and the method suggested is idempotent. If in the first run there are errors of any kind, running again would get you there.

I believe the problem is related to that not being legal in SQL, but @vsr1 can probably answer that better and/or correct me.

There aren’t any other ways in anything currently available, no. Of course, there are always various enhancements being considered.

DELETE/INSERT/UPDATE are stand alone. You can’t use them inside the SELECT as subquery, It is not valid SQL syntax. As first reply suggested you need to use 2 SQL statements.
As document key immutable 2 SQL statement should work.

Normally MERGE is option when not matched INSERT and matched UPDATE/DELETE. But In this case it is not possible because when not matched you need to INSERT and also DELETE.