I have a situation in my production couchbase cluster having version 6.0.1.
I have a doc_type = “eventing_history” which is having 4 years customer event data. Size of this doc_type is 2TB and docs in this doc_type is approx 1 Billion.
Now I want to delete the data for this doc_type and leave only current month data.
I tried n1ql delete syntax but it was getting time out after some time. and I don’t think that N1QL can delete 1 billion docs.
Please suggest correct approach to delete 1 billion docs having same key prefix.
Best option will be use covered index and get the document keys.
Then USE SDK set/touch the document expiration to 1sec. Let the document expiration run through at his own time in background and delete.
Vishal_Sharma - Another method would be to use the Eventing Service to do this example (will work for 6.5+):
// 1. Bucket Alias
// 2. src_bkt
// 3. <<your bucket to purge data from>>
// 4. read+write
function OnUpdate(doc, meta) {
// filter out non-interesting docs
if (doc.type !== "eventing_history") return;
// this is a bucket DEL operation.
delete src_bkt[meta.id];
}
Now I want to delete the data for this doc_type and leave only current month data.
In recent versions of Couchbase the leading characters of the CAS string is a timestamp in millis since epoch. You can also use this to your advantage.
For example if you used Eventing you could ignore the newer CAS values based on some TBD datetime.
// 1. Bucket Alias
// 2. src_bkt
// 3. <<your bucket to purge data from>>
// 4. read+write
function OnUpdate(doc, meta) {
// filter out non-interesting docs
if (doc.type !== "eventing_history") return;
// get the milliseconds since Epoch (GMT) for some date e.g. 2020-08-01
var cutoff_millis = Date.parse('2020-08-01');
// get the milliseconds of the documents last update (you could parse a date from the doc itself)
var doc_millis = parseInt(meta.cas.substring(0,13));
// do not delete if GTE the cutoff
if (doc_millis >= cutoff_millis) return;
// this is a bucket DEL operation.
delete src_bkt[meta.id];
}
You deploy it and just leave the eventing function “live” and it will delete all old documents of type “eventing_history” that have not been updated after 2020-08-01. Note no indexes are required.
A few warnings to keep in mind all of which might cause the above function not to behave as expected.
The timestamp extracted from CAS is a time of a mutation (either to the doc or it’s XATTRs).
Mutations typically come from a document create, update, or delete/expiry.
Mutations to a document can also come from an XDCR replication or a restore from a backup.
The definition of CAS could change in the future releases.