I am trying to delete/ cleanup large number of documents (in the order of ~250 Million items) of certain age from a bucket using N1QL. The query looks like →
Delete from bucket1
where meta().id in
(Select b.Id from bucket1 as b
where creationTimeStamp BETWEEN T1 AND T2) .
Query explanation -
Basically, I have a bucket that comprises of lookup documents as well as another kind (type 2) of documents. I want to delete both kind of documents, but for deleting lookup document, I am selecting Ids (not meta().id, but a specific field) from all the type 2 documents that has created time stamp falling in between T1 and T2 and then delete all the docs that has meta().id same as the subquery-selected Id. (The meta().id of lookup docs will be the Id field in type2 documents). Eventually, I would be deleting the type 2 documents as well. But for deleting the lookup docs, the identification method is only from type 2 docs as the lookup docs is not having any other fields other than the type 2 document id and some other field.
I have created a primary index for doing this ONE TIME activity, but it takes long execution times. Is there a better way to handle this using N1QL. Please suggest.
On the client, you can do something like this (java version). Get the list of id’s using n1ql, then do the deletes using the kv api. The deletes will be done in up to 16 parallel executions.
ReactiveCluster reactiveCluster = cluster.reactive();
ReactiveCollection reactiveCollection = bucket.defaultCollection().reactive();
reactiveCluster.query("Select b.Id from bucket1 as b where creationTimeStamp BETWEEN T1 AND T2")
.flatMapMany( result -> result.rowsAsObject())
.parallel(16).runOn(Schedulers.parallel())
.map( r -> reactiveCollection.remove( r.getString("id"))).collectList().block();
@mreiche Thanks, we were more inclined to the N1QL approach rather than introducing an application layer in between, but this looks good for a try as it reduces the execution scope (based on the result of subquery and then direct delete, rather than doing full primary scan for results in subquery).