I am using Sync Gateway and Couchbase for a mobile app solution. When I delete documents using Sync Gateway they are stored in Couchbase with the “_deleted”: true as described by the documentation.
I understand they are stored so they can sync the deletes across all the sync databases. My question is - after a time (weeks, months, etc) can I physically delete these documents from the Couchbase DB to reduce the document count?
Should I delete them in Couchbase or through the Sync Gateway API?
The REST API has a _purge command for doing this — it physically deletes document(s) without leaving a trace. Unfortunately, for some reason it hasn’t been implemented yet in SG.
In principle you could safely purge the documents by deleting them via the Couchbase smart-client API. That might end up confusing Sync Gateway, though, so I really don’t want to recommend it … maybe one of the SG engineers can comment on that.
I agree that it’s risky to remove documents directly using the Couchbase smart-client API. If the document is still in Sync Gateway’s in-memory channel cache, there’s the potential for Sync Gateway to attempt to replicate the non-existing document. If the document is a tombstone (_deleted:true), though - particularly one that’s been deleted for weeks/months - the chance of problems should be relatively low.
I’ve ran into this issue as well and from a developers point of view it seems that keeping tombstones around for an indefinite amount of time isn’t very efficient but in my opinion you should worry about purging documents only if it becomes a bottleneck in your system. Couchbase Server can persist a very large amount of data and those tombstone documents are very small in size.
Regarding purging documents on clients (Android, iOS…) it’s perhaps something you could look into as well but again, it’s worth doing a review of how much space this will save.
@jens To follow up on this scenario, I wonder what would happen to a user that downloads that app, logs in and gets access to a bunch of channels. Would the replicator pull the tombstone documents in those channels from Couchbase Server? If so, is there a recommended way to work around this if it’s preferable not to replicate them?
Yes, the tombstones will get pulled since they’re part of the channels. There is some logic in the replicator to try to pull them last (after existing docs) but it’s not perfect and in some cases it increases the latency of an initial replication.
This is a necessary aspect of the replication protocol — the tombstone needs to be part of the channel so existing clients can be notified the doc was deleted. And there are cases where even a new client needs to know, although these cases generally involve less-likely things like having the client also replicate with a second server or another client. It’s one of the aspects of multi-master replication that adds overhead that isn’t needed in the simple star-topology case.
The only workaround that comes to mind (for the current version of SG) is to have a server-side process that scans for these tombstones and updates them with newer tombstone revisions that don’t belong to any channel. Then they won’t be visible to clients anymore.
Thanks for the discussion. I have some large channels that pull down 400K+ documents and a lot of them are deletes in the initial sync. Moving them to a new channel is a good idea and I will give that a try. It might be nice to add a parameter to ignore tombstones so when the client knows it is the initial sync it can skip the tombstones and this could be integrated into CBL.
Moving them to a different channel did not help. They still where put into the changes feed with a _removed property set.
I took the approach of deleting them in the sync gateway Couchbase bucket which worked.
I create a Couchbase view which returns only deleted documents and includes the time_saved attribute. I then created a weekly process which pulls the documents from this view and if the time_saved is older than 10 days I delete them using the Couchbase API. If it helps the view is below.
function (doc, meta) {
var sync = doc._sync;
if (sync === undefined || meta.id.substring(0,6) == "_sync:")
return;
if (!doc._deleted)
{
return;
}
emit(meta.id, {saved: sync.time_saved});
}
I’m reluctant to mess with data in Couchbase behind Sync Gateway’s back but I currently have channels with thousands of deleted (tombstoned) documents and this makes the initial sync to the device excruciatingly slow.
We also see that our bucket size is growing continuously, with the bucket being over 18 GB as we speak, even though we only have around 300 “real” documents with attachments of just a few MB (all together I think they comprise around 2 GB). Does Couchbase keep attachments alive for tombstones even after compacting-away the previous revisions?
You can look at the issue yourself (: It doesn’t appear there’s been any progress.
I’m reluctant to mess with data in Couchbase behind Sync Gateway’s back
Understandable, but in this case I think it’s the only thing we can recommend as a workaround. It’s pretty much exactly what an SG implementation of the purge operation would do, aside from updating internal caches.
If we user couchbase along with sync_gatway, for live tracking. Then it will generate large data in Couchbase over a certain period of time.
Most of the data is useless after certain time, so how remove this unwanted data.
Do we need to use ‘Delete’ sql statement to delete these records.
Or we need to set ‘ttl’ for all unwanted records
Or there is any other mechanism (utility) provided by couchbase for this.
Any utility provided by couchbase for backup. (I mean not backup entire bucket but backup certain data from the bucket and remove the backed-up data from bucket.)
@jens I know this is an old issue, but as far as I can tell it is still a problem in Sync Gateway 2.1.
When I have a large number of _deleted objects relative to the number of not-deleted objects, I see that the initial pull replication takes much longer than really necessary. I understand that older clients might need the tombstones, but a new client doing its initial pull replication definitely doesn’t.
Could you advocate for an optimization in CB Lite and SG to solve this ?
Thanks.