We’re currently investigating a problem in our couchbase server clusters where a select set of documents seem to be ignored by replication - in one specific cluster.
No matter what we’ve tried, we can’t seem to make XDCR recognize that it should be replicating this document. We’ve tried:
upserting the document with the same content in the source datacenter
pausing and resuming the replication
restarting the couchbase-server service on both source and destination servers
deleting the problematic document from the source, waiting, and inserting the document back in
creating the document specifically in the datacenter where it is missing, then deleting from the source datacenter (it doesn’t get deleted from the destination cluster)
based on prior experience, we’ve created a different set of documents within the bucket and they all get replicated just fine (my script ensures that it inserts into every vBucket)
we have multiple destination datacenters, and only one destination is missing the document
Is there anything we can look into to see why couchbase is ignoring replication for just this one document?
In case it helps, right around the time the document was being inserted in the source, we were rebalancing the destination cluster and a server being added was powered off unexpectedly. We were able to successfully rebalance the cluster once the server was back online.
creating the document specifically in the datacenter where it is missing, then deleting from the source datacenter (it doesn’t get deleted from the destination cluster)
That to me means the target bucket’s document is winning the conflict resolution.
I would check to see if there’s any reason why the target bucket document keeps winning and thus the source doc is not being replicated, such as perhaps an application that is updating the document outside fo the replication topology, etc…
Can you elaborate on how to “check if there’s a reason why the target bucket document keeps winning”? I’d like to follow through on this lead.
To clarify (sorry if this wasn’t clear):
Under normal circumstances, the document is only ever created in the source datacenter. This document would have been added to the source datacenter around the time that the target datacenter’s rebalance got interrupted. We only added it directly to the target datacenter after repeated upserts of the document in the source datacenter proved unhelpful.
So, depending on the conflict resolution you are using (the default on a bucket is revId/sequence number), you can check the metadata of the documents to check the revId, cas values.
I grabbed the virtual extended attributes related to sequence number conflict resolution from each of our clusters. What’s interesting is that the sequence numbers are all over the place, so either I’m not getting the same sequence number that couchbase uses for conflict resolution, or the algorithm is more complex than “replicate if source[seqno] > dest[seqno]”. For example, “success1” had a higher seqno prior to the user upserting the document, while “success4” has a lower seqno even after the document is successfully replicated.
This is the xattrs prior to the user re-upserting the document into the source datacenter:
I reviewed the other post, and I do see that docs_failed_cr_source has increased starting on the same date as we started seeing this problem and is now at a constant 85 for this replication stream (it was 0). So it does seem that we’re on the right path.
If you are using the default conflict resolution mode on a bucket, then, the simplest is to look at the revid (simple counter incremented on every mutation). The default conflict resolution mode is also called “most update wins”. In the virtual extended attributes, see “revid” – e.g. ‘revid’: ‘2’
The revId can also be seen in the Admin UI → Documents
We are using the “Sequence Number” conflict resolution.
Unfortunately, I don’t see rev or revid anywhere. The admin UI doesn’t allow me to see this metadata - “Warning: Editing of binary document is not allowed”. I’ve also tried N1QL (select meta() from bucket use keys [key]), but that doesn’t seem to work either. I get an empty result set for the binary document, and rev/revid is not included even for a JSON object I inserted separately.
How can I get this revision ID, ideally through the Python SDK or REST request? How can we resolve these conflicts? We are starting to see this pattern show up in other buckets and other remote datacenters, so we really need to find a solution to this problem.
I know for this specific document type, they only insert into the source cluster, and allow XDCR to replicate to each of our other datacenters. Yes, we manually attempted some upserts/deletes in the remote datacenter, but those were only AFTER we detected that the document wasn’t getting replicated.
The default conflict resolution (sequence number) is commonly referred to as the revId conflict resolution – just FYI. This is how I get the revid using python sdk.
Thanks again for your reply. That is exactly how I got the results I had provided before. What version was revid added to $document? We are currently using couchbase server community 5.1. We want to upgrade, but we currently need moxi. I don’t want to derail this conversation, but I want to be clear that we can’t simply upgrade to solve this.
While I realize knowing the revid would be quite helpful in confirming this is definitely what’s happening here, is there anything we can do to force replication from the source to the target (bypassing conflict resolution)?
The concern we have right now is that we have a number of documents that are not getting replicated due to this conflict resolution (a few more have been recently detected). While we can manually create documents on the target side, that does nothing to allow the next update to get replicated.
Sounds like you need the target to purge the document metadata after deleting the document (i.e. purge tombstones) so that there will be nothing on the target for the source document (with same doc key) to be in conflict with. You can review the Couchbase docs for your version on when tombstones are purged.
I don’t see any documentation regarding tombstones in couchbase 5.1. Based on the documentation for 5.5, it would be the metadata purge interval, which we have set to 3 days.
That doesn’t seem to solve the problem, though. To keep this simple, I’ll refer to datacenters A, B, and C and D. We have XDCR replication from A->B, A->C, A->D. All 4 datacenters have the same, 3-day metadata purge interval.
2021-09-01 - document was inserted into A: document was missing in B and C, but successfully replicated to D.
2021-09-02 - document was manually upserted into B: document is now present in A, B and D, yet still missing from C.
2021-09-08 - document was upserted into A: document in B had not been changed, document in A and D are the same, document is still missing in C.
With the timeline above, there would have been 2 metadata purge intervals somewhere between 2021-09-01 and 2021-09-08, so any tombstone present on C would have been purged prior to the upsert on 2021-09-08.
Is there any way to see if a tombstone exists for a given document?
Is there any way to get a list of all tombstones for a bucket?
Is there any way to force metadata to be purged immediately?
Is there any way to see what documents are being skipped due to conflict resolution?
Why would the stats for docs_failed_cr_source only increase over time? It has not dropped (even by 1) in the past week.
Is this stat really a counter that is being displayed as if it were a gauge/rate in the admin UI?
Or is couchbase repeatedly trying and skipping the same documents over and over?
I really appreciate your help with this. The documentation seems to be lacking how to gather this detailed, technical information. It’s good at describing the concepts and how couchbase uses them, but not how to inspect them.
Tombstone is just a deleted document (doc with empty/null body) – so, if the metadata for the doc key/id exists, and if “deleted” : true – then, that would be a tombstone. I think that in one of your outputs, the $document showed “deleted”:false – so, clearly, that document was not a tombstone (since the metadata is saying that the document has not been deleted).
docs_failed_cr_source should be showing you the total count (running total) since that replication spec started.
You can change the purge interval for tombstones but should be careful, as noted in the documentation. https://docs.couchbase.com/server/5.1/settings/configure-compact-settings.html#metadata-purge-interval
In the code this is typed as a Prometheus MetricTypeCounter, which is documented here:
" A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart."
There are also some comments in code referencing this saying
docs failed source side conflict resolution. in this case the docs will be counted in docs_failed_cr_source stats
docs that will get rejected by target for other reasons, e.g., since target no longer owns the vbucket involved. in this case the docs will not be counted in docs_failed_cr_source stats
I just asked the user to re-upsert the document that has been missing from one of the remote clusters since 2021-09-02. The document is still not getting replicated. At that time, the docs_failed_cr_source stat did increment by one. To me, this suggests that a tombstone (or something like it) is not being cleaned up by the metadata purge interval.
Note that I do not seem to have a way to view tombstones (even immediately after deletion). Using this code, I get a DocumentNotFoundException: