How do I resolve failed conflict resolutions? (with Revision-based strategy)

rogozhnikov.andrey · March 2, 2018, 8:54am

My setup includes 5 Couchbase clusters that are connected by a ring of unidirectional XDCR. Clients write to those clusters simultaneously by design, and documents have a limited TTL that is prolonged by client app if a document shows regular usage. My expectation was that records will be eventually replicated or expired, but that seems not to be the case. Out of 4.4 billion documents in every cluster, I have some significant numbers of failed conflict resolutions.

The values of set_failed_cr_source, expiry_failed_cr_source, and docs_failed_cr_source are all about 700 million, which is pretty much and has grown gradually since a long time ago (years). At the same time, deletion_failed_cr_source is not greater that 12 (twelve).

How anyone is supposed to resolve those accumulated conflicts? At least, how to get the key of conflicted documents?

What do set_failed_cr_source, expiry_failed_cr_source, docs_failed_cr_source, and deletion_failed_cr_source mean exactly? I tried to follow the sources of goxdcr and haven’t come to a conclusion.

ysui6888 · March 8, 2018, 9:48pm

the failed_cr_source stats is not an indication that some documents failed to get replicated. Unless you see different document counts in your clusters, you need not be worried.

For large documents, before xdcr replicates them to target, xdcr checks whether document of the same or higher revision already exists on target. If so, replication is not necessary. Xdcr will not replicate the document to target and will increment the failed_cr_source stats.

If you have a ring topology like A->B->C->A, failed_cr_source stats is guarenteed not to be zero. A document mutation originated from A will get replicated to B and then C. When xdcr on C tries to replicate the mutation back to A, it would detect that A already has the mutation and would get its failed_cr_source counter incremented.

rogozhnikov.andrey · March 12, 2018, 12:19pm

Thank you, that sounds reassuring as items count is almost the same in all clusters.

However, we have paid attention to these metrics because of some clearly unresolved conflicts, i.e., different clusters had different content of documents for the same key. We could not estimate the scale of conflicts as we had to take emergency measures (we recreated the ring from scratch, the decline in items count that was clearly visible was replaced by rapid growth to a 10-days-ago-level). But we know, from clients of our application, that those conflicts we mass ones, i.e., tens or hundreds of thousands, if not more.

Can we somehow estimate the current count of actual unresolved conflicts?

Topic		Replies	Views
Replication problem with Couchbase server 6.6.1 Couchbase Server	3	1001	August 26, 2021
XDCR - Failing with reason target has a latest copy Couchbase Server xdcr	1	944	August 17, 2021
XDCR filtering out ~1000 docs Couchbase Server xdcr	3	515	December 7, 2022
XDCR not replicating all documents, but no errors Couchbase Server	0	1453	December 1, 2016
XDCR Inconsistency Couchbase Server replica	6	2214	May 7, 2018

How do I resolve failed conflict resolutions? (with Revision-based strategy)

Related topics