- " When comparing these values, itâs important to interpret them as an unsigned 64-bit integers." - In the kafka connector i get this as long anyways. Hope i can directly use the long value for comparison.
If youâre using Java youâll want to use Long.compareUnsigned(x,y) to see which of two sequence numbers is greater. If the language youâre working with has native support for unsigned 64-bit integers then this isnât an issue at all.
- I get the fact that the same sequence numbers can be reassigned in the case of a fail over scenario (when vBucket moves from one node to another), but you mentioned that it might not be directly comparable. Is there any alternative or it has to be in the application logic?
Without persistence polling, the application logic would need to look at the failover log. It might get complicated, which I why I tried to gloss over it 
When persistence polling is enabled you should be able to simply compare the sequence numbers.
3.1.What do you mean by persistence polling? I do not see any references in the connector document and how to enable this?
Persistence polling is a rollback mitigation strategy where the DCP client waits for changes to be persisted to all replicas before telling the connector about the change. Itâs enabled by setting the couchbase.persistence_polling_interval
connector config property to a non-zero duration.
3.2.And does this mean that connector has a built in way where it adjust the sequence numbers so that ordering is maintained and we can directly compare?
Thatâs persistence polling, yes.
- While reading the DCP link, i came across snapshots. If my understanding is correct, only if connectorâs âuse_snapshotâ is set to true, it becomes resilient to the connet cluster failure.
The Kafka connectorâs use_snapshots
config property doesnât do anything except cause OutOfMemoryError
s
It will be removed in a future release; in the mean time Iâd recommend setting this to false
.
When the same Couchbase document is modified twice, the DCP protocol allows the server to âde-duplicateâ the event stream and send only the second version of the document. For example, letâs say an application creates document A
, then document B
, and finally updates document A
. The ârealâ sequence of events looks like this:
A1 B1 A2
The DCP protocol allows the server to de-duplicate the modifications to document A
and send this instead:
B1 A2
If youâre reading the stream one event at a time, thereâs a period when you would know about only document B
, even though document A
was created first. Snapshots are a way to retain a consistent view of all documents. In this case, the server presents B1 A2
in the same snapshot. The idea is that if you process an entire snapshot at once, you know youâre be looking at documents that all existed together at the same point in time.
For the Kafka connector, snapshots donât provide any value, since we send the events to the topic one at a time. The only thing the use_snapshots
setting does is buffer an entire snapshot into memory before sending the messages to the topic. The messages are still published one event at a time, without any indication that they belong to the same DCP snapshot.
Incidentally, thereâs an open enhancement request MB-26908 to allow disabling de-duplication (and eliminating the need for snapshots). This would be a boon for the connectors, but itâs not clear whether a high-performance solution is feasible.