I have a couchbase bucket that is up for a long time with more than 2 Billion documents;
I want to stream all the document change/remove events to HDFS.
My question is: will DCP aggregate several changes to a document to one event?
For example, one of the document has been changed 100 times; when I do DCP starting from the very beginning, will I get those 100 changes? Or I will get only 1 change? or other number?
If it is not 100, what is the aggregation strategy?
Hi Demshi, the changes might be rolled up or they might not be. It depends on whether you have connected to the DCP stream connected when the mutations happen to your document.
In your example, you would get 100 mutations if you open a DCP feed on the bucket and stay connected while the document changes.
On the other hand, if you create a document, change it 100 times, and then you open a DCP connection and ask for all documents starting from the beginning, you will only get the latest value of the document in a single aggregated DCP message that has the change history in it.
Hope that helps.
-Will
Thank you so much, @WillGardella !! This really helps a lot!
One edge case: Assume I have millions of documents in the couchbase already; then at some time I started DCP from the very beginning; The DCP will start streaming changes;
During the time the DCP starts and DCP catches up to the latest change, some documents gets updated; will these updates be aggregated ?
By the way, " you will only get the latest value of the document in a single aggregated DCP message that has the change history in it", do you have the java API or example on how to get the history with the DCP message?
They will be aggregated on the server when time passes, but old client will just receive all states of me keys in for of full documents. But new clients would see the latest versions.