Sync Gateway version
Couchbase Sync Gateway/1.5.1(4;cb9522c) (multiple instances, dynamically scaled)
Operating system
Ubuntu 14.04.3 LTS
Expected behavior
When repeatedly requesting GET /{db}/_changes
from a Sync Gateway database on the admin port with a since
parameter set to the last_seq
value returned in the previous request, the expectation is that 100% of updated documents will be reliably returned in the responses. Or in other words this strategy ought to result in a guaranteed feed of every single upserted document in a Sync Gateway database over time.
Actual behavior
After extensive investigation following repeated anecdotal reports from users of a production implementation backed by Sync Gateway, it is demonstrably the case that upserted documents are not included in this changes feed, with significant regularity. It would seem that this most frequently occurs when the database sequence transitions from, or to, a compound sequence (stable_seq::seq) representation.
A Typical Example
A recent example that I investigated followed this pattern. The logs from the change monitor application reveal that:
- query made with
since: 11045496
- response contains 1 document (
seq: 11045497
), andlast_seq: 11045497
- subsequent query made with
since: 11045497
. - response contains 2 documents (
seq: 11045497::11045499 & 11045497::11045500
), andlast_seq: 11045497::11045500
- subsequent query made with
since: 11045497::11045500
- response contains 1 document (
seq: 11045497::11045501
), andlast_seq: 11045497::11045501
- subsequent query made with
since: 11045497::11045501
- response contains 26 documents (
seq: 11045497::11045502 … 11045497::11045527
), andlast_seq: 11045497::11045527
- subsequent query made with
since: 11045497::11045527
- response contains 1 document (
seq 11045528
), andlast_seq: 11045528
This is a total of 31 documents between seq: 11045497
and seq: 11045528
Some short time later once the compound sequence situation has resolved, the following manual confirmation is invoked:
- query made with
since=11045496&limit=32
- response contains 32 document (
seq: 11045497 … 11045528
), andlast_seq: 11045528
This is a total of 32 documents between seq: 11045497
and seq: 11045528
Analysis of the returned documents reveals that in the first example the response in step 4 was missing the document returned as seq: 11045498
in the second example. Or in other words, the response to step 3 is missing the expected document seq: 1045497::11045498
. (The log files surrounding this snippet include exhaustive data for several days both before and after this sequence, and the document at seq: 11045498
is demonstrably missing from the change monitor.
Observations
-
It would seem that the occurrence rate of missed document changes increased significantly when we recently introduced replicating clients into our system (previously, the sync gateway database was only accessed by services directly over the REST API). The speculation is that these replicating clients (perhaps by now invoking the previously unused bulk “set” APIs on the Sync Gateway endpoints?) have increased the frequency of whatever root cause leads to the unreliability. Note that this is speculation only.
-
We are using this changes feed to detect the presence of replicated “transactional” documents, and also to recognise changes to “state” documents within Sync Gateway. Our system is absolutely contingent on the guaranteed detection of such document changes in order to trigger application critical business logic and subsequent flow-on document mutations. The unreliability of this changes feed is a mission critical situation for us, and our production system is currently unhealthy, with no immediately obvious mitigation strategy.
-
Perhaps most alarmingly: On the assumption that the remote clients (using Couchbase provided clients) essentially utilise the same changes feed mechanism in order to implement pull replication functionality (albeit over the public port, and with user context channel filtering), there is now a significant concern that such pull replication might also be not 100% reliable and might potentially result in missing document mutations replicating down to clients. Again this constitutes a mission critical failure mode for us.
-
A search of the forum and issue history for Sync Gateway yields various similar sounding issues, which may, or may not be related, for example:
- Sync Gateway _changes feed does not return all documents, sometimes
- Sync gateway _changes feed and pull replication does not return all documents up to the latest sequence
-
https://github.com/couchbase/sync_gateway/issues/1090
None of these that I have discovered have provided a satisfactory explanation.
Thanks for your attention to this issue.