"_changes" feed not 100% reliable

sbarlow · August 24, 2018, 5:44am

Sync Gateway version

Couchbase Sync Gateway/1.5.1(4;cb9522c) (multiple instances, dynamically scaled)

Operating system

Ubuntu 14.04.3 LTS

Expected behavior

When repeatedly requesting GET /{db}/_changes from a Sync Gateway database on the admin port with a since parameter set to the last_seq value returned in the previous request, the expectation is that 100% of updated documents will be reliably returned in the responses. Or in other words this strategy ought to result in a guaranteed feed of every single upserted document in a Sync Gateway database over time.

Actual behavior

After extensive investigation following repeated anecdotal reports from users of a production implementation backed by Sync Gateway, it is demonstrably the case that upserted documents are not included in this changes feed, with significant regularity. It would seem that this most frequently occurs when the database sequence transitions from, or to, a compound sequence (stable_seq::seq) representation.

A Typical Example

A recent example that I investigated followed this pattern. The logs from the change monitor application reveal that:

query made with since: 11045496
response contains 1 document (seq: 11045497), and last_seq: 11045497
subsequent query made with since: 11045497.
response contains 2 documents (seq: 11045497::11045499 & 11045497::11045500), and last_seq: 11045497::11045500
subsequent query made with since: 11045497::11045500
response contains 1 document (seq: 11045497::11045501), and last_seq: 11045497::11045501
subsequent query made with since: 11045497::11045501
response contains 26 documents (seq: 11045497::11045502 … 11045497::11045527), and last_seq: 11045497::11045527
subsequent query made with since: 11045497::11045527
response contains 1 document (seq 11045528), and last_seq: 11045528

This is a total of 31 documents between seq: 11045497 and seq: 11045528

Some short time later once the compound sequence situation has resolved, the following manual confirmation is invoked:

query made with since=11045496&limit=32
response contains 32 document (seq: 11045497 … 11045528), and last_seq: 11045528

This is a total of 32 documents between seq: 11045497 and seq: 11045528

Analysis of the returned documents reveals that in the first example the response in step 4 was missing the document returned as seq: 11045498 in the second example. Or in other words, the response to step 3 is missing the expected document seq: 1045497::11045498. (The log files surrounding this snippet include exhaustive data for several days both before and after this sequence, and the document at seq: 11045498 is demonstrably missing from the change monitor.

Observations

It would seem that the occurrence rate of missed document changes increased significantly when we recently introduced replicating clients into our system (previously, the sync gateway database was only accessed by services directly over the REST API). The speculation is that these replicating clients (perhaps by now invoking the previously unused bulk “set” APIs on the Sync Gateway endpoints?) have increased the frequency of whatever root cause leads to the unreliability. Note that this is speculation only.
We are using this changes feed to detect the presence of replicated “transactional” documents, and also to recognise changes to “state” documents within Sync Gateway. Our system is absolutely contingent on the guaranteed detection of such document changes in order to trigger application critical business logic and subsequent flow-on document mutations. The unreliability of this changes feed is a mission critical situation for us, and our production system is currently unhealthy, with no immediately obvious mitigation strategy.
Perhaps most alarmingly: On the assumption that the remote clients (using Couchbase provided clients) essentially utilise the same changes feed mechanism in order to implement pull replication functionality (albeit over the public port, and with user context channel filtering), there is now a significant concern that such pull replication might also be not 100% reliable and might potentially result in missing document mutations replicating down to clients. Again this constitutes a mission critical failure mode for us.
A search of the forum and issue history for Sync Gateway yields various similar sounding issues, which may, or may not be related, for example:
- Sync Gateway _changes feed does not return all documents, sometimes
- Sync gateway _changes feed and pull replication does not return all documents up to the latest sequence
- https://github.com/couchbase/sync_gateway/issues/1090
  None of these that I have discovered have provided a satisfactory explanation.

Thanks for your attention to this issue.

sbarlow · August 24, 2018, 5:45am

Note: also raised as https://github.com/couchbase/sync_gateway/issues/3721

adamf · August 24, 2018, 5:06pm

Tracking investigation of this issue on the github ticket (https://github.com/couchbase/sync_gateway/issues/3721).

sam.s.lam · August 28, 2018, 3:06am

Same version of SG 1.5.1(4;cb9522c), similar problem here. but we are just using 1 SG and 1 CBS.
When doing pull replication from CBL, CBL does not getting latest revision of documents, but a older revision.

sg config

{
“log”: [““],
“adminInterface”:”:4985",
“interface”: “:4984”,
“databases”: {
“db”: {
“allow_empty_password”: false,
“unsupported”: {
“user_views”: {
“enabled”: true
}
},
“server”: “http://CB_host”,
“bucket”: “db”,
“username”: “admin”,
“password”: “Password”,
“enable_shared_bucket_access”: true,
“import_docs”: “continuous”,
“users”: {
“GUEST”: { “disabled”: true },
“admin”: { “password”: “Password”, “disabled”: false, “admin_roles”: [“admin”] }
“user001”: { “password”: “Password”, “disabled”: false, “admin_roles”: [“user”] }
},
“roles”: {
“admin”: { “admin_channels”: ["”] },
“user”: { “admin_channels”: [“c001”] }
},
“sync”:function(doc, meta) { if (doc.type =='product') requireRole("admin"); channel('c001'); }
},
“CORS”: {
“Origin”: [““],
“LoginOrigin”: [””],
“Headers”: [“Content-Type”],
“MaxAge”: 1728000
}
}

we get session of SG with user001 and perform replication on CBL to get latest document.
then update a documents with PUT on SG with admin account
perform the replication on CBL again.

Changes+: Changes POST request. URL: /db/_changes?feed=normal&heartbeat=30000&style=all_docs&since=287, feed: normal, options: {Since:287 Limit:0 Conflicts:true IncludeDocs:false Wait:false Continuous:false Terminator: HeartbeatMs:30000 TimeoutMs:300000 ActiveOnly:false}, filter: , bychannel: , docIds: (to user001)
Changes+: Int sequence multi changes feed…
Changes: MultiChangesFeed(channels: {*}, options: {Since:287 Limit:0 Conflicts:true IncludeDocs:false Wait:false Continuous:false Terminator:0xc42164bbc0 HeartbeatMs:30000 TimeoutMs:300000 ActiveOnly:false}) … (to user001)
Changes+: MultiChangesFeed: channels expand to “!:1,c001:27” … (to user001)
> Cache: getCachedChanges(“!”, 287) → 0 changes valid from #1
> Changes+: [changesFeed] Found 0 changes for channel !
> Changes+: [changesFeed] Found 0 changes for channel c001
Changes: MultiChangesFeed done (to user001)
HTTP+: #198: → 200 OK (0.0 ms)

http://sg/db

{
“committed_update_seq”: 288,
“compact_running”: false,
“db_name”: “config”,
“disk_format_version”: 0,
“instance_start_time”: 1535371130018573,
“purge_seq”: 0,
“state”: “Online”,
“update_seq”: 288
}

the updated document is not being sync to the CBL

when we get full _changes list from SG. the changes seq is only on 282, which is the seq before we restart the SG.

After, restart sync gateway. CBL can get latest document correctly. It seems SG stop trace the changes after SG restarted. We only found following warning on the log, don’t know whether it is related

WARNING: Error processing DCP stream - will attempt to restart/reconnect: worker connect, server: 172.19.0.2:11210, err: dial tcp 172.19.0.2:11210: getsockopt: connection timed out – base.(*DCPReceiver).OnError() at dcp_feed.go:136
WARNING: Error processing DCP stream - will attempt to restart/reconnect: worker connect, server: 172.19.0.2:11210, err: dial tcp 172.19.0.2:11210: getsockopt: connection timed out – base.(*DCPReceiver).OnError() at dcp_feed.go:136

We use CB long time ago, but never meeting this problem before. it is just happen recently on 1.5.1 when we start a new project.

Topic		Replies	Views
Sync gateway _changes feed and pull replication does not return all documents up to the latest sequence Sync Gateway	7	3870	August 24, 2015
The Sequence of Document Synchronization in _changes feed Mobile	6	2371	December 19, 2017
Changes api does not return new changes Sync Gateway	10	2377	July 7, 2017
Odd Change Feed Behavior with "Received Unexpected Out-of-Order Change" Log Lines Sync Gateway	5	2141	March 24, 2017
Strange sequence number and changes-feed issue Sync Gateway	4	1442	October 29, 2020