When are checkpoints set? Stuck and won't update

Trying to chase down an issue with a Swift app that connects to SG, initially replicates as expected, but the checkpoint sequences are never updated. The net result is every time the client connects, it’s pulling a very large number of changes from the feed, from an early point in time.

The checkpoint document is being mutated (revision goes up) and synced (getCheckpoint and setCheckpoint visible in the SG logs), but the values are always the same. The local and remote sequences agree in the sqlite table and the _sync:local:checkpoint document. {“local”:78017,“remote”:7021997}. This is consistent even with a checkpoint reset on Mobile, or a reinstall of the mobile app (the new _sync:local document will have those same seq. #'s).

Where should I look next? I’m still getting my feet wet looking under the hood.

The version numbers for SG, Couchbase server, and CBL would be helpful.

My apologies.

Server is 6.0.0
Sync Gateway is 2.5.0
Couchbase Lite is Swift 2.5.0

Another thing to note, if you mutate the document on mobile and save it will replicate to Server and the local checkpoint gets updated quickly and properly. The remote stays put. If you mutate the document on Server (or via SG from another client) the remote sequence also stays put.

Are there any warnings being logged by the Couchbase Lite replicator? If it fails to pull a document and the error is judged to be transient, the remote checkpoint won’t advance past that doc’s sequence, so it can be pulled again next time.

Ah, indeed there is, thank you. "Failed to insert ‘userPreferences::00098100’ #86- : LiteCore error 32 “Invalid delta”. That makes sense for ensuring a full and clean replication, so it looks like I need to figure out why the delta isn’t valid.

Deleted the offending document, and I no longer see any warnings (Lite or SG) but the remote sequence still doesn’t update.

The pull replication appears to be completing successfully, but I also noticed the progress completed and total counts don’t match:

12:11:44.605385| [Sync]: {13} Received 12 changes (seq ‘20311371’…‘20313191’)
12:11:44.606407| [Sync]: {13} Responded to ‘changes’ REQ#43226 w/request for 1 revs
12:11:44.606415| [Sync]: {13} activityLevel=idle: pendingResponseCount=0, eventCount=1, activeDocs=0
12:11:44.606476| [Sync]: {19} activityLevel=busy: pendingResponseCount=0, _caughtUp=0, _waitingForChangesCallback=0, _pendingRevMessages=1, _activeIncomingRevs=0
12:11:44.606519| [Sync]: {12} pushStatus=idle, pullStatus=busy, dbStatus=idle, progress=41847/42634
12:11:44.606529| [Sync]: {12} activityLevel=busy: connectionState=2
12:11:44.606614| [Sync]: CBLReplicator[<*> URL[wss://[URL]:4984/[DB]]] is busy, progress 41847/42634, error: (null)
12:11:44.607473| [Sync]: {19} Caught up with remote changes
12:11:44.607476| [Sync]: {19} activityLevel=busy: pendingResponseCount=0, _caughtUp=1, _waitingForChangesCallback=0, _pendingRevMessages=1, _activeIncomingRevs=0
12:11:44.609092| [Sync]: {19} activityLevel=idle: pendingResponseCount=0, _caughtUp=1, _waitingForChangesCallback=0, _pendingRevMessages=0, _activeIncomingRevs=0
12:11:44.609133| [Sync]: {12} pushStatus=idle, pullStatus=idle, dbStatus=idle, progress=41847/42634
12:11:44.609137| [Sync]: {12} activityLevel=idle: connectionState=2
12:11:44.609139| [Sync]: {12} now idle

Try again without delta sync enabled. We are currently addressing a few issues around it, some of which it appears you are running into.

Disabled delta sync and restarted SG, but unfortunately that didn’t work either.

I also tried deleting the app and reinstalling, so I’d get a fresh from zero sync and a new checkpoint doc. The checkpoint was very quickly set and stayed at the exact same point: 7021997.

I did log one error in the client, during the initial sync after app reinstallation. It has not returned on subsequent launches.

CouchbaseLite Replicator ERROR: {Pull#12} Threw C++ exception: mutex lock failed: Invalid argument

Try updating Couchbase Lite to the latest (2.5.2?) – that mutex error sounds like something we fixed.

Updated to CouchbaseLite/2.5.2 (Swift; iOS 12.2; iPad) Build/3 Commit/e613ef5b LiteCore/2.5.2 (3) (via CocoaPods). The replication checkpoint is still stuck at I’m still seeing the mutex lock fail, preceded by a connection reset. Keep in mind, the mutex lock failure only occurs on the first replication after a clean install of the client app, and it never returns (probably because whatever document triggers it has already replicated?)

nw_socket_handle_socket_event [C2.1:1] Socket SO_ERROR [54: Connection reset by peer]
CouchbaseLite Replicator ERROR: {Pull#12} Threw C++ exception: mutex lock failed: Invalid argument

This is accompanied by an entry in SG’s error log (logged at the same moment in time):

Error from bh.handleChangesResponse: use of closed BLIP sender – rest.(*blipSyncContext).Logf() at blip_sync.go:224

These errors only occur on the initial pull, and do not return when the app is relaunched and a new pull replication is started. The original issue of the checkpoint never progressing beyond 7021997 remains with the update to 2.5.2 as well.

Can you reproduce this while running the app in Xcode? If so, please set a breakpoint on exceptions, then when the exception is hit, capture the backtrace by entering bt in the debugger console. (Even better would be to get all threads’ backtraces, but I can never remember the command for that.) Then file an issue on Github and add the backtrace. Thanks!!

Weird that the checkpoint gets stuck but you’re not seeing any warnings. Please turn on full replicator logging (in a subsequent replication, so it won’t stop at the exception), then save the entire log output to a file and attach it to the issue. We will then have the “fun” of poring over it to find something wrong…

I’ve been running in Xcode since I first uncovered the issue, I’ll start on those backtraces right away.

Issue #813 has been filed. I’ll add more context as I continue digging.

Circling back to the original topic, of when checkpoints are set.

I’m still encountering a case where the mobile client (iOS-Swift) begins a pull replication, updates checkpoints for a period of time (new revisions being mutated in and synced properly), but then simply ceases. I was scratching my head for a while until I grabbed the Couchbase Lite source and added a little logging in LiteCore’s RevFinder::_findOrRequestRevs() method.

Long story short, I’m somehow getting documents in the SG changes feed that no longer exist (via SG or Server). When the pull replication runs, it will dutifully check the local DB for each document from _changes and add each missing doc/rev to the next request. Since the requested doc doesn’t exist and doesn’t get returned by SG, it never gets removed from the list of sequences, and bubbles to the top. Once that happens, RemoteSequenceSet::remove will never set wasEarliest to true, Puller::completedSequence will never call updateLastSequence, and the checkpoint never moves forward.

My question is mostly to understand how Sync Gateway builds the list of doc/rev for the changes feed, and how a purged document is included in the list. Any hints on how I can debug this?

Those no longer existent document you speak of, do they result in 404 results? Because in that case the replicator should be skipping them. If not, that’s another issue to investigate. The reason they are there in the first place is most likely due to the way that channels work. If a document is removed from a channel, then you no longer have access to any of the revisions after that, but still have access to the ones up to and including the removal. However, after a while that revision will probably be purged for being old and unneeded, however since channel access prevents the visibility of the newer revisions, the changes feed will only the show latest one that you had access to before the channel removal. The client doesn’t necessarily know that this is a removal, and will request it (which returns a placeholder that means the document was removed). However if the actual document behind that is gone it will return 404 instead. The replicator should mark this and skip it (setting it be no longer be in the list of sequences).

However, you indicate that you are cloning into a fresh database right? In that case you should enter a special mode called “active only” which skips all deleted and removed revisions. I haven’t been able to verify this in the sample that you gave us yet. Could you let me know how you are determining that the checkpoint is not advancing?

@bbrks Can probably fill in a little more detailed about how SG actually generates the changes feed.

In the Server UI, sticking the document ID (assortment::00092682) in the box on the Documents tab and hitting Retrieve Docs will respond "“Object not found”. Running a N1QL query with the where clause where meta().id == 'assortment::00092682' returns { results: [] }

But, Sync Gateway is returning this document in the changes feed:
{"seq":7192516,"id":"assortment::00092682","changes":[{"rev":"1-140f2f1e1be8f808ad664734c20c88c6"}]}

However, when requesting the doc via the REST API, Sync Gateway will respond {"error":"not_found","reason":"Not imported"}. The sg_debug.log does return a 404 Not imported, after attempting to import the document. I’m not sure what document it’s operating on, but the import function requires the document to have type and channels properties. I haven’t yet learned what the cas and syncCas refer to.

2019-08-13T09:32:44.720-05:00 [INF] HTTP:  #10170: GET /db/assortment::00092682 (as ADMIN)
2019-08-13T09:32:44.721-05:00 [DBG] CRUD+: Doc assortment::00092682 is not an SG write, based on cas and body hash. cas:158bdc1d47760000 syncCas:"0x000037b985641915"
2019-08-13T09:32:44.721-05:00 [DBG] Import+: Attempting to import doc "assortment::00092682"...
2019-08-13T09:32:44.721-05:00 [DBG] CRUD+: Doc assortment::00092682 is not an SG write, based on cas and body hash. cas:158bdc1d47760000 syncCas:"0x000037b985641915"
2019-08-13T09:32:44.721-05:00 [DBG] Import+: Doc assortment::00092682 excluded by document import function - will not be imported.
2019-08-13T09:32:44.721-05:00 [DBG] CRUD+: Did not update document "assortment::00092682" w/ xattr: Import cancelled based on import filter
2019-08-13T09:32:44.721-05:00 [ERR] 404 Not imported -- rest.(*handler).writeError() at handler.go:690
2019-08-13T09:32:44.721-05:00 [INF] HTTP: #10170:     --> 404 Not imported  (1.1 ms)

To answer your other question, this actually occurs regardless of whether the local DB is empty. Once you reach this document in the pull replication, the checkpoint will never advance past it because the requested missing doc isn’t returned. Thus, it can’t be removed from the LiteCore _sequences stack. The code only moves the checkpoint forward on the removal of the first entry in _sequences. So once this document sits in that first spot? No more checkpoints.

I’ve tested manually updating the _sync:local:checkpoint/cp-xxx= document and the record in the local sqlite table to a sequence after this document. As expected, the pull replication will pick up and continue properly.

In another example (Doc ID: order::32210477) there’s a 404 Missing, and it does not trigger the import warning. This gives (at least to me) a more clear indication the document just isn’t there, but is showing up in the changes feed anyway:

2019-08-13T10:35:28.769-05:00 [INF] HTTP:  #10187: GET /db/order::32210477 (as ADMIN)
2019-08-13T10:35:28.769-05:00 [ERR] key not found -- rest.(*handler).writeError() at handler.go:690
2019-08-13T10:35:28.770-05:00 [INF] HTTP: #10187:     --> 404 missing  (0.6 ms)

@bbrks @adamf Could you provide any insight on this scenario of the “404 not imported” error?

According to the error, the sync gateway cancelled the import based on the import function. So you may want to double check if the document in question matches the import criteria.
The other statements just indicate that the Sync Gateway has identified that the requested document has not yet been imported (or processed by SGW) and so it is attempting to import it.

I appreciate the comments, but… Please read the further comments I’ve posted.

“404 not imported” would make complete sense, if the document existed and was being passed through the import function. However, I cannot seem to locate the document, or what remnants of it, are actually being sent through the SG import function. In the next comment, I note an additional example with a different document that does not trigger the warning about the import function rejecting the document, but is still appearing in the SG _changes feed.

I really need to figure out how the _changes feed is constructed, so I can get to the bottom of why these documents are appearing there in the first place when they cannot be retrieved from the Server.