We got a problem to replacte data from Couchbase Lite C to Couchbase server.
The setup:
The server is Couchbase Server Community 7.0 ~7.2, run on Machine A
SyncGateway Community is 2.8~3.03, run on Machine A
the Couchbase Lite C Community is 3.0~3.1. Continuous replicate mode; push only; documents expire after 48 hours, run on Machine B
both Machine A and Machine B are Ubuntu Desktop 20.04LTS, with fixed ip address assignment.
Normal Operation:
We push only data from local DB to server, everything is ok,
Abormal Behavior:
After confirm replcation is working properly, we disconnect the Ethernet of Machine A
then reconnect Etherdnet of Machine A
disconnect for short time, replication will work ok.
disconnect for 0.5~2.5 HR, replication will stop
What we found:
the replicator activity level may stuck at any one of OFFLINE, CONNECTING, IDLE, BUSY
couchbase lite c application program, replicator stop start cycle will not resume the replication
stop-start the couchbase lite c application program will resume the replication
restart the sync gateway will not resume the replication.
here is a “ls -l ~sync_gateway/logs/bootstrap” of our test:
166 Sep 7 16:41 sg_error.log
3711825 Sep 7 17:18 sg_info.log
6870131 Sep 7 21:15 sg_stats.log
166 Sep 7 16:41 sg_warn.log
clear all log and start sync_gateway 16:41 as indicated by sg_err.log and sg_warn.log
ethernet is disconnect at 17:16, sg_info.log stop recording at 17:18
ethernet is reconnect at 20:44, no replication is resume, as no now info is logged in sg_info.log
sync_gateway is survive through entire test, as sg_stats.log last to 21:15
Did you find the similiar problems? What might possibly goes wrong ? Is there any work around?
Hope to hear from you soon.
Assuming that you are using the continuos mode and the error from disconecting the ethernet cable is one of the transient errors. The expected behavior is that that the replicator will be under its retry cycle and wait to retry again. When disconnecting for 0.5=2.5 HR, by default, the replicator will wait for 5 mins (300 seconds) before retrying again. However, if the error when trying to connect is a permament error, the replicator will stop.
Can you enable verbose logging and see the error when the replicator stops? Sharing the log would be very helpful.
the replicator activity level may stuck at any one of OFFLINE, CONNECTING, IDLE, BUSY
Enable verbose log and sharing the log would be helpful. We have fixed some issues related to this problem from time to time, updating to use the lastest CBL version might fix the problem.
Thanks for your prompt reply.
We are using lite C++, which come with the DEB package.
In Ubuntu, the cblite C is installed by
"sudo apt install libcblite-comminity "
and
"sudo apt install libcblite-dev-comminity "
An embarrassing question:
How to enable verbose log in C++ api?
There is no clear document on enabling log.
There are two zip files in the Google Drive shared folder:
sg_log_xxx.zip ==> log from Sync Gateway
sg_err.log
sg_err.log
sg_err.log
sg_err.log
cbl_verb_log_xxx.zip ==> console log of our couchbase lite application
cbl_verb.log ==> from application start, break network connection (for 2.5HR), reconnect network (then wait another 1HR)
cbl_verb_restart.log ==> Stop previous CBL application, start and log. Replication back to normal.
psLog_after_netbreak ==> the output of ps command after network break.
psLog_after_reconnect ==> the output of ps command after re-connect network.
One thing must address:
The time stamp generated by couchbase lite log is 8 hours ahead our local time. As shown in the first line of the logs, local time (generated by date commad) is 2023-09-21, Thursday, 15:09:08, but verbose stamp the time as 23:09:08.
Here we lists the time event in “couchbase lite verbose” time, and local time insideparentheses.
23:24:05 (15:24:05) unplug the Ethernet cable of couchbase server machine ( which also run sync gateway)
02:02:00(18:02:00) re-plug the Ethernet cable of couchbase server machine.
03:01:36(1901:36) re-start couchbase lite application. Console log in cbl_verb_restart.log
As shown in the log file, 23:40:05 (16 min after un-plug the Ethernet cable) is the last [Sync] related verbal log. Even after re-plug the Ethernet cable, there are no [Sync] related message.
Plesae help us to solve this issue.
Thank you very much.
I have looked at the cbl_verb.log and I could see the same thing that the replicator seems to stop working after starting an attempt (attempt #4) to connect to SG. It’s strange that there was no log indicating that a BuiltInWebSocket is trying to connect either.
The only workaround I could think of is to listen to the replicator change event. If there is no events reports for a specific of time after the replicator went to offline or connecting status, just restart the replicator (stopped the current one and restarted it).
If there is no events reports for a specific of time after
the replicator went to offline or connecting status,
just restart the replicator (stopped the current one and restarted it).
We tried similar things in a house keeping thread, as shown below.
It does not work. Either set host reachable or stop/start replicator, does not work.
=========================== begin quote ==============
if (r.status().activity== kCBLReplicatorIdle){
idle_count++;
if (idle_count> 2010) { // idle for more than 10 minutes, 20 counts per minutes
idle_count =0;
need_action= true;
}
} else {
idle_count=0;
}
if (r.status().activity== kCBLReplicatorConnecting){
connecting_count++;
if (connecting_count> 2010) { // idle for more than 10 minutes, 20 counts per minutes
connecting_count =0;
need_action= true;
}
} else {
connecting_count=0;
}
if (r.status().activity== kCBLReplicatorOffline){
offline_count++;
if (offline_count> 20*10) { // idle for more than 10 minutes, 20 counts per minutes
offline_count =0;
need_action= true;
}
} else {
offline_count=0;
}
That is really weird. I guessed that some internal flags are off and that prevents the replicator to actually restart. What if you re-create a new replicator?
I picked up the issue last week and tried to reproduce the issue on my mac but I couldn’t reproduce. I have reviewed the code comparing to the log. I am guessing that the replicator is somehow waiting to get a lock to check for any pending conflicts that need to be resolve but I just don’t have enough info to support the guess.
Can you reprodue the issue in your dev enviroment that you can get a full backtrace of all threads when the replicator hang while trying to start?
Without being to reproduce the issue nor getting the traces, it is hard to see where the problem is.
Can you reprodue the issue in your dev enviroment that you can get a
full backtrace of all threads when the replicator hang while trying to start?
Do you mean: run our application, repeat our testing procedure ( unplug network cable for 0.5~2.5 HR , then re-plug), and break the program execution in debugger mode, then print back trace of all replicator threads? If yes, we will do the test.