Q: about cb lite websocket heartbeat blocked by tencent LB

HI,

I use a vps cloud server for my developing project. I have 3 nodes, each node installed sg and cb.
I use tencent CLB to Balanced load. but the lite often disconnect from sync gateway.
the error throw:

2019-06-04 14:25:30.907 14542-14618/com.couchbase.todo E/CouchbaseLite/REPLICATOR: {Repl#1}==> N8litecore4repl10ReplicatorE /data/user/0/com.couchbase.todo/files/user1.cblite2/ ->ws://118.25.31.151:4984/todo/_blipsync @0x7fa4fd74c8
2019-06-04 14:25:30.908 14542-14618/com.couchbase.todo E/CouchbaseLite/REPLICATOR: {Repl#1} Got LiteCore error: POSIX error 104 “Connection reset by peer”

but when I direct connect to the server, the lite works fine.
then I make a chat rom test base on nodejs(socket.io), it link to LB, the websocket work fine.
I guess the lite has custom made websocket protocol, make the LB filtered the lite heartbeat message.right?:
I use wireshark catch the network package,
winshark_websocket_disconnect.pcapng.zip (45.7 KB)

how can I resolve this issue?
thank!

angular

  1. What is the platform (iOS, Android, .NET)? Each platform has different WebSocket implementation so the issue might be platform specific.
  2. Are there any error appeared on the Sync Gateway log when the connection is reset?

HI @pasin

thanks for you reply.

1. What is the platform (iOS, Android, .NET)? Each platform has different WebSocket implementation so the issue might be platform specific.

my env is:
cb: 6.0.1
sg: 2.5
cb lite: android 2.5

2. Are there any error appeared on the Sync Gateway log when the connection is reset?

sg error logs:

2019-06-05T01:52:56.534Z [INF] Cache: c:[7607c466] getCachedChanges(“task-list.user1.946e8b0f-f9b9-4b06-aced-68fb0d288399”, 11:0) → 0 changes valid from #15
2019-06-05T01:52:56.534Z [INF] Cache: Querying ‘channels’ for “task-list.user1.946e8b0f-f9b9-4b06-aced-68fb0d288399” (start=#1, end=#15, limit=0)
2019-06-05T01:52:56.535Z [INF] Cache: Got 1 rows from query for “task-list.user1.946e8b0f-f9b9-4b06-aced-68fb0d288399”: #11#11
2019-06-05T01:52:56.536Z [INF] Cache: Initialized cache of “task-list.user1.946e8b0f-f9b9-4b06-aced-68fb0d288399” with 1 entries from query (#11#11)
2019-06-05T01:52:56.536Z [INF] Cache: c:[7607c466] GetChangesInChannel(“task-list.user1.946e8b0f-f9b9-4b06-aced-68fb0d288399”) → 1 rows
2019-06-05T01:52:56.536Z [INF] Cache: Initialized cache for channel “!” with options: &{ChannelCacheMinLength:50 ChannelCacheMaxLength:500 ChannelCacheAge:1m0s}
2019-06-05T01:52:56.536Z [INF] Cache: c:[7607c466] getCachedChanges(“!”, 0) → 0 changes valid from #15
2019-06-05T01:52:56.536Z [INF] Cache: Querying ‘channels’ for “!” (start=#1, end=#15, limit=0)
2019-06-05T01:52:56.538Z [INF] Cache: Got no rows from query for channel:“!”
2019-06-05T01:52:56.538Z [INF] Cache: c:[7607c466] GetChangesInChannel(“!”) → 0 rows
2019-06-05T01:52:56.562Z [INF] Sync: c:[7607c466] Sent 5 changes to client, from seq 8. User:user1
2019-06-05T01:52:56.562Z [INF] Sync: c:[7607c466] Sent all changes to client. User:user1
2019-06-05T01:52:56.616Z [INF] SyncMsg: c:[7607c466] #4: Type:setCheckpoint Client:cp-SVNtjuxHFu1uBPAI1fyWPVNBPWQ= User:user1
2019-06-05T01:58:56.395Z [INF] WS: c:[7607c466] Error: receiveLoop exiting with WebSocket error: read tcp 172.19.0.3:4984->222.173.43.58:19251: read: connection reset by peer
2019-06-05T01:58:56.396Z [INF] WS: c:[7607c466] BLIP/Websocket Handler exited: read tcp 172.19.0.3:4984->222.173.43.58:19251: read: connection reset by peer
2019-06-05T01:58:56.396Z [INF] HTTP: c:[7607c466] #004: → BLIP+WebSocket connection error: read tcp 172.19.0.3:4984->222.173.43.58:19251: read: connection reset by peer
2019-06-05T01:58:56.396Z [INF] HTTP: c:[7607c466] #004: → BLIP+WebSocket connection closed
2019-06-05T01:58:56.396Z [INF] Changes: c:[7607c466] MultiChangesFeed done (to user1)
2019-06-05T01:58:56.532Z [INF] HTTP: #005: GET /todo/_blipsync (as GUEST)
2019-06-05T01:58:56.532Z [ERR] 401 Login required – rest.(handler).writeError() at handler.go:690
2019-06-05T01:58:56.532Z [INF] HTTP: #005: → 401 Login required (0.4 ms)
2019-06-05T01:58:56.559Z [INF] HTTP: #006: GET /todo/_blipsync (as user1)
2019-06-05T01:58:56.559Z [INF] HTTP+: #006: → 101 [30ecf6c0] Upgraded to BLIP+WebSocket protocol. User:user1. (0.0 ms)
2019-06-05T01:58:56.559Z [INF] WS: c:[30ecf6c0] Start BLIP/Websocket handler
2019-06-05T01:58:56.586Z [INF] SyncMsg: c:[30ecf6c0] #1: Type:getCheckpoint Client:cp-SVNtjuxHFu1uBPAI1fyWPVNBPWQ= User:user1
2019-06-05T01:58:56.612Z [INF] SyncMsg: c:[30ecf6c0] #2: Type:subChanges Since:14 Continuous:true User:user1
2019-06-05T01:58:56.613Z [INF] Sync: c:[30ecf6c0] Sending changes since 14. User:user1
2019-06-05T01:58:56.613Z [INF] Changes: c:[30ecf6c0] MultiChangesFeed(channels: {
}, options: {Since:14 Limit:0 Conflicts:false IncludeDocs:false Wait:true Continuous:true Terminator:0xc0002087e0 HeartbeatMs:0 TimeoutMs:0 ActiveOnly:false Ctx:context.Background.WithValue(base.LogContextKey{}, base.LogContext{CorrelationID:“#006”}).WithValue(base.LogContextKey{}, base.LogContext{CorrelationID:“[30ecf6c0]”})}) … (to user1)
2019-06-05T01:58:56.613Z [INF] Cache: c:[30ecf6c0] getCachedChanges(“task-list.user1.eff46435-5bf7-4b14-ae39-a20ee4ae4df2.users”, 14) → 0 changes valid from #1
2019-06-05T01:58:56.613Z [INF] Cache: c:[30ecf6c0] getCachedChanges(“user1”, 14) → 0 changes valid from #1
2019-06-05T01:58:56.613Z [INF] Cache: c:[30ecf6c0] getCachedChanges(“task-list.user1.6647f6b8-d04e-49cd-b70f-3141da8c71c4.users”, 14) → 0 changes valid from #1
2019-06-05T01:58:56.613Z [INF] Cache: c:[30ecf6c0] getCachedChanges(“task-list.user1.4a4eda9c-e4b2-4389-b8ea-6050dec7d660.users”, 14) → 0 changes valid from #1
2019-06-05T01:58:56.613Z [INF] Cache: c:[30ecf6c0] getCachedChanges(“task-list.user1.0a2a41b7-8898-4299-a4a5-53529d5f9cb5.users”, 14) → 0 changes valid from #1
2019-06-05T01:58:56.613Z [INF] Cache: c:[30ecf6c0] getCachedChanges(“!”, 14) → 0 changes valid from #1
2019-06-05T01:58:56.613Z [INF] Cache: c:[30ecf6c0] getCachedChanges(“task-list.user1.946e8b0f-f9b9-4b06-aced-68fb0d288399.users”, 14) → 0 changes valid from #1
2019-06-05T01:58:56.613Z [INF] Sync: c:[30ecf6c0] Sent all changes to client. User:user1

cb lite error logs:

2019-06-05 09:37:53.008 15184-15257/com.couchbase.todo E/CouchbaseLite/REPLICATOR: {Repl#2}==> N8litecore4repl10ReplicatorE /data/user/0/com.couchbase.todo/files/user1.cblite2/ ->ws://152.136.10.30:4984/todo/_blipsync @0x7f957c59c8
2019-06-05 09:37:53.008 15184-15257/com.couchbase.todo E/CouchbaseLite/REPLICATOR: {Repl#2} Got LiteCore error: POSIX error 104 “Connection reset by peer”

the tencent CLB configure property is:

(sorry, haven’t english documents, please use google translation).

thanks!

angular

I would suspect that your guess is correct and you should eliminate any logic that shuts off idle connections in your load balancer. This sort of stuff is documented at a high level in the Sync Gateway docs

From the SG doc suggested by @borrrden, the keep alive timeout should be set to the value greater than the heartbeat sent by the replicator’s WebSocket implementation which is 300 seconds. Can you first try to set the keep alive timeout to the value such as 360 seconds instead of the default 75 seconds.

To keep a WebSocket connection open, the replicator sends a WebSocket PING message (also known as heartbeat) every 300 seconds (5 minutes). The keep alive timeout value of the load balancer must be configured to a higher value than the heartbeat interval. For example, 360 seconds. The following section demonstrates how to do that with NGINX.

HI @borrrden @pasin

Thanks for your reply. I already set keepalive_timeout to 360s, and set proxy_read_timeout to 3600s on LB.
but sg and lite still disconnect after 3600s then establish reconnect. But the fundamental problem is the LB filtered the lite heartbeat.

thanks again.

Best Regards

angular

HI @borrrden @pasin,

I still confused of why the lite and sg often disconnect. and often lite can’t reconnect to sg.
I direct connect to server node, the lite still disconnect with sg. the android logcat haven’t any error. but the sg throw error:

2019-06-06T07:04:34.425Z [INF] WS: c:[592946d5] Error: receiveLoop exiting with WebSocket error: read tcp 172.18.0.3:4984->222.173.43.58:51929: read: connection reset by peer
2019-06-06T07:04:34.425Z [INF] WS: c:[592946d5] BLIP/Websocket Handler exited: read tcp 172.18.0.3:4984->222.173.43.58:51929: read: connection reset by peer
2019-06-06T07:04:34.425Z [INF] HTTP: c:[592946d5] #244533: → BLIP+WebSocket connection error: read tcp 172.18.0.3:4984->222.173.43.58:51929: read: connection reset by peer
2019-06-06T07:04:34.425Z [INF] HTTP: c:[592946d5] #244533: → BLIP+WebSocket connection closed
2019-06-06T07:04:34.425Z [INF] Changes: c:[592946d5] MultiChangesFeed done (to user1)

this is wireshare catch package when lite deathed then a save new document:


I find this code statement from github:

public class CouchbaseLiteHttpClientFactory implements HttpClientFactory {
private OkHttpClient client;
private ClearableCookieJar cookieJar;
private SSLSocketFactory sslSocketFactory;
private HostnameVerifier hostnameVerifier;
private boolean followRedirects = true;

// deprecated
public static int DEFAULT_SO_TIMEOUT_SECONDS = 40; // 40 sec (previously it was 5 min)
// heartbeat value 30sec + 10 sec

// OkHttp Default Timeout is 10 sec for all timeout settings
public static int DEFAULT_CONNECTION_TIMEOUT_SECONDS = 10;
public static int DEFAULT_READ_TIMEOUT = DEFAULT_SO_TIMEOUT_SECONDS;
public static int DEFAULT_WRITE_TIMEOUT = 10;  

this means per 40s lite will send the heartbeat to sg ?
I use wireshark catch network package can get the handshake lite with sg via http .
but after connected, I can’t get any package from lite to sg.
Is this normal?
I use offical github todo demo. the replicator code as follows:

    Endpoint endpoint = new URLEndpoint(uri);
    ReplicatorConfiguration config = new ReplicatorConfiguration(database, endpoint)
            .setReplicatorType(ReplicatorConfiguration.ReplicatorType.PUSH_AND_PULL)
            .setContinuous(true);

sg config statement as follow:

{
“log”: [“*”],
“interface”: “:4984”,
“adminInterface”: “4985”,
“maxFileDescriptors”: 250000,
“databases”: {
“todo”: {
“server”: “couchbase://cbsEE-6.0.1”,
“bucket”: “todo”,
“username”: “todo”,
“password”: “123456”,
“enable_shared_bucket_access”: true,
“import_docs”: true,
“num_index_replicas”: 0,
“users”: {
“user1”: {“password”: “pass”, “admin_channels”: [“user1”]},
“user2”: {“password”: “pass”, “admin_channels”: [“user2”]},
“user3”: {“password”: “pass”, “admin_channels”: [“user3”]},
“mod”: {“password”: “pass”, “admin_roles”: [“moderator”]},
“admin”: {“password”: “pass”, “admin_roles”: [“admin”]}
},
“roles”: {
“moderator”: {},
“admin”: {}
},
“sync”: `
function(doc, oldDoc){

is any error?

other question is: lite version 2.5 send heartbeat to sg or sg send heartbeat to lite?

Thank you for pointing!

angular