LiteC issue on Windows: non-blocking socket operation error

Hello,

I am receiving the following error when attempting to begin replication. I am compiling couchbase-lite-C from the master branch using CMake.

This happens only when compiled and ran on Windows 10 – compiling and running the same code (and connecting to the same Websocket endpoint) on MacOS works just fine, no problems w/ database replication.

Any help is appreciated. Thanks in advance.

16:55:50.087897| [Sync]: {Repl#22}==> class litecore::repl::Replicator C:\Users\xxxxx\Desktop\xxxxx.cblite2\ → xxxxx:443/xxxxx/_blipsync @000001F74D4AD58816:55:50.088468| [Sync]: {Repl#22} Pull=continuous, Options={{}}
16:55:50.088659| [Sync]: {C4Replicator#23}==> class c4Internal::C4RemoteReplicator 000001F74D194250 @000001F74D194250
16:55:50.088876| [Sync]: {C4Replicator#23} Starting Replicator {Repl#22}
16:55:50.090899| [Sync]: {Repl#22} No local checkpoint ‘cp-2HRgYvzLOR9Sn1mQNvgNj1iyDhs=’
16:55:50.183987| [WS] WARNING: ClientSocket got POSIX error 140 "A non-blocking socket operation could not be completed immediately.
"
16:55:50.184435| [WS]: {BuiltInWebSocket#24}==> class litecore::websocket::BuiltInWebSocket xxxxx:443/xxxxx-x/_blipsync @000001F74D2F35C0
16:55:50.184781| [WS] WARNING: {BuiltInWebSocket#24} Unexpected or unclean socket disconnect! (reason=Network error, code=3)
16:55:50.185010| [Sync]: {Repl#22} Connection closed with Network error 3: “connection timed out” (state=1)16:55:50.185392| [Sync] ERROR: {Repl#22} Got LiteCore error: Network error 3 “connection timed out”

I’m guessing the error 140 is EWOULDBLOCK / EAGAIN, which we should be ignoring. Any idea, @borrrden?

@victorberger, would you mind trying the dev branch of couchbase-lite-c? It’s using a somewhat newer revision of couchbase-lite-core, which may have Windows-specific fixes. Let me know if it works any differently. Thanks.

@jens Thank you for your reply.

Using the “dev” branch:

  1. On MacOS, it caused SSL replications to only partially work (were working fine before). It seems that some of the contents are transferred but then the replications stops. These are the messages when it happens:

12:27:44.264699| [DB]: {DB#169} Opening database
12:27:44.267754| [Sync]: {Inserter#170}==> litecore::repl::Inserter ->wss://xxx.com:443/xxx-xxx/_blipsync @0x7ff8446605d8
12:27:44.267775| [Sync]: {Inserter#170} Inserted 9 revs in 3.20ms ( 2813/sec) of which 99.7% was commit
TLS: >>> BIO returning MBEDTLS_ERR_SSL_WANT_READ
TLS: >>> mbedtls_socket returning EWOULDBLOCK
12:27:44.330856| [WS]: {BuiltInWebSocket#171}==> litecore::websocket::BuiltInWebSocket wss://xxx-xxxx.com:443/xxxx/_blipsync @0x7ff84465f5f0
12:27:44.330906| [WS] WARNING: {BuiltInWebSocket#171} Unexpected or unclean socket disconnect! (reason=errno, code=0)
12:27:44.330989| [Sync]: {Repl#161} Connection closed with WebSocket status 1006: “” (state=2)
12:27:44.331069| [Sync] ERROR: {Repl#161} Got LiteCore error: WebSocket error 1006 “connection closed abnormally”

  1. On Windows, it seems to work correctly with “WS” endpoints, but not “WSS”. With WSS, I see the following message:

12:32:34.837887| [Sync]: {C4Replicator#87} Starting Replicator {Repl#86}
12:32:34.839104| [Sync]: {Repl#86} Scanning for pre-existing conflicts…
12:32:34.839239| [Sync]: {Repl#86} Found 0 conflicted docs in 0.001 sec
12:32:34.839834| [Sync]: {Repl#86} No local checkpoint ‘cp-iw9MseqhhVgzw1GeCMKoq+8UC48=’
12:32:35.063827| [WS] ERROR: mbedTLS(C): x509_verify_cert() returned -9984 (-0x2700)
TLS: mbedtls error -0x2700 from mbedtls_ssl_handshake: X509 - Certificate verification failed, e.g. CRL, CA or signature check failed
12:32:35.064661| [WS] ERROR: TCPSocket TLS handshake failed; cert verify status 0x08
12:32:35.064853| [WS]: {BuiltInWebSocket#88}==> class litecore::websocket::BuiltInWebSocket wss://x-xxxx.com:443/xxx/_blipsync @0000015D06ED3C00
12:32:35.065248| [WS] WARNING: {BuiltInWebSocket#88} Unexpected or unclean socket disconnect! (reason=Network error, code=11)
12:32:35.065671| [Sync]: {Repl#86} Connection closed with Network error 11: “The certificate is not correctly signed by the trusted CA” (state=1)
12:32:35.066091| [Sync] ERROR: {Repl#86} Got LiteCore error: Network error 11 “The certificate is not correctly signed by the trusted CA”

Hello, are there any updates on this ?

Well on Windows the error seems pretty straightforward. We don’t have a lot of time at the moment to spend on C so it will probably be a good amount of time before this gets addressed, assuming we can replicate it.

CBL-C uses mbedTLS, and mbedTLS does not come with the platform-specific code needed to read the set of trusted root certificates. I wrote some code for macOS and most Unices, and @borrrden added code for Windows. This is in the file mbedtls_context.cpp.

It sounds like either the Windows code is not working, or else CBL-C doesn’t yet include a new enough version of LiteCore including the Windows support.

Thank you for the replies.

Regarding the certificate error on Windows, I printed out the output from the read_system_root_certs() function on MacOS and on Windows, and found that the MacOS implementation returns regular looking certificate strings (with -----BEGIN CERTIFICATE----- blocks). On Windows, the function returns a string with seemingly gibberish characters (as if the certificate was encoded at that point). I do not know if this is a bug, or normal behavior.

Regarding the interruption found on MacOS, I now found that there’s more being printed out by the API:

10:38:55.456675| [Sync]: {Puller#168} Caught up with remote changes
10:38:55.478481| [DB]: {DB#174}==> litecore::SQLiteDataFile /Users/xxx.cblite2/db.sqlite3 @0x7f9353a648c0
10:38:55.478547| [DB]: {DB#174} Opening database
10:38:55.485327| [Sync]: {Inserter#175}==> litecore::repl::Inserter ->wss://xxx.com:443/xxx/_blipsync @0x7f9353957088
10:38:55.485364| [Sync]: {Inserter#175} Inserted   9 revs in   7.19ms ( 1252/sec) of which 99.8% was commit
TLS: >>> BIO returning MBEDTLS_ERR_SSL_WANT_READ
TLS: >>> mbedtls_socket returning EWOULDBLOCK
10:38:55.554829| [WS]: {BuiltInWebSocket#176}==> litecore::websocket::BuiltInWebSocket wss://xxx.com:443/xxx/_blipsync @0x7f934bcb5660
10:38:55.554862| [WS] WARNING: {BuiltInWebSocket#176} Unexpected or unclean socket disconnect! (reason=errno, code=0)
10:38:55.555156| [Sync]: {Repl#166} Connection closed with WebSocket status 1006: "" (state=2)
10:38:55.556379| [Sync] ERROR: {Repl#166} Got LiteCore error: WebSocket error 1006 "connection closed abnormally"

Thank you again

The fact that it’s binary is OK (it’s DER format, not PEM), but that implies the read_system_root_certs method is concatenating a bunch of DER together, and mbedTLS can’t parse that. It can parse concatenated PEM (ascii) certs, but only one DER.

@borrrden, does that sound like the problem? If it is, we’d either have to convert the Windows certs to PEM when concatenating, or refactor the code that calls read_system_root_certs to not require one concatenated string.

Hm. That does look like a problem down in our TLS glue code. Is this easily repeatable? I’m surprised it wouldn’t show up in our tests, since I run them on macOS all the time.

I don’t know, because Windows tests all pass on my machine so at least in some cases it must be doing something right. I’m guessing that the encoding type will need to be examined and handled. I’m in favor of not concatenating rather than trying to convert to PEM, but if it’s easy then maybe converting to PEM is fine.

As far as the other thing goes, I imagine you’d have to stress system resources for a while to get EWOULDBLOCK. It happens a lot more on Windows, but it is a valid case that needs to be handled.

Thank you for the replies @jens @borrrden and sorry for the late reply.

I tried this again today and ran into the same issues. It seems the problems may be coming from the TLS glue code and/or the new Websocket implementation.

All of this was found on MacOS, from building Lite-C from source code (dev branch) using CMake, and is easily reproducible:

  1. Sometimes but not always, a given TLS replication will be interrupted in the middle after partially completing, displaying the following error. In these cases, the replication never finishes.
    Other times, the same replication with the same exact parameters will finish completely. The DB being transferred is not large and system resources are not being strained.
14:43:27.625480| [DB]: {DB#106}==> litecore::SQLiteDataFile /Users/X/X.cblite2/db.sqlite3 @0x7fd77018b260
14:43:27.625514| [DB]: {DB#106} Opening database
14:43:27.629679| [Sync]: {Inserter#107}==> litecore::repl::Inserter ->wss://X:443/X-qa/_blipsync @0x7fd76dc70908
14:43:27.629706| [Sync]: {Inserter#107} Inserted  10 revs in   4.37ms ( 2287/sec) of which 99.7% was commit
TLS: >>> BIO returning MBEDTLS_ERR_SSL_WANT_READ
TLS: >>> mbedtls_socket returning EWOULDBLOCK
14:43:27.692882| [WS]: {BuiltInWebSocket#108}==> litecore::websocket::BuiltInWebSocket wss://X:443/X-qa/_blipsync @0x7fd76dc6a1a0
14:43:27.692905| [WS] WARNING: {BuiltInWebSocket#108} Unexpected or unclean socket disconnect! (reason=errno, code=0)
14:43:27.692958| [Sync]: {Repl#98} Connection closed with WebSocket status 1006: "" (state=2)
14:43:27.693015| [Sync] ERROR: {Repl#98} Got LiteCore error: WebSocket error 1006 "connection closed abnormally"
14:43:27.713996| [Sync]: {Inserter#107} Inserted   8 revs in   1.63ms ( 4906/sec) of which 99.3% was commit
14:43:27.726504| [Sync] ERROR: {C4Replicator#99} State: busy, progress=72.00%, error=WebSocket error 1006 "connection closed abnormally"
  1. If I try a TLS replication using any replication settings, however, replication does not start and the error is different. Here are examples using username/password authentication or replication for a specific channel. All of the endpoints being used are valid for sure.

Username/password:

14:45:03.302449| [Sync]: {Repl#127} Pull=continuous, Options={{auth:{password:"********", username:"X"}}}
14:45:03.302557| [Sync]: {C4Replicator#128}==> c4Internal::C4RemoteReplicator 0x7fd76fa1f6f0 @0x7fd76fa1f6f0
14:45:03.302574| [Sync]: {C4Replicator#128} Starting Replicator {Repl#127}
14:45:03.306847| [Sync]: {Repl#127} Scanning for pre-existing conflicts...
14:45:03.306888| [Sync]: {Repl#127} Found 0 conflicted docs in 0.004 sec
14:45:03.307861| [Sync]: {Repl#127} No local checkpoint 'cp-nEFf3hlcC8OkYXWTw6aTInjD0BQ='
14:45:03.904799| [WS]: {BuiltInWebSocket#129}==> litecore::websocket::BuiltInWebSocket wss://X:443/X/_blipsync @0x7fd76fa1faa0
14:45:03.904830| [WS] WARNING: {BuiltInWebSocket#129} Unexpected or unclean socket disconnect! (reason=WebSocket status, code=401)
14:45:03.904922| [Sync]: {Repl#127} Connection closed with WebSocket status 401: "(unknown HTTP status)" (state=1)
14:45:03.905021| [Sync] ERROR: {Repl#127} Got LiteCore error: WebSocket error 401 "(unknown HTTP status)"
14:45:03.905049| [Sync] ERROR: {C4Replicator#128} State: connecting, progress=0.00%, error=WebSocket error 401 "(unknown HTTP status)"

Specific channel:

14:47:03.952163| [Sync]: {Repl#137} Pull=continuous, Options={{channels:["X"]}}
14:47:03.952250| [Sync]: {C4Replicator#138}==> c4Internal::C4RemoteReplicator 0x7fd76dcbdff0 @0x7fd76dcbdff0
14:47:03.952264| [Sync]: {C4Replicator#138} Starting Replicator {Repl#137}
14:47:03.952473| [Sync]: {Repl#137} Scanning for pre-existing conflicts...
14:47:03.956809| [Sync]: {Repl#137} Found 0 conflicted docs in 0.004 sec
14:47:03.957219| [Sync]: {Repl#137} No local checkpoint 'cp-GMTjAMBpyDuxd0rl7szz4LWh0vw='
14:47:04.505299| [WS]: {BuiltInWebSocket#139}==> litecore::websocket::BuiltInWebSocket wss://X:443/X/_blipsync @0x7fd76dcd4720
14:47:04.505324| [WS] WARNING: {BuiltInWebSocket#139} Unexpected or unclean socket disconnect! (reason=WebSocket status, code=401)
14:47:04.505369| [Sync]: {Repl#137} Connection closed with WebSocket status 401: "(unknown HTTP status)" (state=1)
14:47:04.505419| [Sync] ERROR: {Repl#137} Got LiteCore error: WebSocket error 401 "(unknown HTTP status)"
14:47:04.505464| [Sync] ERROR: {C4Replicator#138} State: connecting, progress=0.00%, error=WebSocket error 401 "(unknown HTTP status)"

Please let me know how I can clarify anything, thanks in advance again.
Victor Berger
ON Semiconductor

Well we will keep an eye out for this certainly. The good news is that this implementation is not specific to C and will get tested implicitly as it gets into the other products (C#, Java, Objective-C) for the 2.8.0 release. I’m almost sure we will start hitting this the more testing we do because it deals with annoying socket peculiarities.

The username / password ones are just an HTTP 401, which means that it is considering the username / password combination that it received to be invalid.

Thanks for re-testing. I’ve filed an issue on Github; let’s continue the discussion there.