A few days ago we have experienced a strange failure of Couchbase node and have no idea what was the root cause and how to avoid it in the future. Maybe someone encountered the same problem or can advise on it.
We are running a 3-node cluster of Couchbase CE 6.5 on Ubuntu 18. The failure happened after a partial outage of our cloud provider, which took down one of the three nodes (which is fine and expected). A very short time after this happened (~1 min), another node have failed. This node was in a different availability zone than the cloud provider issues were in, so I’m pretty sure it was not a root cause. Logs in the Couchbase Web Console show this message:
Service 'memcached' exited with status 134. Restarting. Messages:
2021-06-10T20:21:40.107880+00:00 CRITICAL /opt/couchbase/lib/libstdc++.so.6() [0x7f4a3d82d000+0x971e3]
2021-06-10T20:21:40.107889+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x5ec62]
2021-06-10T20:21:40.107893+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x5d90e]
2021-06-10T20:21:40.107897+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x187fe7]
2021-06-10T20:21:40.107900+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0xddce5]
2021-06-10T20:21:40.107904+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x1305bc]
2021-06-10T20:21:40.107907+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x1314f9]
2021-06-10T20:21:40.107909+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x12afe4]
2021-06-10T20:21:40.107912+00:00 CRITICAL /opt/couchbase/lib/libplatform_so.so.0.1.0() [0x7f4a3f8e8000+0x95c7]
2021-06-10T20:21:40.107916+00:00 CRITICAL /lib/x86_64-linux-gnu/libpthread.so.0() [0x7f4a3d059000+0x76db]
2021-06-10T20:21:40.107948+00:00 CRITICAL /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f4a3cc68000+0x12171f]
After some time, memcached recovered automatically, but it seems like it took unreasonably long time to recover the buckets (names and addresses changes for the purpose of this post):
Bucket "xxx" loaded on node 'ns_1@example.com' in 687 seconds.
Bucket "yyy" loaded on node 'ns_1@example.com' in 1256 seconds.
This is especially strange for one of the buckets (named here “xxx”) which is used to store only a few thousands of documents. Even a full instance restart takes way less time than this. All in all, it took over 20 min for this node to be functional again.
Memcached logs don’t tell us anything useful as well:
2021-06-10T20:21:38.449520+00:00 ERROR 112: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 10.xx.xx.xx:36822 - 10.xx.xx.xx:11210 (<ud>yyy</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":1219,"cas":0,"datatype":"raw","extlen":1,"key":"<ud>htp741201749_a935061c-f36e-4d46-b42a-3eb67203244a</ud>","keylen":49,"magic":"ClientRequest","opaque":2736820230,"opcode":"SUBDOC_MULTI_MUTATION","vbucket":362},"refcount":0}] - closing connection ([ 10.xx.xx.xx:36822 - 10.xx.xx.xx:11210 (<ud>yyy</ud>) ]): CheckpointManager::queueDirty: lastBySeqno not in snapshot range. vb:362 state:active snapshotStart:6839710 lastBySeqno:6839707 snapshotEnd:6839707 genSeqno:Yes checkpointList.size():1
2021-06-10T20:21:38.766857+00:00 CRITICAL *** Fatal error encountered during exception handling ***
2021-06-10T20:21:38.766900+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): snapshot_range_t(6839710,6839707) requires start <= end
2021-06-10T20:21:39.341749+00:00 WARNING (keystore) Slow runtime for 'DurabilityTimeoutVisitor on vb:369' on thread nonIO_worker_1: 381 ms
2021-06-10T20:21:40.107520+00:00 CRITICAL Breakpad caught a crash (Couchbase version 6.5.1-6299). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/5e26e9d2-d4d1-594c-76613b6f-55de0661.dmp before terminating.
2021-06-10T20:21:40.107539+00:00 CRITICAL Stack backtrace of crashed thread:
2021-06-10T20:21:40.107648+00:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x1338ed]
2021-06-10T20:21:40.107658+00:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ce) [0x400000+0x14b9ae]
2021-06-10T20:21:40.107662+00:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0x94) [0x400000+0x14bcc4]
2021-06-10T20:21:40.107669+00:00 CRITICAL /lib/x86_64-linux-gnu/libpthread.so.0() [0x7f4a3d059000+0x12980]
2021-06-10T20:21:40.107695+00:00 CRITICAL /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7) [0x7f4a3cc68000+0x3efb7]
2021-06-10T20:21:40.107716+00:00 CRITICAL /lib/x86_64-linux-gnu/libc.so.6(abort+0x141) [0x7f4a3cc68000+0x40921]
2021-06-10T20:21:40.107766+00:00 CRITICAL /opt/couchbase/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7f4a3d82d000+0x99165]
2021-06-10T20:21:40.107775+00:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x1472a2]
2021-06-10T20:21:40.107809+00:00 CRITICAL /opt/couchbase/lib/libstdc++.so.6() [0x7f4a3d82d000+0x96f56]
2021-06-10T20:21:40.107845+00:00 CRITICAL /opt/couchbase/lib/libstdc++.so.6() [0x7f4a3d82d000+0x96fa1]
2021-06-10T20:21:40.107880+00:00 CRITICAL /opt/couchbase/lib/libstdc++.so.6() [0x7f4a3d82d000+0x971e3]
2021-06-10T20:21:40.107889+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x5ec62]
2021-06-10T20:21:40.107893+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x5d90e]
2021-06-10T20:21:40.107897+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x187fe7]
2021-06-10T20:21:40.107900+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0xddce5]
2021-06-10T20:21:40.107904+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x1305bc]
2021-06-10T20:21:40.107907+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x1314f9]
2021-06-10T20:21:40.107909+00:00 CRITICAL /opt/couchbase/lib/ep.so() [0x7f4a382ad000+0x12afe4]
2021-06-10T20:21:40.107912+00:00 CRITICAL /opt/couchbase/lib/libplatform_so.so.0.1.0() [0x7f4a3f8e8000+0x95c7]
2021-06-10T20:21:40.107916+00:00 CRITICAL /lib/x86_64-linux-gnu/libpthread.so.0() [0x7f4a3d059000+0x76db]
2021-06-10T20:21:40.107948+00:00 CRITICAL /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f4a3cc68000+0x12171f]
2021-06-10T20:21:40.109501+00:00 INFO ---------- Closing logfile
I have found another topic here that described similar error but happening in a different circumstances. Unfortunately, it’s left without any answer: Service 'memcached' exited with status 134. Restarting - keeps happening
If someone has any idea how we can investigate this issue further then please advise.
Thanks.