Libcouchbase client crashes when stopping server

We are experiencing crashes of the libcouchbase C client on a dual node cluster when one of the couchbase servers is stopped while the application accesses the bucket(s). Our application uses get-locked operations, which are not handled by the client when handling network- or timeout errors. See also issue CCBC-274.

In CCBC-274 I’ve also provided a patch to do handle get-lock (and unlock, which was also missing), this improves the situation a lot.

However, yesterday we had a crash again when the CB server was stopped on one of the servers:

Program terminated with signal 6, Aborted(gdb) where
#0 0x00007f7407bbf8e5 in raise () from /lib64/libc.so.6
#1 0x00007f7407bc10c5 in abort () from /lib64/libc.so.6
#2 0x00007f740997269a in ringbuffer_consumed (buffer=, nb=41) at src/ringbuffer.c:263
#3 0x00007f7409973c5c in lcb_purge_single_server (server=0x7f73b0002208, error=LCB_ETIMEDOUT) at src/server.c:338
#4 0x00007f74099758d1 in lcb_server_timeout_handler (sock=, which=, arg=) at src/timeout.c:38
#5 0x00007f7409b8cb44 in event_base_loop () from /usr/lib64/libevent-1.4.so.2
#6 0x000000000046e476 in con_thr (arg=0x1471218) at src/store_couchbase.c:2659
#7 0x00007f7407f27851 in start_thread () from /lib64/libpthread.so.0
#8 0x00007f7407c7594d in clone () from /lib64/libc.so.6
(gdb) up
#1 0x00007f7407bc10c5 in abort () from /lib64/libc.so.6
(gdb) up
#2 0x00007f740997269a in ringbuffer_consumed (buffer=, nb=41) at src/ringbuffer.c:263
263 src/ringbuffer.c: No such file or directory.
in src/ringbuffer.c
(gdb) p n
$2 = 0
(gdb) up
#3 0x00007f7409973c5c in lcb_purge_single_server (server=0x7f73b0002208, error=LCB_ETIMEDOUT) at src/server.c:338
338 src/server.c: No such file or directory.
in src/server.c
(gdb) p req
$1 = {request = {magic = 128 ‘\200’, opcode = 148 ‘\224’, keylen = 3328, extlen = 4 ‘\004’, datatype = 0 ‘\000’, vbucket = 52225, bodylen = 285212672, opaque = 1347961, cas = 0},
bytes = “\200\224\000\r\004\000\001\314\000\000\000\021y\221\024\000\000\000\000\000\000\000\000”}


336 ringbuffer_consumed(stream, packetsize);
337 if (mirror) {
338 ringbuffer_consumed(mirror, packetsize);
339 }

So the crash is caused by the following abort in ringbuffer_consumed() :

lcb_size_t n = ringbuffer_read(buffer, NULL, nb);
if (n != nb) {
abort();
}

(n = 0, nb = 41)
Apparently, ringbuffer_read() returning not the number of expected bytes is possible and should not trigger an abort.

Is there any effort ongoing to improve the stability of the libcouchbase C client, especially for failover situations?

CentOS 6.4, couchbase-server-2.1.1, libcouchbase2-2.0.6-2 (2.0.6-1 with patch CCBC-274)
(libcouchbase 2.1.3 behaves exactly the same)

The issue has been fixed in this patch: http://review.couchbase.org/29344

Well, that fixes CCBC-274, but the abort in the ringbuffer_consumed() still occurs from time to time.

I’ve created issue CCBC-275 for this

Interesting I’ve been having the same crash, for a little while, I observed it first on 2.0.6, when I first posted about it I was recommend to move to 2.0.7, this did fix a different crash for me but the assert you’ve see came back.

Here is my 2.0.7 backtrace:

#0 0x00799410 in __kernel_vsyscall ()
#1 0x00875df0 in raise () from /lib/libc.so.6
#2 0x00877701 in abort () from /lib/libc.so.6
#3 0x082a8476 in ringbuffer_consumed (buffer=0x9940d4ac, nb=82) at src/ringbuffer.c:271
#4 0x082a8796 in lcb_purge_single_server (server=0x9940d448, error=LCB_ETIMEDOUT) at src/server.c:331
#5 0x082aaef7 in lcb_server_timeout_handler (sock=-1, which=1, arg=0x9940d448) at src/timeout.c:38
#6 0x082b5ce8 in event_process_active (base=0x6f161000) at event.c:385
#7 0x082b5f3d in event_base_loop (base=0x6f161000, flags=0) at event.c:525
#8 0x082b55d1 in lcb_io_run_event_loop (iops=0x0) at plugins/io/libevent/plugin-libevent.c:320
#9 0x082ab860 in lcb_wait (instance=0x96d100c8) at src/wait.c:67
#10 0x082a9f83 in lcb_synchandler_return (instance=0x96d100c8, retcode=LCB_SUCCESS) at src/synchandler.c:263
#11 0x082a29f2 in lcb_arithmetic (instance=0x96d100c8, command_cookie=0x12df8690, num=1, items=0xa8224f28) at src/arithmetic.c:40

I’d been having trouble working through the cause, so eventually I wrote a funtion to walk the buffers in the same way as lcb_purge_single_server() but without making any modifications, pasted below.

I called this from some places I deemed to be interesting, after few iterations this brought me as close as I could get with this method:

I had the following at the bottom of lcb_arithmetic():

    TRACE_ARITHMETIC_BEGIN(&req, key, nkey, delta, initial,
                           create ? exp : (lcb_time_t)0xffffffff);
    lcb_check_single_server(server);
    lcb_server_start_packet(server, command_cookie, req.bytes,
                            sizeof(req.bytes));
    lcb_server_write_packet(server, key, nkey);
    lcb_server_end_packet(server);
    lcb_server_send_packets(server);
    lcb_check_single_server(server);

I could see the assert() triggering in the second call to lcb_check_single_server()

I added some more code to grab the state of the buffers before the functions:

lcb_server_start_packet()
lcb_server_write_packet()
lcb_server_end_packet()
lcb_server_send_packets()

The state of the buffers at the different lines was:

before lcb_server_start_packet()

output = {root = 0xa9bfaba0, read_head = 0xa9bfaba0, write_head = 0xa9bfaba0, size = 128, nbytes = 0},
cmd_log = {root = 0xa4b860e0, read_head = 0xa4b8612c, write_head = 0xa4b8617e, size = 256, bytes = 82},
output_cookies = {root = 0x7ba8dc20, read_head = 0x7ba8dc20, write_head = 0x7ba8dc34, size = 128, bytes = 20},
pending = {root = 0xaaca1fd8, read_head = 0xaaca1fd8, write_head = 0xaaca1fd8, size = 128, bytes = 0},
pending_cookies = {root = 0x751dd750, read_head = 0x751dd750, write_head = 0x751dd750, size = 128, bytes = 0},
input = {root = (nil), read_head = (nil), write_head = (nil), size = 0, bytes = 0},
connected = 0

before lcb_server_write_packet()

output = {root = 0xa9bfaba0, read_head = 0xa9bfaba0, write_head = 0xa9bfaba0, size = 128, nbytes = 0},
cmd_log = {root = 0xa4b860e0, read_head = 0xa4b8612c, write_head = 0xa4b861aa, size = 256, bytes = 126},
output_cookies = {root = 0x7ba8dc20, read_head = 0x7ba8dc20, write_head = 0x7ba8dc34, size = 128, bytes = 20},
pending = {root = 0xaaca1fd8, read_head = 0xaaca1fd8, write_head = 0xaaca2004, size = 128, bytes = 44},
pending_cookies = {root = 0x751dd750, read_head = 0x751dd750, write_head = 0x751dd764, size = 128, bytes = 20},
input = {root = (nil), read_head = (nil), write_head = (nil), size = 0, bytes = 0},
connected = 0

before lcb_server_end_packet()

output = {root = 0xa9bfaba0, read_head = 0xa9bfaba0, write_head = 0xa9bfaba0, size = 128, nbytes = 0},
cmd_log = {root = 0xa4b860e0, read_head = 0xa4b8612c, write_head = 0xa4b861ca, size = 256, bytes = 158},
output_cookies = {root = 0x7ba8dc20, read_head = 0x7ba8dc20, write_head = 0x7ba8dc34, size = 128, bytes = 20},
pending = {root = 0xaaca1fd8, read_head = 0xaaca1fd8, write_head = 0xaaca2024, size = 128, bytes = 76},
pending_cookies = {root = 0x751dd750, read_head = 0x751dd750, write_head = 0x751dd764, size = 128, bytes = 20},
input = {root = (nil), read_head = (nil), write_head = (nil), size = 0, bytes = 0},
connected = 0

before lcb_server_send_packets()

output = {root = 0xa9bfaba0, read_head = 0xa9bfaba0, write_head = 0xa9bfaba0, size = 128, nbytes = 0},
cmd_log = {root = 0xa4b860e0, read_head = 0xa4b8612c, write_head = 0xa4b861ca, size = 256, bytes = 158},
output_cookies = {root = 0x7ba8dc20, read_head = 0x7ba8dc20, write_head = 0x7ba8dc34, size = 128, bytes = 20},
pending = {root = 0xaaca1fd8, read_head = 0xaaca1fd8, write_head = 0xaaca2024, size = 128, bytes = 76},
pending_cookies = {root = 0x751dd750, read_head = 0x751dd750, write_head = 0x751dd764, size = 128, bytes = 20},
input = {root = (nil), read_head = (nil), write_head = (nil), size = 0, bytes = 0},
connected = 0

I think it’s the initial state of the of the buffers that looks unexpected in state server->connected = 0. That’s pretty much as far as I’ve gotten with this today, I’m trying to think about where I go from here. But that’s work for next week, in the meanwhile I’m very interted in any further thoughts or suggestions on this.

Here is my checker function:

void lcb_check_single_server(lcb_server_t *server)
{
protocol_binary_request_header req;
struct lcb_command_data_st ct;
lcb_size_t nr;
char *packet;
lcb_size_t packetsize;
lcb_t root = server->instance;
ringbuffer_t rest;
ringbuffer_t stream_copy = server->cmd_log;
ringbuffer_t *stream = &stream_copy;
ringbuffer_t cookies_copy;
ringbuffer_t *cookies;
ringbuffer_t mirror = NULL; / mirror buffer should be purged with main stream */
ringbuffer_t mirror_copy;
ringbuffer_t output_copy;
lcb_size_t send_size = ringbuffer_get_nbytes(&server->output);
lcb_size_t stream_size = ringbuffer_get_nbytes(stream);
hrtime_t now = gethrtime();

if (server->connected) {
    cookies_copy = server->output_cookies;
    cookies = &cookies_copy;
} else {
    cookies_copy = server->pending_cookies;
    cookies = &cookies_copy;
    mirror_copy = server->pending;
    mirror = &mirror_copy;
}

output_copy = server->output;
assert(ringbuffer_initialize(&rest, 1024));

do {
    nr = ringbuffer_peek(cookies, &ct, sizeof(ct));
    if (nr != sizeof(ct)) {
        break;
    }
    nr = ringbuffer_peek(stream, req.bytes, sizeof(req));
    if (nr != sizeof(req)) {
        break;
    }
    packetsize = (lcb_uint32_t)sizeof(req) + ntohl(req.request.bodylen);
    if (stream->nbytes < packetsize) {
        break;
    }

    ringbuffer_consumed(cookies, sizeof(ct));
    assert(nr == sizeof(req));
    packet = stream->read_head;

    if (server->connected && stream_size > send_size && (stream_size - packetsize) < send_size) {
        /* Copy the rest of the current packet into the
           temporary stream */

        lcb_size_t nbytes = packetsize - (stream_size - send_size);
        assert(ringbuffer_memcpy(&rest,
                                 &output_copy,
                                 nbytes) == 0);
        ringbuffer_consumed(&output_copy, nbytes);
        send_size -= nbytes;
    }
    stream_size -= packetsize;

    ringbuffer_consumed(stream, packetsize);
    if (mirror) {
        ringbuffer_consumed(mirror, packetsize);
    }

} while (1); /* CONSTCOND */

ringbuffer_destruct(&rest);

}

Ok looking at the patch to fix this issue

void ringbuffer_consumed(ringbuffer_t *buffer, lcb_size_t nb)
{
lcb_size_t n = ringbuffer_read(buffer, NULL, nb);

  • if (n != nb) {
  •    abort();
    
  • }
    }

It this considered the final solution for the issue, or more of a workaround ?.