1 out of 6 servers went down while removing 3 of the 6 and rebalancing them.
This happens quite frequently while rebalancing the nodes. (not just one specific server)
If I wait for the failed server up (~1hr) and try rebalancing the nodes, another server goes down with the same error.
I got the following error from the failed server:
Control connection to memcached on ‘ns_1@10.10.36.122’ disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
cmd_vocal_recv,
5,
[{file,
“src/mc_client_binary.erl”},
{line,
151}]},
{mc_client_binary,
select_bucket,
2,
[{file,
“src/mc_client_binary.erl”},
{line,
346}]},
{ns_memcached,
ensure_bucket,
2,
[{file,
“src/ns_memcached.erl”},
{line,
1269}]},
{ns_memcached,
handle_info,
2,
[{file,
“src/ns_memcached.erl”},
{line,
744}]},
{gen_server,
handle_msg,
5,
[{file,
“gen_server.erl”},
{line,
604}]},
{ns_memcached,
init,
1,
[{file,
“src/ns_memcached.erl”},
{line,
171}]},
{gen_server,
init_it,
6,
[{file,
“gen_server.erl”},
{line,
304}]},
{proc_lib,
init_p_do_apply,
3,
[{file,
“proc_lib.erl”},
{line,
239}]}]}
and “Rebalancing” aborted after the follwoing message:
Port server memcached on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 134. Restarting. Messages: Fri Jan 23 16:48:26.112455 KST 3: (xxx) TAP (Consumer) eq_tapq:anon_310 - disconnected
Fri Jan 23 16:48:26.361575 KST 3: (xxx) TAP (Producer) eq_tapq:replication_ns_1@10.10.36.244 - Schedule the backfill for vbucket 206
Fri Jan 23 16:48:26.361700 KST 3: (xxx) TAP (Producer) eq_tapq:replication_ns_1@10.10.36.244 - Sending TAP_OPAQUE with command “complete_vb_filter_change” and vbucket 0
Fri Jan 23 16:48:26.361717 KST 3: (xxx) TAP (Producer) eq_tapq:replication_ns_1@10.10.36.244 - Sending TAP_OPAQUE with command “initial_vbucket_stream” and vbucket 206
asssertion failed [bySeqno >= 0] at /home/buildbot/buildbot_slave/ubuntu-1204-x64-301-builder/build/build/ep-engine/src/item.h:346
What could cause this issue?
This time, the failed server is not going back up again.