Swap rebalance failed and now REST API and admin console are not responding

kirkbcb · June 18, 2015, 9:31pm

OK, two node 3.0.1 cluster with ~350M objects running on m3.2xlarge in ec2. Tried to do a swap rebalance with an equivalent sized node, rebalance failed with the following errors, any help would be greatly appreciated. The Console is unable to maintain a connection to any of the nodes after initially connecting and the REST APIs seem to be unable to be accessed. Otherwise the cluster appears to function, applications connected to the cluster are able to do CRUD operations just fine. I just cannot do any admin ops and I’m afraid to try anything else. Anything I can try?

Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.21262.1279>,
{bulk_set_vbucket_state_failed,
[{‘ns_1@aaa.compute-1.amazonaws.com’,
{‘EXIT’,
{{{{case_clause,
{error,
{{{badmatch,{error,badarg}},
[{dcp_replicator,init,1,
[{file,“src/dcp_replicator.erl”},
{line,48}]},
{gen_server,init_it,6,
[{file,“gen_server.erl”},
{line,304}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},
{line,239}]}]},
{child,undefined,
‘ns_1@bbb.compute-1.amazonaws.com’,
{dcp_replicator,start_link,
[‘ns_1@bbb.compute-1.amazonaws.com’,
“cdi-master-catalog”]},
temporary,60000,worker,
[dcp_replicator]}}}},
[{dcp_sup,start_replicator,2,
[{file,“src/dcp_sup.erl”},{line,78}]},
{dcp_sup,
‘-set_desired_replications/2-lc$^2/1-2-’,
2,
[{file,“src/dcp_sup.erl”},{line,55}]},
{dcp_sup,set_desired_replications,2,
[{file,“src/dcp_sup.erl”},{line,55}]},
{replication_manager,handle_call,3,
[{file,“src/replication_manager.erl”},
{line,130}]},
{gen_server,handle_msg,5,
[{file,“gen_server.erl”},{line,585}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},{line,239}]}]},
{gen_server,call,
[‘replication_manager-cdi-master-catalog’,
{change_vbucket_replication,1022,
‘ns_1@bbb.compute-1.amazonaws.com’},
infinity]}},
{gen_server,call,
[{‘janitor_agent-cdi-master-catalog’,
‘ns_1@aaa.compute-1.amazonaws.com’},
{if_rebalance,<0.20994.1279>,
{update_vbucket_state,1022,replica,
undefined,
‘ns_1@bbb.compute-1.amazonaws.com’}},
infinity]}}}}]}}} ns_orchestrator002 ns_1@ccc.compute-1.amazonaws.com

<0.21008.1279> exited with {unexpected_exit,
{‘EXIT’,<0.21262.1279>,
{bulk_set_vbucket_state_failed,
[{‘ns_1@aaa.compute-1.amazonaws.com’,
{‘EXIT’,
{{{{case_clause,
{error,
{{{badmatch,{error,badarg}},
[{dcp_replicator,init,1,
[{file,“src/dcp_replicator.erl”},
{line,48}]},
{gen_server,init_it,6,
[{file,“gen_server.erl”},
{line,304}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},
{line,239}]}]},
{child,undefined,
‘ns_1@bbb.compute-1.amazonaws.com’,
{dcp_replicator,start_link,
[‘ns_1@bbb.compute-1.amazonaws.com’,
“cdi-master-catalog”]},
temporary,60000,worker,
[dcp_replicator]}}}},
[{dcp_sup,start_replicator,2,
[{file,“src/dcp_sup.erl”},{line,78}]},
{dcp_sup,
‘-set_desired_replications/2-lc$^2/1-2-’,
2,
[{file,“src/dcp_sup.erl”},{line,55}]},
{dcp_sup,set_desired_replications,2,
[{file,“src/dcp_sup.erl”},{line,55}]},
{replication_manager,handle_call,3,
[{file,“src/replication_manager.erl”},
{line,130}]},
{gen_server,handle_msg,5,
[{file,“gen_server.erl”},{line,585}]},
{proc_lib,init_p_do_apply,3,
[{file,“proc_lib.erl”},{line,239}]}]},
{gen_server,call,
[‘replication_manager-cdi-master-catalog’,
{change_vbucket_replication,1022,
‘ns_1@bbb.compute-1.amazonaws.com’},
infinity]}},
{gen_server,call,
[{‘janitor_agent-cdi-master-catalog’,
‘ns_1@aaa.compute-1.amazonaws.com’},
{if_rebalance,<0.20994.1279>,
{update_vbucket_state,1022,replica,
undefined,
‘ns_1@bbb.compute-1.amazonaws.com’}},
infinity]}}}}]}}} ns_vbucket_mover000 ns_1@ccc.compute-1.amazonaws.com
Bucket “cdi-master-catalog” rebalance appears to be swap rebalance ns_vbucket_mover000 ns_1@ccc.compute-1.amazonaws.com

error log contains the following regarding web requests, looks like a bad argument, but I can’t fathom how this could get screwed up.

[ns_server:error,2015-06-18T22:53:48.530,ns_1@ec2-52-6-104-153.compute-1.amazonaws.com:<0.29870.1318>:menelaus_web:loop:170]Server error during processing: [“web request failed”,
{path,"/pools/default/saslBucketsStreaming"},
{type,error},
{what,badarg},
{trace,
[{erlang,integer_to_list,[undefined],},
{ns_bucket,
‘-json_map_with_full_config/3-fun-0-’,3,
[{file,“src/ns_bucket.erl”},{line,527}]},
{lists,map,2,
[{file,“lists.erl”},{line,1224}]},
{lists,map,2,
[{file,“lists.erl”},{line,1224}]},
{ns_bucket,json_map_with_full_config,3,
[{file,“src/ns_bucket.erl”},{line,519}]},
{menelaus_web_buckets,
‘-handle_sasl_buckets_streaming/2-fun-1-’,
3,
[{file,“src/menelaus_web_buckets.erl”},
{line,343}]},
{lists,map,2,
[{file,“lists.erl”},{line,1224}]},
{menelaus_web_buckets,
‘-handle_sasl_buckets_streaming/2-fun-2-’,
2,
[{file,“src/menelaus_web_buckets.erl”},
{line,329}]}]}]

martinesmann · June 19, 2015, 10:03am

Hi @kirkbcb,
Some of the issues you are facing could look like known bugs fixed in version 3.0.3, see the release notes for more details:
http://docs.couchbase.com/admin/admin/rel-notes/rel-notes3.0.html

But the errors you describe could also happen if you are running low on resources in the cluster nodes. From what I can read here: Amazon EC2 – Secure and resizable compute capacity – AWS
Each node has 8 cores, 30GB RAM and 2 x 80 SSD.
Cores and RAM seem okay, but how mush free disk space do you have on each node?

Topic		Replies	Views
Problems with swap rebalance Couchbase Server	3	2616	November 14, 2013
Unable to rebalance cluster after node failure Couchbase Server	3	2084	July 29, 2013
Trying to recover from an outage, rebalancing fails immediately Couchbase Server	3	262	September 10, 2023
Rebalance is stuck Couchbase Server	36	12869	September 6, 2016
Unable to rebalance production cluster Couchbase Server	4	490	July 24, 2023

Swap rebalance failed and now REST API and admin console are not responding

Related topics