Rebalance exited badmatch

Matthew_Hook · May 22, 2015, 2:49am

Hi,

I’m setting a brand new cluster and repeatedly get errors when adding views. It’s happened on a 2 sever cluster and a 4 server cluster.

Googling and searching the forums and there are similar issues. I get this problem after publishing views to one of the servers. The server I’m publishing the views to was OK, but the other server node 172.31.44.247 immediately goes offline and shows a high CPU. Rebalancing produces this error:

Rebalance exited with reason {badmatch,
{error,
{failed_nodes,[‘ns_1@172.31.44.247’]}}}

Examining the bad node further I note it’s CPU is almost maxing out by the process “beam.smp”.
And there is an extraudinary number of connections in TIME_WAIT state from localhost:random_port to localhost:11209

tcp        0      0 localhost:54337         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:36798         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:39596         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:55549         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:51018         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:55473         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:39469         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:36328         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:60915         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:38081         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:40668         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:51617         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:36328         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:60915         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:38081         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:40668         localhost:11209         TIME_WAIT  
tcp        0      0 localhost:51617         localhost:11209         TIME_WAIT  

~$ netstat  | grep TIME_WAIT | wc -l
6032

Currently the bucket is completely empty.

I thought I’d do an strace on beam.smp to see what was going on.
What I’m seeing is a lot of “resource temporarily unavailable” errors.
It seems that it’s trying to connect to a resource and failing over and over again. I think most likely, it’s trying to connect to 11209 which according to netstat -tulpn output is the memcached process.

strace on memcached doesn’t show up anything special, just a lot of connection activity.

In case it’s significant, my cluster is running on CoreOS. I’ve upped the resource limits so inside the couchbase container ulimit -a is:

[root@ip-172-31-44-248 /]# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 29972
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Grabbed log from the failed server and ns_server.debug.log shows something interesting:

==============================================================================
couchbase logs (debug.log)
cbbrowse_logs debug.log
==============================================================================
[error_logger:info,2015-05-22T0:14:26.370,ns_1@172.31.44.247:error_logger<0.6.0>
:ale_error_logger_handler:do_log:203]
=========================PROGRESS REPORT=========================
          supervisor: {local,'single_bucket_sup-manager'}
             started: [{pid,<0.16213.9>},
                       {name,{failover_safeness_level,"manager"}},
                       {mfargs,
                           {failover_safeness_level,start_link,["manager"]}},
                       {restart_type,permanent},
                       {shutdown,1000},
                       {child_type,worker}]

[user:info,2015-05-22T0:14:26.371,ns_1@172.31.44.247:ns_memcached-manager<0.1618
3.9>:ns_memcached:handle_cast:697]Bucket "manager" loaded on node 'ns_1@172.31.4
4.247' in 0 seconds.
[ns_server:debug,2015-05-22T0:14:26.372,ns_1@172.31.44.247:<0.16179.9>:ns_pubsub
:do_subscribe_link:136]Parent process of subscription {ns_config_events,<0.16178
.9>} exited with reason {bad_return_value,
                                                                                
  {error,
                                                                                
   {error,
                                                                                
    {dcp_socket_connect_failed,
                                                                                
     nxdomain}}}}
[ns_server:info,2015-05-22T0:14:26.373,ns_1@172.31.44.247:ns_log<0.1616.0>:ns_lo
g:handle_cast:189]suppressing duplicate log ns_memcached:undefined([<<"Bucket \"
manager\" loaded on node 'ns_1@172.31.44.247' in 0 seconds.">>]) because it's be
en seen 31 times in the past 9.278875 secs (last seen 0.245073 secs ago
[error_logger:error,2015-05-22T0:14:26.377,ns_1@172.31.44.247:error_logger<0.6.0
>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
  crasher:
    initial call: capi_set_view_manager:init/1
    pid: <0.16178.9>
    registered_name: 'capi_set_view_manager-manager'
    exception exit: {bad_return_value,
                        {error,{error,{dcp_socket_connect_failed,nxdomain}}}}
      in function  gen_server:init_it/6 (gen_server.erl, line 332)
    ancestors: ['single_bucket_sup-manager',<0.16176.9>]
    messages: [replicate_newnodes_docs,
                  {'$gen_cast',
                      {replicated_update,
                          {doc,<<"_design/ManagerService">>,
                              {1,<<"ôïE½">>},
                              <<"{\"views\":{\"WorkerByInstance\":{\"map\":\"fun
ction map(doc, meta) {\\r\\n  if (meta.type === \\\"json\\\" && doc.type) {\\r\\
n    if (doc.type === \\\"worker\\\" && doc.status && doc.status !== \\\"deleted
\\\" && doc.instance_id) {\\r\\n      var addr = doc.address + \\\":\\\" + doc.p
ort;\\r\\n      emit([doc.instance_id, doc.workerstatus, doc.laststatus], addr);
\\r\\n    }\\r\\n  }\\r\\n}\"},\"StatsPoolById\":{\"reduce\":\"function reduce(k
eys, values, rereduce) {\\r\\n  var f, v;  // declare loop variables so that JSH
int doesn't complain...\\r\\n  var ifields = [\\\"count\\\", \\\"starting\\\", \
\\"running\\\", \\\"stopped\\\"];\\r\\n  var wfields = [\\\"count\\\", \\\"activ
e\\\", \\\"idle\\\"];\\r\\n  var out = {};\\r\\n  out.load = { min: 100, max: 0,
 tot: 0, cnt: 0, avg: 0};\\r\\n  out.instances = {};\\r\\n  for (f in ifields) {
\\r\\n    out.instances[ifields[f]] = { min: 10000, max: 0, tot: 0, cnt: 0, avg:
 0};\\r\\n  }\\r\\n  out.workers = {};\\r\\n  for (f in wfields) {\\r\\n    out.
workers[wfields[f]] = { min: 10000, max: 0, tot: 0, cnt: 0, avg: 0};\\r\\n  }\\r
\\n\\r\\n  for (v in values) {\\r\\n    if (values[v].load.min < out.load.min) {
\\r\\n      out.load.min = values[v].load.min;\\r\\n    }\\r\\n    if (values[v]
.load.max > out.load.max) {\\r\\n      out.load.max = values[v].load.max;\\r\\n 
   }\\r\\n    out.load.tot += values[v].load.tot;\\r\\n    out.load.cnt += value
s[v].load.cnt;\\r\\n\\r\\n    for (f in ifields) {\\r\\n      if (values[v].inst
ances[ifields[f]].min < out.instances[ifields[f]].min) {\\r\\n        out.instan
ces[ifields[f]].min = values[v].instances[ifields[f]].min;\\r\\n      }\\r\\n   
   if (values[v].instances[ifields[f]].max > out.instances[ifields[f]].max) {\\r
\\n        out.instances[ifields[f]].max = values[v].instances[ifields[f]].max;\
\r\\n      }\\r\\n      out.instances[ifields[f]].tot += values[v].instances[ifi
elds[f]].tot;\\r\\n      out.instances[ifields[f]].cnt += values[v].instances[if
ields[f]].cnt;\\r\\n    }\\r\\n\\r\\n    for (f in wfields) {\\r\\n      if (val
ues[v].workers[wfields[f]].min < out.workers[wfields[f]].min) {\\r\\n        out
.workers[wfields[f]].min = values[v].workers[wfields[f]].min;\\r\\n      }\\r\\n
      if (values[v].workers[wfields[f]].max > out.workers[wfields[f]].max) {\\r\
\n        out.workers[wfields[f]].max = values[v].workers[wfields[f]].max;\\r\\n
      }\\r\\n      out.workers[wfields[f]].tot += values[v].workers[wfields[f]].
tot;\\r\\n      out.workers[wfields[f]].cnt += values[v].workers[wfields[f]].cnt
;\\r\\n    }\\r\\n  }\\r\\n\\r\\n  out.load.avg = Math.round((out.load.tot / out
.load.cnt) * 100) / 100;\\r\\n  for (f in ifields) {\\r\\n    out.instances[ifie
lds[f]].avg = Math.round((out.instances[ifields[f]].tot / out.instances[ifields[
f]].cnt) * 100) / 100;\\r\\n  }\\r\\n  for (f in wfields) {\\r\\n    out.workers
[wfields[f]].avg = Math.round((out.workers[wfields[f]].tot / out.workers[wfields
[f]].cnt) * 100) / 100;\\r\\n  }\\r\\n  return out;\\r\\n}\",\"map\":\"function 
map(doc, meta) {\\r\\n  if (meta.type === \\\"json\\\" && doc.type) {\\r\\n    i
f (doc.type === \\\"statistic\\\" && doc.statistictype && doc.statistictype === 
\\\"pool\\\" && doc.entity_id) {\\r\\n      var datetime = new Date(doc.timestam
p);\\r\\n      var year = datetime.getFullYear();\\r\\n      var month = 1 + dat
etime.getMonth();\\r\\n      var day = datetime.getDate();\\r\\n      var hour =
 datetime.getUTCHours();\\r\\n      var out = {};\\r\\n      out.load = { min: d
oc.load, max: doc.load, tot: doc.load, cnt: 1, avg: doc.load };\\r\\n      out.i
nstances = {};\\r\\n      out.instances.count = { min: doc.instance.count, max: 
doc.instance.count, tot: doc.instance.count, cnt: 1, avg: doc.instance.count };\
\r\\n      out.instances.starting = { min: doc.instance.starting, max: doc.insta
nce.starting, tot: doc.instance.starting, cnt: 1, avg: doc.instance.starting };\
\r\\n      out.instances.running = { min: doc.instance.running, max: doc.instanc
e.running, tot: doc.instance.running, cnt: 1, avg: doc.instance.running };\\r\\n
      out.instances.stopped = { min: doc.instance.stopped, max: doc.instance.sto
pped, tot: doc.instance.stopped, cnt: 1, avg: doc.instance.stopped };\\r\\n     
 out.workers = {};\\r\\n      out.workers.count = { min: doc.worker.count, max: 
doc.worker.count, tot: doc.worker.count, cnt: 1, avg: doc.worker.count };\\r\\n 
     out.workers.active = { min: doc.worker.active, max: doc.worker.active, tot:
 doc.worker.active, cnt: 1, avg: doc.worker.active };\\r\\n      out.workers.idl
e = { min: doc.worker.idle, max: doc.worker.idle, tot: doc.worker.idle, cnt: 1, 
avg: doc.worker.idle };\\r\\n      emit([doc.entity_id, year, month, day, hour],
 out);\\r\\n    }\\r\\n  }\\r\\n}\"},\"InstanceByPool\":{\"map\":\"function map(
doc, meta) {\\r\\n  if (meta.type === \\\"json\\\" && doc.type) {\\r\\n    if (d
oc.type === \\\"instance\\\" && doc.status && doc.status !== \\\"deleted\\\") {\
\r\\n      emit([doc.pool_id, doc.instancestatus, doc.laststarted], doc.external
id);\\r\\n    }\\r\\n  }\\r\\n}\"},\"WorkitemByJob\":{\"map\":\"function map(doc
, meta) {\\r\\n  if (meta.type === \\\"json\\\" && doc.type) {\\r\\n    if (doc.
type === \\\"workitem\\\" && doc.status && doc.status !== \\\"deleted\\\") {\\r\
\n      emit([doc.job_id, doc.itemstatus], doc.percentage);\\r\\n    }\\r\\n  }\
\r\\n}\"},\"InstanceByStatus\":{\"map\":\"function map(doc, meta) {\\r\\n  if (m
eta.type === \\\"json\\\" && doc.type) {\\r\\n    if (doc.type === \\\"instance\
\\" && doc.status && doc.status !== \\\"deleted\\\") {\\r\\n      var addr = doc
.address + \\\":\\\" + doc.port;\\r\\n      emit([doc.instancestatus, doc.workty
pe, doc.laststatus], addr);\\r\\n    }\\r\\n  }\\r\\n}\"},\"RegionByName\":{\"ma
p\":\"function map(doc, meta) {\\r\\n  if (meta.type === \\\"json\\\" && doc.typ
e) {\\r\\n    if (doc.type === \\\"region\\\" && doc.status && doc.status !== \\
\"deleted\\\") {\\r\\n      emit([doc.name, doc.location, doc.provider], doc.add
ress);\\r\\n    }\\r\\n  }\\r\\n}\"},\"PoolByName\":{\"map\":\"function map(doc,
 meta) {\\r\\n  if (meta.type === \\\"json\\\" && doc.type) {\\r\\n    if (doc.t
ype === \\\"pool\\\" && doc.status && doc.status !== \\\"deleted\\\") {\\r\\n   
   emit([doc.name, doc.instancetype, doc.vmtype, doc.imageid]);\\r\\n    }\\r\\n
  }\\r\\n}\"},\"WorkitemByStatus\":{\"map\":\"function map(doc, meta) {\\r\\n  i
f (meta.type === \\\"json\\\" && doc.type) {\\r\\n    if (doc.type === \\\"worki
tem\\\" && doc.status && doc.status !== \\\"deleted\\\") {\\r\\n      emit([doc.
itemstatus, doc.job_id], doc.worker_id);\\r\\n    }\\r\\n  }\\r\\n}\"},\"WorkerB
yStatus\":{\"map\":\"function map(doc, meta) {\\r\\n  if (meta.type === \\\"json
\\\" && doc.type) {\\r\\n    if (doc.type === \\\"worker\\\" && doc.status && do
c.status !== \\\"deleted\\\") {\\r\\n      var addr = doc.address + \\\":\\\" + 
doc.port;\\r\\n      emit([doc.workerstatus, doc.worktype, doc.laststatus], addr
);\\r\\n    }\\r\\n  }\\r\\n}\"}}}">>,
                              0,false,[]}}},
                  {'$gen_call',
                      {<0.16199.9>,#Ref<0.0.7.180181>},
                      {foreach_doc,
                          #Fun<capi_ddoc_replication_srv.1.125277061>}},
                  replicate_newnodes_docs]
    links: [<0.16179.9>,<0.16181.9>,<0.16182.9>,<0.16180.9>,<0.16177.9>]
    dictionary: [{ddoc_replication_proxy,<0.16180.9>}]
    trap_exit: true
    status: running
    heap_size: 4185
    stack_size: 27
    reductions: 5137
  neighbours:
    neighbour: [{pid,<0.16180.9>},
                  {registered_name,'capi_ddoc_replication_srv-manager'},
                  {initial_call,{erlang,apply,['Argument__1','Argument__2']}},
                  {current_function,{capi_ddoc_replication_srv,proxy_loop,1}},
                  {ancestors,['capi_set_view_manager-manager',
                              'single_bucket_sup-manager',<0.16176.9>]},
                  {messages,[]},
                  {links,[<0.16178.9>]},
                  {dictionary,[]},
                  {trap_exit,false},
                  {status,waiting},
                  {heap_size,233},
                  {stack_size,4},
                  {reductions,25}]
    neighbour: [{pid,<0.16182.9>},
                  {registered_name,[]},
                  {initial_call,
                      {capi_set_view_manager,
                          '-spawn_ddoc_replicator/1-fun-0-',[]}},
                  {current_function,
                      {capi_set_view_manager,ddoc_replicator_loop,2}},
                  {ancestors,
                      ['capi_set_view_manager-manager',
                       'single_bucket_sup-manager',<0.16176.9>]},
                  {messages,[]},
                  {links,[<0.16178.9>]},
                  {dictionary,[]},
                  {trap_exit,false},
                  {status,waiting},
                  {heap_size,233},
                  {stack_size,9},
                  {reductions,10}]
    neighbour: [{pid,<0.16181.9>},
                  {registered_name,[]},
                  {initial_call,{erlang,apply,2}},
                  {current_function,
                      {capi_set_view_manager,nodeup_monitoring_loop,1}},
                  {ancestors,[]},
                  {messages,[]},
                  {links,[<0.16178.9>]},
                  {dictionary,[]},
                  {trap_exit,false},
                  {status,waiting},
                  {heap_size,233},
                  {stack_size,2},
                  {reductions,5}]

[error_logger:error,2015-05-22T0:14:26.378,ns_1@172.31.44.247:error_logger<0.6.0
>:ale_error_logger_handler:do_log:203]** Generic server 'couch_stats_reader-mana
ger' terminating 
** Last message in was refresh_stats
** When Server state == {state,"manager",undefined,undefined}
** Reason for termination == 
** {{bad_return_value,{error,{error,{dcp_socket_connect_failed,nxdomain}}}},
    {gen_server,call,
                ['capi_set_view_manager-manager',
                 {foreach_doc,#Fun<capi_ddoc_replication_srv.1.125277061>},
                 infinity]}}

[error_logger:error,2015-05-22T0:14:26.378,ns_1@172.31.44.247:error_logger<0.6.0
>:ale_error_logger_handler:do_log:203]

Is this a bug or could it be a setup issue? If it’s a bug please let me know if I should be logging this somewhere. And whether more information can be provided.

anil · May 25, 2015, 6:43pm

Hi Matthew, Thanks for posting this issue and your analysis.

We are currently working on supporting CoreOS platform for Couchbase. You can check the initial developer preview documentation here https://github.com/couchbase/kubernetes.

However we would like to understand what issue might be causing this crash even after you increased the ulimit for memcached connection. Can you please log a JIRA (bug) issue here https://issues.couchbase.com/ and run diagnostics [cbcollect_info][1] tool and attach logs to ticket.

Thanks!
[1]: http://docs.couchbase.com/admin/admin/CLI/cbcollect_info_tool.html

Matthew_Hook · May 25, 2015, 11:45pm

Thanks Anil,

I’ve posted a bug report here: [image: Bug] MB-15114
https://issues.couchbase.com/browse/MB-15114

Regards, Matthew

Topic		Replies	Views
Failure during rebalance due to badmatch? Couchbase Server	3	3542	July 11, 2013
Error message while rebalancing Couchbase Server	0	1823	September 22, 2014
Re-balance failing with error - Rebalance exited with reason {badmatch,failed} Couchbase Server server	1	7955	February 13, 2018
Trying to recover from an outage, rebalancing fails immediately Couchbase Server	3	255	September 10, 2023
Rebalance exited with reason badmatch Couchbase Server	3	1690	November 29, 2021

Rebalance exited badmatch

Related topics