Hi,
I’m setting a brand new cluster and repeatedly get errors when adding views. It’s happened on a 2 sever cluster and a 4 server cluster.
Googling and searching the forums and there are similar issues. I get this problem after publishing views to one of the servers. The server I’m publishing the views to was OK, but the other server node 172.31.44.247 immediately goes offline and shows a high CPU. Rebalancing produces this error:
Rebalance exited with reason {badmatch,
{error,
{failed_nodes,[‘ns_1@172.31.44.247’]}}}
Examining the bad node further I note it’s CPU is almost maxing out by the process “beam.smp”.
And there is an extraudinary number of connections in TIME_WAIT state from localhost:random_port to localhost:11209
tcp 0 0 localhost:54337 localhost:11209 TIME_WAIT
tcp 0 0 localhost:36798 localhost:11209 TIME_WAIT
tcp 0 0 localhost:39596 localhost:11209 TIME_WAIT
tcp 0 0 localhost:55549 localhost:11209 TIME_WAIT
tcp 0 0 localhost:51018 localhost:11209 TIME_WAIT
tcp 0 0 localhost:55473 localhost:11209 TIME_WAIT
tcp 0 0 localhost:39469 localhost:11209 TIME_WAIT
tcp 0 0 localhost:36328 localhost:11209 TIME_WAIT
tcp 0 0 localhost:60915 localhost:11209 TIME_WAIT
tcp 0 0 localhost:38081 localhost:11209 TIME_WAIT
tcp 0 0 localhost:40668 localhost:11209 TIME_WAIT
tcp 0 0 localhost:51617 localhost:11209 TIME_WAIT
tcp 0 0 localhost:36328 localhost:11209 TIME_WAIT
tcp 0 0 localhost:60915 localhost:11209 TIME_WAIT
tcp 0 0 localhost:38081 localhost:11209 TIME_WAIT
tcp 0 0 localhost:40668 localhost:11209 TIME_WAIT
tcp 0 0 localhost:51617 localhost:11209 TIME_WAIT
~$ netstat | grep TIME_WAIT | wc -l
6032
Currently the bucket is completely empty.
I thought I’d do an strace on beam.smp to see what was going on.
What I’m seeing is a lot of “resource temporarily unavailable” errors.
It seems that it’s trying to connect to a resource and failing over and over again. I think most likely, it’s trying to connect to 11209 which according to netstat -tulpn output is the memcached process.
strace on memcached doesn’t show up anything special, just a lot of connection activity.
In case it’s significant, my cluster is running on CoreOS. I’ve upped the resource limits so inside the couchbase container ulimit -a is:
[root@ip-172-31-44-248 /]# ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 29972
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1048576
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Grabbed log from the failed server and ns_server.debug.log shows something interesting:
==============================================================================
couchbase logs (debug.log)
cbbrowse_logs debug.log
==============================================================================
[error_logger:info,2015-05-22T0:14:26.370,ns_1@172.31.44.247:error_logger<0.6.0>
:ale_error_logger_handler:do_log:203]
=========================PROGRESS REPORT=========================
supervisor: {local,'single_bucket_sup-manager'}
started: [{pid,<0.16213.9>},
{name,{failover_safeness_level,"manager"}},
{mfargs,
{failover_safeness_level,start_link,["manager"]}},
{restart_type,permanent},
{shutdown,1000},
{child_type,worker}]
[user:info,2015-05-22T0:14:26.371,ns_1@172.31.44.247:ns_memcached-manager<0.1618
3.9>:ns_memcached:handle_cast:697]Bucket "manager" loaded on node 'ns_1@172.31.4
4.247' in 0 seconds.
[ns_server:debug,2015-05-22T0:14:26.372,ns_1@172.31.44.247:<0.16179.9>:ns_pubsub
:do_subscribe_link:136]Parent process of subscription {ns_config_events,<0.16178
.9>} exited with reason {bad_return_value,
{error,
{error,
{dcp_socket_connect_failed,
nxdomain}}}}
[ns_server:info,2015-05-22T0:14:26.373,ns_1@172.31.44.247:ns_log<0.1616.0>:ns_lo
g:handle_cast:189]suppressing duplicate log ns_memcached:undefined([<<"Bucket \"
manager\" loaded on node 'ns_1@172.31.44.247' in 0 seconds.">>]) because it's be
en seen 31 times in the past 9.278875 secs (last seen 0.245073 secs ago
[error_logger:error,2015-05-22T0:14:26.377,ns_1@172.31.44.247:error_logger<0.6.0
>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
crasher:
initial call: capi_set_view_manager:init/1
pid: <0.16178.9>
registered_name: 'capi_set_view_manager-manager'
exception exit: {bad_return_value,
{error,{error,{dcp_socket_connect_failed,nxdomain}}}}
in function gen_server:init_it/6 (gen_server.erl, line 332)
ancestors: ['single_bucket_sup-manager',<0.16176.9>]
messages: [replicate_newnodes_docs,
{'$gen_cast',
{replicated_update,
{doc,<<"_design/ManagerService">>,
{1,<<"ôïE½">>},
<<"{\"views\":{\"WorkerByInstance\":{\"map\":\"fun
ction map(doc, meta) {\\r\\n if (meta.type === \\\"json\\\" && doc.type) {\\r\\
n if (doc.type === \\\"worker\\\" && doc.status && doc.status !== \\\"deleted
\\\" && doc.instance_id) {\\r\\n var addr = doc.address + \\\":\\\" + doc.p
ort;\\r\\n emit([doc.instance_id, doc.workerstatus, doc.laststatus], addr);
\\r\\n }\\r\\n }\\r\\n}\"},\"StatsPoolById\":{\"reduce\":\"function reduce(k
eys, values, rereduce) {\\r\\n var f, v; // declare loop variables so that JSH
int doesn't complain...\\r\\n var ifields = [\\\"count\\\", \\\"starting\\\", \
\\"running\\\", \\\"stopped\\\"];\\r\\n var wfields = [\\\"count\\\", \\\"activ
e\\\", \\\"idle\\\"];\\r\\n var out = {};\\r\\n out.load = { min: 100, max: 0,
tot: 0, cnt: 0, avg: 0};\\r\\n out.instances = {};\\r\\n for (f in ifields) {
\\r\\n out.instances[ifields[f]] = { min: 10000, max: 0, tot: 0, cnt: 0, avg:
0};\\r\\n }\\r\\n out.workers = {};\\r\\n for (f in wfields) {\\r\\n out.
workers[wfields[f]] = { min: 10000, max: 0, tot: 0, cnt: 0, avg: 0};\\r\\n }\\r
\\n\\r\\n for (v in values) {\\r\\n if (values[v].load.min < out.load.min) {
\\r\\n out.load.min = values[v].load.min;\\r\\n }\\r\\n if (values[v]
.load.max > out.load.max) {\\r\\n out.load.max = values[v].load.max;\\r\\n
}\\r\\n out.load.tot += values[v].load.tot;\\r\\n out.load.cnt += value
s[v].load.cnt;\\r\\n\\r\\n for (f in ifields) {\\r\\n if (values[v].inst
ances[ifields[f]].min < out.instances[ifields[f]].min) {\\r\\n out.instan
ces[ifields[f]].min = values[v].instances[ifields[f]].min;\\r\\n }\\r\\n
if (values[v].instances[ifields[f]].max > out.instances[ifields[f]].max) {\\r
\\n out.instances[ifields[f]].max = values[v].instances[ifields[f]].max;\
\r\\n }\\r\\n out.instances[ifields[f]].tot += values[v].instances[ifi
elds[f]].tot;\\r\\n out.instances[ifields[f]].cnt += values[v].instances[if
ields[f]].cnt;\\r\\n }\\r\\n\\r\\n for (f in wfields) {\\r\\n if (val
ues[v].workers[wfields[f]].min < out.workers[wfields[f]].min) {\\r\\n out
.workers[wfields[f]].min = values[v].workers[wfields[f]].min;\\r\\n }\\r\\n
if (values[v].workers[wfields[f]].max > out.workers[wfields[f]].max) {\\r\
\n out.workers[wfields[f]].max = values[v].workers[wfields[f]].max;\\r\\n
}\\r\\n out.workers[wfields[f]].tot += values[v].workers[wfields[f]].
tot;\\r\\n out.workers[wfields[f]].cnt += values[v].workers[wfields[f]].cnt
;\\r\\n }\\r\\n }\\r\\n\\r\\n out.load.avg = Math.round((out.load.tot / out
.load.cnt) * 100) / 100;\\r\\n for (f in ifields) {\\r\\n out.instances[ifie
lds[f]].avg = Math.round((out.instances[ifields[f]].tot / out.instances[ifields[
f]].cnt) * 100) / 100;\\r\\n }\\r\\n for (f in wfields) {\\r\\n out.workers
[wfields[f]].avg = Math.round((out.workers[wfields[f]].tot / out.workers[wfields
[f]].cnt) * 100) / 100;\\r\\n }\\r\\n return out;\\r\\n}\",\"map\":\"function
map(doc, meta) {\\r\\n if (meta.type === \\\"json\\\" && doc.type) {\\r\\n i
f (doc.type === \\\"statistic\\\" && doc.statistictype && doc.statistictype ===
\\\"pool\\\" && doc.entity_id) {\\r\\n var datetime = new Date(doc.timestam
p);\\r\\n var year = datetime.getFullYear();\\r\\n var month = 1 + dat
etime.getMonth();\\r\\n var day = datetime.getDate();\\r\\n var hour =
datetime.getUTCHours();\\r\\n var out = {};\\r\\n out.load = { min: d
oc.load, max: doc.load, tot: doc.load, cnt: 1, avg: doc.load };\\r\\n out.i
nstances = {};\\r\\n out.instances.count = { min: doc.instance.count, max:
doc.instance.count, tot: doc.instance.count, cnt: 1, avg: doc.instance.count };\
\r\\n out.instances.starting = { min: doc.instance.starting, max: doc.insta
nce.starting, tot: doc.instance.starting, cnt: 1, avg: doc.instance.starting };\
\r\\n out.instances.running = { min: doc.instance.running, max: doc.instanc
e.running, tot: doc.instance.running, cnt: 1, avg: doc.instance.running };\\r\\n
out.instances.stopped = { min: doc.instance.stopped, max: doc.instance.sto
pped, tot: doc.instance.stopped, cnt: 1, avg: doc.instance.stopped };\\r\\n
out.workers = {};\\r\\n out.workers.count = { min: doc.worker.count, max:
doc.worker.count, tot: doc.worker.count, cnt: 1, avg: doc.worker.count };\\r\\n
out.workers.active = { min: doc.worker.active, max: doc.worker.active, tot:
doc.worker.active, cnt: 1, avg: doc.worker.active };\\r\\n out.workers.idl
e = { min: doc.worker.idle, max: doc.worker.idle, tot: doc.worker.idle, cnt: 1,
avg: doc.worker.idle };\\r\\n emit([doc.entity_id, year, month, day, hour],
out);\\r\\n }\\r\\n }\\r\\n}\"},\"InstanceByPool\":{\"map\":\"function map(
doc, meta) {\\r\\n if (meta.type === \\\"json\\\" && doc.type) {\\r\\n if (d
oc.type === \\\"instance\\\" && doc.status && doc.status !== \\\"deleted\\\") {\
\r\\n emit([doc.pool_id, doc.instancestatus, doc.laststarted], doc.external
id);\\r\\n }\\r\\n }\\r\\n}\"},\"WorkitemByJob\":{\"map\":\"function map(doc
, meta) {\\r\\n if (meta.type === \\\"json\\\" && doc.type) {\\r\\n if (doc.
type === \\\"workitem\\\" && doc.status && doc.status !== \\\"deleted\\\") {\\r\
\n emit([doc.job_id, doc.itemstatus], doc.percentage);\\r\\n }\\r\\n }\
\r\\n}\"},\"InstanceByStatus\":{\"map\":\"function map(doc, meta) {\\r\\n if (m
eta.type === \\\"json\\\" && doc.type) {\\r\\n if (doc.type === \\\"instance\
\\" && doc.status && doc.status !== \\\"deleted\\\") {\\r\\n var addr = doc
.address + \\\":\\\" + doc.port;\\r\\n emit([doc.instancestatus, doc.workty
pe, doc.laststatus], addr);\\r\\n }\\r\\n }\\r\\n}\"},\"RegionByName\":{\"ma
p\":\"function map(doc, meta) {\\r\\n if (meta.type === \\\"json\\\" && doc.typ
e) {\\r\\n if (doc.type === \\\"region\\\" && doc.status && doc.status !== \\
\"deleted\\\") {\\r\\n emit([doc.name, doc.location, doc.provider], doc.add
ress);\\r\\n }\\r\\n }\\r\\n}\"},\"PoolByName\":{\"map\":\"function map(doc,
meta) {\\r\\n if (meta.type === \\\"json\\\" && doc.type) {\\r\\n if (doc.t
ype === \\\"pool\\\" && doc.status && doc.status !== \\\"deleted\\\") {\\r\\n
emit([doc.name, doc.instancetype, doc.vmtype, doc.imageid]);\\r\\n }\\r\\n
}\\r\\n}\"},\"WorkitemByStatus\":{\"map\":\"function map(doc, meta) {\\r\\n i
f (meta.type === \\\"json\\\" && doc.type) {\\r\\n if (doc.type === \\\"worki
tem\\\" && doc.status && doc.status !== \\\"deleted\\\") {\\r\\n emit([doc.
itemstatus, doc.job_id], doc.worker_id);\\r\\n }\\r\\n }\\r\\n}\"},\"WorkerB
yStatus\":{\"map\":\"function map(doc, meta) {\\r\\n if (meta.type === \\\"json
\\\" && doc.type) {\\r\\n if (doc.type === \\\"worker\\\" && doc.status && do
c.status !== \\\"deleted\\\") {\\r\\n var addr = doc.address + \\\":\\\" +
doc.port;\\r\\n emit([doc.workerstatus, doc.worktype, doc.laststatus], addr
);\\r\\n }\\r\\n }\\r\\n}\"}}}">>,
0,false,[]}}},
{'$gen_call',
{<0.16199.9>,#Ref<0.0.7.180181>},
{foreach_doc,
#Fun<capi_ddoc_replication_srv.1.125277061>}},
replicate_newnodes_docs]
links: [<0.16179.9>,<0.16181.9>,<0.16182.9>,<0.16180.9>,<0.16177.9>]
dictionary: [{ddoc_replication_proxy,<0.16180.9>}]
trap_exit: true
status: running
heap_size: 4185
stack_size: 27
reductions: 5137
neighbours:
neighbour: [{pid,<0.16180.9>},
{registered_name,'capi_ddoc_replication_srv-manager'},
{initial_call,{erlang,apply,['Argument__1','Argument__2']}},
{current_function,{capi_ddoc_replication_srv,proxy_loop,1}},
{ancestors,['capi_set_view_manager-manager',
'single_bucket_sup-manager',<0.16176.9>]},
{messages,[]},
{links,[<0.16178.9>]},
{dictionary,[]},
{trap_exit,false},
{status,waiting},
{heap_size,233},
{stack_size,4},
{reductions,25}]
neighbour: [{pid,<0.16182.9>},
{registered_name,[]},
{initial_call,
{capi_set_view_manager,
'-spawn_ddoc_replicator/1-fun-0-',[]}},
{current_function,
{capi_set_view_manager,ddoc_replicator_loop,2}},
{ancestors,
['capi_set_view_manager-manager',
'single_bucket_sup-manager',<0.16176.9>]},
{messages,[]},
{links,[<0.16178.9>]},
{dictionary,[]},
{trap_exit,false},
{status,waiting},
{heap_size,233},
{stack_size,9},
{reductions,10}]
neighbour: [{pid,<0.16181.9>},
{registered_name,[]},
{initial_call,{erlang,apply,2}},
{current_function,
{capi_set_view_manager,nodeup_monitoring_loop,1}},
{ancestors,[]},
{messages,[]},
{links,[<0.16178.9>]},
{dictionary,[]},
{trap_exit,false},
{status,waiting},
{heap_size,233},
{stack_size,2},
{reductions,5}]
[error_logger:error,2015-05-22T0:14:26.378,ns_1@172.31.44.247:error_logger<0.6.0
>:ale_error_logger_handler:do_log:203]** Generic server 'couch_stats_reader-mana
ger' terminating
** Last message in was refresh_stats
** When Server state == {state,"manager",undefined,undefined}
** Reason for termination ==
** {{bad_return_value,{error,{error,{dcp_socket_connect_failed,nxdomain}}}},
{gen_server,call,
['capi_set_view_manager-manager',
{foreach_doc,#Fun<capi_ddoc_replication_srv.1.125277061>},
infinity]}}
[error_logger:error,2015-05-22T0:14:26.378,ns_1@172.31.44.247:error_logger<0.6.0
>:ale_error_logger_handler:do_log:203]
Is this a bug or could it be a setup issue? If it’s a bug please let me know if I should be logging this somewhere. And whether more information can be provided.