This is an admittedly general question, but looking for some advice. We have a three node couchbase 2.1 community cluster. We have big memory nodes and big buckets. Each node is 192GB and our largest bucket is ~130 million keys and takes up about 100GB. We have never successfully failed over a node our rebalanced the cluster with the big bucket. We recently rebooted a node and that bucket can’t warmup. It tries for 10 minutes or so, status shows it loading the keys, and then it throws a bunch of these:
Control connection to memcached on ‘ns_1@cclnxcouch1.pfizer.com’ disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
stats_recv,
4},
{mc_client_binary,
stats,
4},
{ns_memcached,
has_started,
1},
{ns_memcached,
handle_info,
2},
{gen_server,
handle_msg,
5},
{ns_memcached,
init,
1},
{gen_server,
init_it,
6},
{proc_lib,
init_p_do_apply,
3}]}
and starts the warmup process over again. So the node appears to be forever stuck in pend. Aside from this specific problem, we’ve just found that for the large buckets we have a system that works very well UNTIL anything untoward happens. Is our mistake the big memory nodes? The nodes are local gigabit interconnect, is that insufficient to support the cluster? Couch is running on raid-ed SSD’s, we seem to do pretty well i/o wise. Any suggestions on how to make rebalance functional?
Thanks!