We have a customer in production since June with a 4 node cluster.
On each node we have a couchbase instance and two tomcat with a java application (our frontends).
Since the end of september everything was fine but after that, we increased the load on the cluster by going live with another site.
Before october we were doing 6k/7k operations/sec (cluster wide).
Now we have roughly 34k/35k operations/sec (cluster wide).
It seems that every 5-7 days one of the nodes is not heard from the other nodes and so, after 300 seconds, it is automatically put in failover.
In order to solve the problem I only need to restart couchbase on the node and click on rebalance on the GUI (I usually use DeltaRecovery).
Sometimes the command /etc/init.d/couchbase-server stop does not work and I need to manually kill some of the remaining couchbase processes…
Once I restored the cluster state with all the nodes, after 20-30 minutes another node go down and I’ve to repeat the process.
It seems like there is some kind of leak in the processes that the weekly restart then solve…
Since I initially suspected network problems I’ve setup a permanent ping every 5 seconds between each node and from another server to all the nodes but even if sometimes they show that the latency between the nodes increase it is almost always below 2ms and at the last occourence of the problem it was <1ms (on average .500ms).
Did anyone ever see this kind of behaviour?
This is the email being sent for the last failover:
Node ('ns_1@web2.customer.local') was automatically failovered. [down,stale, {last_heard,{1448,452853,198762}}, {now,{1448,452853,189173}}, {active_buckets,["comments","video","poll","cmbucket"]}, {ready_buckets,["comments","video","poll","cmbucket"]}, {status_latency,8769}, {outgoing_replications_safeness_level, [{"cmbucket",green},{"poll",green},{"video",green},{"comments",green}]}, {incoming_replications_conf_hashes, [{"cmbucket", [{'ns_1@web1.customer.local',22178425}, {'ns_1@web3.customer.local',56464530}, {'ns_1@web4.customer.local',75568021}]}, {"poll", [{'ns_1@web1.customer.local',128744204}, {'ns_1@web3.customer.local',118225207}, {'ns_1@web4.customer.local',77926089}]}, {"video", [{'ns_1@web1.customer.local',14818100}, {'ns_1@web3.customer.local',118225207}, {'ns_1@web4.customer.local',53675363}]}, {"comments", [{'ns_1@web1.customer.local',14818100}, {'ns_1@web3.customer.local',118225207}, {'ns_1@web4.customer.local',53675363}]}]}, {local_tasks,[]}, {memory, [{total,602146168}, {processes,265593048}, {processes_used,264888368}, {system,336553120}, {atom,686993}, {atom_used,669601}, {binary,37829456}, {code,16371821}, {ets,269075200}]}, {system_memory_data, [{system_total_memory,33733103616}, {free_swap,16928206848}, {total_swap,16928206848}, {cached_memory,13174239232}, {buffered_memory,209772544}, {free_memory,466776064}, {total_memory,33733103616}]}, {node_storage_conf [{db_path,"/opt/couchbase"}, {index_path,"/opt/store/couchbase"}]}, {statistics, [{wall_clock,{215464559,5000}}, {context_switches,{1130090132,0}}, {garbage_collection,{263035644,1131734052301,0}}, {io,{{input,246185236379},{output,533850013308}}}, {reductions,{454055570955,7607743}}, {run_queue,0}, {runtime,{54159680,930}}, {run_queues,{0,0,0,0,0,0,0,0}}]}, {system_stats, [{cpu_utilization_rate,34.60076045627376}, {swap_total,16928206848}, {swap_used,0}, {mem_total,33733103616}, {mem_free,13851561984}]}, {interesting_stats, [{cmd_get,13701.0}, {couch_docs_actual_disk_size,1668786948}, {couch_docs_data_size,200836350}, {couch_views_actual_disk_size,74422496}, {couch_views_data_size,32329225}, {curr_items,114528}, {curr_items_tot,229222}, {ep_bg_fetched,0.0}, {get_hits,446.0}, {mem_used,335254192}, {ops,13701.0}, {vb_replica_curr_items,114694}]}, {per_bucket_interesting_stats, [{"comments", [{cmd_get,436.0}, {couch_docs_actual_disk_size,155695343}, {couch_docs_data_size,106937230}, {couch_views_actual_disk_size,70288793}, {couch_views_data_size,28225399}, {curr_items,79984}, {curr_items_tot,159978}, {ep_bg_fetched,0.0}, {get_hits,436.0}, {mem_used,156535344}, {ops,436.0}, {vb_replica_curr_items,79994}]}, {"video", [{cmd_get,0.0}, {couch_docs_actual_disk_size,78650103}, {couch_docs_data_size,61343873}, {couch_views_actual_disk_size,4133703}, {couch_views_data_size,4103826}, {curr_items,32559}, {curr_items_tot,65265}, {ep_bg_fetched,0.0}, {get_hits,0.0}, {mem_used,139843320}, {ops,0.0}, {vb_replica_curr_items,32706}]}, {"poll", [{cmd_get,0.0}, {couch_docs_actual_disk_size,30937326}, {couch_docs_data_size,28621824}, {couch_views_actual_disk_size,0}, {couch_views_data_size,0}, {curr_items,0}, {curr_items_tot,0}, {ep_bg_fetched,0.0}, {get_hits,0.0}, {mem_used,18708304}, {ops,0.0}, {vb_replica_curr_items,0}]}, {"cmbucket", [{cmd_get,13265.0}, {couch_docs_actual_disk_size,1403504176}, {couch_docs_data_size,3933423}, {couch_views_actual_disk_size,0}, {couch_views_data_size,0}, {curr_items,1985}, {curr_items_tot,3979}, {ep_bg_fetched,0.0}, {get_hits,10.0}, {mem_used,20167224}, {ops,13265.0}, {vb_replica_curr_items,1994}]}]}, {processes_stats, [{<<"proc/(main)beam.smp/cpu_utilization">>,0}, {<<"proc/(main)beam.smp/major_faults">>,0}, {<<"proc/(main)beam.smp/major_faults_raw">>,22}, {<<"proc/(main)beam.smp/mem_resident">>,4249673728}, {<<"proc/(main)beam.smp/mem_share">>,9895936}, {<<"proc/(main)beam.smp/mem_size">>,949330685952}, {<<"proc/(main)beam.smp/minor_faults">>,772}, {<<"proc/(main)beam.smp/minor_faults_raw">>,502473658}, {<<"proc/(main)beam.smp/page_faults">>,772}, {<<"proc/(main)beam.smp/page_faults_raw">>,502473680}, {<<"proc/beam.smp/cpu_utilization">>,0}, {<<"proc/beam.smp/major_faults">>,0}, {<<"proc/beam.smp/major_faults_raw">>,0}, {<<"proc/beam.smp/mem_resident">>,27287552}, {<<"proc/beam.smp/mem_share">>,2519040}, {<<"proc/beam.smp/mem_size">>,975695872}, {<<"proc/beam.smp/minor_faults">>,0}, {<<"proc/beam.smp/minor_faults_raw">>,9195}, {<<"proc/beam.smp/page_faults">>,0}, {<<"proc/beam.smp/page_faults_raw">>,9195}, {<<"proc/inet_gethost/cpu_utilization">>,0}, {<<"proc/inet_gethost/major_faults">>,0}, {<<"proc/inet_gethost/major_faults_raw">>,1}, {<<"proc/inet_gethost/mem_resident">>,430080}, {<<"proc/inet_gethost/mem_share">>,344064}, {<<"proc/inet_gethost/mem_size">>,7630848}, {<<"proc/inet_gethost/minor_faults">>,0}, {<<"proc/inet_gethost/minor_faults_raw">>,708}, {<<"proc/inet_gethost/page_faults">>,0}, {<<"proc/inet_gethost/page_faults_raw">>,709}, {<<"proc/memcached/cpu_utilization">>,0}, {<<"proc/memcached/major_faults">>,0}, {<<"proc/memcached/major_faults_raw">>,64}, {<<"proc/memcached/mem_resident">>,543154176}, {<<"proc/memcached/mem_share">>,6148096}, {<<"proc/memcached/mem_size">>,880947200}, {<<"proc/memcached/minor_faults">>,0}, {<<"proc/memcached/minor_faults_raw">>,323925}, {<<"proc/memcached/page_faults">>,0}, {<<"proc/memcached/page_faults_raw">>,323989}]}, {cluster_compatibility_version,196608}, {version, [{lhttpc,"1.3.0"}, {os_mon,"2.2.14"}, {public_key,"0.21"}, {asn1,"2.0.4"}, {couch,"2.1.1r-432-gc2af28d"}, {kernel,"2.16.4"}, {syntax_tools,"1.6.13"}, {xmerl,"1.3.6"}, {ale,"3.0.1-1444-rel-community"}, {couch_set_view,"2.1.1r-432-gc2af28d"}, {compiler,"4.9.4"}, {inets,"5.9.8"}, {mapreduce,"1.0.0"}, {couch_index_merger,"2.1.1r-432-gc2af28d"}, {ns_server,"3.0.1-1444-rel-community"}, {oauth,"7d85d3ef"}, {crypto,"3.2"}, {ssl,"5.3.3"}, {sasl,"2.3.4"}, {couch_view_parser,"1.0.0"}, {mochiweb,"2.4.2"}, {stdlib,"1.19.4"}]}, {supported_compat_version,[3,0]}, {advertised_version,[3,0,0]}, {system_arch,"x86_64-unknown-linux-gnu"}, {wall_clock,215464}, {memory_data,{33733103616,33252155392,{<14497.1546.0>,41050832}}}, {disk_data, [{"/",4787516,12}, {"/sys/fs/cgroup",4,0}, {"/dev",16459992,1}, {"/run",3294252,1}, {"/run/lock",5120,0}, {"/run/shm",16471240,1}, {"/run/user",102400,0}, {"/boot",240972,31}, {"/home",3869352,10}, {"/opt",4787516,20}, {"/usr",7742856,26}, {"/tmp",3869352,1}, {"/var",4787516,20}, {"/opt",51475068,43}, {"/opt/store",381754588,53}]}, {meminfo, <<"MemTotal: 32942484 kB\nMemFree: 455836 kB\nBuffers: 204856 kB\nCached: 12865468 kB\nSwapCached: 0 kB\nActive: 23031524 kB\nInactive: 8294088 kB\nActive(anon): 16519580 kB\nInactive(anon): 1791864 kB\nActive(file): 6511944 kB\nInactive(file): 6502224 kB\nUnevictable: 0 kB\nMlocked: 0 kB\nSwapTotal: 16531452 kB\nSwapFree: 16531452 kB\nDirty: 220 kB\nWriteback: 0 kB\nAnonPages: 18255292 kB\nMapped: 615604 kB\nShmem: 56156 kB\nSlab: 662580 kB\nSReclaimable: 501436 kB\nSUnreclaim: 161144 kB\nKernelStack: 85248 kB\nPageTables: 118444 kB\nNFS_Unstable: 0 kB\nBounce: 0 kB\nWritebackTmp: 0 kB\nCommitLimit: 33002692 kB\nCommitted_AS: 37566348 kB\nVmallocTotal: 34359738367 kB\nVmallocUsed: 340092 kB\nVmallocChunk: 34359386136 kB\nHardwareCorrupted:! 0 kB\nAnonHugePages: 18432 kB\nHugePages_Total: 0\nHugePages_Free: 0\nHugePages_Rsvd: 0\nHugePages_Surp: 0\nHugepagesize: 2048 kB\nDirectMap4k: 61312 kB\nDirectMap2M: 33492992 kB\n">>}]
we do have 4 buckets but we use only 3 of them, one is empty, one of them is doing less than 1ops/sec, one is below 2k ops/sec and the other one is doing all the operations.