Couchbase node dieing when uploading documents using Java SDK

reVrost · June 30, 2015, 8:25am

Hi,
The nodes on my couchbase server seems to be constantly going into pending status which then goes down eventually and then goes back up again after some time.
This happened as I try to upload more and more documents to the nodes using the Java SDK. My bucket is currently holding about 490 million documents with full-eviction. It’s running on 3 nodes with about 46.8GB of memory and about 2.1 TB of HDD space in total. Below is what i’ve got from error.log on one of the node.

  [stats:error,2015-06-30T8:03:28.331,ns_1@ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com:<0.925.0>:stats_collector:handle_info:124]Exception in stats collector: {exit,
                                   {noproc,
                                    {gen_server,call,
                                     ['ns_memcached-Sample',{stats,<<>>},180000]}},
                                   [{gen_server,call,3,
                                     [{file,"gen_server.erl"},{line,188}]},
                                    {ns_memcached,do_call,3,
                                     [{file,"src/ns_memcached.erl"},{line,1399}]},
                                    {stats_collector,grab_all_stats,1,
                                     [{file,"src/stats_collector.erl"},{line,84}]},
                                    {stats_collector,handle_info,2,
                                     [{file,"src/stats_collector.erl"},
                                      {line,116}]},
                                    {gen_server,handle_msg,5,
                                     [{file,"gen_server.erl"},{line,604}]},
                                    {proc_lib,init_p_do_apply,3,
                                     [{file,"proc_lib.erl"},{line,239}]}]}
    
    [stats:error,2015-06-30T8:03:28.332,ns_1@ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com:<0.925.0>:stats_collector:handle_info:124]Exception in stats collector: {exit,
                                   {noproc,
                                    {gen_server,call,
                                     ['ns_memcached-Sample',{stats,<<>>},180000]}},
                                   [{gen_server,call,3,
                                     [{file,"gen_server.erl"},{line,188}]},
                                    {ns_memcached,do_call,3,
                                     [{file,"src/ns_memcached.erl"},{line,1399}]},
                                    {stats_collector,grab_all_stats,1,
                                     [{file,"src/stats_collector.erl"},{line,84}]},
                                    {stats_collector,handle_info,2,
                                     [{file,"src/stats_collector.erl"},
                                      {line,116}]},
                                    {gen_server,handle_msg,5,
                                     [{file,"gen_server.erl"},{line,604}]},
                                    {proc_lib,init_p_do_apply,3,
                                     [{file,"proc_lib.erl"},{line,239}]}]}
    
    [ns_server:error,2015-06-30T8:03:37.090,ns_1@ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com:ns_doctor<0.329.0>:ns_doctor:update_status:229]The following buckets became not ready on node 'ns_1@ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com': ["Sample",
                                                                                                                    "office"], those of them are active ["Sample",
                                                                                                                                                         "office"]
    [ns_server:error,2015-06-30T8:04:24.298,ns_1@ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com:ns_log<0.277.0>:ns_log:handle_cast:210]unable to notify listeners because of badarg
    Type  :quit<Enter>  to exit Vim

I’m confused as to what is causing this problem and would like to know what’s causing it so that i can prevent it from happening in production…
Does anyone know why? Is the server possibly not big enough?

daschl · June 30, 2015, 8:57am

@reVrost this looks like a server issue of some sort. I’ll see if I can pull someone in from the server team to look at it, can you in the meantime run a cbcollectinfo and upload it?

Also, which server version are you running?

reVrost · June 30, 2015, 3:36pm

I am running Couchbase server Version: 3.0.3-1716 Enterprise Edition (build-1716).

I couldnt upload the cbcollectinfo here due to size restriction. I’ve upload it on my dropbox its (94mb)
Here is the link:

anil · June 30, 2015, 6:13pm

You can use the “cluster wide diagnostics” tool to collect and upload logs - http://docs.couchbase.com/admin/admin/Misc/cluster-wide-info-intro.html

reVrost · July 1, 2015, 3:11am

Sorry for the late reply,
Here are the logs:

              https://s3.amazonaws.com/cb-customers/reVrost/collectinfo-2015-07-01T030656-ns_1%40ec2-52-64-116-219.ap-southeast-2.compute.amazonaws.com.zip
            
              https://s3.amazonaws.com/cb-customers/reVrost/collectinfo-2015-07-01T030656-ns_1%40ec2-52-64-98-58.ap-southeast-2.compute.amazonaws.com.zip
            
              https://s3.amazonaws.com/cb-customers/reVrost/collectinfo-2015-07-01T030656-ns_1%40ec2-54-153-141-152.ap-southeast-2.compute.amazonaws.com.zip

reVrost · July 3, 2015, 1:58am

Bump, I restarted one of the server that kept dieing and now it seems like I am able to upsert more documents back up again without the server going down every moment or so. However, i do get some errors like

Hard Out Of Memory Error. Bucket "office" on node 
ec2-52-64-98-58.ap-southeast-2.compute.amazonaws.com is full. All memory
 allocated to this bucket is used for metadata.

Even though that error popped, it seems that the servers are working as normal (nothing went down or anything so FAR).
Although, I’m still not quite sure how/why this error occurred, as far as i know the bucket is operating under full-eviction so why would there be an out of memory error due to metadata? Unless I’m not fully understanding what full-eviction is…

ingenthr · July 3, 2015, 2:50am

You’re correct-- you shouldn’t see that if you are operating with full eviction.

Can you see if the stats show your bucket on one node operating much higher on memory usage than others? In the web UI, you’ll see a little blue arrow that will let you show details by server.

reVrost · July 3, 2015, 4:59am

The RAM usage seems to be roughly equal across all 3 nodes, the cpu usage varies a little though.
Here are some of the screenshots from web ui view:

Topic		Replies	Views
CouchBase down, won't come back up Couchbase Server	2	2076	September 9, 2013
Unknown error in debug.log with Java SDK NullpointException from time to time Couchbase Server	8	3693	September 2, 2015
Timed out waiting for operation - failing node Java SDK	3	15369	July 7, 2015
CouchbaseError: Temporary failure received from server. Try again later Couchbase Server	10	3217	October 1, 2018
Couchbase 3.0 Node goes down for evry weekend Couchbase Server	2	2842	November 18, 2014

Couchbase node dieing when uploading documents using Java SDK

Related topics