Couchbase 5 doesn't restart after a "systemctl restart couchbase-server" command

I have a cluster of 3 servers running couchbase 5.10 on centos 7.4. Everything works perfectly fine until I restart the couchbase server one of the servers. The error messages are not useful in any way. How can I figure out what is the issue?

Here is a snippet of the error

%
=========================CRASH REPORT=========================
crasher:
initial call: gen_event:init_it/6
pid: <0.5001.0>
registered_name: bucket_info_cache_invalidations
exception exit: killed
in function gen_event:terminate_server/4 (gen_event.erl, line 320)
ancestors: [bucket_info_cache,ns_server_sup,ns_server_nodes_sup,
<0.4894.0>,ns_server_cluster_sup,<0.88.0>]
messages:
links:
dictionary:
trap_exit: true
status: running
heap_size: 376
stack_size: 27
reductions: 15952
neighbours:

[ns_server:debug,2018-01-14T13:27:24.505Z,ns_1@145.239.131.81:<0.4962.0>:remote_monitors:handle_down:158]Caller of remote monitor <0.4952.0> died with shutdown. Exiting
[ns_server:debug,2018-01-14T13:27:24.505Z,ns_1@145.239.131.81:ns_couchdb_port<0.4951.0>:ns_port_server:terminate:195]Shutting down port ns_couchdb

I see in another log file the following message:

=========================CRASH REPORT=========================
crasher:
initial call: application_master:init/4
pid: <0.87.0>
registered_name:
exception exit: {{shutdown,
{failed_to_start_child,ns_config_sup,
{shutdown,
{failed_to_start_child,ns_config,
{{badmatch,{error,“Unable to decrypt value”}},
[{ns_config_default,‘-decrypt/1-fun-0-’,1,
[{file,“src/ns_config_default.erl”},{line,657}]},
{misc,‘-rewrite_tuples/2-fun-0-’,2,
[{file,“src/misc.erl”},{line,701}]},
{misc,rewrite,2,[{file,“src/misc.erl”},{line,631}]},
{misc,do_rewrite,2,
[{file,“src/misc.erl”},{line,639}]},
{misc,do_rewrite,2,
[{file,“src/misc.erl”},{line,639}]},
{misc,‘-rewrite_tuples/2-fun-0-’,2,
[{file,“src/misc.erl”},{line,705}]},
{misc,rewrite,2,[{file,“src/misc.erl”},{line,631}]},
{misc,do_rewrite,2,
[{file,“src/misc.erl”},{line,639}]}]}}}}},
{ns_server,start,[normal,]}}
in function application_master:init/4 (application_master.erl, line 133)
ancestors: [<0.86.0>]
messages: [{‘EXIT’,<0.88.0>,normal}]
links: [<0.86.0>,<0.7.0>]
dictionary:
trap_exit: true
status: running
heap_size: 1598
stack_size: 27
reductions: 196
neighbours:

I have tried for the last 2 weeks to figure out what is the problem. I have reinstaled the servers at least 100 times. I become so frustrated that I am very close to siwtch to another NoSQL DB. This is my final try to get it working.

Thanks,
F

@flaviu,

Are the other two nodes in your cluster running still did you rebalance the cluster to make it whole?
When you restarted the machines did it get a new IP/hostname?

Push Comes 2 Shove
you can pull the data from the files of the down nodes back into the cluster.

Hello Househippo,

The server has the same IP/hostname address. The other servers in the cluster are up, waiting for the 3 one to join the cluster again. The data on them was rebalanced before the systemctl restart couchbase-server.

The same behaviour happens with any of the 3 servers if I would restart. One thing that I saw is that when the couchbase restarts it tries to bind on 127.0.0.1 instead of the public IP. I don’t know why, or if this is a normal behaviour.

Please let me know if I can help you with some additional information.

Thanks,
F

@flaviu
Are you looking for the RCA(Root Cause Analysis) for why the node went down
or
is there data on the down node you want to get back into the cluster? (I can help with with this over the forum)
or
Do you just want to get the node back into the cluster because auto - failover happened two weeks ago and your data is safe the two remaining nodes.

Hello ,

I cannot use a product which cannot reliably handle a simple node restart. I don’t have important data, but I need to know that I can handle the management of the DB.

I would really like to understand how to debug the problem and find the root cause of the failure to restart a node.

Thanks,
F

@flaviu [quote=“flaviu.radulescu, post:1, topic:15428”]
I have tried for the last 2 weeks to figure out what is the problem.
[/quote]

RCA
The above two logs entries only tell a very small part of the story of the ns_server(the cluster manager). Its going to be hard to tell you whats going unless you do a full log dump of all the servers of the time of the failure two weeks ago.

HELP FROM EXPERTS
Sounds like your in the evaluation stage and you hit a road bump in your POC.
To short circuit the learning curve of a new database have you through about contacting Couchbase Enterprise. I’m sure spending another XYZ days or weeks in the forums to understand what happen to the cluster manager two weeks ago vs contacting them would be a better use of your time.
“Time is Money”

I can do a full log dump, that’s not a problem.

I have already talked to Couchbase regarding buying a license, but they don’t have a price model which can be affordable for a pre-revenue startup. The problem with couchbase is that the logs are not clear enough to understand what is the real problem.

We could not find any mention on the entire internet regarding our problem. This is something that never happened with any other software we have worked in the past.

Being just a start-up we cannot afford to buy a license at this stage, hence, no license, no support. No support, no working DB, so we will have to look either to Cassandra, or CouchDB to see if they are more stable, and better documented.

What would be your recommendation in order to get to the bottom of this problem, or you would suggest just to move away from Couchbase?

Thanks,
F

@flaviu,

1.What is your use case? (This way I can understand if Couchbase is a good fit for you)

2.Have you taken some of the free Couchbase online training? (This way you can understand better how Couchbase works so that you can debug it better) https://training.couchbase.com/

3.Can you run the cbcollect command to get the logs of all three machines. (note the logs can be big so linking them to a public file server is best ) NOTE cbcollect contains information like IP address of the machine and other information about your servers (but no passwords). https://developer.couchbase.com/documentation/server/current/cli/cbcollect-info-tool.html

Dear Househipo, thank you for your support.

I have been able to use cbcollect on all 3 servers. The reports are here download link

the node cb02 is the one which is not starting.

Just to summarize the problem:

Install 3 CB servers on 3 identical physical machines:
create a cluster of all 3
put some data an make sure they are replicating the data
restart one of the db nodes with standard systemctl restart couchbase-server command
that node would not start again. You don’t have access to http::8091 no matter what you do. The logs are useless and the only hint I have is that the Couchabse cluster doesn’t decrypt some configuration data based on this error message.

{failed_to_start_child,ns_config,
{{badmatch,{error,“Unable to decrypt value”}},
[{ns_config_default,’-decrypt/1-fun-0-’,1,

If you have any hint which will help me understand what is the problem I would really appreciate.

Thank you,
F