Unable to rebalance production cluster

I am unable to rebalance a production Couchbase 6.5 cluster. Ive added a 7.1 node and attempted to rebalance. The first rebalance failed with:

{"completionMessage":"Rebalance stopped by janitor."}

A subsequent rebalance ran for awhile then failed.

“Rebalance exited with reason {service_rebalance_failed,index,\n {agent_died,<29443.456.0>,\n {linked_process_died,<29443.1509.0>,\n {timeout,\n {gen_server,call,\n [<29443.1507.0>,\n {call,"ServiceAPI.GetTaskList",\n #Fun<json_rpc_connection.0.102434519>},\n 60000]}}}}}.”}

The rebalance button is now disabled. Any help is greatly appreciated.

If this is an Enterprise server, please open a case with Customer Support.
Otherwise - look in the server logs for more information about the ‘agent_died’.
It might be worthwhile to try adding a 6.5 server to eliminate one variable.
Ensure all the ports are accessible: Couchbase Server Ports | Couchbase Docs

Can you please collect logs using cbcollect_info? it will give a complete picture of what is going on. https://docs.couchbase.com/server/current/manage/manage-logging/manage-logging.html should show how to do this via different options.

Thanks for your response and information. This is not an Enterprise server.

  1. I’ve collected redacted logs but am in the process of clearing the organization procedures for potentially distributing them. If cleared, should I use the upload to Couchbase feature?

  2. Review of logs around the agent_died event hasn’t yielded much context to identify a root cause so far. Attached at the bottom is a section of the reports.log from a cluster node.

  3. We’ve ensured all ports are accessible and there is no firewall between nodes.

Thanks again for your help.

crash-snippet.reports.log.zip (3.4 KB)

Ok. In that file I find what you first posted:

    messages: [{'EXIT',<0.26359.955>,
                      {linked_process_died,<0.26276.955>,
                          {timeout,
                              {gen_server,call,
                                  [<0.26083.955>,
                                   {call,"ServiceAPI.GetTaskList",

and I search issues.couchbase.com for “linked_process_died ServiceAPI.GetTaskList”. And I find Loading... which says the issue if fixed in 7.0.0. So upgrade your existing servers to 7.1, and then add the new node.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.