My setup is three docker container each running couchbase:latest with different port-mappings to the Docker host: here i
s the excerpt of my docker-compose.yaml:
cdb1:
ports:
- "8091-8096:8091-8096"
- "11210:11210"
cdb2:
ports:
- "11091-11096:8091-8096"
- "12210:11210"
cdb3:
ports:
- "12091-12096:8091-8096"
- "13210:11210"
The nodes are connected using a user defined docker network, all nodes have hostnames, a domainname, and a complete /etc/hosts file listing all three nodes on each node. The cluster is constructed by referring to the nodes per name.
To access those nodes from the Dockerhost I followed the manual and configured external access on all three nodes via posting to /node/controller/setupAlternateAddresses/external
on each node, providing the hostname 127.0.0.1 and appropriate settings for kv, fts, cbas, n1ql, mgmt, capi and eventingAdminPort.
This setup works flawlessly until I test failover by stopping a node. Auto failover happens, but when I restart the disabled node, it starts to write this into the logs:
> Service 'backup' exited with status 1. Restarting. Messages: 2022-09-26T15:04:31.826Z WARN (REST) (Attempt 2) (GET) Retrying request to endpoint '/pools': which failed due to error: failed to perform request: Get "http://127.0.0.1:11091/pools": dial tcp 127.0.0.1:11091: connect: connection refused 2022-09-26T15:04:31.826Z DEBUG (REST) (Attempt 3) (GET) Dispatching request to 'http://127.0.0.1:11091/pools' 2022-09-26T15:04:31.826Z ERROR (REST) (Attempt 3) (GET) Failed to perform request to 'http://127.0.0.1:11091/pools': Get "http://127.0.0.1:11091/pools": dial tcp 127.0.0.1:11091: connect: connection refused 2022-09-26T15:04:31.826Z WARN (REST) (Attempt 3) (GET) Request to endpoint '/pools' failed due to error: failed to perform request: Get "http://127.0.0.1:11091/pools": dial tcp 127.0.0.1:11091: connect: connection refused 2022-09-26T15:04:31.927Z ERROR (Main) Failed to run node {"err": "could not create REST client: failed to get cluster information: failed to get cluster metadata: failed to execute request: failed to execute request: exhausted retry count after 3 retries, last error: failed to perform request: Get \"http://127.0.0.1:11091/pools\": dial tcp 127.0.0.1:11091: connect: connection refused"
this log entry repeats every minute then and rebalancing the cluster fails at “backup” with the completion message:
"completionMessage": "Rebalance exited with reason {service_rebalance_failed,backup,\n {agent_died,<17875.2285.0>,\n {linked_process_died,<17875.2286.0>,\n {'ns_1@172.24.0.4',\n {no_connection,\"backup-service_api\"}}}}}."
I read the log entry as the recovering node trying to obtain cluster info by accessing its own external interface, which looks like a bug to me.
Resolution so far is to delete the external hostname and port assignments on the recovered node, rebalance the cluster, and setting the external hostname and ports again after that on the recovered node.