I have a 3-node Couchbase Cluster running in Google Kubernetes Engine. Apparently, one of the nodes crashed. When the Autonomous Operator spun up a new node, it had a different DNS name (" couchbase://cb-cluster-member-0003.cb-cluster-member.default.svc").
The problem is, my three nodes’ DNS names are hard-coded in my (.NET Core) application’s configuration file. When " cb-cluster-member-0002" was no longer available, I ran into all sorts of seemingly random problems. One such problem was that a N1QL query my application relied on was no longer working. I attempted to make this query through the Web Console, and it gave me some crab about the Index not existing, even though I had created it several days earlier. When I checked the “Bucket Insights” > “Queryable on Indexed Fields” > BucketName I saw it complaining about not being able to determine the schema based on the existing documents. But the documents had not changed! I ended up having to flush the bucket (thankfully this was a development system).
How do I compensate for this? First, I’d like to know how to avoid manually hard-coding DNS names of my nodes, and second, I’d like to know why I encountered that weirdness with just one bucket.
EDIT
I have dug through the logs, and I found some odd things.
Out of nowhere, I see this occur today:
IP address seems to have changed. Unable to listen on 'ns_1@cb-cluster-member-0001.cb-cluster-member.default.svc'. (POSIX error code: 'nxdomain')
This is followed by something similar to this, but for each node in the cluster, 0000, 0001, 0002:
IP address seems to have changed. Unable to listen on ‘ns_1@cb-cluster-member-0001.cb-cluster-member.default.svc’. (POSIX error code: ‘nxdomain’) (repeated 6 times)
Followed by several of these for each node:
Failed to add node cb-cluster-member-0002.cb-cluster-member.default.svc:8091 to cluster. Node already exists in cluster: ns_1@cb-cluster-member-0002.cb-cluster-member.default.svc (repeated 8 times)
Eventually, this happens, and is then followed by a rebalance:
Starting rebalance, KeepNodes = ['ns_1@cb-cluster-member-0000.cb-cluster-member.default.svc',
'ns_1@cb-cluster-member-0001.cb-cluster-member.default.svc'], EjectNodes = [], Failed over and being ejected nodes = ['ns_1@cb-cluster-member-0002.cb-cluster-member.default.svc']; no delta recovery nodes
Then, I get this some more:
Failed to add node cb-cluster-member-0003.cb-cluster-member.default.svc:8091 to cluster. Failed to reach erlang port mapper. Failed to resolve address for "cb-cluster-member-0003.cb-cluster-member.default.svc". The hostname may be incorrect or not resolvable.
And then 0003 comes online, followed by a rebalance:
Node ns_1@cb-cluster-member-0003.cb-cluster-member.default.svc joined cluster
And then it fails to add the node:
Failed to add node cb-cluster-member-0003.cb-cluster-member.default.svc:8091 to cluster. Prepare join failed. Could not connect to "cb-cluster-member-0003.cb-cluster-member.default.svc" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers. (repeated 1 times)
Failed to add node cb-cluster-member-0003.cb-cluster-member.default.svc:8091 to cluster. Prepare join failed. Could not connect to "cb-cluster-member-0003.cb-cluster-member.default.svc" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers. (repeated 2 times)