Over the past couple of days, we’ve received 3 times an automatic email saying:
"IP address seems to have changed. Unable to listen on ‘ns_1@10.10.4.15’."
We’ve got a 4 node cluster, with static local IP, and all 3 alerts came from the same node. The cluster has been running for a couple of months already, with possibly an increase in traffic, but no new usage patterns.
In the logs, we see this related entry:
[ns_server:error,2013-12-01T15:58:58.861,ns_1@10.10.4.15:<0.5021.5975>:menelaus_web_alerts_srv:can_listen:345]Cannot listen due to nxdomain from inet:getaddr
The node seems to be marked as down by the other nodes for a short time, and then come back up to life, as if nothing happened.
What are the possible triggers for this error? Could it be from overload on the node?
Edit:
And additional message that happens just before the one above
[error_logger:error,2013-12-02T23:34:23.142,ns_1@10.64.4.162:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]Detected time forward jump (or too large erlang scheduling latency). Skipping 10 samples (or 8000 milliseconds) ({{1386027254845, #Ref<0.0.62.54885>}, {repeat,800, <0.16858.17>},{timer2,send,[<0.16858.17>,{cascade,minute,hour,4}]}})
Thanks for the reply.
The servers are locally hosted in our internal network, with static IPs, which haven’t changed, on Centos 6.4. The nodes are already referenced by their IP as their name. We could use hostnames, but since the IPs are static I don’t think it’d make much of a difference.
New cluster created last week. 6 nodes. All centos 6.4, static ip, private network, our datacenter.
Under the host name field at setup we put the IP 192.168.10.xxx for each node. Thinking the IP would be ok. Once the ip address change email started, we setup the hosts files to have all the ip and hostnames for the entire cluster.
Still getting the emails.
From research I don’t think I can change these to the actual hostname without redoing the cluster.
HELP! Thoughts? Our NOC freaks out every time the email comes in. About 10 a day from various nodes.
Those checks can sometimes fail (false positively) if erlang is overloaded (i.e. with views and/or xdcr and/or smart clients torturing it with excessively frequent bucket metadata requests). So it can be an early sign of under-sized configuration or wrongly configured host (i.e. check recommendations for swappiness and disabling of transparent huge pages).
Another possibility is due to xdcr bug it’s possible to exhaust tcp ports space with ports “occupied” in TIME_WAIT state. In that state creating binding any socket (client or server) will not work. This condition is easy to verify by looking at netstat output and/or by examining server logs (you would see tons of eaddinuse errors in this case).
Thanks for the answer, I think you are right on the cause. The beam.smp process was quite busy (CPU and RAM intensive) at the time. We had quite a few XDCR streams going, which we consolidated to a single stream (per bucket), to a secondary cluster, where we do the XDCR stream splitting.
We also added a couple of nodes to the cluster, to be on the safe side.