So I’m at a loss, I’ve got a 5 node CE cluster running data only, with 2 views. Configuration runs fine, but give it 90 days, and the memcached process starts flailing, feels like it’s in a loop or something is going on. This is on Amazon EC2 instance, large i3 and i4 nodes, running Amazon 2 linux. No network errors, disc tests show fine, but something is happening that causes us to have to swap in new hardware, 90 days ish seems to be as long as we get before the system starts timing out. Nothing in the logs that I can see, and it shows it’s writing/reading more data than normal for the single node, but there is no single object being requested more than others on this server, it just seems to be having an issue and I’m not finding the right log file or the right approach in trying to figure it out,
In our monitoring we see IO Wait start climbing, with no changes to how we access the box, no increased traffic load.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15933 couchba+ 20 0 153.9g 93.0g 0 S 114.9 75.0 12352:28 /opt/couchbase/bin/memcached -C /opt/couchbase/var/l+
Anyone experience something similar and or how exactly am I to track this down? If I wait for 90 days, I will start seeing increased timeouts. It really seems that is starts to be churning without any real added traffic/requests etc, so what is it doing? XDCR is out only, so it’s not an influx of XDCR data.
Anyone seen similar or have an idea on how I can track it down? Also note, if I shut the service down and allow it to flush it’s memory, it seems to be happy for a bit, but I can’t really do this, because the box is in bad shape and it takes forever for the process to stop, by that time I’ve got a failover happening, and or a ton of failed requests.
Just trying to get a better understand of where I should be looking in the logs or processes to see what it’s doing at the time of chaos.