Second post, since the first netted lots of eyeballs but no suggestions.
Feels like SSD issues to me, 90 days IOwait starts to climb. If I test the discs they test fine, but couchbase starts timing out and has issues responding to the queries in a timely fashion. So IO wait goes up, which pushes the CPU up, normally running at 2% maybe, and when the server hits the 90 days it’s running at 12-20% cpu.
So I’m at a loss, I’ve got a 5 node CE cluster running data only, with 2 views. Configuration runs fine, but give it 90 days, and the memcached process starts flailing, feels like it’s in a loop or something is going on. This is on Amazon EC2 instance, large i3 and i4 nodes, running Amazon 2 linux. No network errors, disc tests show fine, but something is happening that causes us to have to swap in new hardware, 90 days ish seems to be as long as we get before the system starts timing out. Nothing in the logs that I can see, and it shows it’s writing/reading more data than normal for the single node, but there is no single object being requested more than others on this server, it just seems to be having an issue and I’m not finding the right log file or the right approach in trying to figure it out,
In our monitoring we see IOWait start climbing, with no changes to how we access the box, no increased traffic load.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15933 couchba+ 20 0 153.9g 93.0g 0 S 114.9 75.0 12352:28 /opt/couchbase/bin/memcached -C /opt/couchbase/var/l+
Anyone seen similar or have an idea on how I can track it down? Also note, if I shut the service down and allow it to flush it’s memory, it seems to be happy for a bit, but I can’t really do this, because the box is in bad shape and it takes forever for the process to stop, by that time I’ve got a failover happening, and or a ton of failed requests.
Just trying to get a better understand of where I should be looking in the logs or processes to see what it’s doing at the time of chaos.