I have 4 Couchbase nodes in version 5.0.1-5003 community edition with the following configuration :
32 Gb memory
3 buckets with 2 Replicas
Data service : 19Gb
Index Service : 6Gb
Since we start using N1QL, we are having a very serious issue : Couchbase nodes get killed by OOM killer because the indexer and/or cbq goes far beyond the size of the configured memory.
OOM Killer
Jun 7 07:46:50 node2 kernel: [243379.269217] Out of memory: Kill process 9185 (memcached) score 393 or sacrifice child
Jun 7 07:46:50 node2 kernel: [243379.270200] Killed process 9185 (memcached) total-vm:21640716kB, anon-rss:10934144kB, file-rss:0kB, shmem-rss:4kB
We were using N1QL indexes before, but they were simple ones, and not so much used. The new index that seems to trigger the overconsumption of memory is the one in this post
Hello @tchlyah,
Can you please give details about:
Number of documents?
Documents size (avg size)?
Working set residency required(eg: 80% of data needs to be resident)?
Typically if the cluster is under sized and when cluster is under memory pressure, OS will invoke OOM killer and in Couchbase case, memcached is overarching bad boy, and get killed.
Thanks for giving details about the setup. I appreciate it.
Yes, # of documents are not big in this case.
I see that you are using N1QL/Query service. By any chance you have primary index on your index nodes?
Is Index service running separately on its node, or its shared with other service?
Are your current SLA’s being met? And what are they?
We don’t recommend using primary index(s) on production clusters.
No we don’t have any primary index! All our requests use indexes specially created for them.
No, unfortunately we do not have entreprise edition yet, and I can’t do anything about it for now. So we can’t separate index/query services from data one.
Until now we didn’t have any issues with Couchbase, our SLA is being met.
Glad to know that you don’t have primary index. You can separate individual services. While adding a new server pick the service you need on that node. Hit rebalance.
Same thing can be done from already existing cluster, but in rolling fashion. Remove a node, rebalance. Re-add the node, and this time choose the service.
I would need more context for the newly created index.
I see what is happening. Since a node ends up running multiple services and they are resource intensive, even though they are not in use. memcached is getting killed due to that.
At this point, your options are with CE are:
Give more resources to the cluster, to offset for other services running on all the nodes.
With EE:
You will be able to assign individual services to the nodes, thus providing resource isolation.
Get better support, with our support org, getting access to the logs and analyzing them in timely manner.
Getting timely and throughly tested releases with very quick cadence.
I do want to switch to EE, but this isn’t my decision, and this kind of serious bugs doesn’t encourage business to do so, they even incites me to look at the competition…
I’ve doubled the RAM size of the 4 nodes (64Gb each), and it doesn’t change anything, the hosts continue to swap like crazy, and after tweaking linux OOM Killer to not kill memcached process, it’s the indexer and cbq that are being killed.
This is clearly a memory leak! I understand that you offer support only for EE, but that doesn’t mean that you will keep CE with serious bugs like this!
With a four node cluster, two replicas (three copies of data) is not optimal. The cluster will keep a working set of three copies of your data in RAM. I would go down to one replica, allocate a larger % of RAM to the cluster and increase RAM allocation to the indexer to see if this alleviates the OOM issue.