we have a cluster in MDS with the following topology:
Data service nodes
Analytics service nodes
Index/Query services nodes
We are experiencing an every day failover of the Index/Query nodes with the warnings of Critical Memory consumption.
We suspect that the main problem is the Query service (since we already had experience of query nodes consuming too much memory and failing over).
The memory assigned to the services respect the threshold of the ASK node.
But the memory keeps going over the 90%.
Is there a best practice in the MDS architecture to avoid this type of disruption? Or some limit to the Query service?
Do you suspect any other issue, instead of the Query service?
Yeah it’s a production environment with 3 index/query nodes.
Indexes are in the standard storage mode (Plasma) and about 100GiB is reserved to the index service for each node.
We have hundreds of GSI indexes, some of them with 1 replica.
And tipically a primary index for each bucket (about 28) + primary indexes for specific collections.
Do you suspect the problem resides in the indexes?
If EE try which queries causing memory consumption by checking system:completed_requests
(higher requestTime, high fetch cout, indexscan count) and see if you can optimize those queries