Total mutations remaining / drain rate

We are using Enterprise 6.0.

We have a index node + a replica with a large set of indexes.

We are watching the total mutations remaining stat climb and climb (with any query requiring consistency timing out) and looking in dismay at the drain rate. The mutations in the tens of thousands and growing and the drain rate is maybe in the teens when it isn’t zero.

I can find nothing at all on line instructing about the impact of this phenomenon nor how to go about increasing the rate.

What is the performance significance of this and how can we mitigate. Is this a function of how many indexes we have per node? Can we get some guidance.

I’m also having difficult reading the stats.

I navigate into a index node and inspect the stats provided per index. This is where I see the ever growing total remaining mutations figure and anemic drain rate. The number are the same for all the indexes so I assume it’s an index node figure. Is this correct?

The ram profile for the server looks great: 30 % of RAM used and north of 100G of Index RAM remaining.

@naftali,

Slow drain rate can be observed in the following cases:

a. Indexer memory quota is full and indexer is throttled on disk
b. The ingestion rate into indexer is slow (Possibly due to large number of indexes, less network bandwidth)
c. Indexer is throttled on CPU

Looks like the first scenario is not the case. Is the drain rate stuck at zero (or) is it fluctuating slowly? Also, how many indexes are you building?

You can look at some of the below:

  1. The CPU utilisation on the indexer node while the index build is happening - Is there enough CPU (or) does the CPU look saturated

  2. Can you check the mutation_queue_size stat for the bucket for which the index is building - If the mutation queue size is close to zero, then it is possible that the network bandwidth is less. Can you also check the network bandwidth

  3. In the indexer.log for the node on which index is being built (Located at /opt/couchbase/var/lib/couchbase/logs/), check if you are seeing the “Indexer <> blocked” messages

After you eliminate the above 3 possibilities, if you still see less drain rate, we may need the cbcollect logs to understand the reason why index building is slow. The procedure to do cbcollect is mentioned in the following link:

Thanks,
Varun

I cannot find a mutation_queue_size stat for the bucket statistics, can you please specify where I might find this?

You can find this stat in indexer.log (/opt/couchbase/var/lib/couchbase/logs). Check the mutation_queue_size for the bucket for which the number of documents pending are large. Please paste the output of both mutation_queue_size and the number of documents pending

What do they mean and what does that tell us?

These messages mean that indexer received the mutations but the down-stream of indexer process is busy in work. This resulted in the indexer server queues to fill up and hence these messages.

Does this stat include mutations that have not been ingested in the indexer?

Yes

And when the index node say 10k mutations remaining are the numbers updated as the mutation requests come into the indexer or are these number increases as mutations need to be sent to them?

This number means that the number of mutations need to be sent to indexer

From the indexer blocked messages, I am guessing that a downstream component in indexer being slow. This will result in slow ingestion rate into indexer as well. I need the cbcollect logs to figure out what is happening. (If the cbcollect files are huge, can you atleast share the indexer.log files from the time of issue)

Thanks,
Varun

@varun.velamuri
thank you so much. I will work on providing everything you asked.

but I really wish to know. what is down-stream to the indexer?

@naftali

what is down-stream to the indexer?

Storage is a downstream component for indexer i.e. indexer receives the mutations and pushes them to storage. If writing to storage is slow, then indexer queues will fill up. Depending on the storage type, this could be in-memory only (For memory-optimized storage) or both in-memory and disk (For Standard global secondary).

Thanks,
Varun

@naftali,

I have received the logs. I see 137 indexes in total. Can you share the name of one particular index for a quick start.

Thanks,
Varun

@varun.velamuri
sure: xyz_234_idx

choosing one at random

@naftali,

I suspect a hang on the indexer side that could be causing a slow drain rate. As per my knowledge, we have not seen such an issue so far. I need more time to debug and understand if there is really a hang. Also, I need some more information:

a. Can you get the go-routine dumps from both the indexer nodes. For each indexer node, I need 2 go-routine dumps separated by a span of 30 seconds. You can use the below command to get the go-routine dump:

curl -u <username>:<password> http://127.0.0.1:9102/debug/pprof/goroutine?debug=2

Collecting go-routine dumps in a span of 30 seconds will confirm if any go-routine got stuck. Kindly share all the 4 logs (2 indexer nodes, 2 go-routine dumps in a span of 30 seconds)

b. After you have collected the go-routine dumps, can you restart the indexer one at a time and let us know your observations (Note: Restarting indexer can cause scan failures until the indexer node bootstraps if no-replica index is available)

Thanks,
Varun

thank you so much @varun.velamuri

I’m working on getting you what you need, hopefully that will be today but possibly given the permissions it will be tomorrow.

these two nodes are one + its replica

@naftali,

I am able to download the logs. I will analyse the logs and come back to you.

PS: If you are already an enterprise customer, you can get it touch with Couchbase Professional Services team for help.

Thanks,
Varun

@varun.velamuri

Sadly, the restart didn’t help.

What do you see in the logs?

@naftali,

I do not see any hang on the indexer side. From the logs, I see that a lot of garbage is being generated during the mutation processing because of which go lang’s stop-the-world GC is being invoked and it is taking a long time finish (Around 3-7 seconds). This could be one of the reasons for slow-down in mutation processing loop.

I need CPU and memory profiles from both the indexer nodes to additionally debug the issue further. You can get the CPU and memory profile use the below commands:

curl -u <username>:<password> http://127.0.0.1:9102/debug/pprof/profile\?seconds\=60
curl -u <username>:<password> http://127.0.0.1:9102/debug/pprof/profile/heap

Also, can you please try to change the below setting and let us know the impact on drain rate:

curl -u <username>:<password> http://127.0.0.1:9102/settings -d ‘{“indexer.settings.gc_percent”:200}’

Please note that increasing the gc percent means that the garbage collector won’t be triggered as frequently as before and in some cases this can cause a OOM kill.

Thanks,
Varun

@varun.velamuri
thank you.

I’m not completely sure how to make the settings change?

Do I run that on each of the problematic index nodes? Or is it clusterwide setting that I make on any (indexer?) node?

@naftali,

You need to change that on each of the indexer node. You can start with one node and observe the drain rate on that indexer node. Also, before making the change, please collect the CPU profiles and memory profiles.

Thanks,
Varun

1 Like

@naftali If it is possible, would be great if you can open a Couchbase support case. It would allow us to streamline troubleshooting the issue(s) you are facing with better coordination. Thanks.

@jeelan.poola

Of course, how do I do that?

@varun.velamuri

We are trying to get hold of the heap dump but getting the below error:

http://127.0.0.1:9102/debug/pprof/profile/heap

“Unknown profile: profile/heap”

Please let me know how to get you the heap dump

@naftali,

Sorry for typo. Please use the below command:

curl -u <username>:<password> http://127.0.0.1:9102/debug/pprof/heap

Please do a cbcollect and share them as well after collecting the profiles so that it will help us to correlate with the stats.

Thanks,
Varun

@naftali,

Thanks for sharing the CPU and memory profiles. From the profiles, I see that most of the time is being spent in indexing array items. This is also the reason for more garbage being generated. Below are the top 10 consumers of CPU on cb12. Similar is the case with cb6 node

File: indexer
Type: cpu
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 73.65s, 89.72% of 82.09s total
Dropped 513 nodes (cum <= 0.41s)
Showing top 10 nodes out of 99
      flat  flat%   sum%        cum   cum%
    45.98s 56.01% 56.01%     45.98s 56.01%  runtime.cmpbody
    19.17s 23.35% 79.36%     68.78s 83.79%  github.com/couchbase/indexing/secondary/indexer.CompareArrayEntriesWithCount
     3.86s  4.70% 84.07%      3.86s  4.70%  bytes.Compare
     1.07s  1.30% 85.37%      1.07s  1.30%  runtime.futex
     0.81s  0.99% 86.36%      1.77s  2.16%  runtime.mallocgc
     0.65s  0.79% 87.15%      0.87s  1.06%  github.com/couchbase/indexing/secondary/indexer.(*mutationStreamReader).maybeSendSync
     0.64s  0.78% 87.93%      0.66s   0.8%  syscall.Syscall
     0.54s  0.66% 88.59%      0.54s  0.66%  runtime.usleep
     0.53s  0.65% 89.23%      0.53s  0.65%  runtime.memmove
     0.40s  0.49% 89.72%      0.49s   0.6%  github.com/couchbase/indexing/secondary/pipeline.(*BlockBufferWriter).Put

This means the following:

a. The number of entries in the array are large. Is this the case? Can you kindly share a sample doc so that we can understand the JSON schema and further understand the reason for why CompareArrayEntriesWithCount method is taking almost 80% of CPU

b. You seem to have 18 array indexes in the cluster. You may have to re-work the index definitions and see if all the indexes are required. Dropping some of the un-used array indexes can help improve the drain rate

c. Additionally, you may consider upgrading to 6.0.4/6.5.0 version of couchbase-server because of the issue Loading.... This issue is also one of the reasons for large garbage generation when indexing array elements . We have fixed this in 6.0.3.

Thanks,
Varun

1 Like

@varun.velamuri

You absolutely nailed it.

Yes frequently updated array that’s covered by an index.

Very large array.

We were able to drop that index and we look healthy again.

I cannot overstate how well this back and forth went and can’t thank you enough.

The way you and by extension couchbase handled this issue is a huge confidence builder for the platform.

Thank you so much.

(I will be deleting a number of posts on this thread in the interest of privacy)