Couchbase 6.0 Community edition
Indexer process periodically raise CPU usage to 100% and we see a lot of goroutines and native processes…
We have to periodically remove node from cluster and rebuild indices.
Is it known issue with community edition 6.0?
What info should we gather?
Hi @kolchanov,
Please let me know name of the OS, number of Cpu cores and RAM size.
Also, Collect cpu profile, memory profile and goroutine dump at/around the time of CPU usage getting 100%.
Command to collect CPU profile is
curl -X GET -u <Cluster_admin-username>:<password> http://<hostname>:9102/debug/pprof/profile > cpu_profile.pprof
Command to collect memory profile is
curl -X GET -u <Cluster_admin-username>:<password> http://<hostname>:9102/debug/pprof/heap?debug=1 > memory_profile.pprof
Command to collect goroutine dump is
curl -X GET -u <Cluster_admin-username>:<password> http://<hostname>:9102/debug/pprof/goroutine?debug=1 > goroutine_dump.pprof
Also, check indexer.log file to see if any specific operation is running. For example forestdb storage compaction, around the time when CPU goes 100%?
Thanks.
We have 3 nodes with 8 CPU and 96 GB RAM per node.
I’ll update profile info soon
We observerd a lot of forestddb native threads
No success with thread dump - 0 bytes.
curl -X GET -u :* http://172.31.200.16:9102/debug/pprof/goroutine?debug=1 > goroutine_dump.pprof
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:–:-- 0:00:26 --:–:-- 0
Cpu profile
http://gofile.me/2R1hC/DZ6y4lI1F
Cpu profile in 15 minutes
http://gofile.me/2R1hC/iMqEPeYso
Goroutines dump
http://gofile.me/2R1hC/FW66RNSS2
Goroutines dump in 15 minites
http://gofile.me/2R1hC/fsR3udkdu
Indexer.log
http://gofile.me/2R1hC/el5SzrMYu
Any recommendations?
New indexer problem
020-06-07T23:27:30.625+09:00 [Info] memstats {“Alloc”:109856784, “TotalAlloc”:71316733712, “Sys”:215991384, “Lookups”:46105, “Mallocs”:1095795547,“Frees”:10
94855168, “HeapAlloc”:109856784, “HeapSys”:197689344, “HeapIdle”:84148224, “HeapInuse”:113541120,“HeapReleased”:63586304, “HeapObjects”:940379,“GCSys”:962560
0, “LastGC”:1591539973586314423,“PauseTotalNs”:1500641774, “PauseNs”:[905011], “NumGC”:1039}
2020-06-07T23:27:04.523+09:00 [Info] [Queryport “:9101”] connection “172.31.200.17:39704” doReceive() …
2020-06-07T23:27:31.244+09:00 [Info] [Queryport “:9101”] connection “172.31.200.17:39746” doReceive() …
2020-06-07T23:27:47.121+09:00 [Info] [Queryport “:9101”] connection “172.31.200.15:31498” doReceive() …
2020/06/07 23:28:02 revrpc: Got error (EOF) and will retry in 1s
2020-06-07T23:27:09.783+09:00 [Info] [Queryport “:9101”] connection “172.31.200.17:39692” doReceive() …
2020-06-07T23:37:42.144+09:00 [Info] [Queryport “:9101”] connection “172.31.200.16:41951” doReceive() …
2020-06-07T23:32:28.549+09:00 [Info] [Queryport “:9101”] connection “172.31.200.17:39958” doReceive() …
2020-06-07T23:32:43.182+09:00 [Info] StorageMgr::handleCreateSnapshot Added New Snapshot Index: 10711316375339803713 PartitionId: 0 SliceId: 0 Crc64: 9887827
889855871450 (SnapshotInfo: seqnos: 1745052, 1745052, 1499 committed:true) SnapCreateDur 4m20.163343113s SnapOpenDur 1m25.874407237s
2020/06/07 23:43:10 revrpc: Got error (EOF) and will retry in 1s
2020-06-07T23:43:12.797+09:00 [Info] [Queryport “:9101”] connection “172.31.200.16:45321” doReceive() …
2020-06-07T23:43:16.593+09:00 [Info] [Queryport “:9101”] connection “172.31.200.16:63649” doReceive() …
And indexer killed
2020-06-07T23:44:12.327+09:00 [Info] [Queryport “:9101”] connection “172.31.200.16:48227” doReceive() …
2020-06-07T23:44:12.327+09:00 [Info] [Queryport “:9101”] connection “172.31.200.17:41744” doReceive() …
2020-06-07T23:32:46.727+09:00 [Info] [Queryport “:9101”] connection “172.31.200.15:31684” doReceive() …
[goport(/u01/app/couchbase/bin/indexer)] 2020/06/07 23:46:19 child process exited with status 143
2020-06-07T23:46:19.807+09:00 [Info] Indexer started with command line: [/u01/app/couchbase/bin/indexer.bin -vbuckets=1024 -cluster=127.0.0.1:8091 -adminPort
=9100 -scanPort=9101 -httpPort=9102 -streamInitPort=9103 -streamCatchupPort=9104 -streamMaintPort=9105 -storageDir=/u01/app/couchbase/var/lib/couchbase/data/
@2i -diagDir=/u01/app/couchbase/var/lib/couchbase/crash -nodeUUID=17b60be0c1ffb7f18fa42291e49d6843 -ipv6=false -isEnterprise=false]
2020-06-07T23:46:19.827+09:00 [Info] Indexer::NewIndexer Status Warmup
Hi @kolchanov,
Thanks for sharing the information.
I checked the cpu profiles and goroutine dumps attached. I did not see anything stand out. When you collected these profiles, did you see CPU consumption reaching 100% ?
Also, when you saw CPU consumption reaching 100%, were there a lot of new documents being indexes? Or were there a lot of scans being served by the indexer?
I will need more time to analyse indexer.log file in detail.
Regarding, indexer process getting killed, did you check dmesg or messages? Were there any logs regarding that? Some times, we have seen that if memory sizing is not done properly, indexer can exit with “Out Of Memory” error.
Thank you for your attention.
We have upgraded to 6.5.1 CE and this problem has gone.
I want to place detailed analysis from my college:
Seems during last incidents problems looks like this.
Couch try to process some snapshots and initialize iterator for KV stroage here:
32 @ 0x408a90 0x4bd770 0x4cbb6a 0x4ef352 0x501ac1 0x501852 0x5cb45b 0x5cadde 0x5c9f26 0x5ad7b0 0xb5da65 0x465261
# 0x4bd76f github.com/couchbase/indexing/secondary/fdb._Cfunc_fdb_iterator_init+0x4f ??:0
# 0x4cbb69 github.com/couchbase/indexing/secondary/fdb.(*KVStore).IteratorInit+0x419 goproj/src/github.com/couchbase/indexing/secondary/fdb/iterator.go:234
# 0x4ef351 github.com/couchbase/indexing/secondary/indexer.(*ForestDBIterator).Seek+0xc1 goproj/src/github.com/couchbase/indexing/secondary/indexer/forestdb_iterator.go:128
# 0x501ac0 github.com/couchbase/indexing/secondary/indexer.(*fdbSnapshot).Iterate+0x180 goproj/src/github.com/couchbase/indexing/secondary/indexer/forestdb_snapshot_reader.go:208
# 0x501851 github.com/couchbase/indexing/secondary/indexer.(*fdbSnapshot).Range+0xb1 goproj/src/github.com/couchbase/indexing/secondary/indexer/forestdb_snapshot_reader.go:180
# 0x5cb45a github.com/couchbase/indexing/secondary/indexer.scanSingleSlice+0x41a goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_scatter.go:206
# 0x5caddd github.com/couchbase/indexing/secondary/indexer.scanOne+0x11d goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_scatter.go:148
# 0x5c9f25 github.com/couchbase/indexing/secondary/indexer.scatter+0xc5 goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_scatter.go:55
# 0x5ad7af github.com/couchbase/indexing/secondary/indexer.(*IndexScanSource).Routine+0x80f goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_pipeline.go:336
# 0xb5da64 github.com/couchbase/indexing/secondary/pipeline.(*Pipeline).runIt.func1+0x44
Which case calling of this code on C (libforestdb.so part of couchbase).
#1 0x00007f2834faae5a in bcache_read () from /u01/app/couchbase/bin/../lib/libforestdb.so
#2 0x00007f2834fc2887 in filemgr_read () from /u01/app/couchbase/bin/../lib/libforestdb.so
#3 0x00007f2834fb3c97 in btreeblk_read(void*, unsigned long) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#4 0x00007f2834fad503 in _btree_next(btree_iterator*, void*, void*, int) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#5 0x00007f2834fada3e in _btree_next(btree_iterator*, void*, void*, int) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#6 0x00007f2834fb0ba0 in btree_next () from /u01/app/couchbase/bin/../lib/libforestdb.so
#7 0x00007f2834fddfea in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#8 0x00007f2834fde689 in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#9 0x00007f2834fde689 in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#10 0x00007f2834fdfed7 in hbtrie_next () from /u01/app/couchbase/bin/../lib/libforestdb.so
#11 0x00007f2834fe444a in fdb_iterator_next () from /u01/app/couchbase/bin/../lib/libforestdb.so
#12 0x00007f2834fe49e8 in fdb_iterator_init () from /u01/app/couchbase/bin/../lib/libforestdb.so
#13 0x0000000000cdc16c in _cgo_8f4f063755d2_Cfunc_fdb_iterator_init ()
#14 0x0000000000464050 in runtime.asmcgocall ()
#15 0x000000c428480688 in ?? ()
#16 0x0000000000408a6d in runtime.cgocall ()
#17 0x000000c428480698 in ?? ()
#18 0x000000c426d92100 in ?? ()
#19 0x000000c428480688 in ?? ()
#20 0x0000000000000950 in ?? ()
#21 0x000000c425e49520 in ?? ()
#22 0x0000000000462399 in runtime.systemstack ()
#23 0x0000000000436970 in _start ()
There about hundreeds similar threads which use CPU for spin lock.
However one of them do some useful work:
Thread 11 (Thread 0x7f05099fb700 (LWP 1206)):
#0 0x00007f077a480d69 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1 0x00007f077ada3ecc in bcache_read () from /u01/app/couchbase/bin/../lib/libforestdb.so
#2 0x00007f077adbb887 in filemgr_read () from /u01/app/couchbase/bin/../lib/libforestdb.so
#3 0x00007f077adacc97 in btreeblk_read(void*, unsigned long) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#4 0x00007f077ada7b71 in btree_init_from_bid () from /u01/app/couchbase/bin/../lib/libforestdb.so
#5 0x00007f077add70a3 in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#6 0x00007f077add7689 in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#7 0x00007f077add7689 in _hbtrie_next(hbtrie_iterator*, btreeit_item*, void*, unsigned long*, void*, unsigned char) () from /u01/app/couchbase/bin/../lib/libforestdb.so
#8 0x00007f077add8ed7 in hbtrie_next () from /u01/app/couchbase/bin/../lib/libforestdb.so
#9 0x00007f077addd44a in fdb_iterator_next () from /u01/app/couchbase/bin/../lib/libforestdb.so
#10 0x00007f077addd9e8 in fdb_iterator_init () from /u01/app/couchbase/bin/../lib/libforestdb.so
#11 0x0000000000cdc16c in _cgo_8f4f063755d2_Cfunc_fdb_iterator_init ()
#12 0x0000000000464050 in runtime.asmcgocall ()
#13 0x000000c425f94688 in ?? ()
#14 0x0000000000408a6d in runtime.cgocall ()
#15 0x000000c425f94698 in ?? ()
#16 0x000000c4276a9f80 in ?? ()
#17 0x000000c425f94688 in ?? ()
#18 0x0000000000000950 in ?? ()
#19 0x000000c428554820 in ?? ()
#20 0x0000000000462399 in runtime.systemstack ()
#21 0x0000000000436970 in _start ()
I think there something like high concurrency for one lock. I suppose that with not enogh RAM quote there RAM operations become slower which cause processing slower than it come from outside
int bcache_read(struct filemgr *file, bid_t bid, void *buf)
{
struct hash_elem *h;
struct bcache_item *item;
struct bcache_item query;
struct fnamedic_item *fname;
// Note that we don't need to grab bcache_lock here as the block cache
// is already created and binded when the file is created or opened for
// the first time.
fname = file->bcache;
if (fname) {
// file exists
// set query
query.bid = bid;
// Update the access timestamp.
struct timeval tp;
gettimeofday(&tp, NULL); // TODO: Need to implement a better way of
// getting the timestamp to avoid the overhead of
// gettimeofday()
atomic_store_uint64_t(&fname->access_timestamp,
(uint64_t) (tp.tv_sec * 1000000 + tp.tv_usec));
size_t shard_num = bid % fname->num_shards;
spin_lock(&fname->shards[shard_num].lock);
// search shard hash table
h = hash_find(&fname->shards[shard_num].hashtable, &query.hash_elem);
if (h) {
// cache hit
item = _get_entry(h, struct bcache_item, hash_elem);
if (item->flag & BCACHE_FREE) {
spin_unlock(&fname->shards[shard_num].lock);
DBG("Warning: failed to read the buffer cache entry for a file '%s' "
"because the entry belongs to the free list!\n",
file->filename);
return 0;
}
// move the item to the head of list if the block is clean
// (don't care if the block is dirty)
if (!(item->flag & BCACHE_DIRTY)) {
// TODO: Scanning the list would cause some overhead. We need to devise
// the better data structure to provide a fast lookup for the clean list.
list_remove(&fname->shards[shard_num].cleanlist, &item->list_elem);
list_push_front(&fname->shards[shard_num].cleanlist, &item->list_elem);
}
memcpy(buf, item->addr, bcache_blocksize);
_bcache_set_score(item);
spin_unlock(&fname->shards[shard_num].lock);
return bcache_blocksize;
} else {
// cache miss
spin_unlock(&fname->shards[shard_num].lock);
}
}
// does not exist .. cache miss
return 0;
}
This code have at least one problem at my glance - there should be spin with limited iterations count with to possible run another lock type which not consume CPU.
Also there some problem with throttling of tasks from Golang side to prevent run of large amount threads.
@sduvuru Could you please look into this from ForestDB perspective? Thank you!
Unfortunately 6.5.1. indexer hanged again.
cpu_profile_2020-06-11_15-07-56.pprof
http://gofile.me/2R1hC/e3y44ySXg
cpu_profile_2020-06-11_15-09-00.pprof
http://gofile.me/2R1hC/9H753A0oM
goroutine_dump_2020-06-11_15-07-56.pprof
http://gofile.me/2R1hC/eC1K3phyb
goroutine_dump_2020-06-11_15-08-56.pprof
http://gofile.me/2R1hC/uuGDDAI9P