Hey there,
currently our sync-gateway is failing to sync data to mobile couchbase lite clients. We believe that’s due to internal queries timing out in couchbase server, since our sg_warn.log is filled with dozens of lines like the following whenever we try to sync.
2024-06-04T08:32:00.259Z [WRN] c:[1841182f] Error retrieving changes for channel "<ud>dn_f092c223-d485-4fe3-9d17-70cdcd4e51be</ud>": Timeout performing Query -- db.(*Database).buildRevokedFeed.func1() at changes.go:223
We can also observe that all our CPU cores on all machines go up to about 95% and stay there for about 30-60s when syncing.
It would be amazing if someone could give me ideas on how to debug this problem since I’m hitting a wall here with my limited knowledge of CB internals. I’ll drop my limited insights below and would be greatful if someone could point us in the right direction. It would probably be helpful if someone could tell us the exact query that is used to query such a channel so we can debug the query itself.
Our test setup:
- 3 Machines with CB Server in a cluster, 4 Cores / 8GB ram each
- Fairly small bucket with about 140k docs
- One sync-gateway (originally 3) to help with debugging
Additional infos:
- Longpolling changes with the same affected users seems to work without problems
- Using the SGW API does seem to work without problems (fetching docs etc.)
- User with access to only one document is sometimes able to sync (I assume when there is no timeout) but also pushes the cluster to nearly 100% workload and sync often fails
Only observable queries in the CB query monitor interface are statements like
PREPARE SELECT [op.name, LEAST(meta(
my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][1] AS seq, [op.name, LEAST(meta(
my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][2] AS rRev, [op.name, LEAST(meta(
my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][3] AS rDel, meta(
my-bucket).xattrs._sync.rev AS rev, meta(
my-bucket).xattrs._sync.flags AS flags, META(
my-bucket).id AS id FROM
my-bucketUSE INDEX (sg_channels_x1) UNNEST OBJECT_PAIRS(meta(
my-bucket).xattrs._sync.channels) AS op WHERE ([op.name, LEAST(meta(
my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)] BETWEEN [$channelName, $startSeq] AND [$channelName, $endSeq]) ORDER BY [op.name, LEAST(meta(
my-bucket`).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)] LIMIT 5000
I tried to use such a query, but with the PREPARE it’s done in 5 ms and as normal SELECT with a channel from the (timeout) logs above and 1 and our current sequence number as $startSeq and $endSeq, it’s done in about 7ms. So no timeouts there.
Also there are dozens of other lines in the query monitor getting attributed to the user “local:sync-gateway” in which the query column is empty. When I click Edit on one of those, I get an empty query window. Not sure if UI bug though. I’m assuming those are also PREPARE SELECT statements.
All sync-gateway related indexes seem to be live and well (sg_access_x1, sg_allDocs_x1, sg_channels_x1, sg_roleAccess_x1, sg_syncDocs_x1, sg_tombstones_x1)
I would be very grateful if someone could tell me whats going on there. The setup is pretty fresh, although we did some testing with rebalancing etc. Settings should mostly be standard.
Could only include one picture, sorry for the lack of more screenshots.
Best
Marc