Concurrent SG queries causes QueryTimeoutException

Hi all,
I manage a Couchbase cluster infrastructure with 2 nodes Data/Query/Index.

Recently, we’re observing an issue related to tons of queries sent by a Community SG in a specific timewindow (usually every day from 7AM to 9AM and from 7PM to 9PM).

The application uses the SG to enrich data and to set permissions on the documents with Channels and Roles.
The concurrent queries are the following:

SELECT meta(<bucket>).xattrs._sync.access.role:User as value FROM <bucket> USE INDEX (sg_access_x1) WHERE any op in object_pairs(meta(<bucket>).xattrs._sync.access) satisfies op.name = $userName end;

After 1m15s the queries end in timeout, causing the application BE disruption, since it doesn’t get a result.

We also tried to increase the SG query timeout, but we observed even more queries stressing out the entire cluster. Hence we set it again to the default 75s value.

Any idea on why is this happening?
Is there any related known issue or is just something misconfigured at the application level that is triggering too much queries?

So - when one of those queries executes by itself, there is no issue? It’s only when two or more execute concurrently that there’s an issue?
How long does the query take when only one is executed?

  1. When the query executes by itself there are no issues.
  2. When there’s a concurrent execution, up to 2000 queries in the last minute, we observe the timeouts.
  3. When only one query is executed it usually takes around 1s.

I want to add that tipically the query is not executed by itself, but there are around 10 queries or less concurrently executed (but this gives no errors and no timeouts).

We are observing a spike in the number of the concurrent queries at several intervals and an increase of the execution time. This seems to cause the application BE disruption.

2000 queries per minute is 33 queries per second. Given that each query takes 1 second to execute, assuming serial execution (1 core) the last of those 33 queries will complete 33 seconds after the first query. Given that another 33 requests/second will arrive during the other 32 seconds of that interval, there will be a backlog of 1056 requests, which will take 1056 seconds to process, with another 1056*33 requests arriving during that interval etc. Assuming no other bottlenecks under ideal conditions, 33 total cores (number_nodes*cores_per_node) are required to handle a continuous load of 33 queries per second which take 1 second to execute. Assuming that sometimes the distribution is not uniform, maybe allow for 60 total cores.

Thanks for the analysis.

Is the query a standard query performed by the SyncGateway?
If so, what could trigger that query? A specific endpoint call?

The “application uses the SG to enrich data and to set permissions on the documents with Channels and Roles.” ?

Ok, if I understood it’s not a periodical query triggered by the SG itself to check permissions or access to some channels. Right?

Is there any specific call to trigger that?

This appears to be a known issue related to managing users. Someone is looking.
Apologies for my previous response - I understood it to be an application query.

Many thanks, I’ll wait your update, to check if there’s a fix.

This is usually related to how often you’re making access() calls in your sync function.

If you can share your sync function we may be able to determine if you’re doing this unnecessarily (i.e. are you issuing access() on every write / multiple times per write?)

But note that, invoking it multiple times to grant the same user access to the same channel, will result in negative performance implications.

1 Like

Other users have had similar issues in non-public tickets (you won’t be able to see them) CBSE-13514, CBSE-10325 which are marked “solved”, but I don’t understand what the solution is. In addition to making less calls to access(), it seems that adding an index would help - but I don’t know what index should be added. Maybe someone from couchbase-lite could help.

SG’s indexes are already covering where possible for the queries we make.

It’s almost certainly just the amount of access calls being made via document writes.

2 Likes

Sorry for the delayed feedback.
The sync function makes a lot of checks on the users/roles/channels and has 4000rows of code.

It probably seems like a bug when multiple documents are edited by several users.

Anyway, the SG version is the Community 2.7.0.
Is there any known issue about the channels/roles check or any improvemente in the next releases (i.e. 2.8.3)?

I don’t find anything in the release notes. Have you looked to see if you can make fewer calls to access() ?

Checks are fine (require...() functions), and routing documents to channels is fine (channel()). It’s function calls that change access (access() and role()) that should be limited (they cause channel recomputation for all affected users/roles)

With that said, Sync Gateway 2.7 was released over 4 years ago, and went end of life 2 years ago. I would strongly advise upgrading to a more recent (3.0 or 3.1) version. (Note: 2.8 is also end of life as of August 2023)

You can see our Enterprise Software Support Policy that our paying customers use here for more detail:

1 Like

Hi Ben, Michael,

thanks for all the given support by answering as much as possible to my queries!

I have found something in the sg_info.log file that could be interesting. The logs were collected during the concurrent queries issue and before the SG restart.

[INF] Access: Recomputed channels for “User”: !:1,channel_1, channel_2, …, channel_N

Who’s intended as “User”?
Is there any configuration, apart from Sync Function, that can trigger this?

"User" in the log refers to an actual user with the name User.

The log you see happens when the specified user logs in, or another event happens that causes channels to be recomputed lazily (e.g. an admin views the /db/_user/User endpoint)

Sync functions via document updates are the mechanism that invalidates the current set of user channels. The channels are then recomputed lazily the first time the user requires them.

SG Queries have a scan consistency of “Scan_plus” meaning the query waits for the latest mutations from the CouchBase Data node to be index before return. But it looks like the indexer/query reaches the default 75 seconds and times out. But the Access queries runs when the SG user in newly created. A quick fix is when the user is newly created to a fast GET of the user on the admin api b/c it will do the access query for you. so when the user comes to SG in the morning SG doesn’t have to do the access query b/c it already did it.