Is there a way to make advance scheduling more robust?

MagBad · November 15, 2023, 9:53am

Hello there,

I’m currently extending a propriatary API to support transactions to our NoSQL databases. One of the implementation is using the Libcouchbase C SDK 3.3.1.
The goal is to implicitly collect operations to these databases, and if a certain threshold is reached to preempt these, in order to avoid high single operation network I/O. Some tests at our sides point to a 2-4x improvement in processing speed (only limited by serialization), so we’re quite excited.

What I had issues with, is the handling boundaries for the scheduling commands, using the functions from the advanced scheduling section.

Long story short - our API sometimes (meaning I fixed some issues) had called lcb_sched_enter multiple times without closing with lcb_sched_leave (or lcb_sched_flush - but we aren’t using async operations).
We basically wrap the C API over our C++ API, which is using RAII to ensure the lifetimes are ‘correct’. But the above mentioned functions don’t return a lcb_STATUS or something else to observe the status of a ‘transaction’.

My main concern is that it’s not guaranteed (that the above mentioned functions are created pairwise) by the Libcouchbase C API itself and I miss some ‘special cases’ that I’m not aware of (yet).
In our case this lead to a OOM errors in the Libcouchbase API (to many scheduled operations) and a lot of LCB_ERR_TIMEOUT status.

Of course it’s possible to manually wrap some sort of logic to avoid this (and I probably end up doing this), but a standard approach for the SDK would be appreciated.
For example, a call to lcb_sched_enter might implicitly check for the pair and close it, the functions return an error code or these functions simply opt out (if not created pairwise, etc.).
In general a function to observe the size of a transaction would be appreciated as well.

Or maybe I’m missing something. In either case some feedback and/or advice on the advanced scheduling topic would be appreciated.

mreiche · November 15, 2023, 8:17pm

First benchmark couchbase using the asynchronous API, without batching your updates. Your updates will be limited pretty much only by network latency. And since you are using the async api, all your updates will be initiated at the same time and complete in the time of one round-trip from the client to the server. If you batch the updates, the i-th update in the batch will need to wait for the i-1 updates to be sent before it is sent.

collect operations to these databases, and if a certain threshold is reached to preempt these, in order to avoid high single operation network I/O.

This won’t avoid an io for each operation with the Couchbase SDK.

MagBad · November 16, 2023, 7:49am

So you would advise on using the async API over the synchronous API? We currently use the latter, as our services (higher up the stack) aren’t asynchronous. And sadly that’s probaly not so easy to change for our services.
But I can try to benchmark it locally.

If the advanced scheduling functions (mentioned in my inital message) don’t avoid IO for each operation, what should these functions be used for?
IIRC the batches you’ve mentioned is basically what we want. Similar to a traditional SQL transaction.

Maybe to clarify some pseudo code:

{ // in some service
  auto guard = startTransaction(); // this would call `lcb_sched_enter`

  // ... doing some processing
  auto processedData = getData();
  
  for ( const auto &data : processedData )
  {
     storeData( data ); // database agnostic, but essentially calles `lcb_store( /*...*/ )` with the correct arguments etc.
  }

  // destruction of the 'guard' variable -> `lcb_sched_leave` + a `lcb_wait` is called
}

What’s currently happening is that the guard variable isn’t there + we’re not using transactions. So every call to storeData causes an IO operation.

mreiche · November 16, 2023, 4:23pm

What you want to do is have your application submit the update(s) asynchronously (at once), and then wait on all of them to complete. The response time will be essentially the network round-trip time of one request. To reduce the network lag time, have the client and the cb server closer together. If your client and server are sufficiently close, you should see response times around 80 micro-seconds.

I don’t know. And maybe a libcouchbase developer needs to answer your questions. But I do know the underlying kv protocol for Couchbase does not have batches. There are no kv operations that operate on multiple documents. If there are libcouchbase operations that operate on batches, underneath they are still sending one operation at a time over the network.

MagBad · November 17, 2023, 8:00am

I had a short look at the code that changed the behavior (CCBC-664: lcb_sched_enter/leave calls are optional · couchbase/libcouchbase@c2e114d · GitHub IIRC) and I agree my usage of these commands is redundant.

So if I think about this - the entire use of the function lcb_sched_enter is obsolete in the C SDK? If I use the function and schedule any operation (at the minimum the changed ones in the commit above) there will be an implicit call to lcb_sched_leave even if I choose to react on possible failures etc? How should one do this instead?

Consider this: I plan to schedule a few operations, asuming that the commands are pushed only if lcb_sched_leave is called. I experience some kind of user error. Maybe a problem with the seralization or other things that might impact the continuation of the operation (maybe more complex operations, memory allocation, …) and I want to clear the schedule. That wouldn’t be possible anymore, or?

Meaning there is no way for a user of the SDK to opt out of the async behavior. Since I’m doing the lcb_wait call after the destruction (see my above pseudo code) this should work as you’ve highlighted. In addition that seemed to be the main point of the commit I’ve referenced, right? That would also explain the speed ups, as all I did was to reduce the calls to lcb_wait for each operation, due to these ‘scopes’.
All I’m doing is key-value operations, simply storing and fetching documents.

I want to emphasize, that I don’t want to judge or something in that neighbourhood - I’ll simply want to deepen my understanding of the SDK, as the documention doesn’t highlight this point enough IMHO.

In addition the transaction I ‘wanted’ seem to be the one from here. How do these differ? But is that a different question?

mreiche · November 17, 2023, 3:08pm

@avsej would be able to give a better answer regarding the lcb_sched* APIs.

From the documentation - Couchbase C Client: Advanced Scheduling - it looks like the API is still valid. And if you find it yields an increase in speed or throughtput or otherwise satisfies your requirements - then by all means use it. But given that there is no “batch” api to the cb server, I wouldn’t expect round-trips to be reduced. Therefore developing something around lcb_sched_xxxx to improve speed or efficiency would not be fruitful.

Going back to your original post :

our API sometimes (meaning I fixed some issues) had called lcb_sched_enter multiple times without closing with lcb_sched_leave

Is that not an application error that would need to be fixed in your application?

Of course it’s possible to manually wrap some sort of logic to avoid this

Ok - you have already figured out the solution.

I want to clear the schedule. That wouldn’t be possible anymore, or?

Why not? Isn’t that what lcb_sched_fail does? Couchbase C Client: Advanced Scheduling

If the advanced scheduling functions (mentioned in my inital message) don’t avoid IO for each operation, what should these functions be used for?

The all-or-nothing behavior. From the documentation ‘These semantics exist primarily to support “all-or-nothing” scheduling’

So you would advise on using the async API over the synchronous API?

Yes. Because if you have n operations that each take t seconds, executing synchronously will take n*t seconds. But if you execute them all asynchronously, and wait for them to all finish, it will take t seconds.

In addition the transaction I ‘wanted’ seem to be the one from here . How do these differ?

I’d have to go through the doc. The one thing I know is that transactions will take about three server round-trips to execute one operation. So if you’re looking for speed this is the wrong place. Just execute the operations as they are requested.

I’m still trying to figure out how you are exposing a synchronous API, " as our services (higher up the stack) aren’t asynchronous", yet you are batching together operations.

MagBad · November 20, 2023, 7:53am

Looking at the code from here that can’t be the case. IIRC upon scheduling a command (like lcb_get) you’d call said function with the intend on flushing the pipeline. lcb_sched_fail would call the same function, but doesn’t flush the content. Only clearing the pipeline.

As mentioned in the quote before, the pipeline is cleared - but not flushed to the backend (IIRC). So for my purposes I’d need to be 100% user to queuing an operation to the couchbase. But that’s what my application has to do anyway. I’ll need to think about it a bit. From the looks of it, I can simplify things quite a bit again as I only have to wait once now.

Yes, it’s an error from my application. But after reading your suggestions, I’ll remove said lcb_sched_X calls. Less code is better.

Yes you are right. That’s a thing to consider to benchmark correctly. Especially preempting is probably not needed. In my experience batching operations sometimes has the advantage of creating less context switching on the CPU. I’ll benchmark it as mentioned.

I can understand that. But that’s a different topic for me. We are working on making our API asynchronous, but that’s not the way it’s currently and I can’t go into the details.

From my point of view the answer is as follows:
Don’t use the advanced scheduling API try to design the processing such that one can take advantage of the async API of the SDK as much as possible.

mreiche · November 20, 2023, 1:36pm

Huh. Would you call lcb_sched_flush if you wanted them flushed?

MagBad · November 20, 2023, 2:18pm

Yes, my impression was that this function would flush/end the scheduled operations. And use lcb_sched_fail to abort them.
At least that was the initial plan.

system · February 18, 2024, 2:18pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
C SDK - We want to hear from you Couchbase Server	0	1813	April 1, 2015
C SDK 3.3.4 using "txid" in lcb_cmdquery_option() for transactions C SDK	3	487	June 21, 2023
Reason for lcb_update_server_timer C SDK	1	1895	July 30, 2013
Querying when using libevent - LCB_ERR_REQUEST_CANCELED C SDK query	2	68	October 27, 2024
C API - meaning of function return codes for bulk operations C SDK	2	1991	February 15, 2013

Is there a way to make advance scheduling more robust?

Related topics