[Meta] 'Couchbase Commons'?

Hey,

I recently was able to get some time to work on my CB related projects and updated to CB 4.5
While developing I ran into the situation, like several times before, that I would need a datastructure like a stack, a queue/fifo or more CB specific a doc “list of item-references”.
(In my case I need a render-queue. This queue gets regularly checked by render-servers if there are new images to be rendered from stored data.)

I imagine this is a regular problem of other developers as well. These structures are not easy to write, especially making them thread-safe. As this is not a new problem my question would be:
Would it not be a good idea to maintain a ‘couchbase commons’ project following the path of Apache Commons, Guava Commons etc.?

Thoughts?

hey @Kademlia
This is an interesting idea :slight_smile:

Adding simple datastructures in all SDKs is something that is currently on our radar, but we still have to come up with a design and roadmap, so nothing set in stone yet.

Maybe there would be community interest in contributing to an opensource project building such features on top of the Java SDK? That could be an alternative solution, yes. I can’t promise we’d have the manpower to back it up fully, but if users get together and contribute we would definitely love seeing such a project pop up :heart_decoration:

:bulb: I’d name it “couchbase-java-extras:smile:

This sounds really interesting. It would definitely be a good excuse to showcase the new sub-document commands in Couchbase Server 4.5 as they would make it a lot easier to add atomicity / thread-safety to datastructures.

@Kademlia,
am i right, you are talking about (for example) “java queue notified on any changes from server” ?

@egrep I think @Kademlia meant like a List/Queue implementation, but that is directly backed by a Couchbase document. Add an element to the Queue, it’ll get appended in the Couchbase document.

Getting notifications from the server on documents additions / deletions is another thing altogether, but is also on our radar (as a kind of simplified facade over DCP).

@simonbasle,

@egrep I think @Kademlia meant like a List/Queue implementation, but that is directly backed by a Couchbase document. Add an element to the Queue, it’ll get appended in the Couchbase document.

… and what about 20Mb size constraint is this case ? And, anyway:

    Observable
        .defer(() -> b.get(id))
        .map(modifyFunction).flatMap(b::replace)
            .retryWhen(RetryBuilder
                .anyOf(...) // need at least CasMistatchException here, any others at your choice
                .delay(...)
                .max(...)
                .build())
                    .subscribe(...);

… and implement Queue + local synchronization over this (i assume, that document is strictly-formatted within and modifyFunction() follows this format)

Of course, this will (most likely) suffer from lack of performance and high memory usage in case of lots of large documents, but as a “persistent queue for docuemnt id’s” it’s well enough.

Getting notifications from the server on documents additions / deletions is another thing altogether, but is also on our radar (as a kind of simplified facade over DCP).

Hm, this one (com.couchbase.client.core.message.dcpInterface DCPRequest):

Note that they can flow in both directions. For example, ConnectionType.CONSUMER connection, means that messages will flow from server to client.

from Couchbase JVM Core IO 1.2.9 API
makes me think that there are (probably):

  • something is already implemented, but not documented
  • server-side part is not implemented (for case of doc change notification)

Seems all this also has a “hard learning curve” due to lack of docs for underlying architecture (i.e. detailed tech docs about internals). Are the any tech-arch-docs ?

Since:
1.1.0
Author:
Sergey Avseyev

@avsej, could you please comment this (as an author of )?

Yes, DCP designed for full duplex replication, but at the moment, Java SDK implements only one direction: server -> client.

@avsej,
… hm.

Yes, DCP designed for full duplex replication, but at the moment, Java SDK implements only one direction: server → client.

This makes me a little bit confused. There could be “a long long discussion” about “how does it work at all” (and the first question is “is the client i.e. SDK notified about ANY changes for ANY doc on server OR in your terminology server == SDK ?”), but most likely this is going to be “talks between the one who knows about internals, and the one who does not”. So, let’s simplify: are there any public tech-internals-docs ?

Yes, there is documentation on DCP as a protocol, it lives here: GitHub - couchbaselabs/dcp-documentation: couchbase unified protocol for replication

the client i.e. SDK notified about ANY changes for ANY doc on server OR in your terminology server == SDK ?

At the moment, there is not server-side filtering, so once you started listening vbucket on the cluster, you (SDK user) will receive all changes.

@avsej,
thanks!
And i suppose, that measured overhead due to “no filtering” is minimal (otherwise implementation would be different, with filtering), am i right ?

well not really. I guess the overhead can be large if eg. you’re only interested in brand new documents and the workload is more like thousands of mutations per seconds and a couple additions (because you’d get all these mutations you’re not interested in).

“otherwise implementation would be different”: it was more like we didn’t have enough time and manpower to cover all of DCP and optimize things in terms of this kind of overhead, but rather we had to prioritize and implement the minimum viable product for DCP to cover the cases of Kafka and Spark.

Well … :expressionless:
But thanks for explanation anyway :wink:

@egrep While I agree more ‘redis-like’ functionality would be nice - getting that in a performant way is a big step and not to be expected in the near future.

What I was trying to argue is that for the current state (4.5+) some basic helper-constructs would be very good. An example would be a basic Queue-Implementation. This can be achieved in multiple ways. The already mentioned document-based implementation has some serious drawbacks on performance (at least pre 4.5). That’s why you would need at least two or three different implementation to suit the different needs.

Use-Cases could be:

  • Background mailing-servers that all use the same Queue interface
  • Some sort of bigger jobs that need to be handled by multiple processes. The Job-Status-Doc could be passed trough multiple queues and worked on over a longer time in the correct order. This would allow nice observability from an external logging/system-load service as well.
  • Using ‘in processing’ queues could help to detect jobs that failed (maybe caused by a hardware failure of the currently processing unit). Regular checking of the processing queue would allow to identify unfinished jobs an re-addition to the “unfinished work”-queue.

I myself started working (pre 4.5) on a queue that uses atomic counters and prefix-based entries. The interface looked like this (commented out stuff i did not need while doing basic testing):

public interface ICBQueue<T> {
    
    void initialize();
    String getHeadKey();
    String getTailKey();
    String getPrefix();
    String getSuffix();

    boolean add(T e);
    boolean offer(T e);

//    E remove();
    T poll();
    T peek();
    
    long size();
    boolean isEmpty();
    boolean isFull();
//    boolean contains(T o);
//    boolean remove(E o);
//    boolean containsAll(Collection<T> c);
//    boolean addAll(Collection<T> c);
//    boolean removeAll(Collection<T> c);
    void clear();
}

[…]

    public CBQueue(CBQueueConfig config, Bucket bucket) {
        suffix = config.getSuffix();
        prefix = config.getPrefix();
        softLimit = config.getSoftLimit();
        hardLimit = config.getHardLimit();
        headKey = prefix + "head" + suffix;
        tailKey = prefix + "tail" + suffix;
        softLimitKey = prefix + "softlimit" + suffix;
        hardLimitKey = prefix + "hardlimit" + suffix;
        clazz = config.getClazz();
        this.bucket = bucket;
    }

Using this logic you have no 20MB problems and by internally using a soft and hardLimit you can tell producers to slow down.

On the other hand operations like “contains(T e)” would be easier with a doc-based queue and for an actual comparison we would first need both implementations to see if one outperforms the other or both have their useful situations.

@Kademlia,
for 4.5-based implementation of such queue, imho, it’s better to use other principles than “list of documents between headKey and tailKey + viewQuery” (if i understood you right). With 4.5+ you have a sub-document API. And that allows to make a simple thing:

  • define each document as fixed number (N) of unique addressable slots [1…N].
  • define an addressable slot fixed-size format (addressable header, corresponding addressable payload)
  • pre-create set of documents K
    … and now, implement everything you need over your K:N address space on per-slot basis. Arrays and Linked structures are the easiest things.

Looks like something familiar, eh ? :wink:

No thats not how the queue would work. A ViewQuery is costly and does not fit the queue principal (thats why i mentioned that contains(T e) etc. is not that important). This is exactly why a project like “Couchbase Commons” would be a good idea - to use the best implementations instead of everyone building their own one, failing multiple times on the way there because of atomicity, concurrency and performance problems.

The example from above uses a head and tail key to define the current queue size, similar to the LinkedList implementation (plus additional coating to communicate with CB).

From my current understanding of the code I am pretty sure a sub-document implementation cannot beat the performance of that implementation. Sub-Doc path comparisons will always be more expensive that comparing pre-generated hashes and traveling down the B±Tree

@Kademlia,

@simonbasle said:

Maybe there would be community interest in contributing to an opensource project building such features on top of the Java SDK? That could be an alternative solution, yes. I can’t promise we’d have the manpower to back it up fully, but if users get together and contribute we would definitely love seeing such a project pop up

@will.gardner said:

This sounds really interesting. It would definitely be a good excuse to showcase the new sub-document commands in Couchbase Server 4.5 as they would make it a lot easier to add atomicity / thread-safety to datastructures.

So, we can even imagine not only about

pre-generated hashes and traveling down the B±Tree

but even about “direct script-code execution via api” :wink:
IRL, let’s take a deep breath and summarize:

  1. fork engine and make “you own moon with blackjack and B-Tree walking”
  2. find a way of using underlying core API (don’t understand now is is possible at all)
  3. use subdocs api and make something “not too fast, but working”

IMHO, that’s the way life is.
But maybe there are other options i’m not aware about ? …

Hey,

reading your point 1 I think you misunderstood. I was comparing the internal work needed to fulfill requests needed for a queue-concept. The quoted CBQueue implementation works in a multi process environment already and has a very low overhead as only pre-optimized CB structures are used (thats what I was talking about with hashes and B+ Tree; see http://developer.couchbase.com/documentation/server/current/architecture/storage-architecture.html).

The Sub-Document commands are nice but as far as i can tell not optimal for this use-case. The goal here is to minimize network traffic by sacrificing some cpu usage on sub-doc searches. Direct key access to the needed information will still outperform those lookups. Sub-document commands are designed to save on large-document updates/queries but avoiding large docs if possible will still be better.