Performance of Couchbase on single node for querying a range of data

Hello,

our team is currently struggling with Couchbase performance. We are evaluating Couchbase for usage in our application as the storage backend for sensor data. The sensor delivers (roughly) 4 to 5 results per second which are transmitted as JSON strings describing the measurement. Those messages all have the same format and the different event types they represent are encoded in the message as a string. The main reasons for considering Couchbase for evaluation are:

  • measurement data is already transmitted as JSON formatted strings
  • we will need scalability in the future as our system should be able to support mutliple of these sensors
  • availability is of higher importance than consistency w.r.t. the CAP theorem

I ran some test queries on two different setups:

  1. Production environment which runs our live system

Hardware/System configuration:

  • 4 cores
  • 32 GB RAM
  • Couchbase installed on 500 GB SSD (16% of disk occupied)
  • CentOS Linux release 7.5.1804 (kernel 3.10.0-862.9.1.el7.x86_64)

Couchbase Community Edition 5.0.1 (build 5003) in docker container on 1 Node

Memory quotas:

  • Data Service: 12500
  • Index Service: 4096
  • Search Service: 512

Index Storage Mode: Standard Global Secondary

4 Buckets:

  • A (6 GB memory quota, bucket type: Couchbase), currently holds ~ 28.000.000 docs of three types, same format
  • B (2 GB memory quota, bucket type: Couchbase), currently holds ~ 1.000 docs
  • C (500 MB memory quota, bucket type: Couchbase), currently holds 1 doc (for testing purposes)
  • D (2 GB memory quota, bucket type: Couchbase), currently holds ~ 30 docs (rather new)

Documents in A have the following format:

{
“_class”: ,
“latitude”: ,
“longitude”: ,
“positionMeasure1”: ,
“positionMeasure2”: ,
“eventType”: ,
“time”: , “2018-08-09T06:42:47.219Z”,
“originalMessage”: <map<string, any>>
}

Each of this messages represents a single measurement of a sensor which contains two different measures for position (represented as doubles), GPS coordinates, the type of the measured event (in our application eventType is an enum representing this) and a time stamp in the format “yyyy-mm-ddTHH:MM:SS.sssZ”. The originalMessage is a map containing additional information on the event.

Currently, the eventType has three realizations, let’s call them eventType1, eventType2 and eventType3. The vast majority of the documents in A are of type eventType1, followed by eventType2 and a clear minority of eventType3.

The bucket A has been indexed as follows:

CREATE PRIMARY INDEX aIndex ON A
CREATE INDEX time ON A(time)
CREATE INDEX eventTypeAndTime ON A(eventType,time)

  1. Test environment which runs isolated

Hardware/System configuration:

  • 12 cores
  • 64 GB RAM
  • Couchbase installed on SSD
  • Debian 4.9.65-3+deb9u2

Couchbase Enterprise 5.5.0 running in a docker container on this single node

Memory quotas (per default setup):

  • Data Service: 37900
  • Index Service: 512
  • Search Service: 512

Index Storage Mode: Standard Global Secondary

Only one bucket A (others omitted as not relevant)

Same indexes created:

CREATE PRIMARY INDEX aIndex ON A
CREATE INDEX time ON A(time)
CREATE INDEX eventTypeAndTime ON A(eventType,time)

On these two systems I ran the following queries from the web UI:

  1. Query 1: SELECT COUNT(*) FROM A WHERE eventType == “eventType1” AND time BETWEEN “2018-07-11T00:00:00.000Z” AND “2018-07-25T00:00:00.000Z”
  2. Query 2: SELECT COUNT(*) FROM A WHERE eventType == “eventType2” AND time BETWEEN “2018-07-11T00:00:00.000Z” AND “2018-07-25T00:00:00.000Z”
  3. Query 3: SELECT * FROM A WHERE eventType == “eventType1” AND time BETWEEN “2018-07-11T00:00:00.000Z” AND “2018-07-25T00:00:00.000Z”
  4. Query 4: SELECT * FROM A WHERE eventType == “eventType2” AND time BETWEEN “2018-07-11T00:00:00.000Z” AND “2018-07-25T00:00:00.000Z”
  5. Query 5: SELECT time FROM A WHERE eventType == “eventType1” AND time BETWEEN “2018-07-11T00:00:00.000Z” AND “2018-07-25T00:00:00.000Z”
  6. Query 6: SELECT time FROM A WHERE eventType == “eventType2” AND time BETWEEN “2018-07-11T00:00:00.000Z” AND “2018-07-25T00:00:00.000Z”

On the production environment I got the following results:

  1. uses IndexCountScan2 on the eventTypeAndTime index as expected, runs approx. 9.5s.
  2. uses IndexCountScan2 on the eventTypeAndTime index as expected, runs approx. 640ms.
  3. uses IndexCountScan2 on the eventTypeAndTime index as expected, runs either into timeout or web UI can’t be loaded (> 2.000.000 results).
  4. uses IndexCountScan2 on the eventTypeAndTime index as expected, runs approx. 1m (> 150.000 results).
  5. uses IndexCountScan2 (covering) on the eventTypeAndTime index as expected, runs approx. 48s.
  6. uses IndexCountScan2 (covering) on the eventTypeAndTime index as expected, runs approx. 1.4s.

The test environment gave the following numbers:

  1. 1.3s (IndexScan3: 1.3s, Project: 0)
  2. 120ms (IndexScan3: 120ms, Project: 0)
  3. Can’t be evaluated this way, browser says web UI can’t be loaded.
  4. 7.5s (IndexScan3: 270ms, Fetch: 2.9s, Filter for eventType: 1.1s, Project: 50ms)
  5. 10s (IndexScan3 (covering): 4.3s, Filter for eventType: 2.5s, Project: 2.4s)
  6. 640ms (IndexScan3 (covering): 270ms, Filter for eventType: 160ms, Project: 150ms)

On of the main use cases we have in mind would be to load data from the bucket A for eventType2 for, e.g., two weeks as demonstrated by the queries above to do some statistical analysis on the data. However, in our opinion, the performance is not good enough in these cases. In particular when considering those queries that can not use a covering index, the timings we observe are not acceptable. We have gone over our setup and also the DB configuration several times but have not found a way of substantially improving our situation. We are aware that Couchbase is designed to run on multiple nodes to which the services can be distributed. However we have currently set up one node only as we are monitoring only one sensor and thought this should do the job. So my question now is: Is there anything obvious we have made wrong in our setup? Should the data be stored in a different way? Maybe the indexing is not done properly? Is the hardware properly sized (I have gone through the basic sizing instructions …)

Any help is highly appreciated and I’ll gladly supply additional information/data should that be needed.

1 Like

You should increase indexer memory and may be use MOI with EE version cc @deepkaran.salooja.
Also 4 cores and all services running may not have enough CPUs.

Queries 3, 4 you are projecting whole documents and qualified documents for Q3 are > 2M, Q4 are > 0.15M.
These takes time if u really need that many documents you should try narrow your time stamp and repeat the queries.

Also checkout Index Aggregation in 5.5 EE

Thanks for the reply! So I’ll try to increase the indexer memory in our test environment and see if I can gain anything from this. I’m not sure about the hint with narrowing the time stamp. If we want to analyze the data for the past two weeks, I don’t see a possibility of narrowing down the time stamp, except maybe for some sort of paging, e.g., first query only the past week and then the week before that. But would this give us significant performance improvement? I doubt that. We are aware that the number of documents we are querying is rather large and we have in the meantime implemented - what we consider - a workaround: We precompute some of our statistical analysis on a daily basis and save that reduced data set in a separate bucket from which it can be queried quite efficiently. Is this sort of an accepted approach in the community?

I just wanted to report my findings with playing aroung with the indexer settings following the suggestions from @vsr1.

  • I first increased the memory for the index service from 512MB to 8096MB and executed Query 4 once again giving me basically the same timing of approx. 7.5s.
  • I then changed the index from global secondary to MOI by creating a new cluster from scratch, rebuilt my eventTypeAndTime index and executed Query 4 once again. This time it took ~8.5s to answer.
  • I also tried splitting the query into two parts, first querying for the date range from 07-11 to 07-18 and then from 07-18 to 07-25 but - as expected - the total timing is roughly the same .

So basically none of the suggestions led to any improvement in our query performance. I also have the suspicion that our problem is NOT related to the indexer service itself (which I conclude from my last experiments but also from the fact that during my initial experiments the timings of the index scan were usually by far the smallest contribution to the overall timing).

I think that our problem is rather related to data modeling and how we use Couchbase (because it is not designed for our use case of obtaining rather large contingent chunks of documents) and that our approach of precomputing on a daily basis is actually going in the right direction.

Query 3 is getting more than 2Million documents. You can try project MET().id and use SDK asynchrouns API directly get the document from Data Node to avoid 2 hops. cc @ingenthr

SELECT RAW META().id FROM  `A`  
WHERE eventType == “eventType2” AND time BETWEEN “2018-07-11T00:00:00.000Z” AND “2018-07-25T00:00:00.000Z

Hi @martibe, you did a very thorough job describing your situation, but I have a few additional questions to understand your requirements.

  1. Which fields are you using for analysis? I.e., do you really need ‘select *’ or can you select a smaller subset of fields?
  2. Along these lines, how large is the “originalMessage” field on average?
  3. Is the analysis you are doing something you could make part of the query? As @vsr1 notes, in 5.5 some aggregations can get pushed down into the indexer, which is likely faster that returning all the original documents.
  4. Do all the documents for bucket A fit in memory? I.e., if you go to the Buckets tab one of the columns will show you the percentage of documents in memory.
  5. For query #3, what is the size in bytes of the result? That should be shown to the right of the “Execute” button in the query workbench. The “table” view of results is currently the most scalable, it works reasonably well at result sizes of 100MB and more.
  6. What query times are acceptable for your use case?

There may well be a data modeling solution to your problems. Having the right model is just as important in Couchbase as with traditional RDBMSs, though the specifics are of course different.

All of the queries are using this index. Why do you need to have a separate PRIMARY index or the index on just time? Seems like waste of resource.

Now on the index that is being used.

You can save significant memory by simply doing this:

create index ev1Time ON A(time) WHERE eventType = "eventType1";
create index ev2Time ON A(time) WHERE eventType = "eventType2";
create index ev3Time ON A(time) WHERE eventType = "eventType3";

Issue the same queries as before and you should see better throughput.;

Your query 4 has: “SELECT *”. If you do need the whole doc, engine needs to fetch the full document.

If you have large number of documents being fetched from the index, increase scan cap and if you have large number of documents fetched from the KV/Data service, increase pipeline-batch and pipeline-cap.

1 Like

Thanks for the hint! Our application is currently implemented as a Spring boot application and we are using a Spring Couchbase repository to retrieve the data. But this is not written in stone so a switch to using the SDK in a more custom way would definitely be possible. In this case, we would also benefit from the async features as far as I understand. I think I will give this a try in a small test application to see what we can get.

Thanks, @keshav_m. I am aware that the primary index is not needed here so I agree: Yes this can be removed. But we need the index on the time field currently because we have other queries operating purely on the time field.

I’ll try out your hint with the three more customized indexes. This sounds reasonable although this also means that we might get into some kind of “indexing hell” in the future if we make any further developments on our application and want to run many different kinds of queries.

Thanks for your time, @eben! I’ll try to answer all your questions.

  1. We are not using all the fields and a smaller subset would do. And in fact we have seen better performance when a covering index is used. This is also described in my examples above and at least the run time in our test environment is definitely coming close to what we expect in the end.
  2. The original message field is basically a map structure with about 15 entries, all of them are key-value-pairs with rather short strings - no large data contained in there. Does that answer your question?
  3. In this particular case I think the answer to your question is yes: We could probably also solve this problem using the right query. However, our intention was that the database should just be our background storage and have too much logic implemented. I do understand that we might be forced to reconsider our position here to get the performance we expect.
  4. You are refering to the “resident” entry here I assume? That’s 100% on our test environment but only around 8% on our production environment.
  5. Well, that is hard to answer as those queries have always timed out. For the time being I would also like to put an emphasis on querying documents of eventType2 (so basically Queries 2, 4 and 6).
  6. The use case we are currently implementing is a user specifying a date-time range for which he/she wants to see a histogram representing the distribution of the measurement values in that date-time range. So an acceptable query time would have to be < 1s.

Thank you @martibe.

I think the critical issue here is the 8% resident for your production environment. As with any database, Couchbase is going to perform much better if the data required is in memory. In general this can be achieved by

  • covering indexes (so the query engine never needs to look at the documents in the first place), or
  • pushing parts of the query (i.e. aggregation) into the index, or
  • data modeling so that the documents required to answer the query are as small as possible, allowing your working set of documents to fit in memory,
  • and possibly increasing bucket or index memory allocations to help achieve that.

On the whole, Couchbase is not currently optimized for DGM (data greater than memory) scenarios (though we are working on improving that). If you really need to scan a bunch of records on disk, traditional relational databases have optimized that to the greatest extent (by packing records into pages on disk, and using sequential disk pages on spinning disks, etc.). Instead of that, Couchbase gives you tools to get as much into memory as possible.

If the date ranges for queries might cover any part of your data set, we need to find the right document structure and index definitions to get everything needed into memory.

For a first step, I would try a covering index.

  • What is the granularity of searches? I.e. is the user going to search between “2018-07-11” and “2018-07-25”? Or are they likely to search by hours, minutes, or seconds? If days are o.k., you can define your index on a subset of the time string, and end up with a smaller index with many fewer keys. E.g.,

    CREATE INDEX eventTypeAndTime ON A(eventType,substr(time,0,10), latitude, longitude)

    Of course to use this index you would need to make sure that the query also used “substr”, e.g.:

    select avg(latitude), avg(longitude) from A where eventType = “eventType2” AND substr(time,0,10) > “2018-07-11” AND substr(time,0,10) < “2018-07-25”;

  • Another thing you can do is to define separate indexes for different event types, e.g., here is a separate index for just “eventType2”, thus a smaller index:

    CREATE INDEX eventTypeAndTimeB ON A(substr(time,0,10), latitude, longitude) where eventType = “eventType2”

I’ll be interested to see how the covered indexes perform.