Cbq-engine consuming massive memory for simple queries

The following very simple query makes the cbq-engine consume 10GB of memory. The data set inside the GROUP statement only returns approx. 4,000 records. The “drg-IDC-9” document is about 212KB in size.

Is it possible that it’s trying to tack on the entire 212KB json doc onto the end of each row returned inside the grouping? If so, is there a better way to write this? Inside the drg-IDC-9 document, there’s an array of codes. What we’re trying to do is join to the document to get the corresponding weight where the visits.drgCode matches the msDrg code in the drg-IDC-9 document.

SELECT SUBSTR(visits.dischargeDate,0,10) As dischargeDate,
AVG(FIRST c.weight for c IN codes.drgs WHEN c.msDrg = visits.drgCode END) As avgWeight
FROM bucket visits
JOIN bucket codes ON KEYS "drg-IDC-9"
WHERE meta(visits).id LIKE "visit-%" AND visits.patientClass LIKE "I%" AND visits.tenantName = "xxxxxxxx" 
AND visits.drgCode != "" AND visits.drgType LIKE "MS%"
AND SUBSTR(visits.dischargeDate,0,10) >= "2016-07-01" AND SUBSTR(visits.dischargeDate,0,10) < "2016-08-01"
GROUP BY SUBSTR(visits.dischargeDate,0,10)

The memory usage is not per query. It is for a steady state of processing many concurrent queries, continuously. As a first step, you can run a few clients concurrently, issue queries in a loop with a realistic pause, and then measure and report the memory usage.

The query engine and garbage collector may be holding on to memory for ongoing and future euse.

Thanks geraldss, that’s a self-remedying problem because the cbq-engine crashes if we increase the date boundary more than a few days :slight_smile: Plus, we’re also doing this in a test environment where there are not steady-state queries running.

This issue is reproducible from a freshly-started service. It will go from several KB of memory to 10GB (following by crashing) in just a few minutes.

What I believe is happening is it’s jamming that entire 212KB json document onto the end of each row returned by the join. Can you confirm that is what would take place?

HI @jeffhoward001,

The JOIN is being performed before the WHERE, so it may be loading unnecessary documents in the JOIN.

Try putting the left side + WHERE in a subquery:

SELECT ...
FROM (
    SELECT ...
    FROM bucket AS visits
    WHERE ...
) AS visits
JOIN bucket AS codes ...
GROUP BY ...

I hope you have all the necessary indexes on visits. How big is codes.drgs?

It is possible to rewrite this to only load codes.drgs, but it is complicated. Try the query rewrite first.

I’m not sure if I can properly rewrite the query in that way. This is hitting on a bigger topic that we’ve struggled with in N1QL. I know the indexed-joins allow you to join on individual properties instead of document keys, but is this possible for nested properties?

Here’s an example: We have a single document that contains all the DRG codes in a nested array:

{
  "docId": "drg-IDC-9",
  "version": "IDC-9",
  "drgs": [
    {
      "msDrg": "001",
      "weight": 26.2466
    },
    {
      "msDrg": "002",
      "weight": 14.6448
    },
    {
      "msDrg": "003",
      "weight": 17.6569
    }
}

In a more SQL-like psudeo-code, all we really want to do is join the visits.drgCode on codes.drgs.msDrg to do a “look-up” for the relative weight (simplified version):

SELECT visits.dischargeDate, visits.drgCode, codes.weight 
FROM bucket visits
JOIN bucket codes ON KEYS "drg-IDC-9" WHERE visits.drgCode = codes.drgs.msDrg

It seems like that would be a simple enough query to write with N1QL, but I can’t find any good examples of joining to a nested property.

If you post two sample documents here, visits and codes, and the ANSI SQL query, we can propose a N1QL query.

And yes, you can properly rewrite your original query as I suggested.

I rewrote the query with a sub-select, and the cbq-engine is using much less memory, although the query is pretty slow. The bigger question at this point is, is the FIRST/FOR/IN/WHEN really the best way to iterate over the codes.drgs array, or is there another way to write this where we’re actually “joining” directly to the codes in the codes.drgs array?

Sample of the “drg-IDC-9” document:

{
  "docId": "drg-IDC-9",
  "version": "IDC-9",
  "drgs": [
    {
      "msDrg": "001",
      "weight": 26.2466
    },
    {
      "msDrg": "002",
      "weight": 14.6448
    },
    {
      "msDrg": "003",
      "weight": 17.6569
    }
}

Sample of the “visits” document:

{
  "dischargeDate": "2016-09-20T15:57:00-07:00",
  "drgCode": "001",
  "visitNumber": "123456890"
}

and here’s the simplest form of what we’re tying to accomplish:

SELECT visits.dischargeDate, visits.drgCode, codes.weight 
FROM visits
INNER JOIN "drg-IDC-9" codes  ON visits.drgCode = drgs.msDrg

Here you go.

SELECT
        SUBSTR(visits.dischargeDate,0,10) As dischargeDate,
        AVG(FIRST c.weight for c IN drgs WHEN c.msDrg = visits.drgCode END) As avgWeight
FROM
        bucket visits
LET
        drgs = (SELECT RAW codes.drgs FROM bucket AS codes USE KEYS "drg-IDC-9")[0]
WHERE
        META(visits).id LIKE "visit-%" AND visits.patientClass LIKE "I%" AND visits.tenantName = "xxxxxxxx"
        AND visits.drgCode != "" AND visits.drgType LIKE "MS%"
        AND SUBSTR(visits.dischargeDate,0,10) >= "2016-07-01" AND SUBSTR(visits.dischargeDate,0,10) < "2016-08-01"
GROUP BY
        SUBSTR(visits.dischargeDate,0,10)
;

If you are using 4.5.1 or above, you should also set max-parallelism to 2 x number of cores.

Thanks Geraldss. How do we set the max parallelism? Is that a query hint, or a server level setting?

In this case, it’s a separate parameter outside the statement itself. Assuming you’re using one of the Couchbase SDKs @jeffhoward001, you’ll find a maxParallelism setting on N1QL parameters to set this on a per-statement basis. The N1QL from the SDK page for each language should cover this.

@geraldss, the provided query works, but it’s performance is significantly slower than an alternative version we created with individual documents for each msDrg using a traditional join (e.g. individual docs - drg-ms-001, drg-ms-002, etc). I assume this is because the index scan using the key name is more efficient than the FIRST/FOR/IN iterating over the drg-IDC-9 document?

Is there any plan for couchbase to be able to join on covering indexes where it doesn’t need to load the actual doc? For example, if we had a covering index for the “drg-IDC-9” document:

CREATE INDEX idx_drgs_covering ON bucket(drgs.msDrg,drgs.weight) WHERE meta().id = "drg-IDC-9"

Which would allow you to efficiently perform a direct look-up of the data without needing to load the actual document:

SELECT visit.dischargeDate, drgs.weight
FROM bucket visit
JOIN bucket drgs ON visit.drgCode = drgs.msDrg

@jeffhoward001, can you post the alternative version. Thanks.

Definitely, but to understand the new query it’s important to note that we took the contents of the previous nested “drgs” array out of the “drgs-IDC-9” document, and broke out each nested set of properties into their own document, e.g.:

{
  "docId": "drg-IDC-9",
  "version": "IDC-9",
  "drgs": [
    {
      "msDrg": "001",
      "weight": 26.2466
    },
    {
      "msDrg": "002",
      "weight": 14.6448
    },
    {
      "msDrg": "003",
      "weight": 17.6569
    }
}

was converted into:

{
  "docId": "drg-ms-001",
  "msDrg": "001",
  "weight": 26.2466
}
{
  "docId": "drg-ms-002",
  "msDrg": "002",
  "weight": 14.6448
}
{
  "docId": "drg-ms-003",
  "msDrg": "003",
  "weight": 17.6569
}

Then we did a traditional join like we normally would for non-nested content:

SELECT visits.dischargeDate, codes.weight
FROM bucket visits
JOIN bucket codes ON KEYS "drg-ms-" || visits.drgCode

This form is fine for the purposed of working through this project. But long-term we want to be sure we can write high-performing queries that can join to properties inside of nested arrays since we may not always have the liberty to expand out nested arrays into individual documents.

Ok, got it. For the query with a LET, did you set max-parallelism?

For the JOIN you want, we would need to provide non-key JOINs. Non-key JOINs are provided by Couchbase Analytics, which is in Developer Preview. For N1QL, non-key JOINs are a major undertaking and are not yet scheduled.

Your particular example with the index makes sense. We will use your example to think about this.