N1QL: Count query sometimes slow, sometimes not with different possible indexes

We have a simple count query which sometimes returns within 0.5s but sometimes 8s and sometimes it runs into a Timeout (60s).
What could we check to find out the root cause?

Environment
CB Version: 5.0.0 CE
Java SDK: v2.5.1
3 Nodes (each 64GB RAM, 4 CPU Cores, all Index, Data, query)
522k Documents
There is currently no load on the cluster, as we are still evaluating. There is just the query below.

Query:

SELECT COUNT(*) AS size FROM datastoreitems where datastore_id = 'cbbbfe8b-8f20-11e7-abf3-901b0ea49fee';

Expected result:

[
  {
    "size": 189947
  }
]

Possible indexes:

CREATE INDEX `idx_created_at` ON `datastoreitems`(`datastore_id`,(-str_to_millis(`created_at`)))
CREATE INDEX `idx_datastore` ON `datastoreitems`(`datastore_id`,`folder`)
CREATE INDEX `idx_flowrunid` ON `datastoreitems`(`datastore_id`,`flowrun_id`)
CREATE INDEX `idx_lastupdated_at` ON `datastoreitems`(`datastore_id`,(-str_to_millis(`lastupdated_at`)))

Here are some example query plans from Query Console:

Example 1: elapsed: 570.89ms | execution: 570.88ms | count: 1 | size: 38

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexCountScan2",
        "covers": [
          "cover ((`datastoreitems`.`datastore_id`))",
          "cover ((-str_to_millis((`datastoreitems`.`created_at`))))",
          "cover ((meta(`datastoreitems`).`id`))"
        ],
        "index": "idx_created_at",
        "index_id": "4f5568ba9dd1fbd6",
        "keyspace": "datastoreitems",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"cbbbfe8b-8f20-11e7-abf3-901b0ea49fee\"",
                "inclusion": 3,
                "low": "\"cbbbfe8b-8f20-11e7-abf3-901b0ea49fee\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "IndexCountProject",
        "result_terms": [
          {
            "as": "size",
            "expr": "count(*)"
          }
        ]
      }
    ]
  },
  "text": "SELECT COUNT(*) AS size FROM  datastoreitems where datastore_id = 'cbbbfe8b-8f20-11e7-abf3-901b0ea49fee';"
}

Example 2: elapsed: 8.78s | execution: 8.78s | count: 1 | size: 38

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexCountScan2",
        "covers": [
          "cover ((`datastoreitems`.`datastore_id`))",
          "cover ((`datastoreitems`.`flowrun_id`))",
          "cover ((meta(`datastoreitems`).`id`))"
        ],
        "index": "idx_flowrunid",
        "index_id": "8cafd50daee76437",
        "keyspace": "datastoreitems",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"cbbbfe8b-8f20-11e7-abf3-901b0ea49fee\"",
                "inclusion": 3,
                "low": "\"cbbbfe8b-8f20-11e7-abf3-901b0ea49fee\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "IndexCountProject",
        "result_terms": [
          {
            "as": "size",
            "expr": "count(*)"
          }
        ]
      }
    ]
  },
  "text": "SELECT COUNT(*) AS size FROM  datastoreitems where datastore_id = 'cbbbfe8b-8f20-11e7-abf3-901b0ea49fee';"
}

Example 3: elapsed: 8.74s | execution: 8.74s | count: 1 | size: 38

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexCountScan2",
        "covers": [
          "cover ((`datastoreitems`.`datastore_id`))",
          "cover ((`datastoreitems`.`folder`))",
          "cover ((meta(`datastoreitems`).`id`))"
        ],
        "index": "idx_datastore",
        "index_id": "857ea2aec9613055",
        "keyspace": "datastoreitems",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"cbbbfe8b-8f20-11e7-abf3-901b0ea49fee\"",
                "inclusion": 3,
                "low": "\"cbbbfe8b-8f20-11e7-abf3-901b0ea49fee\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "IndexCountProject",
        "result_terms": [
          {
            "as": "size",
            "expr": "count(*)"
          }
        ]
      }
    ]
  },
  "text": "SELECT COUNT(*) AS size FROM  datastoreitems where datastore_id = 'cbbbfe8b-8f20-11e7-abf3-901b0ea49fee';"
}

Example 4: elapsed: 465.54ms | execution: 465.51ms | count: 1 | size: 38
Notice: same index as Example 3, but much faster.

{
          "plan": {
            "#operator": "Sequence",
            "~children": [
              {
                "#operator": "IndexCountScan2",
                "covers": [
                  "cover ((`datastoreitems`.`datastore_id`))",
                  "cover ((`datastoreitems`.`folder`))",
                  "cover ((meta(`datastoreitems`).`id`))"
                ],
                "index": "idx_datastore",
                "index_id": "857ea2aec9613055",
                "keyspace": "datastoreitems",
                "namespace": "default",
                "spans": [
                  {
                    "exact": true,
                    "range": [
                      {
                        "high": "\"cbbbfe8b-8f20-11e7-abf3-901b0ea49fee\"",
                        "inclusion": 3,
                        "low": "\"cbbbfe8b-8f20-11e7-abf3-901b0ea49fee\""
                      }
                    ]
                  }
                ],
                "using": "gsi"
              },
              {
                "#operator": "IndexCountProject",
                "result_terms": [
                  {
                    "as": "size",
                    "expr": "count(*)"
                  }
                ]
              }
            ]
          },
          "text": "SELECT COUNT(*) AS size FROM  datastoreitems where datastore_id = 'cbbbfe8b-8f20-11e7-abf3-901b0ea49fee';"
        }

All the Indexes are qualified and the plan looks good. The Count is done by indexer may index RAM quota might be playing role. cc @deepkaran.salooja, @prathibha, @venkatb4u

@synesty, please make sure there is sufficient memory quota allocated to the index service (UI->Settings) so that all indexes are in RAM.

Also, 4 cores do not seem sufficient to colocate Data, Query, Index service together. If a query request gets processed by the query and index service on the same node, there is going to be lot of cpu contention. It would be good to look at the cpu utilization of the nodes when running your queries.

What is the storage mode being used (Standard or Memory Optimized)?

1 Like

@deepkaran.salooja: note he says there is no other load on the system. That being the case, how would CPU contention be different from run to run? Agreed that it may be useful to try to eliminate CPU as a cause though.

@synesty: my guess, but it’s just that, is that you’re seeing the effects of garbage collection. You could see if you can correlate the longer query time to increases in GC pause time from the vitals URI in the REST interface. Maybe script up the queries you’re running and check that vitals URI after each return. I could well be wrong, but it’d be good to eliminate that as a cause, as we’ve seen that cause non-deterministic increases in request latency in other tests I’ve been involved in.

Good luck!

1 Like

Standard GSI

Ok, we will try that. I hope it is something else more obvious, as this makes me worry, that one count(*) query cause so much GC activity that another query times out or takes much longer.

I hope so too. It’s a possible explanation, so it’d be good to eliminate it as a cause. Also, at least in my experience, it can be solved with some tuning. That’ll be workload/environment dependent though.

@synesty:

Even in Couchbase 5.0 CE, the indexer uses older FORESTDB as the storage manager. That could be contributing to this performance variation.

You can certainly use the Couchbase 5.0 EE for your evaluation.
This as a new storage engine (Plasma) which has much better performance characteristics.
Please try with the EE edition and let us know. Thank you.

1 Like

@keshav_m: Given that this is a query in an otherwise quiet environment, how could @synesty get information from the system that the query engine request to the indexer is the cause? Changing to plasma/EE may indeed help, though it may also be better to use what observation tools we have to find root cause instead of just changing a bunch of variables at once. :slight_smile:

Good point. They can PROFILE the queries.
See this : https://blog.couchbase.com/optimize-n1ql-performance-using-request-profiling/

1 Like

Thanks @ingenthr, @keshav_m, @deepkaran.salooja for the great suggestions and tools. We will try that for sure.
We had 5.0 EE beta2 running for some weeks (but with Standard GSI instead of Plasma), but we want to start with EE for now.

Today we found that one of the 3 nodes had high CPU load (not sure why…some CB-related process had high CPU constantly and was probably slowing things down… not sure what it was exactly). We feel that the query sometimes has hit that node and sometimes not.
A colleague restarted (re-installed) that node, re-created the indexes and now it seems that query response times have stabilized around 0.5s.

We will check how to improve our monitoring, so we will recognize this earlier and have a quick way to turn on the profiling tools you suggested. We will now continue putting more load and documents.

1 Like

@synesty, As a side note, in 5.0 EE, Standard GSI option would actually mean and use Plasma. We also automatically move our EE customers using FORESTDB to Plasma when upgrading to 5.0.

1 Like