Couchbase performance issue - slowness

@samrecd
Whether 2 minutes is unrealistic or not is hard for me to say, it depends a lot on a lot of factors - how many documents being a big one of course, but also your network latencies, how many operations are happening in parallel, your residency ratio (fetches from disk will be slower), your GC (the content().toMap() involves JSON decoding, which can hit CPU and memory/GC), and the server resources. There’s not much for me to go on at this stage. To give you what I personally would check:

  • Add some quick crude instrumentation to the code to check number of in-flight ops and how long they are taking, end-to-end, and how many docs I am fetching.
  • Since you have a N1QL fetch plus a Key-Value fanout, increasing throughput on the fanout is a low-hanging fruit (assuming you’ve already tuned the N1QL fetch). You can tune the flatMap with RxJava and see what difference that makes. I’d be looking to push the parallelism to the point where no obvious issues are occurring (timeouts, increased latency). I might need to consider other parts of the system here - e.g. if my Couchbase cluster is capable of X ops/sec, what % of X am I comfortable committing to this one bulklookup.
  • If performance is where you want now, can feel free to stop. If not, time to look into SDK tuning.
  • The main SDK tuning option that could help here is the kvEndpoints config setting. This controls the number of TCP connections from the SDK to the Key-Value server - for each bucket, and on each node. There’s not much of a downside to setting this higher, beyond it using up more TCP connections, so I’d try out some higher settings and see what it did to overall throughput.
  • If performance is still not where desired, time to check what is actually possible against this cluster and network toplogy. I’d measure my roundtrip network latency from the appserver to the cluster using normal networking tools. If this is high, see if I can get my appserver closer to the server. (This is very dependent on your networking setup - on AWS it would involve using the same region, and VPC peering to go the extra mile.)
  • I would expect that latency to be close to the end-to-end figures my own code is reporting. If not, something somewhere is adding some latency - time to debug further.
  • (By now, it’s possible to start defining performance goals more formally/realistically, with some idea of the minimum and p99 network latencies, and how much parallelism I can afford to throw at this without potentially negatively affecting other parts of my system - e.g. how many batches I need to do the work in. E.g. if I want all docs read in < 10ms p99 but my network latency is 3ms p99 and I need to do 5 batches - I know now it’s not going to happen.)
  • I’d monitor my own application’s CPU, memory and GC stats. Checking for any sawtooth patterns, leaks, or anything else that looks clearly wrong. It might be time to bust out the appserver profiling. The d.content().toMap() involves JSON decoding, so could be a hotspot. If so, maybe move to Sub-Document reads, which will reduce the amount of data that has to be fetched and decoded. With the JVM it’s always worth checking GC logs to check how long the GC pauses are, and how frequent. Possibly other parts of the application are slowing down this bulk lookup, so I’d try to get this to an isolated test case (while remembering that JVM microbenchmarking has avoidable pitfalls due to hotspot warmup - JMH or similar is recommended).
    -If CPU, memory and GC look fine, but latency is still substantially higher than the network roundtrip time, it’s time to check the server performance. Reads from disk will be much slower than from memory, so check the residency ratio (the % of active data in-memory) is high.
  • I’d crack out the OpenTelemetry tools now. Setup Jaegar, Zipkin, Lightstep or Honeystep (the latter is my favourite, personally speaking, but is a commercial product). Setup Couchbase SDK to output to this. None of this takes long.
  • The OpenTelemetry trace data will give you another version of the crude instrumentation added earlier. It will also give you a bundle of useful stats, including the “server_duration” attribute, which tells you how long the server is spending processing the item, and should be neglible (a few microseconds) for in-memory reads. And a lot of insight into what’s happening inside the SDK layers. If you see some big unexplainable gaps in the trace data inside the SDK that could be pointing to an overwhelmed appserver - it would be worth circling back to checking appserver CPU, memory and GC here.
  • If the OpenTelemetry data shows that the SDK is pushing requests out to the wire promptly, and the gap between that and the response is still much longer than the network RTT, it’s time to go the next level of debugging - deep-diving into the server performance. The cbcollect tool can return a massive amount of very useful, very detailed profiling data - but I’m no expert on it, and this post is already far too long, and I think your issues will very likely be addressed by one of the earlier steps in this list, so I’ll stop here for now :slight_smile:
1 Like