N1ql to upload some large documents via cbq takes a very long time

We have a file of approximately 270 MB containing around 1600 INSERT statements, each inserting a single JSON document. Each document is relatively large and includes arrays. Executing this file through cbq takes about 45 minutes, which seems excessively slow.

Here’s what we’ve tried so far:

  1. Parallel Execution: Splitting the file and running multiple cbq sessions in parallel reduced execution time, but productionizing this approach would require additional time and effort.
  2. Batch Inserts: We attempted batching by inserting 10 documents in a single INSERT statement, but this didn’t improve the performance.
  3. Using cbimport: While cbimport is significantly faster, it requires the file to be in JSON format, which involves additional preprocessing.
  4. Ruling Out Network and Hardware Constraints:
    • To eliminate network overhead, we ran the N1QL queries directly on the machine hosting Couchbase Server.
    • We placed the file on a RAM disk to minimize disk I/O overhead.

Despite these efforts, the performance improvement has been minimal. We are keen to understand why processing takes so long. The documents are large, but not excessively so to justify such delays.

Is there any parameter or option in cbq that could help speed up the process? We’ve already tried disabling logging to stdout and other logs, but that didn’t help. Any insights would be greatly appreciated!

I’m not sure if it gives information for inserts, but for SELECT, there is a break-down of execution time in completed_requests - Manage and Monitor Queries | Couchbase Docs
My gut feeling is that there is no easy “speed-up” here and that if this needs to be done on a regular basis, the effort to use cbimport would be worthwhile.

If you want to investigate further, I think that concentrating on the performance of a single insert would be sufficient.

Each SQL statement has to be parsed, processed then the document sent to the KV. It would be faster to just send the documents directly to the KV using an SDK programme.

45*60 / 1600 ~ 1.7s per statement. If you gather a profile for a single insert and check the “phaseTimes” section, you may find that most is spent in the “parse” phase. If in the “insert” phase then this is simply the time spent sending the data to the KV.

To improve the performance of the INSERT SQL statement, make sure to use a prepared statement with the values as parameters and pass those - i.e.

cbq> prepare ins from insert into default values(?,?);
{
    "requestID": "b574cb04-93d2-4cf8-abab-092f017d1c92",
    "signature": "json",
    "results": [
    {
        "encoded_plan": "H4sIAAAAAAAA/wEAAP//AAAAAAAAAAA=",
...
cbq> \set -args ["k0",{"a":"this is the doc"}];
cbq> execute ins;
{
    "requestID": "350271d3-5162-4022-9be2-30da0856c170",
    "signature": null,
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "15.068537ms",
        "executionTime": "14.907989ms",
        "resultCount": 0,
        "resultSize": 0,
        "serviceLoad": 3,
        "mutationCount": 1
    }
}
cbq> \set -args ["k1",{"a":"this is also a doc"}];
cbq> execute ins;
{
    "requestID": "eb254463-9899-486d-8b1c-f3a77abae304",
    "signature": null,
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "1.456261ms",
        "executionTime": "1.215045ms",
        "resultCount": 0,
        "resultSize": 0,
        "serviceLoad": 3,
        "mutationCount": 1
    }
}

So basically transform your file that has:

INSERT INTO ks VALUES("key",{doc});
INSERT INTO ks VALUES("key",{doc});
...

into one with:

\set -args ["key",{doc}];
EXECUTE ins;
\set -args ["key",{doc}];
EXECUTE ins;
...

(Prepare “ins” ahead of running the file with cbq-shell.)

HTH.

1 Like

Also, if your file is:

INSERT INTO ks VALUES("key",{
	"some":["large","document"]
});
INSERT INTO ks VALUES("key",{
	"some":["large"
	,"document"
	],
	"over":"multiple",
	"lines" : true
});

Couldn’t a bit of sed and jq get it in a form suitable for cbimport ?

e.g.

$ cat /tmp/x
INSERT INTO ks VALUES("key",{
        "some":["large","document"]
});
INSERT INTO ks VALUES("key",{
        "some":["large"
        ,"document"
        ],
        "over":"multiple",
        "lines" : true
});
$ sed 's/INSERT INTO ks VALUES(//;s/);$//;s/\("[^"]*"\),{/{"_kf":\1,/' /tmp/x|jq -c
{"_kf":"key","some":["large","document"]}
{"_kf":"key","some":["large","document"],"over":"multiple","lines":true}

which leaves the output suitable for import with something like (assuming you’ve redirected that output to /tmp/transformed.json):

$ cbimport json --format lines -c http://localhost:8091 -u Administrator -p password -g %_kf% -d file:///tmp/transformed.json -b default

HTH.

1 Like

As @dh mentioned each SQL statement must do parse, plan and execute if values are big and part of statement those also must be parsed.

One way avoid that prepare once with parameters and repeated execution.
There is 5 predefined generic statement(s) query/prepareds/prepareds.go at master · couchbase/query · GitHub

Execute some thing like this.

curl -u user:password http://localhost:8093/query/service -H "Content-Type: application/json" -d '{"prepared":"__insert", "args":["bucket.scope.collection", dockey, document, {}]}'

Also u can use cbc tool directly store in kv

Thanks @dh, we had come up with a similar sed + jq command to transform the file into JSON, which could then be imported using cbimport. While that approach worked, it still required transforming the file, which we were hoping to avoid if possible.
Another reason for posting was our confusion about why it’s so slow—it’s been difficult to pinpoint. Based on the responses, it seems that the parsing is where most of the time is being spent.

Thanks again!

thanks @dh, we’ll give this a try.
Just one quick question — I believe everything can go in a single file, with the prepared statement at the top followed by set -args and the values, right?