N1ql to upload some large documents via cbq takes a very long time

We have a file of approximately 270 MB containing around 1600 INSERT statements, each inserting a single JSON document. Each document is relatively large and includes arrays. Executing this file through cbq takes about 45 minutes, which seems excessively slow.

Here’s what we’ve tried so far:

  1. Parallel Execution: Splitting the file and running multiple cbq sessions in parallel reduced execution time, but productionizing this approach would require additional time and effort.
  2. Batch Inserts: We attempted batching by inserting 10 documents in a single INSERT statement, but this didn’t improve the performance.
  3. Using cbimport: While cbimport is significantly faster, it requires the file to be in JSON format, which involves additional preprocessing.
  4. Ruling Out Network and Hardware Constraints:
    • To eliminate network overhead, we ran the N1QL queries directly on the machine hosting Couchbase Server.
    • We placed the file on a RAM disk to minimize disk I/O overhead.

Despite these efforts, the performance improvement has been minimal. We are keen to understand why processing takes so long. The documents are large, but not excessively so to justify such delays.

Is there any parameter or option in cbq that could help speed up the process? We’ve already tried disabling logging to stdout and other logs, but that didn’t help. Any insights would be greatly appreciated!

I’m not sure if it gives information for inserts, but for SELECT, there is a break-down of execution time in completed_requests - Manage and Monitor Queries | Couchbase Docs
My gut feeling is that there is no easy “speed-up” here and that if this needs to be done on a regular basis, the effort to use cbimport would be worthwhile.

If you want to investigate further, I think that concentrating on the performance of a single insert would be sufficient.

Each SQL statement has to be parsed, processed then the document sent to the KV. It would be faster to just send the documents directly to the KV using an SDK programme.

45*60 / 1600 ~ 1.7s per statement. If you gather a profile for a single insert and check the “phaseTimes” section, you may find that most is spent in the “parse” phase. If in the “insert” phase then this is simply the time spent sending the data to the KV.

To improve the performance of the INSERT SQL statement, make sure to use a prepared statement with the values as parameters and pass those - i.e.

cbq> prepare ins from insert into default values(?,?);
{
    "requestID": "b574cb04-93d2-4cf8-abab-092f017d1c92",
    "signature": "json",
    "results": [
    {
        "encoded_plan": "H4sIAAAAAAAA/wEAAP//AAAAAAAAAAA=",
...
cbq> \set -args ["k0",{"a":"this is the doc"}];
cbq> execute ins;
{
    "requestID": "350271d3-5162-4022-9be2-30da0856c170",
    "signature": null,
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "15.068537ms",
        "executionTime": "14.907989ms",
        "resultCount": 0,
        "resultSize": 0,
        "serviceLoad": 3,
        "mutationCount": 1
    }
}
cbq> \set -args ["k1",{"a":"this is also a doc"}];
cbq> execute ins;
{
    "requestID": "eb254463-9899-486d-8b1c-f3a77abae304",
    "signature": null,
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "1.456261ms",
        "executionTime": "1.215045ms",
        "resultCount": 0,
        "resultSize": 0,
        "serviceLoad": 3,
        "mutationCount": 1
    }
}

So basically transform your file that has:

INSERT INTO ks VALUES("key",{doc});
INSERT INTO ks VALUES("key",{doc});
...

into one with:

\set -args ["key",{doc}];
EXECUTE ins;
\set -args ["key",{doc}];
EXECUTE ins;
...

(Prepare “ins” ahead of running the file with cbq-shell.)

HTH.

1 Like

Also, if your file is:

INSERT INTO ks VALUES("key",{
	"some":["large","document"]
});
INSERT INTO ks VALUES("key",{
	"some":["large"
	,"document"
	],
	"over":"multiple",
	"lines" : true
});

Couldn’t a bit of sed and jq get it in a form suitable for cbimport ?

e.g.

$ cat /tmp/x
INSERT INTO ks VALUES("key",{
        "some":["large","document"]
});
INSERT INTO ks VALUES("key",{
        "some":["large"
        ,"document"
        ],
        "over":"multiple",
        "lines" : true
});
$ sed 's/INSERT INTO ks VALUES(//;s/);$//;s/\("[^"]*"\),{/{"_kf":\1,/' /tmp/x|jq -c
{"_kf":"key","some":["large","document"]}
{"_kf":"key","some":["large","document"],"over":"multiple","lines":true}

which leaves the output suitable for import with something like (assuming you’ve redirected that output to /tmp/transformed.json):

$ cbimport json --format lines -c http://localhost:8091 -u Administrator -p password -g %_kf% -d file:///tmp/transformed.json -b default

HTH.

1 Like

As @dh mentioned each SQL statement must do parse, plan and execute if values are big and part of statement those also must be parsed.

One way avoid that prepare once with parameters and repeated execution.
There is 5 predefined generic statement(s) query/prepareds/prepareds.go at master · couchbase/query · GitHub

Execute some thing like this.

curl -u user:password http://localhost:8093/query/service -H "Content-Type: application/json" -d '{"prepared":"__insert", "args":["bucket.scope.collection", dockey, document, {}]}'

Also u can use cbc tool directly store in kv

Thanks @dh, we had come up with a similar sed + jq command to transform the file into JSON, which could then be imported using cbimport. While that approach worked, it still required transforming the file, which we were hoping to avoid if possible.
Another reason for posting was our confusion about why it’s so slow—it’s been difficult to pinpoint. Based on the responses, it seems that the parsing is where most of the time is being spent.

Thanks again!

thanks @dh, we’ll give this a try.
Just one quick question — I believe everything can go in a single file, with the prepared statement at the top followed by set -args and the values, right?

@pccb

To summarize:
\set -args..
EXECUTE prep

Is ok to do, but \SET command takes long when the args include a large JSON like the one you have. This is due to
https://jira.issues.couchbase.com/browse/MB-65181

You can download the version of the tool with fix in the upcoming release 7.6.6/7.2.7

Also, another tip to reduce the time taken is to set the history file to be /dev/null using the hist flag Couchbase Docs.

Hope this helps:)

2 Likes

thanks @gauravj10 , setting hist to /dev/null indeed helped to reduce the time by approx. 30%, much appreciated!

However, this is still not enough for us. Do you have any insights on how much the fix for preparedStmnt is likely to reduce the time? Will it just eliminate the extra delay, or will preparedStmnt perform faster compared to not using it?

If you gather a profile for a single insert and check the “phaseTimes” section, you may find that most is spent in the “parse” phase.

Do you have any insights on how much the fix for preparedStmnt is likely to reduce the time? Will it just eliminate the extra delay, or will preparedStmnt perform faster compared to not using it?

The fix eliminates the extra delay.
The preparedStmt itself runs in in a few milliseconds, leaving the time spent in the issue the fix fixes and also writing to history file.

Sharing the experiment I tried out earlier on a mac m1 machine
prep.n1q1 has 1624 \set followed by execute prep(for insert)

time taken without the fix: 25m42.133s
time taken with the fix: 2:07.97
time taken with the fix and writing history to /dev/null: 53.125

the test runs the \set followed by execute prep in a sequential manner.
You could split the file and try parallel execution as suitable, as you have mentioned in an earlier post to further improve performance.

Happy to help:)

2 Likes

thanks @gauravj10 , again!

this should help. we’ll wait for the patch to be out.