A couple of frequently asked questions on N1QL for query service:
- When do we actually use a primary index?
- Why the index advisor does not recommend the primary index when that may the only choice?
Read on…
Couchbase is a distributed database. It supports flexible data model using JSON. Each document in a bucket will have a user-generated unique document key. This uniqueness is enforced during the inserting or updating of the data. Here’s an example document.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
select meta().id, travel from `travel-sample` travel where type = 'airline' limit 1; [ { "id": "airline_10", "travel": { "callsign": "MILE-AIR", "country": "United States", "iata": "Q5", "icao": "MLA", "id": 10, "name": "40-Mile Air", "type": "airline" } } ] |
1 2 3 4 5 6 7 |
INSERT INTO customer (key, value) VALUES(“cx:123”, {“name”:”joe”, “zip”: 94040, “state”:ca}); SELECT META().id FROM customer; cx:123 |
Each Couchbase bucket can store data multiple types: customer, order, catalog, etc. When you load the “travel-sample” dataset, you load five distinct types of documents: airline, airport, hotel, route, landmarks.
But, by default, Couchbase does not have the equivalent of “full table scan” to scan all the documents from start to finish. The primary index scan provides you the “full table scan” equivalent.
CREATE PRIMARY INDEX ix_customer_primary ON customer;
What is the Primary Index?
-
- a sorted list of all the document keys of every document type within the bucket customer.
- maintained asynchronously, just like other secondary indexes
- keeps just the document key and nothing else
- supports all the scan consistencies:
- Unbounded
- AT_PLUS
- REQUEST_PLUS
The primary index allows the query engine to access all the documents, then do the filtering, joining, aggregation, etc operations on them.
EXPLAIN SELECT * from customer WHERE zip = 94040 name name = “joe” and type = “cx”;
This is slow. Very slow. Unnecessary document fetches; Unnecessary filtering. Wasted memory and CPU. Primary scans will retrieve ALL the documents of all types in the bucket whether or not your query eventually returns them to the user. While I said primary scan is like a table scan, it’s much slower than your table scan since it has to scan all of the documents of all types.
You should not use primary indexes. Do not use them. Especially in production.
Then why do we have primary indexes, to begin with?
- When you’re starting to play with new sample data, you can run most queries without worrying about create specific indexes. At this point, your primary concern is to understand the data than tuning for throughput.
- When you know the range of primary keys you want to scan.
- WHERE META().id between “cx:123” and “cx:458”
- When you know the trailing META().id pattern like below
- WHERE META().id LIKE “cx:1%”
- DO NOT use: LIKE “%:123”. This will result in full scan
- When you do know the full META().id or list of META().id, you can use the USE KEYS to directly fetch the document without consulting the primary index
- FROM customer USE KEYS [“cx:123”]
- FROM customer USE KEYS [“cx:123”, “cx:359”, “cx:948”]
- FROM customer USE KEYS (SELECT raw docid FROM mylist WHERE zip = 94501)
Primary Index
create the primary index on ‘travel-sample’;
The primary index is simply the index on the document key on the whole bucket. The Couchbase data layer enforces the uniqueness constraint on the document key. The primary index, like every other index in Couchbase, is maintained asynchronously. You set the recency of the data by setting the consistency level for your query.
Here is the metadata for this index:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
select * from system:indexes where name = ‘#primary’; "indexes": { "datastore_id": "http://127.0.0.1:8091", "id": "f6e3c75d6f396e7d", "index_key": [], "is_primary": true, "keyspace_id": "travel-sample", "name": "#primary", "namespace_id": "default", "state": "online", "using": "gsi" } |
The metadata gives you additional information on the index: Where the index resides (datastore_id), its state (state) and the indexing method (using).
The primary index is used for full bucket scans (primary scans) when the query does not have any filters (predicates) or no other index or access path can be used. In Couchbase, you store multiple keyspaces (documents of a different type, customer, orders, inventory, etc) in a single bucket. So, when you do the primary scan, the query will use the index to get the document-keys and fetch all the documents in the bucket and then apply the filter. So, this is VERY EXPENSIVE.
The document key design is somewhat like primary key design with multiple parts.
Lastname:firstname:customerid
Example: smith:john:X1A1849
In Couchbase, it’s a best practice to prefix the key with type of the document. Since this is a customer document, let’s prefix with CX. Now, the key becomes:
1 |
Example: CX:smith:john:X1A1849 |
So, in the same bucket, there will be other types of documents.
1 |
ORDERS type: OD:US:CA:294829 |
1 |
ITEMS type: IT:KD93823 |
These are simply best practices. There is no restriction on the format or structure of the document key in Couchbase, except they’ve to be unique within a bucket.
Now, if you have documents with various keys and have a primary index, you can use following queries to efficiently.
Example 1: Looking for a specific document key.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
SELECT * FROM sales WHERE META().id = “CX:smith:john:X1A1849”; { "#operator": "IndexScan2", "index": "#primary", "index_id": "4c92ab0bcca9690a", "keyspace": "sales", "namespace": "default", "spans": [ { "exact": true, "range": [ { "high": "\"CX:smith:john:X1A1849\"", "inclusion": 3, "low": "\"CX:smith:john:X1A1849\"" } ] } ], |
If you do know the full document key, you can use the following statement and avoid the index access altogether.
SELECT * FROM sales USE KEYS [“CX:smith:john:X1A1849”]
You can get more than one document in a statement.
1 |
SELECT * FROM sales USE KEYS [“CX:smith:john:X1A1849”, “CX:smithjr:john:X2A1492”] |
Example 2: Look for a pattern. Get ALL the customer documents.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
SELECT * FROM sales WHERE META().id LIKE “CX:%”; { "#operator": "IndexScan2", "index": "#primary", "index_id": "4c92ab0bcca9690a", "keyspace": "sales", "namespace": "default", "spans": [ { "exact": true, "range": [ { "high": "\"CX;\"", "inclusion": 1, "low": "\"CX:\"" } ] } ], |
Example 3: Get all the customers with smith as their last name.
The following query uses the primary index efficiently, only fetching the customers with a particular range. Note: This scan is case sensitive. To do a case insensitive scan, you’ve create a secondary index with UPPER() or LOWER() of the document key.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
SELECT * FROM sales WHERE META().id LIKE "CX:smith%"; { "#operator": "IndexScan2", "index": "#primary", "index_id": "4c92ab0bcca9690a", "keyspace": "sales", "namespace": "default", "spans": [ { "exact": true, "range": [ { "high": "\"CX:smiti\"", "inclusion": 1, "low": "\"CX:smith\"" } ] } ], |
Example 4: It’s common for some applications to use use email address as part of the document key since they’re unique. In that case, you need to find out all of customers with gmail.com. If this is a typical requirement, then, store the REVERSE of of the email address as the key and simply do the scan of leading string pattern.
Email:johnsmith@gmail.com; key: reverse("johnsmith@gmail.com") => moc.liamg@htimsnhoj
Email: janesnow@yahoo.com key: reverse("janesnow@yahoo.com") => moc.oohay@wonsenaj
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
SELECT * FROM sales WHERE meta().id LIKE (reverse("@yahoo.com") || "%"); "#operator": "IndexScan2", "index": "#primary", "index_id": "4c92ab0bcca9690a", "keyspace": "sales", "namespace": "default", "spans": [ { "range": [ { "high": "\"moc.oohayA\"", "inclusion": 1, "low": "\"moc.oohay@\"" } ] } ], |
Named Primary Index
In Couchbase 5.0, you can create multiple replica of any index with a simple parameter to CREATE INDEX. Following will create 3 copies of the index and there has to be minimum of 3 index nodes in the cluster.
1 2 |
CREATE PRIMARY INDEX ON 'travel-sample' WITH {"num_replica":2}; CREATE PRIMARY INDEX `def_primary` ON `travel-sample` ; |
You can also name the primary index. The rest of the features of the primary index are the same, except the index is named. A good side effect of this is that you can have multiple primary indices in Couchbase versions before 5.0 using different names. Duplicate indices help with high availability as well as query load distribution throughout them. This is true for both primary indices and secondary indices.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
select meta().id as documentkey, `travel-sample` airline from `travel-sample` where type = 'airline' limit 1; { "airline": { "callsign": "MILE-AIR", "country": "United States", "iata": "Q5", "icao": "MLA", "id": 10, "name": "40-Mile Air", "type": "airline" }, "documentkey": "airline_10" } |
Finally, in Couchbase 6.5, we’ve introduced the Index advisor. It can analyze a single N1QL statement or workload. Read the details at:
- N1QL Index Advisor: Improve Query Performance and Productivity
- Index Advisor for N1QL Query Statement
- Index Advisor For Query Workload
This index advisor only advises suitable secondary indexes and never a primary index. If you’ve read the article so far, you know why! Download Couchbase 6.5 and Try out all the new features!
Very nice article, you stayed on the topic and was very easy to follow great !!!
Couple of questions,
Like most SQL why is a primary index not created by default in couchbase on the type column if that exists ?
Creating multiple “named” indexes of the same type and combination (Primary / GSI) is that a good practice considering the cost tradeoff?
Hi Keshav,
Hope you are doing well@ in these difficult times.
Thanks for this wonderful article.
Just have one question, is there any other way, to retrieve the document ID without using indexes and views, if we don’t know the document ID in advance. Please let me know.
Best regards,
Amit.