A man returned home after walking around the globe for eleven years. Next day, when he told his wife he’s going to the corner store, she asked him: “are you taking the short route or the long one?”.
Queries can be executed in many different ways. All paths lead to the same query result. The NoSQL and query optimization tool evaluates the possibilities and selects the efficient plan. Efficiency is measured in latency and throughput, depending on the workload. The cost of Memory, CPU, disk usage is added to the cost of a plan in a cost-based optimizer.
Now, in most cases, a NoSQL database will have SQL-like query language support. So, a good optimizer is mandatory. When you don’t have a good optimizer, developers have to live with feature restrictions and DBAs have to live with NoSQL query performance issues.
Database Optimizer to Improve NoSQL Performance Searches
NosSQL query optimization allows you to choose an optimal index and access paths to execute the search. At a very high level, SQL optimizers decide the following before creating the execution tree:
- Query rewrite based on heuristics, cost or both.
- Just like an editor removes unnecessary words from a writing, query rewrite work can range from eliminating unnecessary predicates to subquery flattening, converting appropriate LEFT OUTER JOINS to INNER JOINS, folding of derived tables, etc.
- Index selection.
- Selecting the optimal index(es) for each of the table (keyspaces in Couchbase N1QL, collection in case of MongoDB performance best practices)
- Depending on the index selected, choose the predicates to push down, see the query is covered or not, decide on sort and pagination strategy.
- Join reordering
- (A INNER JOIN B INNER JOIN C) is equivalent to (B INNER JOIN C INNER JOIN A). The optimizer will have to determine the most optimal way to sequence these joins.
- Join type
- Databases can implement multiple types of join algorithms: nested loop, hash, sort merge, zigzag, star (snowflake), etc. Depending on the structure and cost, the optimizer will have to decide the type of join algorithm for each join operation.
Consider the case of MongoDB restriction. “A collection can have at most one text index.” https://docs.mongodb.com/manual/core/index-text/#restrictions It documents a few other restrictions along with this. For this article, explaining this one restriction will suffice.
Why should you care about this restriction?
- MongoDB and other NoSQL databases encourage you to denormalize (aggregate) your schema so you create a single large document representing an object: a customer, a partner, etc so the majority of your operations happen on a single (JSON) document. So, a single customer document can contain customer information, customers orders, customer shipping information, customer billing information. Having a single search index means, you need to create a single VERY LARGE index combining all the fields you’d ever want to search. Here’s the problem: when you search for customer address, you don’t want to see shipping address. When you search for shipping orderid, you don’t want to see returned orderid.
- You can create multiple indexes on scalars in MongoDB. Why the restriction on the text index?
Why does MongoDB Text Index is restricted to One index per Collection?
-
- Is it the number of the text indexes? Search indexes are typically built with an inverted-tree data structure. But, MongoDB has chosen to build it with the B-Tree index. This is unlikely to be the issue.
- Is it the size of the text indexes? Text indexes generated an array of tokens on text and indexes them. So, it’s an array index. Its size can grow exponentially when you use an array index. Size of the index increase linearly to the number of words indexes and not the number of documents. They may cause issues.
- Is it a problem with the optimizer? When you have multiple indexes, the optimizer will have to choose the right index for the query. If you restrict to one text index, choice easy. This is a symptom of a larger problem in MongoDB optimization techniques — It makes adhoc decisions resulting in restrictions like this.
MongoDB’s query plan language is simplistic, even if it’s trying to mimic the SQL operations.. Let’s see how MongoDB’s query optimization tool handles these.
- Query rewrite: Unsupported. MongoDB’s queries are simplistic in find(), save(), remove(), update() methods. The aggregation pipeline is procedural and verbose. While it’s theoretically possible to rewrite, there’s nothing in the documentation or plan to indicate any query rewrites.
- Index selection: Supported. MongoDB’s optimizer tries to pick up a suitable index for each portion of the query and index can/should be used. More on this below.
- Join reordering: Unsupported. MongoDB’s $lookup is part of the convoluted aggregation framework where the query is written like Unix pipeline, a procedural approach.
- Join type selection: Unsupported since there’s only one type join in MongoDB. MongoDB has a constrained left outer join support via $lookup operator — arrays are unsupported in the join condition. If you do use $lookup, the optimizer automatically uses the default join algorithm. There’s no mention of the type of join done.
Essentially, MongoDB query optimization only does index selection before creating the execution plan. But, query optimization in MongoDB seems to select the indexes an odd fashion — neither by rule not by statistics.
- Pick a random index on one or more qualified index.
- Use that plan if a subsequent query matches the query predicates, even if the constants, selectivities, and cardinalities are different.
- Then at runtime, if the index scan returns more than 100 keys (!), runs each of the alternative plans to see which one returns the keys first. At some point, it aborts the parallel execution and picks up one of them. It also replaces the plan in its plan cache.
1 2 3 4 5 6 7 8 9 10 |
Collection t1, with 3000 documenbts. Create the following indexes: Appendix 1 for the definition: MongoDB Enterprise > db.t1.createIndex({x:1}) MongoDB Enterprise > db.t1.createIndex({y:1}) MongoDB Enterprise > db.t1.createIndex({x:1, y:1}) MongoDB Enterprise > db.t1.createIndex({y:1, x:1}) |
This is a single collection with 4 indexes on (x), (y), (x, y) and (y, x). Now, see this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
MongoDB Enterprise > db.t1.find({x:{$gt:0}, y:99}).explain() { "queryPlanner" : { "plannerVersion" : 1, "namespace" : "test.t1", "indexFilterSet" : false, "parsedQuery" : { "$and" : [ { "y" : { "$eq" : 99 } }, { "x" : { "$gt" : 0 } } ] }, "winningPlan" : { "stage" : "FETCH", "filter" : { "x" : { "$gt" : 0 } }, "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "y" : 1 }, "indexName" : "y_1", "isMultiKey" : false, "multiKeyPaths" : { "y" : [ ] }, "isUnique" : false, "isSparse" : false, "isPartial" : false, "indexVersion" : 2, "direction" : "forward", "indexBounds" : { "y" : [ "[99.0, 99.0]" ] } } }, |
Even on this simple document structure, MongoDB selects the index on (y) even though there the query has filters on x and y: ({x:{$gt:0}, y:99}).
To manage all of these uncertainties and the performance issues it will lead to, MongoDB provides a number of APIs to manage the query plan cache: flush specific cache entry, flush the whole plan cache. Instead of developing applications, MongoDB developers & DBAs, need to manage the plan cache. Developers and DBAs don’t need to manage the plan cache in other enterprise databases.
Back to the original question: Why you can’t create multiple text indexes on MongoDB?
Building multiple indexes shouldn’t be an issue if they simply allowed it. The real problem is that when you give a text predicate in your query, the MongoDB optimizer unable to choose the right index. It cannot validate these text indexes against the text predicate. MongoDB optimizer doesn’t follow a natural logic or a logical framework. Hence the restriction.
And, it could even hurt you!
I really like #MongoDB but writing queries in JSON is painful (esp aggregation). It’s a curly brackets hell with poor tooling support. Same goes for #ElasticSearch. Fortunately, due to Java/Kotlin libraries, we rarely have to write them directly. But when we have to, it hurts.
— Philipp Hauer (@philipp_hauer) April 18, 2019
Couchbase N1QL has added text index to N1QL for the upcoming release. See the details at https://www.couchbase.com/blog/n1ql-and-search-how-to-leverage-fts-index-in-n1ql-query/. Users can create any number of text indexes and the optimizer will choose a qualified (sargable) index and use it. It also supports searching during joins, post index scans, etc because the optimizer understands the search predicate and layers into its decision logic. There’s no new API, or new plan to manage. That’s the power of having Couchbase!
Resources:
- An Overview of Query Optimization in Relational Systems: https://web.stanford.edu/class/cs345d-01/rl/chaudhuri98.pdf
- A Deep Dive Into Couchbase N1QL Query Optimization: https://dzone.com/articles/a-deep-dive-into-couchbase-n1ql-query-optimization
- https://docs.mongodb.com/manual/reference/method/js-plan-cache/
- https://docs.mongodb.com/manual/core/query-plans/
- https://docs.mongodb.com/manual/reference/method/js-plan-cache/