One of the most frequent questions I receive when it comes to NoSQL is on the subject of joining data from multiple documents into a single query result. While this question is brought up more frequently from RDBMS developers, I also receive it from NoSQL developers.
When it comes to data joining, every database does it differently, some of which require it to be done through the application layer, rather than the database layer. We’re going to explore some data joining options between database technologies.
In this blog, we’ll be comparing the process of joining NoSQL documents using the MongoDB $lookup operator versus Couchbase‘s more intuitive N1QL query language.
The Sample Data
For this example, we’ll be basing both MongoDB and Couchbase off two sample documents. Assume we’re working with a classic order and inventory example. For inventory, our documents might look something like this:
1 2 3 4 5 6 |
{ "id": "product-1", "type": "product", "name": "Pokemon Red", "price": 29.99 } |
While flat, the above document can properly explain one particular product. It has a unique id which will be involved during the join process. For orders, we might have a document that looks like the following:
1 2 3 4 5 6 7 8 9 10 |
{ "id": "order-1", "type": "order", "products": [ { "product_id": "product-1", "quantity": 2 } ] } |
The goal here will be to join these two documents in a single query using both MongoDB and Couchbase. However, query language aside, these documents can always be joined via the application layer through multiple queries. This is not the result we’re after, though.
Joining Documents with MongoDB and the $lookup Operator
In recent versions of MongoDB, join queries involve a $lookup
operator that is part of the aggregation queries. Per the MongoDB documentation, this operator performs as the following:
Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing. The $lookup stage does an equality match between a field from the input documents with a field from the documents of the “joined” collection.
To use the $lookup
operator, you’d have something like this:
1 2 3 4 5 6 7 8 9 10 11 |
db.collection.aggregate([ { $lookup: { from: <collection to join>, localField: <field from the input documents>, foreignField: <field from the documents of the "from" collection>, as: <output array field> } } ]) |
Now this is great, but it doesn’t work on relationships found in arrays. This means that the $lookup
operation cannot join the product_id
found in the products
array to another document. Instead the array must be “unwound” or “unnested” first which adds extra complexity to our query:
1 2 3 4 5 6 7 8 9 10 11 |
db.orders.aggregate([ { $unwind: "$products" }, { $lookup: { from: "products", localField: "products.product_id", foreignField: "_id", as: "productObjects" } } ]) |
The $unwind
operator will flatten the array and then do a join on the now flat objects that were produced. The result of such query would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
{ "_id" : ObjectId("58a3869acbf64c4ace55e713"), "products" : { "product_id" : ObjectId("58a3851b2f14a900caa7a731"), "quantity" : 2 }, "productObjects" : [ { "_id" : ObjectId("58a3851b2f14a900caa7a731"), "name" : "Pokemon Red", "price" : 29.99 } ] } |
Had there been more than one reference in the array, there would have been more results returned. However, what is returned isn’t very attractive. We still have the old products
object and now a productsObject
array. Further manipulations to the data stream needs to happen.
The productsObject
array should be “unwound” and then reconstructed to how we want it. This can be accomplished by doing the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
db.orders.aggregate([ { $unwind: "$products" }, { $lookup: { from: "products", localField: "products.product_id", foreignField: "_id", as: "productObjects" } }, { $unwind: "$productObjects"}, { $project: { products: { "quantity": "$products.quantity", "name": "$productObjects.name", "price": "$productObjects.price" } } } ]) |
Notice that the aggregate
query is now getting more complex. After doing the join, the result is “unwound” and then the result is reconstructed using the $project
operator.
At this point further manipulations to the result can be made such as grouping the results so that the products
objects become a single array again. Each manipulation to the data set requires more aggregation code which can easily become messy, complicated, and difficult to read.
This is where Couchbase N1QL becomes so much more pleasant to work with.
Using Couchbase and N1QL to Join NoSQL Documents
Let’s use the same document example that we used for MongoDB. This time we’re going to write SQL queries with N1QL to get the job done.
The first thing that comes to mind might be to use a JOIN
in SQL. Our query might look something like this:
1 2 3 4 |
SELECT orders.*, product FROM example AS orders JOIN example AS product ON KEYS orders.products[*].product_id WHERE orders.type = 'order' |
In the above example, both the documents exist in the same Couchbase Bucket. A JOIN
against document ids happens based on the product_id
values found in the products
array. The above query would yield results that look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
[ { "id": "order-1", "product": { "id": "product-1", "name": "Pokemon Red", "price": 29.99, "type": "product" }, "products": [ { "product_id": "product-1", "quantity": 2 } ], "type": "order" } ] |
Like with MongoDB, there will be a result for every item of the products
array that matches. In fairness, while the N1QL version was easier to write, it wasn’t necessarily any more difficult than the MongoDB Query Language at this point. As we manipulate the data more, Couchbase becomes a lot easier by comparison.
For example, let’s say we wanted to clean up the results:
1 2 3 4 5 |
SELECT orders.id, orders.type, OBJECT_PUT(product, "quantity", products.quantity) AS product FROM example AS orders UNNEST orders.products AS products JOIN example AS product ON KEYS products.product_id WHERE orders.type = 'order' |
There are some major differences in what we’re doing in the above, but minor differences in how we’re doing them. Instead of joining directly on the array, we are first flattening or “unnesting” the array, like what we saw in the MongoDB $unwind
operator. The join is now happening on each of the flattened results. Finally, the quantity
from the original object is added to the new object.
The result to the above query would look something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
[ { "id": "order-1", "product": { "id": "product-1", "name": "Pokemon Red", "price": 29.99, "quantity": 2, "type": "product" }, "type": "order" } ] |
Let’s say that the original products
array had more than one product reference in it. Instead of returning several objects based on the JOIN
criteria we saw above, it might make sense to re-pack that original array.
1 2 3 4 5 6 |
SELECT orders.id, orders.type, ARRAY_AGG(OBJECT_PUT(product, "quantity", products.quantity)) AS products FROM example AS orders UNNEST orders.products AS products JOIN example AS product ON KEYS products.product_id WHERE orders.type = 'order' GROUP BY orders |
In the above query we’ve only added ARRAY_AGG
and a GROUP BY
, but as a result, each joined document shows up in the products
array instead of the id value.
Don’t want to use an actual JOIN
operator? Try using a SQL subquery instead.
Conclusion
In NoSQL, joining data is a very popular concern for developers that are seasoned RDBMS veterans. Because MongoDB is a very popular NoSQL technology, I thought it would be good to use for comparing to how Couchbase handles document joining. For light operations, $lookup
operator is tolerable, but as join queries in MongoDB become more complex, you may need to take a step back. With N1QL, writing complex queries that include joining operations becomes very easy and stays easy regardless of how complex the query is.
For more information on N1QL and Couchbase, visit the Couchbase Developer Portal.