If you’ve been keeping up, you’ll remember I wrote a few tutorials around converting your MongoDB powered Node.js applications to Couchbase. These included a MongoDB Query Language to N1QL tutorial as well as a Mongoose to Ottoman tutorial. These were great migration tutorials from an application perspective, but they didn’t really tell you how to get your existing Collections from MongoDB imported as JSON files.
So, in this tutorial, we’re going to explore how to import MongoDB collection data into Couchbase. The development language doesn’t really matter, but Golang is very fast and very powerful making it a perfect candidate for the job.
Before we worry about writing a data migration script, let’s figure out a sample dataset that we’re working with. The goal here is to be universal in our script, but it does help to have an example.
The MongoDB Collection Model
Let’s assume we have a Collection called courses that holds information about courses offered by a school. The document model for any one of these documents might look something like the following:
1 2 3 4 5 6 7 8 |
{ "name": "Basket Weaving 101", "students": [ "nraboy", "mgroves", "hgreeley" ] } |
Each document would represent a single course with a list of enrolled students. Each document has an id value and the enrolled students reference documents from another collection with matching id values.
With MongoDB installed you have access to its mongoexport utility. This will allow us to export the documents that exist in any Collection to a JSON data file.
For example, we could run the following command against our MongoDB database:
1 |
mongoexport --db example --collection courses --out courses.json |
The database in question would be example
and we’re exporting the courses
collection to a file called courses.json. If we try to open this JSON file, we’d see data that looks similar to the following:
1 2 |
{"_id":{"$oid":"course-1"},"name":"Basket Weaving 101","students":[{"$oid":"nraboy"},{"$oid":"mgroves"}]} {"_id":{"$oid":"course-2"},"name":"TV Watching 101","students":[{"$oid":"jmichaels"},{"$oid":"tgreenstein"}]} |
Each document will be a new line in the file, however it won’t be exactly how our schema was modeled. MongoDB will take all document references and wrap them in an $oid
property which represents an object id.
So where does this leave us?
Planning the Couchbase Bucket Model
As you’re probably already aware, Couchbase does not use Collections, but instead Buckets. However, Buckets do not function the same as Collections. Instead of having one Bucket per every one document type like MongoDB does, you’ll have one Bucket for every application.
This means we’ll need to make some changes to MongoDB export so it makes any kind of sense inside of Couchbase.
In Couchbase it is normal to have a document property in every document that represents the type of document it is. Lucky for us we know the name of the former Collection and can work some magic. As an end result our Couchbase documents should look something like this:
1 2 3 4 5 6 7 8 9 10 |
{ "_id": "course-1", "_type": "courses", "name": "Basket Weaving 101", "students": [ "nraboy", "mgroves", "hgreeley" ] } |
In the above example we have compressed all the $oid
values and added the _id
and _type
properties.
Developing the Golang Collection Import Script
Now that we know where we’re headed, we can focus on the script that will do the manipulations and loading. However, let’s think about our Golang logic on how to accomplish the job.
We know we’re going to be reading line by line from a JSON file. For every line read we need to manipulate it, then save it. Reading from a file and inserting into Couchbase are both blocking operations. While reading is quite fast, inserting a single document at a time in a blocking fashion for terabytes of data can be quite slow. This means we should start goroutines to do things in parallel.
Create a new project somewhere in your $GOPATH and create a file called main.go with the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
package main import ( "bufio" "encoding/json" "flag" "fmt" "os" "sync" "github.com/couchbase/gocb" ) var waitGroup sync.WaitGroup var data chan string var bucket *gocb.Bucket func main() {} func worker(collection string) {} func cbimport(document string, collection string) {} func compressObjectIds(mapDocument map[string]interface{}) string {} |
The above code is merely a blueprint to what we’re going to accomplish. The main
function will be responsible for starting several goroutines and reading our JSON file. We don’t want the application to end when the main
function ends so we use a WaitGroup. This will prevent the application from ending until all goroutines have ended.
The worker
function will be each goroutine and it will call cbimport
which will call compressObjectIds
to swap out any $oid
with the compressed equivalent. By compressed I mean won’t include a wrapping $oid
property.
So let’s look at that main
function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
func main() { fmt.Println("Starting the import process...") flagInputFile := flag.String("input-file", "", "file with path which contains documents") flagWorkerCount := flag.Int("workers", 20, "concurrent workers for importing data") flagCollectionName := flag.String("collection", "", "mongodb collection name") flagCouchbaseHost := flag.String("couchbase-host", "", "couchbase cluster host") flagCouchbaseBucket := flag.String("couchbase-bucket", "", "couchbase bucket name") flagCouchbaseBucketPassword := flag.String("couchbase-bucket-password", "", "couchbase bucket password") flag.Parse() cluster, _ := gocb.Connect("couchbase://" + *flagCouchbaseHost) bucket, _ = cluster.OpenBucket(*flagCouchbaseBucket, *flagCouchbaseBucketPassword) file, _ := os.Open(*flagInputFile) defer file.Close() data = make(chan string) scanner := bufio.NewScanner(file) scanner.Split(bufio.ScanLines) for i := 0; i < *flagWorkerCount; i++ { waitGroup.Add(1) go worker(*flagCollectionName) } for scanner.Scan() { data <- scanner.Text() } close(data) waitGroup.Wait() fmt.Println("The import has completed!") } |
The above function will take a set of command line flags that will be used in the configuration of the application. The connection to the destination Couchbase Server and Bucket will be established and the input file will be opened.
Because we’re using goroutines, we need to use channel variables to avoid locking scenarios. All lines read will be queued up in the channel where each goroutine will read from.
After spinning up the goroutines, the file will be read and the channel will be populated. After the file is completely read, the channel will close. This means that when the goroutines read all the data, the goroutines will be able to end. We’ll be waiting until the goroutines end before ending the application.
Now let’s take a look at the worker
function:
1 2 3 4 5 6 7 8 9 10 |
func worker(collection string) { defer waitGroup.Done() for { document, ok := <-data if !ok { break } cbimport(document, collection) } } |
The MongoDB Collection name will be passed to each worker and the worker will remain functional in a loop until the channel closes.
For every document read from the channel, the cbimport
function will be called:
1 2 3 4 5 6 7 |
func cbimport(document string, collection string) { var mapDocument map[string]interface{} json.Unmarshal([]byte(document), &mapDocument) mapDocument["_type"] = collection compressObjectIds(mapDocument) bucket.Insert(mapDocument["_id"].(string), mapDocument, 0) } |
Each line of the file will be a string that we need to unmarshal into a map of interfaces. We know the Collection name, so we can create a property that will hold that particular name. Then we can pass the entire map into the compressObjectIds
function to get rid of any $oid
wrappers.
The compressObjectIds
function looks like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
func compressObjectIds(mapDocument map[string]interface{}) string { var objectIdValue string for key, value := range mapDocument { switch value.(type) { case string: if key == "$oid" && len(mapDocument) == 1 { return value.(string) } case map[string]interface{}: objectIdValue = compressObjectIds(value.(map[string]interface{})) if objectIdValue != "" { mapDocument[key] = objectIdValue } case []interface{}: for index, element := range value.([]interface{}) { objectIdValue = compressObjectIds(element.(map[string]interface{})) if objectIdValue != "" { value.([]interface{})[index] = objectIdValue } } } } return "" } |
In the above we are essentially looping through every key in the document. If the value is a nested object or JSON array, we recursively do the same thing until we hit a string with a key of $oid
. If this condition is met we make sure it is the only key in that level of the document. This will let us know that it is an id that we can safely compress.
Not so bad right?
Running the MongoDB to Couchbase Importer
Assuming you have the Go programming language installed and configured, we need to build this application.
From the command line, you’ll need to get all the dependencies. With the project as your current working directory, execute the following:
1 |
go get -d -v |
The above command will get any dependencies found in our Go files.
Now the application can be built and ran, or just ran. The steps aren’t really any different, but we’re just going to run the code.
From the command line, execute the following:
1 2 3 4 5 6 |
./collectionimport \ --input-file FILE_NAME.json \ --collection COLLECTION_NAME \ --couchbase-host localhost \ --couchbase-bucket default \ --workers 20 |
The above command will allow us to pass any flags into the application such as Couchbase Server information, number of worker goroutines, etc.
If successful, the MongoDB export should now be present in your Couchbase NoSQL database.
Conclusion
You just saw how to work with Golang and MongoDB to import your collections data into Couchbase. Sure the code we saw can be optimized, but from a simplicity standpoint I’m sure you can see what we were trying to accomplish.
Want to download this importer project and try it out for yourself? I’ve gone and uploaded it to GitHub, with further instructions for running. Keep in mind that it is unofficial and hasn’t been tested for massive amounts of data. Treat it as an example for learning how to meet your data migration needs.
If you’re interested in learning more about Couchbase Server or the Golang SDK, check out the Couchbase Developer Portal.