Migrating Buckets to Collections & Scopes via Eventing: Part 2

Again (as I did in Part 1) I want to point out an excellent blog written by Shivani Gupta, How to Migrate to Scopes & Collections in Couchbase 7.0, which covers in great detail other methods of migrating bucket-based documents to Scopes and Collections in Couchbase. I encourage you to also read about the multiple non-Eventing methods that Shivani touches upon.

Whether you’re new to Couchbase or a seasoned vet, you’ve likely heard about Scopes and Collections. If you’re ready to try them, this article helps you make it happen.

Scopes and Collections are a new feature introduced in Couchbase Server 7.0 that allows you to logically organize data within Couchbase. To learn more, read this introduction to Scopes and Collections.

You should take advantage of Scopes and Collections if you want to map your legacy RDBMS to a document database or if you’re trying to consolidate hundreds of microservices and/or tenants into a single Couchbase cluster (resulting in much lower TCO).

Using Eventing for Scopes & Collections Migration

In the prior article (Part 1), I discussed the mechanics of a high performance method to migrate from an older Couchbase version to Scopes and Collections in Couchbase 7.0 based on Eventing.

Just the Data Service (or KV) and the Eventing Service is required to migrate from buckets to collections. In a well-tuned, large Couchbase cluster, you can migrate over 1 million documents a second. Yes, no N1QL, and no index needed.

In this follow up article, I will provide a simple fully automated methodology to do large migrations with dozens (or even hundreds) of data types via a simple Perl script.

Recap of the final Eventing Function: ConvertBucketToCollections

In Part 1 we had the following settings for the Eventing Function. Note to each unique type, “beer” and “brewery” we had to add a Bucket binding alias to the target collection in “read+write” mode. In addition we had to create the target collections, in this case “bulk.data.beer” and “bulk.data.brewery“

In Part 1 we had the following JavaScript code in our Eventing Function. Note to each unique type, “beer” and “brewery” we had to replicate a JavaScript code block and update reference the corresponding binding alias or target collection in the Couchbase Data Service.

function OnUpdate(doc, meta) {
    if (!doc.type) return;
  
    var type = doc.type;
    if (DROP_TYPE) delete doc.type;
  
    if (type === 'beer') {
        if (DO_COPY) beer_col[meta.id] = doc;
        if (DO_DELETE) {
            if(!beer_col[meta.id]) { // safety check 
                log("skip delete copy not found type=" + doc.type + ", meta.id=" + meta.id);
            } else {
                delete src_col[meta.id];
            }
        }
    }
    if (type === 'brewery') {
        if (DO_COPY) brewery_col[meta.id] = doc;
        if (DO_DELETE) {
            if(!brewery_col[meta.id]) { // safety check
                log("skip delete copy not found type=" + doc.type + ", meta.id=" + meta.id);
            } else {
                delete src_col[meta.id];
            }
        }
    }
}

function OnUpdate(doc, meta) {

if (!doc.type) return;

var type = doc.type;

if (DROP_TYPE) delete doc.type;

if (type === 'beer') {

if (DO_COPY) beer_col[meta.id] = doc;

if (DO_DELETE) {

if(!beer_col[meta.id]) { // safety check

log("skip delete copy not found type=" + doc.type + ", meta.id=" + meta.id);

} else {

delete src_col[meta.id];

}

if (type === 'brewery') {

if (DO_COPY) brewery_col[meta.id] = doc;

if (DO_DELETE) {

if(!brewery_col[meta.id]) { // safety check

log("skip delete copy not found type=" + doc.type + ", meta.id=" + meta.id);

} else {

delete src_col[meta.id];

}

The technique in Part 1 works but what if I have a lot of types?

Using Eventing can indeed do migrations as shown in Part 1, but it seems like a bit of work to set things up.

If you have 80 different types, it would be an incredible amount of error-prone effort to use this technique (both creating the Eventing Function and creating the needed keyspaces). If I had 80 types in a bucket to migrate and split, I wouldn’t want to do all the work described above by hand for each type.

Automate via CustomConvertBucketToCollections.pl

To solve this problem, I wrote a tiny Perl script, CustomConvertBucketToCollections.pl, that generates two files:

CustomConvertBucketToCollections.json, is a complete Eventing Function which does all of the above work described in this post.
MakeCustomKeyspaces.sh, is a shell file to build all the needed keyspaces and import the generated Eventing function.

You can find this script in GitHub at https://github.com/jon-strabala/cb-buckets-to-collections.

Note, the script CustomConvertBucketToCollections.pl requires that both Perl (practical extraction and report language) and also jq (a lightweight and flexible command-line JSON processor) are installed on your system.

Example: Migrate 250M Records with 80 Different Types

We have 250M documents in keyspace “input._default._default” with 80 different types and want to reorganize the data by type into collections under the scope “output.reorg” by the property type. We have an AWS cluster of three r5.2xlarge instances, all running the Data Service and the Evening Service.

The input bucket “input” in this example is configured with a memory quota of 16000 MB.

Below I use the CustomConvertBucketToCollections.pl Perl script from GitHub at https://github.com/jon-strabala/cb-buckets-to-collections. As you can see it can be trivial to do migrations using an automated script.

Step 1: One-time Setup

git clone https://github.com/jon-strabala/cb-buckets-to-collections
cd cb-buckets-to-collections
PATH=${PATH}:/opt/couchbase/bin

cd cb-buckets-to-collections/
chmod +x CustomConvertBucketToCollections.pl big_data_test_gen.pl big_data_test_load.sh

git clone https://github.com/jon-strabala/cb-buckets-to-collections

cd cb-buckets-to-collections

PATH=${PATH}:/opt/couchbase/bin

cd cb-buckets-to-collections/

chmod +x CustomConvertBucketToCollections.pl big_data_test_gen.pl big_data_test_load.sh

Step 2: Create 250M test documents

Running the interactive big_data_test_load.sh command:

./big_data_test_load.sh

1	./big_data_test_load.sh

Input configuration parameters:

# This bash script, 'big_data_test_load.sh', will load &lt;N&gt; million test
# documents into a &lt;bucket&gt;._default._default in 1 million chunks as
# created by the perl script 'big_data_test_gen.pl'. The data will
# have 80 different document type values evenly distributed.

Enter the number of test docs to create in the millions    250
Enter the bucket (or target) to load test docs into        input
Enter the username:password to your cluster                admin:jtester
Enter the hostname or ip address of your cluster           localhost
Enter the number of threads for cbimport                   8

Will load 2 million test docs into keyspace input._default._default (the default for bucket input)
type ^C to abort, running in 5 sec.

Running ....
gen/cbimport block: 1 of 2, start at Mon 01 Nov 2021 11:06:01 AM PDT
JSON `file://./data.json` imported to `couchbase://localhost` successfully
Documents imported: 1000000 Documents failed: 0
** removed 23 lines **
gen/cbimport block: 250 of 250, start at Mon 01 Nov 2021 11:24:05 AM PDT
JSON `file://./data.json` imported to `couchbase://localhost` successfully
Documents imported: 1000000 Documents failed: 0

# This bash script, 'big_data_test_load.sh', will load <N> million test

# documents into a <bucket>._default._default in 1 million chunks as

# created by the perl script 'big_data_test_gen.pl'. The data will

# have 80 different document type values evenly distributed.

Enter the number of test docs to create in the millions 250

Enter the bucket (or target) to load test docs into input

Enter the username:password to your cluster admin:jtester

Enter the hostname or ip address of your cluster localhost

Enter the number of threads for cbimport 8

Will load 2 million test docs into keyspace input._default._default (the default for bucket input)

type ^C to abort, running in 5 sec.

Running ....

gen/cbimport block: 1 of 2, start at Mon 01 Nov 2021 11:06:01 AM PDT

JSON `file://./data.json` imported to `couchbase://localhost` successfully

Documents imported: 1000000 Documents failed: 0

** removed 23 lines **

gen/cbimport block: 250 of 250, start at Mon 01 Nov 2021 11:24:05 AM PDT

JSON `file://./data.json` imported to `couchbase://localhost` successfully

Documents imported: 1000000 Documents failed: 0

There should now be 250M test documents in the keyspace input._default._default.

Step 3: Generate Eventing Function and Keyspace script

Running the interactive CustomConvertBucketToCollections.pl command:

./CustomConvertBucketToCollections.pl

1	./CustomConvertBucketToCollections.pl

Input configuration parameters:

Enter the bucket (or source) to convert to collections [travel-sample]: input
Enter the username:password to your cluster [admin:jtester]:
Enter the hostname or ip address of your cluster [localhost]:
Enter the destination bucket.scope [mybucket.myscope]: output.reorg
Enter the Eventing storage keyspace bucket.scope.collection [rr100.eventing.metadata]:
Enter the number of workers (LTE # cores more is faster) [8]:
Probe the bucket (or source) to determine the set of types [Y]:
samples across the bucket (or source) to find types [20000]: 100000
maximum estimated # of types in the bucket (or source) [30]: 100


Scanning input for 'type' property this may take a few seconds

curl -s -u Administrator:password http://localhost:8093/query/service -d \
    'statement=INFER `input`._default._default WITH {"sample_size": 100000, "num_sample_values": 100, "similarity_metric": 0.1}' \
    | jq '.results[][].properties.type.samples | .[]' | sort -u

TYPES FOUND: t01 t02 t03 t04 t05 t06 t07 t08 t09 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24 t25 t26 t27 t28 t29 
t30 t31 t32 t33 t34 t35 t36 t37 t38 t39 t40 t41 t42 t43 t44 t45 t46 t47 t48 t49 t50 t51 t52 t53 t54 t55 t56 t57 t58 t59 t60 t61 
t62 t63 t64 t65 t66 t67 t68 t69 t70 t71 t72 t73 t74 t75 t76 t77 t78 t79 t80

Generating Eventing Function: CustomConvertBucketToCollections.json

Generating Keyspace commands: MakeCustomKeyspaces.sh

Enter the bucket (or source) to convert to collections [travel-sample]: input

Enter the username:password to your cluster [admin:jtester]:

Enter the hostname or ip address of your cluster [localhost]:

Enter the destination bucket.scope [mybucket.myscope]: output.reorg

Enter the Eventing storage keyspace bucket.scope.collection [rr100.eventing.metadata]:

Enter the number of workers (LTE # cores more is faster) [8]:

Probe the bucket (or source) to determine the set of types [Y]:

samples across the bucket (or source) to find types [20000]: 100000

maximum estimated # of types in the bucket (or source) [30]: 100

Scanning input for 'type' property this may take a few seconds

curl -s -u Administrator:password http://localhost:8093/query/service -d \

'statement=INFER `input`._default._default WITH {"sample_size": 100000, "num_sample_values": 100, "similarity_metric": 0.1}' \

| jq '.results[][].properties.type.samples | .[]' | sort -u

TYPES FOUND: t01 t02 t03 t04 t05 t06 t07 t08 t09 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24 t25 t26 t27 t28 t29

t30 t31 t32 t33 t34 t35 t36 t37 t38 t39 t40 t41 t42 t43 t44 t45 t46 t47 t48 t49 t50 t51 t52 t53 t54 t55 t56 t57 t58 t59 t60 t61

t62 t63 t64 t65 t66 t67 t68 t69 t70 t71 t72 t73 t74 t75 t76 t77 t78 t79 t80

Generating Eventing Function: CustomConvertBucketToCollections.json

Generating Keyspace commands: MakeCustomKeyspaces.sh

In the interactive Perl script above, four of the above default choices were altered.

Step 3: Update the MakeCustomKeyspaces.sh (as needed)

You can just “vi MakeCustomKeyspaces.sh” and alter any needed values. I choose to use the Unix sed command to increase the RAM size of the bucket “output” from 100 to 1600

cat MakeCustomKeyspaces.sh | sed -e 's/\(^.*bucket=output.*ramsize=\)100 \(\.*\)/\116000 \2/' > tmp
mv tmp MakeCustomKeyspaces.sh

1 2	cat MakeCustomKeyspaces.sh \| sed -e 's/$^.bucket=output.ramsize=$100 $\.*$/\116000 \2/' > tmp mv tmp MakeCustomKeyspaces.sh

Step 4: Run the MakeCustomKeyspaces.sh script

sh ./MakeCustomKeyspaces.sh

1	sh ./MakeCustomKeyspaces.sh

output below:

SUCCESS: Bucket created
SUCCESS: Scope created
SUCCESS: Collection created
SUCCESS: Bucket created
SUCCESS: Scope created
SUCCESS: Collection created
SUCCESS: Collection created
** removed 77 lines **
SUCCESS: Collection created
SUCCESS: Events imported

SUCCESS: Bucket created

SUCCESS: Scope created

SUCCESS: Collection created

SUCCESS: Bucket created

SUCCESS: Scope created

SUCCESS: Collection created

** removed 77 lines **

SUCCESS: Collection created

SUCCESS: Events imported

Step 5: Refresh your Couchbase UI on the Eventing Page

To find the new Eventing Function (or updated Function) in the Couchbase UI, go to the Eventing Page and refresh your web browser.

Step 6: Deploy CustomConvertBucketToCollections

In the Couchbase UI, go to the Eventing Page and deploy the Eventing Function “CustomConvertBucketToCollections“.

In about 45 minutes the reorganization should be completely done.

All the documents are indeed reorganized by type as collections. On this modest cluster, they were processed at 93K docs/sec.

Final Thoughts

If you found this article series helpful and are interested in continuing to learn about eventing – click here the Couchbase Eventing Service.

I hope you find the CustomConvertBucketToCollections.pl Perl script from GitHub at https://github.com/jon-strabala/cb-buckets-to-collections a valuable tool in your arsenal when you need to migrate a bucket with many types into a collections paradigm.

Feel free to improve the CustomConvertBucketToCollections.pl script to use an intermediate config file to the Eventing Perl tool where all the parameters could be adjusted. Then use the intermediate config file to create the Eventing Function and the setup shell script.

Example intermediate config file:

[
  {
	"src_ks": "input._default._default",
	"dst_ks": "output.myscope.t01",
	"create_dst_ks": true,
	"dst_copy": true,
	"src_del": true,
	"dst_remove_type": true
  }, {
	"src_ks": "input._default._default",
	"dst_ks": "output.myscope.t02",
	"create_dst_ks": true,
	"dst_copy": true,
	"src_del": true,
	"dst_remove_type": true
  }, {
	"src_ks": "input._default._default",
	"dst_ks": "output.myscope.t03",
	"create_dst_ks": true,
	"dst_copy": true,
	"src_del": true,
	"dst_remove_type": true
  }
]

[

{

"src_ks": "input._default._default",

"dst_ks": "output.myscope.t01",

"create_dst_ks": true,

"dst_copy": true,

"src_del": true,

"dst_remove_type": true

}, {

"src_ks": "input._default._default",

"dst_ks": "output.myscope.t02",

"create_dst_ks": true,

"dst_copy": true,

"src_del": true,

"dst_remove_type": true

}, {

"src_ks": "input._default._default",

"dst_ks": "output.myscope.t03",

"create_dst_ks": true,

"dst_copy": true,

"src_del": true,

"dst_remove_type": true

}

]

Resources

Download: Download Couchbase Server 7.0
Eventing Scriptlet: Function: ConvertBucketToCollections
GitHub: Perl Tool: cb-buckets-to-collections.pl

References

I would love to hear from you on how you liked the capabilities of Couchbase and the Eventing service, and how they benefit your business going forward. Please share your feedback via the comments below or in the Couchbase forums.

Jon Strabala, Principal Product Manager, Couchbase

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Vector Search

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

Capella Playground

Start A Free Capella Trial

Resource Center

Education

Vector Search

What's Vector Search and why is it important?

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

All Posts

Migrating Buckets to Collections & Scopes via Eventing: Part 2