Couchbase Mobile uses a Multi Version Concurrency Control (MVCC) technique for handling conflicts. One of the challenges in a MVCC based system is that over time, a document can grow to have multiple revisions. One can conclude that if all the revisions of the documents are retained indefinitely, the database can grow to be very large. This would be undesirable.
This is the second part in our series on Conflict Resolution in Couchbase Mobile. In our first post on Demystifying Conflict Resolution, we took a behind-the-scenes look at how documents revisions and conflicts are handled in Couchbase Mobile using MVCC. In this post, we will discuss the techniques used in Couchbase Mobile for managing the size of document Revision Trees and the role of applications in the same.
Background
The Couchbase Mobile stack includes the Couchbase Lite embedded database running locally on devices and Sync Gateway in the cloud which is typically backed by Couchbase server persisting the data in the cloud. The Sync Gateway handles the replication of documents across the devices. It is conceivable that a document can be updated by multiple devices at the same time.
A document in Couchbase Mobile is comprised of a document ID, current revision ID, JSON body and Metadata. The Metadata, among other things, holds the revision history for the document. In the current production version Couchbase Mobile, the Metadata is stored in a special _sync property embedded in the document. In V1.5 of Sync Gateway, the metadata has been moved out of the document into XATTRs.
This post assumes that you are familiar with Couchbase Mobile’s document Revision tree structure and the concepts of current revisions. If you would like to lean more, please refer to the post on Demystifying Conflict Resolution.
Techniques to manage Database size
In order to prevent the documents from becoming too big, Couchbase Mobile uses three techniques namely,
– Compaction
– Pruning
– Document Expiration
The Sync Gateway and Couchbase Lite handle the cleanup process slightly differently. Whereever applicable, the differences will be noted in the post.
Compaction
Compaction is defined as the process of purging the JSON bodies of non-leaf revisions. Leaf revisions are not purged during compaction.
On Sync Gateway
On the Sync Gateway, compaction is run by the system periodically in the background. In addition, applications can manually invoke the compaction process via the Compact API on the Admin Interface
On Couchbase Lite
On Couchbase Lite, compaction can only be invoked manually via the Compact API On the CBLDatabase object
Impact of Conflict Resolution on Compaction
The Compaction process does not remove JSON bodies of leaf nodes. So if you have a large number of unresolved conflicting branches, then you will have a large number of leaf nodes with JSON bodies lying around.
- Case 1: When Conflicts are unresolved, Compaction retains the JSON bodies of all the leaf nodes.
- Case 2: When Conflicts are resolved (i.e. non-winning branches are tombstoned), Compaction removes the JSON bodies of non leaf nodes
Hence, resolving conflicts is important to ensure that all old and unused leaf revisions are pruned away.
Pruning
Pruning is the process that deletes the metadata and/or JSON bodies associated with old non-leaf revisions. Leaf revisions are not impacted. The process runs automatically every time a revision is added. Although fundamentally the same, the pruning algorithm works slightly differently between the Sync Gateway and Couchbase Lite.
On Sync Gateway
“Old revisions” are those that are older than the value specified by the revs_limit database configuration property. The revs_limit property defaults to 1000, which means that the metadata corresponding to last 1000 revisions are stored in Sync Gateway. Although the Pruning process does not immediately remove the JSON bodies of old revisions, the JSON bodies of all non-leaf revisions of documents that are older than the default TTL value of 5 minutes are automatically cleaned up by a background process that runs periodically.
Algorithm
- Find the minimum generation number gmin, (the first component a revision ID), of all leaf revisions.
- Delete all non-leaf revisions whose generation number g ≤ gmin –
revs_limit
Sync Gateway Pruning with Conflicts : An Example
In this example:
- Assume that the
revs_limit
configuration is set to 2 (for illustration purposes only) - the document has unresolved conflicting revisions at generation 3
On Couchbase Lite
“Old Revisions” are those that are older than the value specified by maxRevTreeDepth property of the database. The maxRevTreeDepth value defaults to 20, which means that the metadata and JSON bodies corresponding to last 20 revisions are retained in Couchbase Lite.
Basically, unlike the Sync Gateway, which takes a temporary backup (TTL of 5 min) of the JSON bodies of old revisions, the Pruning process on Couchbase Lite removes the metadata as well as the JSON bodies of old non-leaf revisions.
It is to be noted that there may be slight differencess in the way the pruning algorithm is handled across the various Couchbase Lite platforms. From the perspective of the user, it should suffice to note that the Pruning technique on Couchbase Lite will get rid of metadata and JSON bodies of old revisions based on the maxRevTreeDepth
value.
Algorithm
- For each non tombstoned branch in the tree , prune away revisions whose generation Id, g < (Depth Of Branch –
maxRevTreeDepth
)
After pruning, your document may end up with Disconnected Branches. The revision tree is not in a corrupted state and the logic that chooses the winning revision still applies. However, it may make it impossible to do certain merges to resolve conflicts, and should be avoided if that is an issue for you.
Couchbase Lite Pruning with Conflicts : An Example
In this example:
- Assume that the
maxRevTreeDepth
configuration is set to 2 (again, you’d never want to do in practice) - the document has unresolved conflicting revisions at generation 3
Impact of Conflict Resolution on Pruning
- The Pruning process does not prune away leaf revisions. So if you have a large number of unresolved conflicting branches, then you will have a large number of leaf nodes lying around.
- On the Sync Gateway, the Pruning algorithm is applied to the shortest, non-tombstoned branch in the revision tree. This means that in the case where there are many unresolved conflicting branches, Pruning may not delete some of the older revisions.Consider the following example,
- Scenario 1 : When conflicts are unresolved, Pruning may not remove all old revisions as the shortest branch may be a conflicting branch
- Scenario 2 : However, in the case when conflicts are resolved, gmin corresponds to the leaf node of the winning revision branch.
- On the Sync Gateway, in case of unresolved conflicts, it is also possible that the parent nodes of conflicting revisions are prematurely pruned away. This would result in a disconnected branch state as discussed earlier.
Document Expiration
When documents expire, all traces of the document, including all revisions of it are removed from the system. Users have the option of controlling when documents expire. Note that users must exercise caution when changing the expiration value of a document because a document may expire prematurely. This feature is useful for creating ephemeral documents that may be required only for a short period.
On Sync Gateway
When writing a document to Sync Gateway via PUT doc or bulk_docs API, users can set the __exp
property in the body of the document to specify when the document must expire.
Supported formats for the _exp value include:
– JSON number. TTL in seconds when less than 30 days, unix time when greater
– JSON string (numeric format) – same as JSON number
– JSON string (as ISO–8601 date)
– JSON null. Sets expiry to zero (no expiry)
On Couchbase Lite
The expirationDate property on CBLDocument can be used to specify when the document must expire. Note that by specifying an expirationDate manually, you run the risk of expiring documents added to Couchbase Lite in offline mode before they get a chance to sync up with the Sync Gateway.
Configuration of the max depth of the Revision Tree
As discussed in the section on Pruning, the maxRevTreeDepth
property on Couchbase Lite database defaults to 20 and the revs_limits
property on Sync Gateway defaults to 1000.
This value impacts the number of revisions whose metadata survive the Pruning process. So having a very large value implies that the document metadata history can grow to be very large resulting in increased database storage needs.
One may be tempted to reduce these values in order to save space. But making them very small can have undesirable consequences as discussed below
– The parent node of conflicting branches may get pruned away before the conflict can be resolved leading to orphaned leaf nodes. This may be undesirable if the application wishes to n-way merge the concurrent changes. If you get into this state, then the only option to resolve the conflict is to pick a winning branch and tombstone all the non-winning conflicting branches.
– You may end up with disconnected branches as shown in the example below. Disconnected branches are branches that do not have a common ancestor. While it may not be a big issue in and of itself, the application is responsible for tombstoning non-winning branches.
In the example below , after syncing, the device ends up with two disconnected branches [16..19] and [35..38] without a common ancestor.
Note that the above scenarios can also happen with the default tree depth property values. However, the likelihood of getting into this state is significantly increased if the values are set to be too low.
What Next
One important consideration in a MVCC based system is managing the size of the revision trees and to preventing it from bloating. This post discussed the techniques available in Couchbase Mobile for controlling the size of documents and the impact of conflict resolutions on database sizes.
If you have questions or feedback, please leave a comment below or feel free to reach out to me at Twitter @rajagp or email me priya.rajagopal@couchbase.com. The Couchbase Forums are another good place to reach out with questions.
Finally, a thank you to Traun Leyden Traun@couchbase.com from the Sync gateway team for his in-depth review and to Jens Alfke jens@couchbase.com and Jim Borden jim.borden@couchbase.com from the Couchbase Lite team for their input.