Representing Revision Data in MongoDB

The need to track changes to web content and provide for draft or preview functionality is common to many web applications today. In relational databases it has long been common to accomplish this using a recursive relationship within a single table or by splitting the table out and storing version details in a secondary table. I’ve recently been exploring the best way to accomplish this using MongoDB.

A few design considerations

Data will be represented in three primary states, published, draft and history. These might also be called current and preview. From a workflow perspective if might be tempting to include states like in_review, pending_publish, rejected, etc., but that’s not necessary from a versioning perspective. Data is in draft until it is published. Workflow specifics should be handled outside the version control mechanism.

In code, it’s important to avoid making the revision mechanism a primary feature. In other words, you want to deal with the document stored in published, not published itself.

Historical documents need to have a unique identifier, just like the top level entity. These will be accessed less frequently and so performance is less of a consideration.

From a concurrency perspective, it’s important to make sure that updates operate against fresh data.

Basic structure

The basic structure is a top level document that contains sub-documents accommodating the three primary states mentioned above.

In history, each retired document requires a few things. One is a unique identifier. Another is when it was retired. It might be useful to track which user caused it to be retired. As a result, metadata above should represent all those elements that you would like to know about that document.

There are a few options when it comes to dealing with uniquely identifying historical data. One is to calculate a unique value at the time an object is placed in history. This could be a combination of the top level object ID and a sequential version number. Another is to generate a hash when the object is loaded. The problem with the second approach is that queries for specific date objects become more complex.

Queries

As a rule of thumb, it’s probably best to always execute queries against published. Queries against history are unlikely at an application level. One reason for this is that any interest in historical revisions will almost universally be in the context of the published version. In other words, the top level object will already be in scope.

Draft data should be considered transient. There should be little need to protect this data or save it. Until it is accepted and becomes the new published data, changes should have little impact. Querying for draft data would be unlikely, and is discouraged at the application level.

Historical Limits

The size of documents and the frequency with which they are changed must factor in to the retention of historical data. There may be other regulations and internal policies that affect this retention. In some cases it may be sufficient to retain only the last revision. In other cases it may be a time period determination. In some cases it may be desirable to save as much history as is physically possible.

While MongoDB does provide a capped collection, that won’t help us with the structure above. All of the historical data is in a sub document, not a separate collection. For that reason, any retention policy must be implemented at the application level.

It might be tempting to implement the revision history in a separate collection in order to manage retention with a capped collection. Some problems arise. The biggest problem would be that there is no way to cap the collection by versioned document. One way to look at this is that if you have one document that changed very frequently and another that changed rarely or never, historical data for the rarely changing document would eventually be pushed off the end of the collection as updates for the frequently changed object are added.

Retention model

As a baseline, it’s probably reasonable to define retention based on the following two metrics.

Minimum time retention

Maximum revisions retained

In other words, hang on to all revisions for at least the minimum time, up to the maximum number or revisions. This decision would be made at the time the document is modified. If a document is modified infrequently, it’s possible the documents would be much older than the minimum retention time.

To mitigate this it’s possible to define a paddingFactor for a collection. A padding factor is a multiplier used when creating a new document that provides additional space. For example, for paddingFactor=2, the document would be allocated twice the space needed to accommodate its size.

Note that operations like compact and repairDatabase will remove any unused padding added by paddingFactor.

Changes to document structure

It’s possible that the structure of documents change throughout the life of an application. If information is added to or removed from the core document structure, it’s important to recognize that any application code will need to be able to deal with those changes.

Application aspects that might be affected include JSON to object mapping and diffing algorithms.

3 Responses to “Representing Revision Data in MongoDB”

I dont see where you store your actual document in the second listing’s structure. Also wouldnt you want to separate history in its own collection to keep it from the data thats accessed frequently, so you dont pull all the history of a document when trying to see what is published?

The data is in the history element. It’s indexed by an ascending integer value.

You do bring up a good point that could be an alternative implementation, which is to store the history in a separate collection. I actually discussed this in the section on Historical Limits. You can read that section for some possible pitfalls.