Content Management? Bah humbug!

For each of the CMS problem domains mentioned in the introduction, I’m going to start out by outlining the core requirements that are most relevant to NoSQL.

Typically these requirements will focus on the “repository” underlying the CMS – how content is represented, how it is structured, how it is stored and retrieved, how the repository is scaled to large traffic volumes and so on. This is not to diminish the importance of other requirements (such as the editorial UI/UX), however NoSQL has a lot less direct relevance to those facets of a CMS.

Richly Structured Content Types

In my experiences, types in a WCM content model are generally more complex than those in other content management use cases (e.g.. Document Management), with complex nested data structures within types being the norm rather than the exception.

While the specifics vary widely depending on the precise information architecture of the web site, some typical examples might be†:

News Article:

a number of singleton fields such as “title”, “summary”, “author”, “date”, “body” etc.

an unbounded set of related image files, each of which has a “thumbnail” and a “high fidelity” rendition. These images may have further fields associated with them (provenance information, for example).

Product:

a number of singleton fields such as “SKU”, “title”, “description”, etc.

† All of these are based on actual WCM content models I have seen used in live web sites.

Unstructured Binary Objects

A no-brainer really – any CMS (WCM or otherwise) that is unable to efficiently store binary objects (regardless of MIME type) isn’t worthy of the moniker. Enough said.

Relationships aka References aka Associations

WCM content models are inherently interlinked, after all that’s what the “hyper” in “hypertext” refers to! Continuing our examples above: a Product may contain references to complementary Products and other content types such as Technical Specifications, White Papers etc.; a News Article might refer to related News Articles, and so on.

In fact often the most highly interlinked part of a WCM content model are the content types representing the navigational model of the site. Regardless of whether the site uses a traditional single-root hierarchy, a multi-hierarchy (“faceted”) navigation scheme, a tag cloud or some wacky newfangled model dreamt up by a genius information architect, the content type(s) representing the navigational data structures are always highly interlinked with the non-navigational content types (the Products, News Articles, Recipes, etc.) and are often interlinked with themselves. This latter case is particularly true of hierarchically based navigational schemes, which continue to be the dominant navigational paradigm used in content-rich web sites.

While it is possible to “manage” links via the humble hyperlink (and in fact this is the de-facto approach in severalCPSes), this is less than ideal for various reasons:

it’s difficult to inform an author that they’re about to break links on the site by moving or deleting a content item that is the target of a reference

it’s difficult to determine what needs to be deployed in order to ensure that all dependencies are met (i.e. so links won’t be broken, post-deployment)

the graph of links provides useful information to authors about the dependencies within their content set, possible navigation paths through the site etc.

coupled with usage analytics data, visualisations of the link graph can be a powerful tool for authors in revising, distilling and generally maintaining the relevance of the content they’re delivering

Schema Evolution

A general guiding principle that I have followed throughout my technical career has been to avoid (as far as possible) anything that requires what I refer to as “crystal ball gazing” – making decisions now that require prediction of the future and that may be difficult to correct when that prediction turns out to be incorrect (as inevitably happens).

Content models are a classic example of this – in the decade or so that I’ve been working in content management professionally, I don’t recall a single instance where the content model was defined perfectly first time, up front, prior to use.

Unfortunately some of today’s CPSes make it extremely difficult to change the definition of a content type once that type has instances in existence – requiring (for example) a full dump / reload of the entire content set for even the most trivial of changes to the model.

This is the crux of the “schema evolution” requirement – any CMS worth a damn must provide the ability for the content model to evolve over time, regardless of whether content that uses that model exists or not.

Branch / Merge

This is the ability for an author (or set of authors) to spin off from the main “branch” of editorial activity, work independently for some period of time and then merge their changes back into the main “branch”.

This is an optional (though common) requirement – some CPSes don’t provide this capability and some editorial teams don’t require it either.

That said, any web site that has a lifecycle that involves both frequent incremental revisions and infrequent major revisions that are prepared in parallel will benefit from this kind of functionality. Anyone who’s ever managed multiple concurrent software releases will grasp the issue (and its solution) immediately.

Snapshot Based Versioning

By “snapshot based versioning” I mean a versioning system that captures the full state of the content set at a given point in time, and can resurrect that state at any point in the future, regardless of what operations are executed by authors in the meantime (including deletes, renames and moves of assets).

Anyone who suffered through RCS / CVS in the good old days and is now using a sane SCM (Subversion or Mercurial, for example) will know exactly what I’m referring to here!

Surprisingly, someCPSes continue to use RCS style per-asset versioning, which means they are unable to resurrect deleted assets – a serious problem if your web site happens to fall under one of the regulations (e.g. HIPAA, SEC, FTC, etc.) that require that the complete state of a site be “resurrectable” for quite significant periods of time (often 7 years).

ACID Transactions

Basically this boils down to the guarantee that modifications to the content set can be durably persisted to the CPS, either succeed or fail in their entirety and can be read back out in the case of success. To many this will seem a no-brainer, but when we move on to our review of NoSQL solutions we’ll find that some of them don’t necessarily provide this guarantee.

Note: while advantageous in some situations, I consider externally defined transactional boundaries (i.e. the ability to “batch up” numerous otherwise unrelated content modifications into an arbitrary ACID transaction) to be a “nice to have” requirement, rather than a hard requirement. Again we’ll see the impact of this when we review NoSQL technologies.

Scalability to Large Data Sets

Interestingly, scalability in the presence of large amounts of traffic is the area where NoSQL technologies garner the most attention, yet it is one of the least important requirements for a CPS. This is because even large (several hundred person) editorial teams are unable to generate the kind of traffic load that even a moderately successful web site can receive.

However what does matter is that the CPS can scale in the presence of large amounts of data – typical content-heavy web sites these days contain tens to hundreds of thousands of discrete content items, many of which will contain several media assets (images, video, fire applets, etc.) that themselves may be heavyweight (MB to GB in size).

Geographic Distribution

Basically this requirement is for those organisations that have geographically distributed editorial teams who wish to ensure good performance of the editorial tool, no matter where the editors are physically located.

Although in essence a “nice to have” requirement, I threw it in here because I’m hearing it increasingly often and some NoSQL solutions cater to it quite nicely.

Next Up…

Next up I’ll give a quick overview of some (but not all!) of the more relevant NoSQL technologies currently on the market, and we’ll compare them against the requirements we’ve defined here to see to what degree they are relevant to the CPS use case.

[…] This post was mentioned on Twitter by Peter Monks, Scott Liewehr. Scott Liewehr said: Stick with it Peter good stuff RT @pmonks: 1st in a series of blog posts on #nosql and #cms: http://bit.ly/a6ikRB, http://bit.ly/bbe4gP #wcm […]

Is Git or Mercurial a good choice for document versioning in a content management system?…

The bigger issue is that most modern CMSes store data structures (graphs of one sort or another) rather than sets of otherwise independent “flat, dumb” files, but SCMs (whether distributed or not) are mostly designed for handling data structures that…

Stéphane, the classic example of the branch / merge case is where there are a fairly continuous stream of minor “maintenance” edits being made to the current version of the site, and in parallel a major version of the site is being prepared. In this case you’d typically want to create the major revision by branching the existing site, and then as work on the major revision progresses, selectively merge across the minor maintenance edits as needed (typically content changes get merged across, but stylistic changes do not).

This is remarkably difficult to do without a good set of branch / merge tools, and perhaps even more remarkably, most of the popular WCM systems (including both the CPS and PMS flavours) don’t support this directly at all!