CMIS First Technical Committee Meeting Notes

Last week was the first face-to-face meeting of the OASIS CMIS Technical Committee (TC). From my perspective, it was a very productive gathering and allowed for a number of highly constructive discussions. In the following write-up I do my best to retrace the gist of the conversations around the topics I found most interesting, which in some cases were conversations I had with just one or two people.

The outlook from these discussions, and from the scope of the spec itself, is very positive. I believe that within a year CMIS will start to actively redefine the world of content management systems, which will be an opportunity both for big vendors who will see easier adoption of their solutions by customers concerned by lock-in or interoperability issues, and for smaller vendors whose products will be able to take advantage of a much broader spectrum of connectors to third-party systems.

Schedule

The most important news at the end of these three days is that there is enormous support in the TC for CMIS 1.0 to be released as soon as reasonably possible, as it is felt by all that a simple and solid spec that can be implemented and used ASAP by everyone is paramount.

Due to time constraints inherent in the OASIS standardization process, this is likely to be in late 2009 or early 2010 -- and that's if we can polish the current draft and fix the problems within something like two months!

Existing Capabilities

During this meeting it was stressed many times that the goal of CMIS is not to define the semantics for new features that a repository could implement, but to provide access to existing features of existing repositories, so that they can interoperate.

This implies that complex features, non-standard features, or features that are common but implemented with a wide variety of semantics, have to be out of scope for CMIS 1.0. When features are exposed through CMIS, there is a duty to make sure that this can be done by almost everyone without rethinking the repository's architecture.

Retention & Hold

For those not familiar with the terms, a retention policy describes the rules along which documents will be kept for a certain time then archived or destroyed, and holds are typically put on documents for legal purposes to prevent their destruction when companies are being sued or subpoenaed.

A way to specify and discover documents that can have various holds put on them, or various retention policies, is critical to all the Records Management folks. It's not clear what can be standardized though, as there is a huge amount of possible semantics for such policies.

Note also that record management features are explicitly out of scope for CMIS 1.0 (for the very reason that variations between repositories are enormous).

Tagging

While initially it can be seen as very simple concept, tagging is more than just the setting of a multi-valued property (MVP) on a document (a la Dublin Core "subjects"). A complete tagging solution can involve the following features:

Adding metadata to a tag: tag date and author are typical of the use case here. The tags are then seen as either a "rich" property, or as a relationship (carrying metadata) to a concept in a taxonomy.

Tagging an object on which one doesn't have write access: this is quite common and typical of the bookmarking community, think "del.icio.us". Of course to allow this you can't use a basic MVP as you don't have write access to the document being tagged.

Does tagging an object change its last-modification date? This may be constrained by the implementation of the tags.

Changing many tags at the same time. This is typical of the "tag normalization" use case, where a TagMaster has determined that two tags have to be merged. Some batch modification features may be useful in this case.

Querying tags for things like: most common tags, less used tags, tags most used together. Also querying for documents in relation to their tags, for instance documents sorted by total tag weight, documents with the most tags, documents most recently tagged. Or even people having added most tags, etc.

Maintaining the tag cloud, or taxonomy: in a system with many tags, maintaining the tag cloud becomes a full-time job. Having relationships between similar tags, merging them, weighting them, is important.

For CMIS 1.0 this will be hard to standardize, but there's still some time left for something simple to be proposed by interested parties.

Transactions

The fact that transaction capabilities are not mentioned in the spec was surprising to some. This is due to the fact that too many vendors don't support them. In addition, WS-Transaction can be used to get transactions spanning several requests when using the SOAP bindings, so repositories having them can still expose them in this manner.

Events and Notifications

Having a CMIS repository notify the outside world would be very powerful, and has been mentioned as quite useful in the context of user email notification as well as unified search. However, CMIS is a protocol-based spec, where a client sends commands to a server and receives answers, so there is no simple way in CMIS 1.0 to expose direct notification capabilities.

Registering code that can be executed by the repository on certain events would be useful as well, but again CMIS is a language-neutral spec and cannot standardize this.

REST API

While what we have today with the CMIS AtomPub bindings may not be by-the-book REST, we need a simple protocol that can be used by simple tools (and many scripting languages) to do simple access to the repository in a few lines, or even that can be used directly by JavaScript in a browser.

The AtomPub bindings are here for this, and many clients can take advantage of them today, although the way some things are exposed may not be perfect (there was consensus on making sure that the bindings are as close as possible to the best practices of AtomPub).

It was also noted that, for what it's worth, as a marketing term "REST" now carries a lot of weight and its presence in the spec has already been a significant factor in the adoption or interest (internal or not) in CMIS by various vendors.

There was discussion of using WebDAV, which would fit very well with the concept of navigating folders and finding documents, and already has many clients. The reason why this is not in the spec today instead of AtomPub is basically historical, there was no one in the group to push for WebDAV when the spec was initially created, and it seems that IBM is very pro-AtomPub :).

As we all want a CMIS 1.0 soon, AtomPub won't be replaced, but there may be side work going on so that post-1.0 we can find ways for different repositories to expose their CMIS features through WebDAV in a compatible manner.

Reference Implementation/Tech Compatibility Kit

It was not felt that a Reference Implementation (RI) would bring much, as it is expected that many vendors will have implementations of CMIS very soon, including several open source ones. In any case, it's not the job of an OASIS TC to write software.

Regarding a Technology Compatibility Kit (TCK), most people agree that it would be nice to have something, either in an abstract format or as executable test cases. Here I feel that the ball is in the camp of open source vendors; we can easily get together and pool our unit testing resources to turn them into a nice TCK. It won't be a deliverable of the OASIS CMIS TC though, and won't be normative, although conceivably the TC can formally approve a given version.

Access Control Lists (ACLs)

Of all the points that really merit further work before a 1.0 version can be considered, ACLs ranked highest -- practically everyone agrees that ACLs should be in the spec in some form.

However, ACLs are also one of the features that vary most between repositories, so common ground will be hard to find. Nevertheless, ACLs are crucial to some use cases. Unified search was the most frequently mentioned, but many people also have the simple use case of being able to inform a repository that a given document is now readable by Bob.

A simple way of being able to express positive ACLs (but not blocking), and to give a hint that a given ACL is inherited or not (whatever the meaning of "inherited"), would be a good step toward interoperability. If ACLs find their way into the spec, it is likely that the separate notion of a Policy will disappear.

Search

The use cases of Federated Search -- an engine that, when queried, delegates the search to many repositories and then aggregates the results -- and Unified Search -- an engine that somehow crawls many repositories to build a database of what's in them, and can then be directly queried -- have been discussed a lot, especially unified search as it impacts a number of other features.

One feature needed is something allowing the discovery of permissions, to be able to serve search results without having to check with the repository for each document if access can be granted; this will presumably involve some kind of ACLs.

Even if such permission discovery does not reflect the full security policy applicable to a document, it can still be useful to weed out some of the documents and improve the efficiency of the search.

Another feature needed is something allowing the discovery of what has changed in the repository since a previous crawl; this can be done either through push/events (but as mentioned above this would be out of scope for CMIS 1.0), or through pull/polling/querying to retrieve some kind of journal of the last changes, including deleted documents; this feature is sometimes called an Event Journal or a Transaction Log, and the problem is to make it available efficiently outside the repository for the benefit of search engines.

Next Steps

For the TC the coming weeks will be busy, but we hope that very soon a new draft closer to 1.0 will be available to try to resolve some of the issues listed above (and a few others I skipped over). Expect news very soon! And, of course, any feedback to the TC will be reviewed carefully, please submit yours (the cmis-comment mailing-list is listed at the bottom of the CMIS committee webpage).

About the Author

Florent Guillaume is Head of R&D at open source enterprise content managment vendor Nuxeo. He is an Expert software architect and has been a core developer of the Nuxeo content management platform since 2001. He is actively participating in several standardization workgroups including JSR 283 and OASIS CMIS and is giving talks and courses regularly on these topics as well as on Nuxeo content management technologies.