Aspects and XMDS: higher order metadata for datasheets

Recent progress on the OS X Molecular DataSheet (XMDS) app, which is currently in beta, has involved the inclusion of aspects into the user interface for editing datasheets. The core format that the XMDS desktop app, as well as all the rest of the products from Molecular Materials Informatics, is the datasheet, which is a tabular format made up of rows and columns: like a spreadsheet, except that the columns are strongly typed, so you can’t just abuse it and put in whatever you want wherever you want. The minimalistic table datastructure can be supplemented by any number of aspects, which are directives that impose higher order layers of interpretation, which are often composed of multiple columns (e.g. a chemical reaction which cobbles together multiple molecules for its reactants, reagents and products). One of the fun software engineering challenges has been to make these complex representations fit seemlessly into the grid-based “spreadsheet” editor upon which XMDS operates.

The screenshot above/right shows part of a datasheet being edited, which happens to have an aspect called “Assay Provenance”. What this means at the lowest level is that the list of extensions within the header of the datasheet includes a special recognition code (which happens to be com.mmi.aspect.AssayProvenance). Any software that can read the datasheet format (usually serialised using a very simple XML schema) may or may not have a corresponding implementation for this aspect. If not, that’s OK: the main value that this aspect provides is that it combines several columns together: Value, Error, Units and Relation and fused into a single “aspect column” called Activity. Without the presence of the aspect, the underlying datasheet looks like this when edited:

The above layout is more or less self-explanatory. The Assay Provenance aspect is a very simple one, and was implemented first for just that reason. It was designed originally for the purpose of describing the datasets being extracted from ChEMBL, keeping track of various relevant metadata about the target and assay, and also the source origins of individual molecules. At the moment all of the datafiles in existence that make use of this aspect are created by a script, but the initial value of adding the interactivity is to make it more pleasant to view the content (combining measurement values in 1 column is much nicer than separating them into 4 separate ones). But more importantly, to make sure that the editing mechanism validates the content at the time of entry.

When editing an Activity block, the user interface is nothing more than a text box – similar to editing an ordinary text value. If one were to type in “> 100 nM”, it would be translated into [Value=100, Error=0, Units=nM, Relation=greater], whereas “10 +/- 1” would be translated into [Value=10, Error=1, Units=none, Relation=equality], and “fubar” would be considered as an error, and rejected. This kind of at-the-time validation is rather important for cheminformatics, since the corpus of open data is so heavily infested with mistakes of every imaginable kind, many of which are rather trivial and easy to detect. From machine consumer point of view, an algorithm that recognises the aspect can be reasonably confident that it can interpret the underlying data in the way that it was originally intended, rather than just taking a much less well justified guess and hoping for the best.

In addition to the general pattern of aspects co-opting a group of columns to compose their own higher order data, they also have the option to store datasheet-global metadata of their own. In the case of the Assay Provenance aspect, the datasheet is supposed to represent the structure & activity data of a singular assay type, with the objective being to ensure that the SAR is ready to go for a modelling operation, i.e. all of the pre-processing has been done. This means that there is plenty of information about the source of the data (hence the use of the word Provenance), and this is both stored by the aspect, and made available via a dialog:

The importance of this particular aspect will be realised at a later date, as other technology components mature. It was chosen first because it is one of the least complicated. Internally, the process of creating the software framework for allowing any number of additional bolted-on features to hijack parts of the spreadsheet-like editor is a software engineering task that could be done quickly, and eventually result in an unmaintainable trainwreck that needs to be redesigned; or it could be done slowly and well with due consideration given to all of the future needs and problems that are likely to arise. Needless to say that I’m aspiring toward the second scenario, having had more than enough practice with the former.

Many of the mobile apps from Molecular Materials Informatics, as well as the molsync.com sharing service, already make use of the aspect mechanism. However, the mobile apps are designated with almost a “one app per aspect” approach. The Mobile Molecular DataSheet (MMDS) app is primarily designed for aspect-less datasheets, but it also bifurcates its functionality for those which contain the Reaction aspect. The Green Lab Notebook (GLN) app operates primarily on the Experiment aspect, which is a multi-step reaction description, with quantities and other metadata. The SAR Table app operates on the SARSheet aspect, which ties together scaffolds & substituents with some number of properties, with a lot of advanced functionality for working with these specialty collections.

True to the principles of the platform, the XMDS desktop app prefers to roll all of this functionality into a more “monolithic” implementation. So rather than having a unique app for each kind of window onto chemical data, the standard datasheet editor for XMDS reacts intelligently to whatever kind of markup is present.

Stay tuned for more progress reports, and feel free to inquire about the beta programme if you’re interested (and have a Mac).