Pages

Thursday, September 12, 2013

Why I still prefer CML (was: #ACSIndy formats session)

At the #ACSIndy meeting there was a session one chemical formats (I hope the slides will all come online). Some key tweets (thanx to Tony for the coverage):

But I am still fan of the Chemical Markup Language. In fact, I started using this when XML was not even standardized yet. Even CML has a SGML background. Well, fairly, only months before XML made it into a recommendation, and CML followed. CML is flexible, which to some is a downside; to me it is a big advantage, as it allows me to easily extend it. It support ontologies to do this, and is therefore one of the most machine readable chemical formats.

Of course, a lot depends on the libraries that you are using. For reading, there are various approaches I have taken. Originally, I wrote a library (Willighagen2011) that supported the convention idea in CML, which is a pain to many. This feature is still actively used in Bioclipse and the CDK! Of course, many cheminformaticians do not care too much about explicit semantics, and the community standard is MDL molfile V2000 (someone has exact numbers?), even though the improved V3000 update is already 30 years old (see the first tweet!).

Of course, browsing through all tweets, I think the session nicely showed some of the newer requirements, many required the extensions presented in this session. These extension may have been part of the original specification (is there an overview of specification documents of all industry standards?), but in many cases these will also be conventions. E.g. a common convention used by cheminformaticians is to use the bond order type 4 in MDL V2000 molfiles to reflect aromaticity, even though the specification defines it differently.

I hope all specifications of these updates and conventions will find their way to the web, with at least the rights to redistribute, allowing independent tools to properly implement these standards. (The right of modification is debatable for standards.)

No comments:

Post a Comment

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.