The text field shows the SPARQL used to aggregate the data, which is then visualized in the plot below that field. You can edit the SPARQL and, for example, plot the boiling point (t) as function of the number of carbons (p):

This work nicely shows some interesting McPrinciples: it shows what happens if we allow reuse and share our knowledge; it shows that nice graphics and semantic access to original data are very compatible. All in all, this is an important step forward to semantic publishing of chemical data! Orion, thanx for this really nice work!

The CDK project by now is so large, it is hardly possible to keep up, and I am very grateful to particularly Chris and Rajarshi for actively keeping the project going, to all those that submit patches and bug reports, and to all that use the CDK in their software. This created a healthy development and user community, as is visible from the blog aggregator Planet CDK.

But, reflecting on the past, it is also clear where the project needs help. The flow of CDK News papers is effectively void, the documentation needs serious updating, we still need way more unit testing, as well as more in-depth validation of algorithm implementations. And we all know we are short on code reviewers to control the flow of patches going into the library. There is also still some functionality missing, like a simple force field (the Jmol LGPL UFF code could be ported, doi:10.1021/ja00051a040) and support for popular file formats like Symyx V3000 molfiles and the ChemDraw CDX formats.

I am really positive about the future of the CDK project and the current future is mostly limited by the number of people working on maintenance, code quality, and releases. For example, I would love more frequent releases, but making a release takes about half a day. It is not merely creating the files to distribute, but also to ensure that the branch is in a releasable state, that it has no important outstanding bugs and at least does not have more unit test fails than the past release (preferably fewer...), and writing a release message.

This maintenance also involves writing unit tests for reported bugs, and ensuring that someone fixes the bug. This is a second important challenge to the project: how to keep the original code authors involved, and make them feel responsible for making bug fixes in the code they wrote. Cheminformatics is very much a field of write once, go off to another job, and forget about it. This is why I am so strong on having unit tests, proper JavaDoc, and clean code, so that others can do this required code maintenance.

If we look at the current numbers, we see about 170 open bugs out of 1115 ever reported, and 24 open patch reports out of 276 reported. Those are acceptable numbers, though they need to go further down.

I really hope that 2011 will be the year that commercial CDK support is picking up, providing value for users by providing dedicated support. Right now, to get something fixed, you need to wait for someone to fix the problem; however, none of the CDK developers actually is working solely on the CDK and many contributions are done in spare time. That nicely shows the power of Open Source, but also well illustrates the need of proper funding. That said, this is merely limited by people actually willing to pay for such support, or even just to donate financial support to the project. If you are interested in that, please contact me offline, as we have the means in place to do this.

In short, I have no clue where the CDK will go, except that it will continue to grow. This is another power of Open Source: the accumulated effort cannot be lost. Seriously, back in 2004 I wrote a What's 2004 going to bring?, and here's a lousy attempt for 2011:

a new stable series, 2.4 or 3.0 (versioning has not been decided on yet)

it will be faster and support parallel computing

we will have a UFF implementation

more extensive chirality support (EZ, ...)

rendering and editor will be integrated

we will use JExample for unit testing

cheminformatics in the webbrowser (using the CDK)

we will have books about the CDK

more molecular descriptors

But we will also have to overcome these issues, for which we need your help:

CDK News needs a new editorial board

we need an second release managers (one for stable, one for the development branch)

Update An observant reader would have noticed that the output of the current CDKSourceCodeWriter is actually producing code that does not compile. The CDK API has changed, but the created output was not updated accordingly. Apparently, no one is actually using this class, or those who have were not interested in that piece of functionality to file a bug report.

Agreeing with @citeulike and @abhishektiwari, as a service provider any bad news is good news too: they provide opportunities to improve. So, as encouraged to do so, I reported my long list of things I miss in CiteULike:

@citeulike ok, one more. wish #18: get readermeter.org to also support citeulike

@citeulike wish #17: allow people linking between papers in their libs using CiTO to annotate how they cite papers, see http://ur.ly/lBUO

@citeulike wish #16: I think I saw images from some papers, right? how about doing that for #biomedcentral journals too?

As a follow up of the ACS RDF 2010 Symposium we just had in Boston, I can announce that we are preparing a thematic series in the Journal of Cheminformatics around this theme, and that at least six speakers will present their work in this series.

However, as it was the goal of the meeting to create an active and collaborating community, we feel it important to open up, and make encourage others too to submit papers to the series that are about the use of Resource Description Framework in chemistry. The exact scope of the series is that of the symposium, of which all abstracts and some slide sets are available here.

Therefore, it is my pleasure to send around this encouragement to submit papers:

L.S.,

As organizers of the ACS RDF 2010 Symposium held at the American Chemical Society meeting in Boston in August 2010 (http://egonw.github.com/acsrdf2010/), we would like to encourage you to submit a paper to a Thematic Series of papers around the use of Resource Description Framework (RDF) in chemistry, in the Journal of Cheminformatics (http://www.jcheminf.com/). Six speakers have already agreed to participate.

Journal of Cheminformatics was launched in March 2009 as a fully Open Access cheminformatics journal. Papers published in this journal benefit the cheminformatics community through the free, widespread and unrestricted readership that Open Access offers. As organizers of the ACS RDF 2010 symposium, we believe that it is important to share research in this area within the cheminformatics community as much as possible. The journal is peer reviewed, has unrestricted article length, and unlimited use of color illustrations, supplementary files, etc., making it an excellent platform to both give an overview of scientific progress, but being able to go into detail at the same time. Authors retain complete copyright to their published paper.

We would be pleased to explore this opportunity further with you, so if you have any comments or questions e.g. about the scope of the thematic issue, article type, or otherwise, please don't hesitate to get in touch.

Wednesday, September 15, 2010

The list of changes is particularly long for this development release. Therefore, I will list the authors and reviewers first. Note that this release also includes the changes of the 1.2.6 and CDK 1.2.7 release.

The authors
I am mildly impressed by this release's list of authors... it is certainly not the usual suspects anymore, and I would like to thank all contributors very much. Interestingly, we also see the number of places where contributions increase, and see patches from 7 different institutes (6 if you count the current ones)!

The changes
The changes include bug fixes (see also the 1.2.6 and 1.2.7 release notes), but also an updated SMSD engine, a rename of the IAtomType method getHydrogenCount() into getImplicitHydrogenCount() and of MDLWriter into MDLV2000Writer, and the addition of the signature code by Gilleain, and many, many small fixes.

Throw a CDKException when a QUADRUPLE bond order is in the input, which is not supported by the MDL/Symyx molfile format (fixes #3029352) 32cdc48

Deal with a special situation: pyridine N-oxide in the non-charge-separated representation, with a N.sp2.3 nitrogen, with two double bonds. Previously, any ring-outward double bond would disqualify the ring as aromatic. This compound is now an exception. c66039d

Sunday, September 12, 2010

CDK 1.2.7 is the latest of bug fix releases in the 1.2 series. It brings a number of JavaDoc fixes, but, importantly, also bug fixes in SMILES handling and atom type perception. I am really pleased to see the application domain of various algorithms in the CDK continously grow: SMILES parsing for some transition metals has been fixed, and the SMILES generation for some types of ring closures too. Additionally, an important bug was fixed in the atom type perception algorithm, which failed for custom atom types with formal charges. Everyone using the CDK 1.2 series is advised to upgrade to this version.

Friday, September 10, 2010

I am keen on RDFa and RDF in general; that should not be a surprise. RDFa is a serialization of RDF triples embedded in (X)HTML. I recently posted about chemical examples of XHTML+RDFa. Now, the reason for putting data in HTML as RDFa is that we can easily pull it out again, e.g. with this distiller. But the fun goes on, and we can actually also run SPARQL directly on it, for example with RDFaDev which I recently blogged about.

Now, consider we have all these nice visualization tools written in JavaScript which can visualize data from JSON sources, the mashup requires a JSON serialization of that data embedded in HTML pages. Now, I have no experience with the cool JavaScript tools, and hope someone can help me out here, but the JSON bit I already got help with before on SemanticOverflow (thanx to Comment Bot!). The service mentioned no longer works, but there are plenty of alternatives.

Now, Peter is creating this nice data set about green solvents from patents, and it would be great of that data ends up online as RDFa, so that we can easily visualize the trends in solvent use over the years. But as I do not have this data as XHTML+RDFa yet, you will have to do with another example: boiling points.

So, let's consider the data on this page, relating paraffin molecules to boiling points, and we'll take a complexity descriptor (w0, Wiener descriptor) and the boilingpoint (t0). so we get this SPARQL query:

The point is, I am sure at least one of my readers knows how to visualize the data in this JSON with, for example, Google Chart, particularly, because all the mashing up is embedded in the just linked-to, though obscure, URL. And, if it helps, you can otherwise use the CSV or TSV output. The output of that is even more simple (CSV):

w,p
56,4
286,9
35,3
220,8
20,2
84,5
10,1
165,7
120,6

The first one who can use one of the above URLs to extract the data from that XHTML+RDFa page to create a scatter plot in a HTML page with some JavaScript library, wins a free mention in my blog! ;)

Sunday, September 05, 2010

As I was originally waiting for an actual copy inbox, which I still have not received, I had not blogged about it, but earlier this year the book "Handbook of Chemoinformatics Algorithms" by Jean-Loup Faulon and Andreas Bender got released for which I wrote a chapter on 3D molecular representation. Just wanted you to know.

Saturday, September 04, 2010

Earlier this year I gave Mendeley a try, after having been a happy JabRef user, unhappy Connotea user (main problem was that any URI can be bookmarked, not just papers, so very noisy), happy CiteULike user (and still am). But the client did not bring me what I needed, and I canceled my account again.

Moreover, Mendeley has momentum and is starting to provide interesting apps around the API, such as readermeter.org. And since being a scientist is playing the publishing game, one just must add once papers to these systems, just advertise them:

This brings us to problem #1: author identity, which is a general problem and addressed by projects like ORCID. So, besides the page shown above, I have a second page under an entry with just my first name.

But, as the title of the post suggests, Mendeley suffers from a second problem, which was recently brought up by Duncan in his How many unique papers are there in Mendeley? post. Mendeley, apparently, claims 36M papers, but the number of unique papers is much smaller, as detailedly outline by Duncan. Mr. Gunn replied that [d]uplicates are understandably enriched among the popular papers, such as yours, and it’s harder to go from 6 duplicates to 1 canonical document than from 2 to one, because the variability is higher (see this comment), but I do not buy that.

I replied in the blog about that claim and also made a suggestion: this dereplication should really be a crowd-sourcing event, but I found it impossible to find a place to report duplication, so I had to use a message to support form and a uninformative category Other. If I was working in Mendeley, I would make this reporting a key technology behind their dereplication efforts.

Anyway, the duplication goes deep, very deep into the long tail. And really, my papers are fairly well received in general (many of my papers in BMC journals are 'Highly Accessed'; I did request some distinction there, using the StackOverflow gold, silver, bronze system), but incomparable with the highly bookmarked papers in Mendeley. I know this is probably not something Mendeley likes to hear, but the paper duplication goes deep, very deep too: a majority of my papers show duplicates. A semi-exhaustive scan showed me duplication for the XMPP paper (here and here), the Blue Obelisk paper (here, here, and here; yes, three copies), the CDK-Taverna paper (here and here), the Bioclipse 2 paper (here and here), the userscripts paper (here and here), the CDK I paper (here and here), and the CDK II paper (here and here).

Hopefully, by the time you read this post, at least some above the above links no longer work. In that respect, I would also like to request URIs based on the DOI instead.

The decision to mandate data deposition as a condition of publication is another decision best made by the scientific community concerned rather than a single journal or publisher as, for example, has been established in the microarray and evolutionary biology communities [19]. We will, therefore, support data publication when it is mandated, but will also enable, encourage and recognize [20] data sharing and publication on a voluntary basis for scientists wishing to show leadership in their field.

Now, as the journal already allows reuse of papers (CC-BY license), this also applies to data (and in at least several countries data cannot be copyrighted at all, but we need a world-wide solution; it's the 21st century). However, earlier this year the Panton Principles were introduced which formalize the idea behind public domain waiving, and suggest the CC0 waiver as one valid approach. This is where BioMed Central wants to go too; they write:

All research articles published in BioMed Central journals are published under the Creative Commons attribution licence [22] (CC-BY), with which authors retain the copyright to their work. This licence allows unrestricted distribution and re-use provided that the original article is cited. We support the Panton Principles for open data in science [23] and open data should therefore mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. We encourage the use of fully open file formats wherever possible.

...

Therefore, to eliminate potential legal impediments to integration and re-use of data, specifically, and to help enable long-term interoperability of data we believe an appropriate licence or waiver specific to data should be applied, and made explicit by the authors and publishers. There are a number of conformant licences [25] for open data, of which Creative Commons CC0 [26] is widely recognised. Under CC0, authors waive all of his or her rights to the work worldwide under copyright law and all related or neighboring legal rights he or she had in the work, to the extent allowable by law.

The above quoted text are extracted from the draft, and your comments are most welcome. You can leave them as comment here, which I strongly encourage you to do. If even just to support the idea (see McPrinciple3).

The draft also touches the issue of Open Standards, but I feel this problem with resolve itself. More interestingly, it is now time for the journal editors to make a move, and let the community know if they will require these open data waivers for there journal. For example, cheminformatics as field would benefit very much if the Journal of Cheminformatics would make this move. But at the same time, I fully understand that a young journal may not yet be in the position to do just that yet.

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.