Sunday, October 31, 2010

I promised the CiTO author, David, my use cases, but have been horribly busy in the past few weeks with my new position, wrapping up my past position, and thinking on my position after Cambridge. But finally, here it is. Based on source code I wrote and released earlier, the first use case I represent is the Wordle one, which I showed with manual work in February.

Now that all the data is semantically marked up in CiteULike, I can easily extract all paper titles (or whatever is available in CiteULike) for all papers that cite the first CDK paper (doi:10.1021/ci025584y). Using the JSON interface, I have this Groovy script to extract all titles:

The output is two blocks which I can easily copy/paste into Wordle. Now, I think I heard one can actually download the java code, so I am tempted to integrate it later, but for now copy/paste will do fine, after the data handling is mostly automated: with a few lines extra I can make such visualizations for any paper I annotated in CiteULike with CiTO.

Given: A publication, describing specific method of property prediction (not a generic machine learning algorithm). An implementation of this publication.
Example: pKa. This is a decision tree with SMARTS in the nodes. There is a training set, which could be used in validation.
- Should it be exposed by OpenTox services as ot:Algorithm or ot:Model ?
- What is the right way to use / extend Blue Obelisk descriptors dictionary to describe this implementation?
- Would you classify this method as a descriptor calculation or as a predictive model?

I would say, a ot:Model is a ot:Algorithm, just a comlex one.

The question shows one of the virtues of ontologies: they require us to carefully think about what we say. It is almost as like they put the scholar back into science.

On a different note, can we please start making an Open Data pKa database?!?

Thursday, October 28, 2010

Besides getting Oscar used by ChEBI (hopefully via Taverna), my main task in my three month Oscar project is to refactor things to make it more modular, and remove some features no longer needed (e.g. an automatically created workspace environment). Clearly, I need to define a lot of new unit tests to ensure my assumptions on how to code works are valid.

So, what are the API requirements set out? These include (but are not limited to):

This week I worked on the dictionary refactoring, and talked with Lezan about the ChemicalTagger and trying to get this based on the newer Oscar code (I think we'll be able to finish that today). So, I cleaned up some code I did in the first week, and introduced a Oscar class providing a Java API to the Oscar functionality.

So, to get started with Oscar in your application, you only need to do:

Wednesday, October 27, 2010

A second API change lies deep in the IAtom interface. To reflect more accurately the meaning of the method, the IAtomType.getHydrogenCount() has been renamed to IAtomType.getImplicitHydrogenCount(), and likewise the setter methods.

CDK 1.2 code

carbon.setHydrogenCount(4);

CDK 1.4 code

carbon.setImplicitHydrogenCount(4);

Yeah, that's a simple one. Just to make clear, in both versions the count reflected the number of implicit hydrogens. The getHydrogenCount() suggested, however, to return the number of all hydrogens attached to that atom, that is, the sum of implicit and explicit hydrogens.

Later this year (planned) a new stable branch of the CDK library will be released. Time to look at some API changes, to ease migration. In this first post of the series, I will show how the IChemObjectBuilder functionality has changed.

Now, please note that the builder.newInstance() method may actually return null. This is not the case for the DefaultChemObjectBuilder, or the NoNotifiationChemObjectBuilder, but future releases may have dedicated builders that do have such functionality. However, these builder would not supposed to be used for building molecules anyway.

The general patterns of newInstance() calls is that the first argument is the interface for which you want an instance. All further parameters are passed as parameters for the object's constructor. The builder maps the input to appropriate class constructors. To know what parameters you can pass when instantiating an IAtom with the DefaultChemObjectBuilder, you would look at the constructor of Atom. Therefore, we can also call:

The pattern here is that each test method returns one variable, so that any method depending on two other unit test will have two parameters. The order is defined by the order in which they are given by the @Given clause.

About a year ago I wrote about free chemistry books. No, not illegal copied books, but really free books (though, not necessarily Open). Actually, the books I discussed last year, those are out of copyright, as they are old. But, just today I ran into an advertisement for free books by bookboon.com (twitter:bookboon), and these are new books. I have not browsed them yet, but you can download textbook PDFs from these sections: biology, chemistry, chemical engineering, math, nanotechnology, and several more. I do not know about the quality of these books yet, and they author names do not immediately ring a bell. Does anyone have experience with these books?

Saturday, October 23, 2010

Ten days ago I asked my readers if two molecules were the same or not. I guessed they were not, when I was asked Are these organic molecules the same? The people who replied to my post were quite convinced they were, and Peter gave the context of the pub quiz: assumptions may not be correct.

Indeed, I assumed there were hydrogens missing (implicit), and that line corners indicate places where carbons are. But the key to this problem was that I also assumed that the E/Z stereochemistry for the two double bonds were properly defined. Or, more accurately, I assumed that because I was comparing the two molecules, the E/Z stereochemistry for the double bond between the rings was identical in both drawings. We all did.

Under that assumption, these two molecules are indeed not the same. However, if the E/Z stereochemistry is actually not the same for that double bond, ... well, you get the point. Perhaps this was not the best of examples, as it is quite conventional to use 2D coordinates to determine E/Z stereochemistry... we even have a special drawing style to indicate the E/Z stereochemistry is unknown. Then again, how often does the organic chemist really use that.

A more convincing example was also drawn in the pub, and I should have given that one. Peter posted those later. These involve a spiro compounds. Here too, I assumed that the stereochemistry around the spiro carbon was identical. My bad. There was one person in the pub who spotted the problem: David Jessop.

Underlying issue, of course, is those stupid 2D drawings. Jmol has been around for more than 10 years now (and non-free tools too), and we still use 2D drawings... why, oh why? 3D coordinates and explicit hydrogens, that is what our molecular data should be represented with. Henry does this right, over and over again, in his brilliant blog. Well, most of the time anyway. Look for the 'Click for 3D' statements behind the figures, and just give it a try, e.g. in this post on I(CN)7.

Thursday, October 21, 2010

One of the goals of my project in Cambridge is to make Oscar available as Taverna plugin (source code, Hudson build). I have progressed somewhat, but still struggling with getting the update site working. The plugin actually installs into Taverna 2.2.0, but the activities do not show up. While this is work in progress, and the other project goal is refactoring, a current demo workflow looks like:

Example input would be: This is a list of ethanol, methanol, and 2,4,6-trinitrotoluene.

The plain text input can be linked to the pdf2text SADI service, and the CML is suitable for the CDK-Taverna plugin, which is currently being updated by Andreas, Achim, and Christoph for Taverna 2.2. As soon as the update site is properly working, I will upload a demo workflow to MyExperiment.org.

I guess the first next activity (node in the workflow) will be around the dictionaries, as the OPSIN activity converts only IUPAC names into connection tables. I was told OPSIN parses 97% of the IUPAC names it finds, and when it does, it does almost 100% correct. Want to challenge the code? Use this web service.

Saturday, October 16, 2010

Derek's blog pointed me to an editorial by Royce Murray Science Blogs and Caveat Emptor (doi:10.1021/ac102628p). He is warning us, science scholars, for blogs. He is accusing bloggers for not being scholarly, not checking facts etc.

He did himself and the journal a big disfavor with this editorial: in his blog he does precisely what he is accusing the blogger of: fail to check facts. Even worse, particularly for the 'Analytical Chemistry' journal, he showed inadequate in analyzing the problem, putting his scholarly skills at questionable levels: he failed to see what 'blogging' is and what it is not, and he failed to ascribe his concerns to the proper source; effectively, he failed to see the difference between correlation and cause-effect for 'blogging' (unworthy to any scholar, particularly if you start complaining). I invite Royce to blog his full analysis of the problem, with proper underlying data, facts, etc, so that I (and others) can explain to him the true factors involved in this problem he is noticing.

The editorial is a sad piece, and an editorial unworthy for the journal.

Actually, the fact that he mentions the Impact Factor is amusing. It must be noted that his editorial will have a huge impact, but not because the writing is any good, but because it is utterly wrong. And that reflects only one thing that is wrong with impact factors.

I strongly suggest Royce to checks his facts before he starts writing. The ethics expressed in the editorial seems only to apply to other scholars.

I you wonder about my strong language. That was triggered by these words from the editorial: In the above light, I believe that the current phenomenon of “bloggers” should be of serious concern to scientists. I consider myself a blogger, not unreasonable giving the fact that I blog, and feel personally attacked. Hence, the title of this post: Royce Murray and Caveat Emptor.

Friday, October 15, 2010

As Peter announced in his blog, and I tweeted earlier, I have started as postdoctoral research associate in Peter's group at the University of Cambridge, to work the next three months on Oscar, a chemical text mining tool. My tasks will focus on programmatical plumbing instead of method development, and I am aiming at integration with CDK-Taverna (see doi:10.1186/1471-2105-11-159, and which is currently being ported to Taverna 2.2 by Andreas). Sam and Lezan having been working on the refactoring as well, and will help me out with the gory details of the current code.

The source code of Oscar4 is available from this BitBucket project, and you can monitor the code state on this Hudson page. The project I will be working on, is in collaboration with the ChEBI project, and today we met up with various people in the group, and set out some really interesting use cases.

Thursday, October 14, 2010

Cambridge pubs are not just good for the (Danish) beer, but also for the pub quizzes. Peter asked if the below molecules are the same. I did not think so, but... what do you think? He also asked if they are chiral. We got until tomorrow 20:00 BST.

Saturday, October 09, 2010

BioStar is a Q&A website for bioinformatics, just like the Blue Obelisk eXchange. Neil and Pierre have an ongoing struggle to gain the most karma, requiring Pierre to put in a formal complaint against people posting questions when he is asleep (the whole 3 hours). So, I coined to idea of mapping all BioStar users on a Google Map. Neil picked it up, and had combined his coding skills with the various Open API, Open Standards, and Open Source solution, to come up only hours later with this map. Here are the BioStar users from my region:

Now, who will be my new cheminformatics hero, and make a map for the Blue Obelisk eXchange? ;)

Friday, October 08, 2010

Very much overdue, but still in progress, is my book on CDK programming. I am in love with the writing environment, a mix of make, Groovy and LaTeX, where the code snippets are written in Groovy and embedded into LaTeX (see CDK - The Documentation). The Groovy script is actually run by the build system, allowing me to embed the output too.

In the LaTeX source code I, therefore, have something like:

The list of supported hybridization types can be listed with:
\codeverb{HybridizationTypes}
listing these types:
\codeout{HybridizationTypes}

We got a winner! Crabtree just published the paper An Open-Source Java Platform for Automated Reaction Mapping (doi:10.1021/ci100061d), and is, according to Web of Science, the 100th paper to cite the CDK 2003 paper (doi:10.1021/ci025584y)!

The paper uses the rendering functionality of the CDK. The authors write:

The viewer application uses code from the Chemistry Development Kit(31) (CDK) to display graphical representations of the compounds involved in the reactions. The CDK source code was altered to enable the color-coded display of the bonds that were broken or formed during the reaction as shown in Figure 7. In addition, we created a “transition state” molecule that shows the transitory combination of reactant molecules that occurs at a potential energy maximum. The CDK source code was also modified to support the display of the transition state.

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.