Pages

Wednesday, July 28, 2010

Like all release in the 1.2 series after CDK 1.2.0, release 1.2.6 is a bug fix release. Anyone running a CDK 1.2 version is advised to upgrade. New in this release is the availability of a torrent for the cdk-1.2.6.jar (see BitTorrents for Science). Please find below the changes and the authors that contributed to this release.

Apparently the super.clone() does not clone the pointer to the IAtomContainer[], causing a clone() followed by changing containers in the clone to overwrite the original IAtomContainer[]. Fixed by creating a new array. 4e5d6a1

Moved test from the specific class to the abstract tests, as the behavior should be the same for NNMoleculeSet and DebugMoleculeSet too 068fb3b

Two more tests for the issue: atom typing works fine; aromaticity detection fails: one ring is detected as aromatic (that with two nitrogesn), so that it does not consider the double ring, marking the other ring as non-aromatic 3be2367

Monday, July 19, 2010

Richard (Talis) wrote up a three-step tutorial on how to publish your data. I think I would be more than happy if scientists reached step 1. Related, Ola asked me a while ago if I was interested in using the computing facilities of UPPMAX, and I was. But until this weekend I did not have the time or energy to give it a spin. If you are puzzled how the heck I see those two items related, read on :)

Two days later, today, I ran my first analysis. Still a test run, but using the CDK to perceive atom types on the first 2.5 GB of PubChem data. The full data set is now 80 GB, and I will start doing this analysis today. You might remember this already two years ago (see Wicked chemistry and unit testing) for a small subset, but only now have the power to analyze all compounds. The UPPMAX system I work on has 348, each with 8 cores. Each core has 3 GB of memory, but I am using the IteratingPCCompoundXMLReader class anyway. Analyzing the 2.5 GB of data was done using 50 nodes, and finished in about a minute. Nice :)

Now, this first run dumped the results as a plain text file, looking like:

Or? And this is where the two items outlined in the first paragraph meet. No, this is not useful. Since the output is from an analysis of PubChem, I'm sure you already figured out that the first two columns indicate the compound being analyzed. You might also work out that then the elements are given for which the atom type perception failed. You may even figure out that the number is likely to be the index in the connection table representation of the molecule. Right?

But what about machine readability? I could, of course, write the output as CSV, but then I would loose my ability to write the report in human readable format. And moreover, the list of failing atom types does not have a fixed length, as you can see in the example lines given earlier.

Now, this is where RDF comes in. If I create my output as HTML+RDFa, I can do fancy stuff. My results page could link directly to PubChem, so that I can inspect the actual compound. Though I could do that even with merely HTML. But with RDFa, I can actually make my free text log output machine readable. I can accurately annotate what bits are informative:

The file is not backed up by an OWL ontology, but where possible one would do that. Reuse of ontologies is a good thing (e.g. use a service like Schemapedia).

Now, I can easily open up this file in a web browser (follow this link) and get the same view as above. But I can also import the file directly into Bioclipse (see Semantic Web features in Bioclipse 2.2), or in any other tool that supports RDFa. I can then use SPARQL to do some first analysis, for example, with:

Combine that with the RDFaDev tool I wrote about last week (see RDFaDev: HTML+RDFa development with FireFox). Now you should get some feeling of the advantages of using Open Standards: I can do some initial analysis of the results, just right there in the web browser you have open anyway:

Therefore, next time you ask your data analyst to perform some calculation, insist that he sends you HTML+RDFa log files with results. Better, ask him to put it online, and you immediately reach Step 3 in the analysis by David.

Sunday, July 18, 2010

I am writing some more educational material on cheminformatics, and wanted to link to some of the handbooks already around. I need the book details in BibTex format, so CiteULike is my primary tool to create such content. One of those books is the book An Introduction to Chemoinformatics by Leach and Gillet. I looked up the ISBN number on Amazon, and then I noted something weird:

So, the electronic copy is actually more expensive than the paperback?! Is this an artifact or a pattern? No way you can get the investment for the Kindle itself back then... :(

That's it. Not quite the 5 minutes that Samuel promised me, but I'm happy to have this available for my conference tour next month! I installed the default RDF, and got the nice default wiki page to which I can now start adding manual annotation:

Friday, July 16, 2010

Celso informed me in this old post about an alternative to Operator for RDFa handling in browsers, or Firefox in this case: the RDFaDev add-on. It works quite well, extracts the RDFa, reports common problems, and even allows running SPARQL directly on the web page, all from within a browser pop up window:

The current default fingerprinter in the CDK depends on aromaticity, but that concept is algorithmically difficult to define, and even experimentally there are multiple dimensions to this concept. Moreover, calculating aromaticity is not cheap, as it requires detecting of ring systems. The purpose why aromaticity is actually included is this: people expect a ethenol moiety to match phenol.

Now, an alternative is to not use aromaticity, but hybridization information instead: an aromatic bond is basically just a bond between two sp2-hybridized atoms. Removes some algorithmic complexity and speeds up the calculation:

The definition of the fingerprint has changed, and a bond between two sp2-hybridized atoms may not be aromatic. We can therefore expect that the fingerprint will give more false positives with substructure search. I'm hoping that Rajarshi can find some time to compare this new fingerprint in his excellent analysis he did some time ago.

Thursday, July 15, 2010

The Cb software is still holding... I jettinsoned the old post cache, which speeded up the processing of blogs considerably, but the system just doesn't scale right. Yet, Euan has done a great job, and the Cb site has now been online for some three years! Here are some new blogs included in the aggregation and analysis:

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.