Now, this example shows a simple yet powerful feature of how RDF is used nowadays: the ChEBI identifier was not part of the original Solubility spreadsheet at Google Docs. But, taking advantage of the unique and resolvable URIs for molecules, when can simply look them up.

Nice, isn't it?

Update: the embedded gist did not show up nicely, so replaced it with a pre block.

Sunday, February 22, 2009

RDF is swiftly becoming the lingua franca of life sciences (see for example [1,2]). Bioclipse is an excellent platform to visualize results from analysis of the network, both for graph visualization (see [3]), as well of visualization of domain specific data types (e.g. sequences, molecules, ...).

Yesterday I uploaded a Bioclipse feature that adds a rdf manager to handle RDF content, which includes SPARQL support. The below snippet shows application to the solubility data [3]:

Saturday, February 21, 2009

This week I have been porting the PubChem plugin for Bioclipse 1.2 to the new manager-based architecture. While still working on the Wizards, you can run the following JavaScript in Bioclipse2 from SVN and from the next beta (*):

*) There was some confusion on the two beta Bioclipse2 releases so far. Some people expected a release without any bugs left. That release is what we planned to call a Release Candidate. We agree that the first two betas at least turned out to be more alpha than we actually hoped, and we thank everyone who has given these releases a go. Those who tried several development releases of Bioclipse2 saw a lot of ongoing development, and we are fixing any bug reported on these releases. So, do not hesitate in reporting bugs!

R for cheminformaticsThe fact that we can script everything makes Bioclipse an ideal platform for doing cheminformatics: we have access to a variety of cheminformatics libraries, and the means to visualize results via JChemPaint and Jmol. It is like R for cheminformatics: Bioclipse being the R command line, Bioclipse plugins the R packages. Eclipse provides an mechanism called Update Sites, which makes something like CRAN redundant. Back to the Chemistry Development Kit.

Over the next weeks, I will blog about scripts aimed at CDK developers and people who want to learn more on how the CDK internals work. This series assumes Bioclipse 2.0 beta2 (or better) and the CDK Feature installed. I'll be using the Gist widget to embed scripts in this blog, but you can always download the Gist directly into Bioclipse, with the GUI as described here.

Bioclipse uses JavaScript (maybe other scripting languages in the future. File a wishlist report if you like to see Jython, BeanShell or other support in the Bioclipse bug track system.) Bioclipse managers are visible using special variables, such as:

Bioclipse Feature

ui

Bioclipse UI interaction

Cheminformatics Feature

cdk

CDK functionality

jmol

Jmol functionality

CDK Feature

cdx

CDK Developer functionality

Bioclipse scripting has TAB completion support, so you can type cdk. (notice the dot at the end) to which methods the cdk manager provides.

Debugging CDK's Atom TypeAs I wrote last week with the email on the first CDK 1.2 release candidate, the new CDK atom typer is a core component of the new CDK. The new implementation covers all atom types used in CDK 1.0, and many more. In particular, Miguel boosted support for charged and radical atom types.

However, the atom types in your data set may not be covered, or perception fails otherwise. That happens. Bioclipse2 makes debugging of this important step in cheminformatics quite insightful. The following script reads a molecule from SMILES, visualizes 2D diagram in JChemPaint, and perceives atom types: The atom type perception results are return to the JavaScript console, and if there are nulls given, then the CDK algorithm did not find a matching atom type for that atom. If you are sure your cheminformatics representation is in order, I welcome a bug report here.

CDK developers can take advantage of this functionality, to eliminate possible causes why a certain algorithm fails. CDK atom typing is used for a variate of algorithms, including counting implicit hydrogens, which many other algorithms need to know.

How does the CDK read a SMILESA use case for people who want to know if a particular SMILES feature is read or to make sure it is read correctly: This script uses the diff functionality introduced in CDK 1.2, and shows two aspects of the SMILES specification: 1. it picked up the isotope information given in the second SMILES; 2. the second SMILES does not include the implicit hydrogen count, which the SMILES specification then defaults as zero.

SummaryThe CDK managers in Bioclipse (cdk and cdx) expose functionality of the CDK, and allows using it in Bioclipse' rich visual workbench environment.

Wednesday, February 11, 2009

On the DBPediadiscussion mailing list there was a post on a nice web page which allows you to look up things, and which features a autocomplete edit field. The below screenshot show lookup of molecular structures:

If you are not ware of this, adding content to DBPedia is as easy as adding something to WikiPedia. Literally: DBPedia is the RDF flavour of WikiPedia. It extracts the information from the info boxes, as I discussed before (see Molecules in Wikipedia).

BTW, one can take advantage of DBPedia to see what WikiPedia has to offer in terms of chemistry. For example, to list all molecules which have a SMILES, one can use this simple SPARQL query: Or, to list those which have an InChI: And this is actually quite useful, e.g. it can be used in quality control. Running the above queries will show up several broken SMILES and InChIs. I have not had time to fix those yet, so please go ahead and beat me to those fixes, and get some WikiPedia Fame :) Alternatively, invert the queries and add missing InChIs, PubChem CID or SMILES. When I have a bit more free time again, after the new stable CDK and Bioclipse releases, I'll runs these analyses again, and summarize them in a web page.

Tuesday, February 10, 2009

I am happy to be able to announce the first Release Candidate for CDK 1.2.

Everyone using using CDK 1.0 is suggest to upgrade to this release,which has fewer bugs, is much better tested, and is faster too. Italso comes with API changes, and a full changelog is not available(yet). However, the CDK developers are available on this mailing listand on IRC to help you port CDK 1.0 applications to CDK 1.2. Twodifferences in particular I would like to point out at this moment:

1. explicit atom typing

CDK 1.0 did atom typing at various places to perform its function,leading to inconsistencies and bugs. CDK 1.2 introduces a new atomtyping module which isolates atom typing from other algorithms.Consequently, the CDK will be more critical on your code and yourdata: where the old code might have silently eaten incorrect input,the new implementation complains: expect exceptions! The actual atomtype list used in CDK 1.2 is more complete than the ones used in CDK1.0; however, it is not unlikely that you will find no atom typeperceived for a clearly valid atom type. Please report such cases.

And I really want to stress this: in every instance where CDK 1.2, CDK1.0 would have failed too, though it might have not complained aboutit.

2. no rendering functionality

The new rendered under development (see the cdk-jchempaint mailinglist) has not made the CDK 1.2.0 release. However, it is expected tobe available in a later CDK 1.2.x release. If you really need thegraphics functionality, please contact me. Bioclipse2 is an exampleproject which combines CDK 1.2 with the new rendering code.

Contributions-------------------

This release features contributions from a larger developer group thanever before. In particular, I would like to welcome those who havepicked up JuniorJobs, and provided other smaller patches! A full listof authors is available from:

The number of failing unit tests is below 1%, and in the same range asthe number of failing tests for CDK 1.0. Importantly, these aretypically fails of unit tests which are not available in the CDK 1.0unit test suite; that is, many of the failing unit tests in CDK 1.0are *not* failing in CDK 1.2 (it really is rewarding to upgrade!)

However, if you find additional bugs (or just have wishlists), you canreport these with our SourceForge bug tracker at:

Over the next weeks I hope to compose a somewhat useful list ofchanges. I have not made up my mind yet how that will take shape,maybe as a list of blogs, which I'll aggregate later. Dunno yet.Suggestions and contributions welcome :)

JavaDoc for the release is not yet available on SF for download(working on that), but available for the cdk1.2.x / branch at:

I agree that this still is a problem: where can (organic) chemists host their data? TROS hints as Wikipedia, but an encyclopedia is not always the most suited place for cutting edge chemistry (article can easily be biased, contain (science) political views, etc...). I would suggest a blog would be a good start, and if proper markup would be used services like Chemical blogspace would automatically aggregate it.

However, something less volatile might be interesting. So, what we need is an overview of web databases where experimental chemistry data can be hosted. I'll start one, and annotate resources with license, on delicious.com, using the tags chemistry +web +database +open +submission, and regularly summarize things here.

In the below table, the last column indicated the most liberal license you can use to host your data:

I do not typically make complaints in my blog, so consider this a request for advice in good practices ;)

My problem is that I have to log in on SourceForge every day, even if I tick the 'Remember me' switch. I do understand that account log ins do need some time out... but less than a day? One cause of problems seems to be if I connect via a different network, but Cookies should not be affected by that? Am I doing something wrong here, or does SourceForge?

There is an option to have ChemSpider link back to blog, and I will have to figure out how to enable Chemical blogspace to extract the InChI from the underlying JavaScripts.

Update: I noticed that the ChemSpider server was a bit sluggish this morning, and that loading my blog page halts at loading the JavaScript... Tony, I suggest to use some Ajax magic here, with a really fast JavaScript download (using an almost static bit of JavaScript), and then a Ajax to access to slower bits, which might involve image generation and database lookup.

Update2: the feature was already under development before Cameron asked about it.

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.