Pages

Saturday, January 30, 2010

Bioclipse 2.0 introduced a new, powerful molecular table support, and we have been eager to test that on large SD files. A recent ChEBISD file failed to open, and eyes were immediately at the CDK, which is the cheminformatics library used in Bioclipse.

After careful investigations, it turned out that the ChEBI file contained a few entries which were not MDL molfiles, but queries for the ISISBase system. Those cannot be read by the CDK MDLV2000Reader. However, it crashed on it, instead of failing more savely. That's not nice, and fixed. But, the problem is rather recurrent, and the reason why I like CML so much: invalid input. CML, based on XML, has several general validation approaches that give in-depth error messages of what is wrong with the file.

So, I asked on the BOx what the Open Source cheminformatics community had to offer for this. Turns out that several tools find problems in the files, but none could report where the error occurred.

Validation
Now, some time ago, I played with two reading modes, RELAXED and STRICT, as faulty files is core cheminformatics material, and the software is blamed if the QSAR model resulting from it is not good (seriously). Anyways, a small API change in the CDK would make a validating MDLV2000Reader quite a step closer, but I had not followed up on it until last Friday where I patch I was reviewing caused 6 new unit test fails. The new fails were caused by a assumption which turned out the be false in the test files used in those 6 unit tests.

but does not specify which fields are optional. And indeed, many tools around save MDL molfiles with one or more fields missing, leading to shorter than expected line lengths. And, as you might have expected, the failing unit tests had files with lines missing the field introduced by the patch, causing Exceptions being thrown around. I have yet to make up my mind of the lack of those fields is a problem in the file, or allowed by the format. In either case, the information from that field is not available, and the reader could safely ignore the missing information. Per user demand.

Now, personally, I rather send the file back to the user with a proper error report and show them what is wrong with the file. Or better, provide them with a MDL V2000 text editor (e.g. in Bioclipse) which would graphically highlight errors, as many of us are used to with Eclipse:

CDK Patch
So, I am hacking up a patch for CDK master to allow error reporting by IChemObjectReaders. The initial version of the API update and use in the MDLV2000Reader are available as Gist 290659. They are not final yet, as I realized when making the above screenshot, that merely int col is not enough, and that I actually need the startCol and endCol positions instead. Also, there are only an error level at this moment, and no warning level as in the screenshot.

That said, I created a jar (ant dist-large) and saved it as mdlCheck.jar, and wrote a bit of Groovy:

which defines a class implementing the new IChemObjectReaderErrorHandler and then reads a MDL molfile. And the output looks like it fulfills my needs:

Note to myself, that atom block does not like like a MDL molfile atom block at all! Every second line outputs the Exception passed to the error handler. I have to say, those messages are rather cryptic, but resulting from a NumberFormatException, if not mistaken.

Thursday, January 28, 2010

Ola is releasing Bioclipse2.2.0 today, and asked me to show case the semantic web functionality in Bioclipse. I realized that I do not have a nice page showing the semantic web overview. But I did blog a lot about RDF functionality, so here's a list of pointers:

We now invite papers for our symposium on the use of the Resource Description Framework (RDF) technologies in semantic knowledge representation and data exchange in chemistry at the 240th National Meeting & Exposition of the American Chemical Society (ACS) in Boston this fall.

Semantic Chemistry has been around for a while, but is seeing a revival with the adoption of the Resource Description Framework (RDF) and matching technologies in chemistry. RDF triples provide a simple structure that allow data and knowledge alike to be presented in a single framework. Derived technologies include the capturing of ontologies with the Web Ontology Language (OWL) and performing queries with SPARQL. A wide variety of free and open source product make it easy to set up servers with large amounts of RDF data, while integration with HTML is available too with RDFa.

The RDF symposium at the 240th ACS national meeting in Boston invites submissions of talks about the use of RDF in chemistry and cheminformatics. Topics could include the use of OWL ontologies, OWL axioms, reasoning and interference, RDF in user interfaces, such as RDFa in web front ends, visualization, querying systems, and applications thereof, such as linking data sets, compound classification, cloud computing, web services, data aggregation, semantic publishing, and literature mining.

Abstracts may be submitted via http://abstracts.acs.org. You’ll find the RDF session as part of the CINF division symposiums. Submissions open January 25, 2010, and the deadline is March 28, 2010. In case of questions, please email Egon Willighagen at egon.willighagen@farmbio.uu.se or Martin Braendle at braendle@chem.ethz.ch.

I believe there is quite some room for improvement, but it's a start :) Thanx to Joe for posting the public domain test file, so that other projects can start play with the exiting new technology. I should note, however, that I am not running a Microsoft OS nor MS-Word, and the saved documents source are the only way I have access to the CML right now.

Sunday, January 17, 2010

He asked me to test it, and I installed a fresh Taverna install and the new plugin. After that, I used the MyExperiment plugin to download one of the CDK-Taverna workflows Thomas has on MyExperiment, and tuned it a bit to use some local input instead of the database. I took some screenshots while at it, and will use those now to talk you through the installation of Taverna and the CDK-Taverna plugin.

Download Taverna
Taverna 1.7.2 can be downloaded from this download page, but I took the Linux version from the SourceForge download site. I cannot detail the OS/X or Windows installation, but on Linux you simply unzip the downloaded file, and you're ready to go:

$ cd taverna-1.7.2/
$ sh runme.sh

Plugin Installation
Plugins can be installed using with the Plugin manager which can be accessed via the Tools menu:

Clicking the Find New Plugins takes you to a second dialog listing known plugin sites, and the default download has several already:

The CDK-Taverna update site is available at http://cdk-taverna.de/plugin/, and we can make Taverna aware of this update site by clicking the Add Plugin Site button:

After filling out these values and approving it with the OK button, it will show up on the dialog showing all available plugins, where you need the check the check box in front of the CDK-Taverna plugin name, as done in this screenshot:

You can then hit the Install button after which the plugin will be downloaded:

After it is done downloading the plugin, you can close the Plugin Sites and Plugin Manager dialogs. I shutdown and restarted Taverna with sh runme.se, but not entirely sure this is needed. After that, the CDK nodes showed up in the list of Taverna processors:

MyExperiment Plugin
Using the same Taverna Plugin Manager you can also install the MyExperiment plugin that allows you to search, browse, preview and download Taverna workflows from the MyExperiment website from within Taverna itself. I installed the plugin, and then used it to search for CDK workflows (and downloaded a QSAR workflow):

This about everything to get you going. It's not particularly rocket science, but I guess this howto is useful as you get to see what you should expect when setting up a CDK-Taverna environment. If you have further questions, please leave those in the comments section, and I'll try to merge in answers where possible, or otherwise in the reactions too.

Friday, January 15, 2010

This blog is old and new news. The old news is that Warren passed away at the end of last year, after having successfully shown how OpenSource cheminformatics (and/or bioinformatics) software can be developed in a commercial setting (DeLano Scientific), and PyMol was a huge success. Warren had a SourceForge account (wdelano) for almost 10 years:

I had not blogged about it before as the news hit me hard. Surely, Warren knew a lot of people and I only was only one of many, but Warren's memory sticked well. I know Warren from the Jmol project, where we talked in the past of coming to an Open Specification for exchanging scenes between Jmol and PyMol. Around the end of my PhD contract we even briefly, but seriously, explored doing a post-doc in his group.

Schrödinger
Yesterday, I was pinged about Schrödinger acquiring PyMol. The press release is, as usual, short on details, but those have become clearer during the day. Schrödinger is not new to Open Source cheminformatics, and has an product based on KNIME, which is now GPL, but also has a proprietary license for those who wish to license so.

But, unless I missed any other Open Source (-oriented) product, the acquisition of PyMol significantly changes the game for them: PyMol is a major Open Source product, bigger than KNIME at the moment, I'd guess. My immediate response to the acquisition is whether they acquired copyrights, and they did, according to this commit:

This is important as it puts Schrödinger in charge of license changes. Fortunately, they seem rather serious about the Open Source thing, and hired an active PyMol developer (Jason), and kept the existing Open Source license:

Therefore, congratulations to Schrödinger for getting seriously into the Open Source community, making them the next Dr Who of PyMol, and congratulations to the family of Warren in ensuring continued development of the PyMol project! It's hearth-warming to see that despite the bad times they are going through, and all they options they had with the PyMol code base, they find time for and strength in supporting Warren's ideas about the future of cheminformatics. My thoughts are with them!

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.