Pages

Friday, December 21, 2007

Our Christmas tree has not been decorated yet, but the presents are there: the BMC Bioinformatics paper on userscripts in life sciences, Bioclipse 1.2.0, a long list of blogs to rate, and a very nice overview from Wendy Warr on workflow environments, discussing and comparing different offerings like Pipeline Pilot, Taverna, and KNIME.

UserscriptsThe paper on userscripts describes how Greasemonkey scripts can be used to combine different information sources (DOI:10.1186/1471-2105-8-487). A trailer:

BackgroundThe web has seen an explosion of chemistry and biology related resources in the last 15 years: thousands of scientific journals, databases, wikis, blogs and resources are available with a wide variety of types of information. There is a huge need to aggregate and organise this information. However, the sheer number of resources makes it unrealistic to link them all in a centralised manner. Instead, search engines to find information in those resources flourish, and formal languages like Resource Description Framework and Web Ontology Language are increasingly used to allow linking of resources. A recent development is the use of userscripts to change the appearance of web pages, by on-the-fly modification of the web content. This pens possibilities to aggregate information and computational results from different web resources into the web page of one of those resources.

Bioclipse 1.2.0The other present is the Bioclipse 1.2.0 release, for which the QSAR feature is a great new feature addition (see my blog the other day with an overview of blog items detailing my participation in that feature). Ola et al. have done a great job with the plot functionality, which is very nice to scatter plot calculated descriptors. This release is likely going to be the last one in the Bioclipse 1 series, except for bug fix releases, so, this release also means I can start contributing to the Bioclipse 2 series. Recent items in the Bioclipse blog show a bright future, with project based resource handling, better scripting (R, ruby, JavaScript, BeanShell?).

Thursday, December 20, 2007

Pending the release of Bioclipse 1.2.0, Ola asked me to do some additional feature implementation for the QSAR feature, such as having the filenames as labels in the descriptor matrix. See also these earlier items:

But I ran into some trouble when both JOElib and CDK descriptors were selected, or Ola really. Now, nothing much I plan to do on the JOElib code, but at least I code investigate the CDK code.

The QSAR descriptor framework has been published in the Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. paper (DOI:10.2174/138161206777585274). However, while most molecular descriptors had JUnit tests for at least the calculate() method, a full and proper module testing was not set up. This involves a rough coverage testing and test methods for all methods in the classes.

So, I set up a new CDK module called qsarmolecular, and added the coverage test class QsarmolecularCoverageTest. This class is really short and basically only requires a module to be set up, as reflected by the line:

Now, testing for a lot of the methods in the IMolecularDescriptor and IDescriptor interfaces are actually identical for all descriptors. Therefore, I wrote a MolecularDescriptorTest and made all JUnit test classes for the molecular descriptors extend this new class. This means that by writing only 10 new tests, with 29 assert statements, for the 45 molecular descriptor classes, 450 new unit tests are run without special effort, making to total sum of unit tests run each night by Nightly for trunk/ pass the 4500 unit tests.

Now, this turned out to be necessary. I count 52 new failing tests, which should hit Nightly in the next 24 hours.

Wednesday, December 19, 2007

The Chemistry Development Kit has never really been without any bugs, which is reflected in the number of failing JUnit tests. For trunk/ this is today 106 failing tests (live stats). The stable cdk-1.0.x/ branch, however, the number of failing tests is not much lower: 64 failing tests today (live stats).

Overall, only a low percentage of the tests fails (<2% for cdk-1.0.x/ and <3% for trunk/), and, more importantly, it is particular algorithms that are typically broken. For example, in the structgen module 8 tests fail, for both CDK versions. In the cdk-1.0.x/ branch it is the valency checker code that causes quite a few fails, which I discussed in Atom typing in the CDK and which is the reason for the atom type perception refactoring in progress in trunk/ (see Evidence of Aromaticity). Not all code in trunk/ has yet been updated yet, and this causes quite a few failing tests for trunk/ in the reaction, qsarAtomic and qsarBond modules.

Back to the cdk-1.0.x/ branch. Previous CDK releases tended to have around 40 failing tests, so I was worried about the number of tests failing now. Maybe backported patches causes additional fails? To study that I had my machine run the JUnit tests for all revisions of the cdk-1.0.x/ branch since the branch was made in commit 8343. The result looks like:Indeed, it is a number of backports that cause the clear increase in bugs between commit 9044 and 9058. Nothing particular I can see, and worse, the intermediate revisions do not compile and do not have test restults:

I should have taken more care when merging in these patches, even though they are supposed to fix issues:

Merged r8697: Add a method to the query atom container creator which creates an queryatomcontainer. This replaces each pseudoatom to an anyatom.Merged r8699 and r8700: Added test file by Volker (see cdk-user) for the shortest path problem; JUnit test provided by Volker Haehnke (haehnke - bioinformatik uni-frankfurt de), somewhat rewritten.Merged r8701: Renamed a variable to comply with http://en.wikipedia.org/wiki/Dijkstra's_algorithmMerged r8751: Bug fixes for bugs #1783367 'SmilesParser incorrectly assigns double bonds' and #1783381 'SmilesParser uses Molecule instead of IMolecule'. Test case for bug #1783367.Merged r8754 and r8773: Fix and test case for bug #1783547 and #1783546 'Lost aromaticity in SmilesParser with Biphenyl and Benzene'Merged r8774: Add a MDL RXN reader which uses the MDLV2000Reader instead of the MDLReaderMerged r8775, r8776, r8777: bug fixes for #150354 #1783774 #1778479 in the SmilesParser, SmilesGenerator and MDLWriter/PseudoAtom.Merged r8791: Code for v,mass atom two digits mass atom and exception handelingMerged r8800: Fixed reading of MDL molfiles with exactly 12 columns (==valid) in the bond blockMerged r8802: Made a little more memory efficient by removing unnesscary cloning operationsMerged r8803: Fixed it so that we make a deep copy of the input moleculeMerged r8809: Added code to work on a local copy of theinput moleculeMerged r8811: Updated JavadocsMerged 8824 8821 8820 8819 8817 8816: Added code to properly work on a local copy

I'm quite sure it must be the deep-cloning fix ported from the commits 8800-8824. I already fixed a number of bugs in the IP calculation code which is still a good deal of the failing tests in the cdk-1.0.x/ branch (and affects trunk/ too), as can be seen by the drop in bugs just after the big increase:

r9079 | egonw | 2007-10-15 13:24:10 +0200 (Mon, 15 Oct 2007) | 1 line

Renamed container to localClone to clear up code. Fixed a bug where the uncloned atoms was searched in the cloned atomcontainer. More bugs like this are in the code. Miguel is contactedabout this problem.------------------------------------------------------------------------r9082 | egonw | 2007-10-15 13:48:15 +0200 (Mon, 15 Oct 2007) | 1 line

Renamed container to localClone to clear up code. Fixed a bug where the uncloned atoms was searched in the cloned atomcontainer.

The big drop in number of fails is caused by the removal of the SMARTS code from the branch, which has been present since the start of the branch (see this page).

From this analysis I conclude that CDK 1.0.2 can soon be released. With the not that the ionization potential calculation is not safe to use.

Facts are free. The Rightsholder takes the position that factual information is not covered by Copyright. This Document however covers the Work in jurisdictions that may protect the factual information in the Work by Copyright, and to cover any information protected by Copyright that is contained in the Work.

I am looking forward how this license will be picked up by the community. PubChem may be a good candidate to use this license; to formalize their dump into the public domain. Not just yet, though, because things might still change. It is said that a wiki will be set up to ask for feedback. Paul has written a nice writeup on the history of this license.

One day soon, tomorrow's Richard Stallman will wake up and realize that all the software distributed in the world is free and open source, but that he still has no control to improve or change the computer tools that he relies on every day. They are services backed by collective databases too large (and controlled by their service providers) to be easily modified. Even data portability initiatives such as those starting today merely scratch the surface, because taking your own data out of the pool may let you move it somewhere else, but much of its value depends on its original context, now lost.

Thursday, December 13, 2007

The comment I left in the ChemSpider blog, was probably a bit blunt. ChemSpider announced having licensed software from OpenEye. I have seen such announcements more often, but am intrigued about the nature of such announcements. Is it bad that ChemSpider is using OpenEye software? Certainly not. But it is surprising that they "announced today they had entered into an agreement that will allow the incorporation of a number of OpenEye’s products into ChemZoo’s online chemistry database and property prediction service, ChemSpider" (emphasis mine).

Is it really special that you buy software and then use it? Maybe, it increasingly is, with a number of good software products freely available. Even many proprietary products are freely available, sometimes to a selected group only, though. Or, is there some license behind this that restricts you in what you may and may not do with it?

Anyway, I made the somewhat inconsiderate comment:"Amazing! (Forgive me that I [have] not read every bit…) But, amazing! A press release for the fact that one may use software ;)".

Anthony replied with these lines:"Yes, I think it is amazing that companies of this caliber are willing to provide their tools at no cost to systems like ChemSpider". He read my sarcasm correctly. I find it absurd that the future of chemoinformatics is left to the goodwill of benevolent companies. Chemoinformatics is way too important, and in way to crappy state, to be kept as proprietary toy to industry; that's something I argued before.

Let me try to explain where my sarcasm is coming from.

I do Not blame Individuals in Commercial ChemoinformaticsThere is nothing wrong with getting payed for what you do. I get payed for the software I develop too, though most of my contributions to the CDK, Jmol en even some some of my contributions to Bioclipse I have made as a hobby, in my spare time, unpaid. Nothing wrong with a good hobby, I would say.

But I do not blame people for not doing the same. Neither do I blame myself for making a reasonable living in the Netherlands, unlike all those poor bastards who struggle to make it to the next month, like many in the United States. But I do not like the situation. Neither do I blame people for being religious, though I really dislike several of the things the Church is trying to makepeople believe (such as that the HIV virus can get through condoms). I hate the situation.

I do not dislike the Commercial ModelPeople have to make a living. I do; anyone does. I do feel, however, there is a difference between making a living because you work, and getting money because you happen to be at the right side of the money flow. There is a difference between a baker getting up at 5am every morning to feed a village, and someone selling a thin slice of bread via eBay to a poor African soul who just received his/her OPLC laptop. Not that I think this really applies to the ChemSpider/OpenEye deal; just to make a statement about commercialism.

The Bill Gates foundation spending a lot of money on scientific research is what Dutch would call een sigaar uit eigen doos. This translate to something like getting a present you payed yourself. Literally, 'to get a sigar from ones own box'. But that's another story.

I hate the situationI hate the situation that research for new drugs is so expensive, and medicine likewise. I hate it that pharmaceutical industry cannot sell these drugs cheaply to development countries, because they will be sold expensively in western markets. But I do not blame the scientists working in pharma industry.

I hate the situation that scientific results cannot be reproduced independently, because software is being used as black box. But I do not blame the guy who wrote the code.

I hate the situation that I cannot contribute the excellent products around, because they disallow me to discuss my work with others. But I do not blame the guy who sold me the license.

I hate the situation that many very qualified scientists have to find a post-doc after post-doc before the give up and do to industry. I hate the situation that the better scientist you are, the less science you actually do, because all time is spent on getting further funds. But I do not blame those who payed for those temporary post-doc positions.

I hate the situation that people have to use commercial models for their scientific contributions, just to make a living, even though they would have loved to contribute that to mankind. But I do not blame them for wanting to be able to fulfill their primary living requirements (and those of their families).

I hate the situation I review papers for free for commercial publishers, just to help science progress. I do blame myself for not having stopped doing that yet.

But I do not blame ChemSpider for buying or using commercial products. I do not blame the people working at OpenEye for making a living. But I do find it absurd that we have to be amazed that scientific software is put to work.

I apologize for being blunt, but I cannot apologize for disliking the current situation chemoinformatics is in.

Monday, December 10, 2007

It's a really nifty piece of work, which goes into the differences between thesauri, controlled vocabularies, and, as such, ontologies, and social tagging systems. Both have their virtues; it is fuzzy logic versus ODEs all over again. Whether one is better than the other only depends on the problem at hand. For example, can you imagine social tagging in atom typing prior to performing force field calculations? Or, an 150-term ontology to annotate the scientific content of your literature archive?

More from where they come from...The video appears to be made by the Digital Etnography group, which has made several more movies. Certainly something I'm going to check out over the winter holidays (I guess I am quite a bit more religious about ODOSOS than about gods).

Nico wrote: As long as we appreciate that there may be more than one top node…. I am not entirely sure, but if he refers the thesauri, which are, a particular form of ontologies, where basically the only relations that can be found are is-a or is-parent-of, resulting in a hierarchy of controlled terminology with one top node (such as the Gene Ontology). Ontologies can and should be much richer if we really want to take advantage of our information technologies, just like we do with any graph mining. Why mould reality in a tight hierarchy?

Ebs and Michael had reviewed CML and questioned why the key concepts were atoms, molecules, electron, substances, whereas they suggested it would have been better to start from reactions. I think that’s a very clear difference in orientation between endurants and perdurants. Although chemists publish reactions, most of the emphasis is on (new) substances and their properties. CML is designed to map directly onto the way chemists seem to think - at least in their public communication - e.g. through documents. Of course we can also do reactions in CML, but even there the emphasis is often on the components.

The suggestion by Ebs and Michael is indeed quite surprising: ontologies tries to capture knowledge and expressed this an a small set of terms, each of which with an accurate and non-overlapping meaning (orthogonal, if you wish). Now, the terms carbon, nitrogen, oxygen, and the other 104 elements are quite accurate and rather different from each other, at least from a chemical point of view. Sure, bonding is more difficult, and let's not start about aromaticity. But to question atoms, bonds or electrons as key concepts??

Friday, December 07, 2007

I was pleased to hear that Christoph will move to the EBI early next year. Christoph has been working on Open Source and Open Data chemoinformatics since at least 1997. I first got in contact with Christoph when I wrote code for JChemPaint (which Christoph developed) to be able to read Chemical Markup Languages (CML). This also got me into contact with Dan Gezelter who is the original author of Jmol, to which I also added CML support. And, of course, with Henry and Peter, who first developed CML. This was before XML was an official recommendation, and I have worked with CML files which you would no longer recognize. It was in Dan's office that the CDK was founded, where Christoph, Dan and I designed data classes to replace the JChemPaint and Jmol data classes. Both JChemPaint and Jmol were rewritten afterwards, but for Jmol it was later decided that more tuned classes were needed to achieve to required performance for the live rendering of tens of thousands of atoms.

Well, Christoph has done many other Open Source and Open Data stuff, including the NMRShiftDB, Bioclipse and Seneca, a tool for computer-aided structure elucidation (CASE). The scientific impact for Christoph's work is considerable. When I realize that much of his past work was setting out foundations, and that these foundations have found the be solid, I am happy to hear that he can now start to apply his work to life science problems, where current methods are failing.

Monday, December 03, 2007

The article discusses many of the things that have been happening in the field of chemical data. It touches Jean-Claude's work on Open Notebook Science, and then moves to Peter's Open Data, mentions a number of other blogs and the Chemical blogspace. Via some video efforts, it ends up with Mitch' Chemmunity, which has the coolest Captcha I have seen so far:

Tuesday, November 27, 2007

I recently saw that blogger.com blogs gained a poll feature. From now on, I will try to be a bit more Open Science, in addition to Open Source. From now on, you can be in my Advisory Board. To do so, vote on my next chemblaics (aka Open Source Chemoinformatics) project. The poll can be found on the left side of this blog. Associated which each poll, which I may run more or less frequently depending on the time of year, will be one blog post where I introduce the options. Options not mentioned, or completely different things, you would like to suggest me to do, can be left as comments to these items.

Finishing the new JChemPaint codeGoal of this option is to use the code written by Niels in his ProgrammeerZomer project to implement a new JChemPaint based on Java2D and independent of the widget set used (Swing/AWT/SWT/...).

CML-roundtripping of the CDK data modelThe goal of this project is to ensure that all information the CDK data model can hold can be roundtripped in CML.

Integrating InChI-NestedVM in BioclipseRich is, besides an excellent blogger, also someone who is not afraid to try new things. Recently, he experimented with compiling the InChI library into a Java executable. Bioclipse already is able to generate InChIs, using the code written by Sam Adams for the CDK, but a InChI/NestedVM plugin for Bioclipse could make a nice show case.

Writing CDK News articlesOn the other hand, you might find that I should focus on getting a new CDK News issue out, for which we are stilling lacking (finished) contributions.

It's up to you. Deadline in about two weeks; still got some other things to finish :)

I am not sure if opening the workflow in your Taverna installation will automatically set up the WDSL scavenger for the ChemSpider services, which are available in a HTTP version too, btw. If not, right click on the Available Processors folder, and pick Add new WDSL scavenger... and point it to the URL http://www.chemspider.com/MassSpecAPI.asmx?WSDL. The result should look like:

Oh, and please note this comment:

These services are offered free of charge to our users during this period of testing, validation and feedback. Some of these services will be made available commercially in the future and we are proactively informing you of our intention to do this. It is likely that these services will remain available to academia at no charge. Please contact us at feedbackATchemspiderDOTcom with feedback and questions.

Monday, November 19, 2007

During my PhD I wrote a simple but effective genetic algorithm package for R. Because there was a bug recently found, and there is interest in extending the functionality, I have set up a SourceForge project called genalg.

The package provides GA support for binary and real-value chromosomes (and integer chromosomes is something that will be added soon), and allows to use custom evaluation functions. Here is some example code:

Monday, November 12, 2007

It is the Linux kernel, yes: TCP window scaling was switched on by default in kernels since about a year ago (and in Vista too, I think), and one of our routers or firewalls doesn't like it. We're trying to get them upgraded, but it takes a while...

There are 2 quick fixes. First you can simply turn off windows scaling all together by doing

echo 0 > /proc/sys/net/ipv4/tcp_window_scaling

but that limits your window to 64k. Or you can limit the size of your TCP buffers back to pre 2.6.17 kernel values which means a wscale value of about 2 is used which is acceptable to most broken routers.

Friday, November 09, 2007

On my desktop, the Scintilla and Postgenomic.com websites do not work. It is not a browser problem, but has something to do with TCP/IP packages not reaching its destination: the browser. Euan told me they are aware of the problem, but apparently have not found a solution yet.

However, my Wii does not have the problem, which makes me wonder if it is a disagreement between the Nature server and my Linux kernel... Anyway, this is what the two website look like (first Scintilla, then Postgenomic.com):

The only real disadvantage is that it does not integrate well with the things I do daily. If I see some interesting post, and would like to tag it on my del.icio.us account, I have to google for it on my desktop :(

Thursday, November 08, 2007

Right at this moment I am listening to Andrew Hopkins from Dundee on chemical opportunities in system biology, at the Cytoscapeconference in Amsterdam. Anyone who wants to meet up over lunch or coffee break?

Wednesday, November 07, 2007

I have started using branches for non-trivial patches, like removing the HückelAromaticityDetector, in favor of the new CDKHückelAromaticityDetector. I am doing this in my personal remove-non-cdkatomtype-code branch, where I can quietly work on the patch until I am happy about it. I make sure to keep it synchronized with trunk with regular svn merge commands.

Now, the goal is that my branch only fixed failing JUnit tests, not that it creates new regressions. To compare the results between two versions of the CDK, I use these commands:

The first gives me the number of JUnit tests which are now no longer failing, while the secondgives me the number of tests which are new fails. Ideally, the second is zero. Unfortunately, not yet the case :)

Tuesday, November 06, 2007

I have been working on a new atom type perception engine for the CDK, after having decided that the existing atom type lists where not sufficient for the algorithms we have in the CDK. The new list is growing in size, and basically contains four properties (besides element and formal charge):

number of bounded neighbors

number of pi bonds (or double bond equivalents)

number of lone pairs

hybridization state

This seems to be a minimal and accurate set to cover a rather good deal of chemoinformatics. I have yet to make the mappings of the new atom type list with existing lists for force fields, and radicals are missing too. However, the following algorithms in the CDK seem to translate rather well:

hydrogen adding

aromaticity detection (Hückel rules)

I still have to rework the double bond perception.

Aromaticity
Now, aromaticity is a fuzzy concept, and there is no general agreement on what it is. Some say it is smelly compounds, others say ring systems which apply to the Hückel rule. Based on the new atom type list, I have rewritten the Hückel aromaticity detector and it applies these rules:

only single rings and two fused non-spiro rings

4n+2 electrons

no ring atoms with double points not in the ring too

This approach differs in two ways from the old code: it no longer tries to test all ring systems, which required to use the CDK AllRingsFinder algorithm which combinatorial generates all possible ring systems. The new code only considers ring systems with up to two single rings. Aromaticity beyond that is even less well defined than aromaticity in general.

The other difference is that the ring system must not have ring atoms which have a double bond which is not part of the ring too. The classical example is benzoquinone (InChI=1/C6H4O2/c7-5-1-2-6(8)4-3-5/h1-4H) which is not aromatic, even though it conforms the 4n+2 rule (image from PubChem):

Evidence of Aromaticity
The final rule, of course, is what nature tells us what is aromatic and what is not. There are many other details to aromaticity than I just covered. For example, take azulene (InChI=1/C10H8/c1-2-5-9-7-4-8-10(9)6-3-1/h1-8H). All atoms are aromatic, but not all bonds (also PubChem):

These things are complex, but the rise of Open Data helps us out, as well as increasing computing power. Peter has been running two rather projects which may help us out: CrystalEye (Nick: no blog?) and OpenNMR.

NMR shifts will give us experimental backup on our notion of aromaticity, and so do bond lengths. I asked Peter about this, and whether OpenNMR predicted shifts could indeed confirm aromaticity of compounds, and he replied and showed that the predicted spectra could be used to distinguish between C-C and C=C bonds.

I commented the following (which was in moderation at the time of writing), and that gets us to experimental evidence for aromaticity:

Thanx for the elaborate answer. What I had in mind was the question whether NMR shift predictions can be used to tell me if a certain ring system is aromatic or not, and in case of fused rings, which atoms and which bonds are aromatic and which not. I’m sure the prediction error for 1H NMR shifts is well below 2ppm, and more in the order of 0.2ppm.

But maybe I should be asking, can I use CrystalEye to decide if ring systems are “aromatic”, and in case of two rings fused together (non-spiro), which atoms and bonds are aromatic and which not. Aromaticity is a fuzzy concept, with various definitions. I would be interesting in linking what the expert considers ‘aromatic’ (or SMILES, or the CDK, or …) with what the QM chemistry (via bond lengths or NMR shift predictions) and crystal structures (via bond lengths) has to teach us. The null hypothesis being that the bonds are not delocalized (bond length) and that no ring current is found (NMR shifts, 1H in particular).

Regarding those bond lengths, ‘aromatic’ bonds show a bond length in between that of single and double bonds (e.g. see this random pick). The CrystalEye data does not reflect that really, and only a trimodal histograms shows up. Indeed, the C#C peak is *very* low, around 1.2A :) Apparently, the triple C#C bond order is underrepresented in nowadays crystallography.

Maybe aromatic C:C bonds are underrepresented too, or can the absence of a peak around 1.40A be explained otherwise? I would at least have expected a shoulder or deviation in peak shape of the peak at 1.37A.

Wednesday, October 31, 2007

While Subversion is a signification improvement over CVS, they both require a central server. That is, they do not allow me to commit changes when I am not connected to that server. This is annoying when being on a long train ride, or somewhere else without internet connectivity. I can pile up all my changes, but that would yield one big ugly patch.

Therefore, I tried Mercurial where each client is server too. The version I used, however, did not have the move command, so it put me back into the old CVS days where I lost the history of a file when I reorganize my archive.

GitThen Git, the version control system developed by Linus Torvalds when he found that existing tools did not do what he wanted to do. It seems a rather good product, though with a somewhat larger learning curve, because of the far more flexible architecture (see this tutorial). Well, it works for the Linux kernel, so must be good :)

The first git-svn command initializes a log Git repository based on the SVN repository. The git-svn fetch command makes a local copy of the SVN repository content defined in the previous command. Local changes are, by default, not commited; unless one explicitly git adds them to a patch. Once a patch is ready you can do all sorts of interesting things with them, among with commit them to the local Git repository with git commit.

Now, these kind of commits are on the local repository, and I do not require internet access for that. When I am connected again, I can synchronize my local changes with the SVN repository with the git-svn dcommit command.

A final important command is git-svn rebase, which is used to update the local git command for changes others made to the SVN repository.

Monday, October 29, 2007

I just ran into BioSpider. Unlike ChemSpider, BioSpider crawls the internet (well, this list of sources really) to find information, and depending on what it finds it continues the search. Below is a screenshot of an intermediate point after starting with the InChI of methane:

After the search it generates a long HTML page with all the information it found on the molecule you queried for. This approach is much more scalable than storing all in one database.

This crawling of information is something I was working on myself a bit too, and I think this is a good approach. However, I think the use of a central website is not the right approach. Instead, the search should be distributed too: the crawling should be done on the client machine; it should be done in Taverna or Bioclipse instead.

Friday, October 26, 2007

In this series I will introduce the technologies behind my FOAF network. FOAF means Friend-of-a-Friend and

[t]he Friend of a Friend (FOAF) project is creating a Web of machine-readable pages describing people, the links between them and the things they create and do.

My FOAF file (draft) will give you details on who I am, who I collaborate with (and other types of friends), which conferences I am attending, what I published etc. That is, I'll try to keep it updated. BTW, FOAF is a RDF language.

Friday, October 19, 2007

Bob has set up a new interface between the data model and the Jmol renderer, which allows him to define other types of export too. One of this is a POV-Ray export, which allows creating of high quality images for paper. Jmol has had POV-Ray export for a long time now, but never included the secondary structures or other more recent visual featues. PyMOL is well-known for its POV-Ray feature, and often used to create publication quality protein prints. The script command to create a POV-Ray input file takes the output image size as parameters:

In principle, someone could download an assortment of spectra for a given molecule, calculate some other spectra, and then write a paper without ever recording a single NMR spectrum of their own. Would they then include the individual who deposited the spectra as a co-author or even acknowledge the source of the spectra that they used? Who knows.

It is a misconception that releasing your Open Data will cause a situation that your scientific work is not acknowledged (citing statistics is the crude mechanism we use for that). First of all, using results without acknowledgment is called plagiarism (which is ethically wrong by any standard). But this is not a feature of Open Data, it is found in any form of science. Recall Herr Schön.

Some months back I advised an other chemical database who had similar concerns, and I pointed the owners, like I commented to Gary, to the CC-BY license which has an explicit Attribution (BY) clause:

Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).

Using this license, plagiarism would not even just be (scientifically) unethical, it would be illegal too, because it would brake the license agreement. This even allows one to bring the case to court, if you like. (BTW, I was recently informed that the database had switched to the CC-BY license!)

We discussed a number of things, regarding the things we do. One of these was tagging molecules. Ian used http://rdf.openmolecules.net/?info:inchi/InChI=1/CH4/h1H4 instead of http://rdf.openmolecules.net/?InChI=1/CH4/h1H4. The first was not yet picked up by rdf.openmolecules.net but I fixed that.

We also discussed linking molecular structures with scientific literature. The discussions in blogspace of this week show that doing that by using computer programs is not appreciated by publishers (see here, here, here, here, here, and here) (The publishers seem to prefer to like to send of a PDF to India or China.)

I proposed that the InChI would be part of the publication, for all molecules mentioned in the article. If a journal can require exact bibliography and experimental section formats, they can certainly require InChIs too. There are few programs left which cannot autogenerate an InChI, and the chemists draws the structures anyway. However, the software used in the editorial process does not support linking InChIs with a PDF (if that software would have been opensource ...).

So, the best current option seems to be social tagging mechanisms, and this is what we talked about. Just use Connotea (or any other service) and tag your molecule with a DOI:

and

This tagging is done manually. No machines involved in that. Nothing the publishers can do about this. No ChemRefer needed. But this will allow us to start building a database with links between papers and molecules, which we badly need. BTW, we will not have to start from scratch. The NMRShiftDB already contains many links, which is open data!

Now, you might notice the informal semantics of the doi: prefix. That's something I hereby propose, as it allow services to pick up the content more easily. You might also note the incorrect DOI in Connotea. The reason for that is that Connotea does not yet support a '/' in a tag. I reported that problem.

A molecular structure without any properties in meaningless. Structure generators can easily build up a database of molecules of unlimited size. 30 million in CAS, 20 million in ChemSpider or 15 million in PubChem is nothing yet. The value comes in when linking those structures with experimental properties.

Now, chemical industry, academia and publishers have done there best in the past 50 years to maintain such databases, and decided that a commercial model was the best option to maintain such databases. This was true 50 years ago, but no longer is. ICT has progressed so much that a 20M database can be stored on a local hard disc, or site repository anyway. Moreover, and more importantly, creating a database like this is much cheaper now. These ICT developments threaten the stone age chemical databases around now. Current approaches can easily build cheap and Open chemical databases; if we only all wanted.

ChemSpider is attempting to set up the largest free chemical database, by mixing both Open data, as well as proprietary data. As such, they are attempting to achieve what SuSE and other commercial GNU/Linux distributions are trying to do: create a valuable product by complementing Open data with proprietary data when that adds value. That is, I think they are doing this. SuSE, for example, includes proprietary video drivers. ChemSpider, for example, contains proprietary molecular properties computed by ACD/Labs software (BTW, some of which can be done with Open tools too, as I will show shortly.)

Now, this poses quite a challenge: different licenses, different copyright holders, requirements to provide access to the source (for the Open data), etc, all in one system. Quite a challenge indeed, because ChemSpider is now required to track copyright and license information for each bit of information. GNU/Linux distributions do this by using a package (.deb, .rpm) approach. And, the sheer size of the database poses strong requirements if people start downloading the whole lot.

ChemSpider has had their share of critique, but the are learning, and trying to find to set up a sustainable environment for what they want to do. That might involve a revenue stream from clients if there is no governmental organization, academic institute or some society stepping in to provide financial means. A valid question would be why the did not set up a non-profit organization. But neither did SuSE, RedHat and Mandriva, but that has not stopped those from contribution to Open source.

I have no idea where ChemSpider will end up (consider that a request for a copy of the full set of Open Data), but am happy to help them distribute Open data, and even help them replace proprietary bits with open equivalents, which I'm sure the are open too. With respect to proprietary bits the are redistributing, I understand they can only relay the ODOSOS message to the commercial partners from which they get those proprietary bits, and hope they are doing. ChemSpider has the great opportunity to show that releasing and contributing chemical data as Open Data does not conflict with a healthy self-sustainable business model.

Sunday, October 14, 2007

CompLife 2007 was held 1.5 weeks ago in Utrecht, The Netherlands. The number of participants was much lower than last year in Cambridge. Ola and I gave a tutorial on Bioclipse, and Thorsten one on KNIME. Since a visit to Konstance to meet the KNIME developers, I had not been able to develop a KNIME plugin, but this was a nice opportunity to finally do so. I managed to do so, and wrote up a plugin that takes InChIKeys and then goes of the ChemSpider to download MDL molfiles:

Why ChemSpider? Arbitrary. Done PubChem in the past already. Moreover, ChemSpider has the largest database of molecular structures and in that sense important to my research.

Why KNIME? Played with Taverna in the past, and expect to do much more work on Taverna in the coming year (see also this and this). Moreover, KNIME got a CDK plugin already, and the KNIME developers contributed valuable feedback to the CDK project in the last year. It was about time that I contributed something back, though the current functionality is quite limited. KNIME has a better architectural design than Taverna1, but will face though competition with Taverna2, due next year.

The presentationsHeringa gave a presentation on network analysis, and discussed the scale-free network, hub nodes, etc, after which he gave an example on the 14-3-3 PPI family which both have promoting and inhibiting capabilities. Fraser presented work on improving microarray data analysis, by reducing non-random background noise. Schroeter presented the use of Gaussian process modeling in QSAR studies, which allows estimation of error bars (see DOI:10.1002/cmdc.200700041. I did not feel the results were very convincing, though, but the method sounds interesting. Larhlimi presented research on network analysis of metabolic networks. His approach finds so-called minimal forward direction cuts, which identifies critical parts in the network if one is interested in repressing certain metabolic processes. Hofto presented some work on the use of DFT for proteins, and picked up that one has to do things critically to be able to reproduce binding affinities. Combinations of DFT or MM with QM are becoming popular to model binding sites. Van Lenthe presented such an approach of the second day of CompLife.

By far the most interesting talk at the conference, was the insightful presentation by Paulien Hogeweg. She apparently coined the term bioinformatics. Anyway, she had a exciting presentation on feed-forward loops in relation to evolution, and showed correlation between jumps in FFL motifs with biodiversity. She also warned us for the Monster of Loch Ness syndrome, where computational models may indicate large underlying processes, which are not really existing. But that should be a problem that most of my readers should be aware of. She introduced evolutionary modeling, to put further restrictions on the models, to reduce the chance of finding monsters.

Hussong had an interesting presentation too, if one is interested in analysis of GC/MS or LC/MS data. He introduced a hard-modeling approach for proteomics data using wavelets technology. His angle on this was to use a wavelet that represents the isotopic pattern of a protein mass spectrum. Interestingly, the wavelet had negative intensities, something which one will never find in mass spectra. However, I seem to recall a mathematical restriction on wavelets that would forbid taking the squared version of the function. He indicated that the code is available via OpenMS.

Jensen, finally, presented his work at the UCC on Markov models for protein folding, where he uses the mean first passage time as observable to analyze of processes in folding state space. This allows him to compare different modeling approaches and, for example, to predict how many time steps are needed to reach folding. Being able to measure characteristics of certain modeling methods, one is able to make a objective comparison. Something which allows a fair competition.

I value ODOSOS very high: they are a key component of science, and scientific research, though not every scientist sees these importance yet. I strongly believe that scientific progress is held back because of scientific results not being open; it's putting us back into the days of alchemy, where experiments were like black boxes and procedures kept secretly. It was not until the alchemists started to properly write down procedures that it, as a science, took off. Now, with chemoinformatics in mind, we have the opportunity to write down our procedures in high detail.

I keep wondering what the state of drug research would be, if the previous generation of chemoinformaticians would have valued ODOSOS as much as I do. Now, with a close relative being diagnosed last week with a form of cancer with low five-year survival rates, I can not get more angry about those who want to make (unreasonable) money by selling scientific research. A 1M bonus is unreasonable. I can have 10 post-docs work on chemoinformatics research for the same period; I can have them work on drug design for various kinds of cancer.

Therefore, I will continue to use every opportunity to convince people of ODOSOS, and will continue to develop new methods to improve accurate exchange of scientific data and experimental results. I will help people where I can to distribute open data, even if the whole project is not 100% ODOSOS. For example, the Chemistry Development Kit is open source itself (LGPL) which does allow embedding into proprietary software. This does not mean that I will contribute to the proprietary software, and actually am proud not having done so in the last 10 years.

I will continue to advice people how to make their work more ODOSOS, even if they cannot make the full transition. I will also continue to make sure that all my scientific results are ODOSOS, as there is no other kind of science. To set a good example, and, hopefully, to lead the way.

Monday, October 08, 2007

The second part of the morning session featured a presentation by Sirisha Gollapudi which spoke about mining biological graphs, such as protein-protein interaction networks and metabolic pathways. Patterns detection for nodes with only one edge, and cycles etc, using Taverna. An example data she worked on is the Palsson human metabolism (doi:10.1073/pnas.0610772104); she mentioned that this metabolite data set contains cocaine :) Neil Chue Hong finished with an introduction on the OMII-UK which is co-host of this meeting.

After lunch Mark Wilkinson introduced BioMoby, which we actually use in Wageningen already. I have tried to use jMoby to set up services based on the CDK, but failed sofar. Will talk with Mark on that. Next was my presentation, and I spoke about CDK-Taverna, Bioclipse and some peculiarities with chemoinformatics workflow, like the importance with intermediate interaction, the need to visualize the data and complex, information rich data. Bioclipse is seeing an integration of BioMoby and of Taverna.

After the coffee brake Marco Roos spoke about myExperiment and his work on text mining. I unfortunately missed this presentation, as I was meeting with people from the EBI who work on the MACiE database (see this blog item).

A discussion session afterwards introduced a few more Taverna uses, and encountered technical problems. Taverna2 is actually going to be quite interesting, with a data caching system between work processors, and a powerful scheme of annotation of processors, which will allow rating, finding local services, etc. More on that tomorrow. Dinner time now :)

I arrived at the EBI last night for the Taverna workshop, during which the design of Taverna2 is presented and workflow examples are discussed. Several 'colleagues' from Wageningen and the SARA computing center in Amsterdam are present, along with many other interesting people. This afternoon is my presentation.

Paul Fisher just presented his PhD work on using workflows to improve the throughput of QTL matching against pathway information and phenotype. One interesting note was its function to make biological informational studies more reproducible. He had getting the versions of online databases explicitly in the workflow, so that it gets stored in workflow output.

Monday, October 01, 2007

Peter is writing up a 1FTE grant proposal for someone to work on the question how automatic agents and, more interestingly, the blogosphere are changing, no improving, the dissemination of scientific literature. He wants our input. To make his work easy, I'll tag this item pmrgrantproposal and would ask everyone to do the same (Peter unfortunately did not suggest a tag himself). Here are pointers to blog items I wrote, related to the four themes Peter identifies.

Important bad science cannot hideI do not feel much like pointing to bad scientific articles, but want to point to the enormous amount of literature being discussed in Chemical blogspace: 60 active chemical blogs discussed just over 1300 peer-reviewed papers from 213 scientific journals in less than 10 months. The top 5 journals have 133, 78, 68, 57 and 48 papers discussed in 22, 24, 10, 11 and 18 different blogs respectively. (Peter, if you need more in depth statistics, just let me know...)

Sunday, September 30, 2007

Two working days left before I'm off to two conferences. First, next Thursday/Friday, the two day CompLife2007 in Utrecht/NL, with sessions on genomics, systems biology, medical information and data analysis. And, on the second day tutorials on KNIME and CDK/Bioclipse. I will try to orient as much as possible around MS-based metabolomics, and metabolite identity in particular. Last year the conference was very interesting.

The Monday/Tuesday after that, I will present CDK-Taverna integration I worked on in 2005 (see e.g. Taverna on Classpath and CDK-Taverna fully recognized) at the Taverna meeting, before Thomas continued on that leading to the cdk-taverna.de plugin website. If time permits, I will prepare an example workflow from metabolomics. Unlike previous times I went to Cambridgeshire, I won't fly in on Stansted, but take the EuroStar instead. I am very much looking forward to that. Unfortunately, I will not have time to visit Cambridge itself, this time :(

Illustrative is my confusion about the sp2 hybridized atoms, which use lower case element symbols in SMILES. Very often this is seen as indicating aromaticity. I have written up the arguments supporting both views in the CDK wiki. I held the position that lower case elements indicated sp2 hybridization, and the CDK SMILES parser was converted accordingly some years ago. A recent discussion, however, stirred up the discussion once more (which led to the aforementioned wiki page).

You can imagine my excitement when I looked up the meaning in the new draft. It states: The formal meaning of a lowercase "aromatic" element in a SMILES string is that the atom is in the sp2 electronic state. When generating a normalized SMILES, all sp2 atoms are written using a lowercase first character of the atomic symbol. When parsing a SMILES, a parser must note the sp2 designation of each atom on input, then when the parsing is complete, the SMILES software must verify that electrons can be assigned without violating the valence rules, consistent with the sp2 markings, the specified or implied hydrogens, external bonds, and charges on the atoms..

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.