Obviously, the problems originates from the lack of mathematical knowledge of chemists: the current chemoinformatics heavily depends on graph theory, where each atom is a vertex and each bond an edge. This has the advantage that we can borrow all algorithms that work with graph representations, such as Dijkstra's algorithm to find the shortest path between two vertices. Or, in chemical language, an algorithm to calculate how many bonds two atoms are apart in a molecule.

When discussing FlexMol, Rich mentions the work by Dietz (DOI:10.1021/ci00027a001), but I would like to mention the PhD thesis of S. Bauerschmidt to this (see DOI: 10.1021/ci9704423) done in Gasteiger's group. Dropping this 'two-atom bond' representation in favor of something that better describes compounds like ferrocene, like the Dietz and Bauerschmidt approaches, has the unfortunate disadvantage of loosing compatibility with graph theory algorithms. Nevertheless, in order to take chemoinformatics to the next level, we have to address these issues. But hope is not lost, and people are working on rewriting our toolkit of chemoinformatics algorithms to match such new representations.

CDK

In will postpone analyzing the CDK for compatibility with such more modern representations (look out for a CDK News article), and now just describe how the CDK can be used for FlexMol/Dietz/Bauerschmidt representations. Consider the four examples Rich gives in his blog. Here are the CDK ways of doing the same.

Summarizing, the key thing is to use the IBond.setElectronCount() method. The call is sort of redundant, as the CDK defaults to two electrons if not explicitly given. This compound is, of course, benzene which we can represent like this too:

Now, you will note that this approach does not exactly follow Rich's FlexMol examples: the skipped atom pair concepts in the FlexMol version of ferrocene. His example, more closely follows what we are likely to draw, while the CDK code above more closely follows the molecular orbital concept. (I have to check to see how Dietz and Bauerschmidt did this.)

As said, the real trick is to have the chemoinformatics toolkit that can work with this representation, but I will save that for later. At least our algorithms to calculate the molecular mass should work ;)

Thursday, December 21, 2006

Last night I upgraded the software behind Chemicalblogspace, to the version online on Google Code, though I needed the help from Eaun to get paper titles correctly picked up for ACS journals. The number of working blogs is a bit down and now at 68, with an average number of 30 active blogs posting more than 100 blog items each day (see Zeitgeist). The new design looks like quite nice compared to the old one:

The current script only adds search links to PubChem and Google, but the possibilities are endless, and potentially very powerfull. Here are some future ideas.

A link to predict NMR spectra using NMRShiftDB.org:

Making a link to the NMRShiftDB.org website to predict13C or 1H NMR from a SMILES, and InChI likely too, is easy, if the website provides a URL to do this. (I will discuss this with Stefan.)

A popup window with the 3D structure in Jmol:

This would involve some more work, but this most certainly possible too, given that we actually have a website around which allows downloading 3D coordinates given a SMILES or InChI. While a simple approach would be to make a popup with Jmol that takes the URL to that 3D coordinate website, it could be extended using Ajax to query the 3D structure first, and depending on success, show Jmol or a message "Could not find 3D coordinates".

Summarize molecular details hidden in CML:

This is likely the most exiting possibility. I blogged about CMLRSS many times now (check the AVI, the article, etc), and combining these two technologies will take the semantic, chemistry internet to the next level. CMLRSS describes how CML can be embedded in blog items (e.g. Blogging chemistry on blogspot.com), but really works for any XHTML.

Consider this mockup: add CML content to your blog item, containing molecular properties, such as it's NMR peaks, elemental analysis, etc. This will not show up in your blog item, so that the user is not bothered with implementation details. Now, a userscript will now about the CML content, as it has access to the whole content of the page. The visible text will mention the molecule for which CML contains experimental or other details. Using the <span class="chem:compound"/> technology shown above, it is possible to link that compound to this CML bit (details to follow in this blog in January 2007). The userscript will then on the fly create a popup for the compound name in the visible text to show those experimental details.

How about that? Comments and other ideas are more than welcome!

Server side scripts:

Greasemonkey allows users to decide which scripts to run on a website, and which not. If you, as blogger or XHTML editor, want to force a script like the above to be run, that should be possible too. Greasemonkey scripts are written in JavaScript, so including them on the server side should be possible too. I might explore this option soon too.

Sunday, December 17, 2006

As follow up on my Including SMILES, CML and InChI in blogs blog last week, I had a go at Greasemonkey. Some time ago already, Flags and Lollipops and Nodalpoint showed with two cool mashups (one Connotea/Postgenomic and one Pubmed/Postgenomic) that user scripts are rather useful in science too. I can very much recommend the PubMed/Postgenomic mashup, as PubMed has several organic chemistry journals indexed too!

So, how does this relate to my blog of last week? Well, would it not be nice that if your blog uses the markup as suggested in that blog, that you automatically get links to PubChem and Google? That is now possible with a small GPL-ed Greasemonkey script called blogchemistry.user.js.

The Greasemonkey plugin requires Firefox to be installed. If ready, install the script by clicking this link earlier, and the Greasemonkey will ask you if you want to install the script. After, check the output for this RDFa markup content:

a SMILES: CCO

a CAS registry number: 50-00-0

and an InChI: InChI=1/CH4/h1H4

It should look like the output for this blog item:Note the superscript PubChem and Google links.

Update: there was something wrong with the download, which I just fixed (19th, at 8:45 CET). Please download once more to get it working properly.

We all know the combinatorial explosion when calculating the number of possible constitutional isomers (see wp:structural isomorphism) of a certain molecular formula. For example, C2H6 has only one constitutional isomer (ethane, InChI=1/C2H6/c1-2/h1-2H3), and C4H10 has only two. Especially, breaking symmetry by replacing one carbon by another element, or replacing a single by a double bond, increases the number sharply. For example, C7H16 has only nine constitutional isomers, while replacing two single bonds by two double bonds, creating C7H10, increases this number to 499! Then, replacing in the last formula, one carbon by an oxygen adds another few, totaling 747 isomers.

Now, C8H8NBr has at least 649 thousand constitutional isomers, and I am quite interested in being able to know the number of isomers beforehand, without having to generate the structures itself (for example, using CDK's GENMDeterministicGenerator). InChI=1/C8H8BrN/c9-7-1-2-8-6(5-7)3-4-10-8/h1-2,5,10H,3-4H2 is one of the isomers.

So, my question: is anyone aware of free code (in order of preference: 1. LGPL, 2. BSD/MIT, 3. opensource, 4. free) to calculate or estimate the number of constitutional isomers for a certain molecular formula. An estimate would already be nice. Ideally, I would implement this bit of code into the CDK, but otherwise, just knowing the number of isomers for C8H8NBr would be nice :)

Additionally, any relevant, recent literature recommendations are most welcomed. I am aware of the use of polynomials, but literature I have seen so far just focuses on molecules of a certain architecture, and it not able to come up with a guess based on the molecular formula alone.

Tuesday, December 12, 2006

I just found out that a review article that I wrote earlier this year got printed: Molecular Chemometrics (DOI: 10.1080/10408340600969601), with my personal view on the interplay between chemoinformatics and chemometrics. The review discusses interesting developments in the last five years, and was fun writing (reading too, I think :). It has four major topics:

Any incorporation of content other than images and free text requires some HTML knowledge, but this can be rather limited. It is up to us chemoinformaticians to write good documentation on how to do things; so here is a first go.

Including CML in blogs and other RSS feeds

I blogged about including CML in blogs last February, and can generally refer to this article published last year: Chemical markup, XML, and the World Wide Web. 5. Applications of chemical metadata in RSS aggregators (PMID:15032525, DOI:10.1021/ci034244p). Basically, it just comes down to putting the CML code into the HTML version of your blog content, though I appreciate the need for plugins.

Including SMILES, CAS and InChI in blogs

Including SMILES is much easier as it is plain text, and has the advantage over InChI that it is much more readable. Chris wondered in th e KinasePro blog on how to tag SMILES, while Paul did the same on ChemBark about CAS numbers.

Now, users of PostGenomic.com know how to add markup to their blogs to get PostGenomic index discussed literature, website and conferences. Something similar is easily done for chemistry things too, as I showed in Hacking InChI support into postgenomic.com (which was put on lower priority because of finishing my PhD). PostGenomic.com basically uses microformats, which I blogged about just a few days ago in Chemo::Blogs #2, where I suggested the use of <span class="chemicalcompound">asperin.

And this is the way SMILES, CAS and InChI's can be tagged on blogs. The <span> element is HTML code to indicate a bit of similar content in HTML, and can, among many other things, be formatted differently than other text. However, this can also be used to add semantics in a relatively cheap, but accepted, way. Microformats are formalized just by use, so whatever we, as chemistry bloggers, use will become the de facto standard. Here are my suggestions:

for SMILES: <span class="smiles">CCO</span>

for CAS registry numbers: <span class="casnumber">50-00-0</span>

for InChI: <span class="inchi">InChI=1/CH4/h1H4</span>

The RDFa alternative

The future, however, might use RDFa over microformats, so here are the RDFa equivalents:

for SMILES: <span class="chem:smiles">CCO</span>

for CAS registry numbers: <span class="chem:casnumber">50-00-0</span>

for InChI: <span class="chem:inchi">InChI=1/CH4/h1H4</span>

which requires you to register the namespace xmlns:chem="http://www.blueobelisk.org/chemistryblogs/" somewhere though. Formally, the URN for this namespace needs to be formalized; Peter, would the Blue Obelisk be the platform to do this? BTW, this is more advanced, and currently does not have practical advantages over the use of microformats.

Therefore, I composed a list of h-indices of my own, ordered by value. The choice of authors is biased to the Blue Obelisk and the CDK, has some personal touches (Buydens are Wehrens are my PhD supervisors) and some names that put the rest into perspective:

query

h-index

#pubs

BENDER A

41

222

WILLETT P

37

302

GASTEIGER J

33

212

RZEPA HS

25

236

BUYDENS LMC

18

108

GLEN RC

18

78

WEHRENS R

11

47

MURRAY-RUST P*

9

41

STEINBECK C

9

29

FECHNER U

6

12

GUHA R

4

24

WILLIGHAGEN E*

4

9

WEGNER JK

3

9

LUTTMANN E

2

4

Of course, there are many comments on this. Like any measurement, take into account the error. Sources of error include, but are not limited to, ambiguity in the query. The most notable example of this, I think, is Andreas Bender; I don't think he has been that successful :) Also, Rajarshi Guha's h-index was reported 6, but the list included two articles from the 70-ies and 80-ies, which I do not think are actually really his.

Feel free to suggest other names, query corrections, tips, and I will add or work on those too.

Wednesday, December 06, 2006

Because no one picked up my Chemo::Blogs suggestion, I will now officially claim the blog series title. However, unlike the original Bio::Blogs series, I will not summarize interesting blogs, but just spam you with websites I recently marked as toblog on del.icio.us.

Semantics and Text Mining

Evan Prodromou wrote about RDFa vs microformats. The latter are commonly used in enhancing blog semantics, and for example used by PostGenomic.com. While RDFa is more explicit, e.g. by using namespaced markup, we have to wait until XHTML2 to see it working. I do not think chemists are using tags a log yet, but let me propose the following microformats: <span class="inchi">1/CH4/h1H4</span> and <span class="chemicalcompound">methane<span>. Standard JavaScripts and CSS scripts will then do the rest. (Think: addressing newlines, auto googling-for-inchi, etc).

A few EMBL PhD students are having the First Online EMBL PhD Symposium (catchy name, or ... ;) Anyway, discussions are held on IRC, and it has a rather interesting Web2.0 session. All media is available on the website but requires registration right now. After the conference it will become open access to all. Jean-Claude contributed The UsefulChem Project: Open Source Chemistry Research using Blogs and Wikis to the Participants' Contributions section, and I did have a poster on Distributing molecular information over the Internet, discussing CMLRSS, blog aggregators, CML and other things. The IRC session was logged and is available here.

Contributions to open data do not have to be large, as long as many people are doing it. The Wikipedia is a good example, and PubChem accepts contributions of small databases too (I think). The result can still be large and rather useful, even scientifically.

The latter was recently written down in the paper Internet-based monitoring of influenza-like illness (ILI) in the general population of the Netherlands during the 2003–2004 influenza season by Marquet et al. (DOI: 1471-2458/6/242). The data was provided by Internet users via The Great Influenza Survey website. The article states that the sum of all those small contributions (anonymous website users are asked to fill out a weekly form), yields reliable data. The user is rewarded by colorful pictures, such as:

If all chemists and biochemists would add information about or properties of one molecule or metabolite to the Wikipedia each month, one or more commercial database companies will have to change their business model soon. Oh, you already can start doing this here.

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.