Blogging and the chemical semantic web

This post will explain how chemically-aware blogs can be indexed and searched. If you're not a chemist, but still interested in the semantic web, this may be interesting.

I revealed in recent posts that molecules in blogs can be indexed on their chemical structure, thus making the web chemically semantic. (I use the lower-case version to show that we are not using the heavyweight Semantic Web (OWL, triples, etc.) but something much more akin to microformats. Anyway the idea is simple...

For any document containing chemistry, mark up the compounds with the InChI tag that can be guaranteed unique for each of these. I'm going to concentrate on blogs, but the idea extends to any web document. (I'll exclude most chemical papers as they are generally closed and so we can only access them with subscriptions and often are prevented legally from the indexing below).

The main ways of adding InChI tags are:

persuade the author to do this when they create the post. Most of the current types of chemical software either create InChIs or create a file that can be converted into InChIs (e.g. with our WWMMservices). With practice this would probably take 1-2 extra minutes per compound, especially if we can create a drag-and-drop InChIfication service at Cambridge or elsewhere. The InChI (which is simply a text string) can either be added to the blog or hidden in the alt tags of the imgs for the chemical structures. Again fairly straightfoward (though I have had to fight my editor). And I think we can expect blog tools to become semantic - at least for microformats - during the next months.

extract the structure from the blog and turn it into InChI. This is harder (unless the authors use a robust format such as CML or possibly SMILES). One way is to interpret chemical names as structures - we'll explain our work on this later. But semantic authoring is far better.

extract a known Open chemical ID from the site. Pubchem is the only realistic approach (it has ca. 6 million compunds); CAS numbers are closed and copyright so cannot be used. If we do this, then I would suggest the Pubchem entry is indexed like this "CID: 2519" . (This is very easily cut-n-pasted from the pubchem site). I am normally hesitant to use IDs but I think we can make an exception for Pubchem.

A good example of an InChIfied site is: the Carcinogenic Potency Database (CPDB) at Berkeley which contains a list of chemicals with a typical entry which shows the InChI (scroll to bottom part of page). This site consistently gets good hits on Google when searched by the InChI string (try it at our GoogleInchI server).
So, this post is to suggest to chemical bloggers that they add InChIs to their blogs. There are about 15 blogs that seem to have enough chemistry to make this worthwhile (I've taken these from Post Doc Ergo Propter Doc ) and I'd be grateful for comments on what I have misrepresented or what I've left out. The loose criteria for inclusion are (a) are there frequent chemical strucure diagrams or (b) are there enough chemical names that are worth tagging.

10 Responses to Blogging and the chemical semantic web

Our current use of the UsefulChem Molecules blog is a way to minimize the burden on human researchers while providing as much automation as possible. We use Blogger because it is free, hosted and very simple to use. This is important because we want people to be able to replicate what we do with ease. The only truly required information in the UsefulChem Molecules blog is the SMILES code and a pic. Dave has written a script that reads the blog every day and converts the SMILES to InChI (in addition to other useful info) on web pages that refer back to the blog entries for specific molecules. The point is that our molecules are FINDABLE because the InChI is indexed on Google (for example see this) and the page that comes up leads directly to the blog post with that molecule and with links pointing to it from our research.

(1) Thanks J-C. Of course you are ahead of the game! but it's fun to have a little mystery. We now have several powerful ways of tagging molecules:

InChI - preferred of course
Pubchem CID. Easy, quick and covers almost all known molecules. Obviously doesn't work for unpublished or unknown ones. Interestingly quite a number of the recent synthetic targets in TotallySynthetic are not listed in Pubchem. Is that because they are so esoteric they are not widely known? Certainly if they were of major medicinal use they would be included...
SMILES. Not ideal. Not canonical and not semantically identifiable as SMILES (e.g. CO could be methanol or Commanding Officer). Still much better than nothing.
Chemical name. Where possible use the IUPAC one (it can be autogenerated and can often be parsed) and any well-known trivial ones.

The reason we use SMILES as the starting point is that most online databases use it as an input. So for example we can get the PubChem IDs via eMolecules. Of course a major problem with SMILES is that it is not unique for a molecule, which limits findability. For those who have missed it, QueryChem is a wonderful example of a chemical search engine that takes an input like SMILES and converts it to InChI, synonyms and all kinds of other identifiers and returns the Google results in aggregated format. The main problem is that the searches are usually slow.

I added the strings to the most recent entries; could you let me know if I'm doing it properly? I'll try and keep it up - let me know when someone comes up with a way to do it programatically. Right now I'm saving .MOLfiles and putting them in the converter at inchi.info, then manually putting in alt tags.

(4) That looks great. It would be interesting to know how long it takes you (as part of a daily routine): (a) to create the molecule (b) save as MOL - do you do this anyway? and (c) create and edit in the InChI.

I think it takes about 2 days for Google to index a blog, so it's worth checking occasionally. Please let us know how you get on.

OK - having carefully and rather too obviously written in InChI and SMILES strings in a story about ozone at nexus.webelements.info, and being an inorganic chemist who might want to write about a few inorganic species, I wondered how to write strings for, say, metal coordination complexes like the salt [Cr(OH2)6]Cl3. This compound is listed at PubChem at

As far as creation of just the drawing and inchi string? Not long, unless it's a complicated molecule with a lot of chiral centers.

As it is I use chemdraw; if the molecule is easy enough to draw in a few seconds, I might draw it by hand. If it's not, chemdraw's name-to-structure tool is surprisingly robust (Even with non-IUPAC names; i.e., it knows "ethidium," "ascorbic acid," and usually gets chirality right). If the non-IUPAC name doesn't work, I will usually spend a few minutes googling and trying to get a string with the IUPAC name, preferably with chiral centers. Name-to-structure works fine on these, always (Except for some of the more esoteric spiro ring systems, and some other things I can't remember).

If and only if none of that works, I will draw by hand. If I've gotten to this point, it's probably a tricky molecule, and it'll take another 5-10 minutes. You can usually speed it up by using name-to-structure for the tedious bits (usually sugars). Saving as MOL and GIF is trivial (until recently, I was saving only as GIF).

At the end, I often (but I'm inconsistent with this) use structure-to-name to put an IUPAC (or close enough) name in the graphic too in case anyone's after that.

So, short answer: for the simplest molecules, about 2 minutes; for the most complex; about 15. Adding the inchi strings to the routine (including .mol, fetching the string, putting in the alt tag) can't take more than a minute.