Of the two, I think RDFa has the best future. Then I discovered Operator, written by Mike. While the Greasemonkey scripts already allow me to link to, for example, PubChem and eMolecules, the Operator Firefox Addon allowed me to open vCards incorporated in HTML pages directly to my address book client. Thus, I could open chemistry directly in Bioclipse too!

That was the idea, at least. I contacted Mike, and he asked me to wait until the first 0.8 releases, which he announced earlier this month. This version allows user scripts to be written, which define how RDFa should be handled. And with his patience and help, this was the result:

<div about="#chem_123" xmlns:chem="http://www.blueobelisk.org/chemistryblogs/"> Methane has the following identifier: <span property="chem:inchi">InChI=1/CH4/h1H4</span></div>

</html>

It is important here to wrap the statement in a <div> element and to add the @about attribute to it, defining the Subject. Moreover, you need to use the @property attributes instead of @class. The content of this attribute defined the Predicate, and the content of the <span> element is the Object, completing the RDF triple.

Operator detects these RDFa statements from the HTML, and creates a new menu item Search in Pubchem using this piece of code:

You can reproduce this by installing Operator 0.8a in Firefox, saving the script to a file in your home directory, and reading it via the Operator "Options" dialog. Make sure to also set the Display Style in the General tab of the dialog to Data formats. Only then will the RDFa magic kick in.

Adding support for eMolecules, ChemSpider and whatever else we like is easy now. What I still need to explore (or ask Mike), is how I can trigger the Open With/Save As dialog of Firefox.

Over the last few weeks I continued the work on getting (descriptor-based) QSAR/QSPR implemented in Bioclipse. JOELib (GPL) and the CDK (LGPL) being two prominent opensource engines that can calculate molecular descriptors, and AMBIT a front-end.

To be able to do QSAR/QSPR model building from start to end in Bioclipse, I worked in April on an architecture for selecting descriptors. Being busy with so many things, it took me some time to get around to completing that, but here are the screenshots:The funny characters and the whitespace is gone. Right now, it still only lists one provider, but I plan to add JOELib plugin soon. The list of actual descriptors is provided by the extension.

What Bioclipse then does, is have the extension calculate the descriptor values for the selected CDKResource in the BioNavigator using the selected descriptors. This will then create a new MatrixResource in the Bioclipse workspace (currently called qsarResult.jam), and which is opened in the Matrix editor:

There is still enough work left to do. For example, the columns are not yet labeled according to the descriptor name, and selecting more then one CDKResource in the navigator does not give a multirow matrix yet.

Following a discussion on the mailing list earlier, a directory hierarchy has been set up, and each files contains an index.xml to describe the content. In case of a directory with actual test files, it may look like:

To improve and ensure some quality, the XML must be valid in addition to just well-formed, so that I can set up XSLT stylesheets to create XHTML indices and summaries. Therefore, I wanted to setup a schema for the index.xml files. My first thought was to use XML Schema which has XML Namespaces support and has well defined (and extensible) data types. I have hacked in it in the past my the details have slipped me. Already in 1998 I worked with DTDs, around the time that the XML specification was declared a recommendation. Originating from the SGML year, it is not XML based, had no knowledge of namespaces, and only a limited amount of data types.

Then there is RELAX NG. XML based, uses the same data types are XML Schema and has support for namespaces. Since I had to look up the specs for either DTD or XML Schema for the details anyway (e.g. on how to allow the DC namespace in the main namepsace), why not try something new. Well, I was amazed. RELAX NG has a syntax simplicity like that of DTD, but the functionality from XML Schema. So, I hacked up in 30 minutes a XML spec for the test file repository, including a (too short) list of recognized MIME types. Just a combination of some <element>, <attribute>, <oneOrMore>, etc elements. The results is available as schema.relaxng in SVN.

Pedro suggested in Nature Networks What's Next forum that Nature should add a new service for scientists: hosting electronic lab notebooks. And I think this will be a killer application. I am rather excited about the idea, and feel ashamed not putting one-and-one together myself. We have our chemoinformatics tools and RDF is just around the corner, that combined with semantic wikis, and we have science of the 21st century. This is my reply posted on Nature Network:

Pedro, that might be an interesting idea: Nature hosting ELN. with much content, I have been maintaining a wiki in my previous postdoc, as replacement for the old paper notebook. Allows me to make links etc. I plan to do this in my new postdoc too, maybe even with a RDF-enabled wiki, to have agents automatically verify what I enter for inconsistencies. These things are already possible; just a matter of doing it.

If Nature would host such a service (RDF-enabled, and integrated with their other pages), they have a true killer for me: I write my ELN items, and for each page I decide if I want to make it public; since it is a wiki, I can keep it private until happy about the results, or, simply, until the experiment has finished. Then, by clicking a button it would become CC+attribution and automatically end up in Nature Preceedings. The full integration of Scintilla/Postgenomic/Connotea comes in when making links to background material.

The RDF is important for validating what I write, and I can imagine that Nature has an extensive set of default agents (of course, in addition to spell checking etc :). These agents check if the chemical reaction equations makes sense (conservation of mass, atom count, etc), that NMR/MS spectra and other experimental properties are consistent with that equation, and whatever else we can come up with. The tools for this validation are available, and basically only the glue is missing.

Friday, June 22, 2007

Use InChIThis reminded me of a discussion I had with with Colin when he was at the CUBIC, which was about experimental sections. I proposed that the InChI should have a prominent place in the experimental section. An important argument for this is that it allows well-defined atom numbering to be used when writing down the NMR bits in that section: the InChI gives a unique numbering, so that the numbering used in the experimental section becomes author neutral. Because the InChI puts the carbons up front, the 13C NMR details get numbers from 1-13, or whatever the carbon count is. For proton NMR it is not difficult either, they are simply numbered according to the heavy atom to which they are attached. For situations where two hydrogens attached to the same heavy atom have different shifts, then a and b can still be used. The numbers are easily added to 2D diagrams anyway.

Use CMLEven better is to use CML for this, or CMLSpect to be precise (paper is accepted, and should appear soon). This XML-based language allows the full semantic markup of all the experimental details and all the interesting assignments you want to archive. I would like to challenge ACD to follow Bioclipse's lead and provide export as CMLSpect for spectral assignments and markup of experimental details, in addition to the PDF in whatever format they prefer. Cheers for the work by Tobias and Stefan on spectrum support in Bioclipse!

Opendata makes such quality assurance possible, and I am happy that the NMRShiftDB was explored like this; the found problems can be reported and corrected. If correcting them upstream is difficult, opendata allows one to make a better derivative; that's what opendata is about. For example, BioMeta (DOI:10.1186/1471-2105-7-517) took data from KEGG and corrected a lot of molecular problems (like reaction balancing, stereo chemistry, etc).

I have contributed almost 900 spectra to the NMRShiftDB, and I am sure I may have made a mistake here and there. But my submission is verified by a reviewer, and furthermore, users of the database can report inconsistencies via the NMRShiftDB.org website. Now, I have focused on uncommon NMR nuclei, like 11B, 195Pt and 29Si (see the stats), which tend to have only one peak. Nothing much that can go wrong; still, one or two errors were catched by the reviewer.

Ensuring data qualityHumans make errors, but not even only when data is entered; they make mistakes checking data too. Nothing much that can be done about that, other than using computers to find patterns. This is exactly what Robien did: he used his software which implements common patterns to find entries in the database that did not comply to those patterns.

Automated quality assurance requires a easy to use, machine-readable interface. For example, CMLRSS (DOI:10.1021/ci034244p) can be used for running new entries in databases against known patterns. But other interfaces are most welcome too. Rich recently discussed the new PUG interface, which offers an interface to PubChem.

There is also, however, a good list of molecules in Wikipedia for which no CID or InChI is given:

http://www.en.wikipedia.org/wiki/Hafnium(IV)_oxide -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Cubane -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/water -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/oxidane -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Carminic_acid -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Alizarin -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/AIBN -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/piperidine -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/hydroxide -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/tetrahydrocannabinol -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Epibatidine -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/cortisone -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Eschenmoser%27s_salt -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/pyrrole -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/anthracene -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/benzylbromide -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Skatole -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Teicoplanin -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Methyl_violet -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Penicillin -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Aspartame -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Splenda -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Sucrose -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Rhodamine -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Ascorbic_acid -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Tabun_(nerve_agent) -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Soman -> but no InChI/CIDhttp://www.wikipedia.org/wiki/Phosgene -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/AZD2171 -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Heavy_water -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/MTBE -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Biotin -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Spermine -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Silicon_carbide -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/stilbene -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Methyl_salicylate -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Dmso -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/DMF -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Acetonitrile -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/HMPA -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Phenol -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/TBHQ -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/MTBE -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Salvia_divinorum -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/salvinorin -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Tetrahydrocannabinol -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Selenium_dioxide -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Piperidine -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Resveratrol -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/P4O10 -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Dimethyl_sulfide -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Folate -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Hydroxybenzotriazole -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Hydrogen_cyanide -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Peroxyacetic_acid -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/epothilone -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/paraquat -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/N-butyllithium -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Nafion -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Boron_nitride -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Triclosan -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Hydrogen_peroxide -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Cholesterol -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/DMAP -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/aniline -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Phenol -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Ascorbic_acid -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Nicotine -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Tetra-ethyl_lead -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Acetophenone -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Ethanol -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Acetaldehyde -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/EDTA -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Menthol -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Formic_acid -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Octanitrocubane -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/VX_%28nerve_agent%29 -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Tetraazidomethane -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Lawesson%27s_reagent -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Hexafluoroisopropanol -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Cellulose -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Bremelanotide -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Cellulose -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Dimethicone#Applications -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Shikimic_acid -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Methyl_amine -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/Dimethyl_amine -> but no InChI/CIDhttp://www.en.wikipedia.org/wiki/DDT -> but no InChI/CID

I really would like to start adding InChI's for these molecules to Wikipedia, but someone needs to enlighten me about the state of ChemBox? Can the InChI be added to the template, or should the InChI be given elsewhere on the page? Adding such small bits is easier than writing a full entry.

It was therefore decided to create a Java application and applet,‘JAva NOe and Coupling Calculator with Handy Interactive Operation’ (Janocchio), using the open source libraries of the molecular viewer Jmol and the Chemical Development Kit (CDK). It aims to provide a simple andintuitive way to calculate both the NOEs and couplings.

Release 1.0.1 of last May uses an old Jmol, and the CDK release from 26 August 2005. A bit outdated, and I am wondering if it would be a lot of work to integrate this into Bioclipse. Maybe a summer job?

I might request an test account; I do have an old half-finished manuscript that I never got around to finishing. While still relevant, it could use some community input; this preprint server would be the perfect tool. That's how my first manuscript ended up on CPS too :)

Friday, June 08, 2007

Dealing with scientific literature has been one important theme in Chemical blogspace. For example, ranking articles and how to store your personal PDF archive has been topics of discussion. In this blog I will summarize bits of the discussion, and my personal view on things.

SearchingSearching literature is traditionally done in systems like Chemical Abstracts and Web-of-Science. The open nature of a growing number of repositories (e.g. the Dutch DARE) and indexing facilities like PubMed make these proprietary tools obsolete.

It is incorrect to assume that these payed services are the only trustworthy sources. Even WoS fails to make the all links between entries in the database. For example, I am aware of two missing citations to articles I have written, even though both the cited and the citing article is available in the system. One of the citing articles was in the Angewandte Chemie!

Additionally, some search services, like Google Scholar, have the advantage that they find copies and close variants of articles in proprietary articles on home pages and in open repositories. Today, I learned about Scientific Commons which indexes and links to a staggering 1.5M publications, using, among others, PubMed and university repositories. Where possible it makes direct links to PDF versions of the article.

RankingMitch set upChemRank, to which Peter, the ChemBlog and I replied. Afterwards, I learned that other services are available too, that allow, in addition to setting up an online personal literature database, voting and commenting on articles.

Apparently, CiteULike (CUL) supports this too. In contrast to ChemRank, CUL requires a login, which I personally see as an advantage, because I can browse literature bookmarked by other accounts I trust. There is also Connotea but I never liked that site that much (e.g. is allows bookmarking any web page); Rich has his comments too. I would also like to mention BioWizard which is based on the PubMed content, which actually covers a good deal of chemistry literature nowadays too.

Update: the RSS feed for a specific category was already available, but just not from the FireFox URL bar. Instead, it is given on the right side of the posts page when you selected a category. Here a shortcut for the RSS for posts from the Blue Obelisk category.

Sunday, June 03, 2007

Now that my CUBIC desktop machine is shutting down, I made the necessary backups, among a mail.tar for my mail correspondence of about a year. About 500MB in size for almost 8700 files. Strigi is a perfect tool to help me find messages in this archive, as it will recurse into the .tar archive, and even into email attachements. I created an index just for the archive with:

strigicmd create -t clucene -d index/ mail.tar

It took Strigi about 30 seconds to index the whole archive. That's good performance!

Now, Strigi indexes content full text, but also uses a controlled vocabulary (among which one specifically for chemistry). So I can search for email messages which have article in the subject with:

strigicmd query -t clucene -d index/ email.subject:article

However, From: and To: content was not yet extracted. That was easily patched. This allows me to find correspondence between me and, for example, Christoph:

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.