Pages

Tuesday, August 28, 2007

If you, like me, already upgrade to UbuntuGutsy, and use nxclient for remote login (highly recommended, though proprietary code), you might run into the problem that the login no longer works, returning the message "Cannot find KDE environment.". Ubuntu's Lauchpad (generally an excellent service) was rather uncooperative and disregarded a bug report about the problem, I found the solution with grep -ri kde /usr/NX:

But I did not like that Firefly could do this, and JChemPaint not. So, I started hacking. First I discovered I had to get rid of the use of JAI; then I had to adapt the JChemPaintPanel takeSnaphot() API to return a RendererImage; and finally, I had to figure out how to write the extra metadata. Now, Firefly is not opensource (yet), so it took me some time to figure out how that was done, and this is how:

Another issue, unrelated to this patch, is that writing PNG images changes the location of the structure in the JChemPaint editor, and that the placing of the element symbol in image writing is seriously broken. But that will soon be solved with Niels' new renderer.

The metadata looks like:

(Newlines are lost in the XML display.)

JChemPaint does not yet write InChIs, and it also does not open PNG images for input yet (as Firefly does).

Clustering and classification of crystal structures is hot. Parkin hit the front cover of CrystEngComm with a story on Comparing entire crystal structures: structural genetic fingerprinting (DOI:10.1039/b704177b). Now, the story itself, while rather interesting and well written, has three major flaws:

the data set it way too small

the proposed proof-of-concept is not novel at all

they do not cite me

Well, the latter sounds a bit boohoo, and it is :) (BTW, I do like this paper.)

Now, you may wonder if I am in the position to criticize this shortcoming, but I think I am. As part of my PhD work, I analyzed this problem myself, and published two years ago the paper Method for the computational comparison of crystal structures (DOI:10.1107/S0108768104028344). Apparently, Parkin was not aware of this publication and did not cite it. I should have went to a crystallography conference with a poster, and advertise my work more. In this paper, I analyzed a data set with 48 crystal structures, manually validated by visual inspection, resulting in having to compare 1128! crystal structure pairs. Took me two full weeks behind a Silicon Graphics. Yes, I really understand why they took only 12 structures :)

However, there is more prior art. While my approach was based on a new radial distibution function-based whole crystal structure descriptor, my supervisor (Ron) used the more common powder diffraction pattern and showed in Representing Structural Databases in a Self-Organising Map (DOI:10.1107/S0108768105020331) it to be a good enough descriptor for clustering of thousands of crystal structures using a self-organizing map (SOM).

Last week, my second paper in crystallography appeared: Supervised Self-Organizing Maps in Crystal Property and Structure Prediction (DOI:10.1021/cg060872y). In this paper, we show how supervised SOMs (see DOI:10.1016/j.chemolab.2006.02.003) can be used for supervised classification and even for property prediction. Note that these supervised SOMs are truly supervised, unlike many earlier modifications of the unsupervised SOMs: the training is supervised.

Finally, another advantage of this last work: the code is open source. The code for the unsupervised SOMs is available as R package: kohonen; and for powder diffraction patterns: wccsom. Details can be found in this R News issue. The first package is not actually limited to crystal structures, and can be used for any clustering problem. However, the articles mentioned here make use of simulated diffraction patters, and I am not sure there are open source tools to generate those.

BTW, I would still be interested in teaming up with CrystalEye in one way or another, and couple these data analysis methods to live streams of new crystal structures. Nick, let me know if you are interesting in idea exchange.

Getting back to Parkin's paper, I do like the work. Hirshfield surfaces are an interesting tool to visualize packing characteristics, and using them to describe a crystal structure sounds like an interesting idea indeed. I just hope that the method properly scales.

Wednesday, August 22, 2007

The full name is (2S,3R,4R,5S,6R)-2- [4-chloro-3-(4-ethoxybenzyl)phenyl]-6- (hydroxymethyl)tetrahydro-2H- pyran-3,4,5-triol and the PDF report the CAS number 461432-26-8, and InChI=1/C21H25ClO6/c1-2-27- 15-6-3-12(4-7-15)9- 14-10-13(5-8-16(14) 22)21-20(26)19(25)18 (24)17(11-23)28-21 /h3-8,10,17-21,23-26H, 2,9,11H2,1H3/t17?,18?,19?, 20?,21-/m0/s1.

I have added this information to Wikipedia, see the Dapagliflozin entry.

The new Operator release (download) has one notable API change: it now uses "RDF" as key for semantic information; the add-on now supports eRDF too. So, when installing or updating to version 0.8, you also need to update the Sechemtic user script to version 1.1or better.

Installing Operator scripts is a bit more work than Greasemonkey userscripts. Save the script to your home directory, or any other place you can easily find on the hard disk. After installing the Operator add-on, click the Options button:

For the RDFa script to work, you need to make sure that the Display style is set to Data formats:

Then you can go to the User Scripts tab, and use the New button to add the script you downloaded and saved to your hard disk earlier:

As you can see, plenty of blogspot bloggers around me, among which, in purple, Useful Chemistry. Funny thing is, each time I repeat the Google search, the output is different. Oh, and make sure to drag one of the halos around; that will keep you procrastinating for the whole afternoon :)

Both have advantages and disadvantages (everything does). Google has a huge experience with massive data, and is the centralized version of the distributed world wide web. Personally, I tend towards the decentralized version of things. Scales better. The chemical RDF community showed some concerns about scalability of triple stores (see e.g. Taylor et al. Bringing Chemical Data onto the Semantic Web, 2006, DOI 10.1021/ci050378m). Now, their tests went up to some 30M triples, which is barely enough to store the InChI, PubChem compound ID, and one chemical name.

So, how would this work for molecules then? I am leaning towards a system where one can query resources about one molecule, and work ones way through molecular space. Using KEGG, reaction databases, similarity stores, one could move from molecule to molecule, and add bits of RDF along the way, filling a local RDF store around the actual query I have in mind. For example, if I want to verify that the mass spectrum I found really belongs to the molecular structure I have in mind, I would look up in the resources I know about all triples that relate to the putative structure, and do my queries from there. That's what I would do... (and will do, but more on that later...)

Saturday, August 11, 2007

Rich blogged about to Never Draw the Same Molecule Twice: Viewing Image Metadata in which he shows his molecular editor outputting images of molecular structure where the connectivity table of structure is embedded in the image. His molecular editor can read the image again, and will automatically pick up the embedded connection table. Noel showed that such can not only be done in Java, but in Python too.

This is important progress, though I would still like to see InChIs in the documents, and/or the data files as supplementary information. Actually, I would even more like to see that all experimental sections not just list the structure name, but give the InChI. An important spin-off is that when giving spectral information, the atom numbering given by InChI can be used to associate NMR shifts, and IR wavenumbers to atoms and atom groups, removing the ambiguity in those associations as we are used to find in literature.

Chemistry Central is looking into improving the submission process for molecular data, and hereby request the commenting on, taking into account in ongoing internal discussings, and incorporation of these approaches in the editorial requirements for CC publications:

including the connection table as metadata in images

including the InChI in experimental sections for newly synthesized molecules

use InChI atom numbering to associate NMR shifts with atoms in these experimental sections

I will shortly blog an example experimental section incorporating the InChI.

http://www.en.wikipedia.org/wiki/Hydrogen_cyanide#Hydrogen_cyanide_as_a_chemical_weapon -> but no InChI/CID http://www.en.wikipedia.org/wiki/P-Phenylenediamine -> but no InChI/CID http://www.en.wikipedia.org/wiki/Valence_%28chemistry%29 -> but no InChI/CID http://www.en.wikipedia.org/wiki/Nitrous_oxide -> but no InChI/CID http://www.en.wikipedia.org/wiki/Cytisine -> but no InChI/CID http://www.en.wikipedia.org/wiki/Disulfur_decafluoride -> but no InChI/CID http://www.en.wikipedia.org/wiki/Mescaline -> but no InChI/CID http://www.en.wikipedia.org/wiki/Lewisite -> but no InChI/CID http://www.en.wikipedia.org/wiki/Sulfur_mustard -> but no InChI/CID http://www.en.wikipedia.org/wiki/Tryptamine -> but no InChI/CID http://www.en.wikipedia.org/wiki/Interferon_beta-1a -> but no InChI/CID http://www.en.wikipedia.org/wiki/Methyl_isocyanate -> but no InChI/CID http://www.en.wikipedia.org/wiki/Anthraquinone -> but no InChI/CID http://www.en.wikipedia.org/wiki/Tocopherol -> but no InChI/CID http://www.en.wikipedia.org/wiki/Cinnamic_acid -> but no InChI/CID http://www.en.wikipedia.org/wiki/Tryptamine -> but no InChI/CID http://www.en.wikipedia.org/wiki/Psilocybin -> but no InChI/CID http://www.en.wikipedia.org/wiki/Alphamethyltryptamine -> but no InChI/CID http://www.en.wikipedia.org/wiki/Alpha-ethyltryptamine -> but no InChI/CID http://www.en.wikipedia.org/wiki/Allylamine -> but no InChI/CID http://www.en.wikipedia.org/wiki/Ergosterol -> but no InChI/CID http://www.en.wikipedia.org/wiki/Squalene -> but no InChI/CID http://www.en.wikipedia.org/wiki/Sulfur_hexafluoride -> but no InChI/CID

Strictly speaking, the list should be longer, as the code that produced this list actually is also happy when a PubChem compound identifier (CID) is given. The previous list is also still online.

Thursday, August 02, 2007

I do not care about physical and chemical properties in Wikipedia, as I can easily extract them from other sources. The main value of Wikipedia for molecules is, I think, that it describes the history of a molecule. Additionally, the Wikipedia URL is a nice unique molecular identifier (for example http://en.wikipedia.org/wiki/Lactose) given certain conditions, and many bloggers are using it as such. But, it only is a useful identifier if one (and only one) InChI is stated on the wiki page.

Wikipedia TemplatesI have spotted a couple of templates: Drugbox, Chembox, Chembox new, of which the last one seems to most recent, and has extensions for explosives and drugs. The WikiProject Chemicals does not mention it though. Anyone who knows the status? Is chembox new the way forward and going to replace the older chembox? I hope so, because only the newer one has InChI in the last of official fields. Or is chembox new simply an extension of chembox itself?

Somewhere between 1000 and 1500 entries use the chembox new and another 1000 to 1500 use chembox but I assume there is considerable overlap. Additionally, Christian noted that there still seem to be molecules in Wikipedia which do not use a template at all, and counted some 1900 molecules using various lists. If you you want to keep a more close eye on chemistry in dbpedia, you should register to the dbpedia-discussion mailing list.

Wednesday, August 01, 2007

Well, no wonder: Excel is meant to be used to process money flows. Anyway, greyarea pointed me to this nice blog item from March 2006. It discusses a 2004 article in BMC BioinformaticsMistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics by Barry Zeeberg et al. (DOI:10.1186/1471-2105-5-80). Hence, the importance of semantics and proper markup languages. The quotes are illustrative:

When we were beta-testing [two new bioinformatics programs] on microarray data, a frustrating problem occurred repeatedly: Some gene names kept bouncing back as "unknown." A little detective work revealed the reason: ... A default date conversion feature in Excel ... was altering gene names that it considered to look like dates. For example, the tumor suppressor DEC1 [Deleted in Esophageal Cancer 1] was being converted to '1-DEC.' Figure 1 lists 30 gene names that suffer an analogous fate.

...

There is another default conversion problem for RIKEN clone identifiers identifiers of the form nnnnnnnEnn, where n denotes a digit. These identifiers are comprised of the serial number of the plate that contains the library, information on plate status, and the address of the clone. A search ... identified more than 2,000 such identifiers out of a total set of 60,770. For example, the RIKEN identifier "2310009E13" was converted irreversibly to the floating-point number "2.31E+13." A non-expert user might well fail to notice that approximately 3% of the identifiers on a microarray with tens of thousands of genes had been converted to an incorrect form, yet the potential for 2,000 identifiers to be transmogrified without notice is a considerable concern. Most important, these conversions to an internal date representation or floating-point number format are irreversible; the original gene name cannot be recovered.

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.