Tuesday, October 28, 2014

International Chemical Identifers (InChIs): You should use them!

One easy and increasingly common way to increase the make metabolite data more useful is to associate compounds with their corresponding InChI International Chemical Identifier (InChI) (Heller et al., 2013). An InChI is a unique, standardized text representation of the structure of an organic molecule. Inclusion of InChIs in database records facilitates cross-referencing among databases. The InChI system has a number of advantages over other kinds of identifiers. Some chemical identifiers, such as PubChem IDs, Chemical Abstracts Service (CAS) numbers, and ChemSpider IDs, are database-specific accession numbers with no direct relation to the structure of the molecule they describe, this means that a molecule must have been indexed by one of these services to have an identifier. An InChI, by contrast, is a database-independent structure description, so it can be generated for a molecule regardless of whether the molecule has been indexed by a major database. An InChI can be generated for a novel natural product structure, whereas the other IDs cannot. InChI also has advantages over other linear text representations of molecule structure, such as SMILES. Unlike for SMILES, there is a single open source implementation of the InChI generation algorithm, so while a single structure may have multiple valid SMILES representations, it will only have one Standard InChI representation (Heller et al., 2013). A fixed length compressed version of an InChI, an InChIKey, can be generated from any InChI. InChIKeys are more compatible than InChIs with web search engines such as Google (google.com) (Southan, 2013), however, multiple distinct structure may have, and have been observed to have, the same InChIKey (http://www.chemconnector.com/2011/09/01/an-inchikey-collision-is-discovered-and-not-based-on-stereochemistry/), so InChIKeys should not be used as a basis for cross-referencing. When unambiguous identification of a molecule is the priority, InChI should be preferred. When ease of indexing and searchability is the priority, InChIKey should be preferred. When possible, both identifiers should be listed. By listing InChIs and InChIKeys in websites, databases, and publications (Coles et al., 2005), chemists can enhance the ability of their data to be indexed, searched, and cross-referenced. Free and easy to use software for generating InChIs and InChIKeys are the InChI software available from the InChI Trust (http://www.inchi-trust.org), and MolConverter available from ChemAxon (http://www.chemaxon.com).

Of course, even with the use of InChIs, inconsistencies can still arise in cross referencing. Galgonek and Vondrášek provide an excellent (and Open Access) analysis of the kinds of inconsistencies that can arise, and their sources.

I originally wrote this as part of a draft of the manuscript that eventually became this review article. It's a bit out of the scope of that article, so we dropped it. But I posted it here because I still think it's a good analysis.