The properties and behaviors of chemical substances are generally interpreted and discussed in terms of their molecular structures, and to convey structural information, chemists use diagrammatic representations supplemented by verbal descriptions. In order to have a means of specifying or describing a chemical structure in words, conventional chemical nomenclature was developed.

Systematic nomenclature provides an unambiguous description of a structure; a diagram of which can be reconstructed from its systematic name. However, there are other means of specifying molecular structures. Those based on “connection tables” (coded specifications of atomic connectivities) are more suitable than conventional nomenclature for processing by computer, as they are matrix representations of molecular graphs readily governed and handled by graph theory. In parallel with its continued development of conventional nomenclature, IUPAC has developed a structural identifier that can be readily interpreted by computers, or more precisely, by computer algorithms.

The IUPAC International Chemical Identifier (InChI) is a freely available, nonproprietary identifier for chemical substances that can be used in both printed and electronic data sources. It is generated from a computerized representation of a molecular structure diagram, produced by chemical structure-drawing software. Its use enables linking of diverse data compilations and unambiguous identification of chemical substances. A full description of the Identifier and software for its generation are available from the IUPAC website.1 In addition, an unofficial, but helpful compilation of answers to frequently asked questions has been compiled by Nick Day of the Unilever Centre for Molecular Science Informatics as part of his Ph.D. project on the Chemical Semantic Web.2 A full account of the InChI project is in preparation.3 Commercial structure-drawing software that generates the Identifier is available from several organizations, listed on the IUPAC website.1

The conversion of structural information to the Identifier is based on a set of IUPAC structure conventions, and rules for normalization and canonicalization (conversion to a single, predictable sequence) of an input structure representation. The resulting InChI is simply a series of characters that serve to uniquely identify the structure from which it was derived. This conversion of a graphical representation of a chemical substance into the unique InChI character string can be carried out automatically by any organization, and the facility can be built into any program dealing with chemical structures.

The InChI uses a layered format to represent all available structural information relevant to compound identity. InChI layers are listed below. Each layer in an InChI representation contains a specific type of structural information. These layers, automatically extracted from the input structure, are designed so that each successive layer adds additional detail to the Identifier. The specific layers generated depend on the level of structural detail available and whether or not allowance is made for tautomerism. Of course, any ambiguities or uncertainties in the original structure will remain in the InChI.

This layered structure design offers a number of advantages. If two structures for the same substance are drawn at different levels of detail, the one with the lower level of detail will, in effect, be contained within the other. Specifically, if one substance is drawn with stereo-bonds and the other without, the layers in the latter will be a subset of the former. The same will hold for compounds treated by one author as tautomers and by another as exact structures with all H-atoms fixed. This can work at a finer level. For example, if one author includes double bond and tetrahedral stereochemistry, but another omits stereochemistry, the latter InChI will be contained in the former.

Two examples of InChI representations are given below. It is important to recognize, however, that InChI strings are intended for use by computers and end users need not understand any of their details. In fact, the open nature of InChI and its flexibility of representation, after implementation into software systems, may allow chemists to be even less concerned with the details of structure representation by computers.

The layers in the InChI string are separated by the ‘/’ character followed by a lowercase letter (except for the first layer, the chemical formula) with the layers arranged in predefined order. In the examples, the following segments are included:

One of the most important applications of InChI is the facility to locate mention of a chemical substance using Internet-based search engines. This is made easier by using a shorter (compressed) form of InChI, known as InChIKey. The InChIKey is a 27-character representation that, because it is compressed, cannot be reconverted into the original structure, but it is not subject to the undesirable and unpredictable breaking of longer character strings by some search engines. The usefulness of the InChIKey as a search tool is enhanced by its derivation from a “standard” InChI, (i.e., an InChI produced with standard option settings for features such as tautomerism and stereochemistry). An example is shown below; the “standard” InChI is denoted by the letter “S” after the version number.

InChIKey also allows searches based solely on atomic connectivity (first 14 characters). Software for generating InChIKey is available from the IUPAC website.1

The enormous databases compiled by organizations such as PubChem,4 the U.S. National Cancer Institute, and ChemSpider5 contain millions of InChIs and InChIKeys, which allow sophisticated searching of these collections. PubChem provides InChI-based structure-search facilities (for both identical and similar structures),6 and ChemSpider offers both search facilities and web services enabling a variety of InChI and InChIKey conversions.7 The NCI Chemical Structure Lookup Service8 provides InChI-based search access to over 39 million chemical structures from over 80 different public and commercial data sources.

In the age of the computer, the IUPAC International Chemical Identifier is an essential component of the chemist’s armory of information tools, enabling location and manipulation of chemical data with unprecedented ease and precision.

Alan McNaught <mcnaught@ntlworld.com>, retired from RSC, is one of InChI’s fathers; with a broad expertise in publication and nomenclature, he has been involved in IUPAC activities for many years (including ICTNS, CPEP, and Div VIII) and with InChI since day one. Steve Heller <steve@hellers.com>, from NIST, is also a father of InChI, stimulating development and making the identifier known to the community.