Wednesday, July 20, 2005

In support of PubChem: towards open chemical information

XML architecture provides a new way of publishing chemical information

An XML-based approach to the communication of chemical information in the biomedical literature would prevent the loss of crucial information and facilitate the re-use of data - and would be easily achievable using existing open tools and resources. A commentary article published today in the Open Access journal BMC Bioinformatics argues that it is time chemistry followed in the footsteps of bioinformatics and structural biology and moved towards the creation of an open semantic web facilitating access to chemical information.

In the article, Peter Murray-Rust, from the University of Cambridge, UK, and John Mitchell and Henry Rzepa from Imperial College London, UK argue using three case studies that conventional methods such as cutting-and-pasting chemical information are time-consuming and introduce errors. The authors argue in favour an open XML architecture linking to connection tables or open databases such as PubChem, to identify chemical compounds mentioned in the biomedical literature. This comes as additional support for open chemical databases like the NIH's PubChem, which is currently at the centre of a legal battle between the NIH and the American Chemical Society (ACS). The ACS runs the very lucrative Chemical Abstracts Service and is directly threatened by public databases.

Murray-Rust et al. explain that an open XML-based architecture would provide a cost-effective and user-friendly way to publish chemical information.

Such a structure would avoid the loss of data - currently 80-99% of chemical information is never published due to the lack of a simple technical protocol to access it. It would make chemical information easier to read, save time, and would allow published data to be aggregated and re-used. Murray et al. recognise that implementing such as system might take time and money and might not be supported by all publishers. However "if publishers adopt these tools and protocols, then the quality and quantity of chemical information available to bioscientists will increase and the authors, publishers and readers will find the process cost-effective", write the authors. They add that most chemical information already exists in electronic format in the chemists' computers and could be converted into XML format very easily, without any loss.

Murray-Rust et al. used three recent articles containing chemical information, and published in journals of the BMC-series published by BioMed Central, as the basis for case studies on the usefulness of an XML-based tool for the identification of chemical compounds in biomedical literature.

Chemical compounds can be listed using connection tables and associated chemical structure diagrams, but also by structural information such as that provided by IUPAC-NIST Chemical Identifiers (INChI). They can also be found using open semantically free identifiers such as those provided by PubChem or based on their common names using Open lexicons; or by systematic chemical name. XML-based information embedded in the text of digitally published chemistry documents could refer to one or more of these, to help readers identify the compounds.

In their first case study, Murray-Rust et al. coded each molecule mentioned in the article in a simple conversion protocol: XML-based Chemical Markup Language (CML), giving the molecules their PubChem Ids. They estimate that the entire coding process took them the same amount of time as it would take a reader to look up the molecules in chemical databases. In addition to the PubChem ID, CML could contain the INChI identifier and meta-data for each molecule. For the second article, they show that, even using an automated system, looking for information about chemical compounds mentioned in the article takes around 45 minutes. This could have been avoided if the compound had been marked up and linked to connection tables and open databases. In the third article, the name of one compound had been misspelt and others were unclear. This made it difficult for text-mining robots to find information about the compounds, and not all the data needed was retrieved.