Monday, February 16, 2015

Comparing data from open natural products chemical spectral databases

I recently wrote a review article called "Open-access metabolomics databases for natural product research: present capabilities and future potential", for the journal Frontiers in Bioengineering and Biotechnology, section Bioinformatics and Computational Biology (link).

Part of the review involved evaluating the contents of In this post, this post consists of the notes I took when generating those comparisons. The code I used is available from my Bitbucket account (I apologize for the poor organization and near-lack of documentation), some of the scripts depend on srj_chembiolib, and/or Pandas. InChI normalization relies on ChemAxon molconvert.

Factors to compare: how many unique compounds are present in the database, how many structures have the various kinds of associated analytical spectra, what compound classes are represented.

Approach:
I compared the contents of databases that allow bulk downloads. Those databases are: HMDB, GMD, BMRB, MassBank, ReSpect, Spektraris, NMRShiftDB, and GNPS. Because some of these databases are not limited to natural products, we also included UNPD, an extensive natural products structure database which does not, however, contain any spectra. Other important publicly searchable databases, such as METLIN, SDBS, and NAPROC-13 were left out of the comparison because they do not allow bulk downloads, so it is difficult or impossible to know precisely which compounds they contain.
To compare the databases, I first downloaded the data from the project websites. Data formats are not standardized, so I used Python scripts custom for each database to extract structure information in the form of SMILES or InChI strings, and spectrum type as one of: MS1, MSn, 1H NMR, 13C NMR, 1D NMR other, and 2D NMR. Structure information was normalized by converting SMILES and InChI strings into Standard InChI strings using ChemAxon molconvert. In the course of extracting and converting the data, a number of errors were noticed in some of the records. I fixed the most obvious and easily correctible errors, but made no attempt at systematic validation or curation.

In some databases, not all of the records are annotated with structural information that could be converted into a Standard InChI. For example, some compounds were identified only by name or CAS number. It also occurred in some cases that the records available for bulk download were out of date compared to those searchable through the web interface, or that the web interface allows searches against proprietary data in addition to the data made available for bulk download. Also, because stereochemical information was not always available, I based my definition of a unique structure only on the "Main" layer of the InChI, which includes formula and atom connectivity (before comparing the Main layer, I removed "?" characters, which are present in the Main layer, and are used to specifically identify unknown stereochemistry).

HMDB:
The metabolites xml file has enough data to know what the inchi is, and whether a spectrum is available.
Sadly, it does not indicate what kind of an NMR spectrum is available, so that has to be gotten by other means. (or we can appeal to the least common denominator and just distinguish between 1D and 2D nmr for all datasets).
To get spectra type, we have to either download the spectra (the download links are broken, so that won't work), or get them from the individual spectra pages, which is possible:
the url to the spectra pages is nice and formulaic:
"http://www.hmdb.ca/spectra/spectra/nmr_one_d/X"
where x is the name of the accession
just change "one" to "two" for 2d spectra.
I use urllib from python to get all the spectra pages.

From the downloaded spectra, the spectra type can be extracted from the page title. The extracted types can then be used to generate a lookup table correlating spectrum number to spectrum type.

NMRShiftDB:
reconstituting the whole database is a total pain. Much easier just to pull out what we want.
grep 'INSERT INTO `MOLECULE` VALUES' nmr_shift_db_dump.sql > molecules.txt
grep 'INSERT INTO `SPECTRUM` VALUES' nmr_shift_db_dump.sql > spectra.txt

ReSpect:
Download MassBank format files. They are apparently an older (or nonstandard) style, because they use the "INCHI" tag for InChIs, rather than the "IUPAC" tag, and there are some other differences as well.

chem_comp.csv
Experiment.csv
fix the InChI code for bmse010516 and bmse010122 in chem_comp.csv
Luckily they make the mol files available, so it's easy to generate inchis.

some InChI codes are non-standard, so we need to remember to deal with that also.
Normalize some of the spectrum types, I've (arbitrarily) chosen to group spectra as: 1H NMR, 13C NMR, 2D NMR, 1D NMR other

MassBank:
Download using wget
consolidate with
cp */* ../otherdir
This would be much faster if they'd just put it all in a single zip file....
Extract data from MassBank files using srj_chembiolib

UNPD:
get the CSV file "UNPD_unique_information_229358.csv"
I suspect that some of the structures in the database are not natural products.
Beware of multiline entries in the csv, such as "UNPD70379" (which I corrected to one line)

BML-NMR:
A list of compound names and InChI codes is found in the supplemental material of the paper (doi: 10.1007/s11306-011-0347-7). I copied the table into a text editor and then formatted it like the other ones. There are a lot of different kinds of spectra, but for the purposes of the generalizations I'm making with this data analysis, I consider them all to be either 1H NMR, or 2D NMR. So I copy all the compound names/Inchis, and after one set I put 1H NMR, and after the other set, 2D NMR.
Need to add InChI= to the beginning of all of the InChI strings. One of them (N-acetyllysine) is not a standard InChI, so convert it into a standard inchi with: molconvert inchi:"AuxNone SAbs" .\N-Acetyllysine.txt
Remove extra spaces in a few of the inchi codes.
Phosphocholine inchi:
InChI=1S/C5H14NO4P/c1-6(2,3)4-5(7)11(8,9)10/h5,7H,4H2,1-3H3,(H-,8,9,10)/p+1
appears to be incorrect, but I did not correct it.

Comparing compound classes

Assessing what kinds of compounds are in each database is not an easy task. Only a few databases even have compound class annotations. At first I thought the best way to assign classifications would be compare databases to ChEBI, and then transfer the ChEBI ontology annotations to the other databases. What I found was that the ChEBI ontology annotations were not really useful for knowing the biosynthetic origins of a compound. So for the paper, I did not use ChEBI-based annotations, but relied on the annotations provided by the individual databases where available, and giving up on classifying compounds from databases that don't provide annotations.

Terpenoids are, the best represented compound class, with an emphasis on sesqui-, di-, and tri-terpenoids. Other well represented classes are flavonoids, "aromatics", chromans, and lignans.

BML-NMR:
Maps well to ChEBI. Maps perfectly to HMDB, can use name to get cross-reference to compound classes.

BMRB:
maps pretty well to ChEBI (731/1167), check out what the remaining ones are though (probably mostly lignin precursors).

GMD:
maps well to ChEBI (702/914)

GNPS:
ChEBI: 2463/5581 (of compounds that can be given an InChI code, which is not that many). So, no estimation can be made.

HMDB:
Excellent overlap with ChEBI 930/1063. Also has its own internal chemical classification annotations.

MassBank:
Not good overlap with ChEBI (3135/11326), however, some records are annotated with compound classes. The vast majority of annotations are ambiguous, such as "N/A" (20474), or "N/A; Environmental Standard" (8646), or "Natural Product" (2397). Including annotations that are further specified, the total number annotated as "Natural Product" is 7995.

ReSpect:
decent overlap with ChEBI (645/722), however, records are heavily annotated with compound class.
According to internal annoations: Flavonoids are the best represented class (2,999 spectra). Other well represented classes are amino acids (888 spectra), terpenoids (743 spectra), phenylpropanoids (727 spectra), nucleotides (475 spectra), and alkaloids (420 spectra). The only way to convert these "spectra" numbers to "compound" numbers is to use the compound name as a unique identifier. Doing that, the numbers become: Flavonoids are the best represented class (1,360). Other well represented classes are terpenoids (519), phenylpropanoids (341), alkaloids (256), amino acids (236), and glucosinolates (93).

METLIN:
closed, and impossible to estimate.

MMCD:
Experimental NMR spectra shared with BMRB. I don't know how to deal with the rest of their data