Performing a Metagenomic Analysis of a Sargasso Sea Sample

This example illustrates a simple metagenomic analysis on a sample data set from the Sargasso Sea. It requires the taxonomy information included in the files gi_taxid_prot.dmp, names.dmp and nodes.dmp (see the compressed file taxdump), which you can download from the NCBI taxonomy FTP site.

Introduction

Metagenomics is the study of the taxonomic composition of a sample of organisms obtained from a common habitat. It usually consists of the comparison of the sequence samples against databases of known sequences and the use of taxonomy information to classify the sample species. The main goals of a metagenomic analysis include the quantification of the relative abundance of known species and the identification of unknown sequences for which no relatives have yet been identified.

Reading BLASTX Hit Report

In this example, we consider a small subset (100 reads) of the Sargasso Sea data set [1], which has been searched against the NCBI-NR database using BLASTX with default parameters. For convenience, the resulting BLAST report has been saved and compressed into the file sargasso-sample1-100.rpt.gz, and it is provided with Bioinformatics Toolbox™. We read the report content and extract relevant information such as the high-scoring pairs, their score, expectation value and percent identity.

Filtering BLAST Hits

Because we are interested only in significant hits, we filter the results based on their score, expectation value and percent identity with the query sequences. By using this filtering process, we reduce the number of hits to approximately one quarter of the original hits.

Memory-Mapping the Taxonomy Data File

The taxonomic classifications for all GenBank® sequences are stored in large files that are updated weekly as new sequences are submitted and the taxonomic information is refined. To retrieve this information in a quick and efficient way, we create a map between any possible gi number in the GenBank database and its associated taxonomic identifier (taxid). Because currently there are more than 100 million live gi numbers, the memory requirements for loading such a large data set can be very demanding. Thus, using the provided helper function mapTaxoFile, we read the data in blocks of 1MB, save it as a binary file and then use the function memmapfile to map into memory the content of the file itself, so that the data can be accessed using standard indexing operations. See memmapfile help for more details.

If you performed the BLAST search against a database that is outdated with respect to the taxonomy information included in the nodes.dmp file, some gi numbers might be superseded. Therefore, you need to exclude from the analysis those sequences associated with superseded entries.

During the search against the NCBI-NR Database, the first query (SHAA001TR) hit n sequences with significant expectation value and score. We can look at the taxonomic assignment of these hits using the array taxid.

Classifying BLAST Hits by Scientific Name

Every taxid corresponds to a specific taxon, which has been given a scientific name and possibly various synonyms. For our classification purposes, we are interested in the scientific names only. Thus, we extract this information and annotate each BLAST hit in the report using the scientific names, rather than the taxids.

Determining the Taxonomic Distribution of BLAST Hits

One reason to classify the sequence hits in a BLAST report is to study their taxonomic distribution. We can easily create a list of organisms that are represented in the report, their taxids and their frequency as follows:

From the simple statistics on the taxonomic distribution, we observe that the most represented taxon is the Burkholderia sp.383 (taxid 269483). The over-representation of this bacterium, which is usually found in terrestrial settings, in the Sample 1 of the Sargasso Sea data set is discussed in [1].

Filtering Out Isolated Assignments

Several taxa in the report appear to be isolated assignments because they are hit by only one sequence. These taxa are rarely true members of the environmental community under investigation. Thus it is useful to identify them and discard them if needed.

Limiting the Analysis to the Best Hit for Each Query

We can repeat the above procedure by limiting the analysis to only the best-scoring hit for each query sequence. Even though analyses limited to the best-scoring hits cannot depict a complete and accurate picture of the situation, they can be useful as a first approximation and overcome the difficulty inherent with large data sets.

In our example, when only the best-scoring hits are considered, Burkholderia, Candidatus pelagibacter ubique and Shewanella appear to be the most represented taxa in the report. While finding Candidatus pelagibacter ubique is not surprising, because it is a dominant form of life in the Sargasso Sea, Burkholderia and Shewanella are not expected to be present in this marine sample where nutrients and resources are low, because they live either in terrestrial settings or in aquatic, nutrient-rich environments respectively. For a detailed discussion regarding the presence of these bacteria in the Sargasso Sea, see [1].

Memory-Mapping the Taxon Node Information

Oftentimes, to gain a clear vision of the taxonomic distribution of a sequence set, Linnaean categories higher than species are considered. To perform this analysis, we need to create a map between each taxid and its assigned rank, as well as a map between each taxid and the taxid of its parent node, according to the NCBI Taxonomy Database schema. Files containing this information can be created with the helper function mapNodeFile.

Classifying BLAST Hits by Higher Taxonomic Rank

After the maps are created, for every hit that is associated with a taxid corresponding to a Linnaean category more specific than the target rank, we determine the parental taxid and its rank until the target is reached. Then we annotate the hit with the taxid of its more distant ancestor. Synthetic constructs or nodes with no rank are considered descendants of the root. This procedure of walking up the taxonomic hierarchy is performed by the helper function findTaxoRank.

Suppose we are interested in classifying our hits according to the superkingdom to which they belong. After assigning the superkingdom taxid to each hit, we group and count the occurrences as follows:

Representing the Taxonomic Distribution on a Graph

The taxonomic distributions at different levels are related to each other by hierarchy. Suppose we want to look at the distribution of hits across phyla and visualize them on a graph. After filtering out the counts of the low represented phyla (<5 counts), we create a connectivity matrix where all phyla are direct children of the root.

To represent the various phyla and the Proteobacteria class in the same graph, we need to create a connectivity matrix such that all the phyla are children of the root and all the Proteobacteria classes are children of the Proteobacteria node. In the graph, the labels include the class names and the number of occurrences in the BLAST report.