Abstract

We developed a fast method for denoising pyrosequencing for community 16S rRNA analysis. We observe a 2–4 fold reduction in the number of observed OTUs (operational taxonomic units) comparing denoised with non-denoised data. ~50,000 sequences can be denoised on a laptop within an hour, two orders of magnitude faster than published techniques. We demonstrate the effects of denoising on alpha and beta diversity of large 16S rRNA datasets.

Pyrosequencing1 has revolutionized microbial community analysis by allowing the simultaneous assessment of hundreds of microbial communities in multiplex with sufficient depth to resolve meaningful biological patterns2. These techniques have been used to gain striking new insight into microbial processes on scales ranging from continents3 to within an individual’s body4.

Although powerful new analysis tools such as GAST5, Mothur6, and QIIME7 greatly streamline the process of interpreting microbial community information obtained by pyrosequencing, especially similarities and differences among communities, substantial questions remain about the suitability of pyrosequencing to address questions concerning alpha diversity, the amount of diversity within each individual community and non-phylogenetic beta-diversity measures (phylogenetic beta-diversity measures such as UniFrac, which measure similarities between different communities, are relatively robust to these issues8). In particular, noise introduced during pyrosequencing and the PCR amplification stage can inflate estimates of the number of OTUs (chosen at the 97% identity level) in a given habitat by orders of magnitude9, 10. The current state-of-the-art is to reduce noise by clustering the flowgrams (patterns of intensities in each read) before conversion to sequences to eliminate issues due to homopolymer read errors10, yet this approach is exceedingly computationally expensive and beyond the reach of most individual investigators who do not have access to large-scale computing facilities.

Methods

Inability to accurately determine which sequences are present in a sample, and hence the abundances of rare taxa, greatly inhibits our ability to infer important ecological parameters such as rank-abundance curves, yet ironically the portion of the rank-abundance curve that can be inferred, i.e. of the common taxa, provides a solution to the conundrum of the expense of denoising. Empirical rank-abundance curves, especially from human-associated samples, tend to be dominated by a relatively small number of abundant taxa. Given this feature of actual microbial communities, performing all-on-all comparisons for clustering is exceedingly inefficient: instead, a subset of reads suffices to identify the common OTUs, which can then be iteratively removed by recruitment to an existing cluster. Consequently, we can rapidly determine the OTUs that are most likely to be abundant, concentrate initially on comparing reads to the small number of abundant OTUs (removing matches from the analysis), and then cluster only the leftover reads representing more divergent sequences.

We can thus reduce the total number of sequence comparisons using empirical features of the abundance distribution of real datasets as follows. First, we devised a fast pre-filter, removing reads that are strict prefixes of other reads, and compute an initial sequence distribution. We then sort the prefix clusters in descending order of abundance, and use this initial distribution to cluster similar reads, comparing each additional unclustered read to the most abundant clusters first because we expect the abundant clusters to yield a larger number of erroneous near-matching reads due to their numerical dominance alone. For a more detailed description of the algorithm, see Supplementary Methods. A similar method of pre-clustering on the sequence level and subsequent sequence clustering along the abundance distribution has been proposed recently11.

The method introduced here is a major improvement over previous flowgram-based denoising routines10 in terms of compute resources, yet retains the advantage that singletons are not discarded entirely, allowing exploration of the rare biosphere12. Previously, a mid-size 24-core cluster was needed to analyze a small dataset of around 40,000 sequences in around 10 hours. Our method allows the same dataset to be denoised in less than an hour on a single laptop computer (Table S1). We can also denoise full 454 runs with 500,000 sequences on a mid-size cluster in 1 day. We can thus address questions in community ecology that were previously intractable.

Applying these new methods to the most comprehensive survey of human-associated body habitats yet performed4, we find that denoising produces a substantial decrease in the diversity both at the OTU level and in terms of the phylogenetic diversity (the total branch length associated with each sample on a phylogenetic tree14). However, the results from the non-denoised (but filtered) and denoised data are highly correlated (r2 = 0.97, P <10−300 for phylogenetic diversity), suggesting that relative results concerning diversity within each sample are robust to the types of errors introduced by pyrosequencing (Fig. 1a–f). Interestingly, in spite of this high correlation, denoising changes the relative order of OTU richness of individual body habitats. Although the gut exhibits the highest OTU richness without denoising, it falls back into the middle ranks after denoising. This holds true for both Chao1 estimates and the phylogenetic diversity (Fig. 1a,d and 1b,e). The drastic reduction after denoising might be an effect of the sequence composition of the dominant OTUs in the gut (see Supplementary Methods for a more detailed discussion).

Comparisons of non-denoised data (a–c) to denoised data (d–f) for alpha diversity for the Body Habitat study, and comparisons of beta diversity (g–h). Rarefaction plots of the “Body Habitat” study4 show a 3 to 4...

Similarly, when clustering the samples using UniFrac, the non-denoised and denoised reads produce very similar patterns (Fig. 1g–h), reinforcing the point that errors introduced into each sample by noise or chimeras have little effect on beta diversity because they inflate the distances among all samples rather than introducing artifactual similarities between specific pairs of samples15.

We conclude that the availability of these new methods will make more accurate assessments of alpha diversity available to a wide range of researchers (especially in conjunction with improved chimera-checking methods such as ChimeraSlayer, http://microbiomeutil.sourceforge.net/), and will greatly improve our understanding of microbial communities in habitats with scales ranging from global to extremely personal. The efficiency of the new techniques and the fact that they can change conclusions about the relative diversity in different habitats suggests that they should be applied routinely in all pyrosequencing studies where estimates of diversity within each sample are the goal.