How to choose a metagenomic sequencing strategy – input wanted

So … what goes around comes around. In 2003 and 2004, I spent a lot of time discussing and arguing with people about what would be the best strategy for making and sequencing Sanger libraries for metagenomic sequencing for the Sargasso Sea metagenome study coordinated by the Venter Institute (I worked at TIGR at the time and was collaborating with people at the Venter Institute on this project). Then, a short time later, we had similar arguments and discussions for the Global Ocean Sampling (GOS) project (again, with Sanger sequencing). Then as next gen sequencing came online, the argument changed a bit (since it was not quite as simple to make mate pair or paired end libraries or sequence them). And every year or two I have been involved in the same type of discussions. Without wanting to prejudice people’s ideas about this, I thought it would be a good time to solicit input from the community on metagenomic sequencing and library construction.

So – I would like to solict such input with some questions:

What sequencing method(s) are people using for metagenomic studies?

What are your approaches to library construction (method, size of DNA pieces, etc)?

What are your approaches to sequencing (paired end? read length? overlaps or no overlaps?)

Do you use one approach or many combined together?

How did you / do you choose what approach to use?

How much do the goals of a study (e.g., assembling genomes, classifying organisms, comparative ecological analyses, functional discovery, etc) influence the choice of library and sequencing methods

What papers, blogs, data sets, Tweets, etc have you found most useful in making such decisions?

If you’re doing high diversity without a reference (so, something environmental!), in my opinion you should plan to generate an assembly (probably with MEGAHIT). So, you will want the deepest possible sequencing — HiSeq — and direct read overlaps don’t help with assembly, while longer insert sizes do. Then use an assembler like MEGAHIT.

Itai Sharon’s latest paper from the Banfield lab shows that high-accuracy synthetic long reads (so, Moleculo) can be a useful cross-check but they do not add a lot of depth. (That’s my interpretation of the paper.)

On the assembly side, the big remaining problem is that strain variation may kill parts of your assembly. I have suggestions for that but nothing proven (basically, “use digital normalization”…)

Titus, how much do you think larger insert sizes help with assemblies from communities on the diversity scale of soils? Deepest possible sequencing covers a lot of ground (one HiSeq run per sample?) so one useful outcome of this discussion might be some practical guidance on read depths, for different types of goals (assembly, read based comparisons).

Technology-wise go for the longest reads, prepare long insert libraries (the longer the better), do your size selection well, and assemble your stuff always. First try with raw reads, but then try digital normalization, assembly subsampling, or reference-based recruitment, and/or whatever else out there that can help you get better assemblies. Try different assemblers, at least two (i like Ray and hear good things about IDBA-UC), and analyze your dataset with both in parallel (as if you are going to publish both results). Different assemblers make different mistakes and it is very helpful to put two spotlights on your data instead of one. It is not easy to assess which assembly result is “more right”, but working with multiple assemblies of the same data helps you A LOT to learn about your data. The same goes for mapping parameters. Use BWA, Bowtie2 or CLC for mapping, they are pretty much identical and in my experience do well with complex metagenomes + sparse assemblies compared to other alternatives. Map with reasonable stringency, and don’t trust defaults… With CLC it is 50% alignment 50% identity, for instance. Variable mapping will ruin your binning process as everything worth doing downstream will rely on average coverage values of your contigs. So one cannot put enough emphasis on appropriate mapping. If your purpose is to get well-put-together draft genomes, consider using proper binning tools that utilize both sequence composition and coverage like CONCOCT, but always remain very sceptical about your draft genomes: check synteny if you can find anything relevant in reference genomes, check single-copy genes from multiple collections, etc.

Maybe more important of all is to remember that in metagenomics there is a lot of room for creativity before and after you have the data, and prime opportunities for quite novel discoveries. So if you are coming from the marker genes world, don’t just follow a protocol or two to get your “metagenomes done” and rely on your postdoc “who is not a bioinformatician but rocked that mothur/QIIME so they can do it”, but consider collaborating with someone from the _very_ beginning (even before collecting samples, really).

Extraction and sequencing of metagenomic DNA
Metagenomic DNA from prokaryote and girus-enriched size fraction filters, and from precipitated viruses, was extracted as described in (12), (59), and (19), respectively. DNA (30 to 50 ng) was sonicated to a 100â€“ to 800â€“base pair (bp) size range. DNA fragments were subsequently end repaired and 3â€²-adenylated before Illumina adapters were added by using the NEBNext Sample Reagent Set (New England Biolabs). Ligation products were purified by Ampure XP (Beckmann Coulter), and DNA fragments (>200 bp) were PCR-amplified with Illumina adapter-specific primers and Platinum Pfx DNA polymerase (Invitrogen). Amplified library fragments were size selected (~300 bp) on a 3% agarose gel. After library profile analysis using an Agilent 2100 Bioanalyzer (Agilent Technologies, USA) and quantitative PCR (MxPro, Agilent Technologies, USA), each library was sequenced with 101 base-length read chemistry in a paired-end flow cell on Illumina sequencing machines (Illumina, USA).

Metagenomic sequence assembly and gene predictions
Using MOCAT (version 1.2) (18), high-quality (HQ) reads were generated (option read_trim_filter; solexaqa with length cut-off 45 and quality cut-off 20) and reads matching Illumina sequencing adapters were removed (option screen_fastafile with e-value 0.00001). Screened HQ reads were assembled (option assembly; minimum length 500 bp), and gene-coding sequences [minimum length 100 nucleotides (nt)] were predicted on the assembled scaftigs [option gene_prediction; MetaGeneMark (version 2.8) (60)], generating a total of 111.5 M gene-coding sequences (14). Assembly errors were estimated by testing for colinearity between assembled contigs and genes and unassembled 454 sequencing reads by using a subset of 11 overlapping samples (58). From this analysis, we estimate that 1.5% of contigs had breakpoints and thus may suffer from errors (14). This error rate is more than a factor of 6.5 less than previous estimates of contig chimericity in simulated metagenomic assemblies (9.8%) with similar N50 values (61).

Generation of the ocean microbial reference gene catalog
Predicted gene-coding sequences were combined with those identified in publicly available ocean metagenomic data and reference genomes: 22.6 M predicted genes from the GOS expedition (6, 7), 1.78 M from Pacific Ocean Virome study (POV) (62), 14.8 thousand from viral genomes from the Marine Microbiology Initiative (MMI) at the Gordon & Betty Moore Foundation (14), and 1.59 M from 433 ocean microbial reference genomes (14). The reference genomes were selected by the following procedure: An initial set of 3496 reference genomes (all high-quality genomes available as of 23 February 2012) was clustered into 1753 species (24), from each of which we selected one representative genome. After mapping all HQ reads against these genomes, a genome was selected if the base coverage was >1Ã— or if the fraction of genome coverage was >40% in at least one sample. In addition, we included prokaryotic genomes for which habitat entries matched the terms â€œMarineâ€ or â€œSea Waterâ€ in the Integrated Microbial Genomes database (63) or if a genome was listed under the Moore Marine Microbial Sequencing project (64) as of 29 July 2013. Finally, we applied previously established quality criteria (24), resulting in a final set of 433 ocean microbial reference genomes (14). For data from GOS, POV, and MMI, assemblies were downloaded from the CAMERA portal (64). A total of 137.5 M gene-coding nucleotide sequences were clustered by using the same criteria as in (16); i.e., 95% sequence identity and 90% alignment coverage of the shorter sequence. The longest sequence of each cluster was selected, and after removing sequences <100 nt, we obtained a set of 40,154,822 genes [i.e., nonredundant contiguous gene-coding nucleotide sequences operationally defined as â€œgenesâ€; see also (16, 17)] that we refer to as the Ocean Microbial Reference Gene Catalog (OM-RGC).

Taxonomic and functional annotation of the OM-RGC
We taxonomically annotated the OM-RGC using a modified dual BLAST-based last common ancestor (2bLCA) approach as described in (58). For modifications, we used RAPsearch2 (65) rather than BLAST to efficiently process the large data volume and a database of nonredundant protein sequences from UniProt (version: UniRef_2013_07) and eukaryotic transcriptome data not represented in UniRef. The OM-RGC was functionally annotated to orthologous groups in the eggNOG (version 3) and KEGG databases (version 62) with SmashCommunity (version 1.6) (46, 66, 67). In total, 38% and 57% of the genes could be annotated by homology to a KEGG ortholog group (KO) or an OG, respectively. Functional modules were defined by selecting previously described key marker genes for 15 selected ocean-related processes, such as photosynthesis, aerobic respiration, nitrogen metabolism, and methanogenesis (14).

Taxonomic profiling using 16S tags and metagenomic operational taxonomic units 16S fragments directly identified in Illumina-sequenced metagenomes (mitags) were identified as described in (12). 16S mitags were mapped to cluster centroids of taxonomically annotated 16S reference sequences from the SILVA database (23) (release 115: SSU Ref NR 99) that had been clustered at 97% sequence identity with USEARCH v6.0.307 (68). 16S mitag counts were normalized by the total sum for each sample. In addition, we identified protein-coding marker genes suitable for metagenomic species profiling using fetchMG (13) in all 137.5 M gene-coding sequences and clustered them into metagenomic operational taxonomic units (mOTUs) that group organisms into species-level clusters at higher accuracy than 16S OTUs as described in (13, 24). Relative abundances of mOTU linkage groups were quantified with MOCAT (version 1.3) (18).

Functional profiling using the OM-RGC
Gene abundance profiles were generated by mapping HQ reads from each sample to the OM-RGC (MOCAT options screen and filter with length and identity cutoffs of 45 and 95%, respectively, and paired-end filtering set to yes). The abundance of each reference gene in each sample was calculated as gene lengthâ€“normalized base and insert counts (MOCAT option profile). Functional abundances were calculated as the sum of the relative abundances of reference genes, or key marker genes (14), annotated to different functional groups (OGs, KOs, and KEGG modules). For each functional module, the abundance was calculated as the sum of relative abundances of marker KOs normalized by the number of KOs. For comparative analyses with the human gut ecosystem, we used the subset of the OM-RGC that was annotated to Bacteria or Archaea (24.4 M genes). Using a rarefied (to 33 M inserts) gene count table, an OG was considered to be part of the ocean microbial core if at least one insert from each sample was mapped to a gene annotated to that OG. Samples from the human gut ecosystem were processed similarly, and a list of all OGs that were defined in either the ocean or the gut as core is provided in (14).

MetaPhlAn first needs to compare each metagenomic read from a sample with the catalog of marker genes for finding sequence matches. Although the software performs this operation internally using the NCBI program BLASTn (default threshold on the e-value of 1 Ã— 10âˆ’6), any other sequence-matching method (including accelerated methods like MapX (Real Time Genomics), MBLASTX (Multicoreware) and USEARCH18 can be used by providing MetaPhlAn with the generated alignment results in a tabular format. MetaPhlAn also provides an option to perform the alignment operation with multiple processors, which involves internally splitting the input reads into multiple independent sets that can be aligned using different CPUs. The classifier then assigns read counts to the microbial clades according to the alignment prediction; in the rare case of multiple matches with markers from different clades, only the best hit is considered. For each clade, read counts can be assigned directly using the clade-specific markers or indirectly by considering the reads assigned to all direct descendants. Total read counts are estimated with the direct approach if the clade has a sufficient total size of the marker set (default 10,000 nt); otherwise the indirect approach is preferred. Moreover, for all clades except the leaves of the taxonomic tree (species), the two estimations are compared, and an ‘unclassified’ subclade is added when the clade-specific read count is larger than the sum of descendant counts; the difference between the two estimations is assigned to the new descendant. Relative abundances are estimated by weighting read counts assigned using the direct method with the total nucleotide size of all the markers in the clade and normalizing by the sum of all directly estimated weighted read counts.

Metagenomic data acquisition and processing are also detailed elsewhere33 and are summarized here. Libraries were constructed for a subset of available samples (http://hmpdacc.org/tools_protocols/tools_protocols.php) and sequenced targeting ~10Gb of total sequence per sample using 101bp paired-end reads on the Illumina GAIIx platform. Human reads were identified and removed from each sample using 18- mer matches the Best Match Tagger BMTagger (http://cas- bioinfo.cas.unt.edu/sop/mediawiki/index.php/Bmtagger). Each resulting set of sequences was then subjected to three protocols: read-based metabolic reconstruction, assembly- based gene calling, and reference genome mapping.
Duplicate and low-quality reads (as defined by BWA42 q=2) were removed from each sample, as were low-complexity reads and any resulting sequences of <60nt (http://hmpdacc.org/HMIWGS). For metabolic reconstruction, remaining sequences were mapped using MBLASTX (MulticoreWare, St. Louis, MO) with default parameters against a functional sequence database including the KEGG orthology v54. Up to the 20
ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼WWW.NATURE.COM/NATURE | 3
RESEARCH
SUPPLEMENTARY INFORMATION
doi:10.1038/nature11234
most significant hits at E<1 were provided as hits to HUMAnN43, generating abundance and coverage results for each KEGG metabolic pathway and module of interest (http://hmpdacc.org/HMMRC). For gene cataloging, sequences from each sample were assembled using SOAPdenovo44 v1.04 with parameters -K 25 -R -M 3 -d 1. Contigs were filtered using a custom Perl script fasta2apg.pl (http://hmpdacc.org/tools_protocols/tools_protocols.php) derived from AMOS45 and scaffolds >300nt were retained. Coding sequences were predicted from the metagenome assemblies using MetaGeneMark46 and annotated using the JCVI metagenome annotation pipeline47 and are available in the HMP Gene Index (http://hmpdacc.org/HMGI). A non-redundant gene catalog was made from the Gene Index using USEARCH48 with 95% identity and 90% coverage thresholds (http://hmpdacc.org/HMGC).

Species-level taxonomic abundances were inferred for all samples using MetaPhlAn 1.149 (http://hmpdacc.org/HMSMCP). Briefly, Starting from all 2,887 genomes available from IMG50 as of November 2011, we systematically identified the 400,141 genes most representative of each taxonomic unit due of their intra-clade universality and inter-clade uniqueness, resulting in a marker catalog spanning 1,221 species. Metagenomic reads were searched against this unique sequence database and the relative abundance of taxa in each sample estimated using the 10% truncated (robust) mean copy number and length-normalized marker relative abundance per species.

The seven selected metagenomes were analyzed by BLASTX against the COG (Tatusov et al., 2003) and KEGG (Kyoto Encyclopedia of Genes and Genomes; Kanehisa et al., 2004) databases applying a cut-off bit score of >40 and normalized to the number of genomic fragments for each metagenome. The number of hits to the functional categories from COG and KEGG with at least 10 and 1 matches, respectively, in all seven metagenomes were normalized to the number of matches in each metagenome and used for cluster analysis (831 COG functional categories, 148 KEGG functional groups). The percentage of each COG and KEGG functional category shared in the seven metagenomes was determined and shown in cross-comparative heat-maps. The COG functional categories showing the highest variations among the seven metagenomes were determined applying the following cut-off criteria: >10% standard deviation among the metagenomes within the functional categories. The remaining 74 COG functional categories are shown (see below, Figure 5). The seven metagenomes had matches with a total of 207 different KEGG functional groups. KEGG groups without matches in one of the metagenomes and <20 matches in at least another metagenome were eliminated. The remaining 148 functional KEGG groups were used for cluster analysis. Unique, overrepresented and underrepresented groups were identified using s.d. cut-off of 8% and additional manual inspection. The application of different matrices and different sampling depths did not change the metagenome sample tree topology in KEGG analysis (applied to 148 KEGG groups). Hierarchical cluster analysis was performed with MeV 4.4 (Saeed et al., 2003) applying Euclidean distance and Kendall's tau matrices and average linkage clustering.

Shotgun reads were filtered using custom Perl scripts and publicly available software to remove (1) all reads <60 nucleotides; (2) Titanium pyrosequencing reads with two continuous and/or three total degenerate bases (N); (3) all duplicates (a known artefact of pyrosequencing), defined as sequences in which the initial 20 nucleotides are identical and that share an overall identity of >97% throughout the length of the shortest read37; and (4) all sequences with significant similarity to human reference genomes (Blastn E-value thresholdâ€‰â‰¤â€‰10âˆ’5, bitscoreâ€‰â‰¥â€‰50, percentage identityâ€‰â‰¥â€‰75%) to ensure the continued de-identification of samples.

Searches against the database of 126 human gut bacterial genomes were conducted with Blastn. A sequence read was annotated as the best hit in the database if the E-value wasâ€‰â‰¤10âˆ’5, the bit score wasâ€‰â‰¥50, and the alignment was at least 95% identical between query and subject. Relative abundances of reads mapped to each of the 126 genomes were adjusted to genome sizes. Searches against the protein-coding component of the KEGG database (v58) and against COG (v8.3) were conducted with BLASTX. (Note that when we performed searches against a separate KEGG database of intergenic regions alone, very few hits were observed.) Counts were normalized to the mapped reads. In total, 40â€‰Â±â€‰8% reads were mapped to KEGG KOs and 56â€‰Â±â€‰11% to COG; 44â€‰Â±â€‰16% of the reads mapped to the 126 gut genomes using 95% sequence similarity cut-off. Unmapped reads were excluded from the analyses shown in the main text, although repeating the analyses including these reads had little effect on the results. To quantify the differences in KEGG EC profiles among fecal microbiomes, evenly rarefied matrices of EC counts were created with all samples, and Hellinger distances were calculated using QIIME.

Spearman rank correlations were carried out using the R statistical software38. To identify bacterial taxa that change with increasing age in each population, the proportion of reads that map to each of the 126 reference sequenced human gut genomes in each fecal microbiome was identified. The relative abundance of reads from each genome was then correlated with age (years) for each geographic region. To identify genes encoding ECs that change with age, the proportion of reads annotated with each EC in each fecal microbiome was identified. The relative abundance of each EC was subsequently correlated with age (years) for each geographic region.

Right now we are interested in assembly, to get full length genomes, so we generally deal with 2x250bp paired end sequencing with a 500-800bp insert. This facilitates longer and more accurate de-novo assemblies of genetic material within a community metagenome.

We always go for genomes and for that I find the combination of large insert size (the longest your sequencing facility can do)+Illumina HiSeq to work the best. Longer reads are probably better but I would think that as long as you have sufficient coverage then any read length above 100 bp is probably ok (that’s my unproven intuition). So the advantage of having 250 bp reads is probably with the higher coverage that you get compared to sequencing the same number of reads at, say, 100 bp, and not because read length helps assembly in any way.
Downside of using short reads, as Titus mentioned, is that dealing with strain variation is difficult. Digital normalization and subsampling can definitely help but if the different strains appear at similar abundances in the community then I suspect that assembly would still fail.

Accurate long reads (e.g. Moleculo) can help with strain variation, at least if the organisms are abundant enough so that you can get sufficient coverage for them. Long reads are also very useful for evaluating rare organisms in the community for which you do not have enough coverage from short reads. The downside of using long reads is that all currently available technologies that I know of do not provide sufficient coverage depth from a single run to comprehensively evaluate complex communities. So you’ll need to either do a lot of sequencing or use these together with short read technologies.

In the paper Titus mentioned (http://genome.cshlp.org/content/25/4/534.abstract) we described our experience of using HiSeq and Moleculo data for metagenomics. Minimus2 proved to be very useful for the assembly of the long reads (although not much assembly was achieved due to the high complexity of the communities we studied) . We came up with a graph-based method for reconstructing genome architecture from long reads coming from closely related species and strains whose genomes cannot be assembled. Overall I think that combining HiSeq with Moleculo was very useful and provided much better results than using any of the technologies alone, but this conclusion may change depending on the environment.