This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Abstract

High-throughput sequencing provides a fast and cost effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigrade Hypsibius dujardini using approaches routinely employed by microbial ecologists who reconstruct bacterial and archaeal genomes from metagenomic data. We created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes, and curate a 182 Mbp draft genome for H. dujardini supported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.

Author Comment

This study is currently under peer-review, and we wished to make our pre-print available to the community the way it was sent to reviewers. On the other hand, if we were not committed to provide a pre-print that is identical to the version under peer-review, we would have revised it to better clarify one important point: We are not suggesting that the tardigrade genome from our curation of the Boothby et al. assembly represents a better genome than the genome curated by Koutsovoulos et al. The most likely explanation for the 47 Mbp size difference between the two genomes is the better resolved repeat regions due to the inclusion of Moleculo reads in Boothby et al.'s analysis. As two research parasites, we have the utmost respect to all of the research groups who raised funds, performed experiments, generated data, and made them publicly available.

Additional Information

Competing Interests

A. Murat Eren is an Academic Editor for PeerJ.

Author Contributions

Tom O Delmont conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.

A. Murat Eren conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.

Data Deposition

The following information was supplied regarding data availability:

http://merenlab.org/data/

Funding

This work was supported by the Frank R. Lillie Research Innovation Award, and startup funds from the University of Chicago. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

0

Thanks very much for this preprint. I think Anvi'o is a great way to visualise lots of different lines of information and evidence and I definitely want to use it in future genome projects.

As one of the authors of the biorxiv preprint questioning the Boothby et al results and methods Koutsovoulos et al 2015, I would like to clarify the following:

a. This paper should make it explicit somewhere that all the filtering / analyses on the Edinburgh assembly were on our initial, unoptimised, made-for-purposes-of-screening-only assembly (nHd.1.0). It does mention our final assembly (should preferably refer to it as version nHd.2.3 to avoid confusion) but doesn't show Anvi'o run on it and I don't think that is clear.

As it stands, the terms raw/curated/final/draft might be confusing for someone. And the terms could inadvertently imply that the curated version of our raw assembly (nHd.1.0) is the same as our final version (nHd.2.3, which is also called curated "These authors subsequently curated a 135 Mbp draft genome", even though ours was not just curated, but reassembled from filtered reads).

b. This paper could also acknowledge somewhere that the final 182 Mbp curated UNC/Boothby et al assembly was not optimised. The anvi’o process identifies which contigs are likely contaminants. But the remaining tardigrade-origin contigs are not optimised, by which we mean reassembled given more uniform coverage. Once we remove the contaminating reads from the data set, coverage-aware assemblers will do a better job as they will not be dealing with very low coverage bacterial genomes messing up median/mode coverage estimates. Our nHd.2.3 assembly is thus not simply a filtered subset of nHd.1.0, but a substantially improved assembly once contaminating reads were removed.

The rest of the paper is clear, and shows the utility of Anvi'o as an excellent visualisation tool. I especially liked the insights into the bacterial genome that was common to both samples.

This is obviously not a comprehensive review, but these points also jumped out at me:

This sentence gives the impression that our final assembly was as contaminated as the UNC assembly. We state in our preprint that we think the remaining contamination is on the order of a few stray fragments. Also, I think this statement implies that this was the only test for contamination in our paper (so perhaps this paper could include a phrase to suggest that the 2d scatterplot/blobplot was one of many tests?)

2. "A larger draft genome for H. dujardini.... This finding is in agreement with Koutsovoulos et al.’s findings; however, our curated draft genome is 47 Mbp larger than the draft genome released by Koutsovoulos et al. The portion of scaffolds covered by RNA-Seq data suggests that the additional 47 Mbp still originate from the tardigrade genome. Thus, our selection is likely to be a more complete draft genome for H. dujardini than that of Koutsovoulos et al., most probably due to Boothby et al.’s inclusion of longer reads.Regardless, long reads considerably improved Boothby et al.’s assembly..."

I think this paper correctly points out that approx 70 Mbp of the Boothby et al 252 Mbp assembly is contaminating sequence (and includes >96% proposed HGT). However, it suggests that our final nHd.2.3 assembly is less complete, on the basis of size alone. I would argue that our 135 Mbp genome is MORE complete than any subset of the 252 Mbp Boothby et al genome because 92.8% of RNA-seq reads map to ours vs 89.5 % of RNA-seq reads mapping to theirs (see Table 1 in bioRxiv preprint.)

It is possible that some of the contigs in the 182 Mbp curated Boothby et al assembly are longer and supersede contigs in the Edinburgh nHd.2.3 assembly. But we think our assembly is overall better "for our purpose" than the curated Boothby et al subset (our lab's goal is typically to discover what genes/genefamilies are present etc). Yes, we might have collapsed some repeats/haploid contigs. We don't claim we have the best assembly possible. We think it is a good assembly given short-read data. A reassembly with Boothby et al's longer reads and after filtering reads from contaminating bacteria would be better, obviously. However, Boothby et al themselves say neither Moleculo nor Pacbio improved their N50 much (it stayed around 15 kb, compared to approx 50 kb for nHd.2.3). In fact, they didn't use their PacBio data at all in the final assembly according to Supp Info.

Our best guess is that the 182 Mbp is an expanded genome (flow cytometry suggests 75-110Mb by Goldstein's own estimates and by T Ryan Gregory and by our flow cyto estimates). The extra assembled sequence is most likely because of better resolved repeat regions and uncollapsed haplo contigs (probably because of longer reads, we haven't checked). Ours is somewhat expanded too. But not as much.

3. "Although scatterplots can describe the organization of contigs in assembly results, they suffer from limited number of dimensions they can display, and their inability to depict complex supporting data that can improve the iden- tification of individual genomes. These limitations are particularly problematic in sequencing projects covering multiple sequencing libraries, where displaying map- ping results from each library can help detecting sources of contaminants. Despite their successful applications, two dimensional scatter plots limit researchers to the use of simple characteristics of the data that can be represented on an axis (such as GC-content). In contrast, clustering scaffolds, and overlaying multiple layers of independent information produce more comprehensive visualizations that display multiple aspects of the data."

Absolutely. Multiple-track based visualisations are fabulous for seeing lots of evidence at once for a particular region. However, I would suggest 2d scatterplots with additional seq identity info (i.e blobplots) and Anvi'o are complementary rather than one being better than the other - one lets you see the whole picture quickly/intuitively, while the other lets you identify specific cases using lots of evidence. That's certainly how we plan to use Anvi'o in future. Thanks for a terrific, well engineered tool.

Add your feedback

Before adding feedback, consider if it can be asked as a question instead, and if so then use the Question tab. Pointing out typos is fine, but authors are encouraged to accept only substantially helpful feedback.

Follow this preprint for updates

"Following" is like subscribing to any updates related to a preprint.
These updates will appear in your home dashboard each time you visit PeerJ.

You can also choose to receive updates via daily or weekly email digests.
If you are following multiple preprints then we will send you
no more than one email per day or week based on your preferences.

Note: You are now also subscribed to the subject areas of this preprint
and will receive updates in the daily or weekly email digests if turned on.
You can add specific subject areas through your profile settings.