Coverage

From ArachneWiki

Coverage is the term used to quantify the extent to which a large assembly object is covered by instances of smaller objects. The three contexts in which coverage is most often discussed are read coverage of a genome, read coverage of a contig, and contig coverage of a supercontig. Coverage may be evaluated at a particular location, in which case it is an integer value, or it may be over a region, in which case it is a fractional average. Coverage may be nonexistent (0X), partial (less than 1X), single (1X), or multiple (2X, 3X, and so forth.)

Contents

Read coverage of a genome

The most common meaning of "coverage" is the average coverage of a genome by input reads. The read coverage of a genome project is the total length of all the reads that have been sequenced, divided by the total number of bases in the genome. Note that each insert contributes only the length of its read pair to coverage, not its own full length.

A higher coverage means a better assembly. Low-coverage assemblies have many gaps, due to random chance, and the confidence in contig consensus is low. (However, even high-coverage assemblies may have gaps due to cloning bias.) Mammalian genomes are typically considered thoroughly covered when coverage is about 7X-8X, although this can be hard to pinpoint because it is difficult to know a genome's size a priori.

The module BasicAssemblyStats describes two types of genome coverage. Sequence coverage assumes the genome is exactly the size specified in genome.size. Assembly coverage is the coverage of the draft assembly itself -- hence it assumes a genome "size" equal to the sum of all supercontig lengths and is dependent on the choice of SUBDIR. In addition, it also defines Q20 coverage, which is defined as the coverage by bases with Q ≥ 20.

Read coverage of a contig

Read coverage of a contig is the number of reads that contribute to the contig consensus. It is generally on par with the overall assembly coverage, although it varies due to cloning bias. Regions of particularly high read coverage imply repeats, while regions of low coverage may indicate misassemblies due to relying heavily on a single read-read alignment. Read coverage is easy to visualize with DisplaySupercontig, simply by eyeballing the height of read stacks.

Sometimes, during the assembly process, a contig will develop a "bald spot", a region with no read coverage. The module ContigSanitizer finds these spots and will use them to split up contigs.

Contig coverage of a supercontig

Contig coverage of supercontigs is usually 1, but it may change based on the location of gaps. Positive-size gaps create regions of zero contig coverage, while negative-size gaps create overlaps, or regions of multiple coverage. These regions are denoted in BasicAssemblyStats.