Choosing theright read lenghtfor diagnostic sequencing

Whole exome sequencing (WES) plays an important role in research as well as in genetic diagnostics. In WES, coding regions of the genome are enriched and then sequenced in high throughput. We have previously shown how different enrichment methods perform with respect to covered regions, underrepresented regions and sequencing efficiency.

In this Tech Note, we focus on the most important sequencing parameter, read length. Modern nextgeneration sequencing platforms offer a range of read configurations, such as single-read (SR) and paired-end (PE) sequencing with 75 bp per read, 100 bp per read, and 150 bp per read as routinely used methods.

A good choice for read length is closely tied to the insert size of the sequencing library, i.e., how long the individual DNA fragments are that are sequenced. This size depends on the library preparation protocol and can be influenced during library preparation.

During paired-end sequencing, two sequencing reads are generated for each library molecule; one from either end (see figure 1a). If the read length is chosen significantly larger than half the average insert size, the ends of the reads will overlap (figure 1b).

In order to decide on optimal sequencing parameters, a balance needs to be found that minimizes off-target reads while also minimizing overlapping read ends, given the size of coding regions in the human genome.

To make this point clear, we use the term „overlapping coverage“ and „diagnostic coverage“. Overlapping coverage on target is obtained by simply counting all sequenced bases that overlap the target region. Diagnostic coverage, on the other hand, is obtained by only counting informative bases, i.e. removing all bases that are sequenced twice due to overlapping read ends.

Using exome sequencing data, we evaluated the difference between overlapping and diagnostic coverage for the different read lengths, based on 8Gb of raw sequencing output (see table 1).

The difference is already visible in the naive overlapping coverage computation. As fewer molecules are sequenced in the PE150 dataset, yielding fewer reads, the impact of off-target reads is higher on overall coverage. Comparing diagnostic coverage for the two read lengths, the difference is significant, as simply more of the sequencing data is informative for the shorter read length.

This means that the nucleotides in the middle of the molecule are sequenced twice. While this adds to raw coverage, it does not provide additional information on the sequenced genome. Only coverage resulting from different library molecules increases diagnostic sensitivity.

One could argue that longer read length generates more output from the same amount of input molecules, therefore the insert size should be increased accordingly to make use of this extra data. For instance, for a PE150 sequencing run, the insert size should be above 300bp.

However, the average size of human coding exons is only 160bp. Including flanking intronic regions of diagnostic significance (e.g., due to splice variants) of about 30bp at either side, the average target of interest is about 220 bp long. Larger insert sizes will lead to a higher proportion of the sequencing data falling outside the region of interest („off-target“ reads), hence wasting sequencing capacity (see figure 2).

Sequencing Data

No. of Reads

Sequencing Coverage

Diagnostic Coverage

8Gb, PE100

80 million

99x

78x

8Gb, PE150

53 million

93x

61x

Table 1, Difference between overlapping and diagnostic coverage for different read lengths

Note that while increasing the insert size to better suit the PE150 read length would reduce waste due to overlapping read ends, it would at the same time increase waste due to off-target bases being sequenced, thus no improvement would be visible in diagnostic coverage.

Given these considerations, we find that PE100 is the optimal sequencing read length for WES when combined with insert sizes around 220bp.

Instead of maximizing raw sequencing output by increasing sequencing read length, one should maximize diagnostic output by generating the largest amount of usable information possible.