Shotgun Sequencing

1. Summary

Dideoxy chain-termination sequencing depends on synthetic DNA primer sequences to initiate the reaction. These primers must match a portion of the template whose sequence we are trying to determine. This gives us a 'chicken and egg' problem of needing to know a bit of the template sequence before we can read more of it.

One way to start sequencing an unknown sequence is to make a recombinant clone, putting the unknown insert into a vector of known sequence. Then primers from the vector can be used to begin reading the sequence of the insert. Once a portion of the new insert sequence is known, we can use that to design a new primer to let us read further. This process can be repeated until the whole insert is sequenced. This 'primer walking' process is inherently sequential, since each step must be completed before the results can be used to design the primer for the next step.

Shotgun sequencing is an approach that lets us run large numbers of reactions in parallel, rather than in series. Rather than using primer walking through one large insert, we randomly fragment the insert to create a library of smaller fragments. A large number of these clones are chosen at random, and sequenced in parallel using primers matching the vector. The sequencing results are then 'assembled' on the computer into a contiguous sequence of overlapping fragments. This approach essentially trades much of the laborious laboratory work for a puzzle to be solved on the computer, and turns out to be much faster than pure primer walking.

Technologies: This exercise uses the Staden sequence data management software for assembly of reads into contigs. Using the sequence simulation module, students must design custom sequencing primers for primer walking through regions not adequately covered by random clones, and to resolve ambiguities.

Time required: approximately 4 hours.

2. Learning Objectives

After preparing for and completing this exercise, you should be able to:

Describe how ambiguities and discrepancies between sequencing reads arise and how they can be resolved.

Explain how sequencing data management software like the Staden package is used to handle sequencing projects involving large numbers of reads.

Describe sequence assembly and state its importance.

Explain why vector clipping and quality clipping must be performed on reads before assembly.

Explain how univeral primers can be used to sequence large DNA molecules.

Compare the advantages and disadvantages of massively parallel shotgun sequencing strategies to the inherently sequential approach of primer walking.

Explain how read-pair information and knowledge about insert sizes can be used in the sequence assembly and finishing process.

Design custom primers to extend the sequenced region of a template ("primer walking"), and to sequence the complementary strand of a read.

Describe how reaction conditions such as temperature and primer concentration affect priming specificity and the quality of sequence data.

3. Background

In dideoxy chain-termination sequencing, a synthetic DNA primer is used to start the process of copying a template sequence into a set of labelled products. Each of these products ends with a particular base, depending on the sequence of the template, and has a particular size, depending on the position of the base in the template. Ordering these products by size (using gel electophoresis) lets us determine the order of bases in the template. Unfortunately, we often need to determine DNA sequences that are longer than we can read in a single sequencing reaction. This means we need to collect and organize the results of many reactions into larger contiguous sequences in a process called sequence assembly.

Read how Sanger dideoxy chain termination DNA sequencing works in the Wikipedia article on DNA sequencing. Note that the Sanger method is a type of primer extension reaction. Other important types of primer extension reactions include cDNA synthesis, labelling of hybridization probes, and the Polymerase Chain Reaction (PCR).

Old-fashioned water pumps needed to be wet to make a good seal. Wetting it was called 'priming the pump'; you always had to be sure you kept a bit of water available to prime the pump, so you could pump more water later. Oligonucleotide primers are conceptually similar in that you need to know some of the target sequence so that you can make a primer to read more of the sequence.

Say you have a cloned template of 10,000 base pairs that you need to sequence. If you can read about 1,000 bp in a single sequencing reaction, you will need to perform about 10 reactions in series to read the whole thing. The first reaction can use a primer site on the cloning vector. After you get the results from that reaction, you can use them to design a new primer to read 1,000 bases further. Assume it takes one day to run a sequencing reaction, study the results, and design a primer, and another two days to have the new primer made, it will take you about a month to sequence the whole 10,000 bases.

Now, instead of just one universal primer to read in from one end, consider how much you could speed up the process if you read in from both ends at the same time. With automated machines, it probably doesn't take much longer to do two reactions than to do one, since we can do them in parallel at the same time. So reading in from both ends, we could sequence the insert in about 15 days.

It might occur to you that the process would be even faster if we could break up the template into smaller fragments, and sequence each of them from both ends. For example, if we could break it into five pieces of 2,000 bases, and read 1,000 bases in from each end, we could obtain all of the sequence in one set of reactions, using only universal primers. This would take one day in our scenario, assuming we can run 10 reactions at once.

The trouble is, we don't have a good way to break the template up into five non-overlapping pieces of 2,000 base pairs. Maybe if we had a good restriction map, we could invest some time to in cloning smaller pieces, but that would take a considerable amount of work.

The good news is that the speedup from sequencing many small clones in parallel using universal primers is so great that it can be well worthwhile to do so, even if we have to use clones representing overlapping pieces, and we end up sequencing some parts multiple times. In fact, we do have good methods of generating random fragments from a large piece of DNA.

It also turns out that sequencing the same part several times from different clones, at different distances from the primer and on both strands, can let us determine the sequence more accurately. As with most experimental results, the interpretation of sequencing reactions often leaves us with some ambiguity. For example, sometimes peaks blur together on an electropherogram, and though we can tell there are several 'A's in a row, it can be difficult tell whether there are five or six. Ideally, peaks would be spaced evenly, but in fact, peak separations can vary a bit due to the way the strands of DNA fold during electrophoresis. Since peaks tend to get shorter and broader farther from the primer, the signal to noise ratio drops, until eventually we can't read with confidence. Since peak migration is influenced by folding of DNA strands, artifacts due to strand folding are likely to be different on the two strands. This means that if sequence from both strands agrees, we can be fairly confident that it is correct.

Now read the Wikipedia article on Shotgun sequencing, and ponder the advantages of running many reactions in parallel. This may help prepare you for the shock of how much work it is to sort out all the sequence information you will be faced with, and manage the small bits of sequence from many random and probably overlapping clones. This is the focus of the exercise we are about to do. Just remind yourself that you are trading a few hours of computer work for almost a month of laboratory work.

4. Resources

This exercise uses the Staden sequence data management package to assemble simulated experimental sequence traces from an 8.5 kb fragment of an HIV virus genome. (The instructions focus on the 2003 Windows version.)

Traces from the first 25 clones have already been simulated using forward and reverse universal primers. The zip file also contains the Pregap4 configuration file "config.pg4" to assemble these traces. (old version, assembled sequences)

Use the sequence trace generator to run additional simulated sequencing reactions to finish the assembly and resolve ambiguities.

5. Initial Fragment Assembly

The first part of the shotgun sequencing experiment has been done for you. Twenty five clones were selected from a random insert library and sequenced on both ends with universal primers. These fifty sequence traces are in a zip file called "shotgun.zip", together with a configuration file to run the Staden fragment assembly package.

Unzip the shotgun.zip file.

In the new "shotgun" directory, double click the file named "config.pg4". This will launch the PreGap4 program.

Check to see that the following modules are selected (You can read the help documentation to see what these modules do):

Check that the following values are set:

Click "Run". When it finishes, close Pregap4.

Look in the "shotgun" directory again. You should find a new file named "HIV.0.aux": double click this file to launch Gap4.

Two windows will open, the main Gap4 window and the "Contig Selector".

In the main Gap4 window, select "View: Template Display". This opens a "Show templates" dialog box. Be sure the "all contigs" radio button is selcted and the "Templates" and "Readings" checkboxes are checked. Click "OK" (see Figure 1: "original assembly").

The graphical template display shows how the 25 sequencing "reads" have been assembled into 7 contiguous blocks ("contigs"). The "reads" are arrows, and the lines between them are templates. Note that there are two reads per template. Each template was sequenced on each end using universal primers that read into the insert from the plasmid cloning vector. Two ends read from the same template are called a "read pair".

All of our templates are about 1200 bases long, plus or minus about 200 bases, and the read pairs both read in from the ends of the template. We can rearrange the contigs so that the display of read pairs is consistent with these facts.

Note that the two contigs at the right end of Figure 1 are pointing out from their templates, rather than in. Right-click on the contig lines at the bottom of the Templates Display window to bring up a context menu that lets you "Complement" the contig. Figure 2 ("complemented contigs") shows what they should look like when you're done.

Notice the templates drawn in yellow. Each has one read in one contig, and the other read in a different contig. Later in the exercise we will sequence the middle parts of these templates, which will let us join the contigs that these templates span. But first, note that some of the contigs do not have templates that would let us connect them. We must go back to the clone library and sequence more clones.

Save your Gap4 database with a new version number (version 1) by choosing "File" Copy database" from the main window's top menu and entering "1" in the box marked "New version character". Exit Gap4.

6. Sequencing Additional Clones

The sequence fragments you have already assembled are from clones 1 through 25. We must have sequence from at least five more clones, so teams will be assigned clones to sequence, starting with "HIVsubclone026". Each clone should be sequenced using both "forward" and "reverse" primers. Copy and paste the appropriate primer sequence into the "Primer sequence" box. Be sure to appropriately select "forward" or "reverse" from the "primer strand" pull down list. Since these are universal primers, select "vector (step 1)" from the "priming site" pull-down list.

The default reaction conditions should work for the universal primers. Be sure to enter the names of everyone on your team into the "user name" box. This will help us to troubleshoot your reactions if they do not work as expected. After you click "run sequencing reaction", it will take six or seven seconds for your virtual sequence trace to be created. Note the name of the result file: it will be something like "HIVsubclone026-p1t.scf". Be sure you have the correct subclone. Note the leter following the dash: it will be "p" if you told the program this primer was on the forward strand, and "q" for the reverse. The number "1" indicates that you said you were using a universal primer. Be sure these items are correct, and save the trace file.

Once all groups have sequenced their assigned clones, we will collect them and give everyone a copy of all the traces to use in the next round of assembly.

7. Adding New Traces to the Assembly

Copy the new sequence traces into your project folder. Open Pregap4 again by double clicking the "congif.pg4" file. Select the "Files to process" tab, and click the "Add files" button. Select the new SCF files, starting with "HIVsubclone026-p1t.scf". Be sure to set "Files of type" to "SCF"!

On the "Configure modules" tab, click on the "Gap4 shotgun assembly" module. Enter "1" in the "Gap4 database version" field, and select the "Append to existing database" radio button. Click the "Run" button in the lower left corner of the window. When the program reports "processing finished", close Pregap4.

Now open Gap4 again, bu this time by double-clicking the file "HIV.1.aux". This is the new version of the database where Pregap4 put the latest trace data. Open the Templates Display window ("View: Template Display", "OK"). It should resemble Figure 3.

Note that several of the templates do not seem to be displayed correctly. All the templates in our subclone library have inserts of roughly the same size (1200 +/- ~200 bp). Since each should have been sequenced from both ends using the forward and reverse universal primers, there should be an arrow representing the sequencing read from each end pointing in toward the middle of the insert. Because some of our templates are drawn much too long, and not all of the arrows point in from the ends, we need to rearrange the contigs so they are consistent with what we know about our templates and sequence reads.

As we saw earlier, clicking on a contig with the right mouse button brings up a context menu that lets yo complement the contig. This will change the direction of all the read arrows in that contig. Click on a contig using the middle mouse button to drag it left or right to a new position (if your mouse has a wheel, it will probably work as a middle mouse button, too). You may have to click a few time in slightly different spots to grab the contig line successfully.

Use these operations to rearrange the contigs until all templates are drawn about the right length, with one read coming in from each end, as in Figure 4.

Note that the templates drawn in yellow or dark yellow all cross boundaries between contigs. The next part of the exercise will be to use custom primers to sequence the middle parts of some of these templates, to see if we can obtain enough sequence to join some of our contigs together. For example, the contigs named "HIVsubclone010-q1t" and "HIVsubclone021-p1t" wold presumably be joined if we had better sequence from the middle part of clone 9, 19, or 21. (Point the mouse at a contig, template, or read to see its name. Each contig is named after the leftmost read that it contains.)

8. Sequencing Clones that Connect Contigs

At this point, the sequence is assembled into 7 contigs, which means that there are 6 boundaries between contigs. Each group of students will be assigned one of these boundaries, and will do additional sequencing reactions to try to get enough information to join the contigs.

1. Divide clones among groups of students

Group

Clones

1

9, 19 or 21

2

29

3

30 or 29

4

18, 24 or 30

5

26 or 28

6

15, 16, 27

2. Primer Walking to join contigs

Design custom sequencing primers to sequence the regions of these templates that span across contigs.

Use them in simulated sequencing reactions. Check your traces in the trace viewer (Trev) to be sure they worked (if not, you may need to check your primer design or adjust your reaction conditions).

Submit your reads to the web site (the instructor will demonstrate).

Once all teams have submitted their results, each group should download the trace files and add them to their own assembly.

9. Finishing and Editing the Sequence

At this point, all the templates should be joined into a single contig. On the Template Display window, choose "View: Quality Plot", then select the contig. A color-coded display will be drawn below the contig line; see the online help for an explanation of the colors. Blocks of green and blue represent areas that have been sequenced on only one strand. If time permits, we may run additional sequencing reactions to sequence the other strand in these regions.

Editing is the process of resolving inconsistencies among different reads, usually by going back to the traces and deciding what sequence is most consistent with the experimental results. Gap4 is extremely helpful in this process. Click on a problem area in the Quality Plot, and the corresponding sequences will be opened in the Contig Editor. Click on the consensus sequence in the contig editor and the original traces will be displayed together so you can compare them and decide which sequence to believe. Edit the sequences in the Contig Editor to record your choices. Given time, you should be able to resolve all of the ambiguities to determine a high-quality sequence for the entire target gene.

Figure 1: Original assembly of sequencing reads into contigs by Gap4.

Check the quality of the sequence data, and perform additional reactions to "finish" the sequence.

Figure 2. The two rightmost contigs have been complemented so that the read pairs point toward each other (into the template).

Figure 3: New traces added to the assembly.

Figure 4: Contigs have been rearranged so templates are within the expected size, and have reads coming in from both ends.

10. Assembly Errors

Here we show and explain a few common assembly errors.

1. Assembly Results Using Good Vector and Quality Clipping

Vector and Quality Clipping

Good Assembly for First Round

2. Poor Results Without Vector Clipping

No Vector Clipping

Vector clipping failed because PreGap4 could not find the vector sequence file when launched by double clicking "config.pg4" on a networked drive. Moving the PreGap4 configuration file to the local "C:" drive fixed the problem. (Thanks to Dr. George Carman, University of the Pacific).

Bad Assembly for Round 1

Short segments of vector at the beginning of the reads prevent assembly because these segments do not overlap with the target sequence.

11. Review Questions

In DNA sequencing, what is a "read"?

What is a 'universal primer'?

Describe the process of 'primer walking'. Why is it inherently sequential?

What is a "fluorescent dye terminator"?

Why is it important to sequence both strands of DNA?

Briefly describe how shotgun sequencing experiments and data analysis are used to produce one continuous DNA sequence.

When might you use custom designed primers in a large shotgun sequencing project?

Why might a repeated sequence make sequence assembly more challenging?

Sequence trace for review questions

From the figure above, design a primer to continue the read in the same direction.

From the figure above, design a primer to sequence the complementary strand.