Announcement

A hands-on Bioinformatics Workshop lead by Dr. Eric Lyons, University of California, Berkeley, USA will be held immediately prior to the SIP annual meeting.

8000 Genomes At Your Fingertips:

Comparative Genomics using CoGe

9 am – 4 pm Sunday, July 11, 2010

Currently, there are genomes available for over 8000 organisms, with new ones being sequenced at an exponential rate. New DNA sequencing technologies make obtaining genome sequences easier and faster, and modern researchers must understand genome biology in order to take advantage of this wealth of data. CoGe, a web-based comparative genomics system, permits researchers anywhere in the world to easily analyze and compare genomes. CoGe maintains a repository of all genomes publicly available, and its suite of interconnected analytical and visualization tools allows researchers to rapidly understand genome structure and evolution from the level of a gene to the genome. This workshop lead by CoGe’s creator, Dr. Eric Lyons of the Department of Plant and Microbial Biology at UC Berkeley, USA, provides an introductory description of the structure, evolution, and dynamics of genomes that will be followed by a hands-on in silico training session using CoGe to characterize the structure and evolution of genomes. Workshop participants are encouraged to propose their invertebrate pathogens of interest for exploratory de novo research.

For the SynMap example, make sure that cross-hairs appear and follow your mouse cursor on the syntenic dotplot. Click on the syntenic dotplot to cause a zoomed-in popup of a chromosome-chromosome dotplot comparison to appear. When you select and mouse around in the popup dotplot, you should also see cross-hairs. The cross-hairs should change color to red when you mouse over a dot (gray or colored.) When that happens, click the left mouse button to cause a GEvo analysis to appear. This may appear in a new window or as a tab in the window with the master dotplot (depending on your browser's settings).

Introductions

Name, research area of interest, organisms of interest

CoGePedia account for SIP2010

If you are familiar with using a wiki and would like to make changes to this document during the workshop, please feel free to do so. The user name and password will be announced.

Background on CoGe

Understanding how CoGe is put together will help you understand how to best use CoGe. Its design and architecture allows you to create your own analysis workflows that are guided by your research questions and not necessarily by what the tools can do. There are several dozen tools in CoGe, each designed to perform one task. These tools are linked together so that the results from one analysis flow seamlessly into the next. At each point there is access to the underlying data and raw analyses

Multi-genome support: concurrent storage of any version of any genome from any organisms in any state of assembly

This tool is CoGe's dynamic tool for visualizing genomes. The graphics used in it to represent genomic data are also used by other tools analyzing genomic data. Understanding these graphics is important for understanding other analyses.

This tool is the central hub of CoGe for dealing with lists of genomic features. It is often used to by other tools to handle a list of features, and it is itself used to send features to other tools for analysis.

CoGe's interface to blast algorthms permits the selection of any set of genomes of interest. Its results are presented in a highly graphical and interactive way to make the evaluation of blast hits easy. Because CoGe is linked to its genomes database, blast hits can be used to identify annotated genomic features that can subsequently be exported to other tools in CoGe. Overall, this tool makes it quick and easy to find homologs in any set of genomes for downstream analyses.

Visualization of hits in context of the query sequence and target genomic region (what is a good homolog?)

Blast hit details

Selecting and sending hit genomic sequences to other tools

Example Analysis 1: Building a phylogeny

A key part of comparative genomics is first understanding the evolutionary relationships among a set of organisms. Phylogenetic trees are a great map to have when trying to unravel the evolutionary history of a set of genomes. Often, such exploratory analyses begin with closely related genomes and then work out to more distantly related genomes. This exercise uses CoGe's tools to build a list of homologous genes, and then sends those gene sequences to phylogeny.fr for phylogenetic tree reconstruction. The example organism is Autographa californica nucleopolyhedrovirus (AcMNPV).

Get a list of all protein coding features in Autographa californica (quick link)

Search and find the genome of Autographa californica nucleopolyhedrovirus in GenomeView

Under "Genome Information" click on "Click for Features". This will display a table of all the genomic features near the bottom of the screen, below "Chromosome information".

In this feature list ("Features for . . . Autographa californica. . ."), click on "Feature List?" next to the entry for CDS. This will open FeatView will all the protein coding sequence for Autographa loaded.

When CoGeBlast loads, the sequence sent from FeatList will automatically be added to the query sequence submission box located near the bottom of the screen ("Sequence (fasta format)")

Search for baculoviruses by typing "baculo" into the text box next to "Organism Description:"

Add all baculovirus genomes to the "Genomes to BLAST:" list by pressing "Add all listed" located below the organism search list.

Since polyhedrin genes may rapidly evolve (and viruses evolve rapidly in general), you will want to do a protein blast search. Change from a nucleotide search to a protein search by selecting the button next to "Protein Sequence" located to the left of the query sequence. When protein sequence is selected, the sequence loaded from FeatList will automatically be translated into protein sequence

Run the blast search by pressing the button "Run CoGeBlast" located near the top of the screen

Extra: Compare the blast results using blastn and blastp. You will see that several genomes do not have matches using blastn that do have matches using blastp (e.g Helicoverpa armigera granulovirus)

Evaluate the blast hist to identify those with a large amount of query sequence coverage. All genomes matched will have a single blast that overlaps nearly the entire query sequence. Genomic features in the Target genomes overlapping the blast hits are putative orthologs to Autographa californica's polyhedrin gene. The only except is a second blast hit found in the genome of Agrotis segetum nucleopolyhedrovirus. This hit cover more than half the query sequence but does not overlap an annotated genomic feature in the Agrotis segetum nucleopolyhedrosisvirus genome

Select all the blast hits by pressing "Select All" located under the "HSP Table", and then deselect the second blast hit for Agrotis segetum nucleopolyhedrovirus

Select "Phylogenetics" from the drop down menu located next to "Send Checked Features to:" found below the "HSP Table", and press go. This will send the genomic features in the Target genomes overlapping blast hits to FastaView

phylogeny.fr contains a tool set for performing various manipulations of sequences, algorithms, and visualizations for phylogenetic tree reconstructions. You can find information about all of their offerings at: phylogeny.fr's documentation. In addition, phylogeny.fr contains a pipeline for "one click phylogentics" which will run a series of programs to go from a set of fasta sequences to an image of a phylogenetic tree using a very good set of algorithms and options. Their pipeline consists of:

Sequence Alignment: MUSCLE

Alignment Refinement: Gblocks

Phylogenetic tree reconstruction: PhyML

Tree rendering and visualization: TreeDyn

When sequences are sent from FastaView to phylogeny.fr, CoGe automatically sends them to its "one click" phylogenetics mode and clicks their button to start the analysis. All you need to do is wait.

GEvo is CoGe's tool for dynamically comparing multiple genomic regions. It allows you to specify genomic regions using a gene name, fetching a sequence from NCBI using a GenBank accession, or pasting in your own sequence. The extent of the genomic regions are up to you to define, but you can easily expand or contract the amount of sequence analyzed, reverse complement a sequence, and dynamically mask portions of the sequence based on annotated genomic features (e.g. non-CDS regions). There are a handful of sequence comparison algorithms to choose from and a variety of ways to manipulate the genomic regions.

In this example, we'll be using the results from Example 1 to extend out analysis of AcMNPV. Previously we identified homologs to the polyhedrin gene of AcMNPV and used phylogeny.fr to build a phylogenentic tree of their evolutionary relationships. We will use this gene tree as a proxy for the species tree of these viruses in order to identify the baculoviruse most closely related to AcMNPV.

Previous, a list of putative homologs to AcMNPV's polyhedrin gene were identified in CoGeBlast. These were selected and sent to FastaView in order to generate their protein sequences and send them to phylogeny.fr for phylogenetics. There are several other programs in CoGe to which identified genomic features may be sent. Select "Feature List" and press the "Go" button located next to the option. You can save the URL for this page in order to regenerate this feature list in the future.

Using this feature list (or the original list in CoGeBlast), select the polyhedrin gene in the strain that is evolutionary closest. To figure this out, use the phylogeny generated at phylogeny.fr (image). Note: You can use the "Filter Table Rows" text box to quickly find features, but remember that it is case-sensitive.

The organism that is closest to AcMNPV is Plutella xylostella multiple nucleopolyhedrovirus (PxMNPV). When the two polyhedrin genes from these organisms are selected, select "GEvo" from the drop-down menu located at below the list next to "Send Checked Features to:" and press "Go".

GEvo will load with these two features automatically entered and selected in two sequence submission boxes. (quick link)

GEvo's default options are usually a good place to start. Without changing any of the parameters, press "Run GEvo Analysis!" and run the analysis.

GEvo's results show two panels, one for each of the genomic regions analyzed. (quick link)

The graphics show genes as green arrows located above and below a dashed line (top and bottom strand respectively). Note that one gene in each region is colored yellow. These are the genes used to anchor these genomic regions (polyhedrin). To check that, click on the gene to get a dialog box with its annotations. These dialog boxes can be moved around the screen by clicking and dragging on the header, and closed by pressing the "X" located in the upper right corner. The annotation information contains links to get more data about a gene such as its sequence.

Note the pink boxes above and below the genes. These represent regions of identified sequence similarity in the same or opposite orientation respectively. Click on a large pink box to draw a transparent wedge connecting the boxes between the two regions. This also causes a dialog to open (or change the information if the dialog box is alread open) with an overview of information about that HSP (aka blast hit). To get the full information about that HSP including an alignment, press the "full annotation" link at the top of the dialog box. The large HSP shows that these regions are 99.27% identical.

To draw more than one connecting wedge at the same time, you can either click and drag on a panel to draw a box, and all HSPs within that box will have wedges drawn. If you want to draw them all, press and hold the shift key while clicking on a pink box.

Draw wedges for all regions of sequence similarity by pressing and holding the shift key while clicking on a box.

Note that there are several overlapping wedges drawn in near the center. At this region is a blue and black box outline. These glyphs denote repeat regions. If you click on one, you'll see its annotation which is a potential enhancer and origin of replication consisting of 30bp of imperfect palindromic repeats.

To examine the center repeat in more detail, drag the slider bars located at the ends of each genomic region's panel towards the repeat. Leave about a gene of sequence on either side of the slider bars. This will automatically adjust the amount of left and right sequence for each region in the sequence submission box. In addition, change the display options so that overlapping HSPs are automatically adjusted and color matches in HSPs so sequence differences are more easily visualized. These options are found under the "Result Parameters" tab located below the "Run GEvo Analysis" button. When adjusted, press the "Run GEvo Analysis" button to re-run the analysis.(quick link)

Next let's compare the entire genome. First turn off the auto-adjust overlapping HSP and color matches in HSPs options. Return to the sequence submission tab by clicking on it. Set the amount of sequence analyzed to 300000 for all regions by entering that number in the box next to "Apply distance to all CoGe submissions" located near the bottom of the sequence submission box. Note: we are requesting more sequence than contained in these genomes both up-stream and down-stream from these genes. GEvo will only return sequence up till the beginning and end of a genome, so over-shooting the ends will not result in an error. Press "Run GEvo Analysis" to re-run the analysis. (quick link)

Overall, you'll find that these genomes are nearly identical. However, there is one region in AcMNPV that is not present in PxMNVP and two regions in PxMNVP that are not present in AcMNPV. To analyze these regions in more detail, you can adjust the slider-bars to zoom in on those regions with missing sequence.

Adjust the slider bars and press the "Run" button to analyze the region in AcMNPV that is not present in PxMNVP (quick link)

There are two possibilities for this difference: 1. a new insertion in AcMPNB or 2. a deletion in PxMNPV. Because there is a gene model in AcMNPV that spans across this missing region in PxMNPV, it is more likely that PxMNPV experienced a deletion than an insertion in AcMNPV that created a gene using pre-existing neighboring sequence. Also, insertions often leave behind a target site duplication due to staggered cuts in the DNA, which are not seen in AcMNPV.

Zoom back out and analyze the whole genome by typing 300000 in the "Apply distance to all CoGe submissions" and pres the "Run" button (quick link)

Now zoom in on the second region present in PxMNPV that is not present in AcMNPV. This is near position 50,0000 in the PxMNPV genome. (quick link)

The sequence present in PxMNPV that is not present in AcMNPV is likely due to an insertion as evidenced by direct repeat sequences flaking the region.

Now zoom in on the first region present in PxMNPV that is not present in AcMNPV. This is near position 26,000 in the PxMNPV genome. Question: What do you think has happened to give rise to this difference in genomic structure? Try to take into account the structure of the gene models as well. (hint: look at the alignments of the blast hits.) (quick link)

Overall, this analysis provides evidence that since these viruses diverged, PxMNPV has had two new insertions and one deletion.

In the previous example, you found the baculovirus most closely related to AcMNPV with a genome sequence in CoGe, and used GEvo to perform several high-resolution analyses of their genomes in order to identify and characterize three regions that were present in one genome and not the other. This exercise extends those analyses by adding additional baculoviruses for comparisons. These will be at varying evolutionary distances in order to get an overall feeling of baculovirus genome evolution, and to use outgroup comparisons to validate the conclusions made in the previous example.

After genes are selected, select "GEvo" from the drop down menu next to "Send Checked Features to:" and press "Go"

When GEvo loads, change the extent of all regions by typing "300000" into the box next to "Apply distance to all CoGe submissions" (quick link)

WARNING: this analysis will be using 9 genomic regions which is a lot of data. To make this more manageable, the follow steps will help.

Order the genomic regions so that AcMNPV is the first sequence, and the others are placed according to their evolutionary distance from AcMNPV. You can change the relative order of the sequence by dragging and dropping their sequence submission boxes relative to one another. (quick link of genomes ordered by evolutionary distance from AcMNPV)

Since we are only interested in relationships to AcMNPV, press "Open all sequence option menus" and select "No" for "Reference Sequence" in all sequence submission boxes EXCEPT AcMNPV (quick link)

To make the visualization of these region more manageable, shrink the height of the genomic region panels by selecting the "Results Parameters" tab and changing the "Feature Height" to 10 pixels and the "Padding between tracks" to 1 pixel.

Run the analysis by pressing "Run GEvo Analysis"

Looking at the results, you can see that there are a lot of similar regions to AcMNPV across all these genomes. However, some of the colored blocks representing these regions are drawn below the gene models in AcMNPV. These are blast hits that are in the opposite orientation. WhSome of these are due to inversions. To make interpreting these data somewhat easier, find those genomes that are mostly inverted with respect to AcMNPV and flip those genomes by selecting "Yes" for "Reverse complement" in the sequence options menu located in the appropriate sequence selection boxes. (quick link)

From the orientated genomes, it is now possible to see how well the overall structure of the genomes compare to AcMNPV. You can explore this by clicking on the colored blocks and seeing which genome they match. For the most part, genome structure similarity decreases as evolutionary increases (as estimated by the polyhedrin gene). However, there are some exceptions:

RoMNPV appears to have greater continuity to AcMNPV than does PxMNPV

TnSNPV appears to be much more similar in overall structure to AcMNPV than does Xn granulovirus (XnGV).

We can inspect these comparisons in more detail by removing non-relevant genomes from the comparisons.

Let's compare AcMNPV, RoMNPV, and PxMNPV. To do this, select "Yes" for "Skip sequence" located in the sequence options menus of the sequence submission boxes for all other genomes. Next, turn both RoMNPV and PxMNPV into reference sequences so they will be compared to one another. (quick link)

Clicking on the colored boxes for regions of sequence similarity will bring up a dialog box with information about those matches. If you do that for the same region in these genomes, you'll find that AcMNPV-PxMNPV are ~98-99% identical at the nucleotide level. Both AcMNPV and PxMNPV are about 95% identical at the nucleotide level to RoMNPV. Thus, even though AcMNPV and RoMNPV have an overall more similar genome structure than PxMNPV to AcMNPV (fewer gene insertions), PxMNPV and AcMNPV are indeed more closely related.

Also, this three-way comparison permits validation of the claims we made at the end of Example 2. Namely, the difference in genome structure betwen AcMNPV and PxMNPV were the result of two new insertion and one deletion in PxMNPV. This pattern holds true with the addition of RoMNPV. Since RoMNPV is more distantly related to AcMNPV and PxMNPV than they are to each other, the changes we see that are unique to PxMNPV must have happened in its lineage.

Also, note that there is a likely insertion that AcMNPV and PxMNPV share, which Ro does not have have at the left side of the image.

Question: How can you determine if that is likely an insertion in the AcMNPV-PxMNPV lineage, or a deletion in the RoMNPV lineage?

Now let's take a look the comparison with TnSNPV and XnGV. Obvious polyhedroviruses should be more closely related to one another than they are to granuloviruses. Reconfigure and run an analysis to compare AcMNPV, TnSNPV, and XnGV. (quick link)

The percent of sequence similarity for regions overlapping the polyhedrin gene are 84% for AcMNPV-TnSMPV and 62% fir AcMNPV-XnGV. This is in agreement with what we assume about the evolutionary relationships of these viruses, but why did the phylogenetic tree place TnSMPV more distantly related to AcMNPV than XnGV? If you look closely at the genes colored yellow (the ones we used in the phylogenetic analysis and to anchor positions in these genomes, you'll see what happened. The yellow gene in TnSNPV is not overlapped by the region with sequence similarity to AcMNPV! To see this in more detail, change the amount of genomic region analyzed to 2000nt for all these regions. (quick link)

The close-up analysis of the region around our anchor gene revels the error that happened. The anchor gene used for TnSNPV is not polyhedrin! CoGeBlast made a mistake assigning the correct gene to the blast hit (which can happen as these are algorithms). The mistake happened because there is a small amount of overlap between the polyhedrin gene and the one next to in TnSMPV and CoGeBlast picked the wrong one. The important lesson here is to double check spurious results!

Question: Do the other genomes that clade with TnSNPV in the phylogenetic tree also suffer from this error?

Question: Can you identify all the correct polyhedrin genes, create a new FeatList, and send them to phylogeny.fr for phylogenetic analysis and tree reconstruction?

Let's now analyze some genomic inversions and translocations. AgMNPV has several of these in relation to AcMNPV. Do a comparison of these two genomes (quick link)

As before, to gain an insight as to the dynamics of these genomes (i.e. which one has had a particular change), comparison to an outgroup genome is necessary. Select any set of closely related genomes and compared them to these two to find one that helps place one of these inversion or translocation events in one lineage. After searching, I found that CfMNPV will be useful (http://genomevolution.org/r/7rt quick link).

Comparison of the complete genomes of AcMNPV, AgMNPV, and CfMPNV. Notice that AgMNPV and CfMNPV have more similar genome structure to one another than either does to AcMNPV. However, notice that AgMNPV has an additional inverted region when compared to either genome. This analysis can be regenerated at http://genomevolution.org/r/7rt

First, compare the whole genomes of AcMNPV, AgMNPV, and CfMPNV; look closely at the inversion seen on the left-hand side of these genomes. Overall, AgMNPV and CfMNPV are more similar to one another than either is to AcMNPV (~73% versus ~65% sequence identity respectively), and their overall structure is more similar as well. However, if you look closely at the left hand side of their genomes (where there are several inverted regions with respect to the AcMNPV genome), you will notice that there is one region of CfMNPV that is not inverted with respect to AcMNPV's genome while AgMNPV's genome is. Find this region and zoom-in on it for another GEvo analysis (quick link).

GEvo analysis of three baculoviruses: AcMNPV, AgMNPV, CfMNPV showing an inversion specific to AgMNPV. Analysis can be regenerated at http://genomevolution.org/r/7rx

By zooming in on this region, the inversion in AgMNPV becomes obvious. Both AcMNPV and CfMNPV's entire region is in the same orientation as indicated by the continuous colored block drawn above the gene models. The colored blocks between either AcMNPV or CfMNPV and AgMNPV is broken, with one block drawn below the gene models. This indicated that this region is in the opposite orientation for AgMNPV. We can deduce that this inversion happened in the lineage of AgMNPV after its divergence with CfMNPV because AcMNPV's genome is like CfMNPV's AND AcMNPV is more distantly related to either of the other genomes. This concept of using an outgroup genome for comparison in order to determine the ancestral state of between two more closely related genomes is very important. Which ever genome has the same state as the outgroup means that the change most likely happened in the other genome. The reason is more likely rather than definite is because there is a chance that both the outgroup genome and one of the related genomes both changed in the same way. However, more outgroup taxa that can be added to the analysis will help to strength this case.

Conclusion: This example analysis represents some of the more complicated types of analyses that can be done with CoGe: comparing genomes at both the structural level and sequence similarity level using outgroups to determine the timing and placement of evolutionary events.

Using a previously built phylogenetic tree of the evolutionary relationships of 50+ baculoviruses, we identified several baculoviruses at varying evolutionary distances from AcMNPV. We next used high-resolution sequence comparisons to examine some discrepancies between the pattern of genome structure conservation and the evolutionary relationships inferred form the phylogenetic tree. In one case, there were several insertions that appeared to break up the continuity of shared genome structure between AcMNPV and PxMNPV. By comparison to the close outgroup genome, RoMNPV, we could determine that these insertions were specific to the lineage of PxMNPV because AcMNPV and PxMNPV have a higher precent sequence identity that either to RoMNPV. These analyses showed how using outgroup genomes for comparison at both the genome structural level and sequence similarity level can determine where and when evolutionary events happened.

Using a similar kind of analysis (genome structure and percent sequence identity), we checked into why the phylogenetic tree was placing TnSNPV more distantly related to AcMNPV than XnGV (the latter is a granulovirus and the previous two are both polyhedroviruses). This turned out to be an error in how CoGeBlast assigned a genomic feature to an overlying blast hit. A neighboring gene was erroneously being used instead of the correct polyhedrin gene. The major lesson here is that it is important to check the results, look for obvious inconsistencies, and investigate possibly reasons for the error.

Finally, we used the same kind analyses to characterize an inversion detected between AcMNPV and AgMNPV. The outgroup CfMNPV had a region that was in the same orientation as AcMNPV. This means that it is most likely that an inversion happened at this location in the AgMNPV lineage because CfMNPV had the same state as AcMNPV.

While GEvo can compare multiple genomic regions, its power and utility become severely limited when comparing large expanses of genomes. For small genomes, such as baculoviruses (~150kb), GEvo works well for comparing their entire genome. However, for comparing whole genomes of larger genomes such as bacteria (~5mb), or yeast (~30mb), or fruitflies (~120mb), or humans (~3gb), interacting with and interpreting GEvo's results is difficult to impossible. However, GEvo is still the right choice when comparing regions within those genomes. But for comparing whole genomes, SynMap is the tool.

SynMap is CoGe's tool for generating a syntenic dotplot between any two genomes. A syntenic dotplot is one of the best ways to view a whole genome comparison by permitting the rapid identification of a variety of large scale genome patterns such as homologous regions, large scale duplications, very high-copy gene families and repetitive elements, inversions, chromosome fusions and fissions, whole genome duplications, large-scale translocations, and centromeric regions.

syntenic dotplot of two substrains of E. coli K12, MG1655 (x-axis) and BW2952 (y-axis). Notice the two incongruities in the diagonal green line. These are the results of either an insertion or deletion. Results can be regenerated at: http://genomevolution.org/r/8fw

Syntenic dotplot from SynMap between two substrains of E. coli K12: MG1655 (x-axis) and W3110 (y-axis). Notice the large inversion. Results can be regenerated at: http://genomevolution.org/r/8f9

Find an organism: Type in part of an organism's name or taxonomic description in the appropriate box and CoGe will find all matching organisms

Select a genome: There may be different versions available for a given organism

Choosing a masked sequence: Masked sequences usually have repeated sequences converted to "X"s. These are usually available for large genomes with lots of repeat sequences (e.g. plant genomes.) Using a masked genome (when available) decreases the amount of time the analysis initially takes to run.

CDS versus genomic sequence comparisons: Some genomes have not been annotated. When possible, use CDSs as it decreases the amount of time the analysis initially takes to run.

How to read a syntenic dotplot

What are the axes: Each axis represents one genome laid end-to-end. If multiple chromosomes (or contigs, plasmids) are present, each horizontal and vertical black line represents the beginning of another chromosome. These are ordered with the largest chromosomes placed closest to the lower left-hand corner.

What are the gray dots: These are each of the identified putative homologous matches between the genomes

What are the colored dots: These are putative homologous matches that have a collinear arrangement in each genome, and are likely syntenic with respect to one another

How to visualize various genomic patterns:

Inversions: a colored line in broken with a region flipped 90 degrees

Duplications: one region in one genome matched multiple regions in the other genome

Translocations(fission)/fusion: a contiguous series of colored dots in one genome is separated into two or more regions in the other genome

Insertions/deletions: a discontinuity in a series of colored dots

Analysis Options

Which algorithm to use: BlastZ

What is DAGChainer: The algorithm used to identify collinear series of homologous matches

Merging syntenic blocks: For some downstream analyses, joining together neighboring syntenic regions is desirable. This option will merge neighboring syntenic regions based on some distance. Overall, the results will appear identical, but the final output file (which can be downloaded) will have fewer and larger syntenic blocks.

Syntenic Depth: This option enforcing a strict number of syntenic regions for covering the entire genome. If a 1:1 relationship is stated, this will pick the best syntenic block for each region of the genome and any other blocks overlapping a previously covered region will be missed. For example, if a 1:1 relationship is specified and there is a small region that has been duplicated, only the best region will be displayed.

Why calculate synonymous/non-synonymous (Ks/Kd) mutation values: (NOTE: only works for CDS-CDS comparisons!) These metrics can be used to estimate the evolutionary distance between syntenic regions relative to one another and identify genes under purifying or positive selection. If this option is used, each syntenic gene pair identified will have these metrics calculated. The resulting dotplot will have each syntenic gene-pair colored according to these values and a histograms of the values displayed. For more information on these metrics, please see: http://en.wikipedia.org/wiki/Ka/Ks_ratio

Inversion: Syntenic regions are colored green and blue if they have a positive or negative slope respectively

Diagonal: A dozen colors are used to color each syntenic region differently

Ks values: This is set under the "Analysis Options" and syntenic pairs are colored based on their synonymous/nonsynonymous rates.

Drawing boxes around syntenic regions: This option draws a box around each syntenic region. If you would like to merge these together, you can selected to do that under "Analysis Options" and then evaluate the results with this option.

Dotplot axis metrics: This specifies whether the axis metrics are in nucleotides or genes. While nucleotides is often used, switching to genes sometimes help make dotplots easier to read if one genome is vastly larger than the other due to rampant transposon insertions.

Master image size: How large the dotplot is. Dynamic adjusts it based on the number of chromosomes being displayed.

Minimum chromosome size: Don't display any chromosome under a certain size. This is useful if comparing a contig-level assembly with many very small contigs that don't contain enough sequence data to be useful.

Order by syntenic path: This option is to assemble contig-level genome assemblies against a reference genome. Each contig is place and oriented such that a continuous syntenic path is traced along the reference genome. This option will automatically determine which genome is the reference by using the genome with fewer chromosomes (or contigs).

Interacting with results:

Clicking on a chromosome-chromosome in the master dotlpot creates a zoomed-in popup of that comparisons. If selected, Ks colors and histograms are dynamically rescaled for that comparison. The size of the comparison can be adjusted using the parameters box displayed above the master dotplot.

When mousing around in the popup syntenic doplot, you will notice that the cross-hairs turn red when over a gray or colored dot. There is a draggable dialog box above the dotplot that will dynamically display the names of the genes under the cross-hairs. Also, when the cross-hairs are red, this means that you can click on the dot and link directly to GEvo.

When linked to GEvo, the comparison will be anchored on the selected regions with 50,000nt of sequence up and downstream automatically selected.

SynMap popup dotplot between two substrains of E. coli K12: DH10B (x-axis) and W3110 (y-axis). Various interface options and genomic patterns are labeled. Results can be regenerated at http://genomevolution.org/r/9v5 .

With GenomeView you can check to see the of Bacillus thuringiensis (Bt) genomes there are in CoGe. There are 32 organisms with genome sequence for Bt. However, if you scroll through them with GenomeView, you'll notice that several of them only contain plasmid sequences. Also, several of them are not fully assembled into pseudomolecules.

Select these two genome by typing part of their name in the "Organism 1" and "Organism 2" search boxes. Select the appropriate genome and press "Generate SynMap" (quick link).

Syntenic dotplot between Bacillus thuingiensis strains konkukian (x-axis) and Al Hakam (y-axis). Note that these two genomes are nearly identical in their overall genome structure. Results can be regenerated at http://genomevolution.org/r/a36

The syntenic dotplot of these two genomes show that they are nearly identical in their overall genome structure. This is evidenced by the nearly continuous green line running from the lower left-hand corner of the graph to the upper right-hand corner. The dots in the green line identify syntenic gene-pairs. If you look closely at the graph, you'll notice that there are many discontinuities in the green line. These represent regions where something has either been delete or inserted into one or the other genome. Also, note that each genome has a small region separate by a horizontal or vertical black line. These are plasmids.

To analyze one of these discontinuities in more detail, click anywhere on the large chromosome-chromosome comparison in the dotplot. This will popup a window with just a dotpot of those chromosomes. Using this dotplot, move the cross-hairs over a discontinuity. If the cross-hairs move over a neighboring gene-pair, they will turn red. When they turn red, click your mouse button and launch GEvo. (quick link)

Gevo provides a high-resolution analysis of this syntenic region between these genomes and permits us to see all of the changes that have occurred at these regions since these genomes diverged from one another. While the large discontinuity is quite obvious, perhaps equally interesting are the numerous other smaller insertion/deletion (indels) changes that have also occurred. These indels can be investigated as described in a prior example analysis to look for direct sequence repeats to determine if they are a new insertion or deletion. Also, GEvo provides ways to determine what the annotations are for the "new" genes and extract their sequences.

Question: For the large discontinuity, do you think this is due to an insertion or a deletion? Why?

Question: How many of the smaller insertions are due to transposons?

In order to determine what is in the large discontinuity, you can click on each gene model and look at its annotation, or you can build a feature list of all the features. Remember, once you have a list of features, you can easily get their fasta sequences, annotations, GC content, etc. This tool lets you easily save or send the data for future work or analysis. To extract all the genes from the new large insertion:

Move the slider-bars for the lower panel until they border the insertion

Click "Get Sequence" in the Sequence 2 submission box. This will launch SeqView (quick link). Note: SeqView is a program that is very similar to FastaView except that instead of dealing with getting sequence for genomic features, it deals with getting sequence for a genomic region.

To extract all the CDS features from this sequence, select "CDS" from the scroll-down menu located below the display sequence next to the button "Extract Features". When selected, press "Extract Features". This will launch FeatList to view the list of features. (quick link)

With FeatList, you can:

view all the annotations by pressing "Get all" under the Annotation column header.

select all the features to send them to FastaView to get their nucleotide or protein sequences.

Having this list of features will help you with the next question: what are these genes? However, looking at the annotations shows that most are annotated as "hypothetical protein", with nothing obvious as to their origin. You can taken these protein sequences and blast them against the [NCBI] nr (nonredundant sequence database) and determine if you find any interesting and annotated matches.

Author's Note:I've done this analyses and was not able to learn anything new about these genes. It is my experience that genes from very high GC or very low GC bacterial genomes have not been well characterized and many (if not most) have no annotations. However, given that this large region is a new insertion, is ~40kb, contains proteins with a few highly conserved domains such as "resolvase", "transposase", "DNA-binding response regulator" and I have viewed similar types of insertions in better characterized bacteria (e.g. E. coli), my general feeling is that this is a prophage insertion.

Syntenic dotplot between two strains of Bacillus Thuringiensis: konkukian (x-axis) and pondicheriensis (y-axis). Notice the high number of inversions. Results can be regenerated at http://genomevolution.org/r/a2z

Note that while there is still a high degree of synteny between these two genomes, there have been a number of inversions that have happened since their lineages diverged. These apparent inversions can be the result of two phenomena: a real biological inversion or a missassembly of contigs. One tell-tale signature of a true inversion in bacterial genomes is that their break-points occur at homologous (or very similar) sequences. For example, inversions often happen using pairs of sequences of ribosomal operons, tRNA clusters, prophage, and transposons. However, these types of repetitive sequences also cause problems in genome assembly (depending on the sequencing strategy and technology used). For an example of how significant these types of genome assembly errors can be, see this comparison of two different versions of the Medicago genome. Let's examine the large inversion in the lower left-hand corner of the syntenic dotplot between these genomes.

First, open the popup chromosome-chromosome comparison by clicking on it in the master syntenic dotplot.

Center the cross-hairs in the middle of the large inversion. The approximate position is konkukian 574,000nt and pondicheriensis 634,000nt. Click in that region when the cross-hairs turn red to launch GEvo. (quick link)

When the GEvo launches, adjust the amount of sequence to be compared for both regions by typing "150000" in the next box "Apply distance to all CoGe submissions" located below the sequence submission boxes. (quick link)

Run the analysis by pressing "Run GEvo Analysis!"

GEvo comparison of two Bacillus thuringiensis strains konkukian and pondicheriensis showing an inversion. Results can be regenerated at: http://genomevolution.org/r/af6

The GEvo analysis of this region clearly shows the large inversion. While it may be difficult to see, at the ends of the break-point of the inversion are a bunch of tightly drawn gray arrows. These are tRNA clusters and they are present near each end of the inversion except the left side of pondicheriensis. At that region in pondicheriensis, the tRNA clusters are inset ~15kb and boarder what appears to be a second inversion.

GEvo comparison of two Bacillus thuringiensis strains konkukian and pondicheriensis showing an inversion breakpoint with a cluster of tRNAs. Results can be regenerated at http://genomevolution.org/r/af8

By zooming in around one of the tRNA clusters in konkukian, you can see that it is present in two regions in pondicheriensis. Interestingly, the neighboring rRNA genes are missing in both regions of pondicheriensis. This makes these regions suspect as these sequences are notoriously difficult to assemble by whole genome shotgun approaches using de novo assembly algorithms. Let's look at one of these regions more closely in pondicheriensis.

Gevo comparison of two Bacillus thuringiensis strains konkukian and pondicheriensis showing one end of an inversion breakpoint with a cluster of tRNAs. Note the Orange bands in the image for pondicheriensis. This are unsequenced regions of the genome ("N"s) and represent missing sequence. These often occur in repetitive sequences that are difficult to assemble. Also, they represent where two contigs were joined together, and perhaps incorrectly placed. Results can be regenerated at http://genomevolution.org/r/afa

Zooming in a tRNA cluster from both Bt genomes shows something new in pndicheriensis -- three orange bands. When the background is colored orange in a GEvo panel, this represents genomic sequence that are "N"s. "N"s are often used to represent unsequenced regions of a genome as well as places where two contigs were joined together during an assembly the the intervening sequence is unknown. As such, the rRNA genes present in konkukian may be present (probably are!) and are in one of the sequence gaps represented by the orange bands. rRNA sequences, which are quite large (in this case ~5kb for both the 16S and 23S), occur in several copies within a genome, and are nearly identical cause many problems of sequence assembly algorithms. Therefor, it is of no great surprise to see that they are missing in the pondicheriensis genome and instead we see a gap represented by orange bands. Also these gaps also mean that we do not have confidence that these contigs were correctly oriented, and may be placed backwards. This means that this inversion might not be real.

Question: Are any of the other inversions in this genome suspect of being due to assembly errors?

Question: Using the E. coli genome shown at the beginning of this analysis, what sequences flank the inversion? Is this inversion suspect as well?

Video of Example Analysis 4

Conclusion:

This tutorial gave a walk-though of many of the tools in CoGe. You learned how to search for genomes of interest, extract sequences, find homologs, build phylogenetic trees, compare multiple genomic regions, identify various genomic changes, and compare whole genomes using syntenic dotplots. However, there are many more tools in CoGe. Remember that each one is designed to do one thing and is linked to many other tools. Together, this creates a Open-ended_Analysis_Network network of analysis options and allows your research questions to drive where you take CoGe.