This post is part of a series exploring the evolution of a duplicated gene in the genus Drosophila. Links to the previous posts are above. Part 3 of this series (Obtaining Sequences) can be found below.

Obtaining Sequences

In the previous post I described the aldolase gene family, which encode proteins involved in cellular respiration. There are two aldolase genes in the Drosophila melanogaster genome. If we want to study the evolution of these genes, we can obtain the sequences from the 12 Drosophila species that have had their complete genomes sequenced. We’ll start by getting the sequence of one aldolase gene from Drosophila melanogaster using FlyBase.

In the image above I have entered the query “aldolase” into a search of all D. melanogaster genes. You can follow along by visiting FlyBase and entering the same query. If you perform this search you get two results:

Click to enlarge.

We want the first result (Aldolase), so we’ll click on the link to “Ald”. That will take us to the FlyBase Gene Report for Aldolase. Part of that report includes a SUMMARY:

Click to enlarge.

Within the SUMMARY are links to various other pages: one will take you to a genome browser focused on the region containing the Aldolase gene; one will show us the various alleles of Aldolase; and one will allow us to download the various transcripts encoded by Aldolase. Notice that there are eight reported transcripts for the gene (this is possible thanks to alternative splicing). I previously mentioned that there are three different protein coding sequences encoded by this gene (Shaw-Lee et al. 1992). How can we reconcile that with the fact that there are eight transscripts? We can view each of those transcripts by clicking on the link to the region in GBrowse (either by clicking “3R:22080404..22087313″ or by clicking the “GBrowse” link further up the page). This is what the GBrowse page for the Aldolase gene region looks like:

Click to enlarge.

Each splice-form has a unique messenger RNA (mRNA) that gets produced via alternative splicing of the initial RNA produced from the gene (named Ald-RA through Ald-RH). The splice-forms differ by which protein coding exons they contain and by which 5′ untranslated regions (5′ UTRs) they contain. Below the eight mRNAs are the coding sequences (CDS) that are encoded by each mRNA (named Ald-PA through Ald-PH). Note that some of the CDS encoded by different mRNAs are identical (e.g., Ald-PB, Ald-PC, and Ald-PD). That’s because each of those mRNAs encodes the same protein, but they differ in which 5′ UTR becomes part of the processed transcript. Those 5′ UTRs are involved in regulating the translation of the mRNA into a protein.

For the purposes of this experiment, we’ll only concern ourselves with the protein coding sequence of the gene. That means we only need one copy of each transcript that encodes an identical CDS (i.e., we only need Ald-PB, and not Ald-PC or Ald-PD). Additionally, we can ignore CDS that are merely a subset of other CDS (i.e., Ald-PB makes up part of Ald-PF). That leaves us with three CDS to download: Ald-PA, Ald-PF, and Ald-PH. I have a feeling that these are the three different protein coding sequences that were previously described (Shaw-Lee et al. 1992).

To download those CDS, we’ll return to the Gene Report for Aldolase (shown above — this image). In the SUMMARY, we can click on the link to the “8 annotated transcripts”, which takes us to this page:

Click to enlarge.

From this page, we can do a batch download of the CDS we’re interested in: Ald-PA, Ald-PF, and Ald-PH. First, select the three transcripts we want (A, F, and H) by using the check boxes on the left. Next, click the “HitList Conversion Tools” button, and select “Batch Download” from the drop-down menu. That takes you to the batch download page:

Click to enlarge.

We want our data in the FastA format — a simple format for DNA or protein sequence files that is readable by most software used to analyze sequence data. In the Output Options dropdown select “CDS” (we only want the protein coding sequence of the gene). Finally, click the green “Get FastA” button on the bottom of the page, and you will be taken to a webpage that contains three FastA sequences — one for each CDS. Copy and paste those in a text file because we’ll be using them in subsequent analyses.

You can dowload the other aldolase gene (CG5432) in the same manner. There is only one transcript of this gene, so it should be a bit easier. Next time, we’ll try to find these two genes in the genomes of some other species of Drosophila.