I'm having trouble finding a method to find exons in the original DNA sequence used to create the mRNA, even given the sequence of the mRNA, as I cannot find a way to reliably identify the beginning and end of introns and exons. Is there an algorithm or method that I can use to do this? What information am I missing to be able to find the beginning and end of exons and introns?

2 Answers
2

The only information you are missing is a way to identify the splice sites. There are many ways of doing what you need. The simplest, assuming you are sure of the origins of the mRNA, is to use a BLAST flavor, either plain BLASTn or, even better, BLAT, to compare your mRNA sequence to the genome of interest. BLAT really should be all you need if the mRNA comes from the genome you are aligning it to.

BLAT however, has no knowledge of splice sites and will simply cut your mRNA so as to maximize local alignment scores. If you need something more sophisticated than BLAT, use a specialized aligner as @SteveLianoglou suggested. These programs include a model of splice sites (both canonical and non) and will align exons correctly. My personal favorite is exonerate.

Assuming you are using Linux (which you should be if you are planning on doing a lot of this type of analyses), you can install it on a Debian based distribution (such as Ubuntu and its variants) with this command:

sudo apt-get install exonerate

Then, to align the mRNA to its gene do:

exonerate -m est2genome -n 1 mrna.txt dna.txt > out.txt

-m is the model that exonerate will use to align your two sequences. Since the important thing in your case is to correctly model splice sites, you should use either est2genome or, if you are comparing different species, coding2genome, or if you have a full length cDNA including UTRs, cdna2genome.

-n 1 means find the best match only. Assuming you are aligning an mRNA to its coding gene, this is all you are interested in.