I have to do a task for a university task and I need to understand some things
before figuring out how to do it.

The task is the following:

Find matches of known proteins (DNA-PolyI,II,III) to the specific E.Coli DNA, sequence.

I downloaded in FASTA format the protein sequence of DNA-Poly3 DNA-Poly1 of E.coli (strain K-12) and the entire DNA sequence
of the E.Coli.

I've studied a bit on-line and using the BioRuby gem and the Ruby programming language I wrote a program that translates DNA to protein sequence.
Then I tried to match the known DNA-Poly3 sequence but it did not match. After searching a bit on-line again, I learned about ORF and
and the 6 possible reading ways of each frame. The longer, in terms of codons, ORF conformation is chosen but there's no way of telling for sure that the protein was made using
this frame.

Then I've read about TATA boxes, but I can't use those since they can be found only in Eukaryotic and Archaea.

So how should I proceed in order to solve this problem: How can I prove that the DNA-Poly3 gets produced by a specific area (gene) in the DNA sequence?

Thanks for your time,

ps. Insights and hints are very much welcomed as this is just the tip of the iceberg for me and I'm very willing to study bioinformatics :-)

Do you need to write a program that does this or will existing tools be OK?
–
terdonFeb 12 '13 at 17:07

I should write the program and put up a web interface.
–
atmosxFeb 12 '13 at 17:19

4

OK, but do you need a totaly new algorithm or can you use existing tools like BLAST? If this is a computer science course, you should implement the algorithms yourself (I guess), if it is a bioinformatics course you should not reinvent the wheel. There are many, very, very good, programs that do what you need already. Is it OK to integrate them into your web interface?
–
terdonFeb 12 '13 at 17:20

Yes I should not (by any means) re-invent anything. I can use algorithms already in use. I know there are very sophisticated tools for the job. What I really want to know to begin with is how it works, then I can use ready algorithms.
–
atmosxFeb 12 '13 at 17:24

To give a clear example, all my data comes from ncbi (BLAST). How come that I can not match it? I've tried out all 6 possible conformations of ORFs and still I can't "produce" a sequence that matches the DNA-Poly3! :-/
–
atmosxFeb 12 '13 at 17:26

3 Answers
3

IMPORTANT EDIT : In your particular case, if you are working with bacterial genes, splicing is not an issue since bacteria do not have introns. I am leaving the information here since it may be useful to someone else. However, I recommend you focus on the UTRs since they are probably what is causing you problems.

There are three things that could be causing you problems. I will briefly touch on each one. I will talk about all genes, bear in mind that bacteria have no introns so any discussion of splicing and/or introns and exons is not directly relevant to your problem.

1. UTRs

Untranslated Regions (UTRs) are sequences at the beginning and end of a gene that are not translated into protein. UTRs are regions that are part of the original genomic sequence, they are also part of the mature mRNA (indeed, UTRs are sometimes modified by splicing events, they are exons not introns) but they do not get translated into protein. To illustrate, have a look at this simplified representation of an mRNA molecule:

Only the green exons will make it into the final protein. Introns are spliced out and UTRs are not translated.

Therefore, if you translate the entire gene, you will not get the correct protein.

2. Reading frames

Genes are read in words of three letters (the codons). The sequence ATGTGTACCTGA has six possible reading frames (three on each strand) which can be read and translated as follows:

5'3' Frame 1

ATG TGT ACC TGA
M C T Stop

5'3' Frame 2

a TGT GTA CCT ga
C V P

5'3' Frame 3

at GTG TAC CTG a
V Y L

3'5' Frame 1

TCA GGT ACA CAT
S G T H

3'5' Frame 2

t CAG GTA CAC at
Q V H

3'5' Frame 3

tc AGG TAC ACA t
R Y T

DNA is double stranded. The sequence of one strand is complementary to that of the other, therefore if you have one strand you can infer the sequence of its complementary one. Genes can be found on either strand, the two are equivalent biologically. However, sequencing projects choose one of the two strands (randomly) and call it the plus (+) strand and then save all sequences with respect to that strand. This means that sometimes the genomic sequence that you download from a database might be the complement of the actual sequence you are looking for.

3. Names

I once heard someone say in a conference that

Biologists would rather share a toothbrush than a gene name.

While that might be a little exaggerated, naming conventions vary between research communities and species and databases. So, are you sure that you have downloaded the correct gene? Where did you get it from? How did you identify it? Does the sequence also contain up/downstream regulatory regions, promoters, enhancers and the like? If you post the exact sequence you are attempting to use I can give you more specific help.

For example, the first 20 hits when searching for the E. coliDNA Polymerase 3 in ncbi's nucleotide database, are whole genome shotgun sequences. These do not correspond to the gene sequence you are looking for. They are huge pieces of the genome (or even the entire genome) that will contain your gene and many others. Look at the Tools section below for suggestions on extracting your gene from the whole genome.

4. Splicing (irrelevant to bacteria)

Another possible problem is splicing. Lets start with the basics, the process of producing a eukaryotic (bacteria have no introns) protein from a genomic sequence is summarized in the image below (modified slightly from here):

Transcription begins at the transcription start site (TSS) but not all the transcribed sequence is translated into protein. First, the introns are spliced out of the mRNA to produce the mature mRNA (other things like capping and poly-A addition also occur but are not relevant here). So,the mature mRNA contains the exons of the coding gene. This means that a linear translation of the gene's sequence will not correspond to the protein produced. You will need to take splicing into account.

Now, if the sequence ATGT were spliced at, for example, AT/gt (most splice events cut/join at GT/AG sites) and joined with the sequence agATTATT, the resulting (spliced) sequence would be (the splicing process will remove the gt from the first sequence and the ag from the second):

ATATTATT

As you can see, the reading frame has now changed. Where before, in the first reading frame, we had the codon ATG, the canonical translation initiation codon, we now have ATA which codes for isoleucine (I). I hope that is clear, the main point is that splicing can change the reading frame.

5. Tools

OK, that was the background. Now, what you will need to do is use existing programs that model splice sites and can correctly align a protein sequence to genomic DNA. My personal favorites are exonerate and genewise. On a Debian-based Linux distribution, you can install them with this command:

sudo apt-get install exonerate wise

Then, to align the protein to its gene do:

exonerate -m protein2genome -n 1 prot.fa dna.fa > out.txt

or

genewise -pep -pretty -gff -cdna prot.fa dna.fa > out.txt

In my experience exonerate is (much) faster but genewise is a little more accurate. I usually use exonerate if I am dealing with a whole genome and genewise if I only have a few kilobases of sequence. Both are very good and both will be able to align a protein to its genome of origin.

I will not explain all these options because that is beyond the scope of this site. Have a look at their documentation (which is quite good and clear) and if you still have problems, you could ask a question over at biostars.org.

Thanks for the answer! I will study the answer in a couple of hours when I'll have time! Thanks for your time!
–
atmosxFeb 12 '13 at 18:49

Just to let you know, the way to give thanks on the Stack Exchange network sites is to upvote the answer and, if (and only if) it answers your question to accept it. Thanks are always appreciated of course, I'm just letting you know how these sites work. Anyway, you're very welcome :).
–
terdonFeb 12 '13 at 18:59

The OP asked about an E. coli protein and the E. coli genome. I don't think his problem is anything to do with splicing.
–
Alan BoydFeb 12 '13 at 18:59

@AlanBoyd ah, you make a good point. Probably down to the UTRs then. I'll edit my answer.
–
terdonFeb 12 '13 at 19:01

Hello, thanks for the explanation it was enlightening, I will try work on it tonight. Do you know which is the main procedure by which the protein is aligned to the DNA? :-) Thanks again!
–
atmosxFeb 12 '13 at 19:14

Rather than doing it all from scratch, if you had your own instance of BLAST, you would make a blastable database of your e.coli sequence, and do tblastn, with your putative polymerase protein sequence as the query.

This would find the best matching sequence in the genome, and will work even if there are a fair number of differences between the protein you gave it, and what your DNA sequence actually translates to.

For what its worth - I have replicated what you are trying to do using a Python script. This is not elegant, but I just wanted to check for you that it is possible, and that there really is a match.

pseudocode is

take the genome sequence

make a reverse complement sequence

for each of the two DNA sequences, for each of three reading frames:

translate the DNA into a single string of amino acids with "*" at stop codons

split the string at "*" characters, call these words

find the first Met residue in each word, the string from that Met to the end of the word is an ORF

if the ORF is >99 (arbitrary cut off) put it in a big list of ORFs

now have a list of all ORFs in all 6 reading frames

search this list for a match to the polI sequence (I actually just looked for the first line in the fasta sequence).

The hit is identical to the entire polI sequence in a CLUSTAL alignment.

Note that this algorithm does not detect any ORFs that cross the breakpoint in the linear sequence representing the circular genome of E coli. Also assumes all initiator codons are ATG/Met but I seem to recall some E.coli initiation codons are GTG/Val