The machinery of the cell actually starts at some point along the RNA and "reads" the sequences codon after codon, attaching the encoded amino acid to the end of the growing protein sequence. Example 8-1 simulates this, reading the string of DNA three bases at a time and concatenating the symbol for the encoded amino acid to the end of the growing protein string. In the cell, the process stops when a codon is encountered.

8.3.2 Translating Codons to Amino Acids

The first task is to enable the following programs to do the translation from the three-nucleotide codons to the amino acids. This is the most important step in implementing the genetic code, which is the encoding of amino acids by three-nucleotide codons.

Here's a subroutine that returns an amino acid (represented by a one-letter abbreviation) given a three-letter DNA codon:

This code is clear and simple, and the layout makes it obvious what's happening. However, it can take a while to run. For instance, given the codon GGT for glycine, it has to check each test until it finally succeeds on the last one, and that's a lot of string comparisons. Still, the code achieves its purpose.

There's something new happening in the code's error message. Recall filehandles from Chapter 4 and how they access data in files. From Chapter 5, remember the special filehandle STDIN that reads user input from the keyboard. STDOUT and STDERR are also special filehandles that are always available to Perl programs. STDOUT directs output to the screen (usually) or another standard place. When a filehandle is missing from a print statement, STDOUT is assumed. The print statement accepts a filehandle as an optional argument, but so far, we've been printing to the default STDOUT. Here, error messages are directed to STDERR, which usually prints to the screen, but on many computer systems they can be directed to a special error file or other location. Alternatively, you sometimes want to direct STDOUT to a file or elsewhere but want STDERR error messages to appear on your screen. I mention these options because you are likely to come across them in Perl code; we don't use them much in this book (see Appendix B for more information).

8.3.3 The Redundancy of the Genetic Code

I've remarked on the redundancy of the genetic code, and the last subroutine clearly displays this redundancy. It might be interesting to express that in your subroutine. Notice that groups of redundant codons almost always have the same first and second bases and vary in the third. You've used character classes in regular expressions to match any of a set of characters. Now, let's try to redo the subroutine to make one test for each redundant group of codons:

# codon2aa

#

# A subroutine to translate a DNA 3-character codon to an amino acid

# Version 2sub codon2aa {

my($codon) = @_;

if ( $codon =~ /GC./i) { return 'A' } # Alanine

elsif ( $codon =~ /TG[TC]/i) { return 'C' } # Cysteine

elsif ( $codon =~ /GA[TC]/i) { return 'D' } # Aspartic Acid

elsif ( $codon =~ /GA[AG]/i) { return 'E' } # Glutamic Acid

elsif ( $codon =~ /TT[TC]/i) { return 'F' } # Phenylalanine

elsif ( $codon =~ /GG./i) { return 'G' } # Glycine

elsif ( $codon =~ /CA[TC]/i) { return 'H' } # Histidine

elsif ( $codon =~ /AT[TCA]/i) { return 'I' } # Isoleucine

elsif ( $codon =~ /AA[AG]/i) { return 'K' } # Lysine

elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' } # Leucine

elsif ( $codon =~ /ATG/i) { return 'M' } # Methionine

elsif ( $codon =~ /AA[TC]/i) { return 'N' } # Asparagine

elsif ( $codon =~ /CC./i) { return 'P' } # Proline

elsif ( $codon =~ /CA[AG]/i) { return 'Q' } # Glutamine

elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' } # Arginine

elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' } # Serine

elsif ( $codon =~ /AC./i) { return 'T' } # Threonine

elsif ( $codon =~ /GT./i) { return 'V' } # Valine

elsif ( $codon =~ /TGG/i) { return 'W' } # Tryptophan

elsif ( $codon =~ /TA[TC]/i) { return 'Y' } # Tyrosine

elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' } # Stop

else {

print STDERR "Bad codon \"$codon\"!!\n";

exit;

}

}

Using character classes and regular expressions, this code clearly shows the redundancy of the genetic code. Also notice that the one-character codes for the amino acids are now in alphabetical order.

A character class such as [TC] matches a single character, either T or C. The . is the regular expression that matches any character except a newline. The /GT./i expression for valine matches GTA, GTC, GTG, and GTT, all of which are codons for valine. (Of course, the period matches any other character, but the $codon is assumed to have only A,C,G, or T characters.) The i after the regular expression means match uppercase or lowercase, for instance /T/i matches T or t.

The new feature in these regular expressions is the use of the vertical bar or pipe (|) to separate two choices. Thus for serine, /TC.|AG[TC]/ matches /TC./ or /AG[TC]/. In this program, you need only two choices per regular expression, but you can use as many vertical bars as you like.

You can also group parts of a regular expression in parentheses, and use vertical bars in them. For example, /give me a (break|meal)/ matches "give me a break" or "give me a meal."

8.3.4 Using Hashes for the Genetic Code

If you think about using a hash for this translation, you'll see it's a natural way to proceed. For each codon key the amino acid value is returned. Here's the code:

#

# codon2aa

#

# A subroutine to translate a DNA 3-character codon to an amino acid

# Version 3, using hash lookup
sub codon2aa {

my($codon) = @_;

$codon = uc $codon;

my(%genetic_code) = (

'TCA' => 'S', # Serine

'TCC' => 'S', # Serine

'TCG' => 'S', # Serine

'TCT' => 'S', # Serine

'TTC' => 'F', # Phenylalanine

'TTT' => 'F', # Phenylalanine

'TTA' => 'L', # Leucine

'TTG' => 'L', # Leucine

'TAC' => 'Y', # Tyrosine

'TAT' => 'Y', # Tyrosine

'TAA' => '_', # Stop

'TAG' => '_', # Stop

'TGC' => 'C', # Cysteine

'TGT' => 'C', # Cysteine

'TGA' => '_', # Stop

'TGG' => 'W', # Tryptophan

'CTA' => 'L', # Leucine

'CTC' => 'L', # Leucine

'CTG' => 'L', # Leucine

'CTT' => 'L', # Leucine

'CCA' => 'P', # Proline

'CCC' => 'P', # Proline

'CCG' => 'P', # Proline

'CCT' => 'P', # Proline

'CAC' => 'H', # Histidine

'CAT' => 'H', # Histidine

'CAA' => 'Q', # Glutamine

'CAG' => 'Q', # Glutamine

'CGA' => 'R', # Arginine

'CGC' => 'R', # Arginine

'CGG' => 'R', # Arginine

'CGT' => 'R', # Arginine

'ATA' => 'I', # Isoleucine

'ATC' => 'I', # Isoleucine

'ATT' => 'I', # Isoleucine

'ATG' => 'M', # Methionine

'ACA' => 'T', # Threonine

'ACC' => 'T', # Threonine

'ACG' => 'T', # Threonine

'ACT' => 'T', # Threonine

'AAC' => 'N', # Asparagine

'AAT' => 'N', # Asparagine

'AAA' => 'K', # Lysine

'AAG' => 'K', # Lysine

'AGC' => 'S', # Serine

'AGT' => 'S', # Serine

'AGA' => 'R', # Arginine

'AGG' => 'R', # Arginine

'GTA' => 'V', # Valine

'GTC' => 'V', # Valine

'GTG' => 'V', # Valine

'GTT' => 'V', # Valine

'GCA' => 'A', # Alanine

'GCC' => 'A', # Alanine

'GCG' => 'A', # Alanine

'GCT' => 'A', # Alanine

'GAC' => 'D', # Aspartic Acid

'GAT' => 'D', # Aspartic Acid

'GAA' => 'E', # Glutamic Acid

'GAG' => 'E', # Glutamic Acid

'GGA' => 'G', # Glycine

'GGC' => 'G', # Glycine

'GGG' => 'G', # Glycine

'GGT' => 'G', # Glycine

);

if(exists $genetic_code{$codon}) {

return $genetic_code{$codon};

}else{
print STDERR "Bad codon \"$codon\"!!\n";

exit;

}

}

This subroutine is simple: it initializes a hash and then performs a single lookup of its single argument in the hash. The hash has 64 keys, one for each codon.

Notice there's a function exists that returns true if the key $codon exists in the hash. It's equivalent to the else statement in the two previous versions of the codon2aa subroutine.[3][3] A key might exist in a hash, but its value can be undefined. The defined function checks for defined values. Also, of course, the value might be 0 or the empty string, in which case, it fails a test such as if ($hash{$key}) because, even though the key exists and the value is defined, the value evaluates to false in a conditional test.

Also notice that to make this subroutine work on lowercase DNA as well as uppercase, you translate the incoming argument into uppercase to match the data in the %genetic_code hash. You can't give a regular expression to a hash as a key; it must be a simple scalar value, such as a string or a number, so the case translation must be done first. (Alternatively, you can make the hash twice as big.) Similarly, character classes don't work in the keys for hashes, so you have to specify each one of the 64 codons individually.

You may wonder why bother wrapping this last bit of code in a subroutine at all. Why not just declare and initialize the hash and do the lookups directly in the hash instead of going through the subroutine? Well, the subroutine does do a little bit of error checking for nonexistent keys, so having a subroutine saves doing that error checking yourself each time you use the hash.

Additionally, wrapping the code in a subroutine gives a little insurance for the future. If all the code you write does codon translation by means of our subroutine, it would be simplicity itself to switch over to a new way of doing the translation. Perhaps a new kind of datatype will be added to Perl in the future, or perhaps you want to do lookups from a database or a DBM file. Then all you have to do is change the internals of this one subroutine. As long as the interface to the subroutine remains the same—that is to say, as long as it still takes one codon as an argument and returns a one-character amino acid—you don't need to worry about how it accomplishes the translation from the standpoint of the rest of the programs. Our subroutine has become a black box. This is one significant benefit of modularization and organization of programs with subroutines.

There's another good, and biological, reason why you should use a subroutine for the genetic code. There is actually more than one genetic code, because there are differences as to how DNA encodes amino acids among mammals, plants, insects, and yeast—especially in the mitochondria. So if you have modularized the genetic code, you can easily modify your program to work with a range of organisms.

One of the benefits of hashes is that they are fast. Unfortunately, our subroutine declares the whole hash each time the subroutine is called, even for one lookup. This isn't so efficient; in fact, it's kind of slow. There are other, much faster ways that involve declaring the genetic code hash only once as a global variable, but they would take us a little far afield at this point. Our current version has the advantage of being easy to read. So, let's be officially happy with the hash version of codon2aa and put it into our module in the file BeginPerlBioinfo.pm (see Chapter 6).

Now that we've got a satisfactory way to translate codons to amino acids, we'll start to use it in the next section and in the examples.

8.4 Translating DNA into Proteins

Example 8-1 shows how the new codon2aa subroutine translates a whole DNA sequence into protein.

# Translate each three-base codon into an amino acid, and append to a protein

for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {

$codon = substr($dna,$i,3);

$protein .= codon2aa($codon);

}
print "I translated the DNA\n\n$dna\n\n into the protein\n\n$protein\n\n";
exit;

To make this work, you'll need the BeginPerlBioinfo.pm module for your subroutines in a separate file the program can find, as discussed in Chapter 6. You also have to add the codon2aa subroutine to it. Alternatively, you can add the code for the subroutine condon2aa directly to the program in Example 8-1 and remove the reference to the BeginPerlBioinfo.pm module.

You've seen all the elements in Example 8-1 before, except for the way it loops through the DNA with this statement:

for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {

Recall that a for loop has three parts, delimited by the two semicolons. The first part initializes a counter: my $i=0 statically scopes the $i variable so it's visible only inside this block, and any other $i elsewhere in the code (well, in this case, there aren't any, but it can happen) is now invisible inside the block. The third part of the for loop increments the counter after all the statements in the block are executed and before returning to the beginning of the loop:

$i += 3

Since you're trying to march through the DNA three bases at a shot, you increment by three.

The second, middle part of the for loop tests whether the loop should continue:

$i < (length($dna) - 2)

The point is that if there are none, one, or two bases left, you should quit, because there's not enough to make a codon. Now, the positions in a string of DNA of a certain length are numbered from 0 to length-1. So if the position counter $i has reached length-2, there's only two more bases (at positions length-2 and length-1), and you should quit. Only if the position counter $i is less than length-2 will you still have at least three bases left, enough for a codon. So the test succeeds only if:

$i < (length($dna) -2)

(Notice also how the whole expression to the right of the less-than sign is enclosed in parentheses; we'll discuss this in Chapter 9 in Section 9.3.1.)

The line of code:

$codon = substr ($dna, $i 3);

actually extracts the 3-base codon from the DNA. The call to the substr function specifies a substring of $dna at position $i of length 3, and saves it in the variable $codon.

If you know you'll need to do this DNA-to-protein translation a lot, you can turn Example 8-1 into a subroutine. Whenever you write a subroutine, you have to think about which arguments you may want to give the subroutine. So you realize, there may come a time when you'll have some large DNA sequence but only want to translate a given part of it. Should you add two arguments to the subroutine as beginning and end points? You could, but decide not to. It's a judgment call—part of the art of decomposing a collection of code into useful fragments. But it might be better to have a subroutine that just translates; then you can make it part of a larger subroutine that picks endpoints in the sequence, if needed. The thinking is that you'll usually just translate the whole thing and always typing in 0 for the start and length($dna)-1 at the end, would be an annoyance. Of course, this depends on what you're doing, so this particular choice just illustrates your thinking when you write the code.

You should also remove the informative print statement at the end, because it's more suited to a main program than a subroutine.

Anyway, you've now thought through the design and just want a subroutine that takes one argument containing DNA and returns a peptide translation:

# dna2peptide

#

# A subroutine to translate DNA sequence into a peptide

sub dna2peptide {
my($dna) = @_;
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Initialize variables

my $protein = '';

# Translate each three-base codon to an amino acid, and append to a protein

for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {

$protein .= codon2aa( substr($dna,$i,3) );

}
return $protein;

}

Now add subroutine dna2peptide to the BeginPerlBioinfo.pm module.

Notice that you've eliminated one of the variables in making the subroutine out of Example 8-1: the variable $codon. Why?

Well, one reason is because you can. In Example 8-1, you were using substr to extract the codon from $dna, saving it in variable $codon and then passing it into the subroutine codon2aa. This new way eliminates the middleman. Put the call to substr that extracts the codon as the argument to the subroutine codon2aa so that the value is passed in just as before, but without having to copy it to the variable $codon first.

This has somewhat improved efficiency and speed. Since copying strings is one of the slower things computer programs do, eliminating a bunch of string copies is an easy and effective way to speed up a program.

But has it made the program less readable? You be the judge. I think it has, a little, but the comment right before the loop seems to make everything clear enough, for me, anyway. It's important to have readable code, so if you really need to boost the speed of a subroutine, but find it makes the code harder to read, be sure to include enough comments for the reader to be able to understand what's going on.

For the first time use function calls are being included in a subroutine instead of the main program:

use strict;

use warnings;

use BeginPerlBioinfo;

This may be redundant with the calls in the main program, but it doesn't do any harm (Perl checks and loads a module only once). If this subroutine should be called from a module that doesn't already load the modules, it's done some good after all.

Now let's improve how we deal with DNA in files.

8.5 Reading DNA from Files in FASTA Format

Over the fairly short history of bioinformatics, several different biologists and programmers invented several ways to format sequence data in computer files, and so bioinformaticians must deal with these different formats. We need to extract the sequence data and the annotations from these files, which requires writing code to deal with each different format.

There are many such formats, perhaps as many as 20 in regular use for DNA alone. The very multiplicity of these formats can be an annoyance when you're analyzing a sequence in the lab: it becomes necessary to translate from one format to another for the various programs you use to examine the sequence. Here are some of the most popular:
FASTA

The FASTA and Basic Local Alignment Search Technique (BLAST) programs are popular; they both use the FASTA format. Because of its simplicity, the FASTA format is perhaps the most widely used of all formats, aside from GenBank.

Genetic Sequence Data Bank (GenBank)

GenBank is a collection of all publicly released genetic data. It includes lots of information in addition to the DNA sequence. It's very important, and we'll be looking closely at GenBank files in Chapter 10.

European Molecular Biology Laboratory (EMBL)

The EMBL database has substantially the same data as the GenBank and the DDBJ (DNA Data Bank of Japan), but the format is somewhat different.

Simple data, or Applied Biosystems (ABI) sequencer output

This is DNA sequence data that has no formatting whatsoever, just the characters that represent the bases; it is output into files by the sequencing machines from ABI and from other machines and programs.

Protein Identification Resource (PIR)

PIR is a well-curated collection of protein sequence data.

Genetics Computer Group (GCG)

The GCG program (a.k.a. the GCG Wisconsin package) from Accelrys is used at many large research institutions. Data must be in GCG format to be usable by their programs.

Of these six sequence formats, GenBank and FASTA are by far the most common. The next few sections take you through the process of reading and manipulating data in FASTA.