When DNA is copied to ‘messenger RNA’ as part of the process of making proteins, the T gets copied to uracil, or U. The other three base pairs stay the same.

A protein is made of lots of amino acids. A sequence of three base pairs forms a ‘codon’, which codes for a single amino acid. Here’s some messenger RNA with the codons indicated:

But here’s where it gets tricky: while there are 43 = 64 codons, they code for only 20 amino acids. Typically more than one codon codes for the same amino acid. There are two exceptions. One is the amino acid tryptophan, which is encoded only by UGG. The other is methionine, which is encoded only by AUG. AUG is also the ‘start codon’, which tells the cell where the code for a protein starts. So, methionine shows up at the start of every protein (or most maybe just most?), at least at first. It’s usually removed later in the protein manufacture process.

There are also three ‘stop codons’, which mark the end of a protein. They have cute names:

• UAG (‘amber’)
• UAA (‘ochre’)
• UGA (‘opal’)

But look at the actual pattern of which codons code for which amino acids:

It looks sort of regular… but also sort of irregular! Note how:

• Almost all amino acids either have 4 codons coding for them, or 2.
• If 4 codons code for the same amino acid, it’s because we can change the last base without any effect.
• If 2 codons code for the same amino acid, it’s because we can change the last base from U to C or from A to G without any effect.
• The amino acid tryptophan, with just one base pair coding for it, is right next to the 3 stop codons.

And so on…

This what attracts the mathematical physicists I’m talking about. They’re wondering what is the pattern here! Saying the patterns are coincidental—a “frozen accident of history”—won’t please these people.

Though I certainly don’t vouch for their findings, I sympathize with the impulse to find order amid chaos. Here are some papers I’ve seen:

After a very long review of symmetry in physics, starting with the Big Bang and moving up through the theory of Lie algebras and Cartan’s classification of simple Lie algebras, the authors describe their program:

The first step in the search for symmetries in the genetic code consists in selecting a simple Lie algebra and an irreducible representation of this Lie algebra on a vector space of dimension 64: such a representation will in the following be referred to as a codon representation.

There turn out to be 11 choices. Then they look at Lie subalgebras of these Lie algebras that have codon representations, and try to organize the codons for the same amino acid into irreducible representations of these subalgebras. This follows the ‘symmetry breaking’ strategy that particle physicists use to organize particles into families (but with less justification, it seems to me). They show:

There is no symmetry breaking pattern through chains of subalgebras capable of reproducing exactly the degeneracies of the genetic code.

This is not the end of the paper, however!

Here’s another paper, which seems to focus on how the genetic code might be robust against small errors:

But these three papers seem rather ‘Platonic’ in inspiration: they don’t read like biology papers. What papers on the genetic code do biologists like best? I know there’s a lot of research on the origin of this code.

Maybe some of these would be interesting. I haven’t read any of them! But they seem a bit more mainstream than the ones I just listed:

It has long been conjectured that the canonical genetic code evolved from a simpler primordial form that encoded fewer amino acids (e.g. Crick 1968). The most influential form of this idea, “code coevolution” (Wong 1975) proposes that the genetic code coevolved with the invention of biosynthetic pathways for new amino acids. It further proposes that a comparison of modern codon assignments with the conserved metabolic pathways of amino acid biosynthesis can inform us about this history of code expansion. Here we re-examine the biochemical basis of this theory to test the validity of its statistical support. We show that the theory’s definition of “precursor-product” amino acid pairs is unjustified biochemically because it requires the energetically unfavorable reversal of steps in extant metabolic pathways to achieve desired relationships. In addition, the theory neglects important biochemical constraints when calculating the probability that chance could assign precursor-product amino acids to contiguous codons. A conservative correction for these errors reveals a surprisingly high 23% probability that apparent patterns within the code are caused purely by chance. Finally, even this figure rests on post hoc assumptions about primordial codon assignments, without which the probability rises to 62% that chance alone could explain the precursor-product pairings found within the code. Thus we conclude that coevolution theory cannot adequately explain the structure of the genetic code.

The genetic code appears to be optimized in its robustness to missense errors and frameshift errors. In addition, the genetic code is near-optimal in terms of its ability to carry information in addition to the sequences of encoded proteins. As evolution has no foresight, optimality of the modern genetic code suggests that it evolved from less optimal code variants. The length of codons in the genetic code is also optimal, as three is the minimal nucleotide combination that can encode the twenty standard amino acids. The apparent impossibility of transitions between codon sizes in a discontinuous manner during evolution has resulted in an unbending view that the genetic code was always triplet. Yet, recent experimental evidence on quadruplet decoding, as well as the discovery of organisms with ambiguous and dual decoding, suggest that the possibility of the evolution of triplet decoding from living systems with non-triplet decoding merits reconsideration and further exploration. To explore this possibility we designed a mathematical model of the evolution of primitive digital coding systems which can decode nucleotide sequences into protein sequences. These coding systems can evolve their nucleotide sequences via genetic events of Darwinian evolution, such as point-mutations. The replication rates of such coding systems depend on the accuracy of the generated protein sequences. Computer simulations based on our model show that decoding systems with codons of length greater than three spontaneously evolve into predominantly triplet decoding systems. Our findings suggest a plausible scenario for the evolution of the triplet genetic code in a continuous manner. This scenario suggests an explanation of how protein synthesis could be accomplished by means of long RNA-RNA interactions prior to the emergence of the complex decoding machinery, such as the ribosome, that is required for stabilization and discrimination of otherwise weak triplet codon-anticodon interactions.

What’s the “recent experimental evidence on quadruplet decoding”, and what organisms have “ambiguous” or “dual” decoding?

The genetic code maps the sixty-four nucleotide triplets (codons) to twenty amino-acids. Some argue that the specific form of the code with its twenty amino-acids might be a ‘frozen accident’ because of the overwhelming effects of any further change. Others see it as a consequence of primordial biochemical pathways and their evolution. Here we examine a scenario in which evolution drives the emergence of a genetic code by selecting for an amino-acid map that minimizes the impact of errors. We treat the stochastic mapping of codons to amino-acids as a noisy information channel with a natural fitness measure. Organisms compete by the fitness of their codes and, as a result, a genetic code emerges at a supercritical transition in the noisy channel, when the mapping of codons to amino-acids becomes nonrandom. At the phase transition, a small expansion is valid and the emergent code is governed by smooth modes of the Laplacian of errors. These modes are in turn governed by the topology of the error-graph, in which codons are connected if they are likely to be confused. This topology sets an upper bound – which is related to the classical map-coloring problem – on the number of possible amino-acids. The suggested scenario is generic and may describe a mechanism for the formation of other error-prone biological codes, such as the recognition of DNA sites by proteins in the transcription regulatory network.

Post navigation

22 Responses to The Genetic Code

I remember seeing an empirical computer simulation which seemed to be similar to the Tsvi Tlusty paper: given the structures have to be “close enough to form bonds”, particularly during copying, in 3-D within the “jostling” cellular environment, the coding for codons is most “robust” in that:

1. using a coding with shorter, more single-variants decreases DNA molecule length so fewer “steps” need to be decoded for a given cell’s usage, but then a minor error is more likely to stop/give erroneous transcription.

2. using a coding with more codons coded by multiple longer sequences it’s less likely to get instantaneous errors, but the DNA molecule length increases.

So it was argued to finding the point which overall maximises the number of full correct transcriptions in the presence of both effects.

Don’t know if this helps, but put the column for the second base “C” first in order, then the column for the base “U” second in order, then the column for the base “G” third in order, and the column for the base “A” last in order, so the ordering for the second base columns is “C”, “U”, “G”, “A”. Even so, if there’s a pattern I don’t see it. Maybe if someone were able to calculate electrostatic potential surfaces for the bases, and the resultant codons, patterns might be visible… same thing for the amino acids, see if there’s symmetry in there someplace.

Is there a reason why exchanging U and C as third base never makes a difference, and only rarely in the case of A and G? Yet there’s no sign of a greater affinity of U with C than with G in other places.

OK, I can’t remember where I might’ve read that bit about a “primordial genetic code” with shorter codons (it’s not in Stryer’s Biochemistry, which I just pulled off the shelf and checked), but I have found discussions of related ideas.

Speaking to the ambiguity of the third base in the codon, there’s also the wobble base pair hypothesis (Wikipedia summary, with a couple links to papers in the reference section). The idea is that, when tRNA binds to mRNA, the first two bases of the codon are more strongly selective about their binding than the third base, for which (due to the structure of the tRNA) different pairings might have similar thermodynamic stability. For example, a 5′-GAA-3′ tRNA anticodon would ideally (following traditional Watson-Crick base pairing) bind a 5′-UUC-3′ codon, but due to wobble at the third base it could also bind to a 5′-UUU-3′ codon with about equal stability. The U-C and A-G groupings appear to reflect more common wobble substitutions. (Note that UUC and UUU both code for Phe!)

Remember, also, that the primary nucleotide bases can be modified, which affects base pairing. For example, inosine (I, modified from A) can pair about equally well with A, C, and U. Ah, chemistry. :-P

Thus, we have two different redundancies at the third base of the codon: protection from point mutation, and protection from alternate tRNA binding during translation.

That’s just using standard cellular techniques with ribosomes etc., not even counting other unnatural isomers, D amino acids, etc. Many more can be produced using other techniques e.g. solid state synthesis.http://en.wikipedia.org/wiki/Peptide_synthesis

JB: So, methionine shows up at the start of every protein (or most maybe just most?)”

Bacteria use a modified form of methionine, but otherwise this is universal. (Molecular Biology of the Cell, Fifth Edition, Alberts et al, p380).

JB: What’s the “recent experimental evidence on quadruplet decoding”

I think that is a reference to a technique in synthetic biology where they’ve managed to make artificial ribosomes and tRNAs which use a quadruplet code. I think this stuff is in its very very early stages.

I guess this is referring to frame shifts. (http://en.wikipedia.org/wiki/Frame_shift). There are also cases where both strands of DNA are protein coding (in an overlapping but not exactly overlapping way).

I think frame shifts are really cool. For those not in the know, this is the idea: when reading DNA to create proteins, the cell reads 3 bases at a time and turns each 3-letter ‘word’ into an amino acid, as sketched in this blog entry. So, if a mistake happens and the cell starts reading 1 or 2 or 2 base pairs down from where it should, a completely different protein is made. For example,

ATA CCG CGA TCC …

may become:

TAC CGC GAT CC…

And this can cause nasty diseases like Tay-Sachs.

But life, being the wonderfully flexible thing it is, has also figured out ways to exploit this possibility to its advantage! The same stretch of DNA can be deliberately read in 3 different ways to create different proteins!

Sort of related to this: this news report says that other amino acids (outside the 20) can be added to proteins changing some of their biochemical properties. (This particular example appears to be “weighing down” a molecule so that it gets processed by the kidneys slower, enabling more of the biologically active part to be processed.) So this suggests that it’s not the case that simply “there are 20 worthwhile amino acids, so evolution figured out how to code for that 20″ but that it’s as much a decision to have 20 synthesisable amino acids as the decision how to represent them.

How To Write Math Here:

You need the word 'latex' right after the first dollar sign, and it needs a space after it. Double dollar signs don't work, and other limitations apply, some described here. You can't preview comments here, but I'm happy to fix errors.