Here are my insights into why this paper is fundamentally relevant for anyone working with genetic sequence data in an evolutionary context. . .

Scientific frontiers appear when we integrate analyses from the micro and the macro scale. Examples of this include how biology is informed by chemistry, chemistry is informed by physics, and classical physics is informed by quantum physics. This trend is true for EvoDevo: we are rapidly arriving at an understanding of evolution from increasingly scientific first principles. To be specific, we are beginning to understand how mutations in protein sequence and structure — at the biophysical scale — have consequences for the function and phenotype of cells, species, and individuals — at the macro scale [see Dean and Thornton, Nature Reviews 2007].

In order to reveal the evolutionary trajectory of a particular protein structure, we need to examine ancient forms of that protein. However, the simple acquisition of ancestral molecules can be a major obstacle when we examine evolutionary histories over millions of years because the ancestral forms are typically extinct. As a computational alternative, we can time travel via statistical inference [see Thornton, Nature Reviews 2004].

I study computational and phylogenetic methods that make it possible for us to probabilistically infer phylogenies and reconstruct ancestral gene sequences. One of the most important inventions in the history of phylogenetic methods is the use of Markov models to approximate the evolution of gene sequences. Markov models are used all over the place in information science: to model natural language, radio transmissions, and white noise. Markov models are used in speech recognition, your email’s spam filter, and global weather prediction. Google’s core search algorithm is fundamentally just a complex Markov model.

The core idea of the Markov Model concerns characters transitioning (i.e. mutating) over time. Suppose we have some character — like a single nucleotide or an amino acid — and it currently is in state X, where X is one of the letters in our nucleotide or amino acid alphabet. Over time of length t, X will mutate to state Y with probability determined by a matrix of relative substitution ratios. This model follows the Markov property, where the probability of Y later mutating to state Z over time t2 is independent of its prior state X.

If we calculate transition probabilities for all branches in a phylogenetic tree, we can thus calculate the likelihood of that tree and infer the maximum a posteriori ancestral protein sequence. In this discussion, I will avoid articulating all the mathematical minutiae of how we calculate probabilities for trees and ancestral sequences; you can learn more by reading this excellent book edited by Oliver Gascuel. Instead, I want to focus on the substitution matrix: it is an approximation of molecular evolution and it makes critical assumptions about evolutionary forces.

In it’s simplest form (as a 4×4 nucleotide matrix or 20×20 amino acid matrix) substitution matrices assume that all residues with the same state are in a homogenous biophysical environment, and are thus exposed to the same mutational forces. For example, the WAG matrix assumes that all glutamic acids (E) can be treated equally, and thus the relative substitution rate for any glutamic acid mutating into asparagine (D) is 6.174, while the relative rate of any glutamic acid mutating to cystine (C) is 0.021. The assumption of structural homogeneity is often invalid; for example, as is illustrated in this week’s review by Worth et al., residues buried in solvent-inaccessible cores of a protein tend to be more conserved than residues located on the exterior of proteins. This insight implies that we need a secondary substitution matrix expressing relative mutation rates for residues located in protein cores. As an example, if E stands for an external glutamic acid and E’ stands for a core glutamic acid, we should expect the relative substitution rate for E-to-D to be larger than the relative rate for E’-to-D’.

The article by Worth et al. reviews a large historical body of results concerning protein structure conservation. The article further describes how we can use environment-specific substitution tables (ESSTs) to explicitly capture information about structural conservation into our Markov model of evolution. The insights from this paper are fundamental for anyone working with genetic sequence data in an evolutionary context.

Today, Sean gave a technical talk (for the EvoDevo crowd) titled “Endless Flies Most Beautiful: Cis-Regulatory Sequences and the Evolution of Animal Form.” Sean focused on the central EvoDevo question: How do forms (i.e. morphologies) evolve? He thinks an examination of mosaic pleiotropy is the key to answering this question. Historically, gene duplication was thought to be the primary mechanism by which new forms evolved. Sean cites Susumu Ohno’s classic book “Evolution by Gene Development.” However, Sean countered Ohno’s thesis by showing evidence that evolution might actually select against gene duplication. As an example, the evolutionary history of anthropod and tetrapod Hox genes — a gene that is known to drive some morphologies — is a story of gene loss, not gene duplication.

Later approaches to the EvoDevo question examined the role of protein sequence evolution, and then eventually King and Wilson examined the role of protein sequence expression. Essentially, King and Wilson reduced the question “how do forms evolve?” to the micro-question of “how do cis-regulatory elements evolve?” For the remainder of Sean’s talk, he focused on “cis-regulatory elements as the units of evolution.”

Before the EvoDevo community was examining regulatory elements, inter-species genetic analysis was typically occuring over large taxonomic distances. This approach proved problematic because transcription factor binding sites are rarely conserved over large phylogenetic distances. Consequently, the EvoDevo community was forced to find new systems for study. Sean Carroll’s lab — for example — shifted focus away from studying butterflies and began investigating pigmentation diversity in Drosophila (see Nature, Trends in Genetics, and PNAS). Unlike butterflies, Drosophila studies offered the ability to explore evolutionary mechanisms at a deeper mechanistic/genetic level. Among many subsequent results, Sean’s lab discovered the Tan gene locus is responsible for mosaic pleiotropy in Drosophila Santomea’s wing pigmentation.

Based on results from the Tan gene — and several other studies — Sean concluded that regulatory sequence evolution is the more likely mechanism of morphological change than the coding sequence itself (see PLoS Biology 2005). Sean gave several examples to support this theory, including a story about the Engrailed gene: an ancient regulatory protein that was recently co-opted to control development of Drosophila wing spots.

Overall, this was an enlightening visit and I feel fortunate to be studying at a university that can engage this caliber science. For more information, check-out The Carroll Lab.

Here are two good overview articles on evolutionary computation. The first article is more recent and is targeted primarily at computer scientists; the second article is slightly outdated and targeted primarily at ecologists.

On November 8th, Nature published two cool articles about metagenomic studies of twelve Drosophila (“fruit flies”) species. In the the first paper (click here), The Drosophila 12 Genomes Consortium (D12GG) compared the complete genomic sequences of the twelve Drosophila species, which included the model organism species Drosophila Melanogaster. Although the twelve species are related, they exhibit a surprising amount genetic biodiversity. For example, the evolutionary distance between D. Grimshawi and D. Melanogaster is the same distance as between humans and lizards. As a side note, six months earlier (in May 2007), PLoS Genetics published a similar metagenomic comparison of Drosophila (click here for the paper). In the PLoS paper, Hahn et al. present the (somewhat obvious) conclusion: “the apparent stasis in total gene number among species has masked rapid turnover in individual gene gain and loss.”

On November 8, Nature also published this paper (click here), in which Stark et al. (including Hahn) used the data from D12GG’s research to demonstrate a truly novel insight about the connection between conserved metagenomic sequence motifs and functional elements. The result of this paper allows us to infer the presence of functional elements with a accuracy far surpassing previous methods. Specifically, Stark et al. show how to infer the following functional elements, based on a metagenomic sample:

Protein-coding regions: have highly constrained condon substitution regions, and indels have a bias for multiples of three.

RNA genes: tolerate substitutions that preserve base pairing.

miRNA: can be detected by looking for conserved palindromic stem sequences, which mutable loop sub-sequences between the two palindrome pieces.