In the previous course in the Specialization, we learned how to compare genes, proteins, and genomes. One way we can use these methods is in order to construct a "Tree of Life" showing how a large collection of related organisms have evolved over time.
In the first half of the course, we will discuss approaches for evolutionary tree construction that have been the subject of some of the most cited scientific papers of all time, and show how they can resolve quandaries from finding the origin of a deadly virus to locating the birthplace of modern humans.
In the second half of the course, we will shift gears and examine the old claim that birds evolved from dinosaurs. How can we prove this? In particular, we will examine a result that claimed that peptides harvested from a T. rex fossil closely matched peptides found in chickens. In particular, we will use methods from computational proteomics to ask how we could assess whether this result is valid or due to some form of contamination.
Finally, you will learn how to apply popular bioinformatics software tools to reconstruct an evolutionary tree of ebolaviruses and identify the source of the recent Ebola epidemic that caused global headlines.

Revisiones

PO

Good course for improving algorithmic skills and keep learning something new

ZX

Jul 21, 2019

Filled StarFilled StarFilled StarFilled StarFilled Star

In depth and comprehensive coverage of the topics in genetic data analysis.

De la lección

Week 5: Resolving the T. rex Peptides Mystery?

<p>Welcome to week 5 of class!</p>

<p>Last week, we asked whether it is possible for dinosaur peptides to survive locked inside of a fossil for 65 million years. This week, we will see what this question has to do with statistics; in the process, we will see how a monkey typing out symbols on a typewriter can be used to address it.</p>

Impartido por:

Pavel Pevzner

Phillip Compeau

Transcripción

To analyze the statistical significance of the identified peptide spectrum matrix. We will introduce the concept of Spectral Dictionaries. Imagine that a PSM search of 1,000 spectra from a human sample against the human proteome results in 100 peptide spectrum matches whose score exceeds a threshold. What is a fraction of erroneous peptide spectrum matches among these 100 peptide spectrum matches? I'll give you a hint of how it may be possible to answer this question, and here's the hint. Let's repeat the same experiment for a randomly generated DecoyProteome of the same size as the human proteome. Of course, we don't care about peptide spectrum matches identified in the Decoy Proteome: there are simply statistical artifacts. What we care about is the number of such hits in the Decoy Proteome. For example, if you identify 5 peptide spectrum matches in DecoyProteome, then you expect that five over 100, or 5% of PSMs identified in the real proteome are incorrect. And therefore, we define the notion of "false discovery rate" as simply the ratio of the number of peptide spectral matches identified in DecoyProteome over the number of peptides spectral matches identified in the real proteome. And when we run this experiment on the T-Rex spectra, then we will identify 27 peptide spectrum matches in the UniProt+ real database and only 1 peptide spectrum match in the DecoyProteome of the same size for a threshold of 100. Which means that, in this experiment, FDR will be a respectable 3.7%. But does it mean that we just found approximately 27 T-Rex peptides? Not quite, because many of the peptides that we identify are simply laboratory contaminants that are present in every experiment. For example, keratin from human skin. There are currently millions of tiny particles of my skin and skin of the people who pass through this room floating in the air in this room. The questions that we have to answer to figure out which of the identified peptide spectrum matches are correct is how to estimate the statistical significance of individual peptide spectrum matches rather than the bulk false discovery rate for the entire sample? To answer this question, we will bring in a monkey. Give this monkey a typewriter and let the monkey type random keys on this typewriter for a very long time. Afterwards, let's check how many correctly spelled English words the monkey generated. We can use Webster dictionary to check it, or whatever monkey dictionary to check it. In this particular case, the monkey generated 13 correctly spelled English words. Does it mean that the monkey can spell? Well to answer this question we probably need to evaluate what is the expected number or words from the dictionary that appear in a randomly generated text or in other words, we need to solve the following: The Monkey and the Typewriter Problem. Find the expected number of strings from a dictionary appearing in randomly generated text. But what does it have to do with mass spectrometry? Well, at the same time, we want to solve the following mass spectrometry problem: To find the expected number of high-scoring peptides, which is to find the expected number of high-scoring peptides against a given spectrum in a decoy proteome. The input to this problem is a Spectrum, an integer n, and a score threshold. And the output is the expected of peptides in a decoy proteome of length n that score a least threshold against Spectrum. You may be wondering what is there similarity or any relevance of the expected number of the high-scoring peptide problem and the monkey and the typewriter problem. It is not clear that these problems are equivalent, but they are. To explain why this is actually the same problem, I will introduce the notion of spectral dictionary. Dictionary of spectrum under a given threshold is simply the set of all peptides with a score of at least threshold against Spectrum. There will be many of these peptides, and as soon as the generate the dictionary for a given spectrum, we can reformulate the expected number of high-scoring peptides problem as the following problem. We want to find the expected number of peptides from Dictionary occurring in a decoy proteome of length n. And let's make one more step to reformulate this problem. We've reformulated its output. Let's now reformulate its input. And the new input will be simply all peptides from Dictionary of Spectrum and an integer n. And output, the expected number of strings from the dictionary occurring in a decoy proteome of lens F. Take a look at this problem. This is exactly the monkey and the dictionary problem for a specific set of peptides given by dictionary or spectrum under a given threshold.