Are you interested in learning how to program (in Python) within a scientific setting?
This course will cover algorithms for solving various biological problems along with a handful of programming challenges helping you implement these algorithms in Python. It offers a gently-paced introduction to our Bioinformatics Specialization (https://www.coursera.org/specializations/bioinformatics), preparing learners to take the first course in the Specialization, "Finding Hidden Messages in DNA" (https://www.coursera.org/learn/dna-analysis).
Each of the four weeks in the course will consist of two required components. First, an interactive textbook provides Python programming challenges that arise from real biological problems. If you haven't programmed in Python before, not to worry! We provide "Just-in-Time" exercises from the Codecademy Python track (https://www.codecademy.com/learn/python). And each page in our interactive textbook has its own discussion forum, where you can interact with other learners. Second, each week will culminate in a summary quiz.
Lecture videos are also provided that accompany the material, but these videos are optional.

講師

Pavel Pevzner

Professor

Phillip Compeau

Visiting Researcher

字幕

[MUSIC] You may be surprised but the most popular algorithms for finding motifs in DNA sequences, roll dice and take chances to find such motifs. They're called randomized algorithms. How do they work? We have now learned how to construct Profile given Motifs, but can we construct Motifs given Profile? We can, because given an arbitrary Profile and a set of sequences Dna, we can construct the set Motifs(Profile,DNA) by simply finding Profile-most probable k-mers in each sequence from Dna. And therefore, we can further iterate this process by finding profile of motifs, motifs of profiles, profile of motifs, until our motifs keep improving. So this is the key idea of the algorithm called RandomizedMotifSearch. We simply iterate and see whether the score of motif continues to improve. And stop when it doesn't improve anymore. The only question about this algorithm is how in the world it can find anything useful. Think about this, we started from a completely random motifs. If the motifs are completely random, and if they're taken from random sequences, then presumably. expected probability of every element in the profile matrix is 1/4. There is absolutely no information in such a motif. How possibly it can lead us to the correct motif in the sequence? Let's just keep our fingers crossed and see how this approach works on an example. Here's my example. These are very short DNA strings and I implanted 4-nucleotide long motifs (one motif on each string). They're shown in blue, so ACGT is implanted in every sequence with a single mutation. So how randomized motif search will proceed? First, it will randomly select motifs shown in bold here. And, of course, it has missed most of the motifs. They are randomly selected. Afterwards, it will form Motifs matrix, shown here on the slide, and afterwards it will construct Profile(Motifs). Using Profile(Motifs) it will calculate the probability of every k-mer in the set of strings, shown here, and will simply select the most probable k-mers in every string. The results are in this matrix, and I should probably start iterating. But if you look carefully at this matrix, I don't have to iterate. It already found all blue inserted motifs right there. It almost sounds like a magic trick, because we started from something completely random and, in a single iteration, found our implanted motifs! Well, what happened? So let's try to think about this. When we form k-mers randomly, it results in a uniform expected profile, as we discussed before. But strings in Dna are not random because w have inserted implanted patterns in our Dna strings. Therefore, this implanted motif resulted in a biased profile; it's not a completely random profile. Where does this statistical bias point to? If it points in the direction of the implanted motif, then there is no miracle that we start from something random (which is not really random), and, applying iterative procedure, arrived to a hidden part. So, let's see how it actually works. I return back to the case that we considered before. You saw that when we randomly selected motifs, in the first sequence, we missed the motif (the correct, the blue one). In the second sequence, we also missed it. In the third sequnce, we missed it. In the fourth sequence, we missed it. But in the fifth one, simply by pure chance, we actually found it. It's not a big deal. There is always a probability that we simply by chance select one of the implanted motifs. But it results in a biased profile. If you look at the profiles that resulted from this implantation, then you see that the largest elements in this profile actually correspond to A, C, G and T of our implanted pattern ACGT. Therefore, even a single correctly found k-mer in the set of Dna strings creates a bias in the profile, and this bias leads us to the correct sequence. Of course, it doesn't always happen. It depends on how we have chosen initial set of randomly selected k-mers in the sequence. But if we try this algorithm, a thousand or a million times, there is a good chance that eventually one of these attempts will bring us to the correct motif. Even if this procedure sounds ridiculous. We now move to the next topic: "How do bacteria hibernate?" Our goal is to apply our motifs finding algorithm to real biological problems. To figure out how our algorithms work on real biological data, we will start by analyzing tuberculosis. Tuberculosis is a deadly disease that killed quarter of population of Europe in the 19th century. Unfortunately, humanity is not done with tuberculosis yet since tuberculosis strains that resist to all antibiotics are now emerging. Tuberculosis is caused by Mycobacterium tuberculosis also called MTB. An interesting thing about MTB is that it can persist in for decades in humans and one third of the world population is affected by MTB. Of course, it usually doesn't lead to tuberculosis nut, in some cases. bacteria switches from its latent state to the active state and it's when the tuberculosis begins. The question that we will be asking is how MTB switches from the latent to the active state and how does it survive during latency? MTB has an ability to form dormant spores where metabolism is shut down. In such sporulated form it can survive in very tough conditions, for example, in conditions of food shortage or oxygen shortage. And oxygen shortage, called hypoxia, is associated with these latent forms of tuberculosis. So, how can we figure out what makes MTB switch into the latent state and return back to the active states? Biologists have performed a DNA array experiment and identified roughly 25 genes that are activated in the hypoxic condition. So when bacteria is deprived from oxygen, these gene dramatically change their expression. And our goal today is to discover what regulatory motif controls these genes when they're activated or repressed. We now have quite an arsenal of algorithms for finding motifs. Let's try them. But the first problem is that we have no clue of what size of motif to try. Should we try k equal to 8 or k equal to 20? Let's try all of them using median string problem and randomized motif search. And you will see that, for different k, we are getting quite different results. It is not clear which one to choose. for median string, we actually will have difficulties going beyond k=12 because it is an exponential algorithm whose running time grows as four to the power k. But for randomized motif search we can search for very long motif and, for k=20, we actually find this rather long motif. You can see that there are similarities Between the 12-mer found by the Median String. the 12-mer found by the RandomizedMotifSearch, and the 20-mer found by the RandomizedMotifSearch. Which one is correct? To answer this question, we need to introduce the notion of the motif entropy. And entropy of motif is defined in the following way. First, we construct the profile matrix. Every column of the profile matrix correspond to a biased four-sided dice. Or, in other words, to a probability distribution, (p1, p2, ...p_n). Sum of p_i is equal to 1, of course. And entropy is simply defined as - p_i * log (p_i) and summed up over all indices i. For example, we can compute the entropy of the first column, and it will be 1.147. We can compare it with the minimal possible entropy, which corresponds to the extremely conserved motif (where all symbols in the column are the same), and with the maximal entropy (where symbola in the column are equally likely). So the entropy of a motif is defined as the sum of the entropies of its columns. And now, the moment of truth. Let's see how the algorithms that we developed find out whether the motifs that we find are close to the biological one. Here is the motif logo of the dormancy survival regulator site. The motif logo is constructed in such a way that the total length of four nucleotides in the motif logo correspond to two minus entropy. So for completely conserved column, like nucleotide G in this motif, we will have height two. It is the most conserved columns of the motif. The relative sizes of four nucleotide in each position correspond to frequencies of these nucleotides in the column. This is an example of motif logo. Now, let's see how it works for the tuberculosis example. You can see that our randomized motif search has actually captured a good portion of the motif. But you can also see that we missed in some places. We need to work harder and to figure out what to do to develop better regulatory motif finding algorithms. We move next to another randomized algorithm called Gibbs sampler.