Are you interested in learning how to program (in Python) within a scientific setting?
This course will cover algorithms for solving various biological problems along with a handful of programming challenges helping you implement these algorithms in Python. It offers a gently-paced introduction to our Bioinformatics Specialization (https://www.coursera.org/specializations/bioinformatics), preparing learners to take the first course in the Specialization, "Finding Hidden Messages in DNA" (https://www.coursera.org/learn/dna-analysis).
Each of the four weeks in the course will consist of two required components. First, an interactive textbook provides Python programming challenges that arise from real biological problems. If you haven't programmed in Python before, not to worry! We provide "Just-in-Time" exercises from the Codecademy Python track (https://www.codecademy.com/learn/python). And each page in our interactive textbook has its own discussion forum, where you can interact with other learners. Second, each week will culminate in a summary quiz.
Lecture videos are also provided that accompany the material, but these videos are optional.

講師

Pavel Pevzner

Professor

Phillip Compeau

Visiting Researcher

字幕

Since failure is not an option, we need to figure out what to do next, and we will try to do it by learning a little bit of biology. And biology will give us a hint on how we should implement our algorithms. Some of you may find the description of somewhat convoluted biological concepts that I am about to present difficult to digest. If you feel like this, don't worry; if you believe me that DNA replication is an asymmetric process, then you can simply skip this part and go to the point where we derive from this biological knowledge how to design efficient algorithms. The first thing to remind ourselves is that DNA strands have directions, and the two strands of DNA run in opposite directions. Here on this slide, the blue strand runs clockwise, and the green strand runs counterclockwise. If you were a DNA polymerase, how would you replicate a genome? If I was a DNA polymerase, I would do something very simple. I would wait until DNA unwinds a little bit, recruit 4 DNA polymerases, and I would just move them along the genome, trying to replicate. It looks like just 4 DNA polymerases are enough to replicate the whole genome. And when the replication fork enlarges, I continue and continue the replication process. This is simple but completely wrong. And if there are biology professors attending this lecture, they are probably already writing a petition to fire me and send me to a "Biology 101" camp. The reason why it does not work is that DNA polymerases are unidirectional. They can only copy DNA in the direction that is opposite to the direction of DNA, which means that when we want to recruit four DNA polymerases, two of them, this one and this one, will be working just fine, but the two others won't be able to move because they cannot move in the same direction as the direction of DNA. Then we can classify DNA strands into four half-strands. The strand that I showed you (the blue strand that goes from origin to terminus) is the reverse half-strand, and I have no problem replicating it because moving from the origin to terminus goes in the opposite direction to DNA. Likewise, this thick green line that I show right now also does not present any problem replicating -- one DNA polymerase can accomplish it. But the two other half-strands present a big, big problem because we cannot move in the same direction that they go. So, if you were a unidirectional DNA polymerase, how would you replicate a genome? Here is a potential solution. Wait until the fork enlarges, and when it enlarges, start replicating it in the same direction to DNA, When the fork enlarges a little bit more, you put another DNA polymerase and continue replicating. Four DNA polymerases wont be enough because each of these DNA polymerases that I recruited copied just approximately 3000 nucleotides, so that we need a huge number of DNA polymerases to proceed this way. You can see that the resulting fragments, many fragments that are being built (called Okazaki fragments) complicate our life a little bit. But, in the end, after this process is over, we will have many Okazaki fragments, so that we will have the genome copied in many, many little fragments. And the only thing we need afterwards is to repair the gaps between different Okazaki fragments and to replicate the genome. So the only thing we need to learn about this process to proceed with our algorithms afterwards is that the reverse and forward half-strand (the thick half-strands and thin half-strands) have very different lifestyles. The reverse half-strand lives a double-stranded life because it is constantly replicated; there is hardly a moment when it lives double stranded [sic -- should be "single stranded"]. But the forward half-strands spend a large portion of their lives single-stranded because it has to wait until the replication fork opens and until it starts replication. I hope this is clear, but there is one looming question: "Why would a computer scientist care?" And we will learn in the next segment why a computer scientist should care about this. So, asymmetry of replication affects nucleotide frequencies. Why? Let's think about this. Single-stranded DNA has a much higher mutation rate than double-stranded DNA. That's why, if one nucleotide has a greater mutation rate, then we should observe a shortage of this nucleotide on the forward half-strand because it lives a single-stranded life. Which nucleotide, A, C, G, or T has a higher mutation rate? And why? It turns out that there is a peculiar statistic of mutation rates of nucleotides, and cytosine (C) rapidly mutates into T. What is quite amazing is that through this deamination process, rate of mutations rise 100-fold when DNA is single-stranded, which means that the strand that lives the single-stranded life very quickly gets depleted from C. What does it mean for us as algorithm designers? Forward half-strands that live the single-stranded life have a shortage of C and normal G. Reverse half-strands that live the double-stranded life have a shortage of G and normal C. Now, keeping this in mind, let's take a walk long the genome. We start at the terminus and, let's say we move, according to the red line, from the terminus to the origin. In this case, we move along the strand in which C is high and G is low, which means that #G -- #C is decreasing as we walk. But when we walk along this half-strand, C is low and G is high, which means that #G -- #C, total number of nucleotides G minus the total number of nucleotides C, is decreasing as we walk. Again, this sounds like a peculiar and not very important thing, why do we care? I ask one more question: If you walk along a genome and you count the number of G minus the number of C that you saw, and you have been seeing that #G -- #C has been decreasing, and suddenly starts increasing... Imagine, just imagine, you walk through the genome, you count the difference #G -- #C, and it has been decreasing, and suddenly starts increasing. My question is: "Where are in the genome are you?" And to figure out where in the genome you are, we need once again to look at peculiar statistics. The only place in the genome where the behavior of #G -- #C switches from decreasing to increasing is the origin of replication. Which means that if you walk along the genome and see that #G -- #C has been decreasing and suddenly starts increasing, it means you just passed the origin of replication. And this is the hint for our algorithm.