FAQ: Simulation

This FAQ describes how Folding@home simulations work and why our methods benefit from distributed computing. This FAQ may be slightly technical, but aims for people who’d like to learn more about how this project works and how their computer is involved.

Introduction – Why do proteins fold?

Proteins are trying to get into their most “comfortable” position, that is, where they are at the best energy equilibrium with their environment. Some proteins contain areas that are hydrophobic (hate water), so those sections of the protein will end up away from the aqueous environment by hiding in the middle of the folded protein. There are many other factors that drive the protein, but there are several different analogies that can be used to explain the general process.

For one, think of a huge beach ball bouncing down the side of a steep mountain. The ball will bounce many times as it descends and it will eventually stop. If you throw the beach ball down it again, there will be random variations in its path and it won’t end up at the same place. If you repeat that process many, many times, you can determine that there is a statistical pattern to the final resting points. You can also see a statistical scattering in the amount of time it takes for the ball to stop. Most of the time the ball will end up at the bottom of the mountain, but occasionally it may end up in another nearby depression and never reach the lowest possible stopping point. There is a significant statistical nature to atomic motions much like this motion of the bouncing ball down the mountain. Normal folding is like all the times that the ball ends up at the lowest point. Misfolding is like when the ball ends up somewhere else.

In some respects, it’s also similar to parallel-parking a car in a crowded street. At first, the car is exposed, and it usually takes several steps to properly park the car in the proper position. Sometimes it may be necessary to back out slightly and then try again. A protein does the same thing. An observer can watch a hundred similar cars being parked in that space, and they would come to understand the common ways of parking, and which methods work and which don’t.

Like both examples, it’s important for us to know about the motion of a folding protein, although we also want to know the intermediate steps along the way. Our simulation methods construct models of both of these properties. One way that makes Folding@home different from some other distributed computing projects (Rosetta@home for example) is that we want to see how the car parks, not just the end state of seeing it parked. While that’s an important result, it doesn’t shed any light on how or why a protein sometimes misfolds. By attempting to study all of the possible paths that the bouncing ball can take down the mountain, we learn a lot about the question “How did we get here?” It also gives us the opportunity to introduce changes – such as drugs – into the process that modify the probability of misfolded results.

How does Folding@home simulate protein folding?

Two key aspects to our simulations are adaptive sampling and Markov State Models (MSMs). The two are used together and are very important as they allow us to run efficient simulations.

What are Markov State Models?

Protein folding is statistical in nature, so a protein can fold in many ways. We need a map to be able to see the bigger picture. Markov State Models (MSMs) are a way of describing all the conformations (shapes) a protein – or other biomolecule for that matter – explores as a set of states (i.e. distinct structures) and the transition rates between them. They also map out the protein’s motion and energy properties as it folds from one shape to another. Once we have all this information, we can observe the factors that influenced folding, which is especially important if the protein misfolds. Much of the theory underlying these methods is quite old but their use has been limited by the challenges inherent to identifying a reasonable set of states.

MSMs are particularly useful to us as they facilitate parallelization across many computer processors by allowing for the statistical aggregation of short, independent simulation trajectories. This replaces the need for single long trajectories, and thus has been widely employed by distributed computing networks such as Folding@home and GPUGRID. Further, through adaptive sampling, MSMs provide a way to increase the efficiency of simulation without introducing artificial biases or approximations. We’ve been making a lot of progress with developing Markov state model (MSM) methods for analyzing the data we generate with the help of the FAH community. Several Pande Group members include Drs. Xuhui Huang and Gregory Bowman have developed MSMBuilder, an open-source software package used to build, analyze, and visualize MSMs. Since its release in 2009, it’s been download over 1,600 times across five continents and has been used in at least 40 publications to date.

Formally, MSMs are a specific application of discrete-space master equations parameterized from simulation. They consist of two parts: a state space partitioning X, typically chosen to divide the system into a set of metastable states; and a master equation describing kinetics on X, represented by either a transition matrix T or rate matrix. Both the state space and master equation are found from molecular simulation. The precise manner in which this is done varies widely.

What do MSMs look like?

Here are two examples:

MSM showing 14 out of 2000 macrostates for MSM for the NTL9 protein. States that are in better equilibrium are drawn larger, and more likely transitions are represented by larger arrows. Unfolded protein show in red, native state is in green. (Voelz et al)

MSM for the ACBP protein, illustrating some of the primary transitions. (Voelz et al)

What is adaptive sampling, and how is it related to MSMs?

When researchers are using computers to study protein conformational dynamics (how the protein changes shape as it’s folding), the conventional approach for unbiased all-atom molecular dynamics is two-step. First, they run a set of simulations, and second, after the simulations have completed, they analyze the resulting data. The adaptive sampling Markov State Model approach involves breaking this paradigm by interleaving these two steps. Instead of building the model only after the data has been collected, it is instead built on the fly as the data is being generated. A feedback loop can then be set up where the current state of the model is used to inform the progress of further simulations.

Imagine, for example, that you were exploring a maze for the first time. Although you have no map, you do have a GPS which is able to track your progress and display the parts of the maze you’ve explored. One approach is to put the GPS in your purse and walk around blindly — bumping off walls — for as long as possible. Once you’re tired, you take out the GPS and analyze the path your trajectory took; by looking at your path on the GPS you’re able to see the structure of the maze and have effectively built a map. Unfortunately, you notice that you’ve wasted a lot of time stuck in various parts of the maze. Instead, the smarter strategy is to watch the GPS as you walk around — to try to build your map of the maze incrementally. Using your map, you’re able to identify when you’re “stuck” in a certain part of the maze, and to avoid re-exploring parts of the maze that you’re confident that you’ve already discovered.

In many ways, these two approaches to exploring a maze are analogous to the two approaches to collecting and analyzing molecular simulations. Due to the incremental nature of building the model on the fly in the adaptive sampling approach, it is possible to increase the efficiency of simulations.

How do you build an MSM using adaptive sampling?

To start a simulation project, we first must choose some initial conformations (a protein’s shape) to begin with. The heuristic methods we use so far include running high-temperature simulations, employing Rosetta’s Monte Carlo algorithm, and shooting off related MSMs of similar proteins. Once we have a set of conformations, each of them becomes the starting point for some simulations which together we call a Run. Within each Run, we launch many trajectories, each called a Clone. Thus, all of the Clones in a Run start from the same initial protein shape, but they have a different initial velocity, i.e. the atoms are given a different initial push in one direction or another. The Clones from a Run may find additional conformations, in which case that Run ends and several more Runs are started from them. This process continues with a lot of Runs branching out to other conformations, perhaps merging back together to a common shape with other Runs. In the end, we end up with a model with tens of thousands of different conformations, (terabytes of data!) and we can see all the shapes and energy states that the protein can take on while its folding towards its “native state”, the chances of all the transitions occurring, and how long it takes the protein to complete a transition from one conformation to another. More importantly, we can identify the places where the protein misfolds and gets stuck, which then leads to more research and models on how we can prevent this from happening. The more computers we have participating, the faster we can complete the Markov State Model.

Aren’t these the PRCG numbers?

Yes. Work Units are labeled with four distinct numbers in the format: Project (Run, Clone, Generation). We just described the first three; Project is the protein under study, a Run is a simulation started from a particular conformation, and Runs contain many Clones which have different initial velocities. Although Folding@home processes many different Projects, Runs, and Clones all at the same time, Clones themselves are serial in nature. They have to be simulated from start to finish, but it would be impractical for one computer to complete one by itself. Instead, your computer is given a piece of a Clone. We identify the piece using the Generation (Gen) number. One computer will start out with Generation 0, and when it finishes another computer is given Generation 1, etc. We cannot start Gen 1 until Gen 0 finishes, and there may be hundreds of Gens. This is why the Work Units have deadlines, and why speed is so important to us.

Why is this approach particularly useful?

This approach can be powerful because not only is it very amendable to distributed computing, but the available computational resources can be used more efficiently. A protein spends most of its folding time “stuck” in an energetically-favorable position, with transitions – the processes largely of interest – taking place only rarely. Likewise, any straightforward simulation of protein folding will also waste valuable time generating data with little information content. However, using the adaptive sampling concept, the model can identify when the simulation is stuck, and then reinitialize new simulations starting from potentially more fruitful areas, avoiding the wasteful process of re-exploring areas that are already well understood.

In a recent paper, we compared MSMs to more traditional simulation methods. We compared some very long folding trajectories from the Anton supercomputer to an MSM built from the same folding data. Although our MSM “chops up” the simulation into a bunch of short trajectories, it was able to reproduce their simulations very well. Moreover, we also found that the MSM approach revealed new insights into the folding process (a new folding pathway) that was missing in ANTON’s more traditional approach. See this blog post for more information.