Lots of short simulations combined into one large analysis of a protein.

We know how to get snapshots of what proteins look like. These static pictures tell us where all the atoms of a protein reside within a crystal, which gives us a sense of their structure and lets us design drugs that fit neatly within that structure, altering its activity.

But, in actual cells, proteins are nothing like the static, rigid structures found in crystals. Instead they writhe, buffeted by Brownian motion and constantly shifting among similar energy states. Until we develop a microscope that can resolve all this motion, the best we can do is to run molecular simulations on our computers. Unfortunately, most proteins have a lot of atoms to keep track of, which makes those simulations extremely computationally expensive.

Now, some researchers have figured out how to run the simulations on Google's cloud computing architecture. Although each of the individual simulations is short, they can be aggregated to provide a picture of long-term behavior. And, with this method of aggregating them in place, the system should be able to work with just about any cloud service available.

Typically, it's difficult to split a molecular simulation up into smaller jobs. Distant parts of a protein remain physically connected through a series of chemical bonds, and the structure can involve folding and turns that bring distant parts of the protein close together in space. As a result, each step of the simulation typically has to consider the entire protein at once, and the next step depends on the output of the first. That essentially makes any simulation a single large computation. To get more than a few milliseconds takes a lot of computational power.

And it's important to get more than a few milliseconds. The proteins may shift back and forth between hundreds of states, and the differences between any two can determine whether the protein is active or inactive, susceptible to a drug or not. So, you need to run the simulation longer to make sure that it has time to sample a lot of these states.

Or maybe you don't. Some researchers at Stanford, collaborating with a pair of Googlers, have moved the simulation code over to Google's Exacycle cloud computing system. But the cloud still can't run a single simulation as a massively parallel computation. So, instead, the system ran tens of thousands of simulations at once, each of them for a relatively short amount of time. Individually, they were short (only two milliseconds), but collectively, they explored a lot of the potential energy landscape the protein explores. The trick was writing the code that would merge all the individual simulations into a single picture of the protein's behavior.

Why does this work? Because a protein typically has a limited number of highly stable states and tends to shift back and forth between those and other states that are only occupied briefly. As one of the authors, Russ Biagio Altman, put it to Ars, "there is not a very long 'memory' of previously visited states." There are very few cases where an important structural state is reached by a series of unlikely intermediates, so cutting the simulation short doesn't miss all that much. In fact, Altman said that the system could be set up to prioritize any simulations that come across a rare state.

In this case, the authors looked at a G-protein coupled receptor (GPCR), a member of a huge class of proteins that's involved in a lot of key biological processes and implicated in a number of diseases. The crystal structures for the active and inactive states have been solved, and the authors started out a number of simulations in each. Many of these evolved into a common intermediate state, and all three of these states made short excursions into other temporary states. The authors also showed that the simulations could incorporate things like drugs or the normal chemicals that interact with the receptor.

It's still not clear that this system will work for every protein; some may have transitions that are simply slow enough that the short simulations won't capture important behavior. But it seems likely that a lot of proteins could be handled with this technique. And the approach should be more general than simply running it on Google's system. Altman says it would work on dedicated computing clusters, and "we should be able to do this on other cloud infrastructures going forward."

Promoted Comments

Hi folks-I'm one of the paper authors and would like to a few minor clarifying comments:

1) The computer time was provided free to the scientists.

2) Kai Kohlhoff, who was a postdoc in Vijay Pande and Russ Altman's labs, joined Google as a Visiting Faculty. He used roughly half a billion CPU hours for this calculation (and some others).

3) Much of the work was based on software developed by the Folding@Home team. In many ways, Exacycle resembles F@H in design and ran a binary resembling F@H's core code (gromacs). Further, we used MSM (https://simtk.org/home/msm-database) although rewritten in Google Flume, to do the data analysis.

4) As pointed out, we did not carry out a single 2ms simulation. The results were derived from many shorter simulations. Some of us believe this is actually a better sampling method than single, long trajectories, although that's an open question. I'll note that my PhD work, executed 13 years ago, was comprised of 4 10ns simulations.

5) Regarding the comments that this is mainly a computational advance; we didn't pick GPCRs, or simulate drug binding by chance, and our predictions are testable in a lab. In general, I would agree that pure folding simulations are unlikely to produce pharma-accurate predictions in the near future. I *hope* that our technology and models will become increasingly accurate over time, and potentially help in the development of new drugs that reduce the cost of health care and reduce side effects. But we have a lot of work to do.

To add a little context, it's probably worth noting that this is an advance in computation methodology, not necessarily in biochemistry or in drug discovery. Many scientists - especially discovery chemists - are cynical about the predictive powers or lack thereof of computational chemistry. (If you read In The Pipeline you may remember the "spirited" discussion of the usefulness or otherwise of molecular dynamics simulations following the Nobel announcements.)

Computational chemistry is still at the point where calculating a distribution coefficientab initio is basically a guess, never mind understanding the behaviour of an entire protein. So while the ability to deploy cloud resources on this kind of problem is undoubtedly welcome, and will extend the kinds of interactions that can be practically simulated, we're not at the point where using Google's computing might really allows us to understand or predict protein interactions with any conviction. (Truth be told, we're not even that close to knowing how far away we are.)

"I believe you have your units wrong. 2 ms would be a very, very long protein MD simulation."

You are correct. In the paper they state that the final aggregate simulation amounts to 2.5 milliseconds and each individual run was on the order of nano to microseconds.

They also estimate that Anton (a supercomputer with custom hardware designed to do only molecular dynamics calculations) would take about 5x to do the same simulation. I think it'd be worth it to do so to see if there are significant differences between the Exacycle run and the equivalent Anton run. These could be brought about, for example, from small inaccuracies in the force fields (the set of physicochemical parameters in the simulation) that only surface when the simulation is long enough.

To add a little context, it's probably worth noting that this is an advance in computation methodology, not necessarily in biochemistry or in drug discovery. Many scientists - especially discovery chemists - are cynical about the predictive powers or lack thereof of computational chemistry. (If you read In The Pipeline you may remember the "spirited" discussion of the usefulness or otherwise of molecular dynamics simulations following the Nobel announcements.)

Computational chemistry is still at the point where calculating a distribution coefficientab initio is basically a guess, never mind understanding the behaviour of an entire protein. So while the ability to deploy cloud resources on this kind of problem is undoubtedly welcome, and will extend the kinds of interactions that can be practically simulated, we're not at the point where using Google's computing might really allows us to understand or predict protein interactions with any conviction. (Truth be told, we're not even that close to knowing how far away we are.)

Of course, if you really want to be cynical, you could interpret the discovery chemists open posts denigrating computational chemistry as "don't go here, it does not work" as a method for slowing their competitors.

Computational chemistry, like experimental chemistry, has artifacts and difficult to interpret results. On the other hand, given the size of the data sets in current biochemistry, it is clear that computational approaches are crucial and will become more so. The less cynical would say that no one experiment, whether in a wet laboratory or in a computer is definitive, and that both need to be verified by other experiments. In addition, Russ Altman is usually worth listening to, even if the results from the simulation may be incomplete models of reality.

Lest anybody else get confused by the picture, it looks like the simulation was on the beta-2 adrenergic receptor, not the histamine receptor.

And yeah - even microsecond runs on a GPCR - embedded in a lipid bilayer, with water on both sides - are not short, though GPU codes help. Milliseconds are forever. Quite an accomplishment, even if not as a continuous trajectory.

As far as (some) medicinal chemists dissing comp chemistry is concerned, some of it is professional status and turf wars; some of it is ignorance of limits of any prediction (which, honestly, it falls on the comp chemist to make clear at the time. And if the comp chemist is young/inexperienced/hasn't been burned enough times, they may not). The med chemists who are really good - at least the ones I've worked with - and who are clued in definitely want comp chem support, and a proper med chem/comp chem collaboration can be a tremendously synergistic activity - and even more so when you can fold structural biology into it.

Also, referring to computational chemistry in a monolithic sense, especially within the pharma industry drug discovery paradigm, is really painting with an exceptionally broad brush. It can refer to statistical modeling of the activity or ADMET data, property prediction, or other QSAR methods. It could refer to ab initio QM for predicting whether a ligand can assume a binding conformation with a reasonable energetic penalty, or whether a compound might cause photosensitivity. Molecular dynamics can be used to sample loop motions, or relax homology models of proteins (built by the comp chemist) or to create multiple binding site states for docking studies - which appears (sorry, too cheap to pierce the paywall right now) to be what they were doing in the third panel of the abstract page linked to, where they appear to be showing the sorts of compounds enriched in different protein conformational states. And those are just the setups for docking and compound library design activities that are bread and butter for structure-based drug design.

In the end, it's about comp chem improving the hit rate for med chem (raising the batting average, not hitting a home run on every pitch), and being able design things you simply wouldn't dare try blindly - or otherwise think to. And for that it works - scarily well sometimes. Browse through the Journal of Medicinal Chemistry sometime if you want to see how pervasive it is, and then take the dissing with a grain of salt.

But why would folding@home want to pay for expensive cloud computing when the entire project is designed around allowing common people to be their cloud?

They didn't. This project was a collaboration between Stanford and Google, where Google provided the hardware and Stanford provided the software. One of the goals was to be a demonstration project to show the value of their cloud platform, so they provided all the processing time for free.

I would definitely suggest that this technique be extended to other systems that transition among a finite, albeit large, number of states. This is the way I've always envisioned the universe at work up and down in scale. [I thought it was obvious. Guess not.]

Hi folks-I'm one of the paper authors and would like to a few minor clarifying comments:

1) The computer time was provided free to the scientists.

2) Kai Kohlhoff, who was a postdoc in Vijay Pande and Russ Altman's labs, joined Google as a Visiting Faculty. He used roughly half a billion CPU hours for this calculation (and some others).

3) Much of the work was based on software developed by the Folding@Home team. In many ways, Exacycle resembles F@H in design and ran a binary resembling F@H's core code (gromacs). Further, we used MSM (https://simtk.org/home/msm-database) although rewritten in Google Flume, to do the data analysis.

4) As pointed out, we did not carry out a single 2ms simulation. The results were derived from many shorter simulations. Some of us believe this is actually a better sampling method than single, long trajectories, although that's an open question. I'll note that my PhD work, executed 13 years ago, was comprised of 4 10ns simulations.

5) Regarding the comments that this is mainly a computational advance; we didn't pick GPCRs, or simulate drug binding by chance, and our predictions are testable in a lab. In general, I would agree that pure folding simulations are unlikely to produce pharma-accurate predictions in the near future. I *hope* that our technology and models will become increasingly accurate over time, and potentially help in the development of new drugs that reduce the cost of health care and reduce side effects. But we have a lot of work to do.

Hi folks-I'm one of the paper authors and would like to a few minor clarifying comments:

1) The computer time was provided free to the scientists.

2) Kai Kohlhoff, who was a postdoc in Vijay Pande and Russ Altman's labs, joined Google as a Visiting Faculty. He used roughly half a billion CPU hours for this calculation (and some others).

3) Much of the work was based on software developed by the Folding@Home team. In many ways, Exacycle resembles F@H in design and ran a binary resembling F@H's core code (gromacs). Further, we used MSM (https://simtk.org/home/msm-database) although rewritten in Google Flume, to do the data analysis.

4) As pointed out, we did not carry out a single 2ms simulation. The results were derived from many shorter simulations. Some of us believe this is actually a better sampling method than single, long trajectories, although that's an open question. I'll note that my PhD work, executed 13 years ago, was comprised of 4 10ns simulations.

5) Regarding the comments that this is mainly a computational advance; we didn't pick GPCRs, or simulate drug binding by chance, and our predictions are testable in a lab. In general, I would agree that pure folding simulations are unlikely to produce pharma-accurate predictions in the near future. I *hope* that our technology and models will become increasingly accurate over time, and potentially help in the development of new drugs that reduce the cost of health care and reduce side effects. But we have a lot of work to do.

Have you looked at modeling any viral targets, particularly ones that undergo dramatic rearrangements over the course of processing and infection? I'm thinking of HIV GP120, influenza HA and Dengue M here mostly.

I think dramatic rearrangements would be a harder problem, as would simulating the cell accurately. We're really doing sampling around a relativvely well-defined folding basin.

Of course the dream is to use MD sims to study cellular processes in detail, but for that, we tend to use coarse-grained simulations. For example, the work on NPC (Nuclear Pore Complex) by Dan Russel included protein structures at atomic resolution, but modelled interactions very coarsely.

Hi folks-I'm one of the paper authors and would like to a few minor clarifying comments:

1) The computer time was provided free to the scientists.

2) Kai Kohlhoff, who was a postdoc in Vijay Pande and Russ Altman's labs, joined Google as a Visiting Faculty. He used roughly half a billion CPU hours for this calculation (and some others).

3) Much of the work was based on software developed by the Folding@Home team. In many ways, Exacycle resembles F@H in design and ran a binary resembling F@H's core code (gromacs). Further, we used MSM (https://simtk.org/home/msm-database) although rewritten in Google Flume, to do the data analysis.

4) As pointed out, we did not carry out a single 2ms simulation. The results were derived from many shorter simulations. Some of us believe this is actually a better sampling method than single, long trajectories, although that's an open question. I'll note that my PhD work, executed 13 years ago, was comprised of 4 10ns simulations.

5) Regarding the comments that this is mainly a computational advance; we didn't pick GPCRs, or simulate drug binding by chance, and our predictions are testable in a lab. In general, I would agree that pure folding simulations are unlikely to produce pharma-accurate predictions in the near future. I *hope* that our technology and models will become increasingly accurate over time, and potentially help in the development of new drugs that reduce the cost of health care and reduce side effects. But we have a lot of work to do.

Have you looked at modeling any viral targets, particularly ones that undergo dramatic rearrangements over the course of processing and infection? I'm thinking of HIV GP120, influenza HA and Dengue M here mostly.

Dr Konerding: Thank you very much for posting here. One of the best aspects of Ars Technica is how often we get first-hand information in the comments section from the people whose work is covered in the articles.

You can't beat the Ars readership - thanks for really interesting comments, everybody, and in particular to the author for engaging. This will be a slightly overlong post simply because lots of good stuff has come up.

For those less familiar with molecular simulations, to whom a few milliseconds may not sound like a long time, it's probably worth restating that the published work we're discussing here describes an absolutely epic feat of computation.

Quote:

Of course, if you really want to be cynical, you could interpret the discovery chemists open posts denigrating computational chemistry as "don't go here, it does not work" as a method for slowing their competitors.

Hah! Now that's cynical, I'm not sure I'd go quite that far. In The Pipeline has an excellent comments section, one of the few that's reliably worth reading (along with Ars obviously). If you were to criticise it though it would be for an overall slightly reactionary / over-defensive flavour - which arises at least in part from the dismal job security most medchem folk have had for the last while.

Quote:

As far as (some) medicinal chemists dissing comp chemistry is concerned, some of it is professional status and turf wars; some of it is ignorance of limits of any prediction (which, honestly, it falls on the comp chemist to make clear at the time. And if the comp chemist is young/inexperienced/hasn't been burned enough times, they may not).

Agree wholeheartedly. The "ignorance of limits" issue arises in particular when a medicinal chemist without adequate expertise is "having a go" at doing a bit of comp chem on the side and overinterprets the results.

Quote:

The med chemists who are really good - at least the ones I've worked with - and who are clued in definitely want comp chem support

Well I hope I'm one of those. But then, I don't suppose anyone actually thinks "hey, I like being obtuse" . And for the avoidance of doubt I'm emphatically not one of those "this is all nonsense" medchem types. Rightly or wrongly I tend to think of compchem as helping with prioritisation: you do more of the stuff that the various models point you towards, but you try to keep an open mind about testing the limitations of the models.

Quote:

Browse through the Journal of Medicinal Chemistry sometime if you want to see how pervasive it is, and then take the dissing with a grain of salt.

Or what those same cynical types would call "The Journal of Unsuccessful Medicinal Chemistry", where projects that didn't work out in terms of patentability end up, to give a consolation prize of a career-assisting publication or two. (Again, this isn't a view I endorse, but it's one regularly advanced with varying degrees of seriousness.)

In all seriousness though - as you're well aware, but some readers may not be - you necessarily get a neatly narrativised account of a medchem project by the time it makes a journal. It's generally presented as a logical, rational progression from hypothesis A to end result B, leaving out random things that were done by mistake but ended up working out, post facto pattern recognition, lucky guesses, etc. I'd hazard that the published accounts of medchem projects probably tend to place more weight on modelling predictions than the team actually did when the work was in progress.

I constantly see modeling papers of proteins, but often times those papersseem to miss one absolutely critical feature of proteins found on the surface of a cell--glycosylation. You can bet that almost 100% of all GPCRs are glycosylated and nature glycosylates proteins for a reason. Glycosylation of proteins significantly alters protein function, protein folding, signaling, and restricts the protein conformers that are attainable. Glycan structures on proteins are themselves constantly in a state of vibration and movement, which is how they affect protein biology. There's a reason carbohydrates have been called the 3rd alphabet of life. Often times crystal structures of glycoproteins are obtained only after glycan structures of those proteins have been removed. The only problem is that it decreases the value of that protein structure obtained, since glycan structures are critical features that control so much biology. Where are the glycans in models of GPCRs?

The major reason why people choose to ignore something like glycosylation is because of the fact that it is easier to do experiments and get results when you ignore it. Glycosylation is not template driven like DNA and proteins. The same GPCR protein receptor in different types of cells is often glycosylated differently, which is why the same protein on different types of cells behaves differently--there's micoheterogeneities that exist.

The effort the authors put into the paper is certainly commendable, but always take these things with a grain of salt when a key feature of almost all proteins on the surface of cells, such as glycosylation, is basically ignored. Life is much more complicated than just DNA and protein.

I constantly see modeling papers of proteins, but often times those papersseem to miss one absolutely critical feature of proteins found on the surface of a cell--glycosylation. You can bet that almost 100% of all GPCRs are glycosylated and nature glycosylates proteins for a reason. Glycosylation of proteins significantly alters protein function, protein folding, signaling, and restricts the protein conformers that are attainable. Glycan structures on proteins are themselves constantly in a state of vibration and movement, which is how they affect protein biology. There's a reason carbohydrates have been called the 3rd alphabet of life. Often times crystal structures of glycoproteins are obtained only after glycan structures of those proteins have been removed. The only problem is that it decreases the value of that protein structure obtained, since glycan structures are critical features that control so much biology. Where are the glycans in models of GPCRs?

The major reason why people choose to ignore something like glycosylation is because of the fact that it is easier to do experiments and get results when you ignore it. Glycosylation is not template driven like DNA and proteins. The same GPCR protein receptor in different types of cells is often glycosylated differently, which is why the same protein on different types of cells behaves differently--there's micoheterogeneities that exist.

The effort the authors put into the paper is certainly commendable, but always take these things with a grain of salt when a key feature of almost all proteins on the surface of cells, such as glycosylation, is basically ignored. Life is much more complicated than just DNA and protein.

You are absolutely correct that glycosylation is extremely important for in vivo protein interactions, no question. Given a choice, including it would be preferable.

However, it's not so much that it's ignored - it's that the entire experiment can become untenable if the sugars are on there. It's vastly more difficult to purify and crystallize proteins that are glycosylated, for example. So the desglycosylated crystal structure you get out is, in effect, a model system. But I would propose that it's less a matter of regarding the result as incorrect or deficient than it is one of keeping in mind the limits of the model system. Maybe that's merely semantics, but I'd view it as a different mindset. If you're looking at protein-protein interactions, glycosylation state is a big deal. If you're looking at a binding cleft, it may or may not actually matter much.

It's by no means limited to glycosylation. For example, kinases can exist in multiple phosphorylation states and conformational states. But if you can get that first crystal structure of a target kinase for ANY state, you're a lot happier than if you don't. But you keep in mind the limits of the system, and you keep pushing against those limits to test it.

(For that matter, people looking at pretty pictures of precisely positioned atoms in a crystal structure tend to forget that even the structure itself is a model of atom positioning within a fitted electron density, subject to constraints and errors, averaged over an ensemble of states in unnatural conditions of solvent, pH, and crystal contacts. It's a model, with limits - but that doesn't make it less useful).

We're almost always dealing with models in the biological sciences, whether the experiment is taking place in a computer or a plate well or a cell system or a mouse. They're all model systems, with limits.

In the present case, this is a pretty damn awesomely audacious attempt to start mapping out the conformational space between GPCR agonist and antagonist conformations, with a potential payoff being the ability to specifically target and stabilize any state along the spectrum and thus tune the biological response. Maybe it's "just" a first step in a model system, but as first steps go it's a doozie. Kudos to the authors.