Abstract

In this work, we compare the results of molecular dynamics simulations involving the application of three generalized Born (GB) models to 10 different proteins. The three GB models, the Still, HCT, and modified analytical generalized Born models, were implemented in the computationally efficient gromacs package. The performance of each model was assessed from the backbone rms deviation from the native structure, the number of native hydrogen bonds retained in the simulation, and the experimental and calculated radius of gyration. Analysis of variance (ANOVA) was used to analyze the results of the simulations. The rms deviation measure was found to be unable to distinguish the quality of the results obtained with the three different GB models, whereas the number of native hydrogen bonds and radius of gyration yielded a statistically meaningful discrimination among models. Our results suggest that, of the three, modified analytical generalized Born yields the best agreement between calculated and experimentally derived structures. More generally, our study highlights the need both to evaluate the effects of different variables on the results of simulations and to verify that the results of molecular dynamics simulations are statistically meaningful.

Molecular dynamics (MD) simulations are widely used in the study of protein structure and functions (1, 2). The results of a given simulation depend on a number of factors, such as the quality of the molecular force field, the treatment of solvent, the timescale of the simulation, and the sampling efficiency of the simulation protocol. There has been an enormous investment in the underlying technology in each of these areas, and the range of application of MD simulations has expanded greatly since the method was first introduced (3). However, by their very nature, MD simulations are not easy to compare to experiment. Simulations of protein systems involve a considerable amount of computer time, and the results in general cannot be directly compared with the experimental observables without additional processing. When such comparisons are made, the results are often very encouraging, but, given the multiplicity of parameters and computational methodologies that are used, it is hard to know whether the success of a particular protocol, e.g., using one of the available force fields, indicates that all other force fields will yield results of comparable quality. Fortunately, the growth in available computer power combined with the development of highly optimized computer code has made careful comparison of methods more feasible and, in addition, should make it increasingly possible to test whether a particular result is robust in terms of its sensitivity to the parameters of the calculation. Moreover, it would be highly desirable to know whether a particular methodology or combination of parameters is best suited to a particular application and to understand the reasons for performance differences, to the extent that they exist. This goal requires that different methods be compared so that their strengths and weaknesses can be critically evaluated.

The most straightforward validation of MD simulations of proteins is based on comparisons between experimentally determined structures and the conformations that are generated in a simulation that uses the experimental structure as an initial condition. However, there are a number of complications in even this kind of test. The most accurate experimental structures are obtained from x-ray crystallography, but most simulations do not mimic crystalline conditions. Comparisons with NMR structures obtained in solution would appear more appropriate, but NMR structures tend to be less accurate than x-ray structures and are more sensitive to the energy model (e.g., force field) used in generating coordinates. Much remains to be done in comparing theory and experiment and in comparing the one theoretical approach to another. An issue that naturally arises in such comparisons is whether a particular conclusion is statistically meaningful. Has a large enough data set been used, has the simulation been run for a long enough time, and have appropriate statistical tests been applied to the output of the simulations?

In this article, we address some of these issues by using a comparison of continuum solvent models as an example. Specifically, we evaluate the ability of three generalized Born (GB) models, used in conjunction with the gromos96 force field (4, 5) and stochastic dynamics (SD) (6) simulations, to reproduce experimentally determined protein structures. The three models are those of Still and coworkers (7), the HCT model from the Cramer/Truhlar group (8), and a modified AGB (mAGB) model of Levy and coworkers (9) developed in our laboratory (10). The three models have been implemented in the gromacs 3.0 package (11–13). Five-nanosecond simulations using each of the three models were carried out on 10 different proteins (Table 1), and the structural properties averaged over the last 1 ns of each simulation were compared with the corresponding properties obtained from the experimentally determined structures. The extent to which differences in the results obtained from different GB modes were statistically meaningful was determined through analysis of variance (ANOVA). Our analysis reveals that some structural measures are better than others in their ability to discriminate among models, and, in addition, it emphasizes the importance of applying statistical analysis to the results of MD simulations. Proper analysis also requires the use of a sufficiently large data set, something that is no longer a problem given the computational efficiency of modern MD technologies and the availability of large-scale computational resources.

The results reported here and in a previous study (10) indicate that there are important differences between GB models. In addition, model performance is found to depend on simulation protocol; for example, the friction coefficient used in a simulation or initial conditions. Some of the implications of our results are discussed below.

Methods

Protein Data Set.Table 1 provides information about the 10 proteins used in the simulations. Seven of the structures were solved with x-ray crystallography, and seven were solved with NMR. For the latter, the first structure in each PDB file was used, except for 1lea, for which only one energy-minimized average structure is available. All 10 proteins are small (between 56 and 76 residues), have no disulfide bonds, and have different compositions of secondary structure elements.

Model Implementation. The HCT, modified analytical generalized Born (mAGB), and Still models were implemented in the gromacs 3.0 package (11–13). The precision of the implementation was examined by evaluating both energies and forces. The energy calculation was tested through a comparison with values obtained from the same models within the gromos package. The force was tested by comparing the GB force calculated from analytic derivatives with that obtained from finite interpolation. In finite interpolation, the energy change upon a perturbation in a coordinate should approach the analytic derivative when the perturbation is sufficiently small. A coordinate increment of 0.01 Å was used, and the protein 1pgb was tested. The nonpolar solvation term, estimated from an analytical surface-area term calculation (10, 14, 15), was combined with each GB model (GB/surface area model, GB/SA model) to represent the complete solvent potential of mean force. It should be emphasized that the nonpolar term used in the original AGBNP model (9) has been replaced in the mAGB implementation (10) by a simpler surface-based term for convenience of comparison between methods.

Simulation Methodology. The gromos96 force field (4, 5) was used in all simulations. All ionizable amino acids were assumed to be in protonation states appropriate to a pH of 7 assuming standard pKa. Each GB/SA model was combined with a SD algorithm (6) to represent the effect of the solvent. Bound ions and crystal waters were removed, and no counter ions were added to neutralize the protein/implicit solvent systems. All protein structures were first optimized through steepest descent energy minimization and then subjected to 10 ps of positionally restrained dynamics with a force constant of 2.4 kcal/molÅ2. Then, 5-ns SD simulation with each GB/SA model was performed. The equations of motion were integrated with a leapfrog SD scheme and a time step of 0.002 ps. The temperature was kept at 300 K, and, except where noted (see below), the friction coefficient was set to 10 ps–1. Covalent bonds were constrained with the LINCS algorithm (16). Nonbonded interactions were calculated every step without cutoffs. The nonpolar surface-area term and its derivative were updated every 10 time steps. The surface coefficient σ was set to 0.005 kcal/molÅ2 for all atom types except for sulfur (–0.005 kcal/molÅ2) and polar hydrogens (0.0 kcal/molÅ2) (10, 15).

Three additional 5-ns simulations were performed on 1pgb by using different initial atomic velocities, resulting in a total of four runs for this protein. Another four simulations by using otherwise identical conditions were performed on 1pgb by using a friction coefficient of 50 ps–1. All simulation trajectories were analyzed with tools in the gromacs3.0 package. Different properties were extracted from the trajectories and averaged over the last period (1 ns) of the simulation. These average properties were subjected to the statistical analysis.

ANOVA. ANOVA (17) is a statistical technique that can evaluate the significance of differences between two or more groups. Two-way ANOVA is appropriate to cases where there are two independent variables. The “null hypotheses” to be tested are that the effect of the first (or second) variable is not significant and that the two variables are not correlated. In this work, the GB model and the protein correspond to the two independent variables, and the null hypothesis of interest is that three GB models are equivalent. ANOVA will calculate the probability (P) of the null hypothesis being true. A sufficiently low P value indicates that a particular property can differentiate among the GB models. A high P value suggests that the property in question should not be used to infer differences in model quality. ANOVA results are valid only when the samples are normally distributed. This conclusion was confirmed through an analysis of residual. Detailed descriptions of ANOVA can be found online or in most statistics textbooks. We used a linux version of the statistical package, r (www.r-project.org) for ANOVA and validation tests.

Results

rms Deviation (RMSD) of Backbone Atoms.Table 2 lists the RMSDs of the backbone of secondary structures averaged over the last 1 ns of a given simulation. The RMSDs are in the range observed for previously reported explicit (15, 18) and implicit (15, 19–26) solvent simulations. Many of the values are <2 Å, and for only one protein (1pgb) is the RMSD close to 3 Å. Table 2 (multiple simulations, 10-ps–1 friction coefficient) shows results for the four sets of simulations carried out on 1pgb with different initial velocities. The first of these simulations corresponds to the one also listed in column 6. As can be seen, the RMSDs for the three additional simulations are lower than for the first. The average RMSD of the 10 proteins is 1.9, 1.6, and 1.8 Å for the HCT, mAGB, and Still models, respectively. The lowest RMSD for each protein (marked in bold) is scattered among the three GB models, and visual inspection does not identify a best performing GB model.

Two-way ANOVA confirms this conclusion; the probability of the null hypothesis being true (i.e., that the three GB models are equivalent) is 0.32, which indicates that differences among models detected by RMSD results are not statistically significant. This finding suggests that RMSD is not a useful discriminator of model quality in the current study. Although we have not analyzed the effect of the protein on the results in detail, it is of interest to consider whether there are differences between proteins that lead to different RMSDs. Possible factors include the location of salt bridges for which the subtle balance of relevant forces (e.g., Coulomb and desolvation) is difficult to calculate or, perhaps, the energetics of secondary structure formation.

Number of Native Hydrogen Bonds. We evaluated all hydrogen bonds involving residues that are part of secondary structure elements. Tables 3 lists the number of native hydrogen bonds observed in the experimentally determined structures, the average number of these hydrogen bonds in the last 1 ns (obtained by averaging the number seen in each snapshot), and the relative deviation of the two values (DEV). DEV is defined as Qsim/Qexp – 1, where Qsim is the value for a particular quantity obtained from the calculation and Qexp is the value of the same quantity obtained from the experimental structure. [The mAGB model is seen to have the lowest deviations for every protein except 1pgb, where, based on the results shown in Table 3 (multiple simulations with a 10-ps–1 friction coefficient), the performance of all three models is essentially the same.] The average value of DEV for the 10 proteins is –0.368, –0.245, and –0.331 for the HCT, mAGB, and Still models, respectively. The HCT simulations, in particular, tend to lose a relatively large number of native hydrogen bonds. Two-way ANOVA indicates that the probability of the null hypothesis being correct is 1.8 × 10–4, suggesting that differences among model performance, as measured by the number of native hydrogen bonds, are statistically significant.

Radius of Gyration.Table 4 lists the radius of gyration obtained from the experimental structures and from different simulations together with the corresponding DEV values. All GB models yield comparable values with a maximal deviation from the experiment of 0.5 Å. The average DEV value for the 10 proteins is –0.016, 0.001, and 0.012 for the HCT, mAGB, and Still models, respectively. The lowest DEV values, indicated in bold in the table, are evenly distributed and do not indicate a preference for a particular model. Nevertheless, two-way ANOVA reveals that the probability of the null hypothesis being correct is 9.5 × 10–5, suggesting that the radius of gyration is a good discriminator of GB models. Indeed, the data in Table 4 suggest that the HCT model tends to make proteins slightly too compact and that the Still model tends to make them expand slightly, whereas mAGB exhibits no particular preference and has the smallest average deviations from experimental values.

Friction Coefficient. Four simulations using the mAGB model and a friction coefficient of 50 ps–1 were performed on 1pgb with different initial atomic velocities. (Results are presented in the last five columns of Tables 2, 3, 4). The average RMSD is reduced from 2.4 to 1.5 Å when the larger friction coefficient is used (Table 2). More native hydrogen bonds are retained (24 instead of 18; Table 3), but the radius of gyrations is not affected (Table 4). The results suggest that the structural deviations (RMSDs) observed in the simulations, in particular the large deviation of 1pgb, may be strongly influenced by the simulation protocol (e.g., the friction coefficient) and not just changes in the interaction function, in this case the GB model. The improved agreement with experiment of course does not imply that the GB model performs better with a higher fiction coefficient. The friction coefficient affects only dynamic properties of a system, not thermodynamic properties such as the equilibrium structure. In this case, increasing the friction coefficient reduces the structural deviation by restricting the mobility of the protein, thus lowering the sampling efficiency.

Discussion

Criteria for Model Evaluation. Despite the wide use of RMSD as a measure of the success of a particular set of parameters in applications to proteins, two-way ANOVA indicates that RMSD is in fact a poor discriminator of different GB models. Strictly speaking, this result is valid only for the comparison of the GB models used in this study, but its reduced sensitivity relative to factors such as the number of native hydrogen bonds or radius of gyration is noteworthy. Although this result does not prove that RMSD will always fail to discriminate among models, the calculated RMSD clearly results from a combination of factors that may reduce the amount of useful information this measure provides.

In addition to RMSD, MD studies of protein dynamics often report structural features such as radius of gyration, hydrophilic and hydrophobic solvent-accessible surface area, hydrogen bond, and the number of residue–residue contacts. Our results suggest that some of these properties are more sensitive to changes in the energy function than RMSD. Indeed, ANOVA indicates that hydrophilic and hydrophobic surface areas also provide significant tests of GB models. In contrast, the number of residue–residue contacts was not sensitive to the model differences. It may well prove to be the case that different tests are appropriate to different questions about the energy function. For example, hydrogen bonding is clearly sensitive to solvent model and, thus, is likely to be affected by changes in the GB model. In contrast, it may well be the case that the number of residue–residue contacts will provide a useful means of evaluating van der Waals potentials.

Evaluation of GB Models. In a previous study, we compared the Still, HCT, mAGB, and GBMV2 models in terms of their ability to reproduce Born radii obtained from the Poisson equation (delphi, refs. 27 and 28). We also compared the effectiveness of each model, used in conjunction with the gromos96 force field (4, 5), in the ab initio folding of two peptides. The GBMV2 model (29, 30) (which is computationally more demanding than the other three models) was found to be most effective in reproducing the delphi results, but the Still and mAGB models were most effective in the peptide folding, yielding some trajectories that produced native-like structures without the use of enhanced sampling techniques. This finding was attributed to the fact that the latter two models allowed for more efficient sampling even though they are somewhat less accurate in their ability to reproduce the results of a rigorous continuum model. This study also finds that mAGB is the most successful of the three fast GB models. In particular, mAGB maintains the largest number of native hydrogen bonds and yields radii of gyration in best agreement with experiment. In a previous study we found that, of the three models considered here, mAGB was the most effective in reproducing delphi-derived Born radii. In our implementation, mAGB is between 10% and 15% slower than the Still and is faster than HCT by about the same factor.

General Implications. Continuing improvements in access to computer resources and algorithms (programs) are enabling MD simulations of biomolecular systems to approach experimentally relevant time and length scales. With these factors, it is becoming increasingly possible to rigorously evaluate the available computational methodology against experimental data. In this paper, we have described the results of SD simulations on 10 proteins with the specific goal of evaluating the performance of three fast GB models. A second goal was to demonstrate the importance of using statistical techniques in the analysis of the results of MD simulations. The more general purpose of both aspects of this study is to help establish standards that enable the comparison and validation of parameters used in the simulations. A number of our findings clearly indicate the need to go beyond the consideration of just individual molecules and/or well chosen test cases. Instead, we must strive to employ data sets that are sufficiently large and diverse to enable rigorous statistical analysis. For example, the B1 domain of protein G (1pgb) has been used as a model system in a variety of simulations. However, in our study, the results obtained for this protein appear atypical, and, indeed, agreement with calculated structural parameters appears better for other proteins in our test set.

A second finding of our study is the value of using statistical analysis in assessing the significance of conclusions that are derived from simulations. For example, we find that, despite its wide use in evaluating the relative quality of different simulations, RMSD does not appear to be an optimal measure for discriminating among different GB models. In contrast, the number of native hydrogen bonds and the radius of gyrations appear to be better discriminators of GB solvent models. We note in passing that the number of native hydrogen bonds that are maintained in our simulations is significantly lower than the number observed experimentally. The situation appears to improve when a larger friction coefficient is used, simply highlighting the need to carefully evaluate the performance of a particular simulation protocol before specific conclusions are drawn. Still, the extent to which hydrogen bonds are broken raises concerns as to conclusions that are often made in MD simulations as to the biological importance of individual hydrogen bonds.

Finally, it should be pointed out that the major goal of this work was not to comment directly on the accuracy of a particular set of parameters but rather to introduce criteria as to how comparisons should be carried out. In particular, we have shown that it is possible to use appropriate statistical tools such as ANOVA in the analysis of computational studies of proteins. As mentioned above, truly detailed experimental evaluation of simulation accuracy requires, on the one hand, that the crystal environment be properly accounted for and, on the other hand, that the accuracy of NMR structures continues to improve. The problem is particularly complex for side chains, whose conformations can be strongly influenced by crystal packing and can be difficult to determine accurately with NMR methods. However, in this work we have used backbone hydrogen bonds as a means of evaluation so that this issue should not be significant. Indeed, the overall level of agreement with experiments did not appear to be affected by the method used for structure determination.

We hope that our results indicate that the true merits of different interaction functions and methodologies may be assessed in an objective fashion so that we may progress beyond the current situation, where results are often presented without careful validation. We feel that the available programs, algorithms, and computer power now make it possible to carry out such tests that were often difficult to implement in the past.

Acknowledgments

We acknowledge Dr. Erik Lindahl for his help in the implementation of GB/SA models in gromacs and Dr. Tsjerk Wassenaar for his help with the ANOVA. Support from National Institutes of Health Grant GM-30518 (to B.H.) is also acknowledged.

Footnotes

↵‡ To whom correspondence may be addressed. E-mail: a.e.mark{at}chem.rug.nl or bh6{at}columbia.edu.

Similar Articles

You May Also be Interested in

Researchers report links between warming and predator-prey interactions in the Arctic and suggest that predator activity can influence carbon and nitrogen dynamics in the Arctic, but that warming may alter or reverse such effects.

A study finds that individuals with major depressive disorder had lower blood levels of acetyl-L-carnitine (LAC) than healthy controls, suggesting that LAC might aid the diagnosis of severe, trauma-associated depression.

A study explores historical fire activity associated with bison hunting by indigenous groups in North America, and suggests that fire use by indigenous hunters might have amplified the effect of climate variability on fire activity in the North American Great Plains.