We aim to create and test computational methods capable of refining comparative protein structure models to an accuracy comparable to that of moderate to high resolution experimental structures. Our overall strategy is to dramatically improve the efficiency of sampling, which has limited the success of prior efforts at comparative model refinement, by a combination of (a) identifying which degrees of freedom are critical to sample, (b) developing new algorithms for making large moves along these degrees of freedom, and (c) using experimental data, if available, to help constrain the search space. We will

Implementamethodforrefininghomologymodelsbasedonmethodsfromkinematics.The approach combines two powerful strategies for improving the efficiency of sampling. The first strategy is to replace the small, femtosecond motions of molecular dynamics sampling with large, concerted motions of critical structural elements, such as loops. These Monte Carlo (MC) moves occur in dihedral angle space, and are guided by kinematics, which has been used to solve similar problems in the field of robotics. The second strategy is to adopt the powerful multiple temperature sampling schemes (replica exchange), which dramatically increase sampling efficiency and are simple to implement with inexpensive Linux clusters.

Implementamethodforrefiningcomparativemodelsbasedonremotehomologsbycombiningthe “zippers” strategyforsamplingproteinbackboneswithrestraintsderivedfrombioinformatics. The zippers method was originally developed to increase the sampling efficiency for all-atom protein folding by preferentially sampling local contacts. Here, the method will be adapted to refine comparative models by focusing the sampling on those portions of the structure that are poorly constrained by the sequence alignment. Two key features of this approach are that it will be (i) tolerant of modest sequence alignment errors, and (ii) capable of providing models for large insertions, which are not aligned to template residues.

Integrateinformationintemplatestructures,amolecularmechanicsforcefield,andcrystallographicdatatomaximizethenumberandaccuracyofproteinstructuresdeterminedbymolecularreplacement. Crystallographic structure determination by molecular replacement is limited in two model building aspects: (i) it is sensitive to the inaccuracy of the starting models and (ii) the ensuing refinement of the molecular replacement solution has a relatively small radius of convergence. We will address the first problem by an iterative process of target-template alignment, model building, and model assessment based partly on crystallographic data; and the second problem by several algorithms that refine a given model based on crystallographic data, a molecular mechanics force field, and information in the template structures. This work will involve close collaborations with scientists participating in the Protein Structure Initiative.

EvaluatethemethodsdevelopedinAims1–3,includingtheutilityofthemodelsforcomputer-aidedinhibitordiscovery. We will subject the methods in Aims 1–3 to blind predictions in CASP and CAFASP, and on an on-going basis using EVA-CM. In addition, we will directly assess the utility of the models generated for structure-based inhibitor design, an important but challenging application. The ability to create comparative models with accuracy comparable to experimental structures would open up a vast new set of protein targets to structure-based drug design methods. Common methods of assessing comparative models focus on geometrical accuracy, and do not directly assess the aspects of binding sites required for inhibitor design applications. We will directly assess the quality of the models for this purpose by performing docking enrichment studies on comparative models generated by Aims 1–3 for proteins with known inhibitors. This work will involve close collaborations with experimental groups who will test our methods in inhibitor discovery projects.

Aims 1–3 address both Comparative Modeling Goals identified by RFA-GM-05-008, “High Accuracy Protein Structure Modeling”. Our approach integrates methods grounded in bioinformatics (Sali), physics (Jacobson and Dill), and applied mathematics (Coutsias). Close collaborations among these researchers, as well as Dr. Shoichet (Aim 4), is facilitated by most of the researchers being located together at UCSF; our budget also allocates funds for Dr. Coutsias (U. New Mexico, Dept. of Mathematics and Statistics) to spend several months a year at UCSF. The tangible outcome of this research will be a set of freely available modular source codes and executable programs that implement the methods in Aims 1–3.

B. Background and Significance

Comparative structure prediction, in conjunction with the Protein Structure Initiative (PSI), has the potential to bridge the gap between the number of available protein sequences (>1 million) and structures (>20,000). The experience of the Sali group in constructing ModBase, a database currently containing over one million protein comparative models, highlights this potential. The fraction of sequences with comparative models for at least one domain is currently 57%.1 This number will continue to grow as the PSI enters its production phase. The New York Structural Genomix Research Consortium documented the number and quality of the comparative models that could be built based on their new structures. On average, about 100 protein sequences without any prior structural characterization could be modeled for each new structure.1 The accuracy of these models, however, varies significantly, with many accurately representing the overall tertiary structure, but relatively few (<10%, i.e., those with >50% sequence identity) expected to be as accurate as moderate resolution experimental structures (1–2 Å RMSD). While the CASP and CAFASP competitions have shown some measurable progress over time in the accuracy of comparative models2,3, they indicated little ability to refine comparative models to an accuracy better than the template protein. The research we propose addresses this critical bottleneck to high accuracy protein comparative modeling.

The three major sources of inaccuracy in comparative protein models are 1) incorrect choice of template protein, 2) inaccuracy in aligning the target sequence to the template, and 3) inability to routinely refine comparative models, i.e., to predict conformations of residues that do not align to the template, structural differences between the target and template proteins in aligned regions, and critical details such as side chain conformations. Improved methods of sequence alignment, fueled by the ever-growing databases of protein sequences4 and structures as well as algorithmic improvements, will contribute to the first two of these challenges. The research we propose focuses on the third of these challenges, model refinement, but will also contribute to identifying correct templates and alignments. Specifically, we propose methods for conformationalsampling and scoring of comparative protein models, generated by any alignment and model building protocols. The new algorithms we develop will be capable of refining individual models to improve the accuracy, and choosing the most accurate model among several generated from different templates and/or different alignments.

In short, we aim to improve the accuracy of comparative models by identifying the global free energy minimum of the protein sequence. This is a very challenging undertaking, despite the fact that the initial model(s) should be “close” to the global minimum, in the sense that at least the tertiary fold is correct, as long as the correct template is chosen. Success requires both adequate sampling and accurate scoring, and these two imperatives work against each other: more accurate scoring functions generally entail greater computational expense, reducing the amount of sampling that can be accomplished with fixed computer time. Our strategy is to develop methods that dramatically improve the efficiency of sampling, by a combination of 1) identifying which degrees of freedom are critical to sample, 2) developing new algorithms for making large moves along these degrees of freedom, and 3) using experimental data, if available, to help constrain the search space.

In Section B.1, we argue that existing energy models, particularly those that treat the protein at an atomic level of detail, are accurate enough to be useful in refining comparative models. Then in Section B.2, we review available methods for protein sampling, and outline our approach to improving sampling efficiency for comparative model refinement. In Section B.3 we discuss the role that experimental data can play in aiding comparative model refinement, and identify low-resolution data from xray crystallography as a neglected but potentially very useful source of data for this purpose. Finally, in Section B.4, we return to the potential impact of new methods for high-accuracy comparative modeling, highlighting the role that comparative models can play in structure-based inhibitor discovery, and the challenges that confront this goal.

B.1RefinementofComparativeModels:Scoring

Several lines of evidence suggest that currently available all-atom energy functions, although they have limitations5, are capable of the accuracy required to achieve our goal (i.e., refining comparative models to 1–2 Å RMSD). We focus attention on scoring functions composed of all-atom force fields and implicit solvent models. These are used as the primary scoring functions in Aims 1 and 2 due to the attractive balance they provide between accuracy and computational efficiency. However, the sampling methods that we develop can be used with virtually any all-atom scoring function, whether physics- or knowledge-based. Studies that provide grounds for optimism that all-atom energy functions can provide the necessary accuracy include:

Decoy studies, where energy functions are evaluated by their ability to distinguish native from non-native protein structures6,7, including studies that have focused on force fields such as CHARMM8, OPLS9, and Amber10 in combination with Generalized Born solvent models. Feig and Brooks also demonstrated that the CHARMM force field with GB or PB implicit solvent performed well in identifying the most accurate models among those submitted to CASP411.

All-atom protein folding simulations. Although these results are limited to relatively small systems (<60 residues), they nonetheless represent a stringent test of energy functions, since they involve sampling huge number of diverse protein conformations. For example, Dill etal. have used the AMBER/GB energy function to successfully fold 3 small proteins. In extensive replica exchange molecular dynamics (REMD) sampling of the conformational space of GB1, a 56-mer protein, the state of lowest free energy throughout that space is within 1.5 Å RMSD from the native structure. Other successful folding simulations with similar energy functions include several studies by Pande12-14.

Side chain and loop prediction. Numerous papers have tested sampling and scoring schemes for placing side chains on a native protein backbone, or reconstructing protein loops in a native protein. These tests provide an upper limit on the accuracy that could be expected for side chains and loops in more realistic homology modeling applications. Jacobson and co-workers, using the all-atom OPLS force field15-17 and a Generalized Born implicit solvent model18,19, demonstrated the highest accuracy yet reported for predicting the conformations of loops with 4–12 residues20; other high-accuracy results in the literature have used similar molecular mechanics energy functions21. In these tests, thousands of loop conformations are generated, and the energy function shows a robust ability to identify native-like conformations, as shown in Figure 1. In an extension of this work, Li etal. combined rigid sampling of helices with prediction of the adjacent loops, to permit partial refinement of tertiary packing22. In tests analogous to those employed to validate loop prediction, i.e., predicting the loop-helix-loop region keeping the rest of the protein fixed at the native conformation, the average RMSD for the helix regions was 0.9 Å over test 36 cases.

Given that it is possible to identify accurate structures among decoys and reconstruct portions of proteins with high accuracy with currently available energy functions, why is it not possible to routinely refine homology models to comparable levels of accuracy? One critical bottleneck is sampling. A typical homology model requires refinement of several loops simultaneously, and may also exhibit distortions or incorrect lengths of secondary structure elements. Sampling of all of these degrees of freedom simultaneously represents a major challenge, but we argue in the next section that this challenge is not insurmountable.