Abstract

We present a method to reconstruct a three-dimensional protein structure from an atomic pair distribution function derived from the scattering intensity profile from SAXS data by flexibly fitting known x-ray structures. This method uses a linear combination of low-frequency normal modes from an elastic network description of the molecule in an iterative manner to deform the structure to conform optimally to the target pair distribution function derived from SAXS data. For computational efficiency, the protein and water molecules included in the protein first hydration shell are coarse-grained. In this paper, we demonstrate the validity of our coarse-graining approach to study SAXS data. Illustrative results of our flexible fitting studies on simulated SAXS data from five different proteins are presented.

Introduction

Dynamical conformational changes of biological molecules are critical for their function. X-ray crystallography has been used to study such conformational changes by providing the structures of distinct conformational states. However, it is often difficult to observe such conformational changes, as it is challenging to trap molecules in a certain conformational state that would crystallize. This phenomenon is particularly true for large biological molecules. Therefore, alternative techniques such as Cryo-Electron Microscopy (cryo-EM), Small Angle X-ray Scattering (SAXS) and Fluorescence Resonance Energy Transfer (FRET) experiments are often used to identify conformational changes of molecules and are becoming a primary tool to obtain information about the dynamics of biological systems.1–8

However, those techniques can only provide low-resolution data and atomic details of the structure cannot be directly obtained. A common approach to infer atomic detail about conformational changes is to construct an atomic model consistent with the low-resolution data, using known X-ray structures.9–17

In this paper, we specifically aim to develop a quantitative tool to extract conformational changes of biological systems from SAXS data. SAXS is a well-established and interesting fundamental tool to study the structure of molecules.18,19 Its strength arises from its simpleness, i.e. it does not require special sample preparation and is relatively easy to perform. SAXS can supply a rich perspective on the structure and dynamics of proteins and macromolecular assemblies under a range of solution conditions, in particular physiological conditions that are not accessible to other structural methods. However, SAXS data only provides the overall size of the molecule, thus reconstruction of 3D structures is inherently degenerate tough distinct conformational states of the same molecule can be identified if their difference is sufficiently large.

A substantial amount of work has been performed to reconstruct the shape of biological molecules from SAXS data using bead models.14,17,20–22 This shape information can be further complemented by X-ray data, where X-ray structures are rigidly docked into the envelope in a similar way as is actually done with cryo-EM data.23 While current programs provide overall information about the shape of the biological molecules, they cannot describe conformational changes of the molecules with atomic details. Here we are specifically interested in these conformational changes. For a more explicit interpretation of the conformational transitions observed by SAXS, we propose its integration with already known X-ray structures.

Such fitting of an X-ray structure into low-resolution data of a different conformational state is often done by domain segmentation followed by fitting each domain of the system as an independent rigid body block. These procedures have been useful for the understanding, at a near atomic level of detail, of several conformational changes in major biological systems as obtained from cryo-EM data or SAXS data.4,24–29 However, such procedures are subjective because while each domain may move as a rigid unit, large conformational changes often involve tightly coupled motions between domains of the biological system. When domains are fitted as independent rigid body blocks, those correlated motions are not reproduced. Thus, despite the success of these previous studies, the development of more advanced quantitative techniques with data reproducibility is necessary.

We have recently introduced a new algorithm to reconstruct an atomic level structure from a given atomic pair distribution function.30 This new algorithm takes advantage of an already known X-ray structure to predict a different conformation embedded in the atomic pair distribution function. Normal mode theory was used to construct a model by deforming the original X-ray structure to ensure that the new conformation is energetically accessible. The atomic pair distribution function is 1 dimensional (1D) data and it has far less information than 3D structural data obtained from techniques such as cryo-EM. Nevertheless, our protocol is successful in predicting structure from these low-resolution data.

In this paper, we present the extension of this methodology for fitting into experimental data derived from SAXS. Since SAXS experiments on biological molecules are performed in solution, the first layer of ordered water molecules around the protein is also captured in the data.31–33 Our approach takes into account the contribution from such a solvation layer into the fitting. In addition, we employ a coarse grained model for computational efficiency where the protein is represented by its Cα atoms and water molecules are treated as dummy atoms. We present illustrative results of our studies on five different proteins with large conformational changes occur and discuss the accuracy of the method as well as the limitations.

Theory

Intensity profile and pair distribution function

SAXS measures the intensity of X-ray scattered at small angles. Information derived from SAXS data is embedded into a 1D profile of the scattering intensity I(s) measured in the experiments. Here, s = 4πsin(θ)/λ is the momentum transfer as a function of the scattering angle 2θ and λ is the X-ray wavelength. The theoretical I(s) can be calculated from a 3D structure of a given macromolecule using coordinates and structure factors of atoms.31 Most studies to reconstruct low-resolution models from SAXS experiments use the scattering intensity as the target data. One can also relate the measured scattering intensity to the distribution of electron pair distances within the molecule, which can be directly calculated through a Fourier transform of the scattering intensity curve:18

p(r)=r2π2∫0∞I(q)qsin(qr)dq

Such P(r) represents a real space approximation of the data. Since it can be readily related to atomic structure, it is more useful when trying to incorporate SAXS and X-ray data. The pair distribution function P(r) derived from I(s) has been used to discuss conformational changes of biological molecules observed by SAXS experiments.28,34–37 We will also use P(r) for our scoring function to characterize conformational changes.

Modeling water molecules

We have shown that a simple interatomic pair distribution function, P(r), calculated directly from a protein structure is sufficient to predict conformational changes of biological molecules.30 However, experimental SAXS profiles include the effect of water layers surrounding the protein, while it is not included when P(r) is directly calculated from a protein structure. Therefore, as the next step toward using experimental data, a water shell around the full-atom initial structure is modeled using the program Solvate from Helmut Grubmüller’s group. In order to reproduce experimental data, only the first solvent shell of organized water molecules is considered.

Coarse-graining model

In some previous works, several coarse-grained models have been introduced to speed up calculations of theoretical intensity scattering patterns.14,21,32,38 One practical approach is to approximate a given molecular structure by a finite number of homogenous spheres of equal radii and scattering density. This approximation is valid within the resolution limit of SAXS.39 Therefore, in the present study only Cα atoms are taken into account. Such a coarse-grained model corresponds to about ~1/8 of the original number of non-hydrogen atoms, because on average each residue consists of 8 non-hydrogen atoms. Consequently, we randomly remove 7/8 of the water molecules that surround the protein so that the density of atoms of the protein and solvent shell are comparable.

Correspondence between coarse-grained model and experimental data

The P(r) derived from the experimental data reflects the all-atom structure of the protein surrounded by a layer of water. Therefore, in order to perform fitting using coarse-grained models, a normalization is necessary, i.e. the P(r) from the coarse-grained model needs to be scaled to match the experimental P(r).

The integral of P(r) is equal to N(N-1)/2, where N is the total number of atoms. Theoretically, while the shape of the P(r) is different between two protein conformations, their integrals should remain the same, assuming that the solvation is almost similar for both conformations. Therefore the target experimental data can be scaled so that its integral matches the integral of the initial coarse-grained model.

Pair distribution function

P(r) is approximated by evaluating all interatomic distances in the molecule, including surrounding water molecules that are represented as dummy atoms. The atomic pair distribution function P(r), denoted gk in the equations, is represented as a histogram:

gk=∑i<jN∫rk−1rkδ(r−rij)dr

which is the number of atomic pairs whose distances are between rk–1 and rk. The data is a discrete profile. With such a definition, the derivative of gk as a function of atomic coordinates cannot be calculated, which is disadvantageous for the optimization process. To make gk a continuous function of atomic coordinates, a Gaussian kernel is used. Replacing the delta function by a Gaussian, gk can be rewritten as follows:

gk=12πσ2∑i<jN∫rk−1rke−(r−rij)22σ2

(1)

where rij is the distance between atoms i and j, and σ is a parameter for the kernel width. gk is now analytically differentiable as a function of atom coordinates, which is suitable for calculating its derivatives. The parameter σ was calibrated to minimize the error between the exact distance distribution function and the one obtained using eq. 1 while maintaining a smooth profile of the distance distribution function. We have used a value of 2.0 Å throughout our studies.30

Scoring function

Several successful scoring functions have been used to optimize the shape of a molecule toward SAXS data using the scattering intensity.14,20,27,38,40 For computational efficiency, the scattering intensity can be replaced by the inter atomic distance distribution function.21,41–43

Here, the scoring function, f, which is used to measure the agreement between the modeled structure and the P(r) profile is the deviation between the two profiles including a weight function:30

f=∑kwk(gkmodel−gktar)2

where k is the index number of a bin in the profile,
gktar is the target data and
gkmodel is the P(r) calculated from a modeled structure. The weight function, wk, is introduced to reduce the sensitivity of small deviations at the large distribution regions.30

Methods

Normal mode analysis (NMA) using an elastic network model

NMA has proven to be useful for studies of collective motions of biological systems that may occur on time scales not easily accessible to conventional MD simulations. NMA consists of the decomposition of motions into vibrational modes. Each normal-mode vector represents a collective motion of the molecule. The modes with the lower frequencies represent preferential global motions.44

It has been shown that iterative deformations using small steps along a combination of normal modes is effective to provide transitions between two conformational states.45 Such a property has been exploited for refinement of low-resolution data.46 Thus, in this approach, we use the normal-mode coordinates associated with low-frequency normal modes as the optimization variables in place of Cartesian atomic coordinates. By limiting the deformation of the molecule to the low-frequency normal modes, we ensure that generated structural models are energetically accessible. In addition, by limiting the number of variables, i.e. normal-mode coordinates, to be optimized, we minimize the possibility of over-fitting. We have previously shown that using the 10 lowest frequency normal modes, conformational changes could be described within 2–3 Å RMSD.30

Normal modes were calculated using a simplified representation of the potential energy function, which was first introduced by Tirion.47 In this representation, the biological system is described as a three dimensional elastic network based on the equilibrium distribution of atoms. Particles or atoms are used to identify the junctions of the network. These junctions are representative of the mass distribution of the system and are connected together via a simple harmonic restoring force. Studies have shown that Cα carbons are sufficient to reproduce mechanical properties of the molecules44,48 and are successful in reproducing large and collective motions of macromolecules.49

In this study, the Cα atoms of the proteins and oxygen atoms of the water molecules were used to identify the junctions of the network. A cut-off of 8 Å was used. The strength of the potential is a phenomenological constant assumed to be the same for all interacting pairs.

Trusted region method (TRM)

The goal of the refinement process is to maximize the scoring function by following the direction of a few low-frequency normal modes. The optimization method used here was originally proposed to find transition states for chemical reactions and relies on the use of the Newton Raphson approach within a trusted region.50

In the TRM, one defines a radius within which the quadratic model equation is “trusted” to be a reasonable representation of the function (the simple Newton Raphson method trusts it for the entire space). Then, TRM tries to find the constrained minimum within the trusted region (Δ) using the analytical formula,

max{f=12qTHq+qTgsuchthat||q||<Δ}

where H = 2f /qlqm is the Hessian and g = {f /ql} the gradient. The amplitudes q = {ql} that maximize f within the trusted region, Δ, are obtained using the Lagrange multiplier method. If the original point is sufficiently close to a minimum, then often the minimum lies within the circle of the trusted region and convergence is quite rapid, whereas near the start of a search, the constrained minimum will most likely lie on the boundary. Once the constrained minimum is found, the structure is deformed and a new iteration begins. The process continues until convergence of f is observed. This algorithm is conceptually well suited to the iterative normal mode method, as in both we allow only limited movement. Previously, we have found that a step size of 0.1 Å and displacement along 2 modes was most appropriate in term of conserving structural quality and computational efficiency.30

Simulating SAXS target experimental data

To test our algorithm, we need to use “experimental” data as the fitting target but we also need to know the structure that corresponds to the data. In order to mimic experimental data we use the program suite Crysol/Gnom.31 Crysol takes into account details of both the protein itself and the water molecules surrounding it and generate very accurate scattering intensity profiles from known structures when compared to experimental data.51 In order to obtain the corresponding P(r), the theoretical SAXS intensity profile of the full-atom target structure obtained from Crysol is taken as input for Gnom which translates it into P(r) which will be used as target data for refinement.

Summary of procedures

The inputs for the refinement protocol are an experimental P(r) of a protein target structure and a high-resolution structure (x-ray or NMR) that the user wants to flexibly fit to the P(r) data. A layer of water molecules is added to the structure, which is further coarse-grained to its Cα atoms and 1/8 of the water molecules. The NMA is first performed on this initial coarse-grained structure and the TRM is used to find the optimal structural deformations. Details of the algorithm can be found in Gorba et al.30

Results and Discussion

Test proteins

We have tested our procedure on simulated SAXS data obtained from the program package Crysol/Gnom. We choose proteins (Table 1) that undergo a conformational change and where high-resolution structures of the two states are available.

Protein studied and refinement results with and without first solvent shell around proteins.

The purpose of the method presented here is to fit the X-ray structure to raw experimental (in this case simulated) SAXS data of a different conformation of the same molecule. To examine the fitness between the atomic structure and the P(r) from experimental data during the optimization, we examine the scoring function between the structure and the target SAXS data. In the following we also examine the Root Mean Square Deviation (RMSD) between the deformed structure and the structure from which the target P(r) was created to monitor the progress of the fitting. Ideally, when the SF reaches a maximum, the RMSD should reach a minimum. However, two different structures with different RMSD could have the same scoring function because SAXS data is 1D and detailed information of the atomic coordinates is lost.

Fit without water inclusion

In order to show that it is critical to take into account water molecules, refinements were first run using only a coarse-grained model consisting of the Cα atoms of the protein, while simulated experimental data obtained from Crysol had the effects of solvent included. The scattering intensity profile was converted to a P(r), which was used as our target data. In most cases, the decrease in the RMSD is relatively small or close to none (Table 1). Compared to our ideal case, in which no influence of water is considered in the target data,30 we can see that the prediction of the conformational changes is limited, and the influence of water molecules needs to be considered.

Solvation protocol

In the following we demonstrate that our approach to include the effects of the solvent shell on P(r) for biological molecules is adequate for the fitting.

The all-atom structure of adenylate kinase (4ake) is first surrounded by several layers of water molecules. Subsequently, only water molecules that are within a certain cut-off distance of any protein atom are kept to represent the first layer of ordered water molecules that is captured by SAXS data (Fig. 1a). After testing several cut-off values, we found that a 3.5 Å cut-off produces a P(r) in good agreement with a simulated experimental one obtained from crysol/gnom31 (Fig. 1b).

Coarse-graining process and scaling for Adenylate Kinase. (a) Adenlynate Kinase surrounded with several layers of water molecules (colored in gray). Only water molecules shown in light blue are considered to estimate the Intensity Scattering profile and...

Our refinement uses a coarse-grained approach. Figure 1c shows adelynate kinase (4ake), which is our initial structure that needs to be fitted to a target-simulated experimental data set generated from a different conformation (1ake). The first layer of water molecules needs to be coarse-grained, as only the Cα atoms of the protein are considered. To attain a level of coarse-graining comparable to the protein, only 1/8 of the total number of water molecules are kept for the fitting. Our procedure randomly picks up water molecules, and as is evident from Figure 1c, such water molecules uniformly surround the protein. From this coarse-grained model, i.e. protein Cα model and 1/8 of the first layer of water molecule, the interatomic pair distribution function of our initial structure, Pini(r), is calculated using Eq 1. The target data, Pexp(r) for adenylate kinase (1ake), as derived from the simulated experimental scattering intensity profile, is shown in Figure 1d. For the purpose of comparison, the P(r) from the initial structure (4ake) as obtained from Crysol/Gnom is also shown.

The magnitude of Pini(r) and Pexp(r) is different since one is obtained from a coarse-grained model and the other one is derived from an all-atom model. Our experimental Pexp(r), as simulated using Crysol/Gnom, is scaled so that its integral matches the one observed for Pini(r) (Fig. 1e). The two P(r) are now on the same scale and can be used for estimating the scoring function as well as its derivative. As seen from Figure 1d and ​and1e,1e, our scaling approach conserves the difference observed between the two P(r) as obtained from Crysol/Gnom. Therefore, this allows us to flexibly fit structures using a coarse-grained model which is computationally more efficient than using an all atom model.

Refinement process

We illustrate some of our results using the LAO binding protein, as we have previously shown that the simulation of its conformational change by iterative NMA was most successful.30 The refinement was performed using parameters that were previously found optimal, i.e. two normal modes for structure displacement with a radius of 0.1 Å for the TRM.30

Figure 2a shows the behavior of the scoring function as a function of iteration step. The scoring function converges slowly toward 0. However, after the RMSD value decreases to a minimum, it starts to increase again (Fig. 2b) even though the scoring function does not improve much. This indicates that the data is over fitted. Due to the low-resolution nature of the SAXS data, the scoring function may continue to increase while additional displacements do not necessarily decrease the RMSD, i.e. a continued increase of the scoring function does not always improve the quality of the model. Therefore, one needs to define a more stringent parameter than the convergence of the scoring function in order to decide whether the refinement has found a reasonable model, or if the additional displacements observed are only artifacts.

(a) Progress of fitting process measured by fitness scoring function for the LAO binding protein. (b) Progress of RMSD and relative change in the scoring function over the fitting process. The progress of the RMSD is shown in black and the relative change...

Instead of using the convergence of the scoring function as a stop criterion, we monitor the slope or the relative change of the scoring function between two consecutive steps of the refinement procedure. The minimum of the RMSD approximately coincides with the value of the slope approaching zero (Fig. 2b). Thus we use the slope as the stop criterion for refinement, which can be easily applied to real-case refinement in the future.

For the LAO binding protein, the RMSD decreases from 4.7 Å to 2 Å for the best fit. Figure 3a shows the P(r) for the initial, target and best predicted structure. Although the difference in P(r) between initial and target is rather small, i.e. information for refinement is limited, the agreement between the target P(r) and the one for the best predicted structure is good. Superposition of the three structures also reveals good agreement between the modeled structure and the target (Fig. 3b). In particular the opening/closing of the two domains is well captured.

Conformational change of the LAO binding protein obtained from the normal mode flexible fitting. (a) The P(r) of initial (), predicted (gray x) and target (black +) structures are shown. The P(r) of the predicted structure is in close agreement...

Figure 3c and ​and3d3d show the behavior of surrounding water molecules during the fitting. The water molecules followed the protein deformations and remain in the vicinity of the protein in the modeled structure. With this method, conformational dynamics of water molecules is not represented, i.e. the water molecules only follow the protein conformational change but cannot rearrange around the protein as such motion cannot be described by low-frequency normal modes. However, we believe that dynamics of water molecules would not significantly alter the overall shape of the protein and its first hydration layer. Such structured water molecules only render the protein surface and volume larger. Therefore, as an approximation, the conformational dynamics of water can be ignored.

Tests for others proteins

Using the relative change of the scoring function as the convergence parameter, additional fitting processes were performed on several proteins (see Table 1). Except for Lactoferrin, the refinement shows a significant decrease in RMSD. Even when large initial conformational differences are present, such as in the case of the Elongation Factor 2 (~15 Å), the RMSD can be decreased considerably.

Figure 4 shows two different conformations of the Elongation Factor along with the modeled structure (Fig. 4c). In the two known X-ray structures, domains 2 and 5 have different positions. In the modeled structure, domains 2 and 5 are close to each other, which is comparable to their position in the target structure. Although our method only predicts the structure within ~5 Å RMSD (see Table 1), most of the conformational change is accurately described by correctly predicting major rearrangements between domains.

(a) The initial structure of EF2, (b) the target structure from which the simulated experimental data is created and (c) the modeled structure predicted from the fitting algorithm. Domain 2 has a different placement in target and modeled structure compared...

Overall, the performance of the prediction depends on the difference between initial and target P(r). Lactoferrin, is a case for which the fitting was unsuccessful. For Lactoferrin, the maxima of both P(r)s are approximately the same and only slight differences in the shape of the P(r) are noticeable (see Figure 5). Consequently, for Lactoferrin as opposed to Adenylate Kinase, for which the initial RMSD is also ~7 Å, the information provided by the target P(r) as compared to the initial P(r) is so small that the refinement fails. Normal modes that describe the conformational change are not selected, instead, random modes are chosen leading to a faulty model. In the case of the Maltodextrin binding protein, the difference between initial and target structures measured by RMSD is rather small. In such a case, the difference in the P(r) between the two conformations is also rather minimal, and therefore one would expect that only limited information can be extracted from the refinement. Nonetheless, as illustrated by Figure 6 the overall domain motion of the protein is well captured by our approach. The direction of the displacement obtained is accurate; however, there is not sufficient information from the low-resolution data to fully reproduce the full conformational change. It indicates that even though our approach may fail to construct a model close to the final structure, insights on the directionality of conformational changes can be reliably obtained.

P(r) of the initial (1lfg) and target (1lfh) for Lactoferrin are shown. For this system, only small difference is observed between the initial and target P(r) even tough the conformational difference is large, 7 Å RMSD

The initial structure of the maltodextrin binding protein (blue), the structure from which the simulated experimental data is created (silver), and the modeled structure predicted from the fitting algorithm (red) are shown. For this system, since the...

Advantages over direct fitting methods

The interpretation of conformational changes from SAXS data has been usually performed by domain segmentation, which is followed by fitting each domain as an independent rigid body block in order to match the P(r) or scattering intensity profile.4,27–29 Such procedures are subjective because while each domain may move as a rigid unit, large conformational changes often involve tightly coupled motions between domains of the biological system. For example a hinge bending motion is defined as a relative dislocation of one domain with respect to another. These properties are poorly reproduced when each domain is independently fitted as a rigid block, which can lead to misinterpretation of the data.52 On the other hand, it has been shown that normal modes can reproduce conformational changes of molecules that have been observed experimentally. With such quantitative methods, concerted motions are naturally taken into account, which reduces error in interpreting low resolution data.

Conclusions

We have presented a new method to predict conformational changes of biological molecules as observed from SAXS experiments. This method takes advantage of an already known X-ray structure to predict a different conformation embedded in the pair distribution function derived from SAXS experiments. Normal mode theory is used to construct a model by deforming an original X-ray structure to ensure that the new conformation is energetically accessible. As a consequence, concerted motions occurring during conformational changes are naturally taken into account to provide a more accurate description than rigid-body approaches. Water molecules surrounding proteins are included in the model to accurately represent the SAXS experimental data. The protein and water molecules are coarse-grained to speed up the calculations. Our approach allows prediction of the directions of conformational changes and the construction of structures within 3 Å of the target data when sufficiently large difference in the P(r) are observed. In terms of accuracy, our method is close to what can be observed for structures obtained from homology modeling. In cases where two conformations cannot be well distinguished by their SAXS profiles, characteristics of the conformational change can still be extracted from our approach.

Acknowledgments

We thank Dr. Osamu Miyashita for helpful discussions about this project. Financial support from the National Science Foundation grant 0744732 (Molecular Cellular and Biosciences) is greatly appreciated.

Footnotes

Disclosures

This manuscript has been read and approved by all authors. This paper is unique and not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.