Figures

Abstract

Chemical cross-links identified by mass spectrometry generate distance restraints that reveal low-resolution structural information on proteins and protein complexes. The technology to reliably generate such data has become mature and robust enough to shift the focus to the question of how these distance restraints can be best integrated into molecular modeling calculations. Here, we introduce three workflows for incorporating distance restraints generated by chemical cross-linking and mass spectrometry into ROSETTA protocols for comparative and de novo modeling and protein-protein docking. We demonstrate that the cross-link validation and visualization software Xwalk facilitates successful cross-link data integration. Besides the protocols we introduce XLdb, a database of chemical cross-links from 14 different publications with 506 intra-protein and 62 inter-protein cross-links, where each cross-link can be mapped on an experimental structure from the Protein Data Bank. Finally, we demonstrate on a protein-protein docking reference data set the impact of virtual cross-links on protein docking calculations and show that an inter-protein cross-link can reduce on average the RMSD of a docking prediction by 5.0 Å. The methods and results presented here provide guidelines for the effective integration of chemical cross-link data in molecular modeling calculations and should advance the structural analysis of particularly large and transient protein complexes via hybrid structural biology methods.

Funding: This project was funded in part by ETH Zurich, the Commission of the European Communities through the PROSPETS consortium (EU FP7 projects 201648, 233226) and by SystemsX.ch – The Swiss Initiative for Systems Biology. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Conventional structural biology techniques like X-ray crystallography or Nuclear Magnetic Resonance (NMR) spectroscopy solved the structure of a large number of macromolecular complexes [1]. High-resolution data from those techniques provide detailed insights into the working principles of proteins and their malfunction [2] and support drug discovery projects [3]. However, the requirement of these techniques for relatively large amounts of pure and highly concentrated protein samples has caused a bias in the structural elucidation of monomeric proteins and homomeric protein complexes. The PISA database [4] lists around 13,600 heteromeric protein complexes compared to around 62,000 mono- and homomeric proteins as of June 2013. Structural data on large protein complexes or transient interactions in cell signaling processes have therefore remained rather elusive [5].

Chemical cross-linking in combination with mass spectrometric analysis (XL-MS) has emerged as a viable tool for probing the structure of many protein complexes without such bias. Although XL-MS only provides comparatively low resolution data [6], [7], it is highly complementary to conventional methods, has less stringent sample purity requirements and few restrictions concerning complex size. Recent applications of XL-MS have elucidated the structure and topology of RNA polymerases [8], [9], proteasomes [10], [11] and the chaperonins GroEL [12] and TRiC [13], [14]. Most often, homobifunctional cross-linking reagents, for example amine-reactive succinimide esters are used in XL-MS studies. They react predominantly with primary amine groups on lysine side chains and N-termini [15]. The read-out of such an experiment is a list of modified sites, which are obtained from the analysis of fragment ion spectra generated by the mass spectrometric analysis of cross-linked peptides. The peptides, in turn, are generated by the enzymatic digestion of a cross-linked protein complex. Reaction products can be broadly classified into mono-links, i.e. single peptides where only one end of the cross-linker has reacted, and different types of cross-linked peptides. Of particular interest are intra-protein and inter-protein cross-links that originate from connecting two peptide chains from a single or two different polypeptides, respectively, by the cross-linking reagent. Note that different nomenclatures have been proposed in the literature to describe the various types of products of cross-linking reactions [16], [17]. Recent advances in the protocols to generate and process cross-linked samples [18], [19], the mass spectrometric methods to generate fragment ion spectra of cross linked peptides [20] and the development of software tools for the identification of the cross-linked peptides [21], [22] have contributed to the increasing maturity and robustness of the XL-MS technology.

The distance restraints generated by XL-MS can be used to guide molecular modeling simulations towards native-like conformations. However, two important points need to be addressed prior to the incorporation of XL-MS distances. First, cross-linker molecules are flexible and can covalently link lysine residues over a large range of inter-residue distances [23]. In the current work, all calculations are based on data using disuccinimidyl suberate (DSS) as a reagent. DSS has a spacer length of approximately 11.4 Å, but was experimentally found to bridge lysine residues of up to 30.0 Å and more (calculated as Cα-Cα protein backbone carbon atom Euclidean distance). This distance restraint takes into account the length of two extended lysine side chains (∼5.5 Å each) and some conformational flexibility of the protein complex [6]. Second, cross-linker molecules can be assumed to not penetrate the protein surface and be located on solvent accessible surface patches [24]. In case cross-links are simulated as distance restraints in modeling calculations, the linear Euclidean distance measure becomes inappropriate, as it will penetrate the protein surface. Both limitations can either be solved by explicitly modeling the cross-linker molecule [25] or by implementing a non-linear distance measure [24]. We previously introduced the Xwalk (“Crosswalk”) algorithm [26] to calculate the shortest path between two cross-linked amino acids, where the path must not penetrate the protein surface and only lead through solvent occupied space. The algorithm is based on a cubic grid around the cross-linked amino acids, a distance calculator that fills the grid cells with distances following a breadth-first search algorithm, and a trace-back method that selects the shortest path through the grid between cross-linked amino acids. The length of the shortest path is a distance measure that we termed Solvent Accessible Surface (SAS) distance, which represents a more reasonable measure of cross-link distances in modeling calculations.

We recently applied the Xwalk algorithm in combination with the ROSETTA molecular modeling suite to affinity purified protein complexes from the Protein Phosphate 2A (PP2A) network [6]. The modeling calculations were guided by 176 inter-protein cross-links and 570 intra-protein cross-links. Within this study we were able to verify comparative models of all PP2A subunits, predict the location of an intrinsically disordered C-terminal domain of the PP2A interactor IgBP1, define the binding interface between IgBP1 and the catalytic phosphatase subunit, and determine the topology of the regulatory subunit 2ABG in complex with the TCP1 Ring Complex (TRiC) chaperonin.

Here we describe in detail three computational modeling workflows that we applied in the above study [6] as a hybrid structural biology method for de novo prediction of protein structures, comparative modeling of proteins and protein-protein docking. A demo version of the docking protocol is available in the ROSETTA protocol capture archive: XL_guided_protein_docking/run_demo.sh. We also introduce a database with literature-curated cross-links, which we exploited for computing probability distributions of cross-link distances. And finally, we demonstrate in a systematic study the association between the accuracy of a cross-link guided protein docking calculation and the number of employed inter-protein cross-links.

Methods

All modeling calculations were performed using the ROSETTA molecular modeling suite. Protocols applied within the workflows are available in the public release of ROSETTA starting from version 3.1, with the exception of the protocol nonlocal, which was downloaded from ROSETTA’s trunk at revision 42791. The reader is kindly referred to the help pages of the individual software programs for a detailed explanation of the employed application parameters. Parameter arguments shown in curly brackets require case specific input.

All workflows have in common that they utilize XL-MS data to guide modeling calculations towards native-like conformations with the assumption that all identified cross-links were formed on native structures. The guidance was achieved by two means. Firstly, by incorporating the cross-link distance restraints into the ROSETTA scoring function where they penalize models with cross-link distances above the distance threshold of 30.0 Å, which leads to a preferential sampling of the conformational space around the native conformation. Secondly, by applying XL-MS distances as post-modeling filters in the candidate selection stage, where they removed models that violate cross-link data.

Comparative Modeling

The structure of a target protein can be predicted with comparative modeling, if experimental X-ray crystallography or NMR structures of a homologous template protein exist. Out of the 94 proteins that were purified from the PP2A interaction network [6], only 8 (2AAA_HUMAN, 2A5G_HUMAN, 2ABA_HUMAN, PP2AA_HUMAN, SGOL1_HUMAN, SET_HUMAN, MST4_HUMAN, DYL1_HUMAN) had partial or complete structural information from X-ray crystallography experiments. For a subset of 15 proteins comprising all remaining PP2A core subunits and the eight TRiC chaperonin proteins, high quality comparative models were generated using various homologous template structures (see Table 1) and the following workflow (see also Figure 1A). The typical execution time of the workflow is about 5 CPU days per protein.

1.3 Extract the sequence alignment between target and template protein from the HHpred output. Note, that the template sequence in HHpred corresponds to the SEQRES sequence and not to the ATOM coordinate sequence. In cases in which the template structure is predetermined and has a high sequence identity to the target protein, as was the case for the PP2A and TRiC proteins, one can also use the global sequence alignment application needle from the EMBOSS package v6.2.0 [28] with default command-line flags.

where the 1st column is an incremental index, the 2nd column is the PDB file name and the 3rd and 4th columns are dash delimited PDB information about first and second cross-linked atom, respectively. The PDB information should list the residue name, residue number, chain ID and atom name. Mono-links are described with the first three columns only.

1.9 Calculate the SAS distance for each intra-protein cross-link using Xwalk[26] (http://www.xwalk.org) on each model with following command

1.10 Choose models that satisfy the largest number of intra-protein cross-links and mono-links.

1.11 If multiple models satisfy the same number of cross-links and mono-links, or if none of the models satisfy any cross-link, select the model with the lowest Root Mean Square Deviation (RMSD) to the template structure as the best model.

Alternatively, distance information from XL-MS data can also be exploited as distance restraints within the scoring function (see next section), which for the PP2A project was omitted to keep the modeling unbiased for validation purposes.

De Novo Modeling

If experimental structures are missing for a target protein and its homologs, the structure of the target protein can be predicted from its primary amino acid sequence using de novo modeling. The human IgBP1 protein had no crystal structure available in the Protein Data Bank (PDB) [31]. Structural coordinates existed only for the N-terminal domain from a mouse homolog (PDB-ID: 3CQ1). To generate a full-length model of IgBP1, we applied the following de novo modeling workflow (see also Figure 1B), which required about 14 CPU years of computation.

The flat harmonic function guarantees that models are penalized only if the Euclidean distance between two cross-linked atoms exceeds 30.0 Å. The function takes the Euclidean distance dist and three parameters that for a DSS cross-link were heuristically chosen to be x0 = 15.0, tolerance = 15.0 and standard deviation σ = 1.0:

2.3 Run ROSETTA’s nonlocal application with the input files described above and the following command line flags:

2.10 Pick the lowest scoring model from the largest cluster as the best model. If the clustering should retrieve only singletons, pick models that have the smallest RMSD value compared to the template structure as the best models.

Protein-protein Docking

The catalytic subunits of PP2A were docked against best full-length models of IgBP1 (see previous section) using the following workflow (see also Figure 1C). The workflow requires a computation time of around 100 CPU days.

3.1.1 Prior to docking calculations, crystal structures from the PDB (e.g. PP2AA) need to be relaxed. For relaxation, run ROSETTA’s relax protocol with following command line flags:

3.1.6 Assess the number of cross-links and mono-links satisfied by each model. To speed up calculations, apply first Xwalk’s Euclidean distance measure and subsequently the SAS distance measure (see step 1.9) on each model.

3.1.7 Select models satisfying to the highest number of cross-links. Should their number be higher than 500 or should none of the models satisfy any cross-link, select those 500 models with the lowest ROSETTA energy score.

3.1.8 Analyze the binding interface size of the selected models, which is equivalent to the buried surface area (BSA):

3.1.9 Select further only models with a sufficiently large binding interface with BSA(complex) ≥900 Å2[34].

3.1.10 Choose a large number of models with the shortest mean SAS distance over all cross-links. (In the case of the IgBP1-PP2AA, we selected the 300 models with the shortest mean SAS distance). The shortest mean distance leads to a preference for models with overall shorter cross-link distances that in most cases show distance ranges similar to Figure S1.

3.1.11 Calculate an all against all Cα coordinate RMSD matrix among the top models. Note, that the RMSD should be computed only on the smaller protein (ligand) and not on the larger structure (receptor), as the latter remains fixed during the docking calculations and has an RMSD of 0 among all selected models. The RMSD among the smaller proteins is also known as the ligand RMSD (L-RMSD).

3.1.14 Pick the lowest scoring model from the largest clusters as best models.

We would like to emphasize that the workflow described above is only one of two strategies for cross-link guided protein docking. It was developed to highlight the impact of XL-MS data on docking calculations and visualize the set of conformations that satisfy a large number of distance information from XL-MS. For a biophysically more meaningful prediction, the workflow above can alternatively be extended after step 3.1.6 with a high-resolution refinement docking stage as follows:

3.2.1 Run steps 3.1.1–3.1.6.

3.2.2 Select models that satisfy most cross-links. Should their number be higher than 500 or should none of the models satisfy any cross-link, select those 500 models with the lowest ROSETTA energy score.

3.2.4 Apply Quality Threshold (QT) clustering on models passing the interface size filter. A QT cluster is defined by the translational and rotational distances between a reference and referred ligand structures, where the distance must be smaller than 3 Å and 8°, respectively. The distance thresholds correspond to translational and rotational perturbations that will be applied during local-refinement docking, and thus, in contrary to RMSD based clustering, will allow a seamless transition between global and local docking calculations. Rotational and translational distances can be calculated with the Superimpose application (http://cleftxplorer.googlecode.com/files/cX.zip), which is part of the CleftXplorer software package [35], [36]:

3.2.5 Complete the QT clustering by searching within the output file for the model that has the highest number of similar models (i.e. largest number of 0 in a row), declare it as the cluster representative and remove it and all its cluster members from the dissimilarity matrix. If two models have the same number of similar models, then prefer the one with the lower ROSETTA score. Repeat this step two additional times to obtain the cluster representatives of the largest three clusters from the global docking calculations.

Reference Data Sets for the Modeling Workflows

Comparative and de novo modeling.

We have chosen 6 proteins from a recent work of Leitner and co-workers [18] for a reference data set of our comparative modeling workflow. Each protein in the data set had experimental structural data in form of an X-ray structure, a list of experimental chemical cross-links and a homologous experimental template structure. The template structures were selected such that they covered sequence identities between 50% and 90% to the native structures (see Table 2), similar to the sequence identities in the PP2A network (see Table 1).

One of the reference proteins, the rabbit pyruvate kinase (UniProt Entry name: KPYM_RABIT) was also chosen as a reference for the de novo modeling workflow. Our choice for KPYM_RABIT was due to its multidomain structure. According to the Pfam database [38], the pyruvate kinase consists of an N-terminal (amino acid positions 21–100) and C-terminal (amino acid positions 120–367) ATP:guanido phosphotransferase domain. We decided to model the former domain without any template structure and the latter domain with a creatine kinase template structure from chicken (PDB-ID: 1qh4). The sequence of the pyruvate kinase was shortened by 20 and 14 amino acids from the N- and C-terminus, leading to a final sequence length of 347 amino acids, which was comparable to the 339 amino acids in IgBP1. The simulation was supported with 14 experimental intra-protein cross-links (see Table S1).

Protein-protein docking.

Protein complexes from the protein-protein docking benchmark dataset version 4.0 from the Weng lab [39] were used to compare the performance of ab initio docking and cross-link guided protein-protein docking. This benchmark data set features high-resolution protein complexes that are non-homologous and for which X-ray or NMR models in unbound conformations of the complex substituents exist. As docking methods struggle most with proteins that undergo large conformational changes upon association, we focused our performance analysis only on 16 binary protein complexes from the “medium difficult” and “difficult” category (see Table 3). Protein complexes in both categories show medium to large conformational changes at their interface with Interface Root Mean Square Deviations (I-RMSD) >1.5 and >2.2 Å, respectively, where the I-RMSD corresponds to the Cα RMSD of the interface residues after the unbound forms of the protein structures were superimposed on the bound form of the proteins [39].

To assess the impact of the number of cross-links on docking calculations, predicted (virtual) inter-protein cross-link distances were calculated on the bound conformation of all 16 protein complexes using the Xwalk application. A virtual cross-link was assumed to form between a pair of lysine residues if the residue pair had an SAS distance ≤34.0 Å. The 34.0 Å threshold corresponds to the distance that around 80% of DSS and BS3 cross-links exhibit in a number of published XL-MS experiments (see section below). Among all virtual cross-links (see Table 3), one to seven were randomly chosen among 5 distance bins, namely 0–10 Å, 10–15 Å, 15–20 Å, 20–25 Å and 25–34 Å. The probability of choosing an inter-protein cross-link from a particular distance bin was predetermined by a probability distribution that was calculated on a large number of intra- and inter-protein cross-links from our new database, XLdb (see section below and Figure S2). The probabilities corresponded to 9%, 18%, 34%, 22% and 16% for the aforementioned distance bins. Given various lists of virtual cross-links for each of the 16 protein complexes, we next performed cross-link guided docking calculations to assess the impact of cross-links on the docking predictions.

The unbound conformations of the 16 protein pairs were first relaxed and subsequently docked at low-resolution (see steps 3.1.1 to 3.1.4). Around 100,000 docking models were generated for each protein pair and subsequently tested for their ability to satisfy one to seven virtual inter-protein cross-links. As all complexes had more than seven virtual cross-links (see Table 3), we generated 100 random selections for each number of cross-link. For each random selection, the model with the shortest mean distance was chosen as a best model for each protein complex. The quality of the predictions was assessed with the ligand RMSD (L-RMSD) measure, which corresponds to the Cα coordinate based RMSD between the smaller protein (ligand) in the “unbound complex” and in the predicted docking model after superposing them on the larger protein. Note that the L-RMSD value will always be notably larger than 0, as the unbound complexes in the difficult and medium difficult category show substantial atomic clashes between both binding partners.

XLdb: A Database of Literature Curated Intra- and Inter-Protein Cross-links

We collected a non-comprehensive list of 506 intra-protein and 62 inter-protein cross-links from 14 recent publications. In the current form, all cross-links are stored in a database called XLdb, which is based on a Microsoft Excel sheet (Table S1). Figure S1, S2 and S3 provide frequency and probabilities plots for various distance ranges in XLdb (Text S1). Only cross-links that fulfill following criteria were included in the database: 1. XL-MS experiments must have been conducted with the DSS or BS3 cross-linking reagent. 2. Experimental structure on the cross-linked proteins must exist in the PDB. 3. The XL-MS data must have been published. XLdb is besides the recently published Xlink-DB [40] the only database that allows the mapping of cross-links on protein structures and testing of cross-link guided molecular modeling algorithms on a large scale. We hope that as a reference database, it will encourage new method developments in the field of data driven molecular modeling.

Results

Comparative Modeling for Cross-link Validation

We calculated comparative models using the workflow illustrated in Figure 1A for 15 proteins in the PP2A interaction network (Table 1) and 6 proteins from the reference data set (Table 2). Figure 2A shows the RMSD vs. ROSETTA energy scatter plots for all 15 PP2A proteins, where the RMSDs were calculated between the models and the template structure. The models satisfying most DSS based cross-links were furthermore highlighted in green. Two trends can be seen in the plots. First, models satisfying DSS cross-links can span a large RMSD range as evident from the box plots that are located below each scatter plot. There is, however, the tendency that the RMSD of the model with the lowest RMSD score, which nonetheless satisfies the largest number of cross-links, drops with increasing number of cross-links (see Figure 2B). Second, the model that satisfies the largest number of cross-links while being the closest to the template structure is in almost all cases among the top 3 models within the simulations (see Table 1). Thus, comparative modeling can be employed to validate XL-MS data while XL-MS data itself, in combination with a sophisticated scoring function, can aid in identifying native-like conformations. Similar conclusions can be drawn for the reference data set, where a similar range of RMSD values (with respect to their native structures) and rank positions for the best model satisfying most cross-links were observed (see Figure S4 and Table 2).

(A) ROSETTA energy score versus RMSD plots for all proteins. Template structures (see Table 1) served as a reference for the RMSD calculations. Green colored dots highlight those models that satisfy most chemical cross-links; their numbers are indicated at the top right corner of each scatter plot. (B) For each protein from (A), only the model with the largest RMSD value is plotted demonstrating the prediction improvement with the increasing number of chemical cross-links.

Two exceptions, namely TRFL_BOVIN and 2ABG_HUMAN showed large RMSD values of 8.5 Å and 19.6 Å, respectively, to their native or template structure. TRFL_BOVIN consists of two flexible C- and N-lobe domains [41]. As a result, some cross-links could have been formed on a conformation that is distinct from the one found in the PDB structure 1BLF. Interestingly 2ABG_HUMAN is a regulatory subunit of PP2A and has a WD40 propeller fold. It was found co-purified and cross-linked to the TRiC chaperonin complex as a substrate. We speculate therefore that some of the intra-protein cross-links stemmed from B regulatory subunits that were in a stable intermediate folding state while bound to the TRiC chaperonin complex. And indeed selecting for the lowest scoring model that satisfies most cross-links (13/18) shows only a partially folded WD40-propeller fold with an RMSD of 19.5 Å to the folded template structure (see Figure 3). Additional XL-MS experiments on affinity purified TRiC subunits that would provide a clean list of only intra-protein cross-links from 2ABG while in complex with TRiC, remain to be conducted.

Figure 3. Chemical cross-links on the regulatory subunit 2ABG of PP2A might have originated from a stable intermediate folding state.

(A) The comparative model that is most similar to its template structure 2ABA satisfies only 6 of 18 intra-protein cross-links. (B) In contrast, the comparative model that satisfies with 13 cross-links most of the cross-link data has a RMSD of 19.5 Å and is partially unfolded. Green chain of spheres indicate the shortest path between cross-linked lysine pairs that have an SAS distance ≤34.0 Å.

Cross-link Guided De Novo Modeling

Immunoglobulin binding protein 1 (IgBP1) interacts with the catalytic subunit of PP2A, rendering it inactive and preventing it from proteasomal degradation [42]. As inter-protein chemical cross-links were found between IgBP1 and the catalytic subunit of PP2A, we constructed a partial de novo full-length model of human IgBP1 (see section De novo modeling) and predicted the interface between both proteins using cross-link guided protein-docking calculations (see section Protein-protein docking). For the partial denovo prediction of IgBP1, Cα distance restraints from the N-terminal region of the mouse homolog and 65 intra-protein cross-link distance restraints from our XL-MS experiments were applied [6]. Of the 65 intra-protein cross-links, 18 were found within the C-terminal region, while 32 were found between the N- and C-terminal domains of IgBP1. Crucial for the structure prediction was the application of Xwalk’s SAS distance as a post-processing filter. From around 157,000 structural models that were predicted, over 113,000 models satisfied at least 60 cross-links by means of the Euclidean distance. In contrast, only around 190 models satisfied the same number of cross-links using the SAS distance measure (see Figure 4A). Nevertheless, the structure prediction calculations did not converge as assessed by a RMSD based clustering attempt of the 190 models and thus did not result in an unambiguous fold prediction for the C-terminal domain (see Figure 4). However, the 32 intra-protein cross-links facilitated the localization of the C-terminal domain with respect to the N-terminal domain. Five models that had the lowest RMSD (≤10.0 Å) to the N-terminal domain of the template structure (PDB-ID: 3QC1) had their C-terminal domain co-localized at the same region (see Figure 4B). These five models were chosen as best models of IgBP1 and docked to the catalytic subunits of PP2A as described in the next section.

Figure 4. Localization of the C-terminal domain of IgBP1 with chemical cross-link data.

(A) ROSETTA energy score versus RMSD plot for full-length models of IgBP1, with one of the best models acting as a reference structure for the RMSD calculation. Only models below an energy score of 650 are shown. Grey empty circles are models that satisfy more than 60 cross-links by Euclidean distance measure. Blue circles depict models that satisfy more than 60 cross-links by means of the SAS distance measure. The five red circles have been chosen as best models with RMSD ≤10.0 Å to the N-terminal template structure of mouse IgBP1 (PDB-ID: 3QC1). (B) Structure of the five best models. The structures are colored from blue to red between the N and C-terminus. The models were superimposed on their N-terminal domain highlighting the co-location of their C-terminal domain.

A similar structure prediction for the pyruvate kinase from the reference data set (see section Comparative and de novo modeling) with only 14 experimental intra-protein cross-links produced models with RMSD values down to 5.93 Å to the native structure while satisfying all 14 intra-protein cross-links. In contrary to the IgBP1 calculation, the simulation for the pyruvate kinase did converge, producing 5 clusters with at least 24 members, of which the lowest scoring models are shown in Figure S5. Thus, our workflow coupled with chemical cross-link data enables the prediction of useful structural information for large proteins with only partial structural data.

Cross-link Guided Protein-protein Docking

We developed a protein-protein docking workflow [6] (see Figure 1C) and applied it to predict the conformation and the binding interface between the catalytic subunits of PP2A and its interactor IgBP1. For the docking calculations 7 inter-protein cross-links, 11 intra-protein cross-links and 10 mono-links were utilized to guide the calculations between PP2AA and IgBP1.

Prior to the docking calculations, we first tested the impact of the number of cross-links on protein-protein docking calculations. For this purpose, 16 protein complexes with experimental structural coordinates in the PDB from a docking benchmark data set were docked and the impact of randomly chosen 1 to 7 inter-protein cross-links was assessed on the docking results (see section Reference data sets for the modeling workflows). The boxplots in Figure 5 show the docking performance of any model that satisfies a certain number of randomly selected virtual cross-links. The performance was assessed by the ligand’s RMSD values (L-RMSD). A clear trend towards higher quality predictions with increasing number of cross-links was observed. On average, the improvement in docking predictions with the SAS distance measure rose by 5 Å L-RMSD per cross-link and leveled off at 5 cross-links in total, which agreed with a similar observation made elsewhere [43]. The Euclidean distance measure had the same tendencies, although less pronounced and with higher median L-RMSD values as compared to the SAS distance measure.

Figure 5. Box plots showing the improvement of the docking predictions with an increasing number of cross-links (XLs).

The data was collected on 16 protein complexes that were docked using 100 random selections of 1 to 7 virtual cross-links. For each random selection the model satisfying all cross-links and having the shortest mean cross-link distance was selected and its ligand RMSD (L-RMSD) value selected for plotting. Distances were measured with the Solvent Accessible Surface (SAS) distance (green boxes) or the Euclidean distance (blue boxes). White box corresponds to blind docking without distance restraints.

The application of the cross-link guided docking protocol to the IgBP1-PP2AA protein complex (see steps 3.1.x) revealed 4 large clusters of predicted complex models. Despite the high L-RMSD values among the cluster representatives (see Figure 6B), all models revealed similar interface residues as highlighted by the similar location of IgBP1 with respect to PP2AA in Figure 6A. Three amino acids that had been shown in previous studies to form the interface can indeed be found at the interface of the cluster representatives (see Figure 6A).

(A) Structural model of the lowest scoring models from the 4 largest clusters, showing the PP2AA protein in purple color and the IgBP1 protein in dark green color. The solid cartoon representation corresponds to the cluster representative of the largest cluster, while the transparent IgBP1 models are cluster representatives of the 2nd, 3rd and 4th largest cluster. Intra-links with their shortest SAS distance path are shown as green colored chains of spheres, inter-links are shown in red and mono-links are highlighted as blue spheres. In addition, black spheres indicate previously mutated amino acids that were shown to be involved in forming the interface of IgBP1 and PP2AA. (B) Overview of the ROSETTA energy scores for all models that satisfied at least 6 inter-protein cross-links by means of the Euclidean distance measure are shown as empty grey circles. The RMSD was calculated to the cluster representative of the largest cluster. Models satisfying at least 6 inter-protein cross-links by means of the SAS distance measure and having a binding interface size ≥900 Å2 are highlighted in blue, while the cluster representatives of the 4 largest clusters are highlighted as red colored circles.

Discussion

Distance restraints derived from XL-MS experiments are useful for driving modeling calculations towards native-like conformations. In this manuscript, we have described three different computational protocols for comparative and de novo structure prediction and protein-protein docking and demonstrated the added value of the distance restraints on the computational predictions. Each workflow utilizes the ROSETTA molecular modeling suite at several steps in its calculation. The three workflows provide complementary predictions to other structure determination methods like X-ray crystallography and NMR spectroscopy. They have less restriction on the size and rigidity of proteins and are well suited to determine the topology of protein complexes. As XL-MS provides rather low-resolution information, it is particularly useful for gaining structural information on larger protein complexes. In addition, structural models of protein-protein binding interfaces can be generated in cases in which structural information of the subunits exist.

Protein-protein docking methods benefit most from XL-MS data. Native like conformations can be predicted even in cases in which homology models or de novo models for the binding partners are used. It is often possible to predict the topology of a complex with already two cross-links to an accuracy of less than 10 Å L-RMSD (see Figure 5). The increased accuracy in the docking predictions raises the question to which extent multimeric protein docking calculations could benefit from XL-MS data. This important question remains to be addressed.

A XL-MS experiment produces beside inter-protein cross-links also intra-protein cross-links and mono-links. However, due to the employment of the Euclidean distance measure as a mean to simulate cross-link data, the latter type of modifications have so far been mainly ignored and not been incorporated in docking calculations. Xwalk, however, allows exploiting these modifications under the assumption that intra-protein cross-links and mono-links are formed outside the protein-protein binding interface. Xwalk’s shortest path over the protein surface between modified amino acids mimics intra-and inter-protein cross-links. Models that display cross-links within the predicted binding interfaces can therefore be removed from further analysis. One drawback of Xwalk is its high computational expense when calculating SAS distances, which can take up to a second per cross-link. It would therefore be desirable to develop new, faster algorithms for simulating cross-links on protein surfaces. The faster algorithms could facilitate the inclusion of the SAS distance measure in scoring functions where they could directly impact conformational sampling routines rather than acting as a post-modeling filter.

It should be clear, however, that DSS based XL-MS data provide low-resolution structural information, which makes them less appropriate for fold prediction in comparative and de novo modeling. For example, despite using over 60 intra-protein cross-links for the full-length structure of IgBP1, we were unable to pinpoint the fold of the C-terminus. The main problem remains the difficulty to distinguish between close-native and nonnative conformations based purely on DSS cross-links, as it is apparent from the large RMSD ranges in Figure 2. At the same time, the low-resolution information might be sufficient to probe large conformational changes on proteins (see Figure 3). Important for the application of DSS based XL-MS data to structure prediction is that the structural features can be probed with a 34.0 Å long “distance ruler” (see section Reference data sets for the modeling workflows).

Chemical cross-links that cannot be mapped on experimental structures or high-quality comparative models could have emerged as a result of alternative protein conformations or false positive identifications. The former cause might be stimulated by experimental cross-link conditions such as buffer solution, pH, salt concentration, absence of ligands etc. that can be distinct from the conditions found in crystallization or NMR experiments. The different experimental conditions can lead to conformational changes of the proteins or induce even different oligomeric states that might result in dissatisfied cross-links on experimental X-ray structures. The second cause is likely less relevant as the false positive rate in cross-linking experiments is often found to be around 5% or lower [22]. We therefore believe that the main source of apparently not satisfying cross-links remains the conformational variability of proteins especially in solution, like in the case of 2ABG (see Figure 3).

In conclusion, we have introduced three computational workflows for XL-MS data driven structural modeling of proteins and protein complexes. In combination with available structural models of proteins, these workflows strengthen XL-MS as a complementary approach for gaining structural insights into protein complexes and generating testable predictions for biologically relevant protein-protein interactions. The type, quality and coverage of the restraints are likely to increase with the ongoing efforts of the mass spectrometry community to improve XL-MS technology. On the other hand, a better understanding of protein folding kinetics and interaction mechanisms as well as more sophisticated algorithms for simulating XL-MS data will likely improve the prediction accuracy of cross-link guided molecular modeling. Taken together, data-driven structural modeling of proteins and protein complexes as a hybrid structural biology method will likely have an increasing impact on the protein structure and modeling fields.

Supporting Information

Probabilities for observing a cross-link between 0 and 34.0 Å SAS distance. The probabilities were calculated with an empirical cumulative distribution function that was applied to all cross-links from the cross-link database XLdb (see Table S1) having a distance between 0 and 34 Å.

Cross-Link Guided Comparative Modeling on a Benchmark Data Set. The performance of the modeling calculations was assessed by the Cα RMSD similarity between the predicted models and the native protein structure (see Table 2). Green colored dots show those models that satisfy most chemical cross-links; their numbers are indicated at the top right corner of each scatter plot.

Cross-Link Guided De Novo Modeling on the Benchmark Protein KCRM_RABIT. (A) ROSETTA energy score versus RMSD plot for full-length models of KCRM_RABIT. Grey empty circles are all 105,294 models. Black circles depict models that satisfy all 14 intra-protein cross-links by means of the SAS distance measure. The five red circles are the lowest scoring models from the 5 largest clusters after clustering the lowest scoring 500 black circled models with a 10.0 Å RMSD cut-off. Compared to the five blue circles that represent the 5 largest clusters in a non-guided de novo prediction, the mean RMSD value drops from 12.5 Å to 9.7 Å. (B) Structure of the native KCRM_RABIT structure (PDB-ID: 1U6R) is shown on the left, while the five best models are shown on the right. The structures are colored from blue to red between the N and C-terminus. The de novo modeled N-terminal domain is encircled, while the C-terminal domain for which a template structure was provided is shown in transparent surface representation. Note the co-localization of the de novo modeled N-terminal domain.

Acknowledgments

We would like to thank Daisuke Kuroda, Ph.D., for testing the protein docking demo version (see Text S2), which is available in the ROSETTA protocol capture archive under the directory XL_guided_protein_docking. Special thanks go also to Eric D. Merkley, Ph.D., for comments on the XLdb. We would also like to extend our thanks to the SyBIT project of the SystemsX.ch initiative and the Brutus system administrators for support with compute infrastructure and other IT-related resources. All figures containing molecular structures were rendered with PyMOL [44].

Author Contributions

Conceived and designed the experiments: RA LM. Performed the experiments: AK. Analyzed the data: AK GR. Contributed reagents/materials/analysis tools: FH AL. Wrote the paper: AK AL.