Abstract

One of the most popular and simple models for the calculation of pKas from a protein structure is the semi-macroscopic electrostatic model MEAD. This model requires empirical parameters for each residue to calculate pKas. Analysis of current, widely used empirical parameters for cysteine residues showed that they did not reproduce expected cysteine pKas; thus, we set out to identify parameters consistent with the CHARMM27 force field that capture both the behavior of typical cysteines in proteins and the behavior of cysteines which have perturbed pKas. The new parameters were validated in three ways: (1) calculation across a large set of typical cysteines in proteins (where the calculations are expected to reproduce expected ensemble behavior); (2) calculation across a set of perturbed cysteines in proteins (where the calculations are expected to reproduce the shifted ensemble behavior); and (3) comparison to experimentally determined pKa values (where the calculation should reproduce the pKa within experimental error). Both the general behavior of cysteines in proteins and the perturbed pKa in some proteins can be predicted reasonably well using the newly determined empirical parameters within the MEAD model for protein electrostatics. This study provides the first general analysis of the electrostatics of cysteines in proteins, with specific attention paid to capturing both the behavior of typical cysteines in a protein and the behavior of cysteines whose pKa should be shifted, and validation of force field parameters for cysteine residues.

Introduction

In recent years, the importance of cysteine residues to protein function and cellular activity has become increasingly apparent. Cysteines, often due to post-translational modifications including oxidation, nitrosylation, or disulfide bond formation, have been shown to play significant biological roles in enzyme catalysis, in sensing oxidative and nitrosative stress, in the modulation of some transcriptional regulators, in homocysteine binding and in signaling, aging, and disease processes.1–6 The nucleophilic character of cysteine is often an important feature of its reactivity; therefore, stabilization of the deprotonated, thiolate form through lowering of the pKa would promote this reactivity.7 These studies highlight the importance of electrostatics in the mechanism of cysteine reactivity in many cysteine-containing proteins.

As our understanding of the cellular roles that modifiable cysteines can play improves, the ability to accurately predict when cysteines can be modified is becoming increasingly important. Furthermore, there is a need to strike a balance between simple, computationally inexpensive models, which can be used on many protein structures—or even an ensemble of structures—and physically realistic models that can capture wide variation in protein environment.

There have been a few studies of the role of electrostatics in proteins containing active site cysteines7–10; these studies employed simple Poisson–Boltzmann-based models, together with parameters from the literature,11–13 to calculate cysteine pKa shifts and to identify the pKa shift origin. However, in none of these studies was an attempt made either to optimize the cysteine parameters used or to survey the ability of the parameters to capture cysteine behavior across a diverse set of proteins. We do both in this article.

Several different approaches to electrostatic calculations of pKas are available. These range from detailed density functional theory calculations, which can only include portions of the protein, to simpler semi-macroscopic models, which can include the entire protein structure. Most methods involve either single static structures,12,13 or averages over a few structures,11,14 although molecular dynamics with variable protonation states is an emerging, albeit computationally expensive, technique.15–17 The simplest, commonly used semi-macroscopic simple continuum models are applied here.11–13 The general idea of these calculations is to use an implicit water model, such as the Poisson–Boltzmann-based MEAD model, to estimate the pKa shifts of residues in a static structure resulting from electrostatic interactions in the protein environment. The energetics of changing protonation state then involves the electrostatic work needed to alter atom-based charges embedded in an irregularly shaped low dielectric medium (the protein), which is immersed in a high dielectric medium (the solvent).

Four assumptions underpin the semi-macroscopic models: (1) the free energy of ionization can be divided into a local component involving electronic structural changes, for example, bond breaking, and a non-local component which includes distant interactions; (2) model compounds having the same local component as the titratable residues can be used to calculate the local component; (3) the non-local component can be modeled as purely electrostatic; and (4) the electrostatics of the non-local component can be calculated classically using the finite difference linearized Poisson–Boltzmann equation. These assumptions allow for the calculation of an intrinsic pKa, which, when combined with site–site interaction energies, generate full pKas (see Methods for further details). Approximations must be made to calculate the 2N different possible protonation states for N titratable residues, including the reduced-site model,12 Monte Carlo sampling,11 and clustering analyses.18

With adequate parameters and suitable (ad hoc) adjustments of the dielectric constant of the protein, these semi-macroscopic models, along with either reduced-site or Monte Carlo sampling, typically achieve pKa values within ~ 1–2 pH unit (on average) of experimental results.13,15 Semi-macroscopic methods, especially the MEAD model and package, have been successfully used in a variety of situations.12,13,15,19 Here we employ these methods to: (1) optimize the cysteine parameters to reproduce the expected ensemble behavior across a control group of proteins; (2) determine the ability of the new parameters to reproduce the expected ensemble pKas in a set of proteins whose cysteine pKas are expected to be perturbed; (3) compare calculated and experimental pKa values in a set of cysteine-dependent proteins for which such measurements have been made; and (4) evaluate in detail those proteins for which unexpected pKa values are calculated.

Our long-term goals are to probe the origins of cysteine pKa shifts and to use this information to develop and incorporate electrostatic measures into functional analyses so as to predict reactive cysteine sites in protein sequences. This methodological study is a crucial step toward that goal.

Materials and Methodology

Protein data sets

The control protein set is composed of 60 proteins and 137 cysteine residues (Supporting Information Table 1) and was used for calibrating the cysteine pKa calculations. It was selected to be unbiased and general enough to incorporate a variety of protein folds, with few enough titratable sites to calculate the pKas directly without resorting to Monte Carlo sampling or other more approximate methods. Thus, this control data set was identified as those protein chains in the PDB-Select list20 exhibiting the following features: (1) lacking disulfide bonds; (2) containing fewer than 30 titratable sites; and (3) containing at least one unmodified cysteine residue in the crystal structure. (PDB-Select is used so that each protein exhibits less than 25% pairwise sequence identity to any other; PDB-Select release 2003.20)

The modifiable set of proteins containing cysteine sulfenic acid modification sites was compiled as described elsewhere.21 Cysteines that are modifiable to cysteine sulfenic acid are expected to exhibit a decreased Cys pKa. The modifiable protein set is composed of 47 proteins and 49 modifiable cysteine sites.

Calculation of cysteine pKas

The pKa calculations were performed with the MEAD program suite,11,13 an approach that is semi-macroscopic so that the solvent and the protein have different dielectric constants, taken to be 80 and 20, respectively, in this work, as has been previously determined to be optimal in this model.13 The dielectric boundary is defined as the protein solvent-accessible surface, determined using a water-sized probe radius of 1.4 Å. The linearized Finite Difference Poisson–Boltzmann equations, with an ionic radius of 2.0 Å and ionic strength of 0.15M at 300 K, were used to generate the electrostatic potentials. For each protein in the control and modifiable data sets, a single protein conformation was extracted from the experimental crystal structure and hydrogens were added using the HBUILD facility from CHARMM.22,23 Modified cysteine residues (Cys—SOH or Cys—SO2H) were reduced to the standard sulfhydryl (Cys—SH) form. Structures in which the cysteine had been mutated to a serine for crystallization were computationally changed to cysteine. All structures were subjected to brief restrained minimizations to relieve steric clashes prior to electrostatic calculations; 100 steps of conjugate gradient minimization with force constants of 30 kcal/(mol Å2) on all non-hydrogen atoms, then 100 steps of conjugate gradient minimization with force constants of 20 kcal/(mol Å2) on all non-hydrogen atoms, and finally 100 steps of conjugate gradient minimization with force constants of 10 kcal/(mol Å2) on all non-hydrogen atoms. These minimized proteins were then subjected to the electrostatic calculations.

The reduced site titration model was used to calculate pKas from the electrostatic energies using the MEAD multiflex package11 for the proteins in both the control and modifiable protein sets. For the proteins in the modifiable set with a large (>30) number of titratable sites, where exhaustive calculation was infeasible, a Monte Carlo approach was taken to calculate the pKas.11

Methods for modification of cysteine electrostatic parameters

Every Ser, Asp, Arg, Glu, His, Cys, Lys, Tyr, and Thr residue was considered titratable in the computational model. The model parameters required for these calculations, except for Thr, were originally taken from the parameter files distributed with the MEAD package, along with the radii from the CHARMM22 force field.23,24 These parameters are those developed by Bashford and coworkers in previous studies.25,26 There have been multiple studies using finite-difference Poisson–Boltzmann calculations to study pKa effects with different charge and radii models. These particular parameters were chosen as the most likely consistent parameters as they were developed for use with the CHARMM22 force field models of particular proteins. However, the existing MEAD parameters for the deprotonated Cys produced an unreasonable distribution of pKas for cysteine residues in the control protein set (see Results), hence the need for optimization. Those parameters did come from a different study26 than the majority of the other parameters. The Thr parameters were developed by the current authors in a previous study21 where the particular effects of Thr were found to be important.

The use of CHARMM radii was inspired by previous work25 and by the desire to develop a model for use with MD simulations using the CHARMM force field.16 However, it is possible that the use of different radii may lead to different and perhaps better results or at least to different parameterizations.

It is likely unnecessary to consider all of these residues as titratable as the overall effect of a particular residue on the distribution of another residue's pKas is likely to be modest; however, what we describe here should be a comprehensive set of titratable residues.

Trial charges for the protonated side chain were originally determined using a charge fit to the electrostatic potential obtained from a PM3 calculation,27 a semi-empirical quantum chemical method, on the model compounds using the Gaussian 03 package.28

The charges and model pKas were then modified—both to a single charge model and to distributed charge models based on model PM3 calculations—until a realistic Cys pKa distribution was obtained, that is, one centered around pH 8–8.5. Given the approximate nature of the fitting, it is probable that more accurate parameters could be developed; however, given the paucity of experimental data, this accuracy of the simplest set was considered sufficient. Details of the resulting charge model for cysteine can be found in Table I.

Results and Discussion

Current parameters do not reproduce expected distributions for Cys pKas

A control set of 60 proteins, containing 137 cysteines in diverse environments, was identified (Supporting Information Table 1). The pKas of the cysteinyl residues in this control set were calculated using standard charge and van der Waals sets associated with MEAD11,12 and the CHARMM22 force field.23,24 We expected these parameters to be able to accurately capture the behavior of an “average” cysteine in a protein.

The initial distribution of cysteine pKas for this set was centered at 10.9, a rather unrealistic mean for a random set of cysteines in proteins (Fig. 1, white bars), where one would expect the mean pKa to range from 8 or 8.5, based on standard textbook values for the free amino acid,29 to 9.5, a pKa based on a “model compound” for Cys30 (where the model compound was OH—CH2—CH2—SH argued to be characteristic of a broad range of alkyl thiols31). Recent experimental measurements of the pKa values of ionizable groups in alanine pentapeptides suggest that 8.55 is a realistic value for Cys pKa.32 This is the first time, to our knowledge, that anyone has demonstrated that a particular parameter set does (or does not) reproduce the average behavior of cysteines in proteins. In our calculations, the pKas of other titratable residues (Ser, Asp, Arg, Glu, His, Lys, and Tyr) exhibit expected distributions, that is, narrow distributions around their standard text book values, without modifications of the parameters, though shifted to the appropriate mean (data not shown).

Distributions of calculated cysteine pKas. The distributions of the calculated cysteine pKas for the control data set with original parameters (white bars), control set with modified parameters (gray bars), and modifiable cysteine pKas (black bars) exhibit...

Modified cysteine parameters reproduce expected Cys pKa distributions

In order to obtain a more realistic distribution, the model compounds' parameters were adjusted (Methods and Table I) until the distribution of the calculated Cys pKas exhibited a reasonable mean of 8.14 (Fig. 1, gray bars), which is in general agreement with textbook values. The distribution exhibits a standard deviation of 2.0 and a standard error of 0.17 and is close to a Gaussian distribution. These results indicate that, overall, calculations with the modified cysteine parameters produce biologically reasonable results. Subsequently, we show that these new cysteine parameters can also capture much of the behavior of reactive cysteines, as well.

In the control set, significant subpopulations exhibit pKas shifted below 7 or above 9 (Fig. 1 and Supporting Information Table 1) and we next explored these outliers. All control set proteins that exhibit a cysteine pKa lower than 5.1 are metalloproteins; furthermore, the observed cysteine is at or near the metal binding site. The proteins with pKas above 12 are all also at or near metal binding sites, except for 2avi, a structure of the avidin-biotin complex. In this complex, the cysteine sulfur atom exhibiting the elevated pKa is 2.04 Å away from the sulfur in a second cysteine, which suggests the presence of an unannotated disulfide bond. If correct, it would make our current calculation of pKa meaningless as the reference state and parameterization would not be appropriate for the disulfide-bonded cysteine. In the remaining structures, the metal ions are not included in the pKa calculations, which suggest that the environment around the metal ions is unusual even in the absence of the metal. This is a promising topic of future study, but is not germane here, except that it suggests that significant shifts in cysteine pKas can be calculated and explained.

Overall, the new cysteine parameters reproduce the expected pKa distribution for a random set of cysteines. Outliers are readily explained as metal binding sites or unannotated disulfide bonds.

Validation of new cysteine parameters by calculation of pKas for modifiable or reactive cysteines

The next validation step is to compare the calculations on another set of proteins where there is biochemical evidence that the cysteine pKa should be decreased compared to those of the control data set. In a separate study,21 a set of proteins was identified that were known to contain a cysteine residue that could be modified to a cysteine sulfenic acid (Cys—SOH). Evidence was based on either biochemical evidence or the presence of this species in the X-ray crystal structure.21 For modification to Cys—SOH by oxidants such as hydrogen peroxide, the reactive form of the cysteine is the deprotonated, negatively charged form. Thus, the pKa of the free, modifiable cysteine residue is expected to be lower than the average.

The cysteine pKa values were calculated for these reactive cysteines in their reduced states. The distribution of calculated pKas for the modifiable cysteines is downward shifted compared to the distribution for pKas of the control set (Fig. 1, compare grayand black bars). The mean pKa of the distribution for modifiable cysteines is 6.9, which is 0.6 standard deviations (using the standard deviation for the control set) lower than the mean for the control set of cysteines of 8.14. (The standard deviations are 1.4 and 2.0 for the modifiable and control sets, respectively.) Overall, this distribution is entirely consistent with the expected enhanced reactivity of these sites.1

While the ensemble distribution for modifiable cysteines is appropriate, we looked at some of the outliers, specifically. Only two modifiable cysteines exhibit pKas shifted more than 1.4 pH units above the control group mean of 8.14 (discussed below). Nearly half, 20 cysteines, exhibit pKas shifted more than 1.4 below the mean of the control set (Fig. 1, black bars). In particular, the modifiable cysteines in two protein structures, protein tyrosine phosphatase 1B (1oet) and RNA triphosphatase domain of mRNA capping enzyme (1i9t), exhibit extraordinarily large pKa shifts greater than 7 pH units, with cysteine pKas of less than 1. Of course, at such a pH these proteins would likely unfold and so, a low pKa would never be observed biochemically. Instead the low pKa indicates the extremely low probability of cysteine protonation of either protein in a physiological environment, which is in agreement with experimental observation for phosphatases.26,33,34

Several of the calculated pKa values are higher than 8, including 2 whose pKas were shifted more than 1.4 pH units above the control group mean of 8.14. We asked whether these values were consistent with what is known about these proteins or whether they were too high, suggesting problems with the parameters or calculations. Cysteine deprotonation is a first step in the reaction, so a lower pKa enhances reactivity of cysteines; however, a decreased pKa is certainly not the only factor which influences reactivity. Unfortunately, we are at an early stage in understanding the features of these reactive cysteines.35,36 In principle, there may be some proteins with unperturbed pKa values (~8.3–8.5) for which other factors dominate enhanced reactivity. In such cases, the ~9% of deprotonated form present at a physiological pH of ~7.4 might be sufficient to support their enhanced reactivity. A second potential explanation for calculated pKa values of highly reactive cysteinyl sites would be the error inherent in such calculations (of around 1 pH unit) that could lead to overestimation of this value.

A third potential explanation is that a different mechanism of Cys—SOH formation, such as oxidation upon exposure to synchrotron radiation, might not require a deprotonated cysteine. Several proteins in the original data set possess Cys—SOH groups that have only been observed in X-ray structures elucidated using synchrotron radiation. In at least some cases, these sites may not be especially reactive toward oxidants such as hydrogen peroxide, but rather, may be sites where hydroxyl radicals generated during exposure of the crystals to intense radiation have caused oxidative damage.37,38 The protein with the highest pKa in this category (at 9.7), with PDB identifier 1vhq (gene name elbB), has not been studied functionally and is tentatively linked to lycopene biosynthesis in E. coli, representing an early step in isoprenoid biosynthesis in these bacteria.39 A more recent computational analysis of the functional sites of this protein suggests that the function is not so clear.40 The other protein, 1hku (C-terminal binding protein 3/BARS, calculated pKa of 9.0), is a corepressor of various transcription regulators involved in the fission of Golgi membranes during mitosis.41,42 This protein is structurally related to a family of 2-hydroxyacid dehydrogenases and possesses NAD+ -dependent dehydrogenase activity. There is no indication in the literature that it undergoes oxidation at cysteine, only that this residue is oxidized in the crystal structure.41 Both proteins have pKas shifted more than 1 standard deviation (1.4 pH units) above the mean of the control set. Other proteins exhibiting unshifted pKas might also contain cysteines modified due to synchrotron radiation damage. 1qvz is another protein of unknown biochemical function in the DJ-1/ThiJ/PfpI superfamily.43,44 This protein is related to the E. coli heat shock protein Hsp31; the putative conserved Cys-His-Glu catalytic triad may therefore be involved in chaperone activity. Two crystal structures have been solved with this Cys (Cys138) oxidized to sulfenic or sulfinic acid, but in both cases this oxidation may have been induced by the synchrotron radiation and the biological relevance of this modification is uncertain.43,44

Qualitatively, the electrostatic calculations on the modifiable cysteine sites are consistent with what is expected: an overall decrease in the pKa of the modifiable cysteines. Further details are consistent with published and experimental information. While the calculations may not calculate the exact cysteine pKa, reactive cysteines for which the pKa is expected to be shifted can be identified.

Comparison to experimentally determined pKas

To further validate the parameters and methods, we studied a set of proteins for which structures are known and pKas were experimentally measured by one or more methods (Table II). The emphasis was on CXXC proteins which, in addition to any biological relevance they may have,9,45 are methodologically interesting as they have two cysteines in close proximity and often occur at the ends of alpha-helices; thus, they present both potentially significant background interactions (from the helix dipole) and residue–residue interactions (from the second cysteine as well as other residues). Cysteine pKas were calculated for this set of proteins and these calculated cysteine pKas were compared with the experimentally determined pKas. As mentioned in the introduction, the results of these calculations are not expected to be more accurate than ~1–2 pH units, even with a well-developed parameterization.13,15 Only static protein structures are used in these calculations, thus the calculated pKas could potentially be worse if the protein undergoes significant conformational changes around the cysteine sites (discussed subsequently).

For these select proteins, the calculation results are qualitatively correct (Table II). Quantitatively, the mean error is 1.5 pH units, if one averages cases where there are multiple results and if 5.5 and 4.5 are used when the experimental results are <5.5 and <4.5, respectively; thus the new parameters are sufficient to capture the pKa trend. The error is significantly better than from a null hypothesis. The mean error calculated from an “average” cysteine pKa value would be 3.4 pH units using the average cysteine parameterized value of 8.14.

Quantitative agreement between the calculated and experimental pKas is especially significant for the experimental pKas that are particularly low—less than 4.0. The average calculated pKa value across the entire experimental set is 6.3, whereas the average experimental value for this same set is 4.8. This suggests that while our parameters capture the majority of the pKa shifts from average, the methodology does not yet capture the full shift, at least not while accurately modeling a typical cysteine.

One difficulty in this comparison is the lack of agreement in some of the experimental pKa measurements themselves. For instance, we found multiple pKa determinations for the E. coli thioredoxin reactive cysteine in the literature, with measurements ranging from 6.7 to 7.5 (Table II).

Despite the observation that experiment and theory are not in precise quantitative agreement, our current methods are sufficiently accurate to identify the trends toward an increased or decreased pKa and are certainly more accurate than previous parameters. As more pKas are measured and experimental measures of pKas improve, further refinements will be possible. We also expect that if more quantitative accuracy is necessary it might be achieved by combining this model with molecular dynamics simulations.15–17

Electrostatics applied to static crystal structures: the need to include dynamics

An additional difficulty in computationally reproducing low pKas probably lies with a failure to fully model protein dynamics or conformational changes that may occur as the solution pH is lowered. Several examples are observed here. The protein with the most significantly increased cysteine pKa on the basis of our calculations is methionine sulfoxide reductase (Msr; 1fva69) with a pKa of 10.7. At first glance, this pKa would seem inconsistent with Cys—SOH formation—and this protein does form Cys—SOH. To identify potential reasons for the increased pKa, several Msr structures were compared. Related crystal structures for Msr (1ff369; 1fvg70) show the modifiable cysteine in the same conformation as 1fva; however, a nearby active site loop contains another cysteine in varying conformations (Fig. 2). This loop is exceptionally mobile and is not visible in one crystal structure. pKa values were calculated for the modifiable cysteine in each Msr structure. Depending on the conformation of the loop, the calculated cysteine pKas vary by several pH units, with the pKa of the modifiable cysteine ranging from a low of 8.2 to a high of 10.7 in these static structures. This result illustrates a not unexpected limitation to the current analysis: protein conformational variability, especially with local flexible loops, can modulate the pKa in a manner not captured by a simple model in which conformational relaxation is only modeled via a dielectric constant and with a single structure.

Comparison of Msr conformations illustrates conformational variability. Three structures of MsrA are superimposed (1fva, 1fvg, and 1ff3), illustrating the conformational variability that exists in the C-terminal region forming an active site loop (shown...

Another set of calculations points to the limitation of applying the simple electrostatics model to a static crystal structure. Calculations on the two modifiable sites in the homodimer from the crystal structure 1prx,64 a peroxiredoxin (Prx), result in different pKas (6.4 and 4.4) due to structural heterogeneity near the modifiable cysteine. While this may provide insight into the subtle local conformational changes that help activate this cysteine, for the purposes of pKa calculation this result again suggests that some conformational averaging, such as through molecular dynamics simulations, 15–17 might be important for producing more accurate calculations in this protein. Of course, it could be that such heterogeneity is real in that the cysteine sites might be nonequivalent enough to give rise to different pKas and presumably different activities (as suggested in recent work16); however, this is far too speculative to conclude from these limited calculations.

Thus, protein dynamics and conformational changes can significantly affect residue pKas. In addition, the conformation does not change as the pH is changed during the calculation. The current calculations are performed on static crystal structures. Although the use of a dielectric constant greater than unity for the protein in the Poisson–Boltzmann calculations assumes that there is a response by the protein to changes in charge states, it does not consider specific changes in conformations that might occur due to specific changes in charges, that is, specific responses by the protein to side chain protonation/deprotonation.

Conclusions

This study provides the first general analysis of the electrostatics of cysteines in proteins, with specific attention paid to capturing both the behavior of a typical cysteine in a protein and the behavior of cysteines whose pKa should be shifted. We have now shown that this behavior can be captured, at least qualitatively, by parameterization of the simple MEAD model through calculations of both “average” and modifiable cysteines. Our results demonstrate that dynamics and conformational changes significantly impact pKa calculations. In at least two proteins observed here, Prx and Msr, there is conformational variability that critically affects the cysteine pKa. Finally, we have shown the limitation of the method in quantitatively predicting pKas that are shifted significantly downward, presumably due to incomplete inclusion of the protein response in the calculation. The parameterization and validation we have conducted will allow for further study of reactive cysteines using the MEAD model with static structures, but also suggests that incorporation of protein conformational changes will be necessary to obtain more quantitative accuracy. Future work will involve probing the origin of cysteine pKa shifts in modifiable cysteines and developing methods to identify and then predict reactive cysteines using these parameters.