Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Abstract

The Biological Magnetic Resonance Data Bank contains NMR chemical shift depositions for 132 RNAs and RNA-containing complexes. We have analyzed the 1H NMR chemical shifts reported for non-exchangeable protons of residues that reside within A-form helical regions of these RNAs. The analysis focused on the central base pair within a stretch of three adjacent base pairs (BP triplets), and included both Watson–Crick (WC; G:C, A:U) and G:U wobble pairs. Chemical shift values were included for all 43 possible WC-BP triplets, as well as 137 additional triplets that contain one or more G:U wobbles. Sequence-dependent chemical shift correlations were identified, including correlations involving terminating base pairs within the triplets and canonical and non-canonical structures adjacent to the BP triplets (i.e. bulges, loops, WC and non-WC BPs), despite the fact that the NMR data were obtained under different conditions of pH, buffer, ionic strength, and temperature. A computer program (RNAShifts) was developed that enables convenient comparison of RNA 1H NMR assignments with database predictions, which should facilitate future signal assignment/validation efforts and enable rapid identification of non-canonical RNA structures and RNA-ligand/protein interaction sites.

Electronic supplementary material

The online version of this article (doi:10.1007/s10858-012-9683-9) contains supplementary material, which is available to authorized users.

NMR chemical shifts have been widely utilized for NMR signal assignment and structural studies of proteins (for examples see: Grzesiek and Bax 1993; Wishart and Sykes 1994; Wishart et al. 1991, 1992; Cavalli et al. 2007; Shen et al. 2008; Wishart et al. 2008). Although relationships between 13C chemical shifts and RNA structure have been identified (Ebrahimi et al. 2001; Fares et al. 2007; Ohlenschlager et al. 2008), and 15N NMR chemical shifts have been incorporated into a probabilistic approach for automated assignment of RNA imino groups (Bahrami et al. 2012), heteronuclear NMR chemical shifts have not been widely exploited for RNA studies (Lam and Chi 2010; Aeschbacher et al. 2012). On the other hand, Wijmenga and co-workers showed that non-exchangeable 1H NMR chemical shifts for A-form helical residues could be back-calculated from a given 3D RNA structure (Cromsigt et al. 2001). For 28 examples tested, the back-calculated shifts were in good agreement with shifts reported in the Biological Magnetic Resonance Bank (BMRB; www.bmrb.wisc.edu), and some general 1H NMR chemical shift trends were identified (Cromsigt et al. 2001). Here we report a detailed analysis of the H8, H2, H6, H5, H1′, H2′, and H3′ proton NMR chemical shifts that have been deposited in the BMRB. After correcting for differences in chemical shift referencing and sample conditions, excellent correlations were observed, despite the fact that the data were obtained over a wide range of sample conditions. Our findings confirm and quantify previously identified trends and identify new sequence- and structure-dependent chemical shift correlations that can be used for assignment and/or validation of non-exchangeable 1H NMR chemical shifts and for the identification of non-canonical RNA structural features and intermolecular interaction sites.

Methods

NMR data were analyzed using “RNAShifts”, a program designed to download and analyze RNA 1H NMR chemical shifts that have been deposited in the BMRB. (Locally derived shifts that have yet to be deposited can also be analyzed). All 131 depositions available in the BMRB were used in the current analysis except BMRB ID 5170, 6814, 4816, 15697, 15915, 5023, 4253, 4894, and 15257, which could not be reliably used because either the BMRB assignments didn’t match the published PDB assignments, or because there was no associated publication or PDB file that could be used to identify RNA secondary structure. As additional input, files were manually generated for each deposition, based on published structural studies, that identify for each residue (1) whether or not the residue is base-paired, (2) the nature of the base-pairing partner, (3) any long-range intra- and/or inter-molecular interactions (e.g., sites of protein binding or participation in A-minor or other RNA–RNA contacts), (4) participation in structured (e.g., GNRA; G/g = guanosine, N/n = any nucleotide; R/r = purine; A/a = adenosine) or unstructured loops. A representative input file is shown in Supplementary Table S1.

The analysis focused on shifts reported for the non-exchangeable H8, H2, H6, H5, H1′, H2′ and H3′ protons of the central base pair of three consecutive canonical Watson–Crick base-pairs (WC-BPs) (here called WC-BP triplets: ([5′−n(i−1)−Ni−n(i+1)]:[5′−n(j−1)−nj−n(j+1)]; Ni = nucleotide for which the NMR shifts are being evaluated; n = neighboring nucleotides), Fig. 1a. As additional parameters, we denoted if the n(i−1):n(j+1) or n(i+1):n(j−1) base pairs were at terminating positions in the RNA, and we identified the secondary structural elements adjacent to the WC-BP triplets (canonical or non-canonical WC-BP, bulges, loops, long-range RNA–RNA interactions, and RNA–protein/ligand interactions), Table S1.

a Definitions used for base pair triplets. The chemical shifts of the N(i) residue are analyzed in this work, and this strand may be preceded by a base-paired (WC or GU wobble) nucleotide (pre_n) or a non-base paired residue (5loop), or followed by a...

We chose a relatively conservative approach in modeling the effect of the neighborhood of each central base pair. This was done because there are still, especially in comparison to proteins, relatively few chemical shift assignment sets for RNA deposited at the BMRB. Rather than using any non-linear or neural network approach we used an approach similar to the chemical shift increment method of Pretsch as used in predicting spectra of small organic molecules (Pretsch et al. 2009). Thus, for the central residue of each WC-BP triplet, we defined the attributes describing the neighborhood of the central nucleotide as described above, and calculated the contribution that each attribute makes to the predicted chemical shift. The predicted chemical shift is then a base chemical shift plus the linear contribution of the value corresponding to each attribute present in that nucleotide’s environment. The contribution of each attribute was calculated by linear regression of the chemical shifts in our database of RNA chemical shifts with the set of explanatory variables represented by the neighborhood attributes. The constant term of our regression model corresponds to a nucleotide embedded in a triplet of Watson–Crick base pairs with a U (uridine) flanking it on both the 5′ and 3′ sides and Watson–Crick base-paired nucleotides at the 5′ and 3′ ends of the triplet.

Our analysis included a total of 15 potential variables, Table 1, of which only some might potentially contribute significantly to the shift of a specific atom in a given central nucleotide. Because the approach includes a large number of independent variables relative to the chemical shift datasets, there was a significant danger of over-fitting using a conventional linear regression algorithm. Over-fitting can lead to excellent prediction of the training set, but poor predictive capability on novel datasets. To minimize the risk of over-fitting we chose an algorithm, Pace Regression (Projection Adjustment by Contribution Estimation), that is capable of assessing the importance of each of the parameters. Calculations were performed using the Weka Machine Learning and Data Mining Library system, which allowed us to perform a statistical analysis of the prediction model (Witten et al. 2011). Pace Regression is a linear regression system that uses various information criteria to assess the degree of importance of the regression variables (Wang and Witten 2002). Thus it provides one solution to the subset selection problem: which subset of a set of potential regressors is the appropriate set to explain the data, and thereby minimize the risk of overfitting and maximize the predictive capability on previously unseen data.

Use of Weka provided not only access to Pace Regression, but also various assessments of the quality of the predictions. In particular, we used 10-fold stratified cross-validation during our analysis. Rather than providing correlation coefficients and root mean squared (rms) deviations of the predictions using all the data in the prediction, this technique trains the model on 90 % of the data and then assesses the results of predicting the remaining 10 % of the data. The process is repeated 10 times, using a different subset of the data each time and derives the correlation coefficients and rms deviations based on the whole process. Pace regression was used independently on each atom type present in each of the four central nucleotides for a total of 19 regression calculations.

We were unable in our analysis to adequately identify and control for sample conditions (pH, temperature, ionic strength, etc.) and unusual molecular conformation, and there is a significant possibility of misassignment, especially of some atom types. Therefore, after dropping a single obvious major outlier, we minimized these effects by automatically trimming outliers and automatically adjusting the reference for the chemical shift sets. Automated outlier elimination was performed by running two passes of the Pace Regression for each atom/central nucleotide. In the first pass, the rms deviations between the experimental and predicted values were calculated using all of the data. Any data values that deviated from the predicted values by more than three times the rms deviation value were dropped, and a second pass of the Pace Regression was performed on the now trimmed dataset. Automatic re-referencing was achieved by performing the above analysis (including outlier detection) twice. In the first of these passes, the mean error of prediction was calculated for all the shifts from each BMRB file. Prior to the second pass, each shift was corrected by the mean deviation calculated for the corresponding BMRB file. The chemical shift corrections determined by this approach are listed in Table S3.

The RNAShifts program was written using JTcl (http://jtcl.kenai.com) and Swank (http://swank.kenai.com), which are the Java implementations of the Tcl programming language and Tk graphical user interface toolkit (Ousterhout and Jones 2010). The analysis mode is run in three stages. The first loads BMRB files (fetching them from http://bmrb.wisc.edu if necessary), extracts chemical shifts, and then uses the input template to assign attributes to each shift. The second stage reads the output of the first stage and generates input files in the format used by Weka. The third executes Weka multiple times for each proton type, manages the two passes used for outlier detection and generates various statistical output files. The graphical interface module allows plotting predicted and experimental data subject to various criteria for choosing subsets of the data and attributes for plotting. The RNAShifts program is available upon request from the author (BAJ).

Results and discussion

Outlier chemical shifts

The statistical analysis described above identified 65 chemical shift assignments from the full BMRB database that, after automated re-referencing, deviated from expected values by more than 3 standard deviations. Seven of these assignments were associated with earlier publications from the M.F.S. laboratory, and inspection of the original NMR spectra revealed that these signals had been erroneously assigned (corrections to BMRB files 15113 and 17083 have now been made). We also discovered relatively large systematic chemical shift variations for one of our earlier depositions (BMRB ID 6094) that were associated with improper chemical shift referencing (the residual water signal at 35 °C was erroneously assigned a chemical shift of 4.792 ppm). We therefore updated the BMRB with the modified values, which were used in the present analysis. Based on examination of published NMR spectra, we were able to correct 19 additional assignments in the BMRB—in many cases, the signals had been properly assigned in the published spectra but improperly recorded in the BMRB files. In all cases, the re-assigned (or typo-corrected) shifts were well within the 3-standard deviation cutoff. We were unable to determine the nature of the deviations observed for the remaining 38 outliers because relevant regions of the NMR spectra were not provided in the original publications, and these 38 assignments were not used in subsequent analyses. The majority of these outliers were associated with ribose protons, of which 17 were for highly overlapping H2′ and H3′ proton signals. Thus, of the 3,796 available chemical shifts, 3,758 were retained for analysis and 38 (1 %; mostly ribose assignments) were excluded.

Chemical shifts that were either re-assigned or excluded are summarized in Supplementary Table S2, and referencing corrections employed for all of the utilized depositions are summarized in Supplementary Table S3. The final dataset included values for the central base pairs of all of the 43 possible combinations of WC-BP triplets, with as few as one, and as many as 23, assignments for each of the possible combinations. A total of 137 additional triplets that contain G:U base pairs were also included in the analysis. As shown in Fig. 1b, the retained and re-referenced BMRB shifts (δ) were in good agreement with predicted shifts (δpred) (rms deviation for the entire dataset = 0.056). Good agreement was also obtained when training was performed using a two-fold cross-validation analysis, in which half of the data were used for training and half for validation (rms = 0.069 ppm), and when training was performed with 60 % of randomly-ordered BMRB entries and validation assessed with the remaining 40 % of the data (rms = 0.063, averaged over all atom types).

Chemical shift trends for canonical triplets

The re-referenced NMR chemical shifts (δ) were generally in good agreement with the mean shifts calculated for each unique sequence/atom type (δ). For example, excellent correlations were observed in a plot of δ versus δ for the central residues of “canonical triplets,” defined here as a triplet that contains only GC and/or AU base pairs and are both preceded and followed by canonical GC or AU base pairs (rms deviation = 0.043), Fig. 1c. The database utilized does not contain chemical shift values for aAa and uCa canonical triplets, nor for the H2′ and/or H3′ protons of the following canonical triplets: aAu (H2′, H3′), uGa (H2′, H3′), aUu (H2′, H3′), gGu (H3′), aCc (H3′). (Note that data were available for non-canonical forms of these triplets and were included in the analysis). There were no significant differences in correlation coefficients obtained upon fitting δ versus δ for the A, G, C and U nucleotides, but as observed in plots of δ versus δpred, greater scatter was generally observed for the H2′ and H3′ protons, Fig. 2a.

1H NMR chemical shift trends were readily observed in plots that compare δ with the mean shift calculated for canonical triplets (δcan), and with the coefficients obtained with the Pace Regression analysis. Plots of δ versus δcan for the n-A-n canonical triplets are shown in Fig. 2, and data for the n-G-n, n-C-n and n-U-n canonical triplets are plotted in Fig. 3. The contributions of the attributes calculated by Pace Regression are plotted in Fig. 4. The observed trends are consistent with several generalized correlations identified by Wijmenga and co-workers (Cromsigt et al. 2001). For example, δ values for purine-H8 protons in canonical triplets are highly sensitive to the nature of the 5′-residue within the triplet, with 5′-purines associated with more upfield chemical shifts. We further observe that 5′-uridines induce a larger downfield H8 shift than 5′-cytidines (Figs. 2c, ​c,3a),3a), and that the H8 chemical shift is also sensitive to the nature of the 3′-residue, Figs. 2c and ​and3a.3a. For example, the A-H8δcan values observed for n-A-a canonical triplets are consistently downfield relative to those observed for n-A-g canonical triplets, Fig. 2c, and a similar trend is observed for n-G-a versus n-G-g triplets, Fig. 3a.

The adenosine-H2 proton is sensitive to the nature of both the 5′- and 3′-nucleotides (Cromsigt et al. 2001) and exhibits a large chemical shift range of ~6.4–8.0 ppm. Importantly, the simultaneous presence of a 5′-pyrimidine and 3′-purine is associated with a significant upfield A-H2 NMR chemical shift, to a less crowded region of the RNA NMR spectrum (6.4–7.1 ppm, Fig. 2b) where they are potentially useful for structural characterization of large RNAs Lu et al. (2011). In contrast, significant downfield shifts are observed for the H2 protons of adenosines that are preceded by a purine and followed by a pyrimidine, Fig. 2b. The H5 protons of the C and U are sensitive to the nature of the preceding residue of the triplet but exhibit almost no detectable sensitivity to the nature of the following residue, Fig. 3c, d. The pyrimidine H6 protons are also more sensitive to the nature of the 5′ residue, but exhibit some sensitivity to the 3′ residue as well (Fig. 3c, e). The ribose protons appear to be sensitive to the nature of both the 5′ and 3′ residues, although the limited chemical shift dispersion and uncertainties regarding some of H2′ and H3′ assignments make it more difficult to identify clear chemical shift trends.

Influence of 5′- and 3′-terminal base pairs within the WC-BP triplet

The presence of 5′- and/or 3′-terminating base pairs within the WC-BP triplet has a significant influence on the chemical shifts of the central residue. As shown in Fig. 5a, the aromatic, H1′, H2′ and H3′ protons of the central residue exhibit small but significant downfield shifts relative to δcan values when adjacent to a 5′-terminating base-paired residue (the single H3′ outlier is most likely due to a misassignment or typo). The most significant perturbations are observed for the aromatic protons, which exhibit deviations in the range of 0.15–0.45 ppm. In contrast, most signals for residues that reside next to a 3′-terminal WC-BP exhibit smaller but nevertheless consistent upfield shifts relative to the δcan values, Fig. 5b. The most significant shifts are observed for H2′ protons which have a mean upfield shift of 0.2 ppm.

Influence of non-canonical elements adjacent to the WCBP triplets

Our analysis assessed the influence of non-canonical structural elements that reside immediately upstream (5loop) or downstream (3loop) of the WC-BP triplets. We defined these elements to include internally stacked residues that are not involved in Watson–Crick base pairing, looped or bulge residues believed to be flexible or structured (e.g., K-turns), and residues involved in base-triples or long-distance RNA–RNA interactions. As shown in Fig. 5d, the presence of non-canonical RNA structures at the 3′-end of the WC-BP triplet does not appear to significantly influence any of the proton shifts associated with the central residue of the triplet. On the other hand, the presence of non-canonical structure on the 5′-side of the WC-BP triplet results in small but significant upfield shifts relative to δcan values for the aromatic and H1′ protons, Fig. 5c.

Influence of G:U base pairing within the triplet

Because GU base pairs are both prevalent and functionally important (Varani and McClain 2000), we also assessed the influence of this class of base pairing on 1H NMR chemical shifts. Systematic variations are apparent for some protons of the central U of triplets when they are base paired with G. Considering only canonical triplets in which the central U:A base pair is substituted by U:G, the H5 protons exhibit a downfield shift and the H1′ and H2′ protons exhibit small upfield shifts, whereas the H6 and H3′ chemical shifts appear to be relatively unperturbed, Fig. 6a. If the central residue of the canonical triplet is a G, base pairing with U results in a small downfield shift of the H2′ NMR signal and upfield shift of H3′ (relative to base pairing with C) but does not significantly affect the shifts of the other G protons, Fig. 6b.

Plots showing the sensitivity of the 1H NMR chemical shifts to GU and UG wobble pairing within the canonical WC-BP triplet. a The central U of an otherwise canonical triplet is paired with G. b The central G of an otherwise canonical triplet is paired...

The presence of GU (or UG) base pairs at the n(i−1) or n(i+1) positions can significantly influence the signals of the central residue, and data for otherwise canonical triplets are shown in Fig. 6c–f. For triplets in which the central residues is a pyrimidine, the H1′ and H3′ are relatively unaffected by the presence of a preceding GU wobble, Fig. 6c. However, the H6, H5 and H2′ protons are systematically perturbed, with the u(wob)-U/C-n H6 signal shifted downfield, the g(wob)-U-n H6 signal shifted upfield, and the g(wob)-C-n C-H6 signal shifted downfield relative to the average canonical shifts, Fig. 6c. Interestingly, the u(wob)-C/U-n H5 shifts are relatively unperturbed relative to canonical shifts, whereas g(wob)-C/U-n H5 shifts are generally shifted downfield relative to the signals observed for the canonical triplets, Fig. 6c. Also, H2′ shifts of the central pyrimidine are shifted downfield when preceded by a UG wobble, but are shifted upfield when preceded by a GU wobble, Fig. 6c. When the central residue is a purine, the H1′ and H3′ proton shifts are relatively unaffected by a preceding wobble, but the H8, H2, and H2′ protons generally exhibit systematic downfield shifts, with the magnitude of the shift being somewhat greater for a preceding U(wob) compared to a preceding G(wob), Fig. 6d.

The presence of a subsequent GU wobble can also result in systematic chemical shift perturbations. For triplets in which the central residue is a pyrimidine followed by a U(wob) mismatch, the H6 and H3′ signals exhibits small upfield shifts but the remaining signals do not appear to be significantly perturbed, Fig. 6e. In contrast, the presence of a subsequent G(wob) mismatch does not appear to lead to any detectable perturbations, Fig. 6e. For triplets in which the central residue is a purine, a subsequent G(wob) leads to a small systematic downfield shift of the H1′ proton but does not significantly perturb the other NMR signals, whereas a subsequent U(wob) pair results in small upfield shifts of the H6 and H5 signals and a small downfield shift of the H2′ signal, Fig. 6f.

Chemical shift predictions

The Pace regression approach described above provided predicted chemical shift values for all possible combinations of WC-BP triplet parameters used in the present study, Table 1. 1H NMR chemical shifts observed for the canonical triplets are in good agreement with the shifts predicted using the Pace regression approach described above (δpred), Fig. 7a (rms deviation = 0.050). Excellent agreement was also observed for triplets that contained only a single modifying element (e.g., only a 5ter but no other non-canonical elements), with the greatest deviations observed for a few of the H2′ and H3′ assignments, Fig. 7b–h (rms deviation in the range 0.057–0.057). Good fits were also observed for triplets that contained more than one modifying element (rms deviation for all canonical and non-canonical triplets = 0.056), Fig. 7i. As observed in other fits, the largest deviations are observed for the H2′ and H3′ proton assignments.

Plots of δ versus predicted chemical shift (δpred), calculated by Pace regression as described in the text. a–h Data for triplets that are fully canonical (a) (rmsd = 0.050) or include a single non-canonical element,...

The data in Table 1 can be used in computer programs such as NMRView (Johnson 2004; Johnson and Blevins 1994) or ad hoc calculations to predict chemical shifts. The constant term represents the value of the given atom in nucleotide i, when the i − 1 and i + 1 nucleotides are both U, and all nucleotides from i − 2 through i + 2 are present and in canonical Watson–Crick base pairs. For example, an A-H2 proton, in a canonical uAu triplet would be at 7.0299. Calculating the shift of the A-H2 proton in a different environment is done by adding to the constant term the contributions from any applicable columns in the A-H2 row of Table 1. For example, the chemical shift of an A-H2 proton in a gAc triplet, in which the i − 2 residue is in a loop, would be: 7.8469 ppm (7.0299 + 0.6899 + 0.0658 + 0.0622). If the i − 1 G is in a GU (rather than GC) base pair, the A-H2 proton chemical shift would be: 7.9217 ppm (7.0299 + 0.7637 + 0.0658 + 0.0622).

Conclusions

The present studies provide the first quantitative analysis of the RNA non-exchangeable 1H NMR chemical shifts in the BMRB. Our studies identify sequence-dependent chemical shift correlations and establish the influence of terminating base pairs within the triplets and canonical and non-canonical structures adjacent to the BP triplets (i.e. bulges, loops, WC and non-WC BPs). Excellent correlations were observed despite the fact that the NMR data were obtained under different conditions of pH, buffer, ionic strength, and temperature. A relatively small number of outliers that were not utilized in the analysis, mainly ribose H2′ and H3′ assignments, are likely due to assignment or typographical errors and should be re-examined. Assignments for some triplet combinations were either limited or lacking; for example, the database does not include assignments for two of the 64 possible “canonical triplets.” Although shifts for these triplets could be predicted from assignments made for non-canonical triplets (e.g., WC-BP triplets adjacent to non-canonical structures or that contain terminal or GC base pairs), future studies of oligonucleotides with the missing sequences are clearly in order.

The statistics indicate that the protocol employed for chemical shift predictions, assigning attributes to different triplet environments and then conducting selection and linear model fitting with Pace Regression, performed very well for the data used in this study. However, as we move forward with this research and the number of attributes is expanded, alternative fitting methods such as Neural Networks and allowing attributes to contribute in non-linear modes may be required. The protocol used here can, of course, also be applied to nitrogen and carbon nuclei, and it will be interesting to determine if these nuclei exhibit similar environment- and structure-dependent sensitivities.

In the course of these studies, chemical shift trends were tentatively identified for a number of non-A-form helical structures that are well represented in the BMRB, particularly those of conserved tetraloops (e.g., GNRA). Future studies that include parameterizations for tetraloops, base triples, and other conserved and well-defined RNA sub-structures will likely lead to the identification of additional trends useful for 1H NMR assignment and verification. In addition, it should now be possible to incorporate the approach into software programs to enable semi-automated assignment of RNA, including large RNAs with different combinations of 2H-labeled or segmentally-labeled nucleotides (underway).

Electronic supplementary material

Acknowledgments

Support from the National Institute for General Medical Sciences (GM42561 to M.F.S. and GM103297 to B.J.) is gratefully acknowledged. S.B. was supported by grants from the National Institute of General Medical Sciences for enhancing minority access to research careers (MARC U*STAR 2T34 GM008663) and the Howard Hughes Medical Institute for enhancing undergraduate research training.

Abbreviations

BMRB

Biological Magnetic Resonance Data Bank

NDB

Nucleic Acid Database

PDB

Protein Data Bank

WC-BP

Watson–Crick base pair

G/g

Guanosine

N/n

Any nucleotide

R/r

Purine

Py/py

Pyrimidine

A/a

Adenosine

C/c

Cytosine

U/u

Uridine

δ

Mean NMR chemical shift

δcan

Mean NMR chemical shift determined for a canonical triplet, defined here as a stretch of three sequential canonical base pairs that is both preceded and followed by at least one canonical base pair