Pumilio and zinc-binding domains. (A) Human Pumilio1 in complex with RNA (PDB code: 1M8Y). (B) Complex structure of Tis11d (PDB code: 1RGO). (C) Zinc knuckle of the MMLV nucleocapsid protein in complex with RNA (PDB code: 1U6P). The proteins are shown as grey ribbons; individual protein side-chains are shown in green. Repeat 6 of Pumilio is represented by a red ribbon, the C-terminal zinc finger of Tis11d is represented as a light blue ribbon and the zinc coordinating side-chains in (B and C) are in red. The RNA molecules are in blue and yellow, individual phosphate atoms are shown as purple spheres. Intermolecular hydrogen-bonds are depicted as purple dashed lines. Figures were generated with MOLMOL (88).

The nucleocapsid protein of MMLV contains a 28 amino acid zinc knuckle (Arg16-Pro43) of the type CX2CX3HX4C. Several structures of this protein in complex with various ssRNA sequences have been determined (8,9). Although the zinc knuckle binds with highest affinity to a CUCG sequence, binding to other 4 nt sequences occurs as long as they contain a guanine at the 3′ end (9). As for Tis11d, two aromatic residues of the zinc knuckle are involved in RNA binding. Tyrosine 28 (between the first and second cysteine) stacks with U306 and contacts C307 and tryptophan 35 (between the histidine and the third cysteine) stacks with G309 (Figure 1C). Base-specific contacts to U306, C307 and U308 are mediated by several protein side-chains, while specific recognition of G309 is achieved by three hydrogen bonds involving the protein main-chain (Figure 1C) (8,9). Hence, the fold of this CCCH zinc knuckle appears to be specific for an NNNG ssRNA tetranucleotide, while side-chains decide on the preferredidentity of the three 5′ nucleotides. Interestingly, a G-specific binding pocket is found in other CCCH zinc-knuckles as well, eventhough the domain fold in these cases is different and a smallernumber of nucleotides is bound (10–12).

Two structures of type I KH domains (15,17) and a structure of two type II KH domains (16), both in complex with ssRNA, have been determined. In addition, five type I KH domain structures in complex with ssDNA have been solved (18–21). In all these structures, the ssRNA or ssDNA is bound in a cleftformed by the GXXG loop, the two consecutive helices, the following β-strand (β2 for type I and β3 for type II) and the so-called ‘variable loop’ (the β2β3 loop in type I and the β3α2 loop in type II) (Figure 2). Each KH domain binds at least 4 nt (referred to as N1 to N4 in Figure 2 and Table 1). The first 3 nt N1, N2 and N3 are spread on the surface of the domain. The base of N1 is stacked onto a peptidebond within α1 (α2 in type II) between a conserved glycine and the following residue, while N2 and N3lie on a hydrophobic surface made up of two side-chains, one from α1 and one from β2 (α2 and β3 in type II) that act as a wedge between the 2 nt (not shown in Figure 2) (15–18,21). The backbone carbonyl and amideoxygen of the same conserved hydrophobic residue in β2 are also hydrogen-bonded to the N3 base (Figure 2A and B). These two hydrogen bonds favour an adenine or a cytosine in the N3position (Table 1). The conformation is furthermaintained by contacts between the sugar-phosphate backbone of N1 and N2 and the highly conserved GXXG loop, which run almost parallel to one another (Figure 2). In particular, the phosphate group between N1 and N2 is hydrogen bonded to the backbone amide of the third residue of the GXXG loop (not shown in Figure 2). Finally, N4 stacks over N3 and interacts with side-chains of β2 (β3 in type II) (Figure 2A and B).

Table 1

Position on the KH domain

N1

N2

N3

N4

Protein and sequences bound

Nova1 (15)

A

G

A

U

C

A

C

C

SF1 (17)

A

U

A

C

U

A

A

C

A

A

NusA KH1 (16)

A

G

A

A

NusA KH2 (16)

C

U

C

A

A

U

A

hnRNPK KH3 (18)

C

C

C

C

hnRNPK KH3 (18)

T

C

C

C

PCBP2 KH1 (21)

A

C

C

C

Number of bases in each position

A

2

2

4

1

C

2

4

3

5

U/T

3

0

0

1

G

0

1

0

0

Register of the RNA or DNA sequences in complex structures of KH domain containing proteins

Outside the canonical binding of these 4 nt, binding of additional nucleotides is mediated either by the variable loops [e.g. Nova-1 (15) and SF1 (17)] or by an extension of the domain (e.g. the long helix 3 in Nova-1, Figure 2A). In SF1, the presence of an additional small domain (QUA2 domain) C-terminal to the KH domain allows the binding of three additional nucleotides (Figure 2C) (17). Finally, in NusA, the juxtaposition of two type II KH domains leads to binding of two additional nucleotides (Figure 2D) (16).

RRM domains. (A) The RRM of Fox-1 (PDB code: 2ERR). (B) RRM3 of PTB (PDB code: 2ADC). (C) The tandem RRMs of Sex-lethal (PDB code: 1B7F). (D) RRMs 3 and 4 of PTB (PDB code: 2ADC). The proteins are depicted as grey ribbons, except for the C-terminal RRMs of Sex-lethal and PTB, which are in light blue, and the fifth β-strand of PTB RRM3 and the interdomain linkers, which are in red. Individual side-chains that contact the RNA are represented by green sticks. The RNA nucleotides N1 and N2 are shown in yellow and purple, respectively. Other nucleotides are in blue. Individual hydrogen bonds are shown as purple dashed lines. Figures were generated with MOLMOL (88).

In most RRM complexes, 1 or 2 nt are bound in addition to this dinucleotide. N0, the nucleotide 5′ to N1, is either bound to β4 (8 RRMs, see PTB RRM3 in Figure 3B) or resides in a binding pocket formed by the β1α1 and β2β3 loops (6 RRMs, see Fox-1 RRM in Figure 3A). N3, the nucleotide 3′ to N2, is frequently found in contact with the RRM but can be bound in several different locations. For example, in 5 RRMs, N3 stacks with N2 and is recognized by the protein region C-terminal of the RRM, while in another 4 RRMs, N3 is residing on the β2 strand (see Fox1 RRM in Figure 3A). Hence, like the KH domain and the zinc-binding domains, a typical RRM contains 4 nt binding sites (Table 2).

Table 2

Position on the RRM domain

N−3

N−2

N−1

N0

N1

N2

N3

Protein and sequence bound

U1A (24,30)

A

U

U

G

C

A

C

Sex-lethal RRM1 (26)

U

U

U

U

U

U

U

Sex-lethal RRM2 (26)

U

G

U

PABP RRM1 (25)

A

A

A

A

PABP RRM2 (25)

A

A

A

A

U2B″ (31)

A

U

U

G

C

A

G

U

hnRNPA1 RRM1 (36)

T

A

G

G

hnRNPA1 RRM2 (36)

T

T

A

G

G

Nucleolin RRM1 (27,35)

C

G

A

Nucleolin RRM2 (27,35)

U

C

C

HuD RRM1 (33)

U

U

A

U

U

U

HuD RRM2 (33)

U

U

HuD RRM2 (33)

U

A

U

CBP20 RRM (28,34)

G

N

PTB RRM1 (29)

U

C

U

PTB RRM2 (29)

C

U

N

PTB RRM3 (29)

U

C

U

N

N

PTB RRM4 (29)

U

C

N

Fox-1 RRM (23)

U

G

C

A

U

G

U

hnRNPD RRM (37)

T

A

G

G

Number of bases in each position

A

3

6 (1 syn)

4

3

C

0

7

1

2

U/T

11

4

5

5

G

2

2

5 (all syn)

4 (1 syn)

Register of the RNA or DNA sequences in complex structures of RRM domain containing proteins

In addition to this canonical RNA binding surface, binding sites for another three nucleotides 5′ to N0 are found in the RRMs of U1A (30), U2B′′ (31), Sex-lethalRRM1 (26), HuD RRM1 (33) and Fox-1 (23) (Table 2). In all these complexes, RNA binding of these nucleotides is mediated by loops β1α1, β2β3 and α2β4. Nevertheless, the structures adopted by these nucleotides reveal three different topologies. In U1A and U2B′′, N−2 stacks over N−3 and N0 stacks over N−1 with almost a 90° angle between the two stacks, while in Sex-lethal and HuD only N−1 and N−2 stack (Figure 3C), and finally, in Fox-1, no intra-RNA stacking is found but a base pair between N−2 and N0 is formed (Figure 3A). In Sex-lethal and HuD, a tyrosine in the first position of the β1α1 loop stacks with N−3, and in Fox-1, a phenylalanine in the third position of the β1α1 loop stacks with both N−3 and N−1 (Figure 3A), whereas in U1A and U2B′′, no aromatic rings are found in this loop. Thus, it appears that like on the surface of the β-sheet, aromatic rings in the β1α1 loop can shape the structure of the RNA. Interestingly, in the case of Fox-1, binding mediated by the β-sheet and by the loops is independent, since phenylalanine to alaninemutations in either the loop or the β-sheet abolish binding to one site, but not the other (23).

Binding of additional nucleotides 3′ to N3 is muchlesscommon and has so far only been observed for U2B′′ (31) and PTB RRM2 and RRM3 (29) (Figure 3B) (Table 2). The additional nucleotides (two for U2B′′ and PTB RRM3 and one for PTB RRM2) are bound beyond the β2 strand. In U2B′′, binding is mediated by the β2β3 loop and the N-terminus of helix 1; in PTB RRM2 and RRM3, it is achieved by the β2β3 loop and the loop between β4 and an additional β5 strand unique to these two RRMs (Figure 3B). The origin of these additional RNA-binding sites originates from extensions of the RRM: an additional β-strand for PTB RRM2 and RRM3 and an elongated α-helix 1 for U2B′′.

Sequence-specific versus non-sequence-specific ssRNA-binding proteins

Examination of these sequence-specific ssRNA-binding domains reveals a few common structural features. The binding surface of the protein is primarily hydrophobic in order to maximize intermolecular contact with the bases of the RNA. The RNA bases are usually spread on the surface of the protein domains while the RNA phosphates point away toward the solvent. Only a few intramolecular RNA stacking interactions are observed, while many intermolecular stacking interactions, often mediated by aromatic amino acids, are observed (with the notable exception of the KH domain). This mode of binding contrasts with how non-sequence-specific RNA binding proteins recognize ssRNA. For example, in the structures of RNA polymerases bound with DNA–RNA hydrids (41,42) and in the recently determined structures of the DEAD-box protein Vasa (43) and of two viral nucleoproteins (44,45) bound with ssRNA, RNA binding is mostly mediated by positivelycharged side-chains that contact the sugar-phosphate backbone of the RNA (Figure 4). As a consequence, the RNA bases are exposed to the solvent and are stacked with neighbouring RNA bases rather than with protein side-chains.

Figure 4

(A) Structures of the DEAD-box protein Vasa (43) and (B) of the rabies virus nucleoprotein (44), two recent non-sequence-specific ssRNA binding proteins in complex with RNA (PDB code: 2DB3 and 2GTT). The protein ribbon is shown as a grey ribbon and the RNA is in dark blue or in color (yellow, green and red) with the phosphate atoms shown as purple spheres. The ATP analogue AMPPNP is shown in orange.

redgrey

THE INTERMOLECULAR INTERACTIONS RESPONSIBLE FOR ssRNA BINDING

Aromatic interactions of the RNA bases

π−π Interactions

A common feature of complexes of proteins with ssRNA is the so-called ‘stacking’ of aromatic moieties. In such a stack, the planes of the aromatic rings are in parallel orientation with an averagedistance of ∼3.3 Å in between the planes (46). At protein–RNA interfaces, stacks can be either intermolecular, i.e. formed by rings of the nucleic acid bases with the aromatic side-chains of phenylalanine, tyrosine, tryptophane and histidine, or within the RNA, involving two or more bases. In the zinc-binding domains mentionedabove, for example, only intermolecular stacking is observed (Figure 1B and C). In RRMs, on the other hand, both intra-RNA and intermolecular stacking is frequently encountered, e.g. N2 often stacks simultaneously on an aromatic protein side-chain and with N3 [see U1A Phe56, A11 and C12 (30) in Figure 5C]. Finally, in KH domains, only intra-RNA stacking has so far been observed (see N3 and N4 in Figure 2).

Figure 5

The energies associated with intermolecular stacking interactions. (A) Stacking of U11 and A9 on top of Tyr85 in the MS2 coat protein complex and the effect of Tyr85 mutants on affinity and binding free energy. (B) Contacts between Phe126 and U1, G2 and C3 in the Fox-1 complex and the changes in affinity and binding free energy upon mutating Phe126. (C) Stacking contacts at the U1A RNA binding interface and energetic effects of mutating Phe56. RNA bases are shown in yellow, protein side-chains in green and intermolecular hydrogen bonds as red dashed lines. The table shows dissociation constants (KDs), ratios of KDs and corresponding differences in binding free energy (ΔΔG). Data are taken from (23,50,51). PDB accession codes are 1ZDI, 2ERR and 1URN. Figures were generated with MOLMOL (88).

Additionally, in the cases of Fox-1, U1A and MS2 coat protein, mutant proteins have been studied in which the stacking amino acid was replaced by either another aromatic residue or various other side-chains (Figure 5). A general trend is apparent from these measurements. Replacement by another aromatic side-chain generally leads to a fairly small loss in binding affinity. However, this small loss of affinity is always present in these complexes, indicating that the binding pockets have been optimized evolutionarily for a particular aromatic side-chain such that, for example, the hydroxyl group of tyrosine might be required in one case (MS2 coat protein, where it makes a hydrogen bond to a phosphate group in the RNA) and might be sterically disfavoured in another case (Fox-1) (Figure 5A and B). However, an aromatic side-chain always provides higher affinity than replacement by non-aromatic side-chains. Leucine seems to play an intermediate role, being an amino acid with a fairly large van der Waals interaction surface and being sterically similar to the aromatic side-chains. Cysteine and serinemutants also have intermediate binding affinities in the MS2 coat protein, which might be due to the fact that they can hydrogen bond with the RNA (Figure 5A). The largest loss in affinity occurs when the entire side-chain is removed, i.e. in the alanine mutants (23,50,51).

Cation–π interactions

Another protein side-chain that can be found to make stacking interactions with RNA bases is the guanidino group of arginine residues. The guanidinium moiety is protonated at physiological pH, which leads to a planar, positively charged, resonance-stabilized structure capable of engaging in stacking interactions. Interestingly, statistical analyses hint at a sequence preference for arginine stacking with the order of preference being U, A, C > G (49,54). Energetically, in the case of the positively charged guanidinium group, electrostatic interactions play an important role in the attractive forces (55). Consequently, a larger spectrum of angles between the planes is observed as compared to the stacking of neutral species. In fact, in analyses of protein structures and ATP-binding proteins, almost all possible angles between the planes of arginine and aromatic side-chains or adenine bases could be found (55,56). Nevertheless, the parallel and the T-shaped orientation seem to represent energy minima (55). Hence, van der Waals forces as well as electrostatic forces between the electron-negativecenter of the aromatic ring and the positively charged side-chain (cation–π interactions) play a role in arginine-base interactions. The parallel conformation, however, can have the additional energetic advantage of a better hydrogen-bond network with the surroundings. Other cation–π interactions at the protein–RNA interface involve interactions between the RNA bases and lysine and even histidine residues as histidine can be either neutral or positively charged at physiological pH, depending on its chemicalenvironment within the complex. For lysine, the interaction is dominated by electrostatic forces, whereas van der Waals terms play a negligible role (57).

Cation–π interactions are a very common feature of nucleic acid recognition. In statistical analyses of protein–DNA complexes and ATP-binding proteins, cation–π interactions are seen in more than half of the known structures (56,58). This also true for protein–RNA complexes; the most striking example being the recently determined structure of a splicingendonuclease where a bulge adenine near the cleavage site is found sandwiched between two arginines (Figure 6A) (59). In the ssRNA-binding domains described above, interactions between arginine side-chains and RNA bases can be seen, for example, in Pumilio repeat 3 (Figure 1A) and in all RRMs of PTB in complex with pyrimidine tracts (5,29). Furthermore, a lysine–adenine interaction has been shown to be important for RNA binding by SF1 (17) (see its interaction with N2 in Figure 2C), a lysine stacking on top of a base was found in many RRMs including PTB (Figure 3B), and histidines are commonly found as stacking partners on RNA-binding proteins (Figures 1 and 3).

Figure 6

Arginine and peptide bond stacking. (A) General view and close-up view of the splicing endonuclease in complex with RNA (PDB code: 2GJW) At the splicing endonucleoase active-site, A13 is sandwiched between two arginine side-chains. (B) In the Nova KH domain, N1 stacks on a peptide bond within α1. (C) The N0 nucleotide stacks on a peptide bond that lies at the end of β1 of the RRM of hnRNP A1. The colour scheme is as in Figures 2 and 3. PDB accession codes are 1EC6 (Nova) and 2UP1 (hnRNPA1). Figures were generated with MOLMOL (88).

Other π interactions

The amino groups of asparagine and glutamine are also frequently found to be in contact with aromatic moieties. Again, there are two possible interaction modes. Either the amino group is orientedperpendicularly to the aromatic ring, pointing a δ+ hydrogen atom towards the electron-rich aromatic ring, forming what is in essence a hydrogen bond. Or the planar sp2nitrogen stacks on top of the aromatic ring due to favourable van der Waals energies, as it is seen, e.g. in Pumilio repeat 6 (Figure 1A) or for the RRMs of U1A, U2B″ and PTB RRM 1 and 4 (5,29–31). Calculations suggest that the energies of the unusual hydrogen bonds are rather weak as compared to conventional hydrogen bonds and an analysis of amino–π interactions in protein structures, as well as in structures of adenine binding proteins, shows that the parallel conformation is generally preferred (57,60,61). Again, this could be due to the fact that the parallel conformation allows the amino bearing side-chains to engage in a larger number of conventional, energetically more favourable hydrogen bonds.

Finally, even peptide bond planes can serve as stacking platforms. In KH domains, the N1 residue stacks on the peptide bond between a conserved glycine and the following residue within an α-helix (Figure 6B), whereas in several RRMs, the N0 nucleotide stacks on a peptide bond between a glycine and the following residue within a β-strand (26,33,36,37) (Figure 6C).

Electrostatic interactions

Electrostatic attraction, the attractive force between two particles of oppositecharge, plays a crucial role in protein–nucleic acid interactions, as nucleic acids are highly negatively charged molecules. For many proteins that bind to double-stranded DNA or RNA molecules, there are extensive positively charged patches on the protein surface so that it is often fairly easy to predict where the nucleic acid will bind from the protein structure alone (Figure 7A). Furthermore, in the recognition of RNA molecules with a characteristictertiary structure, electrostatic interactions can play a role in specific recognition of their shape (63,64). Sequence-specific protein contacts to single-stranded nucleotides, on the other hand, commonly occurvia the accessible nucleic acid bases, while the phosphate moieties point towards the bulk solution. Hence, the protein surface that contacts the nucleotide is often not extensively positively charged but rather hydrophobic and direct contacts to the nucleic acid backbone can be rare (Figure 7B). Nevertheless, some studies have shown that even in these cases, electrostatic interactions play a highly important role in binding of the RNA (23,65,66). However, since the distribution of charges on an ssRNA is independent of its sequence, they are not important in providing sequence-specificity (53).

Figure 7

Surface potential of RNA binding proteins. Blue areas indicate a positive potential, red areas a negative potential. (A) Vts1, a protein that recognizes a structured RNA loop. The RNA binding surface of the protein is a highly positive patch. (B) Fox-1 RRM, which binds ssRNA. Positive and negative potentials surround the RNA and the area where most contacts are made is primarily apolar. Figures were generated with PyMOL (http://www.pymol.org) and the surface potential was calculated according to (89). PDB accession codes are 2ESE and 2ERR.

Two methods are typically employed to test the contribution of electrostatic interactions to a biomolecular binding process. Either charged groups are removed from the binding partners (usually by site-directedmutagenesis of charged amino acids or by varying the number of phosphate groups in an oligonucleotide) or the saltdependence of the dissociation constant is measured. If the binding is favoured by electrostatic attraction, increasing the salt concentration of the buffer will reduce affinity. The first approach has revealed, for example, that at 10 mM NaCl, the nucleocapsid zinc knuckle of MMLV shows ∼250times higher affinity for an UCUG sequence if it carries a phosphate group at the 5′ end and prefers UAUCUG-P over UAUCUG by a factor of ∼2.5 (9). Furthermore, lysine to alanine mutations of residues that are close but not in hydrogen-bond contact to the RNA backbone in U1A reduce the affinity for U1hpII ∼15- to 40-fold at 150 mM NaCl (66). Finally, increasing the number of phosphate groups of cap analoguesincreases their affinity for eukaryotic translation initiation factor 4E (eIF4E) by ∼6-fold per phosphate group, or even more when comparing m7GMP to m7GDP (67). The second approach shows that in the case of the Fox-1 complex, binding at 150 and 75 mM NaCl is ∼70 and 500 times stronger, respectively, than at 600 mM (23) (Table 4). Similarly, a ∼80-fold decrease of affinity was determined for the U1A U1hpII interaction when the salt concentration was increased from 150 to 500 mM NaCl (65) (Table 4). A particularlythorough way of testing the contribution of individual positive amino acids is a combination of the two methods: the charged amino acid side-chain is mutated and the difference in salt dependence of the affinity of mutant and wild-type are compared (65–69). Studies of this kind can provide information about the exact electrostatic contributions of individual charged residues to RNA binding. In conclusion, all the above measurements show that even for ssRNA-binding proteins, electrostatic interactions stronglycontribute to the overall affinity. However, the exact contribution of a particular charged group depends on its location in the complex. Interestingly, close proximity of a charged side-chain to a phosphate of the RNA backbone does not necessarilycorrespond to a strong contribution as other factors such as flexibility or solvent accessibility play a role; and viceversa, some charged residues that are rather far away from the RNA can still have a strong electrostatic effect on binding (68,69).

The favourable free energy for binding of protein to RNA is believed to originate mainly from an entropic effect. When the binding partners are free in solution, the charges on their surfaces attract counterions that are released into bulk solution when the macromolecules bind to one another and find the countercharges on the surface of the binding partner. The polyanion RNA has a very high charge density and therefore buffer cations are thought to condense on its surface (counterion condensationtheory). Binding of a protein that carries positive charges will release some of these cations from the high local concentration around the RNA so that they will falldown a concentration gradient into bulk solution. The bulk salt concentration determines the size of this gradient and hence the entropy gain associated with the binding event will be greater at low buffer salt concentrations [reviewed in (47)].

Kinetics

Interestingly, kinetic measurements on ssRNA binding have shown that the salt dependence of the associationrate constant kon is larger than of the dissociation rate constant koff, suggesting that electrostatic interactions in ssRNA recognition are largely long range effects (23,65,66) (see Fox-1 and U1A wild type in Table 4). Opposite charges on protein and RNA lead to a strong attraction, but once the RNA is bound, the complex seems to be stabilized primarily by other factors, as the salt dependence of the koff is rather small, albeit present (23,65,66) (Table 4). In this context, it is also interesting to estimate the kon at zeroionic strength. For the Fox-1/RNA complex, extrapolation of a curve of logkonversus the ionic strength suggests a kon of ∼1010 M−1 s−1 in the absence of salt (23). This is as high as the maximum rate constant for collision of molecules in aqueous solutions, the diffusion-limited association rate (70). Bio-molecules usually have association rates that are considerably smaller, because not every collision leads to a productiveencounter [(71) and references therein]. For binding of ssRNA, however, long-range electrostatic attraction and steering (the pre-orienting of binding partners that enhances the rate of productive encounters) seem to allow association rates that reach the diffusion limit. This behaviour has also been observed for protein–protein complexes like the Barnase/Barstar complex in which electrostatics play a highly important role in the recognition process (72,73). Furthermore, for the U1A/U1hpII complex, mutations of lysine side-chains to alanine or glutamine show a slightlyreduced salt dependence of the association rate constant kon, while the salt dependence of the kon of lysine to arginine mutants is similar or even higher as compared to the wild-type protein. For a triple-glutamate mutant, the effect is actuallyreversed and high salt allows a faster association (65) (Table 4). This confirms the importance of these side-chains for electrostatic attraction of the RNA.

Many ssRNA-binding proteins recognize sequences that are presented in loops. Laird-Offringa and co-workers (74) have evaluated the association and dissociation differences between U1hpII, in which the U1A binding sequence is presented in a loop, and an RNA containing the same binding sequence in an ssRNA of equallength. The effect on kon is moderate (∼3-fold), while the effect on koff is substantial (590-fold). Hence, the overall loss in affinity is close to 2000-fold. This might reflect the higher entropy loss when an ssRNA as compared to a stem–loop is bound. Additionally, however, there are certain stabilizing interactions with the stem that might be lost when binding the single-stranded target (74).

Conventional hydrogen bonds

In proteins, the side-chains of tryptophane, lysine and arginine can act as hydrogen-bond donors, aspartate and glutamate can act as hydrogen-bond acceptors, and tyrosine, serine, cysteine, threonine, asparagine, glutamine and histidine can act as both donors and acceptors. Furthermore, each amide linkage in the protein backbone includes a hydrogen-bond donor (NH) and a hydrogen-bond acceptor (C=O). Each RNA base comprises both hydrogen-bond donors and acceptors which are characteristic of each base. The purine bases, for example, can be easilydifferentiated as adenine features a donor, an acceptor and a CH group at ring positions 6, 1 and 2, respectively, while guanine has an acceptor, donor, donor-pattern at the same positions. Similarly, pyrimidines can be discriminated as cytosine comprises an acceptor and a donor at positions 3 and 4, respectively, while uracil has the opposite arrangement.

The contribution of a hydrogen bond to sequence-specificity can be estimated by disrupting individual intermolecular hydrogen-bonds by either mutating the hydrogen-bonding side-chains of the protein or by using modified ligands in which individual donor or acceptor groups have been removed. Early studies of this kind on tyrosyl-tRNAsynthetase/substrate complexes yielded stabilizing energies of 2.1–6.3 kJ/mol for neutral hydrogen bonds, and ∼15–19 kJ/mol for hydrogen bonds in which one partner is charged (76). For neutral hydrogen bonds, this corresponds to a factor of ∼2–15 in specificity, i.e. a ligand that engages in a particular hydrogen bond binds ∼2–15 times more tightly than a ligand that cannot form this hydrogen bond. Similar energies have been measured recently for hydrogen bonds at the interfaces of protein–ssRNA complexes. For the N-terminal RRM of U1A, for example, elimination of a single, neutral, intermolecular hydrogen-bond by using different adenine analogues resulted in free energy differences of ∼4.6–10.5 kJ/mol (51) (Table 3). Similarly, disrupting one and two neutral hydrogen bonds in the Fox-1/RNA complex gave ΔΔGvalues of 3.9–5.2 kJ/mol and 13 or 14 kJ/mol, respectively, while disruption of four intermolecular hydrogen bonds, including a charged one to an arginine side-chain, resulted in an elevation of the free energy of the complex of 19 kJ/mol (23) (Table 3). The interpretation of affinity constants measured when several hydrogen bonds that recognize one base are disrupted can be tricky, however, since in these cases the base and the protein side-chains in the complex might rearrange. Nevertheless, these data show that individual neutral hydrogen bonds at protein–RNA interfaces are worth 4–10 kJ/mol and hence can sometimes have only small effects on specificity. A whole hydrogen-bond network, however, gives a substantial contribution to binding affinity differences between different RNA sequences and hence to sequence-specificity. It should be kept in mind, however, that the energies measured are not the energies of hydrogen bonds themselves, but rather ‘discrimination energies’ between a complex that features a particular hydrogen bond and a complex that does not (77). Hydrogen-bond interactions in an aqueous surrounding always have to be considered as exchangereactions: hydrogen bonds to water are given up for hydrogen bonds in the complex. This is the reasonwhy they are often associated with rather small energies. Why they are associated with favourable energies at all has been attributed to the fact that uponformation of an intermolecular hydrogen bond, the water molecules that were hydrogen bonded to the donor and acceptor groups of protein and RNA are released into bulk solution, which is entropically favoured (76,77). However, part of the reason might also be that the strength of a hydrogen bond depends on the hydrophobicity of the environment. Hydrogen bonds in the hydrophobic core of a protein seem to be associated with significantly higher energies than those in more accessible parts of the protein (78). Hence, H-bonds that are buried at the protein–RNA interface might be enthalpically more favourable than those to water. Furthermore, a statistical analysis shows that there exist strong geometricalpreferences for hydrogen-bonds at protein–RNA interfaces, which in turn suggests that the precise energy of a hydrogen bond depends strongly on the exact relative orientation of donor and acceptor (79). Hence, exact complementarity is required for effective binding, which in turn enhances sequence-specificity.

The CH…O hydrogen bond

The importance of the conventional hydrogen bonds described above for biomolecular recognition has been well established. However, even though the existence of hydrogen bonds involving a CH as a donor group had been evidenced by crystal structures of organic molecules more than 40 yearsago (82), the importance of these unconventional hydrogen bonds for biomolecular stability and recognition has been recognized only recently, again due to the analysis of crystal structures [reviewed in (83)]. It is believed that the strongest hydrogen bond in that group is the CH…O bond formed between a CH donor group and an oxygen acceptor. However, the energies of these unconventional hydrogen bonds depend on the acidity of the hydrogen and are particularly strong when the CH group is adjacent to a nitrogen atom. Recently, the importance of CH…O hydrogen bonds in protein–RNA recognition has been pointed out by a computational study: in a structural analysis of 45 protein–RNA complex structures, the authors find that 33% of all potential intermolecular hydrogen bonds are of the CH…O type (84). Interestingly, a large number of these intermolecular CH…O bonds originate from the sugars, in particular from C4′ and C5′ atoms. Within the bases, by far the highest number of CH…O bonds are provided by the C2 of adenine, as it is observed, e.g. at the protein–RNA interface of Pumilio and PABP. In Pumillio, the contact is made between the adenine bound to repeat 3 and the thiol group of a cysteine side-chain (Figure 1A), while in PABP RRM1, the adenine in the N1 position is hydrogen bonded to a carbonyl of the protein main-chain (5,25). Strikingly, however, in ∼70% of the cases observed, the adenine H2 contact is made with the hydroxyl group of a serine side-chain (84). The C8 of adenine and guanine, as well as the C6 of uracil and cytosine are potent CH…O hydrogen-bond donors as well, but are not frequently involved in hydrogen bonds with the protein as they tend to hydrogen bond with the O5′ of their ownribose when they are in the anti conformation (84).

Surface complementarity

Though the experimentally determined binding affinities described above indicate an important role for hydrogen bonds in providing sequence-specificity, it should not be forgotten that surface complementarity in general is an extremely important prerequisite for sequence-specific recognition. In the case that the RNA perfectly fits into binding pockets provided by the protein, favourable dispersion interactions (van der Waals bonding) are maximized. On the other hand, if there are holes, possiblyfilled with highly constrained and entropically unfavourable water, or stericclashes, which lead to too close contacts that are strongly disfavoured by van der Waals repulsion, the binding affinity will be reduced and the binding partner will be disadvantaged as compared to a ligand that has a perfectly complementary binding surface. Shape recognition plays a particularly important role in the binding of structured RNA molecules and has been reviewed elsewhere (63).

redgrey

TOWARDS A CODE FOR ssRNA RECOGNITION

Two ways to recognize RNA sequence-specifically

In analysing the molecularbasis of how protein domains recognize ssRNA, one can differentiate two basic modes for how sequence-specificity is achieved. For some protein domains, hydrogen bonds to the RNA bases originate from the protein main-chain carbonyl and amide groups and therefore the fold of the protein domain determines the RNA sequence-specificity. This is the case, for example, for the tandem CCCH zinc fingers of Tis11d (7), where each finger recognizes a UAUU sequence. Such an arrangement provides a very rigid and hence highly specific scaffold for RNA binding. However, it also means that small variations in the amino acid sequence could indirectly influence the backbone architecture and change the RNA binding specificity. This makes it virtuallyimpossible to predict which RNA sequence is recognized by these proteins in the absence of a structure.

For other proteins, like Pumilio, sequence-specificity is exclusively provided by hydrogen bonds between the protein side-chains and the RNA bases (5). With such a recognition mode, predicting the RNA sequence that is bound based on the protein primary sequence appears possible. As mentioned earlier, the recognition mode of Pumilio is highly modular. Each Puf repeat recognizes one base and in addition serves as a binding platform for the following base. In each repeat, three amino acid side-chains, all located in helix two, are crucial for RNA recognition (Figure 1A). Different combinations of the amino acids in positions 3, 4 and 7 of this helix specify the binding to the bases, which makes it possible to design a Pumilio-derived specific binder for ssRNAs of distinct sequence. A first attempt of this kind was made by Wanget al. (5) who mutated the asparagine, tyrosine and glutamine at α-helix positions 3, 4 and 7 of repeat 6 (Figure 1A) into serine, asparagine and glutamic acid, respectively, to generate a repeat that specifically recognizes a guanine instead of a uracil. Indeed, the mutant protein binds a U-to-G mutant RNA at least 12 times more strongly than the wild-type RNA.

Role of the protein main-chain of KH and RRM in sequence-specific recognition

The other RNA-binding domains described here (RRM, KH and the MMLV nucleocapsid) achieve sequence-specificity with a combination of both binding modes, i.e. with hydrogen bonds to both the protein main-chain and side-chains. In the KH and nucleocapsid domains, one of the four bound nucleotides is recognized specifically by the protein main-chain. In the MMLV nucleocapsid, this is the guanine at the 3′ end (8,9) (Figure 1C), while in KH domains, the adenine or cytosine in the N3 position is recognized by the backbone of the β2 strand for type I KH domains (15,17,18,21) or the β3 strand for type II KH domains (16) (Figure 2A and B). This indicates that the MMLV nucleocapsid protein and the KH domain have within their fold an inherent preference for specific nucleotide types in one of their binding pockets.

In the case of the RRM, proteins with binding specificity for A-, G- or pyrimidine tracts have been observed. Nevertheless, in examining all known RRM–RNA complex structures, one can see a bias towards particular nucleotide types at certain positions (Table 2). In position N1 of the RRM, a cytosine is found seven times, adenine six times, uracils or thymines four times and guanines only twice. In position N2, on the other hand, guanine and uracil occur five times, adenine four times and cytosine only once. In position N0, there is a strong preference for uracils (11 U or T found). Finally, in position N4, uracils are the most common nucleotide (five times), but the other bases are found at least twice as well. Although not enough complex structures have been solved to make a proper statistical analysis, one can see a certain bias toward a uracil at N0, a cytosine or adenine in N1 and a guanine or a uracil in N2. In fact, a U/G-A/C dinucleotide bound at N1–N2 is never observed, whereas five A/C-U/G sequences are bound in these positions.

A detailed analysis of the interactions in position N1 and N2partlyexplains the origin of this sequence bias (Figure 8). Recognition of the RNA base N1 involves one or two hydrogen bonds between the Watson–Crick edge of the base and the main-chain atoms of the last β4 residue and of the residues just C-terminal to it. For almost all cytosines and adenines, the carbonyl oxygen of the last β4 residue [e.g. Y86 in U1A (30)] is hydrogen bonded with one amino proton of the base and the backbone amide two residues after (β4+2, e.g. K88 in U1A) is hydrogen bonded to N3 of cytosine or N1 of adenine (Figure 3B, 5C and 8B). If N1 is a uracil, it is also contacted by atoms of the protein main-chain (Figure 3A), but with more variations in the binding mode (23,26,33). Binding of a guanine in N1 is also quite different in the two RRMs where such an interaction is found, namely CBP20 (28,34) and Sex-lethal RRM2 (26). From this analysis, it appears that the N1 binding pocket of an RRM is readily shaped for binding a C or an A, whereas adaptations seem to be necessary when binding a U or a G.

Figure 8

Recognition of AG by hnRNPA1 RRM1. (A) Details of the non-sequence-specific contacts to the RNA. (B) Sequence-specific contacts mediated by the protein main-chain. (C) Sequence-specific contacts mediated by the protein side-chains. The colour scheme is as in Figures 2 and 3. PDB accession code is 2UP1. Figures were generated with MOLMOL (88).

Recognition of the RNA base identity in position N2 can also involve hydrogen bonds from the protein main-chain but only when a guanine is bound. In all five complexes with a guanine bound in N2, the base adopts a syn conformation that is stabilized by two hydrogen bonds between the carbonyl oxygen in position β4+2 and both the 2-amino proton and the imino H1 of the guanine (23,35–37). In this syn conformation, the guanine is further stabilized by an intramolecular hydrogen bond between its 2-amino and one of the phosphate oxygens (Figure 3A). As the guanine base is the only base that can engage in these two hydrogen bonds, one could speculate that the default binding sequence for an RRM might be a dinucleotide A/C-G located in N1-N2. When binding A/C-G, no side-chain needs to be involved in the recognition and yet four intermolecular hydrogen bonds with the RNA bases would be formed (Figure 8B). This suggests that the RRM fold might have an inherent binding preference for certain RNA bases, just like the KH domain or the MMLV nucleocapsid zinc knuckle.

Role of the protein side-chains of KH and RRM in sequence-specific recognition

The protein side-chains in the RRM, the KH and the MMLV nucleocapsid zinc knuckle clearly play the major role for discriminating different RNA sequences. For the N1 nucleotide in the RRM, the main side-chains involved in discriminating between different bases appears to be the penultimate residue of β4 (β4-1) and the first residue following β4 (β4+1). Residue β4−1helpsdiscriminate between A/C and G/U, as E, Q or M side-chains are found in this position hydrogen bonded with an A or a C amino proton, whereas K or R are found in this position hydrogen bonded to uracil O4 or Guanine O6. Residue β4+1 appears to help discriminate between A and C. Indeed, an Ala that interacts with A H2 is found in this position in several complexes (Figure 8C) (36,37) while a Sercorrelates with the presence of a C (Figure 3B, contact to O2) (29). However, there are exceptions to this rule as PABP RRM1 has a Ser in the β4+1 position and still accommodates an adenine in N1 (25). Similarly, U1A (30) and U2B″ (31) both contain an alanine in the β4+1 position although a cytosine is bound in N1.

If guanine is bound as N2 on the RRM, specific binding is usually further stabilized by contacts to R or K side-chains from the most N-terminal residue of β1 or from β2 that interact with the O6 and N7 of the guanine (Figures 3B and 8C). It was indeed proven by several crystal structures of hnRNPA1 in complex with various RNAs that an R or K at this position is the determining side-chain for selecting a guanine at N2 (85). For all uracils bound to N2, the most N-terminal residue of β1 is always an asparagine that interacts with O4 of the U. In addition, an arginine of β2 interacts with the O2 of the uracil [in all RRMs except sex-lethal RRM2, where a glutamine of β2 is contacting the U O2 (26)]. Binding of adenine in N2 appears to be more versatile, as the base is not in the same position in the different complexes. In U1A (30) and U2B″ (31), the adenine bound in N2 is recognized by a hydrophobic residue (L or V) of β2 that contacts the A H2 and by a serine located five residues after the end of β4 that interact with both a 6-amino proton and N1 of the adenine. In PABP, however, binding specificity for adenine in N2 is achieved quite differently (25). In RRM1, N58 of β3 is hydrogen-bonded with both the N1 and one of the 4-amino protons of the adenine Watson–Crick edge, whereas in RRM2, N100 from β1 is hydrogen bonded with the N7 and one 4-amino proton of the Hoogsteen edge of the A. In the only case where a cytosine is located in N2, it is recognized by two hydrogen bonds with an arginine side-chain of β2. All in all, it appears that a guanine can be considered the default binding nucleotide in the binding pocket for N2, because it involves the β4+2 backbone carbonyl. Yet, with the presence of an asparagine at the beginning of β1 and of an arginine or lysine in β2 a uracil would be preferred while with a hydrophobic side-chain (L, V or I) in β2 or an asparagine in β3, an adenine would be preferred. There is an exception to this suggestion as an adenine is recognized in PABP RRM2 with an aspargine in β1 and a lysine in β2 but in this case the stacking of the adenine over the aromatic ring of the RNP1 motif is quite reduced (25). This indicates that the binding pocket for N2 is very adaptable.

As discussed earlier, the N0 and N3 binding pockets in the RRM take on several forms, which makes predictions for these binding sites rather difficult. Furthermore, binding specificity in the N0 position can be influenced by neighbouring RNA bases through intramolecular RNA hydrogen bonds. Examples for this are found in PTB, where the uracil in N0 interacts with the cytosine in N1 (29) (Figure 3B), in Fox where the adenine in N0 forms a base pair with the guanine in position N−2 (23) (Figure 3A) or in U1A and U2B″ where a guanine in N0 interacts with a uracil in N−2 (30,31).

In proteins containing KH domains, the side-chains are important to discriminate nucleotide base identity in positions N1, N2 and N4. Although only a few KH domain structures in complex with RNA or DNA are available as compared to RRMs, one can still see where specific side-chains play an important role in sequence recognition. For example, when N2 is a cytosine, such as in Nova1, hnRNPK KH3 and PCBP2KH1, the base is contacted via two hydrogen bonds by an arginine side-chain from the central β-strand (R54 in Figure 2A). In the other KH domains, this arginine is absent. The identity of N4 that stacks over N3 appears to be discriminated by side-chains from β2 in type I KH domains (β3 in type II, see Figure 2A and B), but no clear rules are apparent from the different structures. The same is true for N1. An interesting additional feature is found in NusA KH3 (16). The Adenine in position N5foldsback and forms a similar H-bond interaction with the β-strand backbone as the adenine in N3. Therefore, an extensive network of polar interactions is created between the three nucleotides N3, N4 and N5 and the β-strand (Figure 2B).

Engineering a specific binder based on RRM or KH scaffolds

Based on the above analysis, it is obvious that rational design of an RRM or KH domain with a novel and defined sequence-specificity based on structural analysis is not as straight-forward as it has proven to be with Pumilio (5). Nevertheless, the set of binding rules proposed above might represent a basis for attemptsalong this line and a solution to the problem might become even more tractable as more RRM and KH domain structures in complex with RNA will be available.

Alternativeapproaches to the design of novel RNA binders could be computational design or in vitroselectiontechniques. Both approaches have in principle been successfully applied to the U1A protein. More than 10 years ago, Laird-Offringa and Belasco could successfully identify amino acid residues important for the specific interaction of U1A with its natural target, the U1hpII RNA, using phagedisplay (86). Interestingly, they were able to generate U1A-derived proteins with an affinity that was even higher than that of wild-type U1A. Hence, repeating this in vitro selection process with a foreign RNA might lead to the generation of novel proteins with high affinity and specificity for any given RNA sequence. Furthermore, this approach might also be applied to derive further binding rules.

More recently, the Rosetta Design algorithm has been used to generate a protein that reproduces the U1A backbone structure to within <1 Å (rootmeansquaredeviation) while sharing only ∼30% sequence identity. The design of this U1A-mimic was based on the backbone coordinates of U1A and consequently, the RNA-binding properties of U1A were not retained (87). In the future, it might however become possible to extend such an approach to protein/RNA interfaces and hence to design novel RNA binders in silico.

redgrey

CONCLUSIONS

The most important chemical interactions that guide ssRNA recognition by proteins are stacking, electrostatics and hydrogen bonding. Generally, stacking and electrostatic interactions play a role in providing affinity (Figure 8A), whereas hydrogen bonds contribute to sequence-specificity as well as affinity (Figure 8B and C). However, although electrostatics are responsible for the initial attraction that brings RNA and protein together, stacking and hydrogen bonds lock the RNA in its proper orientation within the complex. Interestingly, specific hydrogen bonds can be provided either by the backbone or the side-chains. Specificity established by the backbone implies that the overall fold of the protein is readily shaped for the recognition of an RNA of specific sequence. This inherent sequence-specificity of the fold can be seen, for example, for the two zinc-binding domains of Tis11d described in this review (7). On the other hand, the protein Pumilio establishes sequence-specificity solely via side-chains, which allows RNA binding of almost any single-stranded sequence (5). RRMs and KH domains represent an intermediate, where specificity is provided by both the main-chain and side-chains of the domains. Hence, these folds have an inherent preference for certain bases at specific positions but this intrinsic specificity is modulated by additional side-chain interactions which enlarge the spectrum of possible bases recognized. Nature has apparently favoured this latter mode of binding since RRMs and KH domains are the two most common types of RNA-binding domains. The reason for this might be that these RNA binding domains are extremely versatile. In particular, the core RRM domain contains just two consensus binding pockets, which can recognize any given nucleotide, while the rest of the protein is highly adaptable. Furthermore, several of these relatively small domains can be combined within a single polypeptide chain, can be separated by linkers of varying length and structure, and can be employed to recognize short ssRNA stretches within loops. Despite these variations, one can distill some of the rules that determine RNA recognition by RRM and KH domains. This is exciting because it promises that in the future, when we will have access to more structures of protein–RNA complexes, we might be able to predict which RNA sequences are bound by RRM or KH domains and to possibly design novel RNA-binding proteins with defined sequence-specificity.

redgrey

References

Towards therapeutic applications of engineered zinc finger proteins

A. Klug

FEBS Lett., 2005

Structural basis of single-stranded RNA recognition

A.C. Messias et. al

Acc. Chem. Res., 2004

The PUF family of RNA-binding proteins: does evolutionarily conserved structure equal conserved function?

D.S. Spassov et. al

IUBMB Life, 2003

Mechanisms of translational control by the 3′ UTR in development and differentiation

C.H. de Moor et. al

Semin. Cell Dev. Biol., 2005

Modular recognition of RNA by a human pumilio-homology domain

X. Wang et. al

Cell, 2002

Crystal structure of a Pumilio homology domain

X. Wang et. al

Mol. Cell, 2001

Recognition of the mRNA AU-rich element by the zinc finger domain of TIS11d

The medical information provided on this website is of a general nature and can not substitute for the advice of a medical professional
(for example, a qualified doctor/physician)! Information from the internet could and should NOT be used to offer or render a medical opinion or otherwise
engage in the practice of medicine.