This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The pre-translational modification of messenger ribonucleic acids (mRNAs) by alternative promoter usage and alternative splicing is an important source of pleiotropy. Despite intensive efforts, our understanding of the functional implications of this dynamically created diversity is still incomplete. Using the available knowledge of interaction modules, particularly within intrinsically disordered regions (IDRs), we analysed the occurrences of protein modules within alternative exons. We find that regions removed or included by pre-translational variation are enriched in linear motifs suggesting that the removal or inclusion of exons containing these interaction modules is an important regulatory mechanism. In particular, we observe that PDZ-, PTB-, SH2- and WW-domain binding motifs are more likely to occur within alternative exons. We also determine that regions removed or included by alternative promoter usage are enriched in IDRs suggesting that protein isoform diversity is tightly coupled to the modulation of IDRs. This study, therefore, demonstrates that short linear motifs are key components for establishing protein diversity between splice variants.

INTRODUCTION

The major pre-translational mechanisms for expanding the repertoire of gene function are alternative splicing, known to occur in at least 86% of human genes (1) and alternative promoter usage, known to occur in 30–50% of human genes (2), with other mechanisms such as ribonucleic acid (RNA) editing (3) also contributing to the diversification of the human proteome. Alternative gene products increase the signalling and regulatory complexity of the proteome in both temporal- and tissue-specific manner (4). This observed proteomic complexity enabled by the one- to many relationship between most genes and their protein products raises the question of how these isoforms confer distinct functionality.

A potential explanation for this functional diversity at the protein level is modulation of domain–domain interactions (5); however, proteome-wide studies indicate that the removal of a globular domain is a relatively rare event (6,7). Instead, studies have shown that intrinsically disordered regions (IDRs) are preferentially found within alternative exons (8,9). This enrichment for IDRs does not explain the functional diversity found between many alternative protein products of a gene. For example, alternative splicing is known to determine the binding properties, stability, subcellular localisation and post-translational modifications (PTMs) of a large number of proteins (4,10). As short linear-motif (SLiM) interaction modules are enriched within IDRs (11,12), we hypothesised that the removal or addition of SLiM-containing exons could confer distinct functions to a splice variant, as these interaction modules are associated with a diverse array of cellular processes (11,13). These include promoting transport [e.g. the nuclear localisation signal (NLS)], directing cleavage (e.g. caspase-3 scission sites), acting as sites for PTMs (e.g. phosphorylation sites), mediating ligand binding (e.g. the PxxP SH3-binding motif) and marking proteins for degradation (e.g. KEN-box motif) (14).

SLiMs (~3–10 amino acids in length) are typically associated with low-affinity interactions [generally in the 1–150μM range (14)], predisposing them to reversible and transient associations (11). Although the context of a motif is important for binding (15), the majority of the binding affinity and specificity arises from a limited number of amino acids, 2–5 on average (11). This ensures that only a few stochastic mutations are required to convergently evolve a functional motif. For example, in neuronal cells, a single mutation of an innocuously exposed TQG sequence can create a TQT dynein-binding motif resulting in the synaptic transport of the protein (16). In this manner, motifs can arise by convergent evolution (11), with stochastic mutations more likely to occur in regions with high substitution rates, such as IDRs (17) and alternative exons (18). The presence of motifs within alternative exons can, in turn, create novel functionality for splice variants. For example, the inclusion of an alternative exon 3 amino acids in length creates a dynein-binding motif in a splice variant of myosin Va, enabling splice variant-specific cargo recognition (19). The presence of SLiMs in alternative exons can also create splice variants with novel cellular localisations, as occurs in human 8-oxoguanine deoxyribonucleic acid (DNA) glycosylase when an alternative exon containing an NLS targeting motif is removed, leading to the exclusion of the splice variant from the nucleus (20).

In this article, we investigate whether the removal or addition of exons containing SLiMs is a common regulatory mechanism used by the cell. We analyse the experimentally validated SLiM instances annotated in the Eukaryotic Linear Motif (ELM) (13) and Domino (21) resources, along with other functional units (globular domains, phosphorylation sites, transmembrane regions and signal peptides), for their presence in sequences altered between known protein isoforms (AltSeqs). We demonstrate that SLiMs are enriched within AltSeqs and confirm that partial domains are under-represented within AltSeqs (6). We also demonstrate that exons excluded or included by alternative promoter usage are enriched with IDRs demonstrating that these unstructured regions of proteins are a recurring property of non-constitutive exons.

MATERIALS AND METHODS

Data sets

SLiM instances are extracted from the ELM resource (version 08/2011) (13), a database of manually annotated experimentally verified SLiMs. These SLiMs are divided by ELM into different functional classes (160 in total), each describing a unique molecular function. Each ELM class is described by a regular expression defined using experimentally validated SLiM instances. These 1595 instances represent a gold standard for SLiM annotation and were collected independently of whether they were present in alternative exons.

An additional data set is also derived from the Domino peptide interaction database (version 10/2009) (21) to validate the results produced using data from the ELM resource. Domino annotates high-quality experimental data on globular domain-peptide interactions independently of ELM and, therefore, can be used as a cross-validation data set. A total of 848 protein isoforms produced from 274 genes are extracted from the Domino resource with peptides shorter than 30 amino acids. A minimal length of 30 amino acids is chosen, as this is shorter than all known linear-motif interaction domains (shortest WW domain) (22). Five linear motifs classes, whose interactions have been analysed in greater detail by high-throughput (HTP) studies and/or curated by experimental annotation databases are investigated in detail. These linear-motif instances bind to PDZ (23), PTB (24), SH2 (13,25), SH3 (26) and WW (13,27) domains and together create a dataset of 408 motif instances within 302 genes.

As additional annotation, for each canonical protein sequence with a known motif instance, globular domains are extracted from Pfam v25 (28), phosphorylation sites from the low-throughput annotation of Phospho.ELM (03/2011) (29) and functional elements (transmembrane domains and signal peptides) from UniProt annotation. These features are mapped onto the canonical protein sequence as defined by UniProt.

Isoform data are retrieved from UniProt (05/2011) (30), a manually annotated, non-redundant protein sequence database. This resource curates annotated protein splice variants of genes only if there is experimental evidence that it exists or has at least one messenger RNA (mRNA) with correct intron/exon boundaries. It, therefore, represents a high-quality resource of validated protein isoforms. The analyses use the canonical isoform as chosen by UniProt. All protein products of a gene are extracted from the UniProt resource for protein sequences with at least one ELM-annotated SLiM instance and more than one annotated UniProt protein product.

All data sets are filtered for proteins of high similarity using UniRef90 (31) to limit bias introduced by homologous proteins with greater than 90% sequence identity.

Methods

The enrichment of functional units (SLiMs, phosphorylation sites, globular domains and functional elements [transmembrane domains and signal peptides]) within alternative sequences (AltSeqs) is assessed based on an approach outlined by Kriventseva et al. (6). This method aims to evaluate whether there is a preference for certain functional units to be altered between protein isoforms. This approach compares the expected number of instances—calculated with the assumption that there are no biases in the data set towards certain functional units being altered—with the observed number of instances altered between protein isoforms. AltSeqs are used in this approach, as they are continuous sections present in canonical protein sequences, as prescribed by UniProt, but missing in another protein isoform. AltSeqs, therefore, reflect the consequences of transcript changes at the protein level, for example, an AltSeq may represent two alternative exons that are always removed together, ensuring a whole globular domain is never only partially present.

The calculation of the expected number of occurrences (e) of functional units within AltSeqs uses a sliding window method. This approach requires that AltSeqs are randomly distributed within protein sequences. To test this assumption, the distribution of annotated UniProt AltSeqs is assessed. This analysis found no strong positional bias for AltSeqs [Supplementary Figure S1 and Kriventseva et al (6)]. A sliding window approach is, therefore, used to calculate the expected number of occurrences of functional units. For each AltSeq, a window of equal length to the AltSeq (Window) scans the AltSeq-containing protein progressing one amino acid at a time, counting the functional units (FUNCAS) overlapping (partial hits) or within (full hits) the window. The expected occurrences of partial/full domains, phosphorylation sites, transmembrane regions, signal peptides, PTMs and linear motifs are calculated using the following equation:

(1)

where FUNCAS=number of instances of a functional unit, j, in sliding windows; Window=number of sliding windows; AltSeqsCount=number of AltSeqs. A goodness of fit χ2 test is then used to compare the expected and observed proportions.

The expected number of occurrence of functional units within regions of intrinsic disorder [()] is also assessed. A protein sequence is assessed for disorder using the IUPred algorithm (32), with the assumption that amino acids with IUPred scores over 0.4 are disordered (11,12). The following equation is used:

(2)

=number of functional units both in a disordered region and an AltSeqs; WindowDIS=sliding window count only including windows with an average IUPred score of over 0.4; AltSeqsCountDIS=number of AltSeqs in disordered regions (average IUPred score>0.4). A goodness of fit χ2 test is then used to compare the expected and observed proportions.

The assessment of the individual ELM classes for enrichment within AltSeqs by χ2 test is infeasible due to the limited number of instances in each class. An adaptation of the log-odds ratio calculation [LOG-odds domain (LOD) (5)] was, therefore, used to compare individual ELM classes with the observed occurrence of linear-motif removal, taking into account the number of instances in each ELM class:

(3)

IC=instance count, Pyy=observed probability of an instance in ELM class being in AltSeq; Pyx=observed probability of an instance in ELM class not being in AltSeq; Pxx=observed probability of an ELM instance in AltSeq, Pxy=Observed probability of an ELM instance not being in AltSeq.

Counts of recurring SLiMs or SLiM instances that have at least one other instance of the same ELM class in the same protein is calculated from the ELM resource’s annotated data. A recent survey of ELM identified that 34.9% of ELM-annotated instances were recurring (11). A goodness of fit χ2 test is used to assess whether the observed occurrences of recurring motifs within AltSeqs is present at a higher rate than expected (34.9%).

Structural disorder is assessed using the IDR predictor IUPred (32) with exons having an average score of greater than 0.4 considered as unstructured (11,12). The expected proportion of intrinsic disorder within an exon is calculated based on an analysis of the protein-coding exons annotated by EnsEMBL (33) found within the canonical UniProt human proteins. A goodness of fit χ2 test compares the intrinsic disorder of the average exon with the observed intrinsic disorder of the exons altered by alternative promoter usage or alternative splicing.

RESULTS

Alternative promoter exons are enriched in IDRs

AltSeqs produced by alternative splicing are enriched for IDRs (8); however, no investigation of protein-encoding AltSeqs specifically produced by alternative promoter usage has been undertaken. A comparison is therefore undertaken comparing the proportion of IDRs within the average human exon with the proportion of IDRs within exons removed or included by alternative promoter usage. We extract from the UniProt database (30), a non-redundant set of 188 altered splice variants derived from 124 genes produced solely by alternative promoter usage. The analysis of this data using the IUPred algorithm (scores>0.4 considered disordered) identifies an enrichment of IDRs within exons altered by alternative promoter usage (χ2P value: 0.033) (59 observed and 38 expected) (Figure 1). In addition to this, a significant under-representation of ordered regions is noted within those AltSeqs altered by alternative promoter usage (χ2P value: 1.32 e−3) (61 observed and 102 expected). The exons removed or included by alternative splicing are also enriched for IDRs (χ2P value: 2.20 e−16) as previously shown (8).

A comparison of intrinsic disorder between exons. The proportion of exons predicted as intrinsically disordered, defined as residues that the IUPred algorithm predicted with a score above 0.4. Exons altered by alternative splicing and exons altered by...

SLiMs are enriched in AltSeqs

The enrichment of IDRs within AltSeqs raises the question of whether known functional regions within IDRs occur at a higher or lower rate than expected within sequences altered between protein isoforms. The initial analysis of functional site enrichment within AltSeqs is undertaken using a data set of 1421 protein isoforms produced from 404 genes, which is limited to those genes with a protein isoform containing an annotated and experimentally validated SLiM instance from the ELM resource. As shown in Figure 2a, the proportion of AltSeqs (average length 112.3 residues, equivalent to 15.5% of average UniProt sequence length) containing a SLiM is at a higher frequency than expected (χ2P value: 5.13 e−5) with 196 SLiMs (30.3% of SLiMs in proteins with alternative products or 12.1% of total ELM instances) observed in AltSeqs compared with the 123 expected. Phosphorylation sites are similarly enriched (χ2P value: 5.76 e−4) with 61 more sites found in AltSeqs than the 128 expected. There is, however, a potential bias in this analysis, as IDRs are enriched in alternative exons (8,9) and SLiMs are enriched in IDRs (11,12). We, therefore, also assessed whether the aforementioned enrichment still occurs, when only regions predicted as disordered are investigated [IUPred (32) scores>0.4 considered as an IDR]. In this case, SLiMs are the sole functional unit significantly enriched (χ2P value: 2.40 e−4) (138 observed and 83 expected) (Figure 2b) suggesting a preference for SLiMs in AltSeqs. These results are validated using the independently annotated data from the Domino database of peptide-mediated interactions (21) consisting of 848 protein isoforms produced from 274 genes. Peptides, likely to contain SLiMs, are highly enriched within AltSeqs (χ2P value: 4.74 e−5) (163 observed and 97 expected). This enrichment of SLiMs is again observed when only functional units within IDRs are investigated (χ2P value: 5.71 e−3) (106 observed and 69 expected) (Supplementary Figure S3). For further assessment of functional site enrichment, additional instances of PTMs are extracted from the PhosphoSite Plus database (34). However, no enrichment is identified for these other PTMs in AltSeqs (Supplementary Table S1). This suggests that SLiMs represent a key regulatory element altered between protein isoforms.

The distribution of functional units within AltSeqs. The observed and expected counts of AltSeqs disrupting an entire or partial SLiM, a phosphorylation site, an entire or partial globular domain, an entire or partial functional element (transmembrane...

The analysis does not show a bias towards a particular type of SLiM, for example, targeting motifs, to be in an AltSeq (Supplementary Figure S2). The observation that SLiMs are enriched within AltSeqs but no particular ELM type is significantly enriched raises the question, what type of SLiMs are present within these regions? We, therefore, assess the individual ELM functional classes for enrichment within alternative exons (Table 1). We identify a number of classes whose instances occur at a much higher frequency than expected in non-constitutive exons. The majority of these ELM classes bind to domains found within intracellular signal-transduction proteins (e.g. SH2 or PTB domains). However, the instances annotated in the ELM resource are limited in number, as only examples identified by low-throughput experimentation are included. To validate the observation that motif instances associated with domains in signal-transduction proteins being enriched in protein encoding alternative exons, motif instances identified as binding to PDZ (23), PTB (24), SH2 (13,25), SH3 (26) and WW (13,27) domains in HTP experiments and by specialist annotation are investigated further. As shown in Figure 3, the aforementioned enrichment of motifs binding to the SH2 (χ2P value: 0.027) and PTB (χ2P value: 0.033) domains is confirmed, as well as identifying that PDZ- (χ2P value: 0.025) and WW-(χ2P value: 0.092) domain-binding motifs have an increased likelihood of being removed or included between protein isoforms.

A comparison of the occurrences of five highly studied binding motifs within alternative exons. The observed and expected occurrences of linear motifs identified in HTP experimental studies of proteins with known isoforms (30). The PDZ, PTB and SH2-binding...

SLiMs tend to reoccur or have one (or multiple) other instances of the same ELM class in the same protein. This is highlighted by the fact that 34.9% of ELM-annotated SLiM instances are recurring (11). When we analyse the occurrence of these recurring motifs within alternative exons, we find that SLiMs known to reoccur multiple times in a protein sequence are significantly enriched within AltSeqs (50.7% are recurring: χ2P value: 0.013) (107 observed and 73.6 expected). This suggests that the inclusion or removal of SLiM-containing exons may tune the multivalent cooperativity of an isoform’s interactions (35).

The removal or inclusion of complete globular domains also displays a weak statistical enrichment within AltSeqs (χ2P value: 1.29 e−3) (204 observed and 144 expected) reflecting similar results by Kriventseva et al. (6). Conversely, splice variants with partial domains or domains that are partially encoded by an AltSeq are under-represented (χ2P value<1.59 e−3) (161 observed and 223 expected) (6,7). A similar finding is also observed for functional elements truncated by the removal or inclusion of AltSeqs (15 observed and 85 expected) (Figure 2a) (6).

SLiM prediction can aid understanding of protein isoforms

The diverse and often conflicting functions of the different protein products of a gene are frequently designated to one protein isoform. The use of bioinformatics to discriminate these differences, in particular by identifying isoform-specific SLiMs, may facilitate an understanding of the distinct properties of these splice variants, such as half-life, interaction partners and cellular localisation.

An apt example of this is p53, whose varied and often opposing functions have puzzled researchers for many years (36). The recent expansion in the number of known alternative protein products of this gene has given a tantalising opportunity to uncover the source of this functional diversity (37). A series of articles focusing on the transcriptional regulation of these splice isoforms [e.g. (37,38)] has enabled some of this diversity to be explored. In Figure 4, the alternative products of p53 are displayed along with their DNA-binding domains and SLiMs. The different phenotypes of p53 isoforms can often be attributed to the removal or inclusion of SLiM-containing exons. For example, the increased half-life of Δ40p53 (9.5h compared with 5–20min of full-length p53) (38) has been attributed to the loss of the MDM2-binding site (39), which marks full-length p53 for degradation by the attachment of ubiquitin (40). Similarly, the absence of the nuclear export signal in Δ40p53γ explains its exclusive nuclear localisation (in a similar manner to p53β and Δ133p53γ) (38). Other putative explanations of phenotypic observations include attributing the shorter half-life of p53β to the absence of the LIG_USP7_1 (41) and 14-3-3 (42) binding sites. Novel motifs can also arise by addition of an alternative exon, such as the putative KEN-box degron in p53γ, which could indicate a novel method of degradation for p53γ by the APC/C complex during anaphase. The expression and half-life of p53 is, therefore, carefully regulated by pre-and post-translational mechanisms that alter the availability of this protein’s interaction surfaces resulting in subtle but important phenotypic differences (37,38). Observations based on the interpretation of phenotypic data can help direct further experimentation, which may further elucidate the often-enigmatic differences between protein isoforms.

Bioinformatics can identify functional differences between protein variants. (A) The exon sequence of the TP53 gene. The coloured (non-grey) exons are alternative exons that vary between protein isoforms. The yellow exons are absent in Δ40p53,...

DISCUSSION

The importance of pre-translational variation within the cell for facilitating cell signalling and regulation is becoming increasingly apparent (43–45). The inclusion or removal of non-protein coding regions, for example, is known to influence mRNA stability, translational efficiency and mRNA localisation (46,47). In this article, we have investigated how the removal/inclusion of functional modules between protein isoforms can lead to functional diversity. In particular, we have focused on the inclusion/removal of SLiM-containing alternative exons known to create protein isoforms of differing functions (10). These differences include the targeting of protein splice variants to different sub-cellular locations [e.g. to the peroxisome rather than the mitochondria (48)], changes in interaction partners [e.g. PDZ SLiMs within membrane receptors (49)] or more dramatic changes such as altering a protein’s function from pro-apoptotic to anti-apoptotic (50).

In this article, we have confirmed previous observations that alternatively spliced exons are enriched for IDRs (8,9) as well as demonstrating a similar enrichment for IDRs in exons generated by alternative promoter usage (Figure 1). This observation prompted us to investigate the propensity of known functional protein modules to occur in regions altered by alternative splicing and/or by variable promoter usage. We identified an enrichment of SLiMs within AltSeqs indicating that the inclusion or removal of motif-containing exons is an important mechanism for modifying the functional properties of alternative protein products. In particular, exons containing SLiMs that bind to SH2 domains are commonly altered by pre-translational mechanisms (Table 1 and Figure 3). SH2-binding motifs are often present in the cytoplasmic tails of membrane receptors, and their inclusion or removal is, for example, known to affect the multivalent assembly of regulatory complexes important for signal propagation (51,52). Similarly, the inclusion or removal of PDZ motifs, also found enriched in AltSeqs, is known to create functional diversity. For example, in neurons, splice variants differing in their C-terminal PDZ motifs play specific roles in the regulation of neurotransmission, ion channel function and development (49).

The small footprint of linear motifs confers a number of advantages in terms of cell regulation and signalling (11,53,54). First, the limited number of residues in a SLiM that contribute to binding usually leads to a binding affinity for SLiM-mediated interactions in the micromolar range. Consequently, motif-mediated interactions are predominately both transient and reversible (14). This reliance on a limited number of amino acids means that SLiM-mediated interactions can be weakened (or strengthened) by PTMs, whose bulk and charge can disrupt (or enhance) this weak binding affinity. Similarly, the short length of linear motifs means these interaction modules can often occur multiple times in a single protein (11). This can facilitate mutually exclusive binding, when two motifs share a binding surface (for example, when they overlap) or promote high-avidity interactions, when motifs reoccur in separate positions along a protein sequence. These switching mechanisms act to regulate SLiM-mediated interaction. Alternative splicing and other pre-translational mechanisms can therefore alter the regulation of a protein by including or removing SLiM-containing exons. For example, altering the number of reoccurring motifs in a protein by exon removal/inclusion can change the avidity of SLiM-mediated interactions, tuning the sensitivity of signalling pathways in a temporal- and tissue-specific manner [e.g. (55)]. In this article, we have demonstrated that these reoccurring motifs are enriched in AltSeqs, suggesting that the inclusion/removal of reoccurring SLiMs is a mechanism commonly used by the cell. Similarly, an exon boundary intersecting these two overlapping SLiMs can facilitate the production of one isoform with an overlapping pair of motifs capable of acting as a regulatory switch and another isoform with just a single motif [e.g. (56,57)]. Linear motifs are, therefore, susceptible to a multiplicity of regulatory mechanisms that are important in regulating signalling within the cell. These regulatory features can be manipulated by the inclusion or removal of non-constitutive exons to create important but often subtle differences in the regulation and function of a protein.

The high false-positive rate of SLiM prediction (11) means the scope of this analysis is limited to the annotated SLiM data sets available from the ELM and Domino resources as well as data from HTP experimental studies. Despite this limitation, we have still been able to demonstrate a statistical enrichment of SLiMs within AltSeqs, suggesting an important role for motifs in the functional diversification and regulation of alternative protein products. An appreciation of how functional differences can arise between protein isoforms is key to our understanding of proteomic diversity. This is important as up to one-half of disease-causing mutations affect splicing (58) with several examples of the inclusion/exclusion of SLiM-containing exons producing disease-specific isoforms (59–61). An example of this is Hoyerall-Hreidarsson syndrome, a rare genetic disorder characterised by premature ageing, in which an aberrant splice variant of the Apollo gene is expressed that lacks a telomeric repeat-binding factor 2 (TRF2)-binding motif. This Apollo splice variant is unable to bind the TRF2 protein leading to telomeric dysfunction and cellular senescence (61). Approaches are being developed to target this type of aberrant splicing event by redirecting alternative splicing. The principal of this approach is to redirect the splicing of a transcript to promote the production of a favourable isoform in preference to the unfavourable splice variant (62). This could have therapeutic potential as demonstrated in Duchenne muscular dystrophy (63) and a melanoma model (64). An appreciation of the protein interaction modules most commonly altered between protein isoforms can help target these problems more precisely.