From transcription start site to cell biology

Abstract

The regulation of transcription is a complex process. Recent novel insights concerning the in vivo regulation and expression of protein-coding and non-coding RNAs have added previously unimagined levels of complexity to these processes.

Knowledge of the exact position of a 5' transcriptional start site (TSS) of an RNA molecule is crucial for the identification of the regulatory regions that immediately flank it. Traditionally, the most reliable method of identifying a TSS is to map a nucleotide to which a 5' cap structure is added in the RNA. Over the past few years this approach has been used in a number of genome-wide surveys aimed at unbiased identification of TSSs (see [1, 2] and references therein). These surveys identified many more sites where 5' ends of capped RNAs could be mapped than those TSSs belonging to annotated genes. At the same time, large amounts of unannotated transcription had been detected in mammalian genomes [2–4] and numerous transcription factor binding sites found outside annotated promoter regions [5, 6]. In addition, multiple start sites are often found for annotated, protein-coding genes very far from their 'official' start sites [2, 7, 8].

Three papers published recently in Nature Genetics by members of the FANTOM (Functional Annotation of Mouse) consortium [9–11] reveal yet further complexity of transcription initiation in animal genomes. Taft et al. [9] describe a new class of short RNAs made at promoters, while Faulkner et al. [10] show that repetitive elements can be a rich source of novel promoters. A study from the FANTOM consortium and the RIKEN Omics Science Center [11] shows how information on the precise positions of TSSs can be used to characterize global gene regulatory networks operating during cell differentiation.

How to identify a transcription start site

The critical issue in mapping a true site of transcription initiation is to be able to distinguish it from a 5' end generated by RNA cleavage or degradation and from a 5' end generated by incomplete copying of RNA into cDNA. The conventional hallmark of TSSs in most eukaryotes is addition of a 7-methyl guanosine cap structure to the 5'-triphosphate of the first base transcribed by RNA polymerase II. This unique feature of the transcription initiation nucleotide is the basis of several methods aiming to enrich and identify capped messages and subsequently to map the exact positions in the genome of the nucleotides to which the cap is added. The main methods used are cap analysis of gene expression (CAGE) [12], oligo-capping [13] and robust analysis of 5'-transcript ends (5'-RATE) [14]. CAGE is the most commonly used and exploits the 2',3'-diol structure of the cap nucleotide, which is only present in only one other place on an RNA molecule besides the cap - its extreme 3' end. The diol structure is susceptible to a specific chemical oxidation which can be followed by biotinylation, enabling selection of capped messages by immunoprecipitation with streptavidin. The enriched capped RNA fraction is then converted into cDNAs that span the entire lengths of the capped RNA molecules. Oligo-capping and 5'-RATE take advantage of the fact that the 5' cap is resistant to phosphatase treatment, which removes mono-, di- or triphosphates from cleaved or degraded RNA. Subsequent removal of the cap using tobacco acid pyrophosphatase leaves a 5'-monophosphate, which is amenable to ligation with a specific linker nucleotide that marks the position of the native 5' end of RNA and can later be used to select and sequence the 5' ends of capped cDNAs [13, 14].

Full-length cDNAs generated by the techniques described above can be further converted into short DNA tags derived from their 5' ends [12, 13, 15], which are very suitable for next-generation sequencing [16]. The combination of cap-selection and next-generation sequencing can generate sequence information about the exact positions of cap-addition sites for millions of RNA molecules [4, 15, 17], thus making it possible to obtain digital information about the number of transcriptional initiation events occurring at any genomic position. This information can be used to infer the positions, as well as the relative strengths, of different promoter elements [15], as exemplified in the recent articles from the FANTOM consortium [9–11]. It can also be correlated with information on the positions of other annotated genomic elements, such as repetitive elements [10] or short RNAs [9, 18], to identify any association between these elements and transcription initiation.

Complex transcriptional activity around TSSs

The immediate vicinity of a TSS is active ground for the production of a number of RNAs other than those destined to become full-length, protein-coding mRNAs. These RNAs can be transcribed from both DNA strands [19, 20] and tend to be either short [19, 18, 21] or short-lived and are quickly degraded by the exosomal complex [22, 23]. Working with the Drosophila, human and chicken genomes, Taft et al. [9] have now added a new class of promoter-related small RNAs, dubbed 'tiny RNAs', which map within -60 to +120 nucleotides around a TSS, with a peak density at 10-30 nucleotides downstream of the TSS. The size of the tiny RNAs, whose length distribution peaks at 18 nucleotides, distinguishes them from the larger promoter-associated short RNAs (PASRs) [19] and other RNAs generated at or near a promoter [21, 22]. The tiny RNAs can be mapped mainly to the sense strand of the longer transcript and, like PASRs, they tend to be found in the promoters of expressed genes and associated with active chromatin marks [9].

An important question is whether any of the non-coding RNAs found at or near promoters and TSSs have any biological function, or whether they simply represent byproducts of stalled polymerases or the degradation of longer mRNAs. Several lines of evidence argue against the latter two explanations. First, the observation by Taft et al. [9] in Drosophila that only a fraction of tiny RNAs associate with promoters that show evidence of stalled RNA polymerase argues against abortive transcription as their sole source. Taft et al. [9] also establish that production of tiny RNAs and PASRs at promoters is common in organisms as diverse as humans and flies, and that their relative positions in the genome tend to be syntenically conserved between between humans and chickens, similarly to PASRs that are syntenically conserved between humans and mice [19]. Third, synthetic single-stranded PASR RNA sequences transfected into human cells can affect the expression of the genes with which they associate [18]. Fourth, small RNAs are found associated with 5' ends of RNAs generated both by transcriptional initiation and by cleavage [18]. In both cases, the 5' ends of these small RNAs are modified by the addition of the cap, a modification known to protect RNAs against degradation [24], and this is inconsistent with their being mere degradation products on a path to complete removal from the cell.

Repetitive elements: parasites or building blocks of the genome?

Over the past few years, unbiased transcriptional surveys have revealed that a large fraction of the genome can be detected as stable transcripts [1, 2, 4]. However, these experiments, often microarray-based, typically avoided interrogating the repetitive element fraction of genomes as hybridization signals could not be assigned to a unique region. The advent of next-generation sequencing has made it possible to uniquely assign an RNA sequence to a particular repetitive element as long as there is some divergence from other copies of the element in the genome. Faulkner et al. [10] have now shown that a significant fraction of all CAGE tag clusters found in their study of human and mouse could be uniquely mapped to repetitive regions of the genome: 18.1% for mouse and 31.4% for human, represented by 44,264 and 275,185 clusters, respectively. Transcription within repetitive elements, specifically within retrotransposons, is apparently driven by their own promoters, which are surprisingly different from those previously characterized for these elements, and is highly tissue- and condition-specific. Faulkner et al. [10] find that overall, 35% of retrotransposon-associated TSSs show a restricted pattern of expression, compared to 17% of the other TSSs. Conversely, different tissues express different levels and types of repetitive elements, with human embryonic tissues having the highest levels of CAGE tags in these elements - 30% of all CAGE tags.

The big question raised by this study is whether the large contribution of repetitive elements, and retrotransposons in particular, to a cell's transcriptome translates into a major influence on its phenotype. In this respect, an important aspect of the study of Faulkner et al. [10] is the finding that retrotransposons might provide alternative or tissue-specific promoters for protein-coding genes. In fact, 15,518 (in mouse) and 117,165 (in human) of the putative novel TSSs within retrotransposons were identified as being associated with protein-coding transcripts, and the activity of 154 mouse and 579 human putative retrotransposon promoters was confirmed from existing expressed sequence tag (EST) data. Also, when Faulkner et al. [10] profiled 24 annotated protein-coding genes with suspected alternative retrotransposon promoters by rapid-amplification of cDNA ends (RACE), eight were indeed found to have sequences associating them with these promoters. Taken together, these results show that repetitive elements could in fact drive the production of a wide array of novel isoforms of protein-coding genes whose regulation and coding potential could be different from the isoforms annotated so far. It will be interesting to see how many of these putative protein-coding transcripts initiating within repetitive elements are actually translated.

This question could be phrased as part of a more general question: what is the complexity of polypeptides made in human cells, given the apparently high transcriptional complexity of RNAs made from a protein-coding locus? Analysis of available EST data has shown that, on average, a protein-coding locus can produce 5.7 different isoforms [25]. Furthermore, unbiased profiling of every protein-coding locus within the ENCODE regions has revealed that around 90% of them have either a novel internal exon or a novel TSS that is used in at least one tissue tested, and that most of the novel isoforms are tissue-specific [8]. It is not known, however, what fraction of these novel transcripts is actually translated and what fraction of such novel proteins would be functional.

Global regulation of the transcriptome

Precise knowledge of the TSSs used in a given biological condition is indispensable for understanding how that transcription is regulated. This is made abundantly clear by the study from the FANTOM Consortium and the Riken Omics Science Center [11], which modeled the transcriptional regulatory networks of a differentiating human cell. The authors used information on the genomic positions of the regulatory regions for each transcript and changes in transcript copy number during differentiation. Promoters were identified as regions flanking clusters of CAGE tags representing putative TSSs. For each promoter, known motifs for transcription factor binding sites were identified and this information was linked to changes in expression levels of the downstream transcript to infer the activity of the relevant transcription factors. From this, the authors identified 30 motifs whose activity explained most of the observed variation in gene expression; many of these motifs correspond to known regulators of the differentiation of macrophages - the particular cell type under study. The main conclusion reached is that a large number of different transcriptional regulators are required for differentiation, as opposed to the model in which the process is controlled by a small number of 'master regulators'.

A similar strategy could be applied to identify transcription factors involved in regulation of other developmental or disease systems. The information on the expression levels of transcripts linked to individual TSSs is particularly important, as the study described above [11] shows that empirical mapping of TSSs can explain expression data better than existing annotated TSSs can.

A caveat that must, however, be applied to techniques that use an RNA cap to identify TSSs, is the recent discovery that CAGE tags could represent 5' ends of RNAs generated by cleavage and subsequent re-capping [18], and that cytoplasmic enzyme complexes can add caps to 5'-monophosphate RNA molecules generated by ribonuclease cleavage [26]. This means that mere knowledge of the position of a capped nucleotide is not sufficient to define a TSS. Additional information, such as the distribution of putative initiation sites within a promoter region [27], chromatin hallmarks associated with active promotors, the presence of RNA polymerase II initiation complexes and transcription factors [2, 28] and appropriate sequence content [29], will be required to prove that a true initiation site has been identified and to re-evaluate the number of TSSs in human and other genomes.