Abstract

Background

Global RNA sequencing technologies have revealed widespread RNA polymerase II (Pol II) transcription outside of gene promoters. Small 5′-capped RNA sequencing (Start-seq) originally developed for the detection of promoter-proximal Pol II pausing has helped improve annotation of Transcription Start Sites (TSSs) of genes as well as identification of non-genic regulatory elements. However, apart from the most well studied genomes of human and mouse, mammalian transcription has not been profiled with sufficiently high precision.

Results

We prepared and sequenced Start-seq libraries from rat (Rattus norgevicus) primary neural progenitor cells. Over 48 million uniquely mappable reads from two independent biological replicates allowed us to define the TSSs of 7365 known genes in the rn6 genome, reannotating 2503 TSSs by more than 5 base pairs, characterize promoter-associated antisense transcription, and profile Pol II pausing. By combining TSS data with polyA-selected RNA sequencing, we also identified thousands of potential new genes producing stable RNA as well as non-genic transcripts representing possible regulatory elements.

Conclusions

Our study has produced the first Start-seq dataset for the rat. Apart from profiling transcription initiation, our data reaffirm the prevalence of Pol II pausing across the rat genome and indicate conservation of pausing mechanisms across metazoan genomes. We suggest that pausing location, at least in mammals, is constrained by a distance from initiation of transcription, whether it occurs at or outside of a gene promoter. Abundant antisense transcription initiation around protein coding genes indicates that Pol II recruited to the vicinity of a promoter is distributed to available start sites of transcription at either DNA strand. Transcriptome profiling of neural progenitors presented here will facilitate further studies of other rat cell types as well as other organisms.

Background

Transcription of genes was thought to be regulated mainly through recruitment of the RNA polymerase to promoters. However, work over the last several years [1,2,3] has demonstrated that mRNA production requires additional inputs even after the RNA polymerase has engaged a promoter and initiated RNA synthesis. Promoter-proximal Pol II pausing takes place within the first 100 nucleotides of many genes and, following a number of seminal studies (reviewed in [4, 5]), is now accepted as a common step in metazoan Pol II transcription. Regulated release of paused polymerase into productive transcription elongation accompanies key biological events including organism development and cellular responses to stimuli [2, 6,7,8,9,10,11,12,13]. Better understanding of transcription initiation, promoter-proximal pausing, and their contributions to transcription regulation is limited by lack of high-resolution datasets especially in commonly used model organisms like the rat. Discrepancies by a few nucleotides do not affect Chromatin Immunoprecipitation-sequencing (ChIP-seq) or mRNA-sequencing (RNA-seq) analyses because their effective resolution is relatively low. However, even single base pair inaccuracies impede analyses relying on the sequence context of promoters and other elements, including CRISPR/Cas9 based targeting [14]. With new technologies being rapidly developed, the demand for nucleotide-level precision of transcriptome annotations is only expected to increase. In addition to refining the Transcription Start Sites (TSSs) of known genes, there is also increasing interest in mapping non-genic transcription that does not produce stable RNA, but delineates non-genic regulatory elements [15,16,17,18].

Thus far, Pol II TSSs have been profiled with high depth and resolution only for relatively well-studied genomes of human, mouse, C. elegans, and Drosophila [18,19,20,21,22]. Here we turned to the rat, an important model organism that still has few available genome-wide datasets and, based on the experience with other genomes, is likely to have incomplete annotation of transcriptomic features. Small capped RNA sequencing, also referred to as Start-seq, captures short 5′-capped RNAs (TSS-RNAs) that are produced by Pol II during early transcription elongation [22, 23]. TSS-RNAs yield dual information: their 5′-ends precisely delineate the sites of transcription initiation, whereas their 3′-end positions indicate the locations of promoter-proximal pausing [22]. We report Start-seq in primary neural progenitors alongside poly-A selected high-coverage RNA-sequencing from the same RNA. We define high-confidence, base-pair resolution TSSs for 7365 of the ~ 24,000 currently annotated genes in the rn6 genome using the RefSeq annotation database, report the relationship of pausing with gene expression, and identify transcription start sites of new genes and potential non-genic regulatory elements. We identify general features of antisense transcription around gene promoters and characterize properties of Pol II pausing. The work outlines a high-resolution landscape of transcription initiation and Pol II pausing in rat neural progenitors of the rat and provides a workflow for transcriptional profiling of other cell types in the rat as well as in other organisms.

Results

Start-seq in rat neuronal progenitors

We isolated and sequenced small 5′-capped RNAs from neural progenitors of embryonic day 14 Sprague Dawley rats (see Materials and Methods). Preparation of Illumina Start-seq libraries is based on our earlier procedure that eliminates RNA species lacking the 5′-cap followed by preparation of small RNA libraries from the 5′-cap-enriched RNA pool [20,21,22] (Fig. 1a, Additional file 1: Figure S1, also see Methods). Compared with the published procedure, rather than excluding non-capped RNAs from ligation by treating with alkaline phosphatase, we directed these RNA species for degradation by adding a 5′-monophosphate with T4 polynucleotide kinase (PNK) prior to treatment with 5′-monophosphate-specific exonuclease (Fig. 1). The 3′-phosphatase-minus version of PNK was used to avoid removal of 3′-phosphates from RNA degradation products, which prevents their ligation to adapters and excludes them from libraries (Fig. 1a). For 5′-decapping, RppH enzyme (NEB) was used instead of the obsoleted Tobacco Acid Pyrophosphatase [19]. The resultant Start-seq procedure is shortened by at least 1 day by removing an extra gel size selection step, at which loss of RNA material commonly occurs [22].

Fig. 1

Validation of TSS-RNA sequencing in rat neural precursors. a. A scheme of an updated protocol for the preparation of Start-seq libraries for Illumina sequencing. b. UCSC browser shot of highly expressed Actb gene showing tracks from this study including 5′-tracks for TSS-RNA for each strand (red and blue) and mRNA-sequencing (black), alongside Pol II ChIP-seq (using an antibody against the Pol II N-terminus) track from mature rat neurons [10] representing a rat Pol II dataset that is most closely related to the current cell type. c. A zoomed-in view of Actb promoter-proximal region showing 5′- (red) and 3′-end (gold) of TSS-RNAs on this gene. The annotated Actb TSS is shown in blue bar and is located 2 bp downstream of the 5′ TSS-RNA peak. d. Correlation plot for promoter-proximal counts between two independent biological Start-Seq replicates. TSS-RNAs on the gene (sense) strand were counted in a +/− 500 bp interval from each RefSeq-annotated TSS

Two independent biological replicates produced 27.9 M and 19.4 M Start-seq reads uniquely mappable to rn6 genome. Of these reads, 16,380,972 for replicate 1 and 7,395,642 for replicate 2 mapped within +/− 500 base pairs of annotated TSSs [24] of known genes. Selectivity of scRNAs for TSSs is also illustrated by examining individual genes. Even on a highly active Actin B (Actb) gene, a majority of Start-seq reads at Actb gene map within the gene promoter (Fig. 1b, c). When the numbers of TSS-RNA hits within +/− 500 bp from the same set of TSSs were compared, Spearman correlation between the replicates was 0.97 overall (Fig. 1d) and profiles of transcription initiation were very similar on individual genes (Fig. 1d, Additional file 1: Figure S2, and data not shown), attesting to consistency of the Start-seq procedure.

Refinement of gene transcription start sites in the rat

As the 5′-ends of TSS-RNAs pinpoint the sites of transcription initiation [18, 20, 22], we compared TSSs of mRNA genes defined with Start-seq to the current (rn6) rat gene annotations. To identify genes on which we could annotate TSSs, we first discarded genes with fewer than 10 TSS-RNA sense strand reads in either replicate as noise. To avoid impinging on neighboring transcripts, we also included a distance filter for maximum distance to RefSeq TSS (Fig. 2d, Additional file 1: Figure S3). Out of the 9158 rn6-annotated genes with TSS-RNA signal above the noise threshold, 7365 met our location criteria and 7112 of these genes had unique gene ids. Reannotated candidate TSSs locations were exactly identical for 4730 genes and were within +/− 10 nt of each other for 5508 genes. Among the qualifying genes, we determined the nucleotide position with the largest number of 5′-ends of reads mapped to the sense strand. Only 1842 genes showed Start-seq TSS locations within +/− 5 nt of the annotated TSS. There was no bias toward upstream versus downstream shift of RefSeq versus Start-seq TSSs, suggesting that no single mechanism accounts for these differences (Fig. 2d and Additional file 1: Figure S3). There was overall sharpening of TSS-RNA distributions when arranged against their peak positions instead of RefSeq TSSs (Fig. 2a, b, left pane).

Fig. 2

TSS RNA-based refinement of gene TSSs. a. Heatmaps of sense-strand TSS-RNA on 7112 rat genes ordered by decreasing TSS-RNA count within the promoter region, centered around annotated (left panel) and TSS-RNA peak-centered locations (right panel). b. Left pane. Metaplots of TSS-RNA 5′-ends for the same genes as in A, centered on TSSs defined using RefSeq (black) and TSS-RNA (red) TSS annotations. Right pane. Pol II ChIP-sequencing traces based on previously published data [10] centered against the same TSSs. c. Weblogo visualization of sequence enrichment of the genes centered around RefSeq annotation (top) and TSS-RNA reannotation (bottom panel). d. Correspondence between TSS annotations between two independent biological replicates, with each dot representing a gene and color coding indicating the number of TSS-RNA hits per gene as indicated. Genes are plotted against RefSeq-annotated TSSs. Strikethrough lines enclose genes considered for TSS reannotations based on each replicate, with the center quadrant containing genes that were reannotated

To validate the TSS reannotations, we compared the DNA sequence context around the observed Start-seq gene TSSs to those around existing TSSs in rn6 genome. While RefSeq positions showed no DNA sequence enrichment, following reannotations (Fig. 2), a clear Pol II initiator (Inr) sequence motif [25] centered around TSS-RNA defined locations was observed (Fig. 2c), similar to what was previously found in human and mouse datasets [20, 21]. Apart from validating the TSS annotations, the data also reaffirm conservation of Pol II initiation sequence context in mammals. To verify the sensitivity of Start-seq based TSS annotations, especially at lower TSS-RNA coverage loci, we divided the 7365 TSSs into quartiles based on the number of mapped TSS-RNA reads. Enrichment of the Inr motif persists even at low (between 10 and 300 reads per 1000 bp TSS window, Additional file 1: Figure S4) coverage, indicating that our Start-seq noise threshold is conservative. In contrast, the existing rn6 RefSeq annotations do not contain sequence motif information even for the most highly expressed gene quartile (Additional file 1: Figure S4). To avoid a potential bias of RNA-based readout datasets, we also utilized previously published RNA Pol II ChIP-seq data, obtained from mature rat neurons (GSM565202) [10]. We observed sharpening of Pol II ChIP-seq metagene signal (Fig. 2b, right pane), suggesting that TSS-RNA method of reannotating start sites is not biased for a specific technique. Reannotated TSSs are listed in Additional file 2.

Antisense transcription around mRNA genes

Transcription initiated from the strand opposite to the promoter has been described around human and mouse genes [26,27,28,29,30], but its distribution in the rat genome remains uncharacterized. The so-called divergent transcription starts upstream of the promoter and is directed away from it [26, 27, 29]. We defined the divergent TSS for each gene as the base pair with the most Start-seq hits on the strand opposite to the gene within a 500-bp interval upstream of the promoter. TSS-RNAs in rat neuronal progenitors indicate prevalent divergent transcription initiation (Fig. 3a, b). While there is no set distance between sense and divergent start sites for genes across the genome - similar to the mouse data [20] - a majority of divergent peaks are found ~ 80–200 nt upstream of the gene TSS (Fig. 3a, b), with good correlation of divergent transcription signal magnitude between individual Start-seq replicates (Fig. 3c). Locations of the highest peaks of divergent transcription were less consistent (Fig. 3a), indicating the limiting sequencing coverage for these events or, more likely, intrinsically lower precision of divergent transcription initiation.

Fig. 3

Antisense transcription in the rat genome. a. A heatmap representing antisense transcription sorted by the distance of divergent peak from the gene TSS, indicated by arrow. Of the 7112 genes, 601 genes that did not contain divergent signal within 600 bp from the TSS were removed from the heatmap. Due to low signal from antisense, and especially convergent transcription, the image was enhanced with Pixelmator to highlight convergent transcription. b. Metagene plot of antisense transcription relative to the reannotated gene TSSs. c. Spearman r correlation matrix of sense and antisense transcription, per replicate. Counts for convergent and divergent transcription were defined on the appropriate strand within +/− 50 nt from the location with the highest signal. d. Weblogo representation of DNA sequence context centered around Convergent (top) and Divergent (bottom) peak locations

Convergent transcription initiates downstream of the TSS and is directed head-on into the promoter. After applying the same 10-count TSS-RNA noise threshold, we detected convergent transcription on 2531 genes within a 500-bp window downstream of the main promoter (Fig. 3b). Because convergent transcription is even lower in intensity than divergent, this threshold under-reports transcription at current sequencing coverage. The site of convergent initiation was defined similar to the divergent TSS, that is, using the base pair position with the highest signal downstream of the main TSS. The convergent signal was lower than that from divergent initiation, and convergent initiation sites were also concentrated further away, ~ 200–250 nt downstream of the sense TSS than divergent transcription.

Just like the gene TSSs, divergent and convergent TSSs are enriched with an Inr-like sequence motif (Fig. 3d), indicating a common mechanism of Pol II initiation both at and outside of gene promoters. There is a modest yet positive correlation between the magnitude of gene TSS-RNA signal and its associated antisense, both divergent and convergent, transcription initiation signal ((Spearman, 0.36–0.46) Fig. 3c). Taken together, these data reinforce the notion that antisense Pol II initiation is common throughout mammalian transcription [26,27,28,29,30,31] and may be co-regulated with the main promoter.

Promoter-proximal pol II pausing is ubiquitous across the rat genome

While TSS-RNAs are generated by Pol II pausing, their levels on a gene are determined through dynamic interplay between mechanisms that establish pausing and those that release paused Pol II into elongation (reviewed in [4, 32, 33]). To quantify pausing, TSS RNA signal is commonly normalized for gene expression by calculating the Pausing index (PI), also referred-to as traveling ratio [2, 6, 26]. RNA-sequencing is the best currently available gene expression measurement for these cells and has been used before to estimate PI [18, 22]. Comparing PI against mRNA levels of the same gene reaffirms positive but modest (0.60, Spearman, Fig. 4a) correlation of PI with gene expression. Gene Ontology (GO) enrichment analysis indicated that, consistent with earlier work in Drosophila and mouse [20, 22], that higher PI tends to favor genes involved in development and stimulus response, whereas low PI genes are skewed toward metabolism (Additional file 3). We did not detect statistically significant enrichment of cell lineage specific genes related to neural development. Thus, pausing likely represents an intrinsic, rather than conditional, property of genes [34, 35].

Fig. 4

Pol II pausing across the rat genome. a. Scatter plot showing the distribution of TSS-RNA versus RNA-seq FPKM signal for all genes, with pausing index indicated by color. c. Metaplot of TSS-RNA 3′-ends relative to the reannotated TSS. b. Metaplot of TSS-RNA lengths around the TSSs. All RNAs on the sense strand from a gene within +/− 150 bp interval were considered regardless of their initiation site. All genes versus top 25% highest expressed genes (based on RNA-seq signal FPKM) are shown, respectively, in grey and dark red. d. A UCSC browser shot showing Pol II pausing and transcription on Fabp7 gene, which has the highest expression in rat neural progenitors based on RNA-seq analysis. Inset shows a zoomed-in view of the gene’s promoter region

As the 3′-ends of TSS-RNAs define the locations of Pol II pausing [22], we next determined the positions of TSS-RNA 3′-ends. Metagene analysis around the TSSs shows that TSS-RNA 3′-ends peak around + 35 nt downstream of the TSS (Fig. 4b, Additional file 1: Figure S6). This is similar to our previously defined distribution of TSS-RNAs in other organisms, including Drosophila, human, and mouse [20,21,22], pointing to commonality of mechanisms that establish Pol II pausing across metazoans.

Ever since the original discovery of promoter-proximal Pol II pausing [36,37,38,39], there remains a question about pervasiveness of pausing and, particularly, existence of non-paused genes (for example, [2, 7, 8, 12, 40]). In genome-wide datasets, paused genes are normally defined through threshold-based cutoffs in global PI distribution [1, 26, 35], which under-reports paused and over-represent non-paused genes. Even then, a majority of genes have Pol II accumulation at promoters indicative of pausing [3]. Apart from completely inactive genes, low PI values should stem from active genes. However, these genes still show detectable Start-seq signal and have the same RNA size distribution as the rest of the genes, both overall (Fig. 4c) and on a representative gene with a high expression level based on RNA-seq signal (Fig. 4d). Examining individual genes, we failed to find an active gene without TSS-RNA signal (data not shown). While quantitative differences probably reflect genome-specified differential duration of premature transcription termination, presence of scRNA at the right location is detected on all active genes we examined. These observations indicate that Pol II pausing occurs on most if not all genes and that there would be few, if any, “nonpaused” active genes, at least in steady-state cells.

Two mechanisms can, in principle, define the location of Pol II pausing: the DNA sequence context or distance from transcription initiation [41,42,43]. To determine the potential contributions of each mechanism, we focused on initiation events that occur just outside of gene TSSs in the intervals between -10 to -5 nt from the peak TSS (upstream interval) and +5 to +40 nt from the peak TSS (downstream interval). If pausing is driven entirely by the underlying DNA sequence, then pausing should happen at the same location irrespective of the initiation position. TSS-RNAs should thus be longer for – in reference to the TSS – upstream and shorter for downstream initiation events. In contrast, sequence-independent mechanisms would result in similar RNA lengths regardless of the initiation site and accordingly shifted positions of 3′-RNA ends. TSS-RNAs initiated immediately upstream of the TSSs in a metagene plot show that the 3′-ends of upstream-initiating reads are shifted upstream, and downstream-initiating reads have 3′-ends shifted accordingly downstream of the events initiating precisely at the TSS (Fig. 5a). Examination of individual genes (Fig. 5b and data not shown) also points to upstream-initiating reads ending at more upstream locations, indicating that pausing for upstream-initiating events is shifted upstream accordingly. These observations argue for a major contribution of sequence-agnostic mechanisms to defining pausing location across the genome.

Fig. 5

Pol II pausing and distance from transcription initiation. a. Metaplot of TSS-RNAs initiating outside of the exact TSS region +/− 10 bp (greyed out), which was excluded from this analysis. The 5′-ends of RNAs are shown in dashed lines and their 3′-ends in solid lines. RNA initiating at more upstream regions are shown in red and at downstream regions in blue. b. 2-d scatter plot of initiation on two genes with top 20 FPKM values as in [31]. c. TSS-RNA length versus their initiation location around the TSS. Circles show mean (with SEM) for TSS-RNA lengths metaplot for RNAs initiating at the indicated locations (positions) relative to the TSS (vertical line). Grey line (right Y-axis) shows the number of RNAs mapping (metagene) to each location in Replicate 1. d. Weblogo plot of DNA sequence context around locations of Pol II pausing based on distance from the reannotated TSS (top) and relative to 3′-end of each RNA

Analysis of RNA lengths relative to each RNA 5′-end rather than the gene TSS [31] shows a distribution of lengths peaking around ~ 35 nucleotides, similar to TSS-centric analysis. However, we noted that upstream initiated RNAs are, on average, 2–3 nucleotides longer than RNAs initiating downstream of the TSS. The TSS itself appears to be the inflection point (Fig. 5c and Additional file 1: Figure S5). While the reasons for this difference remain to be investigated, we suggest that this may be due to different availability, or activity, of factors such as NELF or TFIIS for initiation at different locations. Notably, this extra length does not compensate for the additional distance upstream from the TSS, retaining the RNA length constraints. Analysis of the sequence context around the RNA 3′-ends (but not distance from the TSS) shows preference for G/C nucleotides (as also determined recently [31]), indicating that generation of TSS-RNAs (either the initial pausing or subsequent Pol II backtracking) does to some extent depend on the sequence context (Fig. 5d) [43]. Initiation events outside of promoters, namely, divergent transcription, where sequences are not likely to have specifically evolved to control transcription, had similar distributions of RNA sizes (Additional file 1: Figure S6). Taken together, these data indicate that pausing is a common, likely requisite step of Pol II transcription regardless of whether it initiates at or outside of gene promoters [30]. The data also suggest that the location of pausing, while to some extent sensitive to the sequence context, is ultimately constrained by a distance from transcription initiation.

Transcription initiation events can indicate genes producing stable RNA or non-genic regulatory elements such as transcriptional enhancers. By combining Start-seq with polyA RNA-sequencing, we sought to identify transcription initiation events that are not annotated in the databases [18, 20]. To identify new genes, we used our RNA seq dataset for transcript assembly with stringtie [44]. We identified 100% (17,175) of the known genes and 100% of the exons in the reference annotation, suggesting that the RNA-seq coverage is sufficiently high for annotation of gene transcripts. In addition, stringtie identified 6219 novel intergenic transcripts with 4651 (74.7%) of those containing more than one exon. These transcripts represent potentially novel genes in the rat genome (Additional file 4). Because a combination of HISAT2 and stringtie can over-report single exon transcripts [45], we considered new genes only among multi-exon transcripts, even at a cost of under-representing bona-fide single-exon genes. Figure 6b shows one such gene with a typical assembled RNA transcript and Start-seq signal at the would-be promoter location. This transcript is homologous to the AUTS2 locus in the mouse and human, a gene implicated in neurodevelopment processes [46] (Fig. 6b). The mean RNA expression of new genes is ~ 2-fold lower, but the overall distribution of expression levels is compatible to expression of known genes (Additional file 1: Figure S7), suggesting that low level expression is not the main reason for incomplete annotations and pointing that exploration of transcriptome in other cell types is worthwhile. Start-seq coverage of these genes was accordingly lower than that of the known genes (Additional file 1: Figure S7).

Fig. 6

Examples of transcriptome elements identified in the rat genome. a. A potential regulatory element (enhancer) upstream of Sox2 gene defined based on bidirectional TSS-RNA signal and low RNA-seq signal. b. Example of a new annotation in the rat genome showing homology to Auts2 gene. Transcripts assembled from RNA-seq data are shown inside a bracket. Mouse and H. sapiens genes are shown underneath

To identify potential regulatory elements, we used Homer [47] to find Start-seq peaks on both strands across the genome after excluding the known genic start sites (+/− 3 kb from the gene’s start site) found in either the Ref-Seq or our assembly. This resulted in 29,481 homer peaks. Because accessible genomic elements of including transcriptional enhancers are characterized by bidirectional transcription [20, 48,49,50], we used these peaks to identify regions of bidirectional TSS signal enrichment (Additional file 5, see Methods). Figure 6a shows one of those regions approximately 7 kb upstream of the Sox2 gene. This region represents an active enhancer near a transcriptionally productive developmental gene in these neural progenitor cells. The identified new genes and non-genic TSSs are listed in the Additional file 1.

Discussion

Using small capped RNA sequencing (Start-seq), we profiled Pol II transcription start sites and pausing in neural progenitors of the rat. Compared with human and mouse, the rat genome appears to be even more misannotated for gene TSSs and likely other genomic elements as well. By refining TSSs of known genes and identifying thousands of new TSSs of potential genes and non-genic elements, the first Start-seq datasets in the rat reported here will facilitate transcriptome profiling in other cell types of the rat as well as other organisms. Our definitions of new genes and non-genic elements are likely conservative, so that additional datasets are expected to further improve the scope and confidence of rat transcriptome annotations in various cell types.

Nascent RNA analysis methods such as Global Run-On and Precision Run-On sequencing (GRO-seq and PRO-seq) are powerful tools for transcriptome profiling [19, 26]. Start-seq is not able to measure expression of genes, but unlike the Run-On methods, it can profile transcription initiation events in specimens with inactivated RNA polymerase. Broader adoption of Start-seq has been limited by technical complexity. We have streamlined the Start-seq method by reducing the bench time it takes to complete the protocol. Future iterations of Start-seq method development will increase the specificity of 5′-capped RNA recovery and reduce the requirement for starting RNA material. For example, ribosomal RNAs constituted 15.7 and 44.9% of reads, respectively, in each replicate; the difference in the abundance of TSS-RNAs across replicates is consistent with the relative abundance of ribosomal RNA reads in each sample, indicating that rRNA reads constitute a major variable in Start-seq libraries, presumably at the level of RNA size selection during library preparation (Additional file 1: Figure S1). Combining Start-seq with rRNA depletion, as was recently done in Drosophila [18], should help circumvent this issue. Given the growing affordability of sequencing, it may also be prudent to opt for a higher sequencing depth instead of extra steps in Start-seq library preparation.

Start-seq allowed us to visualize the overall Pol II initiation landscape across the rat genome. We reaffirmed the prevalence of convergent and divergent initiation around Pol II-transcribed genes in the rat. The distances of antisense initiation sites to main TSSs vary widely among genes and, therefore, rather than specific sequences, are likely defined by topological features of the genome such as chromatin looping and/or sequence features such as CpG islands. Convergent initiation in general is shifted further away from the main TSS than divergent initiation, probably because the former takes place next to the + 1 nucleosome [35, 51, 52]. The magnitudes of sense and antisense signals show modest but positive correlation with transcription of the gene, which is comparable to correlation between pausing and gene expression, indicating that these events are co-regulated. We suggest that the Pol II machinery is commonly brought to the vicinity of the promoter (or a transcription factory [53, 54]) and then distributed according to its affinity to each potential start site within the local environment.

Pol II pausing involves a complex interplay of processes that include RNA capping, initial pausing, backtracking, and premature termination (reviewed in [31, 33, 55]). The location of pausing on genes in relation to the start site appears to be highly conserved across metazoans and peaks around ~ 35 nt from a gene TSS. Using Start-seq, we did not detect the bimodal distribution of TSS-RNA lengths observed in PRO-seq based experiments [31], consistent with earlier TSS-RNA data in mammalian organisms [20,21,22], although individual genes such as Actb do show that (Figs. 1b,c and 5b). This may be because TSS-RNA and PRO-seq detect nonidentical populations of RNA generated at different stages of Pol II pausing including processing and backtracking. Because pausing appears to take place during transcription at and outside of promoters [30], likely through the same underlying mechanisms, pausing may be better termed as initiation-proximal rather than promoter-proximal pausing.

Contribution of sequence-dependent and sequence-independent mechanisms to the establishment of Pol II pausing and subsequent Pol II release remain to be fully understood. We suggest that positioning of Pol II from the TSS determines where promoter-proximal pausing would occur. Conservation of pausing among different organisms and at sequence contexts throughout the genome, at and outside of gene promoters, indicates sequence-independent, likely universal mechanisms. Pausing establishing factor NELF (Negative Elongation Factor) [56, 57] and DRB Sensitivity Inducing Factor (DSIF) govern pausing on most, if not all, transcription events [58,59,60]. For example, NELF, through its multiple RNA binding sites [61, 62] or DSIF [59], may serve as a “ruler” to measure the distance of initial pausing or to define the location of subsequent backtracking. Given that pausing is also constrained by the sequence context [43], at least within up to five nucleotides, multiple mechanisms are likely at play. We suggest that the length-based universal constraints define the upper limit for pausing whereas DNA sequence, or balance of promoter activity and pause release, can alter that within these limitations [63]. Indeed, locations of 3′ ends vary on individual genes from 25 to 50 + nt (Figs. 4 and 5 and data not shown). Small RNAs reflect the complex processes during Pol II pausing and release [64], and their analysis in different systems and under different conditions will help shed light on these mechanisms.

By combining Start-seq and RNA-seq data from the same cells, we performed an initial profiling of genic and non-genic TSSs of the rat. This approach can be used for other systems, especially to map the noncoding transcription landscape. While our RNA-seq data detected 100% of known rn6 mRNAs and 100% of known exons, we are unlikely to have fully saturated the rat transcriptome by analyzing one cell type because some genes have low activity in these cells, especially for noncoding transcripts. The number of identified non-genic elements based on TSS-RNA in the rat is on a lower side of the numbers of enhancers reported based on histone marks [65,66,67]. Future analyses of RNA datasets will advance transcriptome annotations in various cell types of the rat as well as other, less studied organisms.

Conclusions

Applying an improved Start-seq procedure for rat neuronal progenitors and combining it with polyA RNA-sequencing from the same sample sets, we report the transcription initiation landscape in these cells that includes (i) refinement of known gene transcription start sites; (ii) profiling of antisense (divergent and convergent) transcription initiation; (iii) genome-wide profiling of Pol II pausing at and outside of gene promoters and (iv) identification of new genes and potential regulatory elements. The work presented here will help fine-tune DNA sequence-based approaches (e.g., CRISPR targeting) in rats and facilitate transcriptome profiling of other rat cell types as well as analyses of other organisms.

Methods

Animals and derivation of neuronal progenitors

All animal procedures were performed in accordance with the National Institute of Environmental Health Sciences (NIEHS) and the University of California Merced animal care committee’s regulations [NIEHS Institutional Animal Care and Use Committee (IACUC) approval: ASP#01–21; and University of California Merced IACUC approval: ASP#13–0007 and ASP#16–0004]. Time-pregnant rats were obtained from a commercial resource (Charles River). Pregnant dams were sacrificed by first deeply anesthetizing them (to minimize pain sensation during decapitation) by intraperitoneal injection of pentobarbital solution and then decapitated using a sharp guillotine. Embryonic animals were decapitated with a sharp pair of scissors, cortices were isolated and subsequently digested in Accutase (Gibco) for 5 min at room temperature. Cultures of cortical neural progenitors were prepared from embryonic day 14 (E14) Sprague Dawley rats of either sex. Single cell suspension was achieved by triturating digested tissue through fire-polished Pasteur pipettes. Cells were washed with Hank’s Balanced Salt Solution (HBSS) without calcium and magnesium, and then plated onto dishes coated with CELLstart (Gibco) in Knockout DMEM/F-12 (Gibco) supplemented with 2% StemPro Neural Supplement (Gibco), 2 mM GlutaMAX (Gibco), 20 ng/ml bFGF (Gibco), and 20 ng/ml EGF (Gibco). Cells were passaged every 2–3 days and routinely collected after the second passage.

Total RNA preparation

PolyA selected RNA libraries were prepared from 50 ng of total RNA extracted from frozen cell pellets using Trizol reagent. In addition to the standard Trizol procedure, we included a chloroform extraction step of the aqueous phase after chloroform-induced separation of phases, to remove traces of phenol. RNA Integrity Numbers were calculated by Bioanalyzer and were always > 6.8, to meet the RNA quality guidelines for RNA sequencing service (Novogene). PolyA-selected RNA-sequencing libraries were prepared using NEBNext Ultra II library preparation kit with NEB beads, using 12 cycles of amplification. Libraries were quantified on Bioanalyzer prior to sequencing.

Start-seq library preparation

scRNAs were prepared based on our earlier procedure [21, 22], with modifications. In brief (5*10^7) cells were used to extract nuclei by washing with hypotonic lysis buffer [22] followed by preparation of total RNA using Trizol reagent, size selection on 15% Urea-TBE gel (Novex), and crush and soak elution using cellulose acetate spin filters (Agilent cat# 5185-5990). After ethanol precipitation, size selected RNA was treated, successively, with T4 Polynucleotide Kinase 3′ phosphatase minus (New England Biolabs, NEB), 5′-polyphosphatase, terminator exonuclease (both Epicentre), followed by ligation of 3′-Illumina small RNA Tru-Seq adapter using T4 RNA Ligase 2, truncated K225Q (NEB). Reactions were then purified on 15% Urea-TBE gel (Novex) to select 45-100 nt RNA sizes, extracted from the gel as above and treated with Rpph (NEB) in Thermopol reaction buffer. After ligating the Illumina 5′-Tru-Seq small RNA adapter with WT T4 ssRNA Ligase 1 (NEB) in the presence of ATP, reverse transcription was done per Tru-Seq Illumina Small RNA kit and libraries were amplified for 18 PCR cycles. Phenol-chloroform, chloroform, and ethanol precipitation was used between each enzymatic treatment. PCR-amplified libraries were purified on a 6% TBE gel to remove linker dimers, extracted from the gel as above, and quantified using Bioanalyzer (Agilent) and droplet digital PCR (Bio-Rad) prior to sequencing.

Sequencing and initial data processing

Start-Seq libraries were sequenced on a MiSeq instrument for quality control locally and re-sequenced on HiSeq2500 using small RNA option (SE50) commercially (Novogene) to the depth of ~ 100 M raw reads per sample. Raw files for Start-seq were mapped to rn6 genome using Hisat2. To filter out highly abundant species with multiple genomic copies such as tRNAs, only uniquely mappable Start-seq reads (Hisat2 mapping score > 3) were considered for analysis. Mapped reads were assigned to annotated genes using rn6 RefSeq annotation.

RNA-sequencing analysis

PolyA-selective RNA sequencing was done to an average ~ 140 M raw reads per replicate using a commercial company (Novogene) from Trizol-extracted RNA. Reads were aligned to the rn6 genome using STAR and expression levels (FPKM) were obtained using the Rsubread [68] and DESeq2 [69] R packages. Transcripts were assembled using stringtie with default parameters.

Small RNA data analysis

Rn6 annotated gene TSS locations were obtained from UCSC and deduplicated to produce a list of unique start site coordinates for each gene. Contaminating RNAs (tRNA, rRNA, etc) and micro RNA species were removed from consideration for this study. The deeptools package [70] was used to convert alignment files to bigwig (bamCoverage) and to count reads +/− 500 base pairs around the TSS locations defined previously (computeMatrix). Strand information was preserved, and reads were counted in the sense direction for all genes both on the 5’ and 3’ of the reads. After fitting the location of the highest peak in each (annotated TSS-centered) gene window to normal distribution, the range of 1 SD from the mean (Additional file 1: Figure S1) (146 nt for replicate 1 and 149 nt for replicate 2 (Additional file 1: Figure S3) was used as the maximum distance to define genes on which we could reannotate TSSs. Custom R scripts were used to analyze transcription around these sites. Pausing Index (PI) was calculated as the ratio of scRNA signal within the TSS +/− 500 bp window in the sense direction and RNA-seq-derived expression level (FPKM) of the same gene. Metagene plots and heatmaps were made using MakeHeatmap or custom R scripts. Due to sequencing read length of 50, the maximum length insert we could identify by adapter trimming was 47, and therefore sequences longer than 48 nt are not represented in RNA length-based analyses, although lower-coverage paired end sequencing of the same Start-Seq libraries (Supplement) shows that the size distribution calculated from paired end read sequencing is the same. Individual Start-seq replicates were processed independently and, unless indicated otherwise, replicate 1, which contained higher coverage, is shown in main Fig. 2-d plots for Start-seq RNA were made with the R package ggplot using coordinates of individual genes relative to their TSS-RNA reannotated TSSs. For identification of new TSS elements, peaks called by Homer using “factor” and “separate strand” flags were filtered to exclude peaks inside all gene promoter regions (+/− 3 kb from each promoter) using annotatePeaks from the same package. To identify bidirectional regions of TSS-RNA enrichment, peaks called by homer were filtered to exclude those near gene start sites (+/−3kbp from each TSS). Among the remaining peaks, those that were within 3000 bp from each other were merged using bedtools if at least two of the adjacent peaks were on opposite strands. This resulted in ~ 8600 bidirectional TSS regions.

Availability of data and materials

Data generated in this work are deposited to the GEO database under accession # GSE130338. ChIP-sequencing data analyzed here are from GEO accession # GSM565202.

Acknowledgements

We thank Bony De Kumar and Archana Dhasarathy for critical reading of the manuscript.

Funding

The work was supported by the University of California Cancer Research Coordinating Committee CRN-18-524978 and NIH R01ES028738 grants to RNS, as well as P20GM104360 CoBRE, R21CA217751, and NSF CAREER 1750379 grants to NS. CRN-18-524978 grant: supported animal acquisition and cell culture, and CJD salary. R01ES028738 grant: supported RNS salary. P20GM104360 grant: provided salary support for AS (first author), AS (third author) and DP, and provided compute infrastructure for the project. R21CA217751 grant: provided support for library preparation supplies and Illumina sequencing. NSF CAREER 1750379 grant: provided salary support for SN and NK and Illumina sequencing. The funders had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Affiliations

Department of Biomedical Sciences, University of North Dakota School of Medicine, Grand Forks, ND, 58202, USA

Contributions

AS1 designed the bioinformatics data analyses, conceived the manuscript structure based on analyzed data, and was a major contributor to writing of the manuscript. CJD collected rat tissues and cultured the neural progenitors. AS2 extracted RNA from neural progenitors and prepared TSS-RNA libraries. NK designed the analysis for determination of TSS-RNA sizes. DP performed bioinformatics data analysis. RNS Designed the manuscript and was responsible for rat neural progenitors. SN designed and performed data analyses and wrote the manuscript together with AS1. All authors have read and approved the manuscript.

Corresponding author

Ethics declarations

Ethics approval

All animal procedures were performed in accordance with the National Institute of Environmental Health Sciences (NIEHS) and the University of California Merced animal care committee’s regulations [NIEHS Institutional Animal Care and Use Committee (IACUC) approval: ASP#01–21; and University of California Merced IACUC approval: ASP#13–0007 and ASP#16–0004].

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Regions of bidirectional transcription outside of known genes identified based on TSS-RNA. (TXT 211 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.