This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Many eukaryotic genomes encode cis-natural antisense transcripts (cis-NATs). Sense and antisense transcripts may form double-stranded RNAs that are processed
by the RNA interference machinery into small interfering RNAs (siRNAs). A few so-called
nat-siRNAs have been reported in plants, mammals, Drosophila, and yeasts. However, many questions remain regarding the features and biogenesis
of nat-siRNAs.

Results

Through deep sequencing, we identified more than 17,000 unique siRNAs corresponding
to cis-NATs from biotic and abiotic stress-challenged Arabidopsis thaliana and 56,000 from abiotic stress-treated rice. These siRNAs were enriched in the overlapping
regions of NATs and exhibited either site-specific or distributed patterns, often
with strand bias. Out of 1,439 and 767 cis-NAT pairs identified in Arabidopsis and rice, respectively, 84 and 119 could generate at least 10 siRNAs per million reads
from the overlapping regions. Among them, 16 cis-NAT pairs from Arabidopsis and 34 from rice gave rise to nat-siRNAs exclusively in the overlap regions. Genetic
analysis showed that the overlapping double-stranded RNAs could be processed by Dicer-like
1 (DCL1) and/or DCL3. The DCL3-dependent nat-siRNAs were also dependent on RNA-dependent
RNA polymerase 2 (RDR2) and plant-specific RNA polymerase IV (PolIV), whereas only
a fraction of DCL1-dependent nat-siRNAs was RDR- and PolIV-dependent. Furthermore,
the levels of some nat-siRNAs were regulated by specific biotic or abiotic stress
conditions in Arabidopsis and rice.

Conclusions

Our results suggest that nat-siRNAs display distinct distribution patterns and are
generated by DCL1 and/or DCL3. Our analysis further supported the existence of nat-siRNAs
in plants and advanced our understanding of their characteristics.

Background

Gene regulation through small non-coding RNAs has been recognized as an important
mechanism for almost all cellular processes in eukaryotes. Most small RNAs (sRNAs)
are processed by RNase III-type ribonuclease Dicer-like (DCL) proteins, and incorporated
into Argonaute (AGO) proteins to guide gene silencing at the transcriptional level
through DNA or histone modification, and at the posttranscriptional level by mRNA
cleavage and degradation or translational repression [1-5]. Based on their precursor structures and biogenesis, sRNAs can be categorized into
microRNAs (miRNAs) and small-interfering RNAs (siRNAs). miRNAs are derived from hairpin-structured
precursors and do not depend on RNA-dependent RNA polymerases (RDRs) or RNA amplification
mechanisms, whereas endogenous siRNAs (endo-siRNA) are generated from long double-stranded
RNAs (dsRNAs) as a result of antisense transcription or the activity of plant-specific
RNA polymerases such as RNA polymerase (Pol)IV and RDRs [6-8].

Different classes of endogenous siRNAs have been described in Arabidopsis based on their distinct characteristics and biogenesis pathways [3,9]. The 24-nucleotide heterochromatic siRNAs (hc-siRNAs) or repeat-associated siRNAs
(ra-siRNAs) are generated by the PolIV/RDR2/DCL3 pathway and function in RNA-directed
DNA methylation and histone modifications [2,10-12]. Trans-acting siRNAs (ta-siRNAs) are 21 nucleotides in length and require DCL4, RDR6,
and miRNA-mediated cleavage that triggers the phased processing [3]. Long siRNAs (lsiRNAs) are 30 to 40 nucleotides in length and are dependent on DCL1,
HEN1, RDR6 and PolIV [13]. lsiRNAs most likely induce decapping and 5' to 3' mRNA degradation of their targets
[13]. Several cis-natural antisense transcripts (cis-NATs) have been reported to also generate siRNAs [14-18]. These so-called nat-siRNAs are induced by abiotic and biotic stresses [16,17,19] or accumulate in specific developmental stages [14,15]. The biogenesis of salt- and bacterium-induced nat-siRNAs appeared to require DCL1
and/or DCL2, RDR6, and PolIV in Arabidopsis [16,17]. Moreover, the expression of ARIADNE14 is de-repressed in dcl1, hen1, hyl1, sde4, rdr2 and sgs3, suggesting that the nat-siRNAs generated from the ARIADNE14/KOKOPELLI overlapping pair is dependent on DCL1, HEN1, HYL1, RDR2, SGS3 and PolIV [14]. The overlapping regions of some cis-NATs in mouse and Drosophila also produce endo-siRNAs, which are processed by Dcr-2 and loaded into AGO2 [20-24]. nat-siRNAs have also been found in budding yeast and Schistosoma japonicum [25,26]. Despite these reports, there remains a need for conclusive evidence for the occurrence
of nat-siRNAs in plants [27], because their features are not well understood and a general model for their generation
is still not available.

Here we report the identification of 33,826 and 162,397 siRNAs corresponding to 17,141
and 56,209 unique sequencing reads that were derived from cis-NATs of biotic and abiotic stress-challenged Arabidopsis and abiotic stress-treated rice, respectively. These siRNAs, displaying either a distributed
or a site-specific pattern, were enriched in the overlapping regions and often exhibited
strand bias. Genetic analysis revealed that nat-siRNAs in Arabidopsis were mainly generated by DCL1 and/or DCL3, and a subgroup of them was RDR- and PolIV-dependent.
Furthermore, accumulation of some nat-siRNAs was regulated by biotic and abiotic stress
conditions, which may contribute to plant responses to environmental stresses.

Results

Natural antisense transcripts and cis-NAT-derived siRNAs in Arabidopsis and rice

We sequenced a total of 21 sRNA libraries prepared from biotic and abiotic stress-treated
Arabidopsis and abiotic stress-challenged Oryza sativa plants. Biotic stress datasets for Arabidopsis included infection of non-pathogenic Pseudomonas syringae pv tomato (Pst) DC3000 hrcC-, a virulent strain Pst DC3000 carrying an empty vector, and an avirulent strain Pst DC3000 carrying an effector gene, avrRpt2, as previously described [28-30]. Abiotic stresses on Arabidopsis and rice included salt, cold and drought treatments (see Materials and methods). We
obtained a total of more than 24.6, 23.2 and 30.9 million raw sequencing reads from
these biotic and abiotic stress-treated Arabidopsis and abiotic stress-treated rice small-RNA libraries, respectively (Additional file
1). After trimming adaptor sequences and discarding reads shorter than 17 nucleotides
or longer than 28 nucleotides or of low quality, we obtained 13,985,938, 14,664,923
and 17,446,592 reads from the biotic and abiotic stress-treated Arabidopsis libraries and the abiotic stress-treated rice libraries that perfectly match to the
Arabidopsis and rice genomes or cDNAs, respectively (Additional file 1). Among all the small-RNA species profiled, 24-nucleotide sRNAs were the most abundant
size class (Additional file 2). The 5'-first nucleotides of sRNAs preferentially are adenosine or uracil (Additional
file 2). In addition to a large number of miRNAs [28,29], we discovered many new endogenous siRNAs belonging to different siRNA categories.
In this study, we focused on siRNAs that correspond to cis-NATs.

Additional file 1.Distributions of the sequencing reads from small RNA libraries of abiotic and biotic
treated Arabidopsis and abiotic challenged rice. Shown in the table are the total number of raw sequencing reads (total), the number
of qualified reads that can map perfectly to the corresponding Arabidopsis or rice genome (mapped), intergenic regions (intergenic), intronic sequences (introns),
transposons or repeats (repeats/mobile), tRNA, rRNA, snoRNA and snRNA sequences (ncRNA),
trans-acting siRNA (ta-siRNA), microRNA sequences (miRNAs), intron-exon junctions
(intron-exon junction) and exon-exon junctions (exon-exon junction). No mismatches
were allowed in the mapping. The second number for each condition (column) is the
percentage of reads relative to the total mapped reads.

Using the Arabidopsis genome annotation (TAIR/NCBI version 8.0) and the rice genome annotation (MSU RGAP
6.1), we searched for pairs of genes in Arabidopsis and rice that overlapped more than 25 nucleotides at the same genomic loci. We thus
identified 1,439 and 767 pairs of cis-NATs in Arabidopsis and rice, respectively (Table 1). cis-NATs can be further categorized into three groups: convergent (3'-3' overlap), divergent
(5'-5' overlap), and enclosed, in which one transcript is entirely encompassed by
the other. Most cis-NATs (1,126 and 560 pairs in Arabidopsis and rice, respectively) are arranged in convergent orientation (Table 1). We excluded 65 and 10 cis-NAT pairs in which at least one gene encodes rRNA, tRNA, small nuclear RNA (snRNA),
small nucleolar RNA (snoRNA), miRNA, ta-siRNA, or transposons in Arabidopsis and rice, respectively, from subsequent analyses. These RNA genes or transposons are
considered 'hot spots' of RNA degradation or small RNA generation and may mask the
real distribution of nat-siRNAs. We also analyzed the NAT genes targeted by miRNAs
to eliminate the possibility that the sRNAs that matched to the transcripts are secondary
siRNAs triggered by 22-nucleotide miRNAs [31,32]. We found that At2g33810 in the At2g33810/At2g33815 pair was a target of miR156 and
At1g53230 in the At1g53230/At1g53233 pair was a target of miR319. However, neither
miR156 nor miR319 is 22 nucleotides, and the siRNAs generated from At2g33810 and At1g53230
were not arranged in a phasing pattern, which is a characteristic pattern of miRNA-triggered
secondary siRNAs. Therefore, we retained At2g33810/At2g33815 and At1g53230/At1g53233
in our list. A total of 1,374 NAT pairs in Arabidopsis were used in this study. Among them, 186 genes (6.8% of 2,748 NAT genes) are annotated
as 'other RNAs'. They are most likely non-coding RNAs or RNAs with potential to encode
very short proteins or peptides.

Table 1. Different types of cis-NATs identified in Arabidopsis and rice and cis-NATs that generated cis-nat-siRNAs

We searched our small RNA deep-sequencing libraries for reads that matched perfectly
(that is, no mismatches) to the cis-NATs. We found that 84 (6.1% of 1,374) and 119 (15.7% of 757) of the Arabidopsis and rice cis-NAT pairs, respectively, gave rise to at least 10 sRNAs per one million total genome-mapped
reads from the overlapping regions of the cis-NATs in all the libraries (Table 1; Additional file 3). Interestingly, among the 84 Arabidopsis NAT pairs, 54 pairs have one transcript (32.1% of 168 genes) annotated as 'other RNAs'
(Additional file 4), suggesting that non-coding antisense RNAs are more likely to generate siRNAs. A
total of 33,826 and 162,397 siRNAs corresponding to 17,059 and 56,209 unique reads
were derived from cis-NATs in Arabidopsis and rice, respectively (Table 2). As shown in Additional file 2, nat-siRNAs are mainly 21 and 24 nucleotides in length, whereas in the total small
RNA population, the 24-nucleotide siRNAs are the most abundant species. The first-nucleotide
distribution for the nat-siRNAs was similar to that of the total sRNAs (Additional
file 2).

Additional file 3.Normalized reads of siRNAs mapped to the overlapping regions of Arabidopsis and rice cis-NAT pairs, respectively (a) siRNAs mapped to the overlapping regions of 84 Arabidopsis cis-NAT pairs. (b) siRNAs mapped to the overlapping regions of 119 rice cis-NAT pairs. Each small RNA library was normalized to one million reads with 100% matching
to the genome. The sum of normalized reads from each library was listed for each cis-NAT pair.

Additional file 4.Arabidopsis cis-NATs with two classes of small RNAs mapped to the overlapping regions. 'gene1' and 'gene2' represent the first and second transcript in each cis-NAT pair. Listed are 84 Arabidopsis cis-NATs with siRNAs mapped to the overlapping region. Raw reads of 20- to 22-nucleotide
and 23- to 28-nucleotide classes of siRNAs matching 100% to the genome were analyzed.
The percentage of 20- to 22-nucleotide and 23- to 28-nucleotide siRNAs are presented.

Table 2. The numbers of nat-siRNAs from the entire regions of cis-NATs, overlap regions of cis-NATs, introns in entire regions of cis-NATs and introns in overlap regions of cis-NATs in Arabidopsis and rice

Within the three types of cis-NATs, enclosed cis-NATs have the highest percentages for producing siRNAs, that is, 39 (27.1% of 144)
and 39 (36.4% of 107) of the enclosed cis-NATs in Arabidopsis and rice, respectively, generated siRNAs (Table 1). Among the 84 and 119 cis-NAT pairs that produce more than 10 siRNAs per million reads in the overlapping regions
in Arabidopsis and rice, respectively, 16 and 34 pairs (19.0% and 28.6%) have siRNAs exclusively
in the overlapping regions (Table 1, Figures 1a-i and 2a). To determine if siRNAs were truly enriched in the overlapping regions of NATs,
we took into consideration the length differences between the overlapping and non-overlapping
regions of cis-NATs by comparing the densities of siRNAs in these two regions. A statistical test
showed that the density of siRNAs in the overlapping regions was more than six times
that in the non-overlapping regions, with P-values of 0.0077 and 0.0242 in Arabidopsis and rice, respectively (see Materials and methods). To further explore whether cis-NATs have a higher likelihood to give rise to siRNAs than other protein-coding genes,
we computed the densities of sRNAs in the 3'-UTR overlapping regions of convergent
cis-NATs and 5'-UTR overlapping regions of divergent cis-NATs, and the siRNA densities in the 3'-UTRs and 5'-UTRs of protein-coding genes
that were not involved in NATs, respectively (see Materials and methods). We then
compared the density of siRNAs in the convergent cis-NATs with the density of siRNAs in the 3'-UTR of non-NAT genes in Arabidopsis; similarly, we compared divergent cis-NATs with the 5'-UTR of non-NAT genes. The densities of siRNAs in divergent 5' and
convergent 3' cis-NATs were statistically greater than that of the 5'- and 3'-UTRs of non-NAT genes,
with P-values of 0.0366 and 0.0438, respectively. These results confirmed that cis-NATs were more likely than non-NAT genes to generate siRNAs. This is further supported
by the study of Argonaute-associated sRNAs, which indicated that cis-NATs generated 2.29 times more sRNAs than other gene transcripts [33]. Our results were also consistent with that of animal endo-siRNAs, which are also
enriched in the overlapping regions of cis-NATs, for example, more than ten-fold enrichment in Drosophila [21-23,25,34].

Figure 1.Distinct distribution patterns of Arabidopsis nat-siRNAs. (a-i) Arabidopsis nat-siRNAs displaying a distributed pattern and exclusively derived from the overlapping
regions. (j-o) Arabidopsis nat-siRNAs displaying a distributed pattern but derived from both the overlap region
and non-overlap region. sRNAs matching the upper and lower strands are displayed above
and below the NAT pairs, respectively. The red regions on the gene model represent
exons, whereas the blue regions represent introns. The region between the green lines
represents the overlapping region of the NATs. siRNAs probed are indicated by black
arrows.

Figure 2.Distribution of rice nat-siRNAs with distribution pattern OS03G02762/OS03G02770 (a)
and OS08G06220/OS08G06230 (b) or site-specific pattern OS04G57140/OS04G57160 (c). Strands above or under the NAT pairs represent sRNAs positively or negatively mapping
to the upper strands. In the gene model, exons and introns are indicated by red regions
and blue regions, respectively. The overlapping regions of the NATs are indicated
by green lines.

The distribution of siRNAs along the cis-NATs display two distinct patterns, one is the distributed pattern with siRNAs scattered
along the overlapping regions or across the entire transcripts (Figures 1 and 2a, b), and the other is the site-specific pattern with siRNAs derived from one or a few
specific sites (Figures 2c and 3) (see Materials and methods for additional details). We identified 9 site-specific
patterns from Arabidopsis and 23 from rice (Additional files 5 and 6). Because RNA secondary structure analysis at these specific sites using RNAFold
[35] revealed no obvious stem-loop structures, these regions were unlikely to harbor new
miRNA genes or nat-miRNA genes as described [36], although we are aware of the limitation of RNAfold. Moreover, 46 and 63 (50.5% and
52.9%) cis-NAT pairs in Arabidopsis and rice, respectively, exhibited strand bias in generating nat-siRNAs with different
directions (with more than two-fold change) (Table 1).

Figure 3.Site-specific patterns of Arabidopsis nat-siRNAs At4g35850/At4g35860 (a), At2g33793/At2g33796 (b), and At1g51400/At1g51402
(c). siRNAs that positively or negatively map to the upper genes are present above or
under the gene pair, respectively. The introns of cis-NATs are represented by blue strands and overlapping regions of cis-NATs are shown by green lines. The black arrow represents the siRNAs detected by
Northern blot.

Additional file 5.(a) Distributions of the number of small RNA clusters in 84 Arabidopsis cis-NATs. The red line represents the separation between site-specific (left) and distributed
(right) patterns. (b) The plot of two metrics of 84 cis-NATs in Arabidopsis. Each dot represents the number of clusters within the cis-NAT whole region and the percentage of small RNA reads in all clusters. The red dots
in the rectangle were classified as site-specific patterns, whereas the blue dots
were distributed patterns, represented by a linear correlated line.

Requirement of DCL1 and/or DCL3 for the accumulations of nat-siRNAs

To experimentally characterize the nat-siRNAs, we examined the expression of 15 new
nat-siRNAs identified in Arabidopsis with relatively high numbers of reads. In addition to nat-siRNAATGB2, another 6 out
of the 15 nat-siRNAs that we examined showed clear siRNA signals by small RNA Northern
blot analysis using locked nucleic acid (LNA) probes (indicated by black arrows in
Figures 1 and 3; probes used are listed in Additional file 7), which enabled us to study their biogenesis by examining their expression in small
RNA pathway mutants. The biogenesis of the site-specific nat-siRNA generated from
the divergent pair At1g51400/At1g51402 (referred to as nat-siRNA1g51400; Figure 3c) was dependent on DCL1 and HEN1, but did not require any RDRs or PolIV (Figure 4a). Similarly, the nat-siRNA generated from the convergent pair At2g41590/At2g41600
with a distributed pattern (referred to as nat-siRNA2g41590; Figure 1j) was also DCL1-dependent but RDR- and PolIV-independent (Figure 4b). Both nat-siRNAs were 21 nucleotides in length. Previous studies have shown that
the biogenesis of bacterial-induced nat-siRNAATGB2, and the accumulation of some of
the salt-induced nat-siRNASRO5 and development-associated nat-siRNAKPL also depended
on DCL1 [14,16,17].

Additional file 7.Oligos used in this study. A plus sign ('+') before the nucleotide depicts LNA residues.

Figure 4.Biogenesis of nat-siRNAs that depend on DCL1. (a-c) Accumulation of nat-siRNA1g51400 (a), nat-siRNA2g41590 (b) and nat-lsiRNA4g34070 (c)
is shown in various dcl, rdr, polIV and polV mutants. Total RNA (75 to 100 μg) was used for Northern blot analysis. LNA probes
complementary to each nat-siRNA were used. U6 was an internal control to show equal
loading. The same results were obtained from two biological replicates.

We found that the convergent NAT pair At1g13330/At1g13340 (Figure 1o) could produce two classes of siRNAs, 21-nucleotide and 24-nucleotide siRNAs (referred
to as nat-siRNA1g13340) from the same site (Figure 5a). Genetic analysis revealed that the 21-nucleotide siRNAs were DCL1-dependent, whereas
the 24-nucleotide siRNAs were DCL3-dependent. Furthermore, we found that two other
24-nucleotide nat-siRNAs with a distributed pattern, nat-siRNA1g28100 generated from
NAT pair At1g28100/At1g28110 (Figure 1l) and nat-siRNA1g17745 generated from NAT pair At1g17744/At1g17745 (Figure 1n), were also dependent on DCL3 (Figure 5b, c). Interestingly, all the DCL3-dependent 24-nucleotide nat-siRNAs examined are dependent
on RDR2 and PolIV (Figure 5), while only some of the 21-nucleotide nat-siRNAs are dependent on RDRs and PolIV
[14,16,17]. Deep sequencing results showed that 82 out of the 84 cis-NAT pairs gave rise to two classes of siRNAs, 20- to 22-nucleotide and 23- to 28-nucleotide
siRNAs in the overlapping regions, while the remaining two pairs gave rise to the
20- to 22-nucleotide class only (Additional file 4). Thus, our results suggested that the cis-NATs have the potential to be processed by both DCL1 and DCL3.

Figure 5.Biogenesis of nat-siRNAs that depend on DCL3. (a-c) Accumulation of nat-siRNA1g13340 (a), nat-siRNA1g28100 (b) and nat-siRNA1g17745 (c)Total
RNA (75 to 100 μg) of distinct dcl, rdr, polIV and polV mutants and corresponding WT are probed by LNA probes indicated in Figures 1 and 3.
U6 was used as a loading control. Similar results were obtained in two biological
repeats.

nat-siRNAs generated from introns

In both plants and animals, nat-siRNAs are derived from NAT mRNAs and mainly function
post-transcriptionally [16,17,23]. This was also supported by our observation that some NATs with distributed nat-siRNAs
predominantly appeared in exonic regions of protein coding genes, for example, At2g33810/At2g33815,
At1g13330/At1g13340, OS08g06220/OS08g06230, and At4g35850/At4g35860 (Figures 1g, o, 2b and Figure 3a). However, based on the Arabidopsis genome annotation (TAIR version 8), we found that 383 of the 1,374 cis-NAT pairs (27.9%) contain introns in the overlapping regions. Among them, 45 (11.7%)
have siRNAs mapped to introns. Examples are shown for At2g41590/At2g41600, AT434070/At4g34071
and At2g02795/At2g02800 (Figure 1j, k, m). In rice, 346 of 757 (45.7%) cis-NAT pairs contain introns in the overlapping regions; one example is shown for OS04g57140/OS04g57160
in Figure 2c. A substantial number of nat-siRNAs originated from such introns within NATs, as
listed in Table 2 and Additional files 8 and 9. Similarly, it has been reported that a large number of siRNAs are also generated
from the introns of several cis-NATs in Drosophila [23]. It is possible that the intronic nat-siRNAs are generated from pre-mRNAs in the
nucleus before the introns are spliced out.

Additional file 8.Small RNAs mapped to the introns or intron-exon junction regions of Arabidopsis cis-NATs. Copy number indicates the number of raw reads in the combination of all libraries.
(a) siRNAs mapping to introns in the overlapping region of cis-NATs in Arabidopsis. (b) siRNAs mapping to introns in the whole region of cis-NATs in Arabidopsis.

Additional file 9.Small RNAs mapped to the introns or intron-exon junction regions of rice cis-NATs. Raw reads from all the libraries were analyzed. (a) siRNAs mapping to introns in the overlapping region of cis-NATs in rice. (b) siRNAs mapping to introns in the whole region of cis-NATs in rice

To confirm these intron-derived nat-siRNAs, we validated the candidates in small RNA
biogenesis mutants. Although the total number of siRNAs from the entire overlapping
regions might be large, the reads for individual siRNAs were still low. Most of the
siRNAs examined were below the detection level of Northern blot analysis except one
siRNA that is derived from an intron-exon junction of At4g34070 in the overlapping
region of the divergent NAT pair At4g34070/At4g34071 (Figure 1k). We detected a distinct 30-nucleotide siRNA band using a 24-nucleotide probe. Because
we used a 18- to 26-nucleotide size-fractionated small-RNA population for constructing
our small-RNA libraries, lsiRNAs were excluded from our small-RNA sequencing dataset.
To determine whether it was a new lsiRNA like the ones we detected before [13], or just a degradation product, we examined the dependence of this 30-nucleotide
siRNA on components of the small RNA biogenesis pathways. We found that the generation
of this siRNA depended on DCL1 and HEN1, but not on DCL3, or any RDRs, PolIV or PolV,
indicating that it was a new lsiRNA but not a hc-siRNA (Figure 4c). It is worth noting that the reported AtlsiRNA-1 is also dependent on DCL1 and HEN1
[13]. These results suggested that a subgroup of nat-siRNAs was likely to be generated
from pre-mRNAs.

Some nat-siRNAs regulate the expression of their cognate NAT mRNAs in cis

We found that siRNAs from some cis-NATs accumulated to different levels in different libraries (Additional files 10 and 11). Eighty cis-NAT pairs in Arabidopsis and 37 cis-NAT pairs in rice displayed a more than two-fold difference of siRNA accumulation
in the overlap regions under at least one stress condition as compared with mock-treated
samples. This result suggested that these nat-siRNAs were likely to regulate expression
of the NATs under stress conditions, although more experiments are needed to validate
and confirm these changes.

Additional file 10.Normalized reads of siRNAs mapped to 84 Arabidopsis NATs under different conditions. Each library was normalized to a million reads that perfectly mapped to the genome.
The fold change with maximum absolute value between treated sample and corresponding
control is listed as a supplemental value. The pairs with more than two-fold change
were annotated in red (up) or in blue (down). (a) Reads of small RNAs mapped to overlapping region of NATs under different conditions.
(b) Reads of small RNAs mapped to whole region of NAT under different conditions.

Additional file 11.Normalized reads of siRNAs mapped to 119 rice NATs under different conditions. siRNAs matching 100% to the genome were analyzed and each library was normalized
to a million reads before analyzing. The fold change with maximum absolute value between
treated and untreated libraries is listed. The pairs with more than a two-fold change
are displayed in red (up, +) or blue (down, -). (a) Reads of small RNAs mapped to overlapping region of NATs under different conditions.
(b) Reads of small RNAs mapped to whole region of NATs under different conditions.

To determine the regulatory function of nat-siRNAs on their NAT transcripts in cis, we performed computational analysis using dcl1-7 and dcl3-1 (inflorescence tissue) microarray data generated by the Carrington lab [37] (see Materials and methods), because DCL1 and DCL3 are responsible for nat-siRNA
biogenesis. Of the 84 pairs of cis-NAT genes, 93 genes from 78 pairs were present on the 22-K microarray chip. Of the
93 cis-NAT genes, 22 (23.7%) exhibited a greater than 1.5-fold up-regulation in the dcl1-7 mutant compared to the wild type (WT) control (P-value < 0.05; Additional file 12). As a comparison, only 176 out of 2,215 (7.9%) total NAT genes were up-regulated.
Although we cannot absolutely rule out the possible small RNA-independent function
of DCL1 on the accumulation of miRNA targets [38], the result that more siRNA-associated NATs (23.5%) were regulated by DCL1 than total
NATs (7.9%) strongly suggests that these small RNA-producing cis-NATs were more prone to down-regulation by DCL1-dependent nat-siRNAs. However, we
did not find any of the 93 NAT genes that were up-regulated more than 1.5-fold in
dcl3, suggesting that the DCL3-dependent nat-siRNAs may not be directly involved in the
expression regulation of the NAT transcripts.

Additional file 12.Arabidopsis cis-NATs that are up-regulated in the dcl1-7 mutant. (a) List of probes of the 93 genes of the 84 cis-NAT pairs. Annotations of the Affymetrix ATH1-121501 chip were used and no probes
were found for the rest of the cis-NAT genes. (b) The 23 cis-NAT genes up-regulated in dcl1-7.

To confirm the results from the microarray analysis and to examine the expression
of some siRNA-targeting NAT transcripts that were not present on the chip or were
not identified from the analysis, we used real-time RT-PCR to examine their expression
in the WT, dcl3-1 and dcl1-7 fwf2 double mutant plants. We chose to use the double mutant instead of the dcl1-7 single mutant because the fwf2 mutant, which carries a mutation in the ARF8 gene [39,40], largely rescued the pleiotropy and fertility defects of dcl1-7 [41,42]. We found that the accumulation of At1g51402 and At1g13330 increased six and four
times more, respectively, in the dcl1-7 fwf2 mutant than in WT plants (Figure 6), indicating that the antisense transcripts were no longer suppressed when nat-siRNA1g51400
and nat-siRNA1g13340 were not produced. For At2g41590/At2g41600 and At4g34070/At4g34071
pairs that produce siRNAs from both strands and have less strand bias, both sense
and antisense transcripts were up-regulated in the dcl1-7 fwf2 mutant compared to the WT, suggesting that both transcripts were under the regulation
of nat-siRNAs (Figure 6). Although it has been shown that the arf8 mutation does not interfere with miRNA-mediated gene silencing or RNAi or small RNA
generation [39,40], it is still possible that up-regulation of these NAT transcripts in the dcl1-7 fwf2 double mutant is due to the fwf2 mutation that is sRNA-independent. To test this possibility, we also examined the
expression levels of these NAT genes in the fwf2 single mutant. As shown in Additional file 13, in the fwf2 single mutant, there was no clear induction of these NAT transcripts except for At1g13340.
Both transcripts of the At1g13340/Ag1g13330 pair were induced in the dcl1-7/fwf2 double mutant and At1g13340 was induced to a similar extent in both single and double
mutants (about two-fold), suggesting that the up-regulation of At1g13340 was mainly
due to the arf8 mutation whereas induction of At1g13330 was attributable to the de-repression of small
RNA-mediated silencing by the mutation in DCL1. sRNAs generated from this pair had
a strong strand bias (Figure 1o), were predominantly generated from At1g13340 and targeted At1g13330. These results
correlated well with our real-time RT-PCR results. In summary, these data indicated
that the expression of NAT transcripts was regulated by the small RNAs generated from
DCL1.

Figure 6.Expression of the NAT transcripts is increased in dcl1-7 fwf2 and dcl3-1 mutants. Expression was examined by quantitative RT-PCR and Actin2 was used as an internal control. Total RNA (5 μg) was treated with DNaseI and then
subjected to reverse transcription. Error bars indicate standard deviations derived
from three technical replicates. Similar results were obtained from two biological
replicates.

Additional file 13.Expression analysis of NAT transcripts in the fwf2 single mutant. The expression of NAT transcripts was analyzed by quantitative RT-PCR. Total RNA
(5 μg) was used for DNase treatment and reverse transcription. Error bars indicate
the technical replicates and similar results were obtained from two biological repeats.

We found that only three out of the ten transcripts that we tested showed an increase
in the dcl3-1 mutant compared to WT (Figure 6), suggesting that some 24-nucleotide nat-siRNAs are still able to regulate the expression
of NAT transcripts, although a majority of the NATs were not affected by the 24-nucleotide
siRNAs.

To further analyze the function of these nat-siRNAs, we searched the publicly available
deep sequencing datasets of the AGO-associated small-RNAs in Arabidopsis for the nat-siRNAs in the overlap regions we identified [43,44]. As shown in Additional file 14, the 20- to 22-nucleotide nat-siRNAs were mainly loaded into AGO2, the Argonaute
protein that predominately binds 21-nucleotide sRNAs and contributes to anti-viral
and anti-bacterial resistances [45-48]. Interestingly, the 23- to 26-nucleotide long nat-siRNAs were loaded into both AGO1
and AGO4, the Argonaute proteins responsible for mRNA cleavage and DNA methylation,
respectively [49,50].

Additional file 14.Association of 20- to 22-nucleotide and 23- to 26-nucleotide classes of nat-siRNAs
with different AGOs in Arabidopsis. Unique reads of nat-siRNAs associated with AGO1, AGO2, AGO4 and AGO7 were analyzed.
Loaded reads represent unique reads of nat-siRNAs that were loaded into distinct AGOs.
The percentage in total nat-siRNA-AGO libraries indicates the distribution of loaded
nat-siRNAs in different AGOs.

Discussion

We identified 17,141 and 56,209 unique nat-siRNA sequences originating from cis-NATs in biotic and abiotic stressed Arabidopsis plants and abiotic stress-challenged rice, respectively. These siRNAs are enriched
in the overlapping regions of cis-NAT pairs and display distributed or site-specific patterns.

Our biogenesis analysis suggested that the siRNAs from cis-NATs were dependent on DCL1 and/or DCL3. Deep sequencing results showed that 82 out
of the 84 cis-NAT pairs that generated siRNAs in the overlapping regions gave rise to two classes
of siRNAs, 20- to 22-nucleotide and 23- to 28-nucleotide nat-siRNAs, while only two
pairs of NATs gave rise to one class of 20- to 22-nucleotide siRNAs (Additional file
4). The 20- to 22-nucleotide nat-siRNAs are generated by DCL1, whereas the 23- to 28-nucleotide
nat-siRNAs were DCL3-dependent (Figures 4 and 5). We also provide an example that a DCL1-dependent 21-nucleotide nat-siRNA and a
DCL3-dependent 24-nucleotide nat-siRNA were generated from the same site in the overlapping
region of a NAT pair (Figure 5a). These two classes of sRNAs have also been observed in transgene silencing, viral
RNA silencing and endogenous inverted repeat (IR) loci, where long double-stranded
precursors or intermediates are formed [51-55]. Different kinds of transgenic constructs can generate different types of sRNAs:
sense transgenes mainly generate 21- and 22-nucleotide siRNAs by DCL2 [56]; IR transgenes generate both 21- and 24-nucleotide siRNAs by DCL4 and DCL3, respectively,
or mainly 21-nucleotide siRNAs by DCL4 [55-58]; DCL1 and DCL2 are also involved in the generation of siRNAs induced by IR transgenes
[56]. Similarly, the 21-nucleotide, 22-nucleotide and 24-nucleotide endogenous siRNAs
are processed by DCL4, DCL2 and DCL3, respectively, from endogenous IR loci [52,53]. DCL1 is required for the accumulation of 21- and 24-nucleotide siRNAs processed
from IR transgenes or 22- and 24-nucleotide siRNAs processed from certain endogenous
IR loci [53,59]. Viral RNAs are processed into 21- and 24-nucleotide siRNAs by DCL4 and DCL3, respectively,
and DCL2 processes viral RNAs into 22-nucleotide siRNAs when DCL4 is suppressed or
inactivated [60-64]. Furthermore, some endogenous miRNAs in Arabidopsis and rice also generate two distinct sizes of DCL1-dependent short miRNAs and DCL3-dependent
long siRNAs or miRNAs at the same sites [28,65]. Our results suggest that, like the endogenous IR loci, transgene loci and viral
replicons as well as cis-NAT sense and antisense transcripts form double-stranded RNAs that can potentially
be processed by DCL1 and/or DCL3 to generate 20- to 22-nucleotide short siRNAs and
23- to 28-nucleotide long siRNAs, respectively. In this regard, the previously reported
apparent DCL2 dependence of a 24-nucleotide nat-siRNA [17] is intriguing and needs to be re-examined.

By examining the expression of NAT transcripts that are complementary to the siRNAs
and analyzing published microarray data, we found that many 21-nucleotide nat-siRNAs
were able to regulate the accumulation of their target transcripts (Figure 6; Additional file 12), which was consistent with the function of reported 21-nucleotide nat-siRNAs processed
by DCL1 [14,16,17]. However, only a small number of NATs targeted by siRNAs were up-regulated in the
dcl3 mutant, suggesting that only a fraction of the 24-nucleotide class of siRNAs may regulate
gene expression directly. Similar results were observed in the viral-derived 24-nucleotide
siRNAs. Although 24-nucleotide viral siRNAs are as abundant as 21-nucleotide and/or
22-nucleotide siRNAs, they can neither silence viral RNAs nor suppress viral infection
[60-64]. Endogenous 24-nucleotide siRNAs are mainly generated by the DCL3/RDR2/PolIV pathway,
and are often associated with AGO4 and AGO6 and mainly direct DNA methylation or chromatin
modification [49,50,54,66]. We found that the generation of the DCL3-dependent 23- to 28-nucleotide nat-siRNAs
also required RDR2 and PolIV. Further studies are necessary to determine whether these
23- to 28-nucleotide nat-siRNAs can also direct DNA methylation or histone modification
at their NAT loci.

We observed two major distribution patterns of nat-siRNAs - distributed and site-specific
- in Arabidopsis and rice. We found that the site-specific nat-siRNAs tested were generated by DCL1.
Because of its predominant role in processing hairpin structures into miRNAs, DCL1
is generally assumed to be capable of recognizing certain RNA secondary structures.
The sense and antisense transcripts in a NAT pair are transcribed separately, and
their overlapping regions may interact to form a dsRNA. In this partially dsRNA duplex,
complex secondary structures may arise, which are recognized by the versatile DCL1
enzyme, and thus are processed to yield site-specific nat-siRNAs, or possibly distributed
nat-siRNAs, depending on what secondary structures are formed and recognized by DCL1.
It is also possible that these apparently site-specific nat-siRNAs may be generated
across the whole overlapping region first as the distributed ones, but only specific
siRNAs from particular sites are stable and thus accumulated, while the rest are degraded.

We showed that RDRs were not always required for nat-siRNA formation (Figure 4a-c), which suggested that the dsRNAs formed within the overlapping regions of the sense
and antisense transcripts were sufficient for producing nat-siRNAs in such cases.
As for the nat-siRNAs that do require RDRs [13,16,17], they were not completely eliminated in the rdr mutants. The partial dependence on RDRs suggested that these nat-siRNAs could be primarily
generated from the dsRNA regions formed within overlapping sense-antisense transcripts,
and then may be amplified by RDRs in a secondary amplification step. However, because
the enrichment of cis-NAT-generated siRNAs compared to all genic siRNAs is relatively weak, one should
not automatically assume that siRNA clusters corresponding to the cis-NAT genes are always caused by the cis-NAT configuration; it is possible that some siRNAs may coincidentally be in a cis-NAT region. Nevertheless, our observation of some siRNA clusters that appear exclusively
in the overlapping regions of NATs strongly supports that at least some siRNAs are
indeed caused by the cis-NAT configuration.

We also found that many nat-siRNAs were derived from introns or intron-exon junctions.
It has been shown that the sense/antisense dsRNA pairs in vertebrates are processed
in the nucleus but not the cytoplasm [67,68]. Therefore, some plant nat-siRNAs may also be produced in the nucleus before introns
are spliced out from the NATs. Moreover, we found a new Arabidopsis lsiRNA, lsiRNA4g34070, which was generated from the exon-intron junction of a NAT
pair. Similar to AtlsiRNA-1, lsiRNA4g34070 was processed by DCL1.

The extent of transcript overlap within each cis-NAT may differ from the TAIR annotations and requires careful examinations. For example,
recent 3' rapid amplification of cDNA ends (RACE) experiments revealed that the major
SRO5 transcripts (data not shown) are shorter than previously indicated [17], although the experiments also showed a longer SRO5 transcript that extends beyond the detected nat-siRNAs (GenBank accession JQ513374). Some genes such as AGO3 were annotated in the TAIR database as having no 3'-UTRs, but longer 3'-UTRs for some
of these genes have been recovered by 3'-RACE in our labs and others. It appears that
many genes have multiple forms of transcripts with various lengths of 3'-UTRs.

Plant nat-siRNAs have been shown to regulate gene expression in stress responses or
under specific developmental stages [16,17]. In this study, we identified new nat-siRNAs that appeared to regulate the expression
of their NAT mRNAs, which may contribute to gene expression reprogramming and fine-tuning
in response to environmental stresses. NATs are widespread in eukaryotic organisms
[19,69-72]. The expression of antisense transcripts may differ among different cell types or
under different environmental conditions [19,73]. It is likely that different sets of nat-siRNAs may be generated from NATs in response
to different developmental and environmental cues, and therefore play a broad regulatory
role in gene expression under specific conditions. Several reported nat-siRNAs, such
as nat-siRNASRO5, nat-siRNAATGB2 and nat-siRNAKPL, are induced under stress or appear
at specific developmental stages and suppress the expression of their antisense transcripts
post-transcriptionally [14,16-18,69]. Similar examples have also been observed in animal systems, for example, the siRNAs
from the SIC34a1 gene pair are only generated from mouse kidney and testis [67]. Moreover, many cis-NAT gene pairs in mammalian brain, including the genes involved in Alzheimer's disease
and LRRTM1/Ctnna2, a pair of cis-NAT genes that participates in schizophrenia, can generate nat-siRNAs [74]. A lot of these nat-siRNAs are up-regulated in olfactory discrimination training
[74]. Therefore, most of the reported nat-siRNAs are generated under specific developmental
or environmental conditions. In this study, we detected bacterial-induced nat-siRNAATGB2
[16], as well as another siRNA that is in phase with nat-siRNAATGB2 (Figure 3a; Additional file 10) by deep sequencing, whereas the developmental stage-specific nat-siRNA nat-siRNAKPL
was not present in our database [14]. The salt-induced nat-siRNAs from the SRO5-P5CDH pair were also present in this dataset,
but the number of reads was below our cutoff threshold (minimum ten reads per million
total match reads in the overlapping region), so it was not included in our list.
In order to be consistent with other abiotic stress treatment, the salt treatment
condition in this study was different from that in the published work [17]. However, more SRO5-P5CDH siRNAs were detected from two other salt-treated small
RNA libraries in a separate study (Additional file 15; Gene Expression Omnibus (GEO) accession number GSE33642). It is worth noting that
our genome-wide analysis on nat-siRNA-targeting NAT transcripts used the microarray
result generated from healthy inflorescences (GEO accession number GSE2473), which
may miss some NAT transcripts that are regulated by siRNAs at a different developmental
stage or under specific environmental conditions. In addition, we cannot rule out
the possibility that some of the nat-siRNAs regulate the expression of their targets
by translational inhibition and therefore cannot be directly detected at the mRNA
level by microarray or real-time RT-PCR experiments.

Additional file 15.SRO5-P5CDH siRNAs derived from salt and cold stress challenged Arabidopsis. The siRNAs were identified from GEO database accession number GSE33642. (a) Reads of siRNAs positively/negatively match to At5G62520 (SRO5) in the overlap/non-overlap
region. The '+ strand' and '- strand' indicate positively or negatively matching.
(b) Distribution pattern of SRO5-P5DCH siRNAs. siRNAs positively or negatively matched
to At5G62520 are displayed above or below the SRO5-P5DCH gene pair. Exons and introns
are represented by red and blue dishes, respectively. The overlapping region is indicated
by the two green lines.

Conclusion

Our studies of plant sRNAs showed that those derived from plant cis-NATs were enriched in their overlap regions and cis-NATs are more likely to generate sRNAs than other transcripts. Taken together with
previously published data on plants, Drosophila and yeast, our results strongly support the existence of nat-siRNAs in eukaryotic
organisms. Our genetic analysis indicated that cis-NATs were processed by DCL1 and/or DCL3. By analyzing a large number of small RNA
libraries from stress-challenged plants, we found that some nat-siRNAs were likely
to respond to different environmental stresses, and therefore contributed to plant
stress resistance, development or other cellular processes by regulating corresponding
NAT mRNAs. nat-siRNA-mediated gene regulation is one of the regulatory mechanisms
for at least a subgroup of cis-NATs.

Small-RNA library construction and deep sequencing

Bacteria infiltration was carried out on the leaves of 4-week-old Arabidopsis WT and mutant plants as described previously [46]. Briefly, leaves of 4-week-old Arabidopsis plants growing at 23°C with 12 h light were infiltrated with mock (10 mM MgCl2), non-pathogenic strain Pst DC3000 hrcC-, Pst DC3000 (empty vector) and Pst DC3000 (avrRpt2) at a concentration of 2 × 107 cfu/ml. The leaves were collected at 6 and 14 hours post-inoculation.

For abiotic stress treatment in Arabidopsis, plants grew at 23°C with 16 h light for 19 days, and were then divided into four
groups. One of the groups was water-deprived (drought treatment) while the others
were maintained with regular irrigation. At day 29, one group of plants was treated
with salt solution (200 mM NaCl as irrigation solution for 24 h). Another group of
plants was transferred to a growth chamber at 5°C and 16 h light for 24 h for cold
treatment. The fourth group of plants was always maintained under regular conditions
as a normal control. At day 30, all four groups of plants were collected. For small
RNA library construction, the whole shoots of plants (stems, leaves and inflorescences)
were used.

Three-month-old rice growing at 28°C and approximately 13 h of natural light were
treated with drought (water withholding for 3 days), cold (5°C for 24 h) and salt
(400 mM NaCl for 24 h, added as irrigation solution). The inflorescences of these
plants were collected and used for sRNA library construction.

sRNA extraction and library construction were carried out as described previously
[75,76]. Briefly, total RNA was isolated from infiltrated leaves and fractionated on 15%
denaturing polyacrylamide gel. RNA molecules ranging from 18 to 26 nucleotides were
excised and ligated to 5'- and 3'-RNA adaptors using T4 RNA ligase followed by RT-PCR
and gel purification as described in the instructions from Illumina Inc. The small
RNA libraries were sequenced by Illumina Inc. and UCR core facility.

Northern blot analysis of siRNAs

We used LNA probes for small RNA Northern blot analysis to detect nat-siRNAs. A total
of 100 μg total RNA was separated on a 17% acrylimide gel. The blots were probed at
39°C with PerfectHyb Plus Hybridization Buffer (Sigma-Aldrich, Saint Louis, MO, USA).
The sequences of the oligos were listed in Additional file 7.

Quantitative RT-PCR analysis of small RNA targets

For QRT-PCR analysis, 5 μg total DNaseI-treated (Invitrogen, Grand Island, NY, USA)
RNA was used for synthesizing cDNA using Oligo dT and SuperScript II (Invitrogen).
Amplification of small RNA targets was carried out using a real-time PCR machine (MyiQ,
Bio-Rad, Hercules, CA, USA). The sequences of primers used here are listed in Additional
file 7.

Pre-processing of deep sequencing data

Raw sequence reads were parsed to remove 3'-adaptors using an in-house program. Unique
sequences from Arabidopsis small-RNA libraries were mapped to the Arabidopsis chromosome sequences [77]. To produce candidate datasets of sRNAs, we removed sequencing reads that matched
to rRNAs, tRNAs, snRNAs, snoRNAs, transposons, retrotransposons, or repeats. rRNA,
snRNA and snoRNA sequences were obtained from TAIR8 genome annotation, and tRNA sequences
were downloaded from the Arabidopsis tRNA database [78] and TAIR [77]. Transposons, retrotransposons, and repeat sequences were downloaded from the TIGR
v2 Arabidopsis Repeat Database [79] and RepBase [80].

The rice genome and annotation were retrieved from the MSU Rice Genome Annotation
Project V6.1 (RGAP 6.1). Non-coding RNA (tRNA, rRNA, snoRNA and snRNA) were collected
from fRNAdb [81] and the UCSC tRNA database. Transposons, retrotransposons and repeat sequences were
based on transposable elements of MSU RGAP 6.1 and RepBase [80].

Analysis of enrichment of endogenous sRNAs in the cis-NATs

The enrichment of sRNAs in the overlapping regions of NATs was examined by the method
outlined in [69]. The sRNAs in the combined libraries for Arabidopsis or rice were used here. Briefly, for an l-nucleotide genic or genomic region that gives rise to n sRNAs, we defined n/l as the density of small RNA loci in this region. For each pair of cis-NATs, we counted the number of sRNAs mapping to the overlapping region, No, and the total number of sRNAs matching the non-overlapping regions of the two genes,
Ng, and measured the length of the overlapping region, Lo, and the total length of non-overlapping regions of the two genes, Lg. Then we computed the density of small-RNA loci in the overlapping region, No/Lo, and that of the non-overlapping regions of the two NAT genes, Ng/Lg. The one-tail paired two-sample t-test was applied to the two-paired samples of siRNA density, No/Lo and Ng/Lg. The average densities in the overlapping regions (Ao) and in the non-overlapping regions of the NAT genes (Ag) that spawn sRNAs were computed. The ratio Ao/Ag was considered as the enrichment score.

In order to explore whether cis-NATs were more likely to give rise to siRNAs than other protein-coding genes, we
calculated the average densities in the overlapping regions (Ao) of cis-NATs. We then calculated the average densities Au of a set of randomly chosen regions from the other protein-coding genes, which have
the same size as the cis-NATs. Statistical significance was measured by a P-value computed as the frequency of the occurrences that Au was greater than Ao. We applied this method to compare the siRNA density of convergent cis-NATs with that of 3'-UTRs and the siRNA density of divergent cis-NATs with that of 5'-UTRs.

To analyze the regulation of nat-siRNAs under a specific condition, the sRNA reads
in a single library that matched 100% to the genome were normalized to one million.
Then the normalized sRNA abundances were compared between different libraries. Furthermore,
we calculated the maximum fold change among all treated samples compared to corresponding
control conditions as a supplemental value (Additional files 10 and 11).

Classification of nat-siRNAs

To classify the cis-NAT pairs into distributed and site-specific patterns, we first clustered sequencing
reads mapped to regions of cis-NATs. Starting from the first reads closest to the 5' end of a cis-NAT, reads were clustered if the positions of their first nucleotides were within
a ten-nucleotide long segment. Clusters with more than five reads were retained for
further analysis. We calculated two metrics for each cis-NAT: 1) the number of small-RNA clusters and 2) the percentage of the number of reads
within these clusters relative to the total number of reads mapped to the whole cis-NAT. We then categorized a cis-NAT to have a site-specific pattern if 1) it had no more than ten siRNA clusters
and 2) the percentage of the reads within the clusters was greater than 50%; otherwise
we categorized the cis-NAT to have a distributed pattern. The criterion of 10 siRNA clusters was chosen
based on the observation that the frequency of clusters of 84 Arabidopsis cis-NATs has a clear separation between 10 and above (Additional file 5). The percentage criterion (more than 50% sequence reads) was determined based on
a visual exanimation of the relationship between the number of clusters and the frequencies
of reads within the clusters in the 84 cis-NATS, as plotted Additional file 5. As shown in the figure, this relationship seemed to follow two distributions, one
is a linear correlation represented by the line in Additional file 5 and the other is a group with no more than ten clusters that retain more than 50%
of the total sequence reads. The second distribution thus formed site-specific patterns
of siRNAs. We thus used these two criteria because a site-specific pattern should
have relatively more reads appearing in a relatively smaller number of clusters than
a distributed pattern.

Microarray data and data analysis

The processed microarray data of dcl1-7 and Ler WT Arabidopsis were downloaded from GEO accession number GSE2473 (samples GSM47014, GSM47015, GSM47016,
GSM47017, GSM47018, GSM47019, GSM47023, GSM47024, GSM47025, GSM47026 and GSM47027).
We identified differentially expressed genes using Rank Product [82]. The rationale behind Rank Product is that it is unlikely for a gene to be ranked
in the top position in all replicates if the gene was not differentially expressed.
It has been shown that Rank Product is less sensitive to noise and has a better performance
than other methods when sample size is small [83]. We selected differentially expressed genes with a false discovery rate no greater
than 0.05 for 1,000 permutations. Differentially expressed probe sets were then mapped
to corresponding genes according to the annotation of the Affymatrix ATH1-121501 chip.

Analysis of association of AGOs and nat-siRNAs

The processed AGO data for Arabidopsis were downloaded from the GEO databases under accession numbers GSE10036 and GSE12037.
The number of AGO-associated nat-siRNAs was obtained by perfectly mapping the identified
nat-siRNAs to each of the AGO pull-down datasets separately. The number was then normalized
by the total number of AGO reads in an AGO dataset that can be perfectly mapped to
the genome raised to per million for normalization. The enrichment of nat-siRNAs in
an AGO pull-down relative to the other AGOs was calculated as the percentage of the
normalized nat-siRNAs in the AGO of the total normalized nat-siRNAs in all AGO pull-downs.

Identification of nat-siRNA regulated cis-NATs

The nat-siRNA regulated cis-NATs were identified from the set of differentially expressed genes from the microarray
data of the dcl1-7 and Ler WT control Arabidopsis plants. The analysis was carried out based on two criteria: 1) more than 1.5-fold
up-regulation with a P-value less than 0.05 in the dcl1-7 mutant and 2) presence of nat-siRNAs produced by the cis-NATs on the strands opposite to the NAT transcripts. The rationale behind these criteria
is that the main function of DCL1-generated nat-siRNAs is to target cis-NATs on the opposite strands to down-regulate their expression. Thus, in a DCL1 mutant
where no such nat-siRNAs are produced, the expressions of the targeted NATs are expected
to be up-regulated.

Authors' contributions

HJ initiated the project; HJ, WZ and JZ designed the experiments; XiaomingZ and BB
carried out the experiments; JX, XuefengZ and CZ performed the computational analyses;
SG, LL, DN, HZ, CL and TW participated in the experimental analyses. HJ, WZ, JZ, RL,
YiL, WL and JG coordinated the research. HJ, XiaomingZ, YL, JX, JZ and WZ wrote the
paper. All authors have read and approved the manuscript for publication.

Acknowledgements

We thank David Baulcombe, Jim Carrington, Herve Vaucheret, Xuemei Chen, John Clarke
and Zhixin Xie for providing seeds of various mutants. This work was supported by
NIH grant R01GM093008, a NSF Career Award MCB-0642843, and an AES-CE Research Allocation
Award PPA-7517H to H Jin, by NSF grant DBI-0743797, NIH grants RC1AR058681 and R01GM086412
to W Zhang, NIH grants R01GM070795 and R01GM059138 to J-K Zhu and by National Science
Foundation of China (30900102) to X Zhang.