Abstract

Variation in the presence or absence of transposable elements (TEs) is a major source of genetic variation between individuals. Here, we identified 23,095 TE presence/absence variants between 216 Arabidopsis accessions. Most TE variants were rare, and we find these rare variants associated with local extremes of gene expression and DNA methylation levels within the population. Of the common alleles identified, two thirds were not in linkage disequilibrium with nearby SNPs, implicating these variants as a source of novel genetic diversity. Many common TE variants were associated with significantly altered expression of nearby genes, and a major fraction of inter-accession DNA methylation differences were associated with nearby TE insertions. Overall, this demonstrates that TE variants are a rich source of genetic diversity that likely plays an important role in facilitating epigenomic and transcriptional differences between individuals, and indicates a strong genetic basis for epigenetic variation.

Introduction

Transposable elements (TEs) are mobile genetic elements present in nearly all studied organisms, and comprise a large fraction of most eukaryotic genomes. The two types of TEs are retrotransposons, which transpose via an RNA intermediate requiring a reverse transcription reaction, and DNA transposons, which transpose via either a cut-paste or, in the case of Helitrons, a rolling circle mechanism with no RNA intermediate (Wicker et al., 2007). TE activity poses mutagenic potential as a TE insertion may disrupt functional regions of the genome. Consequently, safeguard mechanisms have evolved to suppress this activity, including the methylation of cytosine nucleotides (DNA methylation) to produce 5-methylcytosine (mC), a modification that can induce transcriptional silencing of the methylated locus. In Arabidopsis thaliana (Arabidopsis), DNA methylation occurs in three DNA sequence contexts: mCG, mCHG, and mCHH, where H is any base but G. Establishment of DNA methylation marks can be carried out by two distinct pathways—the RNA-directed DNA methylation pathway guided by 24 nucleotide (nt) small RNAs (smRNAs), and the DDM1/CMT2 pathway (Zemach et al., 2013; Matzke and Mosher, 2014). A major function of DNA methylation in Arabidopsis is in the transcriptional silencing of TEs. Mutations in genes essential for DNA methylation establishment or maintenance can lead to a decrease in DNA methylation levels, expression of previously silent TEs, and in some cases transposition (Mirouze et al., 2009; Miura et al., 2001; Saze et al., 2003; Lippman et al., 2004; Jeddeloh et al., 1999; Zemach et al., 2013). In Arabidopsis, TEs are often methylated in all cytosine sequence contexts, in a pattern distinct from DNA methylation in other regions of the genome. Conversely, DNA methylation often occurs in gene bodies exclusively in the CG context and is correlated with gene expression, although this gene-body methylation appears dispensable (Bewick et al., 2016). Many thousands of regions of the Arabidopsis genome have been identified as differentially methylated between different wild Arabidopsis accessions, although the cause and possible function of these differentially methylated regions remains unclear (Schmitz et al., 2013).

TEs are thought to play an important role in evolution, not only because of the disruptive potential of their transposition. The release of transcriptional and post-transcriptional silencing of TEs can lead to bursts of TE activity, rapidly generating new genetic diversity (Vitte et al., 2014). TEs may carry regulatory information such as promoters and transcription factor binding sites, and their mobilization may lead to the creation or expansion of gene regulatory networks (Hénaff et al., 2014; Bolger et al., 2014; Ito et al., 2011; Makarevitch et al., 2015). Furthermore, the transposase enzymes required and encoded by TEs have frequently been domesticated and repurposed as endogenous proteins, such as the DAYSLEEPER gene in Arabidopsis, derived from a hAT transposase enzyme (Bundock and Hooykaas, 2005). Clearly, the activity of TEs can have widespread and unpredictable effects on the host genome. However, the identification of TE presence/absence variants in genomes has remained difficult to date. It is challenging to identify the structural changes in the genome caused by TE mobilization using current short-read sequencing technologies as these reads are typically mapped to a reference genome, which has the effect of masking structural changes that may be present. However, in terms of the number of base pairs affected, a large fraction of genetic differences between Arabidopsis accessions appears to be due to variation in TE content (Cao et al., 2011; Quadrana et al., 2016). Therefore, identification of TE variants is essential in order to develop a more comprehensive understanding of the genetic variation that exists between genomes, and of the consequences of TE movement on genome and cellular function.

In order to accurately map the locations of TE presence/absence variants with respect to a reference genome, we have developed a novel algorithm, TEPID (Transposable Element Polymorphism IDentification), which is designed for population studies. We tested our algorithm using both simulated and real Arabidopsis sequencing data, finding that TEPID is able to accurately identify TE presence/absence variants with respect to the Col-0 reference genome. We applied our TE variant identification method to existing genome resequencing data for 216 different Arabidopsis accessions (Schmitz et al., 2013), identifying widespread TE variation amongst these accessions and enabling exploration of TE diversity and links to gene regulation and epigenomic variation.

Results

Computational identification of TE presence/absence variation

We developed TEPID, an analysis pipeline capable of detecting TE presence/absence variants from paired end DNA sequencing data. TEPID integrates split and discordant read mapping information, read mapping quality, sequencing breakpoints, as well as local variations in sequencing coverage to identify novel TE presence/absence variants with respect to a reference TE annotation (Figure 1; see Materials and methods). This typically takes 5–10 min per accession for Arabidopsis genomic DNA sequencing data at 20-40x coverage, excluding the read mapping step. After TE variant discovery has been performed, TEPID then includes a second refinement step designed for population studies. This examines each region of the genome where there was a TE presence identified in any of the analyzed samples, and checks for evidence of this insertion in all other samples. In this way, TEPID leverages TE variant information for a group of related samples to reduce false negative calls within the group. Testing of TEPID using simulated TE variants in the Arabidopsis genome showed that it was able to reliably detect simulated TE variants at sequencing coverage levels commonly used in genomics studies (Figure 1—figure supplement 1).

TE variant discovery pipeline.

Principle of TE variant discovery using split and discordant read mapping positions. Paired end reads are first mapped to the reference genome using Bowtie2 [Langmead and Salzberg, 2012]. Soft-clipped or unmapped reads are then extracted from the alignment and re-mapped using Yaha, a split read mapper [Faust and Hall, 2012]. All read alignments are then used by TEPID to discover TE variants relative to the reference genome, in the 'tepid-discover’ step. When analyzing groups of related samples, these variants can be further refined using the ‘tepid-refine’ step, which examines in more detail the genomic regions where there was a TE variant identified in another sample, and calls the same variant for the sample in question using lower read count thresholds as compared to the ‘tepid- discover’ step, in order to reduce false negative variant calls within a group of related samples.

In order to further assess the sensitivity and specificity of TE variant discovery using TEPID, we identified TE variants in the Landsberg erecta (Ler) accession, and compared these with the Ler genome assembly created using long PacBio sequencing reads (Chin et al., 2013). Previously published 100 bp paired-end Ler genome resequencing reads (Schneeberger et al., 2011) were first analyzed using TEPID, enabling identification of 446 TE presence variants (Figure 1—source data 1) and 758 TE absence variants (Figure 1—source data 2) with respect to the Col-0 reference TE annotation. Reads providing evidence for these variants were then mapped to the Ler reference genome, generated by de novo assembly using Pacific Biosciences P5-C3 chemistry with a 20 kb insert library (Chin et al., 2013), using the same alignment parameters as were used to map reads to the Col-0 reference genome. This resulted in 98.7% of reads being aligned concordantly to the Ler reference, whereas 100% aligned discordantly or as split reads to the Col-0 reference genome (Table 1). To find whether reads mapped to homologous regions in both the Col-0 and Ler reference genomes, we conducted a BLAST search (Camacho et al., 2009) using the DNA sequence between read pair mapping locations in the Ler genome against the Col-0 genome, and found the top BLAST result for 80% of reads providing evidence for TE insertions, and 89% of reads providing evidence for TE absence variants in Ler, to be located within 200 bp of the TE variant reported by TEPID. Thus, reads providing evidence for TE variants map discordantly or as split reads when mapped to the Col-0 reference genome, but map concordantly to homologous regions of the Ler de novo assembled reference genome, indicating that structural variation is present at the sites identified by TEPID, and that this is resolved in the de novo assembled genome.

Note: Discordant and split read categories are not mutually exclusive, as some discordant reads may have one read in the mate pair split-mapped.

To estimate the rate of false negative TE absence calls made using TEPID, we compared our Ler TE absence calls to the set of TE absences in Ler genome identified previously by aligning full-length Col-0 TEs to the Ler reference using BLAT (Quadrana et al., 2016). We found that 89.6% (173/193) of these TE absences were also identified using TEPID, indicating a false negative rate of ~10% for TE absence calls. To determine the rate of false negative TE presence calls, we ran TEPID using 90 bp paired-end Col-0 reads (Col-0 control samples from [Jiang et al., 2014]), aligning reads to the Ler PacBio assembly. As TEPID requires a high-quality TE annotation to discover TE variants, which is not available for the Ler assembly, we looked for discordant and split read evidence at the known Col-0-specific TEs (Quadrana et al., 2016), and found evidence reaching the TEPID threshold for a TE presence call to be made at 89.6% (173/193) of these sites, indicating a false negative rate of ~10%. However, it should be noted that this estimate does not take into account the TEPID refinement step used on large populations, and so the false negative rate for samples analyzed in the population from Schmitz et al. (2013) is likely to be lower than this estimate, as each accession gained on average 4% more presence calls following this refinement step (Figure 2—figure supplement 1).

TEPID was used to analyze previously published 100 bp paired-end genome resequencing data for 216 different Arabidopsis accessions (Schmitz et al., 2013), and identified 15,007 TE presence variants (Figure 2—source data 1) and 8088 TE absence variants (Figure 2—source data 2) relative to the Col-0 reference accession, totalling 23,095 unique TE variants. A recent study focused on identifying recent TE insertions containing target site duplications in this population (Quadrana et al., 2016). Our goal was to provide a comprehensive assessment of TE presence/absence variation in Arabidopsis. In most accessions TEPID identified 300–500 TE presence variants (mean = 378) and 1000–1500 TE absence variants (mean = 1279), the majority of which were shared by two or more accessions (Figure 2—figure supplement 2). Although more TE absences were found on an accession-by-accession basis, overall TE presence variants were more common in the population as the TE absences were often shared between multiple accessions. PCR validations were performed for a random subset of 10 presence and 10 absence variants in 14 accessions (totalling 280 validations), confirming the high accuracy of TE variant discovery using the TEPID package, with a false positive rate for both TE presence and TE absence identification of ~9%, similar to that observed using simulated data and the Ler genome analysis (Figure 2—figure supplement 3). The number of TE presence variants identified was positively correlated with sequencing depth of coverage, while the number of TE absence variants identified had no correlation with sequencing coverage (Figure 2—figure supplement 4A,B), indicating that the sensitivity of TE absence calls is not limited by sequencing depth, while TE presence identification benefits from high sequencing depth. However, accessions with low coverage gained more TE presence calls during the TEPID refinement step (Figure 2—figure supplement 4C), indicating that these false negatives were effectively reduced by leveraging TE variant information for the whole population.

As TE presence and TE absence calls represent an arbitrary comparison to the Col-0 reference genome, we sought to remove these arbitrary comparisons and classify each variant as a new TE insertion or true deletion of an ancestral TE in the population. To do this, the minor allele frequency (MAF) of each variant in the population was examined, under the expectation that the minor allele is the derived allele. Common TE absences relative to Col-0, absent in ≥80% of the accessions examined, were re-classified as TE insertions in Col-0, and common TE presences relative to Col-0, present in ≥80% of accessions, as true TE deletions in Col-0. Cases where the TE variant had a high MAF (>20%) were unable to be classified, as it could not be determined if these were cases where the variant was most likely to be a true TE deletion or a new TE insertion. While these classifications are not definitive, as there may be rare cases where a true TE deletion has spread through the population and becomes the common allele, it should correctly classify most TE variants. Overall, 72.3% of the TE absence variants identified with respect to the Col-0 reference genome were likely due to a true TE deletion in these accessions, while 4.8% were due to insertions in Col-0 not shared by most other accessions in the population (Table 2). High allele frequency TE presence variants relative to Col-0, representing true deletions in Col-0, were much more rare, with 97.8% of initial TEPID TE presence calls being subsequently classified as true insertions. The rarity of true deletions identified in Col-0 is likely due to a reference bias in the TE variant identification method using short read data, as false negative presence calls in the population will reduce the number of true deletions identified in Col-0 due to a reduction in the allele frequency for that variant, causing the frequency of TE presence variants in non-Col-0 accessions to fall below the required 80% threshold for some variants. This is not expected to have a large impact on subsequent population-scale analyses, as Col-0 is only one accession out of the 216 analyzed. Accessions were found to contain on average ~240 true deletions and ~300 true insertions (Figure 2—figure supplement 5). Overall, we identified 15,077 TE insertions, 5856 true TE deletions, and 2162 TE variants at a high MAF that were unable to be classified as an insertion or deletion (Figure 2—source data 3).

While TE deletions were strongly biased towards the pericentromeric regions where TEs are found in high density, TE insertions had a more uniform distribution over the chromosome. This suggests that TE insertion positions are largely random but may be eliminated from chromosome arms through selection, and accumulate in the pericentromeric regions where low recombination rates prevent their removal (Figure 2A). TE deletions and common TE variants were found in similar chromosomal regions, as deletion variants represent the rare loss of common variants. Among TE deletions, DNA TEs were slightly less biased towards the centromeres in comparison to the distribution of RNA TEs (Figure 2—figure supplement 6). The distribution of rare (<3% minor allele frequency [MAF], <7 accessions; see Materials and methods) TE variants and TE insertions was similar to that observed for regions of the genome previously identified as being differentially methylated in all DNA methylation contexts (mCG, mCHG, mCHH) between the wild accessions (population C-DMRs) (Schmitz et al., 2013). In contrast, population CG-DMRs (differentially methylated in the mCG context) less frequently overlapped with all types of TE variants identified and instead closely followed the chromosomal distribution of genes. This was expected, as CG-DMRs are associated with gene bodies whereas C-DMRs are associated with TEs (Schmitz et al., 2013). Furthermore, genes and DNase I hypersensitivity sites (putative regulatory regions) (Sullivan et al., 2014) rarely contained a TE variant, whereas ~20–35%% of gene flanking regions, pseudogenes, intergenic regions, and other TEs were found to contain a TE variant (Figure 2B). This again suggests that TE insertions occur randomly across the genome, with deleterious insertions that occur in functional regions of the genome being subsequently removed through selection. TE deletions and common TE variants were enriched within the set of TE variants found in gene bodies, indicating that TE deletions within genes may be better tolerated than new TE insertions within genes (Figure 2C,D). No significant enrichment was found for TE variants within the KNOT ENGAGED ELEMENT (KEE) regions, previously identified as regions that may act as a 'TE sink' (Grob et al., 2014) (Figure 2—figure supplement 7). This may indicate that these regions do not act as a 'TE sink' as has been previously proposed, or that the 'TE sink' activity is restricted to very recent insertions, as the insertions we analysed in this population were likely older than those used in the KEE study (Grob et al., 2014).

Extensive novel genetic diversity uncovered by TE variant analysis.

(A) Distribution of identified TE variants on chromosome 1, with distributions of all Col-0 genes, Col-0 TEs, and population DMRs. (B) Proportion of different genomic features containing one or more TE variants. (C) Proportion of TE variants within each genomic feature classified as deletions or insertions. (D) Proportion of TE variants within each genomic feature classified as rare (<3% MAF) or common (≥3% MAF). (E) Enrichment and depletion of TE variants categorized by TE superfamily compared to the expected frequency due to genomic occurrence.

Among the identified TE variants, several TE superfamilies were over- or under-represented compared to the number expected by chance given the overall genomic frequency of different TE types (Figure 2E). In particular, both TE insertions and deletions in the RC/Helitron superfamily were less numerous than expected, with an 11.5% depletion of RC/Helitron elements in the set of TE variants. In contrast, TEs belonging to the LTR/Gypsy superfamily were more frequently deleted than expected, with a 17% enrichment in the set of TE deletions. This was unlikely to be due to a differing ability of the detection method to identify TE variants of different lengths, as the TE variants identified had a similar distribution of lengths as all Arabidopsis TEs annotated in the Col-0 reference genome (Figure 2—figure supplement 8). These enrichments suggest that the RC/Helitron TEs have been relatively dormant in recent evolutionary history, while the LTR/Gypsy TEs, which are highly enriched in the pericentromeric regions, are frequently lost from the Arabidopsis genome. At the family level, we observed similar patterns of TE variant enrichment or depletion (Figure 2—figure supplement 9;Figure 2—source data 4). As certain TEs present in Col-0 have previously been genotyped in 47 different accessions, allele frequency data was available for some TEs (Hollister and Gaut, 2007), and we compared these previous allele frequency estimates with our estimates based on the short read data. We found a weakly positive linear correlation (r2 = 0.3) between the previous allele frequency estimates for Basho family TEs and our allele frequency estimates, which may not be unexpected given the differing population sizes and TE variant detection methods used (Figure 2—figure supplement 10).

We further examined Arabidopsis (Col-0) DNA sequencing data from a transgenerational stress experiment to investigate the possible minimum number of generations required for TE variants to arise (Jiang et al., 2014). In one of the three replicates subjected to high salinity stress conditions, we identified a single potential TE insertion in a sample following 10 generations of single-seed descent, while no TE variants were identified in any of the three control single-seed descent replicate sets. However, without experimental validation it remains unclear if this represents a true variant. Therefore, we conclude that TE variants may arise at a rate less than 1 insertion in 60 generations under laboratory conditions. Further experimental work will be required to precisely determine the rate of transposition in Arabidopsis.

Relationship between TE variants and single nucleotide polymorphisms

Although many thousands of TE variants were identified, they may be linked to the previously identified single nucleotide polymorphisms (SNPs), or unlinked from SNPs across the accessions. This distinction is important, as studies aiming to link epigenetic diversity to genetic variants using only SNPs would fail to detect such a link caused by TE variants if the TE variants are not in LD with SNPs. We tested how frequently common TE variants (>3% MAF; see Materials and methods) were linked to adjacent SNPs to determine when they would represent a previously unassessed source of genetic variation between accessions. SNPs that were previously identified between the accessions (Schmitz et al., 2013) were compared to the presence/absence of individual TE variants. For the common TE variants in the population, the nearest flanking 300 SNPs upstream and 300 SNPs downstream of the TE variant site were analyzed for local linkage disequilibrium (LD, r2; see Materials and methods). TE variants were classified as being either ‘low’, ‘mid’, or ‘high’ LD variants by comparing ranked r2 values of TE variant to SNPs against the median ranked r2 value for all between SNP comparisons (SNP-SNP) to account for regional variation in the extent of SNP-SNP LD (Figure 3A,B) due to recombination rate variation or selection (Horton et al., 2012). The majority (61%) of common TE variants had low LD with nearby SNPs, and represent a source of genetic diversity not previously assessed by SNP-based genotype calling methods (Figure 3C). 29% of TE variants displayed high levels of LD and are tagged by nearby SNPs, while only 10% had intermediate levels of LD. We observed a positive correlation between TE variant MAF and LD state, with variants of a high MAF more often classified as high-LD (Figure 3D). While the proportion of TE variants classified as high, mid, or low-LD was mostly the same for both TE insertions and TE deletions, TE variants with a high MAF (>20%) that were unable to be classified as either true deletions or as new insertions had a much higher proportion of high-LD variants (Figure 3E). This was consistent with the observation that the more common alleles were more often in a high-LD state. TE variants displayed a similar distribution over chromosome 1 regardless of linkage classification (Figure 3—figure supplement 1). Overall, this analysis revealed an abundance of previously uncharacterized genetic variation that exists amongst Arabidopsis accessions caused by the presence or absence of TEs, and illustrates the importance of identifying TE variants alongside other genetic diversity such as SNPs.

TE variants affect gene expression

To determine whether the newly discovered TE variants may affect nearby gene expression, the steady state transcript abundance within mature leaf tissue was compared between accessions with and without TE insertions or deletions, for genes with TE variants located in the 2 kb gene upstream region, 5’ UTR, exons, introns, 3’ UTR or 2 kb downstream region (Figure 4A). While the steady state transcript abundance of most genes appeared to be unaffected by the presence of a TE, 168 genes displayed significant differences in transcript abundance linked with the presence of a TE variant, indicating a role for these variants in the local regulation of gene expression (1% false discovery rate; >2-fold change in transcript abundance; Figure 4A, Figure 4—source data 1). No functional category enrichments in this set of differentially expressed genes were identified. As rare TE variants may also be associated with a difference in transcript abundance, but were unable to be statistically tested due to their rarity, a burden test for enrichment of rare variants in the extremes of expression was performed (Zhao et al., 2016). Briefly, this method counts the frequency of rare variants within each gene expression rank in the population, and aggregates this information over the entire population to determine whether an enrichment of rare variants exists within the gene expression extremes for the population. A strong enrichment for gene expression extremes was observed for TE variants in all gene features tested (Figure 4B). While TE variants in gene upstream regions showed a strong enrichment of both high and low gene expression ranks, TE variants in exons or gene downstream regions were more skewed towards low expression ranks than high ranks. Randomization of the accession names removed these enrichments completely (Figure 4—figure supplement 1), and there was little difference between TE insertions and TE deletions in the gene expression rank enrichments found (Figure 4—figure supplement 2). This rare variant analysis further indicates that TE variants may alter the transcript abundance of nearby genes, with TE variants in exons or gene downstream regions being mostly associated with gene downregulation, whereas TE variants in gene upstream regions appear to be associated with gene activation and gene repression equally often.

As both increases and decreases in transcript abundance of nearby genes were observed for TE variants within each gene feature, it appears to be difficult to predict the impact that a TE variant may have on nearby gene expression based on TE insertion position alone. Furthermore, gene-level transcript abundance measurements may fail to identify potential positional effects of TE variants upon transcription. To more closely examine changes in transcript abundance associated with TE variants among the accessions, we inspected a subset of TE variant sites and identified TE variants that appear to have an impact on transcriptional patterns beyond simply a change in total transcript abundance of a nearby gene. For example, the presence of a TE insertion within an exon of AtRLP18 (AT2G15040) was associated with truncation of the transcripts at the TE insertion site in accessions possessing the TE variant, as well as silencing of a downstream gene encoding a leucine-rich repeat protein (AT2G15042) (Figure 5A,B). Both genes had significantly lower transcript abundance in accessions containing the TE insertion (p < 5.8 × 10-10, Mann-Whitney U test). As four accessions that were predicted to contain the TE insertion within AtRLP18 appeared to have the non-insertion RNA expression pattern (Figure 5A), we performed additional PCR validations on two of these four accessions, as well as two accessions with truncated RNA expression. These validations showed that the accessions predicted to contain the TE insertion but also expressing AtRLP18 were false positive calls (Figure 5—figure supplement 1). However, the false positive rate for this site (~3%) was still lower than our global estimate for TEPID. AtRLP18 has been reported to be involved in bacterial resistance, with the disruption of this gene by T-DNA insertion mediated mutagenesis resulting in increased susceptibility to the bacterial plant pathogen Pseudomonas syringae (Wang et al., 2008). Examination of pathogen resistance phenotype data (Aranzana et al., 2005) revealed that accessions containing the TE insertion in the AtRLP18 exon were more often sensitive to infection by Pseudomonas syringae transformed with avrPpH3 genes (Figure 5C). This suggests that the accessions containing this TE insertion within AtRLP18 may have an increased susceptibility to certain bacterial pathogens.

Effects of TE variants on local gene expression.

(A) Genome browser representation of RNA-seq data for genes AtRLP18 (AT2G15040) and a leucine-rich repeat family protein (AT2G15042). All accessions predicted to contain the TE insertion are shown. Inset shows magnified view of the TE insertion site for two accessions. (B) AtRLP18 and AT2G15042 RNA-seq FPKM values for all accessions. (C) Percentage of accessions with resistance to Pseudomonas syringae transformed with different avr genes, for accessions containing or not containing a TE insertion in AtRLP18. (D) Genome browser representation of RNA-seq data for a PPR protein-encoding gene (AT2G01360) and QPT (AT2G01350), showing transcript abundance for these genes in accessions containing a TE insertion variant in the upstream region of these genes, as well as in Col-0. (E) RNA-seq FPKM values for QPT and a gene encoding a PPR protein (AT2G01360), for all accessions. Note that scales are different for the two heatmaps shown in E, due to the higher transcript abundance of QPT compared to AT2G01360. Scale maximum for AT2G01350 is 3.1 × 105, and for AT2G01360 is 5.9 × 104.

Some TE variants were also associated with increased expression of nearby genes. For example, the presence of a TE within the upstream region of a gene encoding a pentatricopeptide repeat (PPR) protein (AT2G01360) was associated with higher transcript abundance of this gene (Figure 5D,E). Transcription appeared to begin at the TE insertion point, rather than the transcriptional start site of the gene (Figure 5D). Accessions containing the TE insertion had significantly higher AT2G01360 transcript abundance than the accessions without the TE insertion (p < 1.8 × 10−7, Mann-Whitney U test). The apparent transcriptional activation, linked with the presence of a TE belonging to the HELITRON1 family, indicates that this element may carry regulatory information that alters the expression of genes downstream of the TE insertion site. Importantly, this variant was classified as a low-LD TE insertion, as it is not in LD with surrounding SNPs, and therefore the associated changes in gene transcript abundance would not be linked to genetic differences between the accessions using only SNP data. This TE variant was also upstream of QPT (AT2G01350), involved in NAD biosynthesis (Katoh et al., 2006), which did not show alterations in transcript abundance associated with the presence of the TE insertion, indicating a potential directionality of regulatory elements carried by the TE (Figure 5D,E). This TE insertion occurred at the border of a non-syntenic block of genes thought to be a result of a transposition event in Arabidopsis (Freeling et al., 2008). This transposition event likely predates the TE insertion discovered here, and it is interesting that multiple transposition events appear to have occurred in close proximity in the genome. Overall, these examples demonstrate that TE variants can have unpredictable, yet important, effects on the expression of nearby genes, and these effects may be missed by studies focused on genetic variation at the level of SNPs.

As TEs are frequently highly methylated in Arabidopsis (Zhang et al., 2006; Zilberman et al., 2007; Cokus et al., 2008; Lister et al., 2008), the DNA methylation state surrounding TE variant sites was assessed to determine whether TE variants might be responsible for differences in DNA methylation patterns previously observed between the wild accessions (Schmitz et al., 2013). TE variants were often physically close to DMRs (Figure 6A). Furthermore, C-DMRs were more often close to a TE variant than expected, whereas CG-DMRs were rarely close to TE insertions or TE deletions (Table 3). Again, this was expected as DNA methylation solely in the CG context is associated with gene bodies, whereas DNA methylation in all contexts is associated with TEs. Overall, 54% of the 13,482 previously reported population C-DMRs were located within 1 kb of a TE variant (predominantly TE insertions), while only 15% of CG-DMRs were within 1 kb of a TE variant (Table 3). For C-DMRs, this was significantly more than expected by chance, while it was significantly less than expected for CG-DMRs (p < 1 × 10−4, determined by resampling 10,000 times). Of the C-DMRs that were not close to a TE variant, 3701 (27% of all C-DMRs) were within 1 kb of a non-variable TE. Thus, 81% of C-DMRs are within 1 kb of a TE when considering both fixed and variable TEs in the population. Of the remaining 19% of C-DMRs, most were found in genes or intergenic regions.

TE variants are associated with nearby DMR methylation levels.

(A) Distribution of distances from TE variants to the nearest population DMR, for TE deletions and TE insertions, C-DMRs and CG-DMRs. (B) Pearson correlation between DMR DNA methylation level and TE presence/absence, for all DMRs and their closest TE variant, versus the distance from the DMR to the TE variant (log scale). Blue lines show a linear regression between the correlation coefficients and the log10 distance to the TE variant. (C) Empirical cumulative distribution of Pearson correlation coefficients between TE presence/absence and DMR methylation level for TE insertions, TE deletions, C-DMRs and CG-DMRs. The Kolmogorov–Smirnov statistic is shown in each plot, indicated by D. (D) Relationship between rare TE variant counts and nearby DMR DNA methylation level ranks, for TE insertions, deletions, C-DMRs, and CG-DMRs. Plot shows the cumulative number of rare TE variants in equal-sized bins of DMR methylation level ranks, from the lowest ranked accession (left) to the highest ranked accession (right). Lines indicate the fit of a quadratic model, and the corresponding R2 and p values are shown in each plot.

To determine whether DMR methylation levels were associated with the presence/absence of nearby TE variants, Pearson correlation coefficients were calculated between the DNA methylation level at each C- or CG-DMR and the presence/absence of the nearest TE variant, to produce a numerical estimate of the association between TE presence/absence and DNA methylation level at the nearest DMR. Further analysis showed that for C-DMRs the strength of this association was dependent on the distance from the C-DMR to the TE insertion, whereas this was not true for CG-DMRs or TE deletions (Figure 6B, Figure 6—figure supplement 1). This suggested a distance-dependent effect of TE insertion on C-DMR methylation. DNA methylation levels at C-DMRs located within 1 kb of a TE insertion (TE-DMRs) were more often positively correlated with the presence of a TE insertion than the DNA methylation levels at C-DMRs further than 1 kb from a TE insertion (non-TE-DMRs). This was evident from the distribution of correlation coefficients for non-TE-DMRs being centred around zero, whereas for TE-DMRs this distribution was skewed to the right (Figure 6C, D = 0.24). For TE deletions, such a difference was not observed in the distributions of correlation coefficients between TE-DMRs and non-TE-DMRs, nor for CG-DMRs and their nearby TE insertions or deletions (Figure 6C, D = 0.07–0.10). These results strongly suggest a relationship between the presence of a TE insertion and formation of a nearby C-DMR.

As the above correlations between TE presence/absence and DMR methylation level rely on the TE variants having a sufficiently high MAF, this precludes analysis of the effect of rare variants on DMR methylation levels. To determine the effect that these rare TE variants may have on DMR methylation levels, a burden test for enrichment of DMR methylation extremes at TE-DMRs was performed, similar to the analysis undertaken to test the effect of rare variants on gene expression. A strong enrichment was observed for high C-DMR and CG-DMR methylation level ranks for TE insertions, while TE deletions were associated with both high and low extremes of DNA methylation levels at C-DMRs, and less so at CG-DMRs (Figure 6D). This further indicates that the presence of a TE insertion is associated with higher C-DMR methylation levels, while TE deletions appear to have more variable effects on DMR methylation levels. This enrichment was completely absent after repeating the analysis with randomized accession names (Figure 6—figure supplement 2). A slight enrichment was also observed for low DMR methylation ranks for TE insertions near CG-DMRs, indicating that the insertion of a TE was sometimes associated with reduced CG methylation in nearby regions (<1 kb from the TE). Closer examination of these TE insertions revealed that some TE insertions were associated with decreased transcript abundance of nearby genes, with a corresponding loss of gene body methylation, offering a potential explanation for the decreased CG methylation observed near some TE insertions (Figure 6—figure supplement 3).

To further assess the effects of TE variants upon local DNA methylation patterns, the levels of methylation were examined in regions flanking all TE variants regardless of the presence or absence of a population DMR call. While DNA methylation levels around pericentromeric TE insertions and deletions (<3 Mb from a centromere) seemed to be unaffected by the presence of a TE insertion (Figure 7A), TE insertions in the chromosome arms were associated with an increase in DNA methylation levels in all sequence contexts (Figure 7A,B). In contrast, TE deletions in the chromosome arms did not affect patterns of DNA methylation, as the flanking methylation level in all contexts appeared to remain high following deletion of the TE (Figure 7A,C). As the change in DNA methylation levels around most TE variant sites appeared to be restricted to regions <200 bp from the insertion site, DNA methylation levels in 200 bp regions flanking TE variants were correlated with the presence/absence of TE variants. DNA methylation levels were often positively correlated with the presence of a TE insertion when the insertion was distant from a centromere (Figure 7D). TE deletions were more variably correlated with local DNA methylation levels, but also showed a bias towards positive correlations for TE deletions distant from the centromeres. However, for TE variants in the chromosome arms the mean correlation between TE insertions and flanking DNA methylation was significantly higher than the mean correlation between TE deletions and flanking DNA methylation (Mann-Whitney U test, p<0.002). As methylome data were available for both leaf and bud tissue for 12 accessions, this analysis was repeated comparing between tissue types, but no differences were observed in the patterns of methylation surrounding TE variant sites between the two tissues (Figure 7—figure supplement 1). This suggests that the effect of TE variants upon patterns of DNA methylation may be tissue-independent.

These results indicate that local DNA methylation patterns are influenced by the differential TE content between genomes, and that the DNA methylation-dependent silencing of TEs may frequently lead to the formation of DMRs between wild Arabidopsis accessions. TE insertions appear to be important in defining local patterns of DNA methylation, while DNA methylation levels often remain elevated following a TE deletion, and so are independent from the presence or absence of TEs in these cases. Importantly, the distance from a TE insertion to the centromere appears to have a strong impact on whether an alteration of local DNA methylation patterns will occur. This is likely due to flanking sequences being highly methylated in the pericentromeric regions, and so the insertion of a TE cannot further increase levels of DNA methylation. Overall, a large fraction of the population C-DMRs previously identified between wild accessions are correlated with the presence of local TE variants. CG-DMR methylation levels appear to be mostly independent from the presence/absence of common TE variants, while rare TE variants have an impact on DNA methylation levels at both C-DMRs and CG-DMRs, perhaps due to their more frequent occurrence within the chromosome arms, closer to genes and where CG-DMRs are more abundant (Figure 2A).

Genome-wide association scan highlights distant and local control of DNA methylation

To further investigate the effects of TE variants upon local and distant DNA methylation levels in the genome, an association scan was conducted for all common TE variants (>3% MAF) and all population C-DMRs for the 124 accessions with both DNA methylation and TE variant data available. To test the significance of each pairwise correlation, bootstrap p-value estimates were collected based on 500 permutations of accession labels. TE-DMR associations were deemed significant if they had an association more extreme than all 500 permutations (p<1/500). A band of significant associations was observed for TE insertions and their nearby C-DMRs, signifying a local association between TE insertion presence/absence and C-DMR methylation (Figure 8A). This local association was not as strong for TE deletions (Figure 8B), consistent with our above findings. While TE variants and DNA methylation showed a local association, it is also possible that TE variation may influence DNA methylation states more broadly in the genome, perhaps through the production of trans-acting smRNAs or inactivation of genes involved in DNA methylation establishment or maintenance. To identify any potential enrichment of C-DMRs regulated in trans, the total number of significant associations was summed for each TE variant across the whole genome (Figure 8A and B, top panels). At many sites, far more significant associations were found than expected due to the false positive rate alone. This suggested the existence of many putative trans associations between TE variants and genome-wide C-DMR methylation levels. These C-DMRs that appeared to be associated with a TE insertion in trans were further examined, checking for TE insertions near these C-DMRs that were present in the same accessions as the trans associated TE, as these could lead to a false trans association. These were extremely rare, with only four such cases for TE insertions, and 38 cases for TE deletions, and so were unable to explain the high degree of trans associations found. Overall, this suggests that certain TE variants may affect DNA methylation levels more broadly in the genome, as their effects upon DNA methylation are not necessarily limited to nearby DNA sequences.

Association scan between TE variants and C-DMR methylation variation.

(A) Significant correlations between TE insertions and C-DMR DNA methylation level. Points show correlations between individual TE-DMR pairs that were more extreme than all 500 permutations of the DMR data. Top plots show the total number of significant correlations for each TE insertion across the whole genome. (B) As for (A), for TE deletions.

Discussion

Here we have discovered widespread differential TE content between wild Arabidopsis accessions, and explored the impact of these variants upon transcription and DNA methylation at the level of individual accessions. Most TE variants were due to the de novo insertion of TEs, while a smaller subset was likely due to the deletion of ancestral TE copies, mostly around the pericentromeric regions. A subset (32%) of TE variants with a minor allele frequency above 3% were able to be tested for linkage with nearby SNPs. The majority of these TE variants exhibited only low levels of LD with nearby SNPs, indicating that they represent genetic variants currently overlooked in genomic studies. A marked depletion of TE variants within gene bodies and DNase I hypersensitivity sites (putative regulatory regions) is consistent with the more deleterious TE insertions being removed from the population through selection. Of those TE variants found in gene bodies, TE deletions were overrepresented, indicating that the loss of ancestral TEs inserted within genes may be more frequent, or perhaps less deleterious, than the de novo insertion of TEs into genes.

A previous study focused on recent TE insertions in the Arabidopsis population (Quadrana et al., 2016), thus the extensive variation between accessions due to older TE insertions or TE deletions has not been explored. We identified clear cases where TE variants appear to have an effect upon gene expression, both in the disruption of transcription and in the spreading or disruption of regulatory information leading to the transcriptional activation of genes, indicating that these TE variants can have important consequences upon the expression of protein coding genes (Figure 5). In one case, these changes in gene expression could be linked with phenotypic changes, with accessions containing a TE insertion more frequently sensitive to bacterial infection. Further experiments will be needed to establish a causal link between this TE insertion and the associated phenotype. An analysis of rare TE variants, present at a low MAF, further strengthened this relationship between TE presence/absence and altered transcript abundance, as a strong enrichment of rare TE variants in accessions with extreme gene expression ranks in the population was identified. Therefore, the effects of TE insertions appear to be long-lasting, as there was little difference between common (old) and rare (young) variants in the impact upon gene expression (Figure 4).

Perhaps most importantly, we provide evidence that differential TE content between genomes of Arabidopsis accessions underlies a large fraction of the previously reported population C-DMRs. Thus, the frequency of pure epialleles, independent of underlying genetic variation, may be even more rare than previously anticipated (Richards, 2006). Overall, 81% of all C-DMRs were within 1 kb of a TE, when considering both fixed and variable TEs in the population. We did not find evidence of CG-DMR methylation, associated with gene bodies, being altered by the presence of common TE variants. However, rare TE variants may be more important in shaping patterns of DNA methylation at some CG-DMRs, perhaps due to their higher frequency in regions close to genes. The level of local DNA methylation changes associated with TE variants was also related to the distance from a TE variant to the centromere, with variants in the chromosome arms being more strongly correlated with DNA methylation levels. This seems to be due to a higher baseline level of DNA methylation at the pericentromeric regions, which prevent any further increase in DNA methylation level following insertion of a TE. Furthermore, we found an important distinction between TE insertions and TE deletions in the effect that these variants have on nearby DNA methylation levels. While flanking DNA methylation levels increase following a TE insertion, the deletion of an ancestral TE was often not associated with a corresponding decrease in flanking DNA methylation levels (Figure 7). This indicates that high levels of DNA methylation, once established, may be maintained in the absence of the TE insertion that presumably triggered the original change in DNA methylation level. It is then possible that TE variants explain more of the inter-accession variation in DNA methylation patterns than we find direct evidence for, if some C-DMRs were formed by the insertion of an ancestral TE that is now absent in all the accessions analysed here. These DMRs would then represent the epigenetic 'scars' of past TE insertions.

Finally, a genome-wide scan of common TE variant association with C-DMR methylation levels provides further evidence of a strong local association between TE insertion presence/absence and C-DMR methylation level (Figure 8). The identification of some TE variants that appeared to be associated with changes in DNA methylation levels at multiple loci throughout the genome indicates a possible trans regulation of DNA methylation state linked to specific TE variants. Further experiments will be required to confirm and examine the role of these TE variants in determining genome-wide patterns of DNA methylation. Overall, our results show that TE presence/absence variants between wild Arabidopsis accessions not only have important effects on nearby gene expression, but can also have a role in determining local patterns of DNA methylation, and explain many regions of differential DNA methylation previously observed in the population.

Materials and methods

TEPID development

Mapping

FASTQ files are mapped to the reference genome using the ‘tepid-map’ algorithm (Figure 1). This first calls bowtie2 (Langmead and Salzberg, 2012) with the following options: ‘–local’, ‘–dovetail’, ‘–fr’, ‘-R5’, ‘-N1’. Soft-clipped and unmapped reads are extracted using Samblaster (Faust and Hall, 2014), and remapped using the split read mapper Yaha (Faust and Hall, 2012), with the following options: ‘-L 11’, ‘-H 2000’, ‘-M 15’, ‘-osh’. Split reads are extracted from the Yaha alignment using Samblaster (Faust and Hall, 2014). Alignments are then converted to bam format, sorted, and indexed using samtools (Li et al., 2009).

TE variant discovery

The ‘tepid-discover’ algorithm examines mapped bam files generated by the ‘tepid-map’ step to identify TE presence/absence variants with respect to the reference genome. Firstly, mean sequencing coverage, mean library insert size, and standard deviation of the library insert size is estimated. Discordant read pairs are then extracted, defined as mate pairs that map more than four standard deviations from the mean insert size from one another, or on separate chromosomes.

To identify TE insertions with respect to the reference genome, split read alignments are first filtered to remove reads where the distance between split mapping loci is less than 5 kb, to remove split reads due to small indels, or split reads with a mapping quality (MAPQ) less than 5. Split and discordant read mapping coordinates are then intersected using pybedtools (Dale et al., 2011; Quinlan and Hall, 2010) with the Col-0 reference TE annotation, requiring 80% overlap between TE and read mapping coordinates. To determine putative TE insertion sites, regions are then identified that contain independent discordant read pairs aligned in an orientation facing one another at the insertion site, with their mate pairs intersecting with the same TE (Figure 1). The total number of split and discordant reads intersecting the insertion site and the TE is then calculated, and a TE insertion predicted where the combined number of reads is greater than a threshold determined by the average sequencing depth over the whole genome (1/10 coverage if coverage is greater than 10, otherwise a minimum of 2 reads). Alternatively, in the absence of discordant reads mapped in orientations facing one another, the required total number of split and discordant reads at the insertion site linked to the inserted TE is set higher, requiring twice as many reads.

To identify TE absence variants with respect to the reference genome, split and discordant reads separated >20 kb from one another are first removed, as 99.9% of Arabidopsis TEs are shorter than 20 kb, and this removes split reads due to larger structural variants not related to TE diversity (Figure 2—figure supplement 8). Col-0 reference annotation TEs that are located within the genomic region spanned by the split and discordant reads are then identified. TE absence variants are predicted where at least 80% of the TE sequence is spanned by a split or discordant read, and the sequencing depth within the spanned region is <10% the sequencing depth of the 2 kb flanking sequence, and there are a minimum number of split and discordant reads present, determined by the sequencing depth (1/10 coverage; Figure 1). A threshold of 80% TE sequence spanned by split or discordant reads is used, as opposed to 100%, to account for misannotation of TE sequence boundaries in the Col-0 reference TE annotation, as well as TE fragments left behind by DNA TEs during cut-paste transposition (TE footprints) that may affect the mapping of reads around annotated TE borders (Plasterk, 1991). Furthermore, the coverage within the spanned region may be more than 10% that of the flanking sequence, but in such cases twice as many split and discordant reads are required. If multiple TEs are spanned by the split and discordant reads, and the above requirements are met, multiple TEs in the same region can be identified as absent with respect to the reference genome. Absence variants in non-Col-0 accessions are subsequently recategorized as TE insertions present in the Col-0 genome but absent from a given wild accession.

TE variant refinement

Once TE insertions are identified using the ‘tepid-map’ and ‘tepid-discover’ algorithms, these variants can be refined if multiple related samples are analysed. The ‘tepid-refine’ algorithm is designed to interrogate regions of the genome in which a TE insertion was discovered in other samples but not the sample in question, and check for evidence of that TE insertion in the sample using lower read count thresholds compared to the ‘tepid-discover’ step. In this way, the refine step leverages TE variant information for a group of related samples to reduce false negative calls within the group. This distinguishes TEPID from other similar methods for TE variant discovery utilizing short sequencing reads. A file containing the coordinates of each insertion, and a list of sample names containing the TE insertion must be provided to the ‘tepid-refine’ algorithm, which can be generated using the ‘merge_insertions.py’ script included in the TEPID package. Each sample is examined in regions where there was a TE insertion identified in another sample in the group. If there is a sequencing breakpoint within this region (no continuous read coverage spanning the region), split reads mapped to this region will be extracted from the alignment file and their coordinates intersected with the TE reference annotation. If there are split reads present at the variant site that are linked to the same TE as was identified as an insertion at that location, this TE insertion is recorded in a new file as being present in the sample in question. If there is no sequencing coverage in the queried region for a sample, an 'NA' call is made indicating that it is unknown whether the particular sample contains the TE insertion or not.

While the above description relates specifically to use of TEPID for identification of TE variants in Arabidopsis in this study, this method can be also applied to other species, with the only prerequisite being the annotation of TEs in a reference genome and the availability of paired-end DNA sequencing data.

TE variant simulation

To test the sensitivity and specificity of TEPID, 100 TE insertions (50 copy-paste transpositions, 50 cut-paste transpositions) and 100 TE absence variants were simulated in the Arabidopsis genome using the RSVSim R package, version 1.7.2 (Bartenhagen and Dugas, 2013), and synthetic reads generated from the modified genome at various levels of sequencing coverage using wgsim (Li et al., 2009) (https://github.com/lh3/wgsim). These reads were then used to calculate the true positive, false positive, and false negative TE variant discovery rates for TEPID at various sequencing depths, by running ‘tepid-map’ and ‘tepid-discover’ using the simulated reads with the default parameters (Figure 1—figure supplement 1).

Estimation of sensitivity

Previously published 100 bp paired end sequencing data for Ler (http://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/Ler-1/Reads/; [Schneeberger et al., 2011]) was downloaded and analyzed with the TEPID package to identify TE variants. Reads providing evidence for TE variants were then mapped to the de novo assembled Ler genome (Chin et al., 2013). To determine whether reads mapped to homologous regions of the Ler and Col-0 reference genome, the de novo assembled Ler genome sequence between mate pair mapping locations in Ler were extracted, with repeats masked using RepeatMasker with RepBase-derived libraries and the default parameters (version 4.0.5, http://www.repeatmasker.org). A blastn search was then conducted against the Col-0 genome using the following parameters: ‘-max-target-seqs 1’, ‘-evalue 1e-6’ (Camacho et al., 2009). Coordinates of the top BLAST hit for each read location were then compared with the TE variant sites identified using those reads. To estimate false negative rates for TEPID TE absence calls, Ler TE absence calls were compared with a known set of Col-0-specific TE insertions, absent in Ler (Quadrana et al., 2016). For TEPID TE presence calls, we mapped Col-0 DNA sequencing reads (Jiang et al., 2014) to the Ler PacBio assembly, and identified sites with read evidence reaching the TEPID threshold for a TE insertion call to be made.

Arabidopsis TE variant discovery

We ran TEPID, including the insertion refinement step, on previously published sequencing data for 216 different Arabidopsis populations (NCBI SRA SRA012474; [Schmitz et al., 2013-03]), mapping to the TAIR10 reference genome and using the TAIR9 TE annotation. The ‘–mask’ option was set to mask the mitochondrial and plastid genomes. We also ran TEPID using previously published transgenerational data for salt stress and control conditions (NCBI SRA SRP045804; [Jiang et al., 2014]), again using the ‘–mask’ option to mask mitochondrial and plastid genomes, and the ‘–strict’ option for highly related samples.

TE variant / SNP comparison

SNP information for 216 Arabidopsis accessions was obtained from the 1001 genomes data center (http://1001genomes.org/data/Salk/releases/2013_24_01/; [Schmitz et al., 2013-03]). This was formatted into reference (Col-0 state), alternate, or NA calls for each SNP. Accessions with both TE variant information and SNP data were selected for analysis. Hierarchical clustering of accessions by SNPs as well as TE variants were used to identify essentially clonal accessions, as these would skew the SNP linkage analysis. A single representative from each cluster of similar accessions was kept, leading to a total of 187 accessions for comparison. For all other analyses, the full set of accessions were used in order to maximize sample sizes. For each TE variant with a minor allele frequency greater than 3% (>5 accessions for the SNP linkage analysis), the nearest 300 upstream and 300 downstream SNPs with a minor allele frequency greater than 3% were selected. Pairwise genotype correlations (r2 values) for all complete cases were obtained for SNP-SNP and SNP-TE variant states. r2 values were then ordered by decreasing rank and a median SNP-SNP rank value was calculated. For each of the 600 ranked surrounding positions, the number of times the TE rank was greater than the SNP-SNP median rank was calculated as a relative LD metric of TE to SNP. TE variants with less than 200 ranks over the SNP-SNP median were classified as low-LD insertions. TE variants with ranks between 200 and 400 were classified as mid-LD, while TE variants with greater than 400 ranks above their respective SNP-SNP median value were classified as variants in high LD with flanking SNPs.

PCR validations

Selection of accessions to be genotyped

To assess the accuracy of TE variant calls in accessions with a range of sequencing depths of coverage, we grouped accessions into quartiles based on sequencing depth of coverage and randomly selected a total of 14 accessions for PCR validations from these quartiles. DNA was extracted for these accessions using Edward’s extraction protocol (Edwards et al., 1991), and purified prior to PCR using AMPure beads.

Selection of TE variants for validation and primer design

Ten TE insertion sites and 10 TE absence sites were randomly selected for validation by PCR amplification. Only insertions and absence variants that were variable in at least two of the fourteen accessions selected to be genotyped were considered. For insertion sites, primers were designed to span the predicted TE insertion site. For TE absence sites, two primer sets were designed; one primer set to span the TE, and another primer set with one primer annealing within the TE sequence predicted to be absent, and the other primer annealing in the flanking sequence (Figure 2—figure supplement 3). Primer sequences were designed that did not anneal to regions of the genome containing previously identified SNPs in any of the 216 accessions (Schmitz et al., 2013-03) or small insertions and deletions, identified using lumpy-sv with the default settings (Layer et al., 2014) (https://github.com/arq5x/lumpy-sv), had an annealing temperature close to 52°C calculated based on nearest neighbor thermodynamics (MeltingTemp submodule in the SeqUtils python module; [Cock et al., 2009]), GC content between 40% and 60%, and contained the same base repeated not more than four times in a row. Primers were aligned to the TAIR10 reference genome using bowtie2 (Langmead and Salzberg, 2012) with the ‘-a’ flag set to report all alignments, and those with more than five mapping locations in the genome were then removed.

PCR

PCR was performed with 10 ng of purified Arabidopsis DNA using Taq polymerase. PCR products were analysed by agarose gel electrophoresis. Col-0 was used as a positive control, water was added to reactions as a negative control.

mRNA analysis

Processed mRNA data for 144 wild Arabidopsis accessions were downloaded from NCBI GEO GSE43858 (Schmitz et al., 2013). To find differential gene expression dependent on TE presence/absence variation, we first removed transposable element genes from the set of TAIR10 gene models, then filtered TE variants to include only those where the TE variant was shared by at least seven accessions with RNA data available. We then grouped accessions based on TE presence/absence variants, and performed a Mann-Whitney U test to determine differences in RNA transcript abundance levels between the groups. We used q-value estimation to correct for multiple testing, using the R qvalue package v2.2.2 with the following parameters: lambda = seq(0, 0.6, 0.05), smooth.df=4 (Storey and Tibshirani, 2003). Genes were defined as differentially expressed where there was a greater than two fold difference in expression between the groups, with a q-value less than 0.01. Gene ontology enrichment analysis was performed using PANTHER (http://pantherdb.org).

DNA methylation data analysis

Rare variant analysis

To assess the effect of rare TE variants on gene expression or DMR DNA methylation levels, we tested for a burden of rare variants (<3% MAF, <7 accessions) in the population extremes, essentially as described previously (Zhao et al., 2016). For each rare TE variant near a gene or DMR, we ranked the gene expression level or DMR DNA methylation level for all accessions in the population, and tallied the ranks of accessions containing a rare variant. These rank counts were then binned to produce a histogram of the distribution of ranks. We then fit a quadratic model to the counts data, and calculated the R2 and p-value for the fit of the model.

TE variant and DMR genome-wide association analysis

Accessions were subset to those with both leaf DNA methylation data and TEPID calls. Pairwise correlations were performed for observed data pairs for each TE variant and a filtered set of population C-DMRs, with those C-DMRs removed where more than 15% of the accessions had no coverage. This amounted to a final set of 9777 C-DMRs. Accession names were then permuted to produce a randomized dataset, and pairwise correlations again calculated. This was repeated 500 times to produce a distribution of expected Pearson correlation coefficients for each pairwise comparison. Correlation values more extreme than all 500 permutations were deemed significant.

Decision letter

Daniel Zilberman

Reviewing Editor; University of California, Berkeley, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Population scale mapping of transposable element diversity reveals links to gene regulation and epigenomic variation" for consideration by eLife. Your article has been favorably evaluated by Detlef Weigel (Senior Editor) and two reviewers, one of whom is a member of our Board of Reviewing Editors. The following individual involved in review of your submission has agreed to reveal his identity: Jeffrey Ross-Ibarra (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

Stuart et al. present a compelling and comprehensive description of TE variation in Arabidopsis thaliana, highlighting the influence of TE polymorphism on expression of nearby genes and methylation of nearby regions. This paper persuasively argues that TE variation is the genetic basis for most DNA methylation polymorphisms and makes an important contribution to our understanding of the genomic impacts of TEs. However, the reviewers raised several points that should be addressed before publication. Most generally, this manuscript focuses a bit too much on the process and not enough on the biology. Dense statistical descriptions that will be difficult to understand for most readers are common, yet some results are explained either very broadly or not at all, and some important biological issues are ignored. A greater emphasis on biological meaning would substantially improve this paper.

Essential revisions:

1) The authors report that TEPID identified 300-500 insertions and 1000-1500 deletions per ecotype relative to Col. However, the overall number of insertions identified was 15,007 vs. 8,088 deletions. The reason that more deletions than insertions were identified per ecotype but fewer overall is that some of the deletions are present in nearly all or all tested ecotypes. Because this analysis is relative – insertions are both insertions in other ecotypes and deletions in Col, whereas deletions are deletions in other ecotypes and insertions in Col, such 'fixed deletions' are just insertion in Col. The authors get to this issue later in the analysis, but they don't explain the asymmetry. Why aren't there 'fixed insertions' relative to Col – meaning deletion in Col? Is TEPID unable to identify such events, or is there a biological meaning to this? An explanation would be very helpful.

2) The authors mention that TE insertions and deletions are both biased for pericentric regions, without discussing the observation that deletions are far more biased and what this likely means for actual TE insertion biases. The data strongly suggest that TE insertions are largely random, but are eliminated by selection from the chromosome arms. However, reading this paper, one could easily get the impression that TE insertion and deletion location biases are similar (which is stated in the Results), when they are actually not similar at all except in the most general sense.

3) C-DMRs and CG-DMRs are introduced without any description of the underlying biology (gene body vs. TE methylation; C-methylation mostly in TEs, CG mostly in genes) that would help to make sense of the different correlations of these DMRs with TE insertions. Gene body methylation doesn't even come up until the next-to-last Results section, and it's not explained there either. A keen reader might notice that CG-DMR distribution tracks genes, but to most the biological meaning of the presented analyses with respect to the two types of DMRs will be very obscure.

4) There is some concern about the inferences made about gene expression, as it seems that TE genes are not removed from the gene models used in the differential expression calls. Separating the impact of TEs inserting into TEs from the effect of TEs inserting into host genes would strengthen the argument that TEs are associated with extreme expression changes.

6) As any definition of a rare allele is somewhat arbitrary, the paper should be clearer about how TE variants are categorized into rare and common. Probably due to additions in the text, the definition of common (>3% MAF) comes in the subsection “Relationship between TE variants and single nucleotide polymorphisms”, long after Figure 2 is introduced. Can the common/rare definition either be added to the Figure 2 legend, or introduced when polarization is described? Similarly, it seems that the translation of MAF cutoffs to numbers of individuals are not consistent throughout the paper. While the definition of MAF cutoff states this is in >7 accessions, this doesn't obviously make sense with 217 accessions (shouldn't it be >=7?). Also, it seems that filtering reduces the number of accessions to 184, where 3% would be >5. Since there are so many low frequency TEs (Figure 2—figure supplement 2), small shifts in this definition might change conclusions qualitatively.

7) Gene At2G01360 used as an example in Figure 5 is highlighted in Figure 2 of Freeling et al. (2008) (http://genome.cshlp.org/content/18/12/1924.full) as transposing to this position in Arabidopsis by an unknown mechanism. Please comment about the possible relationship between this observation and the presence of a polymorphic TE within Arabidopsis.

[Editors’ note: a previous version of this study was rejected after peer review, but the authors submitted for reconsideration. The first decision letter after peer review is shown below.]

Thank you for submitting your work entitled "Population scale mapping of transposable element diversity reveals links to gene regulation and epigenomic variation" for consideration by eLife. Your article has been favorably evaluated by Detlef Weigel (Senior Editor) and four reviewers, one of whom is a member of our Board of Reviewing Editors. The following individual involved in review of your submission has agreed to reveal their identity: Magnus Nordborg (Reviewer #2).

Our decision has been reached after extensive consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife in its present form. We would, however, welcome a resubmission that substantively addresses reviewer concerns. Senior Editor Detlef Weigel suggests that you contact him directly if you have questions regarding the decision and recommendations. Magnus Nordborg is also happy to discuss his comments.

Although the reviewers found the subject of the manuscript interesting and clearly presented, three major deficiencies were identified:

1) The paper appears to conflate transposition with TE content variation, which are different things. TE variation – the actually measured feature – can occur through new insertions, but also through deletion of old copies. Given the large number of ecotype genomes and an understanding of the underlying phylogeny, new insertions could be distinguished from deletions of ancestral TEs in most cases. This would allow new insertion evens to be considered separately from deletions.

2) As explained by Magnus Nordborg in his review, the use of LD to measure TE insertion age is questionable and likely produces inaccurate measurements. This might explain why the conclusion reached in this manuscript that even recent TE insertions are biased for pericentric heterochromatin is inconsistent with studies of individual plant TEs that do not show such a bias. A more appropriate estimate of TE insertion age may be through measurement of allele frequency. Because estimated TE age is used in analyses throughout the manuscript, a robust measure is essential.

3) The section on seed and gamete methylation contains numerous questionable statements and analyses. Statements linking demethylation in pollen to imprinting are very confusing, because imprinting is associated with demethylation in the female gametophyte. The correlations with TEs demethylated (more accurately hypomethylated) in the CHH context in sperm ignore the global nature of this phenomenon, which affects most heterochromatic TEs. The small RNA analysis rests on the unwarranted assumption that the small RNAs are produced in the endosperm. The paper from which the small RNA data are derived makes a strong case that the small RNAs are derived from the maternal seed coat, which would provide a trivial explanation for the observation that TEs absent in the maternal genome have lower levels of small RNA in hybrid seeds. The most promising result – the correlation between TE content and methylation in Col x Cvi hybrids – didn't hold up in Col x Ler hybrids.

We would welcome a resubmission that clearly identifies TE insertions, provides a robust measure of the false positive and false negative rates (please see Magnus Nordborg's review below), provides a robust measure of TE age, and ideally deepens the analysis of the correlations between TE variation, DNA methylation polymorphism, and gene expression as suggested by the reviewers.

Reviewer #2 (Magnus Nordborg):

This is a very clear and interesting paper. It is one of the first genome-wide descriptions of TE polymorphism (there will be more…), and, although the conclusions are a bit limited, it should be of broad interest. Most importantly, it really suggests strongly that methylation variation is mostly genetic.

I do have a few substantive concerns, however. The first concerns data quality, which has been a major reason these types of studies have not done before. Calling TE polymorphisms from short-read data is not trivial. What is done here is sensible, but I would still like to have some idea of error rates and biases. Using simulations (subsection “Computational identification of TE presence/absence variation”) is not QC – it is at best debugging of software. What is needed is comparison with real data, preferably a "gold standard". The obvious choice is the PacBio Ler assembly, and it is used here, but it seems to me it is used in the wrong direction. The TE polymorphisms identified in the short-read data are confirmed in the PacBio data, rather than the other way around. It is the usual false-positive / false-negative issue that plagues all these data. What is done shows that the identified polymorphism are reliable, but it tells us nothing about what we don't see. Which leads to biases of all kinds. Why not do the right thing? And if there is a good reason for not doing the right thing, please justify this, and say a few words about how why the inevitable biases do not affect your results.

My second concern is the entire missing heritability section (Figure 2), which I think is based on two misunderstandings, one of linkage disequilibrium (LD), one of the nature of "missing heritability". Take LD first. It is theoretically impossible for TEs variants to show a different LD behavior than SNPs unless they are due to repeat mutations (which have to be very frequent indeed to affect LD). As long as identity in state = identity by descent, it doesn't matter how mutations arise. Of course it is possible that some of the TE indels could be non-unique, but there is no evidence for this in this paper (you could look for it: does the same insertion occur on unrelated haplotypes?). I'm fairly convinced that the difference is r2 distributions you see are largely due to differences in the allele frequency. To confirm this, compare r2 between TEs and flanking SNPs to r2 between a SNP and flanking SNPs after making sure that the polymorphism being tested has the same derived allele frequency. This can be done by binning. I think you will see that there is no difference. And if you do see a difference, it is probably due to some other artifact (genotyping error?), because, as I said, it is theoretically impossible for there to be a difference (unless repeat mutations are very different, selection is strong, etc.).

Don't say things like "this variant was classified as a young TE insertion, as it is not in linkage disequilibrium with surrounding SNP" – the younger a mutation is, the stronger the LD (because there has been no time for recombination). Sure, r2 will be low, but this is a trivial consequence of the definition of r2.

As I see it, the relevance of TEs in the "missing heritability" debate is simply that they could be an important source of rare deleterious alleles. Such alleles, regardless of whether they are TEs, SNPs, or CNVs, will have the property that they can be mapped in pedigrees in which they segregate, but will individually explain very little of the population variation and thus cannot be mapped using GWAS. This has nothing to do with LD: it is a simple consequence of their frequency. LD is a red herring in this context. I would get rid of Figure 2. and simply mention this in the Discussion.

But speaking of rare variants, you seem to argue against TEs having an effect, at least on transcript abundance, although you note that perhaps the rare variants are the ones that matter (subsection “TE variants affect gene expression”, first paragraph). Did you try testing this in aggregate (analogously to what human geneticists do to establish that rare alleles matter)? Rather than carrying out a SNP-by-SNP test, which clearly doesn't work when frequencies are less than 3% (is that 2 alleles?), ask whether these rare TEs are, in aggregate, more likely to show extreme phenotypes than two (or three, or whatever the right number is) randomly chosen individuals? You could calculate some kind of two-tailed rank statistic, and simply ask whether they are more likely to be extreme (feel free to contact me if this is not clear).

As I said from the outset, the strength of this paper is confirming that a large fraction of DMRs are due to TEs. This could be made even stronger:

A) When estimating the fraction based on published DMR data (subsection “TE variants drive DNA methylation differences between accessions”, end of first paragraph), why not distinguish between types of DMRs?

B) When looking for DMR around identified TEs (subsection “TE variants drive DNA methylation differences between accessions”, first paragraph), why not using random regions as control?

C) The correlations between TE and methylation are nice, but is there any clear example of strong methylation without a TE? This would be of obvious interest.

D) have you considered trans-acting TEs? Even if there is no TE variant in cis, there could be one in trans. How many more DMRs can you explain if you consider nearby TEs, variable or not?

Trans-acting TEs could also be important when interpreting the crosses, but I'm guessing other reviewers will comment more extensively on that.

Reviewer #3:

This is a very nice analysis of transposon activity in Arabidopsis accessions and will serve as a platform for similar analyses in more complex genomes.

Reviewer #4:

The manuscript by Stuart and colleagues describes the development of a software pipeline for detection of variations in transposons insertions (TEPID). The authors applied this pipeline to the available genomic sequences of over 200 Arabidopsis accessions. They verified efficiency and accuracy of the TEPID and estimated ages of transposons based on linkage disequilibrium between transposon and flanking sequences. In addition, they assessed the degree of transposon derived regulation of transcription of the neighbouring genes, which turned out to be surprisingly limited and only two experimentally verified examples could illustrate TE derived transcriptional suppression or activation. Interestingly, suppression of transcription of a gene encoding a LRR protein seems to cause a change in the level of resistance to a bacterial pathogen when accessions with or without the transposon insertion were compared. The authors also studied the influence of transposon insertions on the local variation in DNA methylation claiming that especially old insertions affect variation in DNA methylation. Finally, in reciprocal inter-accession crosses they examined the epigenetic interaction between accessions differing in the distribution of particular transposable elements, comparing loci with presence of corresponding transposons in one or both accessions.

Author response

Summary:

Stuart et al. present a compelling and comprehensive description of TE variation in Arabidopsis thaliana, highlighting the influence of TE polymorphism on expression of nearby genes and methylation of nearby regions. This paper persuasively argues that TE variation is the genetic basis for most DNA methylation polymorphisms and makes an important contribution to our understanding of the genomic impacts of TEs. However, the reviewers raised several points that should be addressed before publication. Most generally, this manuscript focuses a bit too much on the process and not enough on the biology. Dense statistical descriptions that will be difficult to understand for most readers are common, yet some results are explained either very broadly or not at all, and some important biological issues are ignored. A greater emphasis on biological meaning would substantially improve this paper.

We have made numerous changes throughout the text in order to place more emphasis on the biology rather than the process. Below we address each specific point raised by the reviewers.

Essential revisions:

1) The authors report that TEPID identified 300-500 insertions and 1000-1500 deletions per ecotype relative to Col. However, the overall number of insertions identified was 15,007 vs. 8,088 deletions. The reason that more deletions than insertions were identified per ecotype but fewer overall is that some of the deletions are present in nearly all or all tested ecotypes. Because this analysis is relative – insertions are both insertions in other ecotypes and deletions in Col, whereas deletions are deletions in other ecotypes and insertions in Col, such 'fixed deletions' are just insertion in Col. The authors get to this issue later in the analysis, but they don't explain the asymmetry. Why aren't there 'fixed insertions' relative to Col – meaning deletion in Col? Is TEPID unable to identify such events, or is there a biological meaning to this? An explanation would be very helpful.

We do in fact identify nearly fixed insertions that we classify as a Col-0 deletion. These are however a very small fraction of all the insertion variants identified, likely due to some bias in the TE variant discover method using the short read data. TE insertions relative to the reference genome appear to be slightly more difficult to identify that TE absences (as shown in our simulations, Figure 1—figure supplement 1), and any missed TE insertions will decrease the allele frequency for that variant, resulting in a reduced number of TE insertions that are at a high enough minor allele frequency to be subsequently classified as a TE deletion in Col-0. This effect is restricted to Col-0, as that is the only accession where all true TE deletions must be identified as TE insertions in other accessions in the population. As this bias is restricted to Col-0 (one accession out of the 216 examined), we do not believe this affects conclusions made in our manuscript. We have added the following sentence to our revised manuscript to address this question:

“High allele frequency TE presence variants relative to Col-0, representing true deletions in Col-0, were much more rare, with 97.8% of initial TEPID insertion calls being subsequently classified as true insertions. […] Accessions were found to contain on average ~240 true deletions and ~300 true insertions (Figure 2—figure supplement 5).”

We have added an additional plot showing the number of true deletions and true insertions detected for each accession as Figure 2—figure supplement 5:

To further clarify for readers, we have also added the following text:

“Although more TE absences were found on an accession-by-accession basis, overall TE presences were more common in the population as TE absences were often shared between multiple accessions, indicative of a TE insertion unique to the Col-0 reference genome.”

2) The authors mention that TE insertions and deletions are both biased for pericentric regions, without discussing the observation that deletions are far more biased and what this likely means for actual TE insertion biases. The data strongly suggest that TE insertions are largely random, but are eliminated by selection from the chromosome arms. However, reading this paper, one could easily get the impression that TE insertion and deletion location biases are similar (which is stated in the Results), when they are actually not similar at all except in the most general sense.

Our data do suggest that TE insertions are largely random, whereas TE deletions are more biased towards the pericentromeric regions, likely due to the distribution of old TEs throughout the genome. Thank you, we now realize that this point was not clear in our manuscript and have accordingly made clarifications to the text in order to make this important point more obvious to readers. Specifically, we have added the following text to the revised manuscript:

“While TE deletions were strongly biased towards the pericentromeric regions where TEs are found in high density, TE insertions had a more uniform distribution over the chromosome. This suggests that TE insertion positions are largely random but may be eliminated from chromosome arms through selection, and accumulate in the pericentromeric regions where low recombination rates prevent their removal (Figure 2A).”

3) C-DMRs and CG-DMRs are introduced without any description of the underlying biology (gene body vs. TE methylation; C-methylation mostly in TEs, CG mostly in genes) that would help to make sense of the different correlations of these DMRs with TE insertions. Gene body methylation doesn't even come up until the next-to-last Results section, and it's not explained there either. A keen reader might notice that CG-DMR distribution tracks genes, but to most the biological meaning of the presented analyses with respect to the two types of DMRs will be very obscure.

We apologize for this oversight, and have added a brief explanation of the biological meaning of gene body methylation to the Introduction:

“In Arabidopsis, TEs are often methylated in all cytosine sequence contexts, in a pattern distinct from DNA methylation in other regions of the genome. Conversely, DNA methylation often occurs in gene bodies exclusively in the CG context and is correlated with gene expression, although this gene-body methylation appears dispensable (Bewick et al. 2016)”

We also draw attention to the CG-DMR distribution track in the third paragraph of the subsection “Abundant TE positional variation among natural Arabidopsis populations” to highlight the fact that the distribution of CG-DMRs closely follows the distribution of genes.

4) There is some concern about the inferences made about gene expression, as it seems that TE genes are not removed from the gene models used in the differential expression calls. Separating the impact of TEs inserting into TEs from the effect of TEs inserting into host genes would strengthen the argument that TEs are associated with extreme expression changes.

We have now repeated all analysis of the effect of TE variants upon gene expression removing all TE genes from the set of gene models used. The results are mostly the same, as most TE genes were lowly or not expressed in both cases of TE presence and TE absence, and so this change in the analysis does not affect any of our original conclusions.

We thank the reviewers for this suggestion, and have examined the allele frequencies for the Basho elements examined by Hollister and Gaut to compare with our data, as this dataset is the most straightforward to compare (the Lockton and Gaut paper uses TE display, which complicates the comparison of individual TE copy presence/absence calls). We found a weakly positive correlation between the TE occupation frequency for those elements examined by Hollister and Gaut, and the occupation frequency that we have estimated for the same elements using short read data in the larger population (Figure 2—figure supplement 9). We have added such information on the expected allele frequencies to the revised text:

“As certain TEs present in Col-0 have previously been genotyped in 47 different accessions, allele frequency data was available for some TEs (Hollister and Gaut 2007), and we compared these previous allele frequency estimates with our estimates based on the short read data. We found a weakly positive correlation (r2 = 0.3) between the previous allele frequency estimates for Basho family TEs and our allele frequency estimates, which may not be unexpected given the differing population sizes and TE variant detection methods used (Figure 2—figure supplement 9)”.

6) As any definition of a rare allele is somewhat arbitrary, the paper should be clearer about how TE variants are categorized into rare and common. Probably due to additions in the text, the definition of common (>3% MAF) comes in the subsection “Relationship between TE variants and single nucleotide polymorphisms”, long after Figure 2 is introduced. Can the common/rare definition either be added to the Figure 2 legend, or introduced when polarization is described? Similarly, it seems that the translation of MAF cutoffs to numbers of individuals are not consistent throughout the paper. While the definition of MAF cutoff states this is in >7 accessions, this doesn't obviously make sense with 217 accessions (shouldn't it be >=7?). Also, it seems that filtering reduces the number of accessions to 184, where 3% would be >5. Since there are so many low frequency TEs (Figure 2—figure supplement 2), small shifts in this definition might change conclusions qualitatively.

We apologize for this error, and the number of accessions used should read >6 (not >7). Furthermore, we now realize that the explanation of MAF cutoffs used may have been unclear in our original manuscript. As the LD analysis performed is very sensitive to having low SNP variation between accessions, some accessions were filtered out for that analysis, giving a set of 184 accessions. In this case, our 3% MAF cutoff amounted to >5 accessions. For all other analyses, we used the full set of accessions in order to maximize sample sizes, and so the cutoff for rare variants was >6 accessions. We have made alterations to the text in order to clarify this point, under the Methods section titled “TE variant / SNP comparison”. Furthermore, we have added a definition of the MAF used to the Figure 2 legend and again in the main text when the distinction between rare and common alleles is introduced (subsection “Abundant TE positional variation among natural Arabidopsis populations”, third paragraph).

7) Gene At2G01360 used as an example in Figure 5 is highlighted in Figure 2 of Freeling et al. (2008) (http://genome.cshlp.org/content/18/12/1924.full) as transposing to this position in Arabidopsis by an unknown mechanism. Please comment about the possible relationship between this observation and the presence of a polymorphic TE within Arabidopsis.

We thank the reviewers for pointing this out, and have added some discussion of the transposition events at this point to the text. Specifically:

“This TE insertion occurred at the border of a non-syntenic block of genes thought to be a result of a transposition event in Arabidopsis (Freeling et al. 2008). This transposition event likely predates the TE insertion discovered here, and it is interesting that multiple transposition events appear to have occurred in close proximity in the genome.”

[Editors’ note: the author responses to the first round of peer review follow.]

Although the reviewers found the subject of the manuscript interesting and clearly presented, three major deficiencies were identified:

1) The paper appears to conflate transposition with TE content variation, which are different things. TE variation – the actually measured feature – can occur through new insertions, but also through deletion of old copies. Given the large number of ecotype genomes and an understanding of the underlying phylogeny, new insertions could be distinguished from deletions of ancestral TEs in most cases. This would allow new insertion evens to be considered separately from deletions.

In our revised manuscript, we now distinguish between new TE insertions and deletions of ancestral TE copies. We used the minor allele frequency of TE variants to guide these classifications, and have carried out all subsequent analyses on these insertion and deletion groups separately, as they are fundamentally different events. We identify important differences between TE insertions and TE deletions in the patterns of DNA methylation surrounding these sites. While the level of DNA methylation surrounding TE insertions was often elevated, we do not see DNA methylation decrease following the deletion of a TE. Rather, these regions flanking the original insertion site remain highly methylated. We believe that making this distinction between TE insertions and TE deletions has greatly improved our manuscript, and thank the reviewers for their suggestion.

2) As explained by Magnus Nordborg in his review, the use of LD to measure TE insertion age is questionable and likely produces inaccurate measurements. This might explain why the conclusion reached in this manuscript that even recent TE insertions are biased for pericentric heterochromatin is inconsistent with studies of individual plant TEs that do not show such a bias. A more appropriate estimate of TE insertion age may be through measurement of allele frequency. Because estimated TE age is used in analyses throughout the manuscript, a robust measure is essential.

We agree that the use of LD as a measure of TE insertion age may produce inaccurate results, and have removed any statements linking TE LD with TE insertion age. Instead, we have focused on allele frequency as a proxy for TE insertion age, as suggested by the reviewers. We now analyse rare and common TE variants separately, and conclude that many of the TE variants we find are not tagged by SNPs.

3) The section on seed and gamete methylation contains numerous questionable statements and analyses. Statements linking demethylation in pollen to imprinting are very confusing, because imprinting is associated with demethylation in the female gametophyte. The correlations with TEs demethylated (more accurately hypomethylated) in the CHH context in sperm ignore the global nature of this phenomenon, which affects most heterochromatic TEs. The small RNA analysis rests on the unwarranted assumption that the small RNAs are produced in the endosperm. The paper from which the small RNA data are derived makes a strong case that the small RNAs are derived from the maternal seed coat, which would provide a trivial explanation for the observation that TEs absent in the maternal genome have lower levels of small RNA in hybrid seeds. The most promising result – the correlation between TE content and methylation in Col x Cvi hybrids – didn't hold up in Col x Ler hybrids.

We agree with the reviewers’ comments, and this section has been removed from the revised manuscript.

We would welcome a resubmission that clearly identifies TE insertions, provides a robust measure of the false positive and false negative rates (please see Magnus Nordborg's review below), provides a robust measure of TE age, and ideally deepens the analysis of the correlations between TE variation, DNA methylation polymorphism, and gene expression as suggested by the reviewers.

We believe each of these points have been addressed in the revised manuscript. We provide a robust measure of false positive and false negative rates, described in detail in the subsection “Computational identification of TE presence/absence variation” and subsection “TE variant simulation”, and find that our rate of false negatives is substantially lower than that of Quadrana et al., published recently in eLife. By using the same methodology as Quadrana et al. (2016), we enable a direct comparison between the two TE variant detection methods, and find that the method we have developed has a much greater sensitivity. This key difference allows us to detect 8x more TE variants, and this larger sample size further allows us to explore the effect of these TE variants at the level of individual accessions, rather than at the species level, as was the focus of Quadrana et al. (2016). To provide a measure of TE age, we have focussed on the minor allele frequency of TE variants, and make many comparisons between the rare and common alleles identified. We further deepened the analysis of correlations between TE variation, DNA methylation and gene expression by testing the effects of rare variants on both gene expression and DNA methylation, and we perform a genomewide scan for significant correlations between TE variants and genomewide patterns of DNA methylation, in order to identify any potential trans effects of TE variants on DNA methylation. Furthermore, we now analyse TE insertions and TE deletions separately, and find important differences between insertions and deletions in the effect such TE variants have on nearby patterns of DNA methylation.

Reviewer #2 (Magnus Nordborg):

This is a very clear and interesting paper. It is one of the first genome-wide descriptions of TE polymorphism (there will be more…), and, although the conclusions are a bit limited, it should be of broad interest. Most importantly, it really suggests strongly that methylation variation is mostly genetic.

I do have a few substantive concerns, however. The first concerns data quality, which has been a major reason these types of studies have not done before. Calling TE polymorphisms from short-read data is not trivial. What is done here is sensible, but I would still like to have some idea of error rates and biases. Using simulations (subsection “Computational identification of TE presence/absence variation”) is not QC – it is at best debugging of software. What is needed is comparison with real data, preferably a "gold standard". The obvious choice is the PacBio Ler assembly, and it is used here, but it seems to me it is used in the wrong direction. The TE polymorphisms identified in the short-read data are confirmed in the PacBio data, rather than the other way around. It is the usual false-positive / false-negative issue that plagues all these data. What is done shows that the identified polymorphism are reliable, but it tells us nothing about what we don't see. Which leads to biases of all kinds. Why not do the right thing? And if there is a good reason for not doing the right thing, please justify this, and say a few words about how why the inevitable biases do not affect your results.

We have now included a robust estimate of TE false negative calls, using the Ler assembly and the same methodology as was recently published by Quadrana et al. To estimate the rate of false negatives for TE absence calls, we compared the set of TEs identified by Quadrana et al. as being absent from the Ler genome (by BLAT alignment) to our set of TE absence calls for Ler using TEPID. To estimate the rate of false negative TE insertion calls, we mapped Col0 short read data to the Ler PacBio assembly, and then looked for read evidence at the sites containing a Col0specific TE insertion meeting the thresholds for a TE insertion call to be made by TEPID. For both TE absences and TE insertions, this resulted in a false negative estimate of approximately 10%. This is described in more detail in the last paragraph of the subsection “Computational identification of TE presence/absence variation” and subsection “TE variant discovery”, last paragraph.

My second concern is the entire missing heritability section (Figure 2), which I think is based on two misunderstandings, one of linkage disequilibrium (LD), one of the nature of "missing heritability". Take LD first. It is theoretically impossible for TEs variants to show a different LD behavior than SNPs unless they are due to repeat mutations (which have to be very frequent indeed to affect LD). As long as identity in state = identity by descent, it doesn't matter how mutations arise. Of course it is possible that some of the TE indels could be non-unique, but there is no evidence for this in this paper (you could look for it: does the same insertion occur on unrelated haplotypes?). I'm fairly convinced that the difference is r2 distributions you see are largely due to differences in the allele frequency. To confirm this, compare r2 between TEs and flanking SNPs to r2 between a SNP and flanking SNPs after making sure that the polymorphism being tested has the same derived allele frequency. This can be done by binning. I think you will see that there is no difference. And if you do see a difference, it is probably due to some other artifact (genotyping error?), because, as I said, it is theoretically impossible for there to be a difference (unless repeat mutations are very different, selection is strong, etc.).

We agree that the discussion of missing heritability in our manuscript was misguided, and have removed those statements. We now simply analyse TE LD in order to make the point that most of these variants are untagged by SNPs, and so some differences between accessions that we detect could not previously have been attributed to any genetic differences between accessions. One example of this is the TE insertion that we identify in AtRLP18, which is untagged by SNPs, and so the differing sensitivities of accessions to Pseudomonas infection would appear random in the absence of our TE variant data. We also include a comparison between LD state and minor allele frequency, and we see that the two are indeed well correlated, indicating that our TE variant data is likely free of bias.

Don't say things like "this variant was classified as a young TE insertion, as it is not in linkage disequilibrium with surrounding SNP" – the younger a mutation is, the stronger the LD (because there has been no time for recombination). Sure, r2 will be low, but this is a trivial consequence of the definition of r2.

We have now removed all statements linking TE LD state with TE variant age.

As I see it, the relevance of TEs in the "missing heritability" debate is simply that they could be an important source of rare deleterious alleles. Such alleles, regardless of whether they are TEs, SNPs, or CNVs, will have the property that they can be mapped in pedigrees in which they segregate, but will individually explain very little of the population variation and thus cannot be mapped using GWAS. This has nothing to do with LD: it is a simple consequence of their frequency. LD is a red herring in this context. I would get rid of Figure 2. and simply mention this in the Discussion.

We agree that the TE variants identified are simply a source of rare variants, and have removed any discussion of “missing heritability”.

But speaking of rare variants, you seem to argue against TEs having an effect, at least on transcript abundance, although you note that perhaps the rare variants are the ones that matter (subsection “TE variants affect gene expression”, first paragraph). Did you try testing this in aggregate (analogously to what human geneticists do to establish that rare alleles matter)? Rather than carrying out a SNP-by-SNP test, which clearly doesn't work when frequencies are less than 3% (is that 2 alleles?), ask whether these rare TEs are, in aggregate, more likely to show extreme phenotypes than two (or three, or whatever the right number is) randomly chosen individuals? You could calculate some kind of two-tailed rank statistic, and simply ask whether they are more likely to be extreme (feel free to contact me if this is not clear).

This is an excellent suggestion, and we have now carried out this analysis. We do find a clear enrichment for rare TE variants near genes with expression extremes in the population, suggesting that these variants do indeed have a strong impact on gene expression. Similarly, we find a strong enrichment for nearby rare TE insertions in accessions with high DMR DNA methylation. Rare TE deletions were enriched for both high and low DMR methylation ranks in the population, suggesting that the DNA methylation level at DMRs sometimes does not return to preinsertion levels following a TE deletion.

As I said from the outset, the strength of this paper is confirming that a large fraction of DMRs are due to TEs. This could be made even stronger:

A) When estimating the fraction based on published DMR data (subsection “TE variants drive DNA methylation differences between accessions”, end of first paragraph), why not distinguish between types of DMRs?

We now include both types of DMR in our analysis (CDMRs and CGDMRs). We find that the DNA methylation level at CDMRs is dependent on nearby TE presence/absence, but CGDMR methylation level is usually (but not always) independent from any nearby TE variants.

B) When looking for DMR around identified TEs (subsection “TE variants drive DNA methylation differences between accessions”, first paragraph), why not using random regions as control?

In this section, we aim to compare DNA methylation levels in the same region of the genome with the presence or absence of TE insertions, and so feel the comparisons made fit our purpose.

C) The correlations between TE and methylation are nice, but is there any clear example of strong methylation without a TE? This would be of obvious interest.

We do find that many CDMRs (and most CGDMRs) are not associated with a TE variant. We find that TEs can explain ~54% of the CDMRs. The remaining CDMRs that are apparently independent from TE presence/absence did not seem to have any defining features that we could discern (genomic location, distance to genes, distance to TEs). However, one of our key findings is that the level of DNA methylation surrounding TEs often remains high following a TE deletion, suggesting that DNA methylation, once established near a TE, often remains after the TE has been deleted. This suggests that more CDMRs could be a result of TE insertions than we find direct evidence for, if these TEs were deleted in all accessions analysed in this population. These CDMRs would then represent the epigenetic “scars” of past TE insertions. It is also possible that some DMRs are regulated in trans, and we performed a genomewide scan to identify these DMRs, discussed below.

D) have you considered trans-acting TEs? Even if there is no TE variant in cis, there could be one in trans.

This is an excellent suggestion. We have reanalysed our TE variant and DMR data to perform pairwise associations across the genome. We identified the previously noted strong local (cis) effect of TE variants upon DNA methylation, and also identified many putative transeffect associations in which a single TE variant displays a high level of association with hundreds of DMRs. We have substantially edited the manuscript to include a section describing our association results and highlight the possibility that specific TE variants may influence distant methylation states beyond local effects.

How many more DMRs can you explain if you consider nearby TEs, variable or not?

Of the CDMRs that were not within 1 kb of a TE variant, 3,701 (27% of all CDMRs) are within 1 kb of a fixed TE. Therefore, 81% of all CDMRs are within 1 kb of a TE when taking into account both variable and fixed TEs in the population. Of the remaining 19% of CDMRs, most are within genes or intergenic regions. This information has been added to the first paragraph of the subsection “TE variants explain many DNA methylation differences between accessions”.

Trans-acting TEs could also be important when interpreting the crosses, but I'm guessing other reviewers will comment more extensively on that.

As detailed above, the section has been removed from the revised manuscript.

For correspondence

Competing interests

Funding

Centre of Excellence in Plant Energy Biology, Australian Research Council (CE140100008)

Tim Stuart

Steven R Eichten

Jonathan Cahn

Yuliya Karpievitch

Justin Borevitz

Ryan Lister

Australian Research Council

Tim Stuart

Steven R Eichten

Jonathan Cahn

Yuliya Karpievitch

Justin Borevitz

Ryan Lister

Sylvia and Charles Viertel Charitable Foundation

Ryan Lister

Australian Research Council (FT120100862)

Ryan Lister

Australian Research Council (DE150101206)

Steven R Eichten

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was supported by the Australian Research Council (ARC) Centre of Excellence program in Plant Energy Biology CE140100008 (JOB, RL). RL was supported by an ARC Future Fellowship (FT120100862) and Sylvia and Charles Viertel Senior Medical Research Fellowship, and work in the laboratory of RL was funded by the Australian Research Council. TS was supported by the Jean Rogerson Postgraduate Scholarship. SRE was supported by an Australian Research Council Discovery Early Career Research Award (DE150101206). We thank Robert J Schmitz, Mathew G Lewsey, Ronan C O’Malley, and Ian Small for their critical reading of the manuscript, and Kevin Murray for his helpful comments regarding the development of TEPID. We would also like to thank Brandon Gaut for kindly providing Basho TE allele frequency estimates.

eLife is a non-profit organisation inspired by research funders and led by scientists. Our mission is to help scientists accelerate discovery by operating a platform for research communication that encourages and recognises the most responsible behaviours in science.eLife Sciences Publications, Ltd is a limited liability non-profit non-stock corporation incorporated in the State of Delaware, USA, with company number 5030732, and is registered in the UK with company number FC030576 and branch number BR015634 at the address:
eLife Sciences Publications, Ltd
Westbrook Centre, Milton Road
Cambridge CB4 1YG
UK