Significance

CRISPR-Cas is an adaptive immunity system that protects bacteria and archaea from mobile genetic elements. We present comparative genomic and phylogenetic analysis of minimal CRISPR-Cas variants associated with distinct families of transposable elements and develop the hypothesis that such repurposed defense systems contribute to the transposable element propagation by facilitating transposition into specific sites. Thus, these transposable elements are predicted to propagate via RNA-guided transposition, a mechanism that has not been previously described for DNA transposons.

Abstract

A survey of bacterial and archaeal genomes shows that many Tn7-like transposons contain minimal type I-F CRISPR-Cas systems that consist of fused cas8f and cas5f, cas7f, and cas6f genes and a short CRISPR array. Several small groups of Tn7-like transposons encompass similarly truncated type I-B CRISPR-Cas. This minimal gene complement of the transposon-associated CRISPR-Cas systems implies that they are competent for pre-CRISPR RNA (precrRNA) processing yielding mature crRNAs and target binding but not target cleavage that is required for interference. Phylogenetic analysis demonstrates that evolution of the CRISPR-Cas–containing transposons included a single, ancestral capture of a type I-F locus and two independent instances of type I-B loci capture. We show that the transposon-associated CRISPR arrays contain spacers homologous to plasmid and temperate phage sequences and, in some cases, chromosomal sequences adjacent to the transposon. We hypothesize that the transposon-encoded CRISPR-Cas systems generate displacement (R-loops) in the cognate DNA sites, targeting the transposon to these sites and thus facilitating their spread via plasmids and phages. These findings suggest the existence of RNA-guided transposition and fit the guns-for-hire concept whereby mobile genetic elements capture host defense systems and repurpose them for different stages in the life cycle of the element.

Mechanisms for recognizing specific nucleic acid sequences are essential to accessing and maintaining the genome in all life forms. The most widespread molecular systems based on sequence recognition involve dedicated nucleic acid-binding proteins (1, 2). In particular, promoter recognition by transcription factors and recognition of chromosomal replication origins by initiation proteins are fundamental, universal processes central to normal cell function (3, 4). Additionally, recognition of nucleic acids by proteins is the basis of self vs. nonself discrimination that is essential for defense functions, such as restriction modification in prokaryotes (5, 6). However, there is a growing appreciation of how nucleic acids themselves are harnessed for the task of sequence recognition. A key advantage of these systems is their flexibility whereby a guide nucleic acid molecule can be adapted to recognize any target sequence with high specificity. Thanks to this inherent capacity, nucleic acid based-machinery has been exploited extensively in the evolution for defense of the genome against mobile genetic elements (MGE) as well as regulatory functions (7). A major case in point is the vast RNAi network, apparently the most conserved, ancestral innate immunity system in eukaryotes (8⇓–10). The RNAi machinery takes advantage of dsRNA produced by viruses and transposons to generate specific guide RNAs for defense and has also spawned a variety of regulatory mechanisms.

Prokaryotes possess a system of innate immunity centered around the Argonaute proteins that appears to be the evolutionary antecedent of eukaryotic RNAi (7, 11, 12) as well as CRISPR-Cas systems of adaptive immunity (13⇓–15). The CRISPR-Cas systems provide guide RNA-based defense against viruses and other MGE in nearly all archaea and about one third of bacteria (16).

CRISPR-Cas systems possess modular organization which roughly reflects the three main functional stages of the CRISPR immune response: (i) spacer acquisition (known as “adaptation”), (ii) pre–CRISPR RNA (precrRNA) processing, and (iii) interference (14). CRISPR-Cas systems are highly diverse but can be partitioned into two distinct classes based on the organization of the effector module that is responsible for processing and adaptation (15, 16). Class 1 CRISPR-Cas systems are further divided into three types and 12 subtypes in all of which the effector modules are multisubunit complexes of Cas proteins (16). In contrast, in the currently identified three types and 12 subtypes of class 2, the effector modules are represented by a single multidomain protein, such as the thoroughly characterized Cas9 (15, 17, 18).

At the adaptation stage, the Cas1–Cas2 protein complex, in some instances with additional involvement of accessory adaptation proteins and/or effector module proteins, captures a segment of the target DNA (known as the “protospacer”) and inserts it at the 5′ end of a CRISPR array (19⇓⇓⇓–23). In the second processing stage, a CRISPR array is transcribed into a long transcript known as “precrRNA” that is bound by Cas proteins and processed into mature, small crRNAs. In most class 1 systems, the precrRNA processing is catalyzed by the Cas6 protein that, in some cases, is loosely associated with the effector complex (14, 24). The final interference step involves binding of the mature crRNA by the effector complex, scanning a DNA or RNA molecule for a sequence matching the crRNA guide and containing a protospacer adjacent motif (PAM), and cleavage of the target by a dedicated nuclease domain(s) (14, 24⇓–26). The identity of the nuclease(s) differs between type I and type III CRISPR-Cas systems. In type I, the protein responsible for target cleavage is Cas3, which typically consists of a superfamily II helicase and HD-family nuclease domains (27). After the effector complex, which is denoted “Cascade” [CRISPR-associated complex for antiviral defense (28)] in type I systems, recognizes the cognate protospacer in the target DNA, it recruits Cas3, after which the helicase unwinds the target DNA duplex, and the HD nuclease cleaves both strands (29, 30). Type III systems lack Cas3, and the protein responsible for target cleavage is Cas10, which contains polymerase-cyclase and HD-nuclease domains that are both required for the target degradation (31, 32).

In some of the CRISPR-Cas systems the adaptation genes are encoded separately or even are missing from the genome containing effector complex genes. Among these nonautonomous CRISPR-Cas systems, those of type III have been characterized in most detail (14). It has been shown that type III effector complexes can use crRNA originating from CRISPR arrays associated with type I systems and thus do not depend on their own adaptation modules (33⇓⇓⇓–37). Furthermore, the CRISPR-Cas systems of type IV, which are often encoded on plasmids, typically consist of the effector genes only (16). No adaptation genes and no associated nuclease domains could be found in the type IV loci, although occasionally CRISPR arrays and cas6-like genes are present. The type IV systems have not yet been studied experimentally, so their mode of action remains unknown. Finally, several variants of type I systems, similarly to type IV, lack adaptation genes and genes for proteins involved in DNA cleavage. A “minimal” variant of subtype I-F has been identified in the bacterium Shewanella putrefaciens, with an effector module that consists only of Cas5f, Cas6f, and Cas7f proteins and lacks the large and small subunits present in other Cascade complexes (38). Even more dramatic minimization of subtype I-F has been observed in another variant of subtype I-F that lacks the adaptation module and consists solely of three effector genes, namely a fusion of cas8f (large subunit) with cas5f, that is unique for this variant, cas7f, and cas6f (Fig. 1A) (16). Given the composition of their Cascade complex, these Cas1-less minimal subtype I-F systems can be predicted to process precrRNA, yielding mature crRNAs, and to recognize the target. However, they lack the Cas3 protein and therefore cannot be expected to be competent for target cleavage. Here we report a comprehensive in silico analysis of this system showing that it is linked to a specialized group of transposons related to the well-studied Tn7.

Schematic representation of the complete and minimal type I-F CRISPR-Cas systems and Tn7 transposition. (A) Gene organization of a complete and a minimal type I-F CRISPR-Cas system lacking the genes for proteins responsible for adaptation and target cleavage. Minimal I-F systems contain fused cas8f and cas5f genes that are characteristic of this group (16). Together, these proteins can be predicted to be subunits of a minimal Cascade complex. (B) Gene structure of the Tn7 genes flanked by left (L) and right (R) end sequences. Transposition catalyzed by the TnsABC+TnsD proteins directs the transposon into a single chromosomal site (attTn7) in bacterial genomes. Transposition catalyzed by the TnsABC+TnsE proteins preferentially directs transposition into actively conjugating DNA and filamentous bacteriophage (shown by a red circle with arrows). The transposon is denoted by a rectangle in the attachment site. The DNA sequence omitted in the graphic is denoted by two slashes. See text for details.

As genomic parasites, transposons have evolved to limit the negative effects they exert on the host. A variety of regulatory mechanisms are used to maintain transposition at a low frequency and sometimes coordinate transposition with various cell processes. Some prokaryotic transposons also can mobilize functions that benefit the host or otherwise help maintain the element. Certain transposons also evolved mechanisms of tight control over target site selection, the most notable example being the Tn7 family (39).Three transposon-encoded proteins form the core transposition machinery of Tn7: a heteromeric transposase (TnsA and TnsB) and a regulator protein (TnsC) (Fig. 1B). In addition to the core TnsABC transposition proteins, Tn7 elements encode dedicated target site-selection proteins, TnsD and TnsE. In conjunction with TnsABC, the sequence-specific DNA-binding protein TnsD directs transposition into a conserved site referred to as the “Tn7 attachment site,” attTn7 (40). TnsD is a member of a large family of proteins that also includes TniQ, a protein found in other types of bacterial transposons. TniQ is incompletely characterized at the molecular level but has been shown to target transposition into the resolution sites of plasmids (41). Transposition into the attTn7 site shows no negative impact on the host, providing a “safe haven” for these elements that appear to be universally maintained in bacteria. Transposition mediated by TnsABC + TnsE is preferentially directed into plasmids and bacteriophages owing to the ability of TnsE to recognize complexes formed during specific types of DNA replication (42⇓⇓–45). The TnsE-mediated transposition that preferentially directs insertion into other MGE is likely responsible for the wide distribution of Tn7 elements among bacteria.

Here we show that minimal subtype I-F CRISPR-Cas systems are specifically associated with a distinct group of Tn7-like elements. These transposons encode TnsD(TniQ)-like proteins and use previously uncharacterized attachment sites but lack TnsE-like proteins that normally promote horizontal transfer of the elements. Several identified matches for the spacers from the transposon-associated CRISPR arrays suggest that this system might function by targeting transposition to target sites enabled by guide crRNAs. We hypothesize that the CRISPR-Cas machinery recruited by these elements facilitates their horizontal dissemination, mostly via plasmids and/or phages. Thus, this group of MGE is likely to possess a functionality that has not been described previously for DNA transposons, namely, RNA-guided transposition.

Results and Discussion

A Variant of the Type I-F CRISPR-Cas System Is Specifically Associated with a Distinct Family of Tn7-Like Elements.

For the purpose of comprehensive identification of type I-F CRISPR-Cas loci, we chose the Cas7f protein as the probe, given that it is the most conserved component in all systems of this subtype including the minimal variant lacking cas1, cas2, and cas3 genes. Using a PSI-BLAST search started with Cas7f profiles, we obtained 2,905 Cas7f protein sequences, mapped them onto the respective genomes, and annotated the genes in the neighborhoods 10 kb up- and downstream of the cas7f genes using PSI-BLAST against the conserved domain database (CDD). These 20-kb loci are long enough to cover a typical complete I-F system that consists of six genes (16). We then reconstructed a phylogenetic tree from all identified Cas7f protein sequences (Fig. 2A and Dataset S1; see the respective Newick tree at ftp://ftp.ncbi.nih.gov/pub/makarova/supplement/Peters_et_al_2017). Mapping gene neighborhoods on the tree revealed a single, monophyletic, strongly supported branch that included all cas1-less I-F variants. As of this analysis, the branch encompassed 423 sequences from 19 genera of Gammaproteobacteria and appears to derive from a typical, complete I-F system (Figs. 1A and 2A). Indeed, all other branches in the tree consist of Cas7f homologs from complete I-F systems containing a cas1 gene within the locus. A few exceptions that are scattered in the tree are from either small contigs or disrupted cas loci. In the vast majority of the loci corresponding to the cas1-less branch, a tnsD(tniQ) gene is located next to the cas genes (Fig. 3).

Schematic evolutionary trees for the Cas7f, TnsA, and TnsD(TniQ) protein families. (A) The dendrogram was built using 2,905 Cas7f proteins as described in Methods (see the complete tree at ftp://ftp.ncbi.nih.gov/pub/makarova/supplement/Peters_et_al_2017). The major subtrees are collapsed and shown by triangles. The branch corresponding to the minimal I-F variant is colored in orange, and the bootstrap value for this subtree is shown. (B) The dendrogram was built using 7,023 TnsA protein sequences (see the complete tree at ftp://ftp.ncbi.nih.gov/pub/makarova/supplement/Peters_et_al_2017). The branch corresponding to TnsA in the loci containing I-F variant cas genes is colored in orange, and I-B subtype cas genes are colored in green. The CRISPR-Cas subtypes are indicated next to the respective branches. Distinct cyanobacterial strains are indicated next to the respective I-B systems. The bootstrap value for the TnsA branch associated with I-F cas genes is shown. (C) The dendrogram was built using 7,963 TnsD(TniQ) proteins (see the complete tree at ftp://ftp.ncbi.nih.gov/pub/makarova/supplement/Peters_et_al_2017). The outgroup consists of the TnsD(TniQ)-like proteins that form the sister group of those associated with the type I-F CRISPR-Cas systems but encoded by Tn6022 elements lacking CRISPR-Cas (see the complete tree for the full information). The designations are as in B.

Schematic representation of Tn7, Tn6022, and selected Tn7-like transposons containing cas genes. Genomic features recognized by the transposon-encoded TniQ protein are indicated on the left (glmS, yifB, IMPDH, yciA, and SRP-RNA). Color coding and labeling are as in Fig. 1. Elements other than Tn7 and Tn6022 are denoted by the respective TnsA tree leaves (#XX) (Tn6022 = Tree node #582) (Dataset S2). Other genes are shown in gray, and known Tn7 cargo genes are indicated. Black vertical bars indicate repeats in the element-encoded arrays. DNA sequences omitted in the graphic are indicated by two slashes. See text for details.

To determine whether the association of the Cas1-less I-F systems with Tn7-like elements was unique or emerged independently on several occasions, we analyzed the TnsD(TniQ) and TnsA families. The TnsA protein is the most highly conserved gene of the Tn7-like elements and is responsible for the unique behavior of the elements with heteromeric transposases (46⇓⇓–49). We collected and annotated 10,349 loci containing at least tniQ/tnsD or tnsA (Dataset S2) and reconstructed a tree for both protein families (Fig. 2 B and C and see respective Newick trees at ftp://ftp.ncbi.nih.gov/pub/makarova/supplement/Peters_et_al_2017). In both trees, the loci containing cas genes of the cas1-less I-F variant mapped to strongly supported clades (Fig. 2 B and C). Thus, phylogenetic analysis of both Cas7f and the associated transposon-encoded proteins reveals a strong link between a specific group of Tn7-like elements and a distinct variant of the subtype I-F CRISPR-Cas systems. The Tn7-like elements in the clade that includes Tn6022 were identified as the outgroups to the respective branches in both the TnsA and TnsD(TniQ) trees, suggesting that a member of the Tn6022 family is the ancestor of the CRISPR-associated variety of Tn7-like transposons (Fig. 2 B and C). Both clades include multiple, deep branches that are not associated with cas genes in the respective loci, indicating that the link with the I-F system evolved relatively late in the history of this group of Tn7-like elements (see respective Newick trees at ftp://ftp.ncbi.nih.gov/pub/makarova/supplement/Peters_et_al_2017). In several cases, however, distribution of the cas genes among the tree branches implies that these were lost from the vicinity of the conserved transposon genes (e.g., Shewanella baltica OS678 and Thiomicrospira crunogena XCL_2), implying that the CRISPR-Cas system is not essential for the transposon survival. Notably, however, the converse is not the case: We detected no intact cas1-less I-F systems outside this transposon neighborhood, with the implication that this CRISPR-Cas variant is functional only when associated with a Tn7-like element.

We further investigated the tnsD and tnsA loci to identify any other CRISPR-Cas systems that might be linked to Tn7-like transposons. Only a few such instances were detected, mostly complete loci containing the adaptation genes. The respective tnsA and/or tniQ/tnsD genes are scattered in the phylogenetic trees, suggesting that most of these associations are effectively random and might be transient (Dataset S2). However, some such loci do show a degree of evolutionary coherence. Specifically, they form two small, unrelated branches in both the TnsA and the TnsD(TniQ) trees (see I-B in Fig. 2 B and C). All these CRISPR-cas loci are present in different cyanobacteria, belong to the I-B subtype, and lack adaptation genes as well as the cas3 gene that is required for DNA cleavage in type I systems. Thus, to a large extent, these type I-B variants mimic the organization of the more common transposon-associated, cas1-less I-F variant (see below).

The cas1-Less Type I-F CRISPR-Cas System Is Mobilized Together with Conserved Transposition Genes.

We analyzed the transposon end sequences in the loci containing the I-F and I-B CRISPR-Cas variants to determine whether the cas genes were located within the boundaries of these elements or are simply adjacent to the transposon. The structure of the left and right ends of canonical Tn7 has been defined previously (Fig. S1). Tn7 ends are marked by a series of 22-bp TnsB-binding sites (50⇓–52). Flanking the most distal TnsB-binding sites is an 8-bp terminal sequence ending with 5′-TGT-3′/3′-ACA-5′. Tn7 contains four overlapping TnsB-binding sites in the ∼90-bp right end of the element and three dispersed sites in the ∼150-bp left end of the element, but the number and distribution of TnsB-binding sites can vary among Tn7-like elements (39, 49). End sequences of Tn7-related elements can be determined by identifying the directly repeated 5-bp target site duplication, the terminal 8-bp sequence, and 22-bp TnsB-binding sites (Fig. S1). Compared with the canonical Tn7 and Tn6022, Tn7-like elements show extensive variation in size and gene complements as illustrated by a representative set of 12 complete elements ranging in size from 22 kb to almost 120 kb (Fig. 3 and Table S1) (53, 54). One of these elements has been previously identified in Vibrio parahaemolyticus RIMD2210633 as a member of the Tn7 superfamily and encodes the Vibrio pathogenicity determinant, thermostable direct hemolysin (TDH) (55). It should be emphasized that in closely related bacterial genomes (e.g., different strains of V. parahaemolyticus), CRISPR-Cas–carrying Tn7-like elements are often inserted in different sites (Fig. 4), which is indicative of recent mobility of these elements.

Phylogenetic tree of selected representatives of type I-F-associated TnsD(TniQ)-like proteins. A maximum likelihood phylogenetic tree was built as described in Methods for a selected set of TnsD(TniQ)-like proteins associated with the type I-F CRISPR-Cas variant and the same outgroup as in Fig. 2C. The numbers at internal branches indicate percent bootstrap support; only values greater than 70% are indicated. Elements located in one of the three attachment sites identified in this work are shown by color as indicated (yciA, IMPDH, and SRP-RNA); random sites are in black. The leaves of the tree for the TnsD(TniQ)-like proteins (#XX) (Dataset S2) are shown in green.

Schematic representation of the end structure of Tn7-like elements: anatomy of a Tn7 insertion. Typically, the insertion occurs at a single site about 25 bp downstream from the last codon of glmS. The Tn7 end proximal to tnsA is closest to glmS (by convention it is referred to as the “right” end). Transposition generates a target site duplication (shown in red) of the chromosomal sequence that now forms a direct repeat on either side the element. In the case of insertion of the canonical Tn7 element at the attTn7 site, this sequence is GCGGG; an 8-bp end sequence starts with TGT/CAC, and immediately after this end sequence is the first 22-bp binding site for TnsB.

In our analysis of CRISPR-Cas systems, two groups of type I-B variants were identified in association with Tn7-like elements (Fig. 2 B and C). Similar to the type I-F CRISPR-Cas variant, these I-B systems are expected to be functional for maturing CRISPR transcripts and forming crRNA complexes at protospacers but lack adaptation genes and Cas3 and, accordingly, are likely to be defective for interference. Furthermore, these type I-B CRISPR-Cas variants are associated with short CRISPR arrays (Fig. 3).

Taken together, these findings indicate that the type I-F and I-B CRISPR-Cas variants identified in this work are part of the core gene repertoire in multiple clades of Tn7-like elements.

The canonical Tn7 element and especially the transposition pathway that directs the element into the attTn7 site located downstream of the conserved glmS gene have been studied extensively. The Tn7 TnsD(TniQ) protein is a sequence-specific DNA-binding protein that recognizes a highly conserved 36-bp sequence in the downstream region of the glmS gene-coding sequence (40, 56). Transposition events promoted by TnsABC+D are directed into a position 23 bp downstream of the region bound by TnsD. Tn7 transposition is orientation specific in all transposition pathways; the transposon end proximal to the tnsA gene (the “right” end of the element) is adjacent to the DNA sequence or a specific protein complex recognized in each pathway (44, 56⇓–58).

We analyzed the region adjacent to the point of insertion of the Tn7-like elements and identified three previously uncharacterized attachment sites for the cas1-less, type I-F–associated transposons. Similar to Tn7 insertions, one subgroup of the elements occurred downstream of a gene, but instead of glmS, these insertions were found downstream of an inosine-5′-monophosphate dehydrogenase gene (Figs. 3 and 4 and Table S1). The configurations found with the other recognizable attachment sites have not been described previously for Tn7-like elements. In one case, the attachment site was located upstream of the yciA gene, which encodes an acyl-CoA thioester hydrolase (Figs. 3 and 4 and Table S1). The third attachment site identified for the cas1-less type I-F–associated elements is in a non–protein-encoding gene, namely, the gene for the signal-recognition particle RNA (SRP-RNA), another configuration not reported previously (Figs. 3 and 4 and Table S1). The concordance between the phylogeny of the TnsD(TniQ) proteins and the attachment site used by the element is consistent with the hypothesis that each attachment site is recognized by a cognate TnsD(TniQ) protein (Fig. 4). However, many transposons appear to be inserted in random sites (Fig. 4). It remains unclear how insertions were directed into these sites because they are unlikely to be specifically recognized by TnsD(TniQ) proteins encoded by these elements, and these elements lack a homolog of the TnsE protein found in typical Tn7 transposons.

Analysis of CRISPR Arrays Associated with the cas1-Less I-F Systems.

The great majority of the transposon-associated I-F and I-B systems encompass a CRISPR array downstream of the cas6 gene (Fig. 3 for examples). In most cases, this array contains only one or two spacers, suggesting that spacer acquisition in these arrays occurs only rarely (Fig. 3 and Table S2). Nevertheless, the spacers are typically unrelated, even in closely related bacterial genomes, indicating that, occasionally, new spacers are incorporated, and old ones are lost. Obviously, only adaptation genes acting in trans can insert new spacers into these arrays. Among the 14 complete bacterial genomes containing Tn7-like elements with the I-F CRISPR-Cas, only two encompass other CRISPR-Cas loci containing adaptation genes, namely, Vibrio fluvialis ATCC 33809 and Pseudoalteromonas rubra SCSIO6842, which possess I-F and I-C systems, respectively. Among draft genomes, there are more cases where additional, complete CRISPR-Cas systems, mostly I-F and I-E, are present in the same genomes. Nevertheless, most of the genomes that contain the Tn7-associated I-F lack other CRISPR-Cas systems that would be able to provide for adaptation, which might account for the short CRISPR arrays. All four complete genomes containing elements associated with I-B systems encompass additional CRISPR-Cas loci containing adaptation genes, often of subtype I-D, which is abundant in cyanobacteria (16).

Altogether, more than 800 spacers were identified in the transposon-associated I-F and I-B CRISPR arrays (see automatically and manually identified spacers at ftp://ftp.ncbi.nih.gov/pub/makarova/supplement/Peters_et_al_2017). As in most analyses of CRISPR spacers, including a recent comprehensive survey (59⇓⇓–62), only a small fraction of these spacers yielded significant matches to sequences in public databases. However, the matches that could be detected were informative because they were to plasmids and bacteriophages associated with the same bacterial genera in which the respective elements are found (Table S2). We identified two cases (in Photobacterium kishitanii and Photobacterium leiognathi) of special interest, in which spacers matched the region adjacent to the tnsA-gene–proximal side of the element (Table S2), i.e., the specific region where complexes involved in targeting transposition events interact with the target DNA (44, 56, 58). An additional spacer match was found inside the transposon boundaries in several V. parahaemolyticus strains (Table S2). A similar situation might have also occurred in a Tn7-like transposon associated with a type I-B CRISPR-Cas variant in a Cyanothece PCC 7822 plasmid, although end sequences could not be unambiguously defined for this element (Table S2).

A Potential Role for CRISPR-Cas in Targeting Transposition.

Taking into account all the observations on the transposon-associated CRISPR-Cas systems and previous studies on the mechanism of target site activation, we propose a model for the involvement of Cas1-less CRISPR-Cas systems in targeting transposition to facilitate cell–cell transfer of the element (Fig. 5). Canonical Tn7 encodes two targeting pathways that are both mediated by the same set of TnsABC proteins (Fig. 1B). The TnsABC+TnsD(TniQ) pathway appears to be broadly conserved, allowing high-frequency transposition into an attachment site recognized by a cognate TnsD(TniQ) protein (Figs. 1 and 4 and Table S1) (49). The cas1-less I-F CRISPR-Cas variant is encoded in the same location where the tnsE gene that promotes transposition into conjugal plasmids and filamentous bacteriophages is typically located (Fig. 3). Thus, it appears likely that the CRISPR-Cas system functionally replaces TnsE as a mechanism facilitating horizontal transfer of the element. Support for this possibility comes from the observation that the transposon-associated CRISPR arrays largely carry plasmid and phage-specific spacers and could direct the transposon to the respective elements (Table S2).

Model of the two targeting pathways for Tn7 elements containing CRISPR-Cas system. Designations are as in Fig. 1.

Distortions in B-Form DNA Induced By Cas-crRNA Could Play a Role in Recruiting Transposition.

Transposition into attTn7 is well understood at the molecular level; the DNA structure in the vicinity of the attachment site plays a central role in transposition (Fig. 6 A–C). TnsD binding induces an asymmetric distortion in the attTn7 target DNA that is essentially responsible for attracting TnsC for target site selection during transposition (Fig. 6A) (56, 63). The TnsABC proteins are normally insufficient for Tn7 transposition in vivo or in vitro (64); however, certain gain-of-function mutations in the regulator protein TnsC (TnsC*) allow untargeted transposition in the absence of TnsD or TnsE (47, 65, 66). Notably, transposition in this case is attracted to a specific location adjacent to a short segment of triplex-forming DNAs (58, 67). Analogous to transposition events found in attTn7, these events are targeted to a position on one side of the triplex-forming DNA in a unique orientation owing to the ability of TnsC to recognize the distortion formed at the triplex–to–B-form DNA transition (Fig. 6C). Distortions induced in the target DNA are also implicated in transposition targeting by TnsABC+E (Fig. 6B) (68). Given that distortions in B-form DNA are also expected adjacent to crRNA-bound effector complexes that generate R-loops through duplex formation between the crRNA and the protospacer (26, 69), there could be a mechanistic link between the well-understood Tn7 targeting process and DNA targeting by the CRISPR-Cas effector complexes (Fig. 6D).

Models of the three previously described Tn7 targeting pathways and the proposed CRISPR-Cas–facilitated transposition pathway. Representations of TnsABC+TnsD (A) and TnsABC+TnsE (B) transposition pathways, the synthetic transposition pathway that targets triplex DNA complexes with a mutant form of TnsC, TnsABC* (C), and the proposed targeting pathway mediated by Cas interference complexes (D) are shown. Known host factors that participate in the TnsD (ACP, L29) and TnsE (DnaN) pathways are also shown. See text for details and references.

Evolution of the Association Between CRISPR-Cas Variants and Tn7-Like Elements.

Given that type I CRISPR-Cas systems have been shown to selectively integrate spacers from plasmids and phages (19, 70), an attractive hypothesis is that the CRISPR-cas loci that randomly became associated with the transposon were fixed through selection for their ability to facilitate dissemination of transposons. As discussed above, because changes in DNA structure play a key role in target site selection by Tn7, relatively little evolutionary adaptation might be needed to allow the core TnsABC machinery to recognize crRNA-bound effector complexes for targeting. In this light, it is intriguing that association between CRISPR-Cas systems and Tn7-like elements occurred on multiple, independent occasions. The consistent minimalist features in the organization of the transposon-associated type I-F and I-B variants imply that they coevolved with Tn7-like elements along parallel paths of reductive evolution. Both type I-F and type I-B systems have lost the adaptation module (cas1 and cas2) and the cas3 gene, which is required to cleave the target DNA in other type I systems (14). The absence of Cas3 implies that these CRISPR-Cas systems recognize but do not cleave the target, a mode of action that would allow the targeted DNA to serve as a vehicle for horizontal transfer of the respective Tn7-like transposon.

The transposon-associated CRISPR arrays are short, and the respective bacterial genomes often lack CRISPR adaptation modules. Thus, the majority of the CRISPR-containing transposons are likely to be relatively recent arrivals to the respective genomes, conceivably, brought about by the plasmid or phage against which they carry a spacer. Once integrated into a new host attachment site, such transposons could “lie in wait” for a horizontal transfer vehicle, either as a result of in trans acquisition of a new spacer that is specific to an endogenous plasmid or prophage or via the entry of an element that is already represented by a cognate spacer in the transposon-encoded CRISPR array. In some cases, an incoming plasmid or phage recognized by the CRISPR-Cas system and targeted for transposition would be incapacitated by the integration event. Nevertheless, even such unproductive integrations would still benefit the CRISPR-carrying transposon by protecting the host. In such cases, CRISPR-directed integration that is in keeping with a selfish behavior for the transposon would also qualify as altruistic behavior toward the host. Occasionally, the Tn7-encoded CRISPR-Cas systems appear to acquire spacers from the host chromosome, conceivably stimulating ectopic transposition within the same genome. This mechanism could provide for transposition in hosts that lack attachment sites recognized by the element-encoded TnsD(TniQ) protein.

Concluding Remarks

Here we identify three distinct groups of Tn7-like transposons that encode minimal variants of type I CRISPR-Cas systems. The transposon-encoded CRISPR-Cas variants lack the interference nucleases, whereas the transposons themselves lack the TnsE protein that directs transposition to MGE. Therefore, we hypothesize that these CRISPR-Cas systems functionally replace TnsE and comprise an RNA-guided transposition machinery. To the best of our knowledge, such a mechanism has not been identified or proposed previously for DNA transposons. However, homology between the MGE RNA and the integration region in the host genome is exploited during group II intron retrohoming (71, 72), suggesting that RNA-guided target recognition evolved more than once in MGE evolution.

Many questions remain regarding the functioning of the CRISPR-Cas in Tn7-like transposons, including the possibility of direct interaction between the CRISPR effector complexes and TnsD(TniQ), TnsABC, or other transposon-encoded accessory proteins. It is also unclear if these CRISPR-Cas variants might perform alternative or additional functions beyond facilitation of transposition, such as gene silencing or protection of the transposon.

From the evolutionary standpoint, the transposon-associated CRISPR-Cas systems fit the guns-for-hire paradigm (73). Under this concept, MGE genes are often recruited by host defense systems, and, conversely, defense systems or components thereof can be captured by MGE and repurposed for counter defense or other roles in the life cycle of the element. Recruitment of MGE apparently was central to the evolution of CRISPR-Cas, contributing to the origin of both the adaptation module and the class 2 effector modules (15, 17, 74). On the other side of the equation, virus-encoded CRISPR-Cas systems have been identified and implicated in inhibition of host defense (75). The observations described here, if validated experimentally, seem to “close the circle” by demonstrating recruitment of CRISPR-Cas systems by transposons, conceivably for a role in targeting transposition, a key step in transposon propagation.

This work also raises the possibility that other, complicated molecular machines may be identified that use RNA or DNA guides to recognize specific nucleotide sequences in different functional contexts. Finally, it has not escaped our notice that the transposon-encoded CRISPR-Cas systems described here potentially could be harnessed for genome-engineering applications, namely, precise targeting of synthetic transposons encoding selectable markers and other genes of interest.

Methods

Prokaryotic Genome Database and ORF Annotation.

Archaeal and bacterial complete and draft genome sequences were downloaded from the National Center for Biotechnology Information (NCBI) FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/) in March 2016. For incompletely annotated genomes (coding density less than 0.6 coding DNA segments/kbp), the existing annotation was discarded and replaced with the Meta-GeneMark 1 (76) annotation using the standard model MetaGeneMark_v1.mod (Heuristic model for genetic code 11 and GC 30). Altogether, the database includes 4,961 completely sequenced and assembled genomes and 43,599 partially sequenced genomes.

Profiles for three protein families, namely Cas7f (cd09737, pfam09615), TnsA (pfam08722, pfam08721), and TnsD(TniQ) (pfam06527), which are available in the NCBI CDD database (77), were used as queries for PSI-BLAST searches (E-value: 10−4; other parameters were default) to find respective homologs. All ORFs within 10-kb regions up- and downstream of cas7f genes (to cover the potential complete I-F system) and 20-kb regions up- and downstream of tnsD(tniQ) and tnsA (to cover potential Tn7-like elements) were further annotated using RPS-BLAST searches with 30,953 profiles (COG, pfam, cd) from the NCBI CDD database and 217 custom Cas protein profiles (16). The CRISPR-Cas system (sub)type identification for all loci was performed using previously described procedures (16).

Protospacer Analysis.

The CRISPRfinder (78) and PILER-CR (79) programs were used with default parameters to identify CRISPR arrays in Cas7f and TnsA/TnsD loci. The MEGABLAST program (80) (word size 18; otherwise default parameters) was used to search for protospacers in the virus subset of the NR (nonredundant) database and the prokaryotic genome database. Matches were considered only if they showed at least 95% identity and at least 95% length coverage in the case of the NR database and 80% identity and 80% length coverage for the self-hits (hits were classified as “self” if they matched the same genomes or genome of the same species disregarding the strain information). Because the automatic approach missed several short CRISPR arrays, loci initially found to lack CRISPR were analyzed manually by examining the intergenic region downstream of the cas6f gene for repeats and using the BLASTN program with the default parameters to find matches to the spacer identified.

Clustering and Phylogenetic Analysis.

To construct a nonredundant, representative sequence set, protein sequences within families of interest were clustered using the NCBI BLASTCLUST program (ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html) with the sequence identity threshold of 90% and length coverage threshold of 0.9. Short fragments or disrupted sequences were discarded. Multiple alignments of protein sequences were constructed using MUSCLE (81) or MAFFT (82) programs. Sites with the gap character fraction values >0.5 and homogeneity <0.1 were removed from the alignment. Phylogenetic analysis was performed using the FastTree program (83), with the WAG evolutionary model and the discrete gamma model with 20 rate categories. The same program was used to compute bootstrap values.

Relationships within diverse sequence families were established using the following procedure: Initial sequence clusters were obtained using UCLUST (84) with the sequence similarity threshold of 0.5; sequences were aligned within clusters using MUSCLE (81). Then, cluster-to-cluster similarity scores were obtained using HHsearch (85) (including trivial clusters consisting of a single sequence each), and an unweighted pair-group method with arithmetic mean (UPGMA) dendrogram was constructed from the pairwise similarity scores. Highly similar clusters (pairwise score to self-score ratio >0.1) were aligned to each other using HHALIGN (85), and the procedure was repeated iteratively. At the last step, sequence-based trees were reconstructed from the cluster alignments using the FastTree program (83) as described above and rooted by midpoint; these trees were grafted onto the tips of the profile similarity-based UPGMA dendrogram.

Analysis of Tn7-Like Elements.

End sequences of Tn7-like elements were determined by identifying the directly repeated 5-bp target site duplication, the terminal 8-bp sequence, and 22-bp TnsB-binding sites as described in the text using Gene Construction Kit 4.0 to manipulate DNA sequences and search for DNA repeats. Sequence files were derived from matches to cas7f, tnsA, and tniQ as described above.

Acknowledgments

J.E.P. was supported by US Department of Agriculture National Institute of Food and Agriculture Hatch Project NYC-189438. K.S.M., S.S., and E.V.K. are supported by the intramural program of the US Department of Health and Human Services (to the National Library of Medicine).

Footnotes

↵1To whom correspondence may be addressed. Email: joe.peters{at}cornell.edu or koonin{at}ncbi.nlm.nih.gov.

Author contributions: J.E.P., K.S.M., and E.V.K. designed research; J.E.P., K.S.M., and S.S. performed research; J.E.P., K.S.M., and E.V.K. analyzed data; and J.E.P., K.S.M., and E.V.K. wrote the paper.

Reviewers: N.L.C., Johns Hopkins University School of Medicine; J.F., CNRS; and B.W., Montana State University.

Researchers report biparental inheritance of mitochondrial DNA in 17 members of three unrelated multigeneration families, paving the way for insights into alternative mechanisms for the treatment of inherited mitochondrial diseases.

Researchers report a machine-learning approach to identify land plants at risk of extinction, suggesting that the approach can be used to guide policies aimed at allocating resources for biodiversity conservation.

A study explores how cats groom fur using fine structures called papillae on the surface of the tongue and presents a biologically inspired hairbrush to remove allergens from cat fur and apply medications on cat skin.