Abstract

A high-throughput genetic screening platform in a single-celled photosynthetic eukaryote would be a transformative addition to the plant biology toolbox. Here, we present ChlaMmeSeq (Chlamydomonas MmeI-based insertion site Sequencing), a tool for simultaneous mapping of tens of thousands of mutagenic insertion sites in the eukaryotic unicellular green alga Chlamydomonas reinhardtii. We first validated ChlaMmeSeq by in-depth characterization of individual insertion sites. We then applied ChlaMmeSeq to a mutant pool and mapped 11,478 insertions, covering 39% of annotated protein coding genes. We observe that insertions are distributed in a manner largely indistinguishable from random, indicating that mutants in nearly all genes can be obtained efficiently. The data reveal that sequence-specific endonucleolytic activities cleave the transforming DNA and allow us to propose a simple model to explain the origin of the poorly understood exogenous sequences that sometimes surround insertion sites. ChlaMmeSeq is quantitatively reproducible, enabling its use for pooled enrichment screens and for the generation of indexed mutant libraries. Additionally, ChlaMmeSeq allows genotyping of hits from Chlamydomonas screens on an unprecedented scale, opening the door to comprehensive identification of genes with roles in photosynthesis, algal lipid metabolism, the algal carbon-concentrating mechanism, phototaxis, the biogenesis and function of cilia, and other processes for which C. reinhardtii is a leading model system.

INTRODUCTION

The ability to phenotype genome-wide collections of single-celled mutants has revolutionized our understanding of cellular processes in bacteria, yeast, and animal cells (Winzeler et al., 1999; Ozawa et al., 2005; van Opijnen et al., 2009; Cipriani and Piano, 2011). An analogous platform in a photosynthetic eukaryote would open doors to rapid identification of all the genes required for photosynthesis and any other phenotype of interest and would allow grouping of genes into pathways using chemical genomics (Hillenmeyer et al., 2008). As a first step toward these goals, we present ChlaMmeSeq (Chlamydomonas MmeI-based insertion site Sequencing), a tool that enables high-throughput genotyping in the single-celled green alga Chlamydomonas reinhardtii.

C. reinhardtii has immense potential as a functional genomics platform for studying photosynthesis and other processes: (1) The similarity of its photosynthetic apparatus to that of land plants has enabled the discovery and characterization of key components of photosynthesis (Schmidt et al., 1977; Shepherd et al., 1979; Niyogi et al., 1997; Fleischmann et al., 1999). (2) Mutants deficient in photosynthesis can be maintained in the dark on acetate, unlike in most photosynthetic organisms. (3) It has a carbon-concentrating mechanism, which facilitates CO2 fixation (Wang et al., 2011) and is of interest as a potential source of genes to enhance photosynthesis in C3 plants. (4) It accumulates lipids under stress conditions, which makes it an excellent model for studying pathways in algal lipid metabolism (Wang et al., 2009; Merchant et al., 2012), which is of interest in biofuel research. (5) Its eyespot and two flagella make it a great model to study phototaxis (Pazour et al., 1995). (6) Vegetative cells are haploid, so mutant phenotypes are visible immediately. (7) It has a short doubling time (6 to 8 h) and sexual life cycle (2 weeks). (8) Its nuclear and organellar genomes are transformable (Boynton et al., 1988; Kindle et al., 1989; Randolph-Anderson et al., 1993) and have been sequenced (GenBank accession number U03843; Maul et al., 2002; Merchant et al., 2007).

Here, we present ChlaMmeSeq, a robust strategy for simultaneous genotyping of tens of thousands of C. reinhardtii insertional mutants. We validated the approach by in-depth analysis of 15 individual mutants. We then applied ChlaMmeSeq to a pool of mutants and identified 11,478 distinct insertions, covering 39% of the 17,737 protein-coding genes annotated in the v5.3 C. reinhardtii nuclear genome (Merchant et al., 2007; Goodstein et al., 2012). The data reveal that insertion sites are distributed in the genome in a manner largely indistinguishable from random, which allows us to predict genome coverage for larger collections of mutants. Analysis of flanking sequences from tandem cassette insertions reveals that an endonucleolytic activity acts on the transforming DNA, leading us to propose an improved model for events during transformation. The abundance of mutants in a pool measured by ChlaMmeSeq is quantitatively reproducible, opening the door to genome-wide biological enrichment screens (Carette et al., 2011) and the generation of indexed mutant collections (Goodman et al., 2009) in green algae.

To map insertion sites, we developed ChlaMmeSeq, a strategy for extracting genomic flanking sequences in parallel from tens of thousands of mutants in a pooled sample (Figure 1). ChlaMmeSeq builds upon technologies that were previously demonstrated in bacteria (Goodman et al., 2009; van Opijnen et al., 2009), with modifications to overcome challenges due to the larger genome and different insertion mechanism (see Methods for details). Genomic DNA from mutants is digested with the Type IIS restriction enzymes MmeI and BsgI, which yield fragments containing the ends of the cassettes and 20 to 21 bp of flanking genomic DNA. An adaptor is ligated to the digested genomic DNA, and the genomic DNA flanking the cassettes is amplified by PCR and sequenced. The sequences are then used to map the genomic locations of the cassettes in mutants.

(F) The 20- to 21-bp flanking DNA sequences are mapped to the C. reinhardtii genome to identify the site of each insertion.

We Validated ChlaMmeSeq by Characterizing Insertion Sites in Individual Mutants

To evaluate the mapping accuracy of the tool, we analyzed 15 randomly picked and individually isolated mutants. DNA gel blot analyses with the AphVIII probe and two different restriction enzymes (Supplemental Figure 3) indicated that 14/15 mutants had one AphVIII copy per genome, and one mutant had two AphVIII copies.

We applied ChlaMmeSeq to extract flanking sequences from the 5′ and 3′ sides of the cassette in each of these 15 mutants. Eight mutants yielded 5′ side flanking sequences, and nine mutants yielded 3′ side flanking sequences (Figure 2A). These flanking sequences were aligned to the published v5.3 C. reinhardtii genome (Merchant et al., 2007; Goodstein et al., 2012). Of these 17 flanking sequences, 10 mapped to unique sites in the genome, two mapped to multiple sites in the genome, four mapped to the cassette sequence, and one could not be aligned (Figure 2B). Guided by the 10 flanking sequences that were mapped to unique sites in the genome, we characterized the structure of insertion sites (Figure 2C; Supplemental Figure 4).

(A) Flanking sequences were PCR amplified from 5′ and 3′ sides in 15 mutants. The wild type was included as a negative control; plasmid pMJ013b (containing the transforming cassette) served as a positive control.

(B) The mapping location of each extracted flanking sequence is given: a chromosome number if the flanking sequence was mapped uniquely to the genome, “multiple” if it was mapped to multiple genome locations, “cassette” if it was mapped to the transforming cassette, and “unaligned” if it had no alignment. A dash indicates no extractable flanking sequences. Chromosome locations confirmed by PCR are highlighted in green; those determined to inaccurately represent the insertion sites are in gray.

(C) The gene models for each mutant (bottom) and its corresponding wild type (top) are shown for the six insertion sites confirmed by PCR. Supplemental Figure 4 provides details and supporting evidence.

We verified that 7/10 uniquely aligned genomic flanking sequences correctly identified the genomic insertion sites by PCR from the cassette into the flanking genome ∼1 kb on either side (Figure 2; Supplemental Figure 4). The remaining three flanking sequences did not correctly identify the insertion loci: The expected genome-cassette junctions could not be amplified; instead, the corresponding wild-type loci could be amplified in the mutants (Supplemental Figures 4C, 4G, and 4H).

Nonextractable or misleading flanking sequences are likely the result of commonly observed events at insertion sites in C. reinhardtii. Others have reported that the ends of transformation cassettes are frequently removed during insertion into the genome (Dent et al., 2005; González-Ballester et al., 2011). Consistent with this, we observed that 2/6 characterized insertion sites carried truncated cassettes (Figure 2C). Cassette truncation removes the MmeI binding site and makes flanking sequence extraction impossible. Others have also observed fragments of genomic DNA from distant loci inserted next to mutagenic cassettes (Meslet-Cladière and Vallon, 2012). We observed such fragments in 3/6 characterized insertion sites. Only one of these fragments caused a misleading mapping, as the other two did not yield uniquely mapping flanking sequences.

Four out of the six mutants characterized had no genomic deletions or rearrangements at the site of insertion (Figure 2C). One mutant had a 21-bp deletion (Supplemental Figure 4E). Additionally, we observed an apparent inversion of ∼6 kb of genomic DNA flanking the 5′ side of the cassette in one mutant (Supplemental Figure 4F).

In summary, we expect that ∼70% of the flanking sequences obtained from this collection of mutants correctly indicate the insertion sites.

We Simultaneously Mapped 11,478 Insertion Sites in a Pool of Mutants

We pooled ∼40,000 mutants and extracted their 5′ flanking sequences using ChlaMmeSeq. Sequencing of the resulting sample on an Illumina Genome Analyzer IIx yielded 47 million reads total, of which 37 million contained the expected adapter and cassette sequences (the remaining 10 million reads were most likely due to nonspecific PCR amplification or sequencing errors). Similarly to our analysis of individual mutants, 9% of reads could not be mapped, 24% aligned to the cassette, 17% aligned to multiple genomic positions, and 50% aligned uniquely to a single genomic location (the ratio of multiple to unique genomic alignments is consistent with the ratio of nonunique to unique 20-bp sequences in the genome). Of these uniquely aligned reads, 96% aligned to the nuclear genome, while 3.9% aligned to the chloroplast and 0.44% to the mitochondrial genome. We have not further investigated insertions in the latter two groups because of their rarity. We identified 11,478 insertion sites distributed throughout the nuclear genome.

The Global Distribution of Insertion Positions Is Largely Indistinguishable from Random

The unprecedented scale of our insertion site data allowed us to evaluate the distribution of insertion sites in the genomes of C. reinhardtii mutants generated by electroporation-based transformation. We looked for regions with densities of observed insertion sites higher (hot spots) or lower (cold spots) than would be expected if the insertion sites were randomly distributed over the genome.

We compared our observed distribution of insertion sites to simulations of random insertion of cassettes into the nuclear genome. Our simulations only used insertions with uniquely mappable flanking sequences (i.e., ones that only align to a single site in the genome). This requirement made the simulation results comparable to the real insertion data, which were generated using only uniquely mappable flanking sequences. By visual inspection, it appears that most extended regions with few insertions were replicated in the simulated data, indicating that they correspond to regions of the genome with low density of uniquely mappable insertion sites, rather than being genuine cold spots (Figure 3A). To identify statistically significant insertion cold spots and hot spots, we scanned the genome with moving windows of size ranging from 1 kb to 1 Mb and for each region compared the number of observed insertion sites to the number of uniquely mappable positions. This analysis yielded only one potential cold spot and 15 potential hot spots with P values < 0.05 after adjustment for multiple testing (Figure 3A; Supplemental Data Set 4). Overall, ∼2% of all insertions are in hot spots; cold spots cover <2% of the genome. We conclude that the distribution of insertion locations in the genome is largely indistinguishable from random, on the scale observed in this study.

The Genomic Distribution of Insertions Is Largely Random, and Many Genes of Interest Are Represented.

(A) For each chromosome, four columns are shown: the first, wider green column depicts the observed insertion density; the next three columns show insertion densities for three simulated data sets. The blue and red marks on the left of each chromosome indicate statistically significant hot spot and cold spot locations.

(B) The dotted rectangles contain zoomed-in portions of two significant hot spots in (A). The dashed rectangles show additional data for the same two genomic regions: the density of all possible uniquely mappable positions based on the reference strain genome sequence, and the density of observed uniquely mappable 20- to 21-bp sequences from whole-genome sequencing of our background strain aligned to the reference strain sequence (Supplemental Figure 5 contains the same plot for the full genome).

(C) Insertion density differs between gene features and intergenic regions. Density is normalized to the density over the entire genome (dotted line) and is based on uniquely mappable positions only. All categories are different from the overall mutant density (P values < 0.003, exact binomial test); all category pairs except intergenic versus 3′ UTR are different from each other (P values < 0.02, χ2 test of independence). Genes with multiple splice variants are ignored when looking at gene features.

(D) The fraction of genes with one allele or two or more alleles in our data set is shown for each of several data sets of interest. For some of the data sets, the Joint Genome Initiative protein IDs for our insertions had to be determined. The data could not be obtained for some of the genes, which are omitted from the figure (see Supplemental Methods for details.)

(E) The fraction of genes with 1+, 2+, and 3+ independent mutant alleles is shown as a function of the number of mapped insertions. The observed data (with 100 randomly chosen subsets for lower insertion numbers) and data from 10 simulations are plotted.

To investigate whether any of the potential hot spots were due to amplifications of genomic regions in our background strain in comparison to the reference genome, we sheared the genome of our background strain and sequenced the resulting fragments using Illumina sequencing. We mapped ∼38 million 21-bp reads to the reference genome, using the same method as for mapping insertion flanking regions (Supplemental Figure 5). Fewer than 1% of 20-kb regions had >2× the median read density normalized to the number of uniquely mappable positions, suggesting that the technique yielded even coverage of the genome. Strikingly, we observed that 4 of the 15 potential insertion hot spots, including the most prominent one on chromosome 3, correspond precisely to regions of high read counts in the background genome sequencing data (Figure 3B; Supplemental Figure 5). This indicates that those four apparent hot spots are artifacts due either to local amplification of the genome sequence in our background strain (in comparison to the reference genome) or to possible inaccuracies in the reference genome assembly.

On a finer scale, we found that the density of insertions is higher in intergenic regions, introns, and 3′ untranslated regions (UTRs) and lower in genes, 5′ UTRs, and exons (Figure 3C). This could be due to an increased likelihood of lethality if the cassette inserts into the latter elements. Gene essentiality appears to decrease the likelihood of recovering a mutant: The 711 best BLAST hits of yeast essential genes had fewer insertions per mappable length than remaining genes (P < 10−5, χ2 test of independence; Supplemental Methods). We did not detect a significant difference between insertions based on position in gene (Supplemental Figure 6) or between sense and antisense insertions into genes. As expected, on average there are more insertions into longer genes (Supplemental Figure 7).

Many Genes of Interest Are Represented

Of the 11,478 insertions mapped uniquely to the nuclear genome, 10,391 (90%) are in genes. Of the insertions in genes, 3032 (29%) are in exons, 3826 (37%) in introns, 446 (4.3%) in 5′ UTRs, 1939 (19%) in 3′ UTRs, 31 (0.3%) at feature boundaries, 1104 (11%) in genes with multiple splice variants, and 13 (0.12%) in multiple features due to overlapping genes. A total of 6955 genes were represented by at least one insertion (39% of the 17,737 protein-coding genes annotated in the v5.3 C. reinhardtii nuclear genome; Goodstein et al., 2012). The insertion sites included 3735 (45%) genes highly conserved between C. reinhardtii and Arabidopsis thaliana (Merchant et al., 2007), 215 (37%) of the most highly conserved plant genes as defined in the GreenCut2 (Karpowicz et al., 2011), and 370 (35%) genes encoding components of cilia identified by mass spectrometry (Pazour et al., 2005) (Figure 3D).

∼100,000 Correctly Mapped Mutants Will Be Needed to Cover 80% of the Genome with Two Alleles

The agreement of the observed data with our random insertion model allowed us to use the model to predict the number of C. reinhardtii mutants needed to achieve any desired coverage of the genome with 1+, 2+, or 3+ mutant alleles (Figure 3E). The model predicts that in a collection of ∼100,000 correctly mapped mutants, ∼90% of genes will be represented by 1+ allele, ∼80% by 2+ alleles, and ∼70% by 3+ alleles. In a screen, multiple mutant alleles in a gene are an advantage because they increase the confidence in a genotype-phenotype link.

The Flanking Sequence Abundances Are Quantitatively Reproducible

To evaluate the quantitative reproducibility of ChlaMmeSeq, we compared the flanking sequence abundances obtained from two parallel applications of the protocol on the same pool of mutants (Figure 4). Ninety-nine percent of flanking sequences with ≥200 reads in one replicate were also observed in the other replicate. The abundance of 92% of those flanking sequences was reproduced within 3× in the other replicate. This suggests that ChlaMmeSeq could be used to quantitatively track changes in abundances of mutants in a pool or for quantitative enrichment screens of mutants of interest.

Reproducibility of mutant abundances between technical replicates is shown. Each dot is a mutant; the two axes are read abundance in the two replicates. The dots are 50% transparent.

Site-Specific Endonucleases Cleave the Cassette during Transformation

In previous reports (Dent et al., 2005; Aksoy et al., 2013) and in our small-scale mutant analyses (Figure 2B), fragments of the transformation cassette were found adjacent to intact cassettes in some of the mutants; however, the origin of these fragments has remained unknown. We reasoned that analysis of the large number of flanking sequences mapping to the cassette in our high-throughput data could yield insights into the mechanism of cassette fragmentation. Such insights could enable strategies to reduce fragmentation, which could increase the fraction of insertions that yield genomic flanking sequences with ChlaMmeSeq and more generally increase transformation efficiency in C. reinhardtii.

The transforming DNA appeared as a sharp band when evaluated by Bioanalyzer (Supplemental Figure 8), suggesting that any cassette fragments are likely generated during transformation.

We sought to determine whether some cassette positions were subject to fragmentation more often than others. We used ChlaMmeSeq on a different mutant pool with more uniform mutant abundance, to detect cassette fragments that were present in multiple mutants. A total of 105 flanking sequences mapping to the cassette were found to be more abundant than the most abundant flanking sequence mapping to the genome (Figures 5A and 5B; Supplemental Figure 9). This strongly suggests that each of these 105 flanking sequences originated from more than one mutant. Flanking sequences mapping to the cassette 5′ and 3′ termini were by far the most abundant (16 and 11% of all cassette-mapped flanking sequences, respectively).

Site-Specific Endonucleolytic Activities Act on the Cassette during Transformation.

(A) In our model, the cassette is fragmented during transformation. A resulting fragment can be ligated to an intact cassette and inserted into the genome along with it; the end of the fragment will then be sequenced as a flanking region. We categorize cassette fragments as “upstream” (from the 5′ side of the fragmentation site) and “downstream” (from the 3′ side). Each type of fragment can be ligated to either a 5′ or a 3′ end of an intact cassette; we do not distinguish between those cases.

(B) For each position along cassette length, the number of reads mapped to that position is plotted for upstream and downstream fragments. For comparison, we show the median and maximum read count for genomic insertions from the same mutant pool (Supplemental Figure 9 contains a full comparison of cassette and genomic read count distributions).

(C) Each dot represents one cassette position; the two axes show the read counts of upstream and downstream fragments mapping to that position. Positions flanked by the five most enriched 4-bp motifs are black; all other positions are gray.

(D) Normalized binned histograms of the sum of upstream and downstream read counts are shown for all cassette positions and for positions flanked by the five most enriched fragmentation site motifs. The total number of positions matching each motif is given in parentheses.

A single cassette fragmentation event can yield two fragments, each containing one side of the fragmentation site; we refer to them as the upstream and downstream fragments (Figure 5A). We observed a correlation between the abundance of flanking sequences originating from the upstream and downstream cassette fragments for a given fragmentation position (Figure 5C). This result suggests that both upstream and downstream fragments from frequent fragmentation sites are present at similar abundance after fragmentation. This finding indicates that the frequent fragmentation sites are not due to exonuclease activity, which one would expect to produce an uneven ratio of upstream and downstream fragments.

We observed an enrichment of specific sequence motifs flanking the fragmentation site. Fragmentation positions flanked by the motifs CA|TA and CA|TG and several one-base variants yielded significantly more flanking sequence reads than other positions (Figures 5C and 5D; Supplemental Data Set 5). The five most enriched motifs (false discovery rate–adjusted P values < 0.00001, Kruskal-Wallis test) are responsible for 15% of the reads, and all 26 statistically significantly enriched motifs (P values < 0.05) are responsible for 61% of the reads. The two independent data sets show similar cassette cleavage motifs, underscoring the reproducibility of our findings (Supplemental Figures 10A and 10B). The fact that the motifs span the fragmentation site suggests again that most fragmentation sites are not due to exonuclease activity. Instead, this result is consistent with a model in which the cassette is cleaved by site-specific endonucleolytic activities during transformation, prior to ligation into the genome.

It is worth noting that the flanking sequences mapping to the genome also show enrichment for a CA|TG insertion site motif (Supplemental Figure 10C). This could be due to action of the same endonucleolytic activities on genomic DNA from lysed cells present in the culture medium before or during electroporation. This genomic DNA would be electroporated into cells along with the transformation cassette and could yield some of the observed fragments of exogenous genomic DNA inserted next to some cassettes (Figures 2C and 6).

(B) During transformation, some of the cassette and genomic DNA molecules are cleaved by site-specific endonucleases.

(C) After electroporation, DNA from the extracellular medium (including intact cassettes, fragmented cassettes, and fragmented genomic DNA) is ligated into a double-stranded break in the genome. In some cases, multiple fragments are ligated into a single site in the genome. When an intact cassette end is present, this cassette end can yield flanking sequences containing the neighboring DNA. In some cases, this flanking DNA will contain cassette DNA or extracellular genomic DNA that was inserted following endonuclease cleavage; in those cases, the end of the DNA neighboring the cassette would be enriched for the endonuclease's sequence preference.

We Propose a Model for Events during Electroporation of C. reinhardtii

Our observation of sequence-specific fragmentation of the transformation cassette is consistent with the following model for events during transformation of C. reinhardtii by electroporation (Figure 6): (1) During transformation, the cassette and genomic DNA from lysed cells are partially digested by sequence-specific endonucleases, before or during entry of DNA into recipient cells. (2) The DNA that entered cells is ligated into double-stranded breaks in the recipient genomes. Sometimes, multiple DNA fragments are ligated together into one insertion site.

This model explains the following observations: (1) the presence of flanking regions mapping to cassette sequence and the pattern of cassette fragmentation that we observed from them; (2) insertions of DNA from other genomic loci at cassette insertion sites observed by us and others (González-Ballester et al., 2011); (3) the fact that our mutants with misleading flanking sequences had an intact copy of the genomic region corresponding to these flanking sequences; (4) improvement of transformation efficiency by adding carrier DNA (Shimogawara et al., 1998), since carrier DNA could reduce inactivation of the transforming cassette by endonucleolytic digestion (we do not recommend using carrier DNA, as it could cause additional insertions).

This model suggests possible avenues for improving transformation in C. reinhardtii: (1) Removal of endonuclease recognition sequences from the transforming DNA may reduce susceptibility to the endonucleases. (2) Smaller cassettes may be less susceptible to endonuclease activities. (3) Means of inactivating the endonucleases, e.g., mutation or inhibitors, would improve transformation efficiency. (4) Improved washing of cells before electroporation to remove DNA from lysed cells may reduce insertion of exogenous genomic DNA next to the cassette.

DISCUSSION

We demonstrated that ChlaMmeSeq enables mapping of insertion sites for tens of thousands of C. reinhardtii mutants. Insertions are randomly distributed, with remarkably few hot spots and cold spots. The high-throughput nature of our approach facilitates identification of multiple independent alleles of a gene, which alleviates commonly encountered challenges caused by incorrect flanking sequences and second-site mutations.

The reproducible quantification of flanking sequence abundances could enable the use of ChlaMmeSeq for simultaneous measurements of fitness of thousands of mutants in pooled culture under conditions of interest (van Opijnen et al., 2009) or quantitative enrichment screens (Carette et al., 2011). Our data suggest that this approach could detect a 15% growth defect after seven generations with P < 0.08.

We are presently using ChlaMmeSeq to generate an indexed genome-wide collection of C. reinhardtii mutants with known insertion sites, by means of combinatorial pooling (Goodman et al., 2009). Based on the analysis presented here, we expect that 100,000 correctly mapped mutants will cover 80% of the C. reinhardtii genes with at least two independent mutants per gene. Such a permanent collection could transform plant biology by increasing the phenotyping throughput of mutants in nearly all genes in C. reinhardtii.

More immediately, ChlaMmeSeq enables genotyping of pools of hits from C. reinhardtii screens on an unprecedented scale. This opens the door to comprehensive identification of genes with roles in regulation and biogenesis of the photosynthetic apparatus, algal lipid metabolism, the algal carbon concentrating mechanism, phototaxis, and other processes for which C. reinhardtii is a leading model system. Now that mapping flanking sequences is no longer a rate-limiting step, saturating identification of these important cellular components is within reach.

METHODS

Growth of C. reinhardtii Strains

Chlamydomonas reinhardtii strains were grown in Tris acetate phosphate (TAP) medium with modified trace element solution (Kropat et al., 2011) (pH 7.5). For experiments with Tris phosphate medium, the acetate was omitted and the media was buffered to pH 7.5 using HCl stock. For growth as colonies on plates, the medium was supplemented with 1.5% (w/v) agar. Unless otherwise indicated, cool white fluorescent lights (F34CW/RS/WM/ECO; Ecolux) were used for illumination.

Isolation of Background Strain CMJ030

C. reinhardtii strains D66+ and 4A– were crossed (Jiang and Stern, 2009). Progeny were screened for photoautotrophic and heterotrophic growth, greening in the dark, normal swimming ability, high transformation efficiency (Shimogawara et al., 1998), and efficient recovery from cryogenic storage in liquid nitrogen. One progeny strain that showed all of these properties was isolated and named CMJ030. The cell wall phenotype of CMJ030 was evaluated by treatment with 1% (w/v) Nonidet P-40 and 1% (w/v) SDS (separately), and in both cases the cells swelled and exploded, leaving behind no visible sacks, suggesting that the cells are at least cw15 wall-less. The mating type of CMJ030 was determined by PCR (Werner and Mergenhagen, 1998). CMJ030 appeared to mate efficiently.

Large-Scale Transformation

The transformation cassette was prepared by digestion of plasmid pMJ013b (GenBank accession number KJ572788) with MlyI (New England Biolabs) at 37°C for 1 h. Digestion products were separated on a 1% (w/v) agarose gel, and the 2660-bp digested DNA fragment was extracted using a QIAquick gel purification kit (Qiagen), according to the manufacturer’s instructions. The purified fragment was analyzed with an Agilent Bioanalyzer (Supplemental Figure 8) and used for transformation.

CMJ030 was grown in TAP medium in a 20-liter container under 100 μmol photons m–2 s–1 cool white fluorescent light, with continuous stirring and bubbled air, until it reached a cell density of 1.5 × 106 cells/mL. Cells were collected as follows. Bubbling was stopped and the 20-liter container was transferred into a 32-gallon garbage bin and illuminated from the top by four cool white fluorescent bulbs for 2 h. This caused the cells to settle to the bottom of the 20-liter container. The top 15 liters were removed by aspiration, and the lower 5 liters were centrifuged in RC5C centrifuges (Sorvall Instruments) with GS3 rotors for 4 min at 1000g. Pellets were resuspended in TAP supplemented with 40 mM sucrose at 2 × 108 cells/mL. Transformation was performed by electroporation according to Shimogawara et al. (1998) with some modifications. Transforming DNA (144 μL) at 52 ng/μL was added to a sterile 50-mL Falcon tube with 50 mL of concentrated cells (37.5 ng DNA per 250 μL concentrated cells) in 40 mM sucrose. The concentrated cells were incubated with transforming DNA at 16°C for at least 20 min before electroporation. The cell/DNA mix was then aliquoted into sterile electroporation cuvettes (4-mm gap, 1.5-mL Micro Cuvette, two Clear Sides, E&K Scientific) at 250 μL/cuvette. Cells were electroporated (Bio-Rad; Gene Pulser2 electroporation system) with pulse settings of 800 V and 25 μF, followed by immediate decanting into a 15-mL Falcon tube containing 13 mL of TAP supplemented with 40 mM sucrose. The 15-mL Falcon tubes were shaken gently under low light (5 μmol photons m–2 s–1) for 6 h. Cells were then collected by centrifugation at 1000g for 4 min, most of the supernatant was decanted, and the cells were resuspended in the remaining 500 μL of supernatant. Resuspended cells were gently plated onto 2% (w/v) TAP agar plates containing 20 μg/mL paromomycin. These plates were stored at 5 μmol photons m–2 s–1 light for 2 weeks, until transformant colonies appeared.

Flanking Sequence Extraction from Pooled Mutants

Our protocol for flanking sequence extraction from pooled C. reinhardtii mutants was built upon technologies that were previously demonstrated in bacteria (Goodman et al., 2009; van Opijnen et al., 2009) with modifications to overcome the following challenges: (1) The bacterial genomes (6.3 and 2 Mb, respectively) are smaller than the 121 Mb C. reinhardtii genome; and (2) both previous methods used in vitro transposon mutagenesis of genomic DNA, followed by homologous recombination of the mutagenized DNA into the recipient genomes, whereas our C. reinhardtii mutants were generated by random insertion of linear transforming DNA (likely by nonhomologous end joining).

Our protocol is most similar to that of Goodman et al. (2009) with the following major changes: (1) We used phenol/chloroform to extract DNA, whereas they used DNeasy columns. (2) We performed digestions with both MmeI and BsgI to generate the same size fragments from both full and truncated cassettes, whereas they only did MmeI digestion. (3) Our PCR protocol was optimized for GC-rich DNA templates of C. reinhardtii. (4) Placement of the MmeI sequence at the very ends of the cassette allowed us to extract 20/21 bp of flanking sequences to map insertion sites, whereas their sites were recessed and yielded only 16/17 bp (which is adequate for small genomes but insufficient for the C. reinhardtii genome). (5) Each step of our protocol was optimized several times to increase the quantitative character of the tool. Below is the detailed protocol.

Transformants were scraped from 80 transformation plates when colonies were ∼1 mm in diameter, pooled together, and grown in TAP in the dark for 1 week in a 1-liter photobioreactor from Photon Systems Instruments at a constant cell density of 1 × 106 cells/mL, with constant bubbling with air. Samples of 50 mL were harvested by centrifugation at 3000g for 10 min. The pellet was used for extraction of genomic DNA by phenol:chloroform:isoamyl alcohol (Phenol:CIA, 25:24:1; Sigma-Aldrich).

Genomic DNA was digested as follows. The 500-μL reactions were assembled with 3.5 μg DNA, 50 μL 10× NEB4 buffer, 1.25 μL 32 mM S-adenosyl methionine, 5 μL 10 mg/mL BSA, 40 μL 2 units/μL MmeI, and 1.25 μL 5 units/μL BsgI (NEB). Reactions were incubated at 37°C for 1 h. Digestion products were phenol/chloroform-extracted, ethanol-precipitated, and dissolved in water. Double digestion of genomic DNA by MmeI and BsgI yielded 1100-bp DNA fragments containing either the 5′ or 3′ end of the cassette, and 20 to 21 bp of flanking genomic DNA. The digestion products were run on a 1% (w/v) agarose gel in an Owl D3 tray (BioExpress) at 100 V for 4 h, and DNA fragments in the range of 1 to 1.2 kb were cut out and gel extracted by D-Tube Dialyzer Maxi (MWCO 3.5 kD; EMD Biosciences). DNA was precipitated at –20°C for at least 30 min with 9 μL 20 mg/mL glycogen, 210 μL 3 M NaOAC at pH 5.2 (0.1 volume), and 2100 μL isopropanol, and then centrifuged at 4°C at 16,000g for 20 min, followed by a wash with 1 mL 70% (v/v) ethanol, and then by a wash with 1 mL of 100% ethanol. The DNA pellet was dissolved in water and quantified by Qubit.

Adaptors (50 μM) were prepared by mixing equal volumes of oMJ082 and oMJ083 at 100 μM each in water, placing the mixture in a heat block (E&K Scientific D-1200 AccuBlock Digital Dry Bath) at 96°C for 2 min, then placing the metal insert of the heat block containing the samples on the bench at room temperature and letting it cool for 1 h.

Ligations were performed as follows. Thirty-microliter ligation reactions contained 16 μL DNA template (7 ng/μL) from the previous step, 3 μL T4 DNA ligase buffer, 10 μL 50 μM adaptors, and 1 μL 2000 units/μL T4 DNA ligase. The reactions were incubated at 16°C overnight. The high concentration of adaptors increased ligation efficiency, but also interfered with PCR. As a result, a second gel extraction using a D-Tube Dialyzer Maxi was performed to remove extra adaptors.

PCR was performed as follows. Ninety percent of the DNA from the previous step was used as template for each sample. The DNA template was diluted to 0.1 to 0.3 ng/μL and heated at 65°C for 20 min before PCR. We found that this step helped reduce nonspecific amplification. For each sample, 48 PCR reactions of 50 μL each were assembled. Each reaction contained 10 μL 5× Phusion GC buffer, 1 μL 10 mM deoxynucleotide triphosphates, 0.5 μL 100% DMSO, 0.2 μL 50 mM MgCl2, 0.25 μL of each primer (oMJ079, oMJ081, and oMJ191) at 100 μM, 10 μL DNA template at 0.1 to 0.3 ng/μL, 25.55 μL water, and 2 μL 2 units/μL Phusion Hot Start II High-Fidelity DNA Polymerase (New England Biolabs). Cycling parameters were as follows: 3 min at 98°C, 10 cycles of 10 s at 98°C, 25 s at 61°C, 15 s at 72°C; then 14 cycles of 10 s at 98°C, 40 s at 72°C; then a final extension of 2 min at 72°C. Primer oMJ079 binds to the adaptor, oMJ081 binds the 5′ end of the cassette to amplify the 5′ side flanking sequence, and oMJ191 binds to the 3′ end of the cassette to amplify the 3′ side flanking sequence. The products were run on a 1.8% (w/v) agarose gel at 100 V for 2 h. PCR products of the expected size (190 bp for flanking sequences from both sides) were gel extracted with QIAquick and submitted for deep sequencing by Illumina Genome Analyzer IIx at the Stanford Sequencing Service Center.

Data in Figure 5 are from a second collection of mutants generated by a similar protocol, with the following differences, which were applied to reduce the variation in mutant abundance in the pool. Transformants were picked individually from transformation plates using a CP7200 colony picking robot (Norgren Systems), arrayed at 384 colonies/plate on fresh TAP plates containing 20 μg/mL paromomycin, and grown up under 5 μmol photons m–2 s–1 light for 3 weeks. The plates were then propagated by Singer Rotor HDA Robot to fresh TAP plates containing 20 μg/mL paromomycin. After growth under 5 μmol photons m–2 s–1 light for 1 week, most mutants on these 384 plates had approximately similar colony sizes. Mutants were scraped from plates and pooled together without further growth, and flanking sequences were extracted using the protocol above for pooled mutants.

PCR-Based Characterization of the Insertion Sites Indicated by the Flanking Genomic DNA

PCRs were attempted only for the insertions that yielded flanking sequences that were mapped uniquely (10 out 17 total flanking sequences from 15 individual mutants). Primers were designed to amplify the cassette-genome junction on both sides of each insertion site indicated by the 20- to 21-bp flanking sequences, based on the genome sequence v5.3 from Phytozome (Goodstein et al., 2012). Primer binding sites were 1 to 1.3 kb away from the flanking sequences. PCR products were amplified using the Taq PCR core kit (Qiagen). The 25-μL PCR reactions included: 5 μL 5× Q-solution, 2.5 μL 10× PCR buffer, 1.25 μL 100% DMSO, 0.5 μL 10 μM deoxynucleotide triphosphates, 12.65 μL water, 0.1 μL Taq DNA polymerase, 1.25 μL of each primer at 10 μM, and 0.5 μL 50 ng/μL C. reinhardtii genomic DNA. PCR cycling parameters were: 5 min at 95°C, 40 cycles of 30 s at 95°C, 45 s at 58°C, 2 min at 72°C, followed by a final extension of 10 min at 72°C. For oligos and template used for each check PCR, see Supplemental Data Set 2. PCR products of the expected size were gel extracted and submitted for Sanger sequencing by ELIM Biopharmaceuticals. The resulting sequences were aligned against the v5.3 genome sequence from Phytozome.

Code Availability

All custom software written for this project is available at github.com/Jonikas-Lab/Zhang-Patena-2014.git and as Supplemental Data Set 6.

Inferring Insertion Details from Pooled Mutant Deep-Sequencing Data

The workflow is summarized in Supplemental Figure 11. Adaptor and cassette sequences were removed from reads to obtain the flanking sequences; reads without the expected adaptor or cassette sequences were discarded. Flanking sequences were aligned to the C. reinhardtii nuclear, chloroplast, and mitochondrial genomes and to the insertion cassette sequence using bowtie (Langmead et al., 2009), allowing at most one mismatch. Flanking sequences with no alignments or with multiple genomic alignments were discarded. Flanking sequences aligned to the same position in the same orientation were combined into single insertion positions and annotated with gene and feature information based on C. reinhardtii v5.3 genome data from Phytozome.

Insertion positions with very low abundance were removed. We believe that these low abundance reads are likely the result of sequencing or PCR errors. The genomic read count histograms in Supplemental Figures 12A and 12B show a tall peak at around one read per insertion and then a second peak at around 50 reads per insertion. We suspect that the second peak is the number of reads resulting from a single DNA molecule as a PCR template, so any “insertions” below that peak are likely to be due to PCR and/or sequencing errors. Therefore, we removed any insertions below a custom cutoff positioned between the first two peaks (15 reads for replicate 1 and 20 reads for replicate 2, marked on Supplemental Figures 12A and 12B). The data used in Figure 5 was processed in a similar way (the cutoff is marked on Supplemental Figures 12C and 12D).

The two technical replicate data sets were merged together at this point. The single replicates were used to show replicate correlation (Figure 4) and reproducibility; the merged data set was used for all other analyses.

Adjacent insertion positions that probably came from a single real insertion were merged. In some cases, a single insertion could generate two flanking sequences that map to different (adjacent) positions. The main scenarios where this could happen are (Supplemental Figure 13): (1) 1-bp insertions/deletions during PCR or sequencing; (2) a single insertion of two cassette copies ligated together, in opposite orientations. We merged such pairs until their level was reduced to that expected randomly. This merging process impacted <100 insertions, but omitting it can cause bias in the distribution of insertion positions and inflate the number of insertions per gene. The merging was not done for the cassette-aligned flanking sequences.

Determining Genome Mappability

To meaningfully compare the observed and expected insertion positions and densities, we determined which of all possible genomic insertion positions would yield sequencing reads uniquely mapped to that position. Each 20- and 21-bp slice of the genome sequence was categorized as unique or nonunique (77% of 20 bp slices and 78% of 21 bp slices in the C. reinhardtii genome are unique). The results were summed to yield “mappable lengths” for the whole genome, each gene, and other regions of interest; these are proportional to the expected density of insertions in a region if insertion positions were purely random.

Generating Simulated Random Insertion Data Sets

We generated 10 simulated data sets with the same number of insertions as the real data set, with the location of each insertion randomly chosen out of all the mappable positions in the genome. Furthermore, in order to estimate how much of the genome would be covered by larger numbers of insertions (Figure 3E), we used the same method to generate 10 simulated data sets of one million mappable insertions each.

We used the binomial test with correction for multiple testing to detect regions of the genome with more (hot spots) or fewer (cold spots) insertions than would be expected if the insertion positions were random. We looked for hot spots/cold spots in a large range of sizes: 1 kb, 5 kb, 20 kb, 100 kb, 200 kb, 400 kb, and 1 Mb. We sliced the genome into windows of each size, using evenly spaced offsets to get two to four overlapping sets of windows. For each region, we used the exact binomial test to determine the probability of obtaining the observed number of insertions given that region's mappable length and the total number of observed insertions in the genome, assuming that the insertions were uniformly randomly distributed over the mappable genome positions. The resulting P values were corrected for multiple testing using the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) (as implemented in R with the p.adjust function) separately for each region size and offset combination.

A single real hot spot or cold spot is likely to generate multiple statistically significant results in overlapping windows with different sizes or offsets. Therefore, the results were clustered into 15 distinct groups of overlapping hot spots and 1 group of cold spots. For plotting (Figures 3A and 3B; Supplemental Figure 5), each overlapping cluster was reduced to a nonredundant subset of one to three regions with the lowest P values. Supplemental Data Set 4 contains the full data.

We performed the same analysis on the simulated data sets to ensure that the method worked as expected. The 10 simulated data sets had only 0 to 2 distinct hot spot and cold spot groups, with the lowest adjusted P value much higher than those of most of the hot spots detected in the real data set. This shows that a few of the potential hot spots and cold spots in the real data set may be due to noise, but the majority of the hot spots are almost definitely nonrandom.

Background Strain Genome Sequencing and Analysis

Genomic DNA from our background strain was extracted in technical duplicate and then submitted for Illumina library preparation and sequencing. To mimic the process of mapping insertion flanking regions, we extracted the first 20 or 21 bp from one end of each read and aligned it to the reference C. reinhardtii genome using bowtie. The differences between the two replicates and the 20/21-bp versions were minimal; they were added together for the final display of sequenced fragment density (Supplemental Figure 5).

Cassette Fragmentation Motif Enrichment

For each possible 4-bp motif, we determined the number of times it occurs in the cassette, and the observed read counts for upstream and downstream fragments for each of the positions in which it occurred. We calculated the average enrichment by dividing the average read count for those fragments by the average read count for all observed cassette fragments. We used the nonparametric Kruskal-Wallis test to determine whether the read counts for each motif originate from a different distribution than all the cassette fragment read counts and corrected the resulting P values for multiple testing using the Benjamini-Hochberg procedure. The full list of motifs, average read counts, P values, and other information are available in Supplemental Data Set 5.

Accession Numbers

Sequence data from this article can be found in the Arabidopsis Genome Initiative or GenBank/EMBL databases under accession number KJ572788 for the insertion cassette pMJ013b along with its cloning vector, annotated with gene features, primer binding sites, and relevant restriction sites.

Supplemental Data

The following materials are available in the online version of this article.

Acknowledgments

We thank K. Wemmer for help with assessing the swimming phenotypes of strains; P. Lefebvre for assessing the cell wall phenotype of CMJ030; G. Euskirchen and Z. Weng for help with Illumina sequencing; members of the Jonikas lab for helpful discussion; M. Terashima for help with mutant pooling; G. Watrous for help with mapping insertion sites in individual mutants; A. Grossman, W. Yang, M. Heinnickel, and C. Catalanotti (Carnegie Institution for Science) for strains D66+ and 4A– as well as protocols and advice on DNA extraction and crossing; and W. Frommer, V. Walbot, M. Terashima, L. Mackinder, L. Pallesen, E. Freeman, and X. Li for critical reading of the article. We thank the Barton, Frommer, and Ehrhardt labs for allowing us to use their lab space. This project was funded by the Carnegie Institution for Science, by U.S. Air Force Office of Scientific Research Grant FA9550-11-1-0060, and National Science Foundation Grant EF-1105617 to M.C.J. as well as by Deutsche Forschungsgemeinschaft research fellowship AR 808/1-1 to U.A.

AUTHOR CONTRIBUTIONS

R.Z. and M.C.J. designed the experiments and developed the initial version of the ChlaMmeSeq protocol. R.Z. optimized ChlaMmeSeq. R.Z. and S.S.G. isolated and characterized the CMJ030 strain. U.A., S.S.G., and S.R.B. optimized the transformation protocol. S.R.B. and S.S.G. generated the mutant pools. R.Z. applied ChlaMmeSeq to mutant pools to generate deep sequencing data and characterized individual mutants by PCR to validate ChlaMmeSeq. U.A. performed DNA gel blots, spot tests, and mating-type PCRs. W.P. designed and implemented computational data analyses. R.Z., W.P., and M.C.J. wrote the article.

Footnotes

The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantcell.org) is: Martin C. Jonikas (mjonikas{at}carnegiescience.edu)