Abstract

Genetic variation between individuals has been extensively investigated, but differences between tissues within individuals are far less understood. It is commonly assumed that all healthy cells that arise from the same zygote possess the same genomic content, with a few known exceptions in the immune system and germ line. However, a growing body of evidence shows that genomic variation exists between differentiated tissues. We investigated the scope of somatic genomic variation between tissues within humans. Analysis of copy number variation by high-resolution array-comparative genomic hybridization in diverse tissues from six unrelated subjects reveals a significant number of intraindividual genomic changes between tissues. Many (79%) of these events affect genes. Our results have important consequences for understanding normal genetic and phenotypic variation within individuals, and they have significant implications for both the etiology of genetic diseases such as cancer and for immortalized cell lines that might be used in research and therapeutics.

Genetic variation between individuals has been extensively explored (1, 2), but far less effort has been made to ascertain somatic genetic variation in healthy individuals. This information is important for understanding how somatic mutations arise de novo that lead to both phenotypic variation and somatic genetic human diseases such as cancer. There is presently considerable interest in generation of induced pluripotent stem cells (iPS cells) from the cells of adult tissues because these cells have the potential to be used for therapeutic purposes. Evidence of genomic alterations has been found in human embryonic stem cells (3) and human iPS cells (4⇓⇓–7). Recent findings suggest that immortalized cell lines (8) contain significant copy number differences, although it is unclear whether these variations arise in vitro during cell culturing or originate in vivo in somatic cell populations.

On the basis of results from the 1,000 Genomes Project and other analyses, it is estimated that there are upward of 2,500 structural variations (duplications, insertions, and inversions) between any two individuals, 700–1,000 of which are greater than 1 kb (9, 10). The 1,000 Genomes Project analyses estimate upward of 3 million single nucleotide polymorphisms between any two humans (2). Sequence analysis of parent–offspring trios and quartets has revealed a mutation rate corresponding to ∼70 de novo single nucleotide variations that occur between parents and offspring, presumably through germline mutations (11, 12). Structural variations between parents and offspring were not measured.

Somatic mosaicism is the presence of genetically distinct populations of somatic cells within an organism and even within the same tissue. Somatic rearrangements in various cancer cell types are well documented, and more than 30 Mendelian diseases have been associated with somatic mosaicism (13). Somatic mosaicism is well known to occur in normal germline cells as the process of meiotic recombination and in cells of the immune system within Ig and T-cell receptor genes as a result of V(D)J recombination that provides diversity in immune responses. Other studies have revealed age-related structural variation in human blood cells (14), aneuploidy and retrotransposition in the human brain (15, 16), and aneuploidy in human preimplantation embryos (17). Evidence of copy number variation (CNV) in monozygotic twins (18) also reveals the presence of somatic mosaicism in tissues arising from the same zygote.

High-resolution genomic arrays enable the monitoring of genetic variation with an accuracy not previously applied to somatic tissues. Somatic mosaicism has been studied in cultured cells (8), but only one study has documented copy number variation between unfixed human tissues within individuals using BAC arrays for array-comparative genomic hybridization (aCGH) (19). These arrays produce low-resolution data, and they have high error rates, leaving uncertainty. Although deep whole-genome sequencing provides the highest resolution and the ability to resolve rearrangements to the base pair, mapping sequences from mosaic samples remains a challenge using available sequencing technologies. aCGH using arrays that can resolve rearrangements as small as ∼2 kb provides a sensitive and well-established technology for this investigation and can detect mosaic events in tissues.

Results

Identification of Copy Number Variations (CNVs) in Somatic Tissues.

The presence of somatic CNVs was investigated using organ tissues from subjects obtained during routine autopsy. In total, tissues from six individuals (age range, 45–85 y) not known to be affected by any disease with a genetic component (e.g., hereditary disorders or cancer) were analyzed using Nimblegen 2.1M oligonucleotide whole-genome arrays (20). Between three and 11 tissues were tested for each of the six subjects. The tissue that yielded the most abundant and high-quality DNA for each subject was selected as the reference. The remaining tissues, referred to as test tissues, were each hybridized against the reference for each subject. DNA from the test tissues was labeled with Cy3 fluorescent dye and hybridized to DNA from the reference tissue labeled with Cy5 fluorescent dye. For each tissue comparison, “dye swap” experiments were also performed in which the reference DNA was labeled with Cy3 and the test DNA was labeled with Cy5. We required that an event be observed in both “dye swap” experiments to be included in the list of total somatic CNVs (Fig. 1A and Table S1). As a control, the reference DNA of one subject was labeled with Cy3 and hybridized to the same DNA labeled with Cy5. Positive events called using Nexus Copy Number software were detected when different tissues types were hybridized, and the number varied with the threshold used (Table S2). CNVs above the threshold chosen for calling somatic CNVs were not detected when the reference DNA was hybridized to itself.

(A) Locations of somatic CNVs for each of six subjects. (B) Ratios of selected somatic CNVs by NimbleGen aCGH and NanoString validation. A control region with no CNV between somatic tissues is also included. HC, hippocampus; IPL, inferior parietal lobe; LM, leptomeninges; MFG, middle frontal gyrus; SI, small intestine.

To set a threshold for calling CNVs, we analyzed a number of CNVs in various tissues using an initial set of 50 NanoString probes. A threshold was established that minimized the false discovery rate (10%; Materials and Methods). Two hundred twenty-nine CNVs passed the threshold. CNVs from telomeric and centromeric regions were removed, because these regions are prone to cause cross-hybridization signals on aCGH. CNVs that appeared in the same location in multiple test tissues from the same individual were ascribed to the reference tissue. This resulted in a set of 178 candidate somatic CNVs. Of the 178 regions, 168 were amenable to NanoString probe design and subjected to a second round of validation. Of the 168 regions tested by NanoString technology, 73 were validated across six people and 36 tissue comparisons (Fig. 1A and Table S1). We refer to this list of 73 events as the high-confidence set.

The frequency of high-confidence events varied from 0 to 45 per individual, with most events detected within the two individuals with the most tissues tested. Reference tissues often yielded the greatest numbers of CNVs per individual. The increased number of tests performed with the reference tissues presumably increased detection of CNVs in those samples. It is plausible that all 178 events that were reproducibly detected by the arrays are bona fide events but that many were not confirmed using the NanoString technology owing to inadequate probe design or they were at a level below the stringent threshold used in our validation process. Regardless, these results indicate that there are a large number of somatic CNVs in the tissues of adults.

CNVs Occur in Many Tissue Types and Can Be Large.

Somatic CNVs were detected by aCGH in a variety of tissues (Fig. 1A, Table S1, and Table S3). The greatest number of somatic CNVs was detected in the pancreas of subject 6 (43 events), followed by those observed in pancreas of subject 3 (35 events) and kidney of subject 1 (21 events). Events in brain tissues were detected by aCGH in the leptomeningeal tissue (two events) of subject 1, as well as in tissues of the inferior parietal lobe (10 events) and middle frontal gyrus (one event) of subject 6. There were five events in subject 6 that appeared in more than one brain tissue and likely affect the entire brain. Events in the small intestine tissues were observed in subjects 1 and 5 (seven and six events, respectively). Somatic CNVs were detected by aCGH in all of the liver tissues tested (four subjects). Tissues of the reproductive system were obtained for the two female subjects analyzed. Ten ovary-specific events were detected in subject 3, and eight uterus-specific events were detected in subject 4. Examples of CNVs discovered by NimbleGen and validated by NanoString are shown in Fig. 1B and Fig. 2. The validated events in Fig. 1B demonstrate that somatic CNVs have signals less than 1.5 (a 3:2 ratio for a duplication) or greater than 0.5 (a 1:2 ratio for a deletion). In part, this is to be expected because the tissues and organs contain a substantial fraction of nonparenchymal cells from blood vessels and connective tissue. However, the signals suggest that the events may not be homogeneous throughout the parenchymal cells of the tissue.

The size distribution of somatic CNVs was analyzed (Fig. 3A). Somatic CNVs with sizes smaller than 10 kb were binned into 1-kb intervals, somatic CNVs with sizes between 10 kb and 100 kb were binned into 10-kb intervals, and somatic CNVs with sizes larger than 100 kb were binned into 100-kb intervals. The sizes of the total somatic CNVs discovered by aCGH ranged from 2.0 kb to 184 kb, and those of the NanoString-validated set ranged from 3.2 kb to 184 kb. Thus, larger events were more likely to be validated by NanoString: 30 of 32 CNVs (93.8%) larger than 20 kb were validated, whereas only 43 of 146 CNVs (29.5%) smaller than 20 kb were validated. It is possible that the smaller CNVs were either false positive, or the validation probes were not optimal for the more limited target regions.

Many Somatic CNVs Affect Genes Involved in Regulation.

We determined the number of high-confidence CNVs that affect genes and found 58 (79%) intersect genes, including introns, exons, or upstream sequence; 51 of these (70% of the total) directly affect exons (Fig. 3B). Fig. 2 shows examples of CNVs that affect the ACOXL, BCL2L11, TMEM132D, NFIL3, and NR4A2 genes. Fig. 2A shows a ∼13.7-kb increase in signal in DNA from liver tissue when hybridized against DNA from reference kidney tissue, suggesting a duplication in liver of a region overlapping both the BCL2L11 gene and the ACOXL gene. BCL2L11 mRNA is expressed in several human tissues, including liver (21). Fig. 2C shows an increase in signal of a ∼21.2-kb region on chromosome 9 of subject 6 in liver tissue DNA when hybridized against pancreas reference tissue DNA, suggesting a duplication of a region encompassing the NFIL3 gene in liver tissue. Expression of NFIL3 was previously reported in human liver tissue (22). Fig. 2 D and E show examples of somatic CNVs observed in the DNA of reference tissues. Fig. 2E shows an increase in signal for DNA from tissues of three brain regions and liver when hybridized against pancreas tissue DNA of subject 6 in a region encompassing the NR4A2 gene. Because this event is observed in all hybridizations against the pancreas reference tissue, the event is likely a deletion in the pancreas. More examples are shown in Fig. S1. These results indicate that most somatic CNVs affect genes. Some of these genes have previously been reported to be expressed in the affected tissues.

Investigation of the types of genes affected by high-confidence somatic CNVs was performed by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses (23, 24) (Fig. 4 and Table S4). GO analysis revealed modest enrichments for genes within somatic CNVs in ∼30 GO categories. Many of the GO categories relate to regulation of cellular processes such as regulation of phosphorylation, regulation of primary metabolic processes, and regulation of gene expression. Sequence-specific DNA binding and protein binding were also enriched in genes within somatic CNVs. KEGG pathway analysis revealed modest enrichments in the Wnt signaling pathway and the MAPK signaling pathway. Thus, many of the CNVs are likely to affect regulatory processes in the cell.

Further analysis was performed on genes within the somatic CNVs discovered to investigate whether any of the affected genes within the same subject or tissue are known to interact. Deletion events affecting the MAP2K3 gene on chromosome 17 and the SMAD7 gene on chromosome 18 were observed in pancreas tissue of subject 6. MAP2K3 and SMAD7 were previously shown to be expressed in pancreas tissue (25), and SMAD7 has been shown to be overexpressed in pancreatic cancer (26). The SMAD7 and MAP2K3 genes were previously reported (27) to interact as part of the TGF-β–dependent activation of p38 MAPK pathway inducing apoptosis in prostate cancer cells. Thus, in at least one case the CNVs affect interacting components.

Several Somatic CNVs Occur in the Same Location in Multiple Individuals.

Nearly all events were unique; only seven regions overlap between two individuals, three of which were validated by NanoString. Fig. 5 A and B show regions on chromosomes 12 and 14, respectively, where CNVs were observed in both subjects 1 and 6. Fig. 5 A and B show increased signal in DNA from liver tissue when hybridized against DNA from kidney tissue of subject 1. An increased signal is observed for subject 6 in the same regions in DNA from liver tissue and tissue from brain regions when hybridized against DNA from reference pancreas tissue. Increased signal in all of the test tissues of subject 6 correspond to deletions in the reference pancreas tissue for both chromosomal regions. The events on chromosome 12 occur in a region encompassing the GABARAPL1 gene (Fig. 5A). The events on chromosome 14 overlap both the ANG gene and the RNASE4 gene (Fig. 5B). ANG and RNASE4 were previously shown to be expressed in pancreas and liver (28, 29). Thus, a few CNVs (4%) lie in the same regions and contain genes.

aCGH signal for somatic CNVs that appear in more than one subject. (A) Increased signal in liver tissue hybridized against kidney tissue of subject 1 corresponding to a duplication of chromosome 12 DNA in liver. Increased signal in liver tissue and superior temporal gyrus–middle temporal gyrus (STG-MTG) tissue hybridized against pancreas tissue of subject 6 corresponding to a deletion in the pancreas reference tissue on chromosome 12. (B) Increased signal in liver tissue hybridized against kidney tissue of subject 1 corresponding to a duplication of chromosome 14 DNA in liver. Increased signal in liver tissue and MFG tissue hybridized against pancreas tissue of subject 6 corresponding to a deletion in pancreas reference tissue on chromosome 14. Controls show no signal change between hybridized tissues. Blue and red dots represent dye-swap experiments.

Somatic CNV Breakpoints Analysis.

Analysis of the breakpoint regions of 73 high-confidence somatic CNVs was undertaken to assess the frequency of repetitive elements near somatic CNV breakpoints. aCGH does not provide base pair resolution that is necessary for determining exact CNV formation mechanisms (e.g., nonallelic homologous recombination, nonhomologous end-joining, and transpositions of mobile genomic elements). However, repetitive sequence is known to be integral in many of the CNV formation mechanisms. NimbleGen 2.1M oligonucleotide microarrays can achieve breakpoint resolution to ∼2 kb based on probe spacing of ∼1 kb. Therefore, a 2-kb window was analyzed around each breakpoint for the presence of six classes of repetitive sequence: L1, L2, Alu, LTR, segmental duplications, and microsatellites (Table 1). To test for significant enrichment or depletion for each class of repetitive element, P values were calculated by performing permutations of 10,000 sets of random intervals. Within each set of random intervals the number and size of the intervals was kept equal to those of the validated somatic CNVs being analyzed. The locations of the intervals were randomized within the regions probed by the NimbleGen 2.1M microarray. The number of somatic CNVs that intersect a repetitive element class was compared with the average number of random intervals that intersect the element to calculate an enrichment/depletion ratio. Thirty-nine of 73 somatic CNVs have a breakpoint that intersects a microsatellite, an enrichment of ∼1.29 over the genomic background (P < 3.8e−2). Microsatellites have been suggested to play a role in chromosomal rearrangement (30), and a colocalization of microsatellites with CNVs was previously reported (31). Twenty-five of 73 somatic CNVs have a breakpoint that overlaps an L1, a significant depletion of ∼0.53 compared with the genomic background (P < 1.2e−7). Twenty-nine of 73 somatic CNVs have a breakpoint that intersects an LTR, a significant depletion of ∼0.77 compared with random intervals (P < 4.8e−2). Segmental duplications, Alu, and L2 were not significantly enriched or depleted near somatic CNV breakpoints compared with average random intervals.

Analysis for enrichment or depletion of several classes of repetitive elements at somatic CNV breakpoints

In summary, these results show that somatic CNVs occur in humans in a variety of tissues. Although the sample size is small, the number of events detected was often higher in dividing tissues (e.g., liver, small intestine, and pancreas) relative to nondividing tissues (e.g., brain).

Discussion

This study confirms that humans have extensive genetic variation in somatic tissues. aCGH and the validation method used can detect heterogeneous cell populations within tissue samples. Many of the events detected have aCGH signals corresponding to less than a full-copy duplication or deletion, corroborating that there are indeed heterogeneous cell populations within tissues. To detect these events, they must be present in a substantial fraction of the parenchymal cells. In proliferating tissues, such as in the cases of p53 mutations in UV-irradiated skin (32) and some of the CNVs seen in normal blood cells (14), it seems probable that the somatic mutation conferred a growth advantage to the affected cells. In organs that undergo little or no proliferation in the adult, one possibility is that the observed CNV may have given a growth or differentiation advantage to the affected precursor cells during development. It is also likely that many more CNVs exist in small numbers of parenchymal cells but do not give any growth advantage to the cells.

These results have important implications for the detection of disease formation and treatment. A stepwise theory of cancer has been proposed to explain the origin of tumor populations (33⇓⇓–36). This theory posits that genetic instability in precancerous cells leads to an accumulation of mutations that eventually result in cancer. Our results suggest that CNVs may occur commonly during somatic cell growth and division. In some cases, these might result in predilection for diseases such as cancer. Detection of such events in cells of healthy individuals may have prognostic value for the risk of malignancy. These results also indicate that choosing the relevant tissue for comparison of normal and disease tissues is important. Use of tissues from distant lineages may find events that are not relevant to formation or maintenance of the disease state and that might simply be “background” events in another tissue.

In developmental biology, there is great interest in understanding how common progenitor cells differentiate into distinct tissue types. Our results show that somatic tissues can genetically vary. Although our sample size is small, several of the tissues with the highest numbers of observed events are dividing tissues (e.g., liver and intestine). The pancreas, liver, and small intestine are all derived from the endoderm germ layer. Studies of differentiation have revealed that the ventral endoderm can give rise to each of these tissues, depending on signaling from various pathways (37⇓–39). We observed seven events that occurred in the same locations in two individuals, suggesting that there are potential hotspots for somatic genomic variation. These types of events were observed in several different tissues, including pancreas, liver, and small intestine. It remains to be seen whether the somatic genetic differences observed here play a role in the differentiation process.

Finally, this work is particularly important for therapies that use iPS cells. iPS cells are typically derived from somatic tissues and may contain somatically altered DNA. Indeed, CNVs have been observed in iPS cells relative to the starting material from which they were derived (4⇓⇓–7). Our findings indicate that many of these events (and perhaps all) are not induced during iPS cell formation but perhaps exist in somatic cells in vivo. Although genetic alterations are generally viewed as negative, somatic variation could be beneficial. For example, in tissues that frequently encounter pathogens, CNVs that eliminate viral receptors might enhance host survival. Somatic genomic variation may have wide-reaching implications for diverse aspects of human biology and health, making it an important area of future research.

Materials and Methods

aCGH.

Tissues that were identified as noncancerous were obtained from routine autopsy. DNA was extracted from tissues using the Qiagen DNeasy Blood & Tissue Kit (Qiagen). For each subject, the tissue that yielded the greatest amount of DNA was selected as the reference tissue. Each of the other tissues from a subject was compared against the reference tissue. aCGH was performed using NimbleGen (Roche NimbleGen) 2.1M microarrays according to the NimbleGen protocol. Dye swap replicates were carried out for each tissue comparison to control for dye-specific bias. Arrays were scanned using the Roche MS 200 Microarray Scanner. Images were analyzed using NimbleScan 2.6 software. CNV calls were made with Nexus Copy Number version 6.0 (BioDiscovery) using the Fast Adaptive States Segmentation Technique 2 (FASST2) segmentation algorithm, a hidden Markov model-based approach.

Validation.

NanoString probes were designed to 50 regions of possible copy number change. Half of the regions had two probes designed, whereas the other half had three probes designed. Hybridization, sample prep, and scanning were performed according to the NanoString protocol. Data analysis was performed using the nCounter CNV Collector tool. Copy number was determined by averaging over two to three probes per region. To validate the NimbleGen CNV calls, test/reference NanoString ratios were compared with NimbleGen ratios. For ratios below 1 (i.e., deletions), the ratios were inverted for a direct comparison. Events with NanoString (validation) ratios of 1.25 or greater were considered to be valid CNVs. A false-positive rate of 10% was used as a cutoff, which corresponded to a NimbleGen ratio of 1.3 for subjects 1–5 and 1.35 for subject 6 (Figs. S2 and S3). All thresholding analyses were performed using the ROCR package for the R statistical computing package. These ratios were used as the thresholds for calling CNVs from the NimbleGen data for each tissue comparison using the FASST segmentation algorithm of the Nexus Copy Number (BioDiscovery) software. A second round of NanoString probes was designed to the candidate CNVs that had not been previously validated. One probe was used for each of 168 regions. Candidate events with NanoString ratios of 1.25 or greater were considered validated.

Note Added In Proof.

Our speculation that many of these events are not induced during iPS cell formation but perhaps exist in somatic cells in vivo received support from complementary data in a recent paper (40).

Acknowledgments

We thank Alexander Vortmeyer and colleagues at the Yale Pathology Department for providing tissue samples. This work was supported by National Institutes of Health Grants 5R01MH094740 (to M.P.S. and A.E.U.) and 5P50HG002357 (to M.P.S.).

Footnotes

↵1To whom correspondence may be addressed. E-mail: sherman.weissman{at}yale.edu or mpsnyder{at}stanford.edu.

Blood-sucking sand flies from disparate global regions have a predilection for feeding on the marijuana plant (Cannabis sativa), and the findings hint at a potential avenue for controlling sand flies, which can transmit leishmaniasis.