There has been an explosion of data describing newly recognized structural variants in the human genome. In the flurry of reporting, there has been no standard approach to collecting the data, assessing its quality or describing identified features. This risks becoming a rampant problem, in particular with respect to surveys of copy number variation and their application to disease studies. Here, we consider the challenges in characterizing and documenting genomic structural variants. From this, we derive recommendations for standards to be adopted, with the aim of ensuring the accurate presentation of this form of genetic variation to facilitate ongoing research.

The association of DNA copy-number variation (CNV) with specific gene function and human disease has been long known, but the wide scope and prevalence of this form of variation has only recently been fully appreciated. The latest studies using microarray technology have demonstrated that as much as 12% of the human genome and thousands of genes are variable in copy number, and this diversity is likely to be responsible for a significant proportion of normal phenotypic variation. Current challenges involve developing methods not only for detecting and cataloging CNVs in human populations at increasingly higher resolution but also for determining the association of CNVs with biological function, recent human evolution, and common and complex human disease.

Human genome sequencing has transformed our understanding of genomic variation and its relevance to health and disease, and is now starting to enter clinical practice for the diagnosis of rare diseases. The question of whether and how some categories of genomic findings should be shared with individual research participants is currently a topic of international debate, and development of robust analytical workflows to identify and communicate clinically relevant variants is paramount.

Methods

The Deciphering Developmental Disorders (DDD) study has developed a UK-wide patient recruitment network involving over 180 clinicians across all 24 regional genetics services, and has performed genome-wide microarray and whole exome sequencing on children with undiagnosed developmental disorders and their parents. After data analysis, pertinent genomic variants were returned to individual research participants via their local clinical genetics team.

Findings

Around 80 000 genomic variants were identified from exome sequencing and microarray analysis in each individual, of which on average 400 were rare and predicted to be protein altering. By focusing only on de novo and segregating variants in known developmental disorder genes, we achieved a diagnostic yield of 27% among 1133 previously investigated yet undiagnosed children with developmental disorders, whilst minimising incidental findings. In families with developmentally normal parents, whole exome sequencing of the child and both parents resulted in a 10-fold reduction in the number of potential causal variants that needed clinical evaluation compared to sequencing only the child. Most diagnostic variants identified in known genes were novel and not present in current databases of known disease variation.

Interpretation

Implementation of a robust translational genomics workflow is achievable within a large-scale rare disease research study to allow feedback of potentially diagnostic findings to clinicians and research participants. Systematic recording of relevant clinical data, curation of a gene–phenotype knowledge base, and development of clinical decision support software are needed in addition to automated exclusion of almost all variants, which is crucial for scalable prioritisation and review of possible diagnostic variants. However, the resource requirements of development and maintenance of a clinical reporting system within a research setting are substantial.

Funding

Health Innovation Challenge Fund, a parallel funding partnership between the Wellcome Trust and the UK Department of Health.

The analytical resolution of individual chromosome peaks in the flow karyotype of cell lines is dependent on sample preparation and the detection sensitivity of the flow cytometer. We have investigated the effect of laser power on the resolution of chromosome peaks in cell lines with complex karyotypes. Chromosomes were prepared from a human gastric cancer cell line and a cell line from a patient with an abnormal phenotype using a modified polyamine isolation buffer. The stained chromosome suspensions were analyzed on a MoFlo sorter (Beckman Coulter) equipped with two water-cooled lasers (Coherent). A bivariate flow karyotype was obtained from each of the cell lines at various laser power settings and compared to a karyotype generated using laser power settings of 300 mW. The best separation of chromosome peaks was obtained with laser powers of 300 mW. This study demonstrates the requirement for high-laser powers for the accurate detection and purification of chromosomes, particularly from complex karyotypes, using a conventional flow cytometer.

Zebrafish have become a popular organism for the study of vertebrate gene function1,2. The virtually transparent embryos of this species, and the ability to accelerate genetic studies by gene knockdown or overexpression, have led to the widespread use of zebrafish in the detailed investigation of vertebrate gene function and increasingly, the study of human genetic disease3–5. However, for effective modelling of human genetic disease it is important to understand the extent to which zebrafish genes and gene structures are related to orthologous human genes. To examine this, we generated a high-quality sequence assembly of the zebrafish genome, made up of an overlapping set of completely sequenced large-insert clones that were ordered and oriented using a high-resolution high-density meiotic map. Detailed automatic and manual annotation provides evidence of more than 26,000 protein-coding genes6, the largest gene set of any vertebrate so far sequenced. Comparison to the human reference genome shows that approximately 70% of human genes have at least one obvious zebrafish orthologue. In addition, the high quality of this genome assembly provides a clearer understanding of key genomic features such as a unique repeat content, a scarcity of pseudogenes, an enrichment of zebrafish-specific genes on chromosome 4 and chromosomal regions that influence sex determination.

The trials performed worldwide towards Non-Invasive Prenatal Diagnosis (NIPD) of Down syndrome (or Trisomy 21) have demonstrated the great commercial and medical potential of NIPD compared to the currently used invasive prenatal diagnostic procedures. Extensive investigation of methylation differences between the mother and the fetus has led to the identification of Differentially Methylated Regions (DMRs). In this study, we present a strategy using the Methylated DNA immunoprecipitation (MeDiP) methodology in combination with real-time qPCR to achieve fetal chromosome dosage assessment which can be performed non-invasively through the analysis of fetal-specific DMRs. We achieved non-invasive prenatal detection of trisomy 21 by determining the methylation ratio of normal and trisomy 21 cases for each tested fetal-specific DMR present in maternal peripheral blood, followed by further statistical analysis. The application of the above fetal-specific methylation ratio approach provided correct diagnosis of 14 trisomy 21 and 26 normal cases.

Patients with developmental disorders often harbour sub-microscopic deletions or duplications that lead to a disruption of normal gene expression or perturbation in the copy number of dosage-sensitive genes. Clinical interpretation for such patients in isolation is hindered by the rarity and novelty of such disorders. The DECIPHER project (https://decipher.sanger.ac.uk) was established in 2004 as an accessible online repository of genomic and associated phenotypic data with the primary goal of aiding the clinical interpretation of rare copy-number variants (CNVs). DECIPHER integrates information from a variety of bioinformatics resources and uses visualization tools to identify potential disease genes within a CNV. A two-tier access system permits clinicians and clinical scientists to maintain confidential linked anonymous records of phenotypes and CNVs for their patients that, with informed consent, can subsequently be shared with the wider clinical genetics and research communities. Advances in next-generation sequencing technologies are making it practical and affordable to sequence the whole exome/genome of patients who display features suggestive of a genetic disorder. This approach enables the identification of smaller intragenic mutations including single-nucleotide variants that are not accessible even with high-resolution genomic array analysis. This article briefly summarizes the current status and achievements of the DECIPHER project and looks ahead to the opportunities and challenges of jointly analysing structural and sequence variation in the human genome.

Down syndrome (DS) is caused by trisomy of chromosome 21 (Hsa21) and presents a complex phenotype that arises from abnormal dosage of genes on this chromosome. However, the individual dosage-sensitive genes underlying each phenotype remain largely unknown. To help dissect genotype – phenotype correlations in this complex syndrome, the first fully transchromosomic mouse model, the Tc1 mouse, which carries a copy of human chromosome 21 was produced in 2005. The Tc1 strain is trisomic for the majority of genes that cause phenotypes associated with DS, and this freely available mouse strain has become used widely to study DS, the effects of gene dosage abnormalities, and the effect on the basic biology of cells when a mouse carries a freely segregating human chromosome. Tc1 mice were created by a process that included irradiation microcell-mediated chromosome transfer of Hsa21 into recipient mouse embryonic stem cells. Here, the combination of next generation sequencing, array-CGH and fluorescence in situ hybridization technologies has enabled us to identify unsuspected rearrangements of Hsa21 in this mouse model; revealing one deletion, six duplications and more than 25 de novo structural rearrangements. Our study is not only essential for informing functional studies of the Tc1 mouse but also (1) presents for the first time a detailed sequence analysis of the effects of gamma radiation on an entire human chromosome, which gives some mechanistic insight into the effects of radiation damage on DNA, and (2) overcomes specific technical difficulties of assaying a human chromosome on a mouse background where highly conserved sequences may confound the analysis. Sequence data generated in this study is deposited in the ENA database, Study Accession number: ERP000439.

We report on a 17-year-old patient with midline defects, ocular hypertelorism, neuropsychomotor development delay, neonatal macrosomy, and dental anomalies. DNA copy number investigations using a Whole Genome TilePath array consisting, of 30K BAC/PAC clones showed a 6.36 Mb deletion in the 9p24.1–p24.3 region and a 14.83 Mb duplication in the 20p12.1–p13 region, which derived from a maternal balanced t(9;20)(p24.1;p12.1) as shown by FISH studies. Monosomy 9p is a well-delineated chromosomal syndrome with characteristic clinical features, while chromosome 20p duplication is a rare genetic condition. Only a handful of cases of monosomy 9/trisomy 20 have been previously described. In this report, we compare the phenotype of our patient with those already reported in the literature, and discuss the role of DMRT, DOCK8, FOXD4, VLDLR, RSPO4, AVP, RASSF2, PROKR2, BMP2, MKKS, and JAG1, all genes mapping to the deleted and duplicated regions.

The recently described DNA replication-based mechanisms of fork stalling and template switching (FoSTeS) and microhomology-mediated break-induced replication (MMBIR) were previously shown to catalyze complex exonic, genic and genomic rearrangements. By analyzing a large number of isochromosomes of the long arm of chromosome X (i(Xq)), using whole-genome tiling path array comparative genomic hybridization (aCGH), ultra-high resolution targeted aCGH and sequencing, we provide evidence that the FoSTeS and MMBIR mechanisms can generate large-scale gross chromosomal rearrangements leading to the deletion and duplication of entire chromosome arms, thus suggesting an important role for DNA replication-based mechanisms in both the development of genomic disorders and cancer. Furthermore, we elucidate the mechanisms of dicentric i(Xq) (idic(Xq)) formation and show that most idic(Xq) chromosomes result from non-allelic homologous recombination between palindromic low copy repeats and highly homologous palindromic LINE elements. We also show that non-recurrent-breakpoint idic(Xq) chromosomes have microhomology-associated breakpoint junctions and are likely catalyzed by microhomology-mediated replication-dependent recombination mechanisms such as FoSTeS and MMBIR. Finally, we stress the role of the proximal Xp region as a chromosomal rearrangement hotspot.

Motivation: The careful normalization of array-based comparative genomic hybridization (aCGH) data is of critical importance for the accurate detection of copy number changes. The difference in labelling affinity between the two fluorophores used in aCGH—usually Cy5 and Cy3—can be observed as a bias within the intensity distributions. If left unchecked, this bias is likely to skew data interpretation during downstream analysis and lead to an increased number of false discoveries.

Results: In this study, we have developed aCGH.Spline, a natural cubic spline interpolation method followed by linear interpolation of outlier values, which is able to remove a large portion of the dye bias from large aCGH datasets in a quick and efficient manner.

Conclusions: We have shown that removing this bias and reducing the experimental noise has a strong positive impact on the ability to detect accurately both copy number variation (CNV) and copy number alterations (CNA).

Contact: l.larcombe@cranfield.ac.uk; tf2@sanger.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Structural variations of DNA greater than 1 kilobase in size account for most bases that vary among human genomes, but are still relatively under-ascertained. Here we use tiling oligonucleotide microarrays, comprising 42 million probes, to generate a comprehensive map of 11,700 copy number variations (CNVs) greater than 443 base pairs, of which most (8,599) have been validated independently. For 4,978 of these CNVs, we generated reference genotypes from 450 individuals of European, African or East Asian ancestry. The predominant mutational mechanisms differ among CNV size classes. Retrotransposition has duplicated and inserted some coding and non-coding DNA segments randomly around the genome. Furthermore, by correlation with known trait-associated single nucleotide polymorphisms (SNPs), we identified 30 loci with CNVs that are candidates for influencing disease susceptibility. Despite this, having assessed the completeness of our map and the patterns of linkage disequilibrium between CNVs and SNPs, we conclude that, for complex traits, the heritability void left by genome-wide association studies will not be accounted for by common CNVs.

Aarray painting is a technique that uses microarray technology to rapidly map chromosome translocation breakpoints. previous methods to map translocation breakpoints have used fluorescence in situ hybridization (FIsH) and have consequently been labor-intensive, time-consuming and restricted to the low breakpoint resolution imposed by the use of metaphase chromosomes. array painting combines the isolation of derivative chromosomes (chromosomes with translocations) and high-resolution microarray analysis to refine the genomic location of translocation breakpoints in a single experiment. In this protocol, we describe array painting by isolation of derivative chromosomes using a MoFlo flow sorter, amplification of these derivatives using whole-genome amplification and hybridization onto commercially available oligonucleotide microarrays. although the sorting of derivative chromosomes is a specialized procedure requiring sophisticated equipment, the amplification, labeling and hybridization of Dna is straightforward, robust and can be completed within 1 week. the protocol described produces good quality data; however, array painting is equally achievable using any combination of the available alternative methodologies for chromosome isolation, amplification and hybridization.

Copy number variants (CNVs) account for the majority of human genomic diversity in terms of base coverage. Here, we have developed and applied a new method to combine high-resolution array comparative genomic hybridization (CGH) data with whole-genome DNA sequencing data to obtain a comprehensive catalog of common CNVs in Asian individuals. The genomes of 30 individuals from three Asian populations (Korean, Chinese and Japanese) were interrogated with an ultra-high-resolution array CGH platform containing 24 million probes. Whole-genome sequencing data from a reference genome (NA10851, with 28.3× coverage) and two Asian genomes (AK1, with 27.8× coverage and AK2, with 32.0× coverage) were used to transform the relative copy number information obtained from array CGH experiments into absolute copy number values. We discovered 5,177 CNVs, of which 3,547 were putative Asian-specific CNVs. These common CNVs in Asian populations will be a useful resource for subsequent genetic studies in these populations, and the new method of calling absolute CNVs will be essential for applying CNV data to personalized medicine.

Congenital malformations involving the Müllerian ducts are observed in around 5% of infertile women. Complete aplasia of the uterus, cervix, and upper vagina, also termed Müllerian aplasia or Mayer–Rokitansky–Kuster–Hauser (MRKH) syndrome, occurs with an incidence of around 1 in 4500 female births, and occurs in both isolated and syndromic forms. Previous reports have suggested that a proportion of cases, especially syndromic cases, are caused by variation in copy number at different genomic loci.

Methods

In order to obtain an overview of the contribution of copy number variation to both isolated and syndromic forms of Müllerian aplasia, copy number assays were performed in a series of 63 cases, of which 25 were syndromic and 38 isolated.

Results

A high incidence (9/63, 14%) of recurrent copy number variants in this cohort is reported here. These comprised four cases of microdeletion at 16p11.2, an autism susceptibility locus not previously associated with Müllerian aplasia, four cases of microdeletion at 17q12, and one case of a distal 22q11.2 microdeletion. Microdeletions at 16p11.2 and 17q12 were found in 4/38 (10.5%) cases with isolated Müllerian aplasia, and at 16p11.2, 17q12 and 22q11.2 (distal) in 5/25 cases (20%) with syndromic Müllerian aplasia.

Conclusion

The finding of microdeletion at 16p11.2 in 2/38 (5%) of isolated and 2/25 (8%) of syndromic cases suggests a significant contribution of this copy number variant alone to the pathogenesis of Müllerian aplasia. Overall, the high incidence of recurrent copy number variants in all forms of Müllerian aplasia has implications for the understanding of the aetiopathogenesis of the condition, and for genetic counselling in families affected by it.

The Tasmanian devil (Sarcophilus harrisii), the largest marsupial carnivore, is endangered due to a transmissible facial cancer spread by direct transfer of living cancer cells through biting. Here we describe the sequencing, assembly, and annotation of the Tasmanian devil genome and whole-genome sequences for two geographically distant subclones of the cancer. Genomic analysis suggests that the cancer first arose from a female Tasmanian devil and that the clone has subsequently genetically diverged during its spread across Tasmania. The devil cancer genome contains more than 17,000 somatic base substitution mutations and bears the imprint of a distinct mutational process. Genotyping of somatic mutations in 104 geographically and temporally distributed Tasmanian devil tumors reveals the pattern of evolution and spread of this parasitic clonal lineage, with evidence of a selective sweep in one geographical area and persistence of parallel lineages in other populations.

Whole-genome sequences of the Tasmanian devil and two devil cancer subclones suggest that the cancer first arose from a female devil and that the clone has subsequently genetically diverged during its spread across Tasmania.

We have systematically compared copy number variant (CNV) detection on eleven microarrays to evaluate data quality and CNV calling, reproducibility, concordance across array platforms and laboratory sites, breakpoint accuracy and analysis tool variability. Different analytic tools applied to the same raw data typically yield CNV calls with <50% concordance. Moreover, reproducibility in replicate experiments is <70% for most platforms. Nevertheless, these findings should not preclude detection of large CNVs for clinical diagnostic purposes because large CNVs with poor reproducibility are found primarily in complex genomic regions and would typically be removed by standard clinical data curation. The striking differences between CNV calls from different platforms and analytic tools highlight the importance of careful assessment of experimental design in discovery and association studies and of strict data curation and filtering in diagnostics. The CNV resource presented here allows independent data evaluation and provides a means to benchmark new algorithms.

Mutations in the transcription factor encoding TFAP2A gene underlie branchio-oculo-facial syndrome (BOFS), a rare dominant disorder characterized by distinctive craniofacial, ocular, ectodermal and renal anomalies. To elucidate the range of ocular phenotypes caused by mutations in TFAP2A, we took three approaches. First, we screened a cohort of 37 highly selected individuals with severe ocular anomalies plus variable defects associated with BOFS for mutations or deletions in TFAP2A. We identified one individual with a de novo TFAP2A four amino acid deletion, a second individual with two non-synonymous variations in an alternative splice isoform TFAP2A2, and a sibling-pair with a paternally inherited whole gene deletion with variable phenotypic expression. Second, we determined that TFAP2A is expressed in the lens, neural retina, nasal process, and epithelial lining of the oral cavity and palatal shelves of human and mouse embryos—sites consistent with the phenotype observed in patients with BOFS. Third, we used zebrafish to examine how partial abrogation of the fish ortholog of TFAP2A affects the penetrance and expressivity of ocular phenotypes due to mutations in genes encoding bmp4 or tcf7l1a. In both cases, we observed synthetic, enhanced ocular phenotypes including coloboma and anophthalmia when tfap2a is knocked down in embryos with bmp4 or tcf7l1a mutations. These results reveal that mutations in TFAP2A are associated with a wide range of eye phenotypes and that hypomorphic tfap2a mutations can increase the risk of developmental defects arising from mutations at other loci.

Cancer is driven by somatically acquired point mutations and chromosomal rearrangements, conventionally thought to accumulate gradually over time. Using next-generation sequencing, we characterize a phenomenon, which we term chromothripsis, whereby tens to hundreds of genomic rearrangements occur in a one-off cellular crisis. Rearrangements involving one or a few chromosomes crisscross back and forth across involved regions, generating frequent oscillations between two copy number states. These genomic hallmarks are highly improbable if rearrangements accumulate over time and instead imply that nearly all occur during a single cellular catastrophe. The stamp of chromothripsis can be seen in at least 2%–3% of all cancers, across many subtypes, and is present in ∼25% of bone cancers. We find that one, or indeed more than one, cancer-causing lesion can emerge out of the genomic crisis. This phenomenon has important implications for the origins of genomic remodeling and temporal emergence of cancer.

PaperClip

Graphical Abstract

Highlights

► 2%–3% cancers show 10–100 s of rearrangements localized to specific genomic regions ► Genomic features imply chromosome breaks occur in one-off crisis (“chromothripsis”) ► Found across all tumor types, especially common in bone cancers (up to 25%) ► Can generate several genomic lesions with potential to drive cancer in single event

Microarray-based Comparative Genomic Hybridization (array-CGH) has been applied for a decade to screen for submicroscopic DNA gains and losses in tumor and constitutional DNA samples. This method has become increasingly flexible with the integration of new biological resources generated by genome sequencing projects. In this chapter, we describe alternative strategies for whole genome screening and high resolution breakpoint mapping of copy number changes by array-CGH, as well as tools available for accurate analysis of array-CGH experiments. Although most methods listed here have been designed for microarrays composed of large-insert clones, they can be adapted easily to other types of microarray platforms, such as those constructed from printed or synthesized oligonucleotides.

The spatial resolution of microarray-based comparative genomic hybridization (array-CGH) is dependent on the length and density of target DNA sequences covering the chromosomal region of interest. Here we describe the methods developed at the Wellcome Trust Sanger Institute (Cambridge, UK) to construct microarrays composed of large-insert clones available through genome sequencing projects. These methods are applicable to Bacterial and Phage Artificial Chromosomes (BAC and PAC) as well as fosmid and cosmid clones. The protocols are scalable for the construction of microarrays composed of several hundreds up to several ten thousands clones.

Array-CGH involves the comparison of a test to a reference genome using a microarray composed of target sequences with known chromosomal coordinates. The test and reference DNA samples are used as templates to generate two probe DNAs labeled with distinct fluorescent dyes. The two probe DNAs are co-hybridized on a microarray in the presence of Cot-1 DNA to suppress unspecific hybridization of repeat sequences. After slide washes and drying, microarray images are acquired on a laser scanner and fluorescent intensities from every target sequence spot on the array are extracted using dedicated computer programs. Intensity ratios are calculated and normalized to enable data interpretation. Although the protocols explained in this chapter correspond primarily to the use of large-insert clone microarrays in either manual or automated fashion, necessary adaptations for hybridization on microarrays composed of shorter target DNA sequences are also briefly described.

Hypoplastic left heart (HLH) occurs in at least 1 in 10 000 live births but may be more common in utero. Its causes are poorly understood but a number of affected cases are associated with chromosomal abnormalities. We set out to localize the breakpoints in a patient with sporadic HLH and a de novo translocation. Initial studies showed that the apparently simple 1q41;3q27.1 translocation was actually combined with a 4-Mb inversion, also de novo, of material within 1q41. We therefore localized all four breakpoints and found that no known transcription units were disrupted. However we present a case, based on functional considerations, synteny and position of highly conserved non-coding sequence elements, and the heterozygous Prox1+/− mouse phenotype (ventricular hypoplasia), for the involvement of dysregulation of the PROX1 gene in the aetiology of HLH in this case. Accordingly, we show that the spatial expression pattern of PROX1 in the developing human heart is consistent with a role in cardiac development. We suggest that dysregulation of PROX1 gene expression due to separation from its conserved upstream elements is likely to have caused the heart defects observed in this patient, and that PROX1 should be considered as a potential candidate gene for other cases of HLH. The relevance of another breakpoint separating the cardiac gene ESRRG from a conserved downstream element is also discussed.