Program in Genetics and Genomic Biology, The Hospital for Sick Children and Department of Molecular and Medical Genetics, University of Toronto and The Centre for Applied Genomics, MaRS Centre, Toronto, Ontario, M5G 1L7, Canada.

Abstract

Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs and intermediate-sized variants (ISVs). However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.

Overview of the different types of alignments and assembly differences extracted from the R27c and Build 35 genome assemblies. (a) Matched alignments account for the majority of the sequence. (b) Mismatches are small intra-alignment differences ≤10 bp in length. (c) Unmatched sequences are sequences that are present in one assembly but absent in the other. These sequences are candidates for insertion/deletion polymorphism. (d) Copy-unmatched sequences. This category contains sequences that are present in both assemblies but that have additional copies in one of the assemblies. Here we focus on regions >1 kb in size for which the additional copy has at least 98% identity. These sequences are candidates for copy number variation. (e) Inversions are sequences that appear in different orientation in the two assemblies. (f) Gaps are sequences represented by Ns. These can be aligned either to sequence or to gaps in the other assembly.

Genome-wide overview of insertion points of unmatched and copy-unmatched sequences present in R27c with no corresponding match to Build 35. Each bar represents an insertion point, and the length of each bar indicates the size of the unmatched fragment (log scale). Green and red bars represent unmatched and copy-unmatched sequences, respectively. The data shown in this figure are based on anchored unmatched data and copy-unmatched data encompassing 13,837,593 bp ( and ).

Fosmid probes were used for FISH experiments to confirm the R27c mapping of unmatched sequences to Build 35 or to find a location for sequences with inconsistent or no mapping information. (a) Unmatched region, with no gap in Build 35. The human COPG2 locus on 7q32.2 contains 85 kb of unmatched sequence, which includes 11 exons of the COPG2 gene. The FISH results confirm a unique location at 7q32 for this sequence. There is no gap in Build 35 corresponding to this missing sequence. (b) Unmatched region, with no gap in Build 35. The FISH results verify the location of an unmatched sequence of ~100 kb at 6p12, where no gap is present in Build 35. (c) Unmatched region, with gap in Build 35. Confirmation of an unmatched sequence mapping to a Build 35 assembly gap at 4p16. (d) FISH results for a sequence mapped to chromosome 7 in R27c and chromosome 12 in Build 35. The sequence does not correspond to an annotated segmental duplication. The results indicate that the sequence is present on both chromosomes in each tested individual. (e) An unanchored scaffold assigned to chromosome 6 in R27c, with no match in Build 35. The results show localization to centromeric regions on chromosomes 3 and 6. (f) An unanchored scaffold mapping to chromosome 12 in R27c and chromosome 20 in Build 35, with segmental duplications mapping to chromosomes 7, 12, 15 and 20. The result confirms multiple mapping locations. Only one homolog of chromosome 22 consistently showed a signal, indicating that this sequence may be polymorphic.