Segmental duplications: organization and impact within the current human genome project assembly.

1Department of Genetics and Center for Human Genetics, Case Western Reserve School of Medicine and University Hospitals of Cleveland, Cleveland, Ohio 44106, USA.

Abstract

Segmental duplications play fundamental roles in both genomic disease and gene evolution. To understand their organization within the human genome, we have developed the computational tools and methods necessary to detect identity between long stretches of genomic sequence despite the presence of high copy repeats and large insertion-deletions. Here we present our analysis of the most recent genome assembly (January 2001) in which we focus on the global organization of these segments and the role they play in the whole-genome assembly process. Initially, we considered only large recent duplication events that fell well-below levels of draft sequencing error (alignments 90%-98% similar and > or =1 kb in length). Duplications (90%-98%; > or =1 kb) comprise 3.6% of all human sequence. These duplications show clustering and up to 10-fold enrichment within pericentromeric and subtelomeric regions. In terms of assembly, duplicated sequences were found to be over-represented in unordered and unassigned contigs indicating that duplicated sequences are difficult to assign to their proper position. To assess coverage of these regions within the genome, we selected BACs containing interchromosomal duplications and characterized their duplication pattern by FISH. Only 47% (106/224) of chromosomes positive by FISH had a corresponding chromosomal position by comparison. We present data that indicate that this is attributable to misassembly, misassignment, and/or decreased sequencing coverage within duplicated regions. Surprisingly, if we consider putative duplications >98% identity, we identify 10.6% (286 Mb) of the current assembly as paralogous. The majority of these alignments, we believe, represent unmerged overlaps within unique regions. Taken together the above data indicate that segmental duplications represent a significant impediment to accurate human genome assembly, requiring the development of specialized techniques to finish these exceptional regions of the genome. The identification and characterization of these highly duplicated regions represents an important step in the complete sequencing of a human reference genome.

Detection Method. The method combines DNA sequence analysis software and a suite of Perl scripts that are optimized for the detection of large highly similar duplications. Briefly, the genome assembly (2.6 Gb) is broken into tractable 400-kb segments. For each segment, common repeats (blue) are identified with RepeatMasker. Repetitive sequence is then removed (“fuguized”) leaving putatively unique DNA. All fuguized pieces are then compared by BLAST. Repeats internal to an individual 400-kb segments are detected with BLASTZ. Relaxed affine gap parameters are used allowing gaps up to 1 kb in size to be traversed. Fuguized pairwise alignments (>0.87 similarity and >500 aligned bp) have their common repeats reinserted and then the alignment ends undergo heuristic trimming allowing for refinement of alignment end points which may lie within common repetitive sequence. The program ALIGN generates optimal global alignments from which final alignment statistics are calculated. Global alignments >1000 bases aligned and >90% identity were selected in this analysis.

Example of pericentromeric duplication using fuguization method. (A) A graphical view of the output for our method as displayed in the program PARASIGHT (J.A. Bailey, unpubl.). Compared to miropeats (B; Parsons 1995), all of the positions of similarity have been captured as continuous large alignments (C). An example of a large insertion-deletion in an alignment (D) demonstrates the ability of fuguization to traverse such regions returning larger more meaningful alignments. Lower thresholds (>500 aligned bases; >90% identity) were used for this test case compared to our genome analysis.

Genome-wide view of segmental duplications. The positions of alignments are depicted in red for each of the 24 chromosomes. Panels separate alignments on the basis of similarity: (A) 90%–98% identity and (B) 98%–100% identity. Purple bars depict centromeric gaps as well as the p-arms of acrocentric chromosomes (13, 14, 15, 21, and 22).Because of scale constraints, only alignments >5 kb are visible. Views were generated with the program PARASIGHT (J.A. Bailey, unpubl.), a graphical pairwise alignment viewer.

Integration of segmental duplications into assembly. The two pie charts divide the assembly contigs into ordered contigs and unordered (random and unlocated) contigs. Random contigs have chromosomal assignment but no specific position in the chromosome. Unlocated contigs have no chromosome position. Duplicated sequence represents 3% and 25% of the sequence in the ordered and unordered bins, respectively.