Analysis of somatic alterations in cancer genomes has been accelerated through the rapid growth of the quantity, quality and depth of data generated by next-generation sequencing (NGS). Previously most of cancer genome studies were focusing on single nucleotide variations (SNVs), small insertions and deletions (INDELs), or somatic copy number alterations (SCNAs). Recently, there is a paradigm shift in the cancer genome study that more efforts have been devoted to characterizing large scale structural variations (SVs) in various cancer genomes. However, there are still pressing needs for designing specific computational algorithms to tackle the challenges caused by the complexity of cancer genomes.
The first part of my thesis is developing a novel computational method called Weaver, which takes whole genome sequencing (WGS) alignment as core input and generates a precise rearrangement map for cancer genomes. Weaver identifies SVs with base-pair resolution and applies a probabilistic graphical model to simultaneously quantify allele specific copy number of SVs (ASCNS) and genomic regions (ASCNG). Through evaluation on simulated datasets with different parameter settings, Weaver was demonstrated to be highly accurate and be able to significantly refine the analysis of complex cancer genomes.
The second part of this study is applying Weaver on two widely used cancer cell lines: MCF-7 and HeLa. For both cell lines, we generated base-pair resolution ASCNS and ASCNG for the first time. The detailed characterization of genomes for MCF-7 and HeLa may serve as valuable resource for future studies based on these two cell lines, by replacing reference genome with cancer specific genomes. We have found that allele specific expression can be explained by the profiled ASCNG for both cell lines. We have also discovered that a large portion of promoter-promoter interactions, detected by ChIA-PET, are found to be formed by distal genomic regions linked to be adjacent by somatic translocations in MCF-7 genome, showing that phased SVs analysis by Weaver has enabled the analysis of interaction between genomic rearrangements and long-range gene regulation at much broader scale.
The last part of this thesis is applying Weaver on large-scale primary tumor data, com- posed by 600 TGCA WGS samples. To our knowledge, this is the largest whole genome SV and base-pair resolution ASCNG analysis for primary cancer genomes to date. We analyzed two mechanisms, breakage-fusion-bridge (BFB) and tandem duplication (TD), for recurrent focal amplifications and found different frequently focal amplified regions have different enrichment of specific tumor types. We proposed a new pan-cancer classification method, for the first time utilizing SV pattern, that categorizes 600 TCGA samples across 17 tumor types into five subtypes with potential clinical relevance. Our pan-cancer classification has the potential of prognostic assessment for future patients regardless of their tumor types.
In order to gain knowledge on the landscape of cancer genome structural alterations, in this thesis, we developed an algorithm which handles WGS data and specifically tackles the complexity in aneuploid cancer genomes. The integrative method combining the analysis of SVs and SCNAs enabled novel findings when applied on cancer cell lines and primary tumors.