Publications

Walker, et al., Aging (2015). We previously reported the unusual case of a teenage girl stricken with multifocal developmental dysfunctions whose physical development was dramatically delayed resulting in her appearing to be a toddler or at best a preschooler, even unto the occasion of her death at the age of 20 years. Her life-long physician felt that the disorder was unique in the world and that future treatments for age-related diseases might emerge from its study.

Hiltemann et al., Genome Research (2015). Tumor analyses commonly employ a correction with a matched normal (MN), a sample from healthy tissue of the same individual, in order to distinguish germline mutations from somatic mutations. Since the majority of variants found in an individual are thought to be common within the population, we constructed a set of 931 samples from healthy, unrelated individuals, originating from two different sequencing platforms, to serve as a virtual normal (VN) in the absence of such an associated normal sample.

Acuna-Hidalgo et al., AJHG (2015). De novo mutations are recognized both as an important source of genetic variation and as a prominent cause of sporadic disease in humans. Mutations identified as de novo are generally assumed to have occurred during gametogenesis and, consequently, to be present as germline events in an individual. Because Sanger sequencing does not provide the sensitivity to reliably distinguish somatic from germline mutations, the proportion of de novo mutations that occur somatically rather than in the germline remains largely unknown. To determine the contribution of post-zygotic events to de novo mutations, we analyzed a set of 107 de novo mutations in 50 parent-offspring trios.

Gilissen et al., Nature (2014). Severe intellectual disability (ID) occurs in 0.5% of newborns and is thought to be largely genetic in origin. The extensive genetic heterogeneity of this disorder requires a genome-wide detection of all types of genetic variation. Microarray studies and, more recently, exome sequencing have demonstrated the importance of de novo copy number variations (CNVs) and single-nucleotide variations (SNVs) in ID, but the majority of cases remains undiagnosed. Here we applied whole-genome sequencing to 50 patients with severe ID and their unaffected parents.

Fagny et al., Molecular Biology and Evolution (2014). Genome-wide scans for selection have identified multiple regions of the human genome as being targeted by positive selection. However, only a small proportion has been replicated across studies, and the prevalence of positive selection as a mechanism of adaptive change in humans remains controversial. Here we explore the power of two haplotype-based statistics – the integrated haplotype score (iHS) and the Derived Intra-allelic Nucleotide Diversity (DIND) test – in the context of next-generation sequencing data, and evaluate their robustness to demography and other selection modes.

Li et al., PLOS Genetics (2014). The determination of the relationship between a pair of individuals is a fundamental application of genetics. Previously, we and others have demonstrated that identity-by-descent (IBD) information generated from high-density single-nucleotide polymorphism (SNP) data can greatly improve the power and accuracy of genetic relationship detection. Whole-genome sequencing (WGS) marks the final step in increasing genetic marker density by assaying all single-nucleotide variants (SNVs), and thus has the potential to further improve relationship detection by enabling more accurate detection of IBD segments and more precise resolution of IBD segment boundaries.

Hiltemann et al., Gigascience (2014). Complete Genomics provides an open-source suite of command-line tools for the analysis of their CG-formatted mapped sequencing files. Determination of; for example, the functional impact of detected variants, requires annotation with various databases that often require command-line and/or programming experience; thus, limiting their use to the average research scientist. We have therefore implemented this CG toolkit, together with a number of annotation, visualisation and file manipulation tools in Galaxy called CGtag (Complete Genomics Toolkit and Annotation in a Cloud-based Galaxy).

Ye et al., Twin Research and Human Genetics (2013). It has been postulated that aging is the consequence of an accelerated accumulation of somatic DNA mutations and that subsequent errors in the primary structure of proteins ultimately reach levels sufficient to affect organismal functions. The technical limitations of detecting somatic changes and the lack of insight about the minimum level of erroneous proteins to cause an error catastrophe hampered any firm conclusions on these theories. In this study, we sequenced the whole genome of DNA in whole blood of two pairs of monozygotic (MZ) twins, 40 and 100 years old, by two independent next-generation sequencing (NGS) platforms.

Huang et al., Computers in Biology and Medicine (2013). We present NGSPE, a pipeline for variation discovery and genotyping of pair-ended Illumina next generation sequencing (NGS) data (http://ngspeanalysis.sourceforge.net/). This pipeline not only describes a set of sequential analytical steps, such as short reads alignment, genotype calling and functional variation annotation that can be conducted using open-source software tools, but also provides users a set of scripts to install the dependent software and resources and implement the pipeline on their data.

Wang et al., Genome Medicine (2013). Whole-exome sequencing has identified the causes of several Mendelian diseases by analyzing multiple unrelated cases, but it is more challenging to resolve the cause of extremely rare and suspected Mendelian diseases from individual families. We identified a family quartet with two children, both affected with a previously unreported disease, characterized by progressive muscular weakness and cardiomyopathy, with normal intelligence. During the course of the study, we identified one additional unrelated patient with a comparable phenotype.

Moore et al., BMC Medical Genomics (2013). With the recent decreasing cost of genome sequence data, there has been increasing interest in rare variants and methods to detect their association to disease. We developed BioBin, a flexible collapsing method inspired by biological knowledge that can be used to automate the binning of low frequency variants for association testing.

Rieber et al., PLoS One (2013). The emergence of high-throughput, next-generation sequencing technologies has dramatically altered the way we assess genomes in population genetics and in cancer genomics.

Petrini et al., PLoS One (2013). Molecular pathology of thymomas is poorly understood. Genomic aberrations are frequently identified in tumors but no extensive sequencing has been reported in thymomas. Here we present the first comprehensive view of a B3 thymoma at whole genome and transcriptome levels.

O’Rawe et al., Genome Medicine (2013). To facilitate the clinical implementation of genomic medicine by next-generation sequencing, it will be critically important to obtain accurate and consistent variant calls on personal genomes. Multiple software tools for variant calling are available, but it is unclear how comparable these tools are or what their relative merits in real-world scenarios might be.

Cui et al., JCI (2013). Anorexia nervosa and bulimia nervosa are common and severe eating disorders (EDs) of unknown etiology. Although genetic factors have been implicated in the psychopathology of EDs, a clear biological pathway has not been delineated. DNA from two large families affected by EDs was collected, and mutations segregating with illness were identified by whole-genome sequencing following linkage mapping or by whole-exome sequencing.

Schaaf et al., Nature Genetics (2013). Prader-Willi syndrome (PWS) is caused by the absence of paternally expressed, maternally silenced genes at 15q11-q13. We report four individuals with truncating mutations on the paternal allele of MAGEL2, a gene within the PWS domain.

Florisson et al., AJMG (2013). We describe a family that segregated an autosomal dominant form of craniosynostosis characterized by variable expression and limited extra-cranial features. Linkage analysis and genome sequencing were performed to identify the underlying genetic mutation.

Ma et al., PNAS (2013). Acute lymphoblastic leukemia (ALL) is the major pediatric cancer. At diagnosis, the developmental timing of mutations contributing critically to clonal diversification and selection can be buried in the leukemia’s covert natural history. Concordance of ALL in monozygotic, monochorionic twins is a consequence of intraplacental spread of an initiated preleukemic clone. Studying monozygotic twins with ALL provides a unique means of uncovering the timeline of mutations contributing to clonal evolution, pre- and postnatally.

Stubbs et al., Journal of Clinical Bioinformatics (2012). Next generation sequencing provides clinical research scientists with direct read out of innumerable variants, including personal, pathological and common benign variants. The aim of resequencing studies is to determine the candidate pathogenic variants from individual genomes, or from family-based or tumor/normal genome comparisons.

Schraiber et al., Genetics (2012). We examine the distribution of heterozygous sites in nine European and nine Yoruban individuals whose genomic sequences were made publicly available by Complete Genomics. We show that it is possible to obtain detailed information about inbreeding when a relatively small set of whole-genome sequences is available.

Rosenfeld et al., PLoS One (2012). Data from the 1000 genomes project (1KGP) and Complete Genomics (CG) have dramatically increased the numbers of known genetic variants and challenge several assumptions about the reference genome and its uses in both clinical and research settings.

Yang et al., Molecular Biology and Evolution (2012). Neanderthals have been shown to share more genetic variants with present-day non-Africans than Africans. Recent admixture between Neanderthals and modern humans outside of Africa was proposed as the most parsimonious explanation for this observation. However, the hypothesis of ancient population structure within Africa could not be ruled out as an alternative explanation. We use simulations to test whether the site frequency spectrum, conditioned on a derived Neanderthal and an ancestral Yoruba (African) nucleotide (the doubly conditioned site frequency spectrum [dcfs]), can distinguish between models that assume recent admixture or ancient population structure.

Kiel et al., JEM (2012). Splenic marginal zone lymphoma (SMZL), the most common primary lymphoma of spleen, is poorly understood at the genetic level. In this study, using whole-genome DNA sequencing (WGS) and confirmation by Sanger sequencing, we observed mutations identified in several genes not previously known to be recurrently altered in SMZL.

Berg et al., Genetics in Medicine (2012). Next-generation sequencing has transformed genetic research and is poised to revolutionize clinical diagnosis. However, the vast amount of data and inevitable discovery of incidental findings require novel analytic approaches. We therefore implemented for the first time a strategy that utilizes an a priori structured framework and a conservative threshold for selecting clinically relevant incidental findings.

Nishiguchi and Rivolta, PLOS One (2012). Retinitis pigmentosa and other hereditary retinal degenerations (HRD) are rare genetic diseases leading to progressive blindness. Recessive HRD are caused by mutations in more than 100 different genes. Laws of population genetics predict that, on a purely theoretical ground, such a high number of genes should translate into an extremely elevated frequency of unaffected carriers of mutations. In this study we estimate the proportion of these individuals within the general population, via the analyses of data from whole-genome sequencing.

Lachance et al., Cell (2012). To reconstruct modern human evolutionary history and identify loci that have shaped hunter-gatherer adaptation, we sequenced the whole genomes of five individuals in each of three different hunter-gatherer populations at >60× coverage: Pygmies from Cameroon and Khoesan-speaking Hadza and Sandawe from Tanzania. We identify 13.4 million variants, substantially increasing the set of known human variation.

Su et al., BMC Bioinformatics (2012). Identity by descent (IBD) has played a fundamental role in the discovery of genetic loci underlying human diseases. Both pedigree-based and population-based linkage analyses rely on estimating recent IBD, and evidence of ancient IBD can be used to detect population structure in genetic association studies.

Veeramah et al., AJHG (2012). Individuals with severe, sporadic disorders of infantile onset represent an important class of disease for which discovery of the underlying genetic architecture is not amenable to traditional genetic analysis. Full-genome sequencing of affected individuals and their parents provides a powerful alternative strategy for gene discovery.

Molenaar, et al., Nature (2012). Neuroblastoma is a childhood tumour of the peripheral sympathetic nervous system. The pathogenesis has for a long time been quite enigmatic, as only very few gene defects were identified in this often lethal tumour.

Jiang, et al., Genome Research (2012). Hepatitis B virus (HBV) infection is a leading risk factor for hepatocellular carcinoma (HCC). HBV integration into the host genome has been reported but its scale, impact and contribution to HCC development is not clear. Here, we sequenced the tumor and non-tumor genomes (>80X coverage) and transcriptomes of four HCC patients and identified 255 HBV integration sites.

Lam et al., Nature Biotechnology (2011). Whole-genome sequencing is becoming commonplace, but the accuracy and completeness of variant calling by the most widely used platforms from Illumina and Complete Genomics have not been reported. Here we sequenced the genome of an individual with both technologies

Yokoyama et al., Nature (2011). So far, two genes associated with familial melanoma have been identified, accounting for a minority of genetic risk in families. Mutations in CDKN2A account for approximately 40% of familial cases, and predisposing mutations in CDK4 have been reported in a very small number of melanoma kindreds.

Funk et al., Stem Cell Research (2011). Copy number variation (CNV) is a common chromosomal alteration that can occur during in vitro cultivation of human cells and can be accompanied by the accumulation of mutations in coding region sequences.

Roach et al., The American Journal of Human Genetics (2011). Assignment of alleles to haplotypes for nearly all the variants on all chromosomes can be performed by genetic analysis of a nuclear family with three or more children. Whole-genome sequence data enable deterministic phasing of nearly all sequenced alleles

Nieminen et al., The American Journal of Human Genetics (2011). Craniosynostosis and supernumerary teeth most often occur as isolated developmental anomalies, but they are also separately manifested in several malformation syndromes.

Rios et al., Human Molecular Genetics (2010). Whole-genome sequencing is a potentially powerful tool for the diagnosis of genetic diseases. Here, we used sequencing-by-ligation to sequence the genome of an 11-month-old breast-fed girl with xanthomas and very high plasma cholesterol levels (1023 mg/dl).

Lee et al., Nature (2010). Lung cancer is the leading cause of cancer-related mortality worldwide. Non-small-cell lung carcinomas in smokers are the predominant form of the disease. Although previous studies have identified common somatic mutations in lung cancers, they primarily focused on a small set of genes

Roach et al., Science (2010). We analyzed the whole-genome sequences of a family of four, consisting of two siblings and their parents. Family-based sequencing allowed us to delineate recombination sites precisely, identify 70% of the sequencing errors (resulting in > 99.999% accuracy), and identify very rare single-nucleotide polymorphisms.

Cao et al., Nature Biotechnology (2015). The human genome is diploid, and knowledge of the variants on each chromosome is important for the interpretation of genomic information. Here we report the assembly of a haplotype-resolved diploid genome without using a reference genome.

Peters et al., Genome Research (2015). Currently, the methods available for preimplantation genetic diagnosis (PGD) of in vitro fertilized (IVF) embryos do not detect de novo single-nucleotide and short indel mutations, which have been shown to cause a large fraction of genetic diseases. Detection of all these types of mutations requires whole-genome sequencing (WGS).

Drmanac et al., Clinical Chemistry (2014). Even 30 years ago, it was obvious that Sanger sequencing had limited throughput, and a more efficient process could replace many tedious gene and genome mapping projects. It would take until the mid-2000s for massively parallel sequencing (MPS) technologies to demonstrate they could overtake the Sanger sequencing hegemony. Our paper was not the first description of a viable MPS technol- ogy, but it firmly established that human whole genome sequencing could be done affordably (US$5000 in reagent cost), with high accuracy (1 error in 100 kb), and with high throughput, thus heralding the arrival of personal genome sequencing.

Peters, et al., Frontiers in Genetics (2014). Next generation sequencing (NGS) technologies, primarily based on massively parallel sequencing (MPS), have touched and radically changed almost all aspects of research worldwide. These technologies have allowed for the rapid analysis, to date, of the genomes of more than 2,000 different species. In humans, NGS has arguably had the largest impact. Over 100,000 genomes of individual humans (based on various estimates) have been sequenced allowing for deep insights into what makes individuals and families unique and what causes disease in each of us. Despite all of this progress, the current state of the art in sequence technology is far from generating a “perfect genome” sequence and much remains to be understood in the biology of human and other organisms’ genomes. In the article that follows we outline, why the “perfect genome” in humans is important, what is lacking from current human whole genome sequences, and a potential strategy for achieving the “perfect genome” in a cost effective manner.

Peters, et al., Nature (2012). Recent advances in whole-genome sequencing have brought the vision of personal genomics and genomic medicine closer to reality. However, current methods lack clinical accuracy and the ability to describe the context (haplotypes) in which genome variants co-occur in a cost-effective manner. Here we describe a low-cost DNA sequencing and haplotyping process, long fragment read (LFR) technology, which is similar to sequencing long single DNA molecules without cloning or separation of metaphase chromosomes.