Study Characteristics

A primary goal of the TOPMed program is to improve scientific understanding of the fundamental biological processes that underlie heart, lung, blood, and sleep (HLBS) disorders. TOPMed is providing deep WGS and other omics data to pre-existing ‘parent’ studies having large samples of human subjects with rich phenotypic characterization and environmental exposure data.

Study Designs

As of October 2018, TOPMed consists of ~144k participants from >80 different studies with varying designs. Prospective cohorts provide large numbers of disease risk factors, subclinical disease measures, and incident disease cases; case-control studies provide large numbers of prevalent disease cases; extended family structures and population isolates provide improved power to detect rare variant effects. The phenotype pie chart below shows the numbers and percentages of participants in studies with a focus on HLBS, as well as the percentage belonging to cohort studies that have collected many different phenotypes. It also shows areas of focus within each of the major HLBS categories.

Sample numbers by phenotype area (N=144k total)

Participant Diversity

Achieving ancestral and ethnic diversity was a priority in selecting contributing studies. Currently, the 144k participants consist of approximately 60% with substantial non-European ancestry (see pie chart below, based on participant self-identification and study inclusion criteria). Discovery of genotype-phenotype associations frequently includes pooled analysis across ancestry groups and studies, using statistical models that account for population structure and relatedness.

Sample numbers by ancestry/ethnicity (N=144k total)

Whole Genome Sequencing

WGS was performed by several sequencing centers to a median depth of 39X using DNA from blood, PCR-free library construction and Illumina HiSeq X technology. A Support Vector Machine quality filter was trained with known variants and Mendelian-inconsistent variants. The Informatics Research Center conducts joint genotype calling across all samples available to produce genotype data “freezes.” In freeze 5, with variant discovery on ~65k samples, 438 million single nucleotide variants and 33 million short insertion/deletion variants were identified.

Resources for the Scientific Community

TOPMed data are being made available to the scientific community as a series of “data freezes”: genotypes and phenotypes via dbGaP; read alignments via the SRA; and variants via the Bravo variant server (see figure below) and dbSNP. Genotypes for a set of 18.5k samples have been released on dbGaP, another freeze of 55k samples is currently being released, and a freeze of >100k samples is planned starting early in 2019. TOPMed WGS data are contained in study-specific accessions with names containing “NHLBI TOPMed”, while most phenotypic data are in parent study accessions. The TOPMed accessions can be identified by searching the dbGaP web site for “TOPMed”. More information about what data are available and how to access it can be found on the Data Access page.

TOPMed is currently adding other omic assays to samples that have been whole-genome sequenced; these include RNAseq, metabolomics, proteomics and epigenomics. These data will become available via dbGaP starting in 2019.

Overview of Bravo variant server resources

This content was adapted from a poster presented at the 2018 American Society of Human Genetics (ASHG) meeting, “Overview of the NHLBI Trans-Omics for Precision Medicine (TOPMed) program: Whole genome sequencing of >100,000 deeply phenotyped individuals” (Poster 3145/T).