Although recent Genome-Wide Association Studies have identified novel associations for common variants, there has been no comprehensive exome-wide search for low-frequency variants that affect the risk of venous thromboembolism (VTE). We conducted a meta-analysis of 11 studies comprising 8,332 cases and 16,087 controls of European ancestry and 382 cases and 1,476 controls of African American ancestry genotyped with the Illumina HumanExome BeadChip. We used the seqMeta package in R to conduct single variant and gene-based rare variant tests...

Given the rapid pace with which genomics and other -omics disciplines are evolving, it is sometimes necessary to shift down a gear to consider more general scientific questions. In this line, in my presidential address I formulate six questions for genetic epidemiologists to ponder on. These cover the areas of reproducibility, statistical significance, chance findings, precision medicine and related fields such as bioinformatics and data science. Possible hints at responses are presented to foster our further discussion of these topics...

Analysis of the X chromosome has been largely neglected in genetic studies mainly because of complex underlying biological mechanisms. On the other hand, the study of human microbiome data (typically over-dispersed counts with an excess of zeros) has generated great interest recently because of advancements in next-generation sequencing technologies. We propose a novel approach to infer the association between host genetic variants in the X-chromosome and microbiome data. The method accounts for random X-chromosome inactivation (XCI), skewed (or nonrandom) XCI (XCI-S), and escape of XCI (XCI-E)...

When interpreting genome-wide association peaks, it is common to annotate each peak by searching for genes with plausible relationships to the trait. However, "all that glitters is not gold"-one might interpret apparent patterns in the data as plausible even when the peak is a false positive. Accordingly, we sought to see how human annotators interpreted association results containing a mixture of peaks from both the original trait and a genetically uncorrelated "synthetic" trait. Two of us prepared a mix of original and synthetic peaks of three significance categories from five different scans along with relevant literature search results and then we all annotated these regions...

When testing genotype-phenotype associations using linear regression, departure of the trait distribution from normality can impact both Type I error rate control and statistical power, with worse consequences for rarer variants. Because genotypes are expected to have small effects (if any) investigators now routinely use a two-stage method, in which they first regress the trait on covariates, obtain residuals, rank-normalize them, and then use the rank-normalized residuals in association analysis with the genotypes...

In Mendelian randomization (MR), inference about causal relationship between a phenotype of interest and a response or disease outcome can be obtained by constructing instrumental variables from genetic variants. However, MR inference requires three assumptions, one of which is that the genetic variants only influence the outcome through phenotype of interest. Pleiotropy, that is, the situation in which some genetic variants affect more than one phenotype, can invalidate these genetic variants for use as instrumental variables; thus a naive analysis will give biased estimates of the causal relation...

Whole-exome sequencing (WES) and whole-genome sequencing (WGS) studies are underway to investigate the impact of genetic variants on complex diseases and traits. It is customary to perform single-variant association tests for common variants and region-based association tests for rare variants. The latter may target variants with similar or opposite effects, interrogate variants with different frequencies or different functional annotations, and examine a variety of regions. The large number of tests that are performed necessitates adjustment for multiple testing...

One of the most important research areas in case-control Genome-Wide Association Studies is to determine how the effect of a genotype varies across the environment or to measure the gene-environment interaction (G × E). We consider the scenario when some of the "healthy" controls actually have the disease and when the frequency of these latent cases varies by the environmental variable of interest. In this scenario, performing logistic regression with the clinically diagnosed disease status as an outcome variable and will result in biased estimates of G × E interaction...

In metagenomic studies, testing the association between microbiome composition and clinical outcomes translates to testing the nullity of variance components. Motivated by a lung human immunodeficiency virus (HIV) microbiome project, we study longitudinal microbiome data by using variance component models with more than two variance components. Current testing strategies only apply to models with exactly two variance components and when sample sizes are large. Therefore, they are not applicable to longitudinal microbiome studies...

Single-cell microscopy image analysis has proved invaluable in protein subcellular localization for inferring gene/protein function. Fluorescent-tagged proteins across cellular compartments are tracked and imaged in response to genetic or environmental perturbations. With a large number of images generated by high-content microscopy while manual labeling is both labor-intensive and error-prone, machine learning offers a viable alternative for automatic labeling of subcellular localizations. Contrarily, in recent years applications of deep learning methods to large datasets in natural images and other domains have become quite successful...

The transmission disequilibrium test (TDT) is the gold standard for testing the association between a genetic variant and disease in samples consisting of affected individuals and their parents. In practice, more complex pedigree structures, that is siblings with no parents, or three-generational pedigrees with possibly missing genotypes, are common. There are several generalizations of the TDT that are suitable for use with arbitrary pedigree structures. We consider three such frequently used generalizations, family-based association test, pedigree disequilibrium test, and generalized disequilibrium test, that have accompanying software and compare them regarding validity and power in the single variant setting...

Evaluating the association of multiple genetic variants with a trait of interest by use of kernel-based methods has made a significant impact on how genetic association analyses are conducted. An advantage of kernel methods is that they tend to be robust when the genetic variants have effects that are a mixture of positive and negative effects, as well as when there is a small fraction of causal variants. Another advantage is that kernel methods fit within the framework of mixed models, providing flexible ways to adjust for additional covariates that influence traits...

Understanding the genetic and metabolic bases of obesity is helpful in planning and developing health strategies. Therefore, the first family-based joint linkage and linkage disequilibrium study was conducted in Iranian pedigrees to assess the relationship between obesity and single-nucleotide polymorphisms (SNPs) located in the 16q12.2 region. In the present study, a total of 13,344 individuals were included, of whom 12,502 individuals were within 3,109 pedigrees and 842 were unrelated singletons. To investigate the relationship between obesity and genetic variants, a joint model of linkage and linkage disequilibrium was applied...

The reproducibility of scientific processes is one of the paramount problems of bioinformatics, an engineering problem that must be addressed to perform good research. The System for Quality-Assured Data Analysis (SyQADA), described here, seeks to address reproducibility by managing many of the details of procedural bookkeeping in bioinformatics in as simple and transparent a manner as possible. SyQADA has been used by persons with backgrounds ranging from expert programmer to Unix novice, to perform and repeat dozens of diverse bioinformatics workflows on tens of thousands of samples, consuming over 80 CPU-months of computing on over 300,000 individual tasks of scores of projects on laptops, computer servers, and computing clusters...

We develop linear mixed models (LMMs) and functional linear mixed models (FLMMs) for gene-based tests of association between a quantitative trait and genetic variants on pedigrees. The effects of a major gene are modeled as a fixed effect, the contributions of polygenes are modeled as a random effect, and the correlations of pedigree members are modeled via inbreeding/kinship coefficients. <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mi>F</mml:mi> </mml:math> -statistics and χ 2 likelihood ratio test (LRT) statistics based on the LMMs and FLMMs are constructed to test for association...

Loss of function variants in NOTCH1 cause left ventricular outflow tract obstructive defects (LVOTO). However, the risk conferred by rare and noncoding variants in NOTCH1 for LVOTO remains largely uncharacterized. In a cohort of 49 families affected by hypoplastic left heart syndrome, a severe form of LVOTO, we discovered predicted loss of function NOTCH1 variants in 6% of individuals. Rare or low-frequency missense variants were found in 16% of families. To make a quantitative estimate of the genetic risk posed by variants in NOTCH1 for LVOTO, we studied associations of 400 coding and noncoding variants in NOTCH1 in 1,085 cases and 332,788 controls from the UK Biobank...

When evaluating a newly developed statistical test, an important step is to check its type 1 error (T1E) control using simulations. This is often achieved by the standard simulation design S0 under the so-called "theoretical" null of no association. In practice, the whole-genome association analyses scan through a large number of genetic markers ( <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mi>G</mml:mi> </mml:math> s) for the ones associated with an outcome of interest ( <mml:math xmlns:mml="http://www...

Observational studies find an association between increased body mass index (BMI) and short self-reported sleep duration in adults. However, the underlying biological mechanisms that underpin these associations are unclear. Recent findings from the UK Biobank suggest a weak genetic correlation between BMI and self-reported sleep duration. However, the potential shared genetic aetiology between these traits has not been examined using a comprehensive approach. To investigate this, we created a polygenic risk score (PRS) of BMI and examined its association with self-reported sleep duration in a combination of individual participant data and summary-level data, with a total sample size of 142,209 individuals...

It is unclear whether insertions and deletions (indels) are more likely to influence complex traits than abundant single-nucleotide polymorphisms (SNPs). We sought to understand which category of variation is more likely to impact health. Using the SardiNIA study as an exemplar, we characterized 478,876 common indels and 8,246,244 common SNPs in up to 5,949 well-phenotyped individuals from an isolated valley in Sardinia. We assessed association between 120 traits, resulting in 89 nonoverlapping-associated loci...