Genome Wide Association Studies (GWAS) search for correlations between genetic markers (usually Single Nucleotide Polymorphisms, short SNPs) and any measurable trait in a population of individuals. The motivation is that such associations could provide new candidates for causal variants in genes (or their regulatory elements) that play a role for the phenotype of interest. In the clinical context this may eventually lead to a better understanding of the genetic components of diseases and their risk factors.

Genome Wide Association Studies (GWAS) search for correlations between genetic markers (usually Single Nucleotide Polymorphisms, short SNPs) and any measurable trait in a population of individuals. The motivation is that such associations could provide new candidates for causal variants in genes (or their regulatory elements) that play a role for the phenotype of interest. In the clinical context this may eventually lead to a better understanding of the genetic components of diseases and their risk factors.

−

Our current focus is on the Cohorte Lausanne (CoLaus), a population-based sample of more than 6'000 individuals from the Lausanne area. The CoLaus phenotypic dataset includes a large range of measurements, including extensive blood chemistry, anatomic and physiological measures, as well as parameters related to life style and history. Genotypes have been measured for ~500`000 SNPs using Affymetrix 500k SNP arrays. Regressing the various phenotypes onto these SNPs has already revealed a number of highly significant associations (see http://serverdgm.unil.ch/bergmann/CBG_publications.html for our publications).

+

Our current focus is on the Cohorte Lausanne (CoLaus), a population-based sample of more than 6'000 individuals from the Lausanne area. The CoLaus phenotypic dataset includes a large range of measurements, including extensive blood chemistry, anatomic and physiological measures, as well as parameters related to life style and history. Genotypes have been measured for ~500`000 SNPs using Affymetrix 500k SNP arrays. Regressing the various phenotypes onto these SNPs has already revealed a number of highly significant associations (see our [[publications]]).

Current GWAS usually include the following steps:

Current GWAS usually include the following steps:

Line 14:

Line 18:

From the many GWAS that were performed in the last years it became apparent that even well-powered (meta-)studies with many thousands (and even ten-thousands) of samples could at best identify a few (dozen) candidate loci with highly significant associations. While many of these associations have been replicated in independent studies, each locus explains but a tiny (<1%) fraction of the genetic variance of the phenotype (as predicted from twin-studies). Remarkably, models that pool all significant loci into a single predictive scheme still miss out by at least one order of magnitude in explained variance. Thus, while GWAS already today provide new candidates for disease-associated genes and potential drug targets, very few of the currently identified (sets of) genotypic markers are of any practical use for accessing risk for predisposition to any of the complex diseases that have been studied.

From the many GWAS that were performed in the last years it became apparent that even well-powered (meta-)studies with many thousands (and even ten-thousands) of samples could at best identify a few (dozen) candidate loci with highly significant associations. While many of these associations have been replicated in independent studies, each locus explains but a tiny (<1%) fraction of the genetic variance of the phenotype (as predicted from twin-studies). Remarkably, models that pool all significant loci into a single predictive scheme still miss out by at least one order of magnitude in explained variance. Thus, while GWAS already today provide new candidates for disease-associated genes and potential drug targets, very few of the currently identified (sets of) genotypic markers are of any practical use for accessing risk for predisposition to any of the complex diseases that have been studied.

−

Various solutions to this apparent enigma have been proposed: First, it is important to realize that the expected heritabilities usually have been estimated from twin-studies, often several decades ago. It has been argued that these estimates entail problems of its own, such as: independently raised twins shared a common prenatal environment; they may have undergone intrauterine competition; the mother may be more physically stressed (less nutrients); and twins reared apart are difficult to find, and may reflect certain types of environments. Indeed it is important to remember that heritability estimates are always relative to the genetic and environmental factors in the population, and are not absolute measurements of the contribution of genetic and environmental factors to a phenotype. Heritability estimates reflect the amount of variation in genotypic effects compared to variation in environmental effects.

+

Various solutions to this apparent enigma have been proposed: First, it is important to realize that the expected heritabilities usually have been estimated from twin-studies, often several decades ago. It has been argued that these estimates entail problems of its own (independently raised twins shared a common prenatal environment and may have undergone intrauterine competition, etc.).

−

Second, the genotypic information is still incomplete. Most analyses used microarrays probing only around half a million of SNPs, which is almost one order of magnitude less than the current estimates of about 4 million common variants from the Hapmap CEU panel (ref). While many of these SNPs can be imputed accurately using information on linkage disequilibrium, there still remains a significant fraction of SNPs which are poorly tagged by the measured SNPs. Furthermore, rare variants with a Minor Allele Frequency (MAF) of less than 1% are not accessed at all with SNP-chips, but may nevertheless be the causal agents for many phenotypes [ref]. Finally, other genetic variants like Copy Number Variations (CNVs) (or even epigenetics) may also play an important role.

+

Second, the genotypic information is still incomplete. Most analyses used microarrays probing only around half a million of SNPs, which is almost one order of magnitude less than the current estimates of about 4 million common variants from the Hapmap CEU panel. While many of these SNPs can be imputed accurately using information on linkage disequilibrium, there still remains a significant fraction of SNPs which are poorly tagged by the measured SNPs. Furthermore, rare variants with a Minor Allele Frequency (MAF) of less than 1% are not accessed at all with SNP-chips, but may nevertheless be the causal agents for many phenotypes. Finally, other genetic variants like Copy Number Variations (CNVs) (or even epigenetics) may also play an important role.

Third, it is important to realize that current analyses usually only employ additive models considering one SNP at a time with few, if any, co-variables, like sex, age and principle components reflecting population substructures. This obviously only covers a small set of all possible interactions between genetic variants and the environment. Even more challenging is taking into account purely genetic interactions, since already the number of all possible pair-wise interactions scales like the number of genetic markers squared.

Third, it is important to realize that current analyses usually only employ additive models considering one SNP at a time with few, if any, co-variables, like sex, age and principle components reflecting population substructures. This obviously only covers a small set of all possible interactions between genetic variants and the environment. Even more challenging is taking into account purely genetic interactions, since already the number of all possible pair-wise interactions scales like the number of genetic markers squared.

Revision as of 11:29, 21 August 2012

Contents

Introduction

Genome Wide Association Studies (GWAS) search for correlations between genetic markers (usually Single Nucleotide Polymorphisms, short SNPs) and any measurable trait in a population of individuals. The motivation is that such associations could provide new candidates for causal variants in genes (or their regulatory elements) that play a role for the phenotype of interest. In the clinical context this may eventually lead to a better understanding of the genetic components of diseases and their risk factors.

Our current focus is on the Cohorte Lausanne (CoLaus), a population-based sample of more than 6'000 individuals from the Lausanne area. The CoLaus phenotypic dataset includes a large range of measurements, including extensive blood chemistry, anatomic and physiological measures, as well as parameters related to life style and history. Genotypes have been measured for ~500`000 SNPs using Affymetrix 500k SNP arrays. Regressing the various phenotypes onto these SNPs has already revealed a number of highly significant associations (see our publications).

Current GWAS usually include the following steps:

genotype calling from the raw chip-data and basic quality control

principle component analysis (PCA) to detect and possibly correct for population stratification

testing for association between a single SNP and continuous or categorical phenotypes

global significance analysis and correction for multiple testing

data presentation (e.g. using quantile-quantile and Manhattan plots)

cross-replication and meta-analysis for integration of association data from multiple studies

From the many GWAS that were performed in the last years it became apparent that even well-powered (meta-)studies with many thousands (and even ten-thousands) of samples could at best identify a few (dozen) candidate loci with highly significant associations. While many of these associations have been replicated in independent studies, each locus explains but a tiny (<1%) fraction of the genetic variance of the phenotype (as predicted from twin-studies). Remarkably, models that pool all significant loci into a single predictive scheme still miss out by at least one order of magnitude in explained variance. Thus, while GWAS already today provide new candidates for disease-associated genes and potential drug targets, very few of the currently identified (sets of) genotypic markers are of any practical use for accessing risk for predisposition to any of the complex diseases that have been studied.

Various solutions to this apparent enigma have been proposed: First, it is important to realize that the expected heritabilities usually have been estimated from twin-studies, often several decades ago. It has been argued that these estimates entail problems of its own (independently raised twins shared a common prenatal environment and may have undergone intrauterine competition, etc.).

Second, the genotypic information is still incomplete. Most analyses used microarrays probing only around half a million of SNPs, which is almost one order of magnitude less than the current estimates of about 4 million common variants from the Hapmap CEU panel. While many of these SNPs can be imputed accurately using information on linkage disequilibrium, there still remains a significant fraction of SNPs which are poorly tagged by the measured SNPs. Furthermore, rare variants with a Minor Allele Frequency (MAF) of less than 1% are not accessed at all with SNP-chips, but may nevertheless be the causal agents for many phenotypes. Finally, other genetic variants like Copy Number Variations (CNVs) (or even epigenetics) may also play an important role.

Third, it is important to realize that current analyses usually only employ additive models considering one SNP at a time with few, if any, co-variables, like sex, age and principle components reflecting population substructures. This obviously only covers a small set of all possible interactions between genetic variants and the environment. Even more challenging is taking into account purely genetic interactions, since already the number of all possible pair-wise interactions scales like the number of genetic markers squared.

Further reading

For an introduction to GWAS, with an emphasis on human studies, you could start with a nice tutorial article [1], and a review of more recent issues [2]. There is also a nice review about approaches for rodent studies [3].

More Advanced Statistical Methodology

An important and widely used approach to dealing with cryptic population structure [4], and key references on genotype imputation [5][6].

A powerful approach to deal with strain structure or relatedness between individuals [7].

Software

PLINK is an excellent data handling tool, and
implements many useful statistical methods. It's the Swiss Army Knife for GWAS.

EIGENSOFT is widely used for population structure analysis and correction.