The relationship between imputation error and statistical power in genetic association studies in diverse populations.

Abstract

Genotype-imputation methods provide an essential technique for high-resolution genome-wide association (GWA) studies with millions of single-nucleotide polymorphisms. For optimal design and interpretation of imputation-based GWA studies, it is important to understand the connection between imputation error and power to detect associations at imputed markers. Here, using a 2x3 chi-square test, we describe a relationship between genotype-imputation error rates and the sample-size inflation required for achieving statistical power at an imputed marker equal to that obtained if genotypes at the marker were known with certainty. Surprisingly, typical imputation error rates (approximately 2%-6%) lead to a large increase in the required sample size (approximately 10%-60%), and in some African populations whose genotypes are particularly difficult to impute, the required sample-size increase is as high as approximately 30%-150%. In most populations, each 1% increase in imputation error leads to an increase of approximately 5%-13% in the sample size required for maintaining power. These results imply that in GWA sample-size calculations investigators will need to account for a potentially considerable loss of power from even low levels of imputation error and that development of additional genomic resources that decrease imputation error will translate into substantial reduction in the sample sizes needed for imputation-based detection of the variants that underlie complex human diseases.

Genotype Misclassification Rates at Imputed Loci, in Each of 29 PopulationsEach bar plot presents a particular error rate ɛij, in which ɛij represents the probability that genotype i is imputed as genotype j (1, minor allele homozygote; 2, heterozygote; 3, major allele homozygote). For each population, the greatest of the six error rates is shown in a color characteristic of the geographic region of the population. For convenience in interpreting the figure, the vertical dashed line indicates 15% error. The values plotted in the figure appear together with the overall imputation error rate in .

Sample-Size Inflation Factor f Required for Maintaining Statistical Power at Imputed Loci, as a Function of the True Difference in the Frequency of the Minor Allele between Cases and ControlsEach plot utilizes the estimated imputation error rates in for a specific population. For each population, the inflation factor is plotted for five choices of the true minor allele frequency in controls (0.05, 0.15, 0.25, 0.35, and 0.45). Note that MAFcontrols ranges from 0 to 0.5, whereas MAFcases, representing the frequency in cases of the minor allele in controls, ranges from 0 to 1. We used a step size of 0.001 for MAFcases and disregarded points with MAFcases = MAFcontrols.

Maximal and Minimal Sample-Size Inflation Factors at Imputed Loci as Functions of the True Minor Allele Frequency in Controls, in Each of 29 PopulationsFor each value of MAFcontrols from 0.005 to 0.5 with a step size of 0.005, the value plotted is the maximal or minimal value of the inflation factor f obtained across choices of MAFcases ranging from 0 to 1 with a step size of 0.001 (MAFcases ≠ MAFcontrols). Graphs for individual populations are color-coded by geographic region.(A) Maximal sample-size inflation factor.(B) Minimal sample-size inflation factor.

Maximal and Minimal Sample-Size Inflation Factors as Functions of the Overall Imputation Error Rate, for an Imputed Disease Locus with a True Minor Allele Frequency of 0.3 in ControlsPopulations are color-coded by geographic region, and two data points appear for each population: a maximum and a minimum. Best-fit linear-regression lines for the maxima and minima, forced through the point (0,1), indicate the increase in the inflation factor with increasing imputation error rate. For example, the lines indicate that in most populations, at MAFcontrols = 0.3, imputation error rates of 2%–6% correspond to sample-size inflation factors of ∼14%–53%, and each additional 1% increase in imputation error corresponds to a ∼7%–10% increase in the inflation factor.

Cost Coefficients as Functions of MAFcases for the Fixed Value MAFcontrols = 0.3The coefficient Cij provides an approximation for the relative magnitude of the sample-size inflation that is due to the error parameter ɛij. Thus, a small increase of x in the imputation error parameter ɛij adds approximately Cijx to the sample-size inflation factor. The sum of the six cost coefficients, Csum, has the interpretation that Csumx is added to the sample-size inflation factor when all six of the ɛij are simultaneously set to x. Each of the cost coefficients was evaluated for values of MAFcases from 0.005 to 0.995 at intervals of 0.01.