Bottom Line:
This naïve method is straightforward but is valid only when the missingness is random.However, a given assay often has a different capability in genotyping heterozygotes and homozygotes, causing the phenomenon of "differential dropout" in the sense that the missing rates of heterozygotes and homozygotes are different.Compared with the naïve method, our method provides more accurate allele frequency estimates when the differential dropout is present.

Affiliation: Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University Taipei, Taiwan.

ABSTRACTThe presence of missing single-nucleotide polymorphism (SNP) genotypes is common in genetic studies. For studies with low-density SNPs, the most commonly used approach to dealing with genotype missingness is to simply remove the observations with missing genotypes from the analyses. This naïve method is straightforward but is valid only when the missingness is random. However, a given assay often has a different capability in genotyping heterozygotes and homozygotes, causing the phenomenon of "differential dropout" in the sense that the missing rates of heterozygotes and homozygotes are different. In practice, differential dropout among genotypes exists in even carefully designed studies, such as the data from the HapMap project and the Wellcome Trust Case Control Consortium. Under the assumption of Hardy-Weinberg equilibrium and no genotyping error, we here propose a statistical method to model the differential dropout among different genotypes. Compared with the naïve method, our method provides more accurate allele frequency estimates when the differential dropout is present. To demonstrate its practical use, we further apply our method to the HapMap data and a scleroderma data set.

Figure 1: The box-and-whiskers plots of 1,000 estimates of MAF, given MAF = 0.1. The different panels in the figure are arranged so that the overall genotype missing rate (P.drop) is 0.02, 0.05, 0.1, and 0.15 (from top to bottom) and the DDR (r.drop) is 0.25, 0.5, 1, 2.5, 5, and 10 (from left to right). A box is constructed with a median (here, very close to the mean) and two quartiles (the first and the third quartiles). The outliers are data points outside the range of (first quartile −1.5 × IQR, third quartile +1.5 × IQR), where IQR is the inter-quartile range (third quartile − first quartile). The end of the upper whisker is the largest data point below the third quartile +1.5 × IQR, while the end of the lower whisker is the smallest data point beyond the first quartile −1.5 × IQR.

Mentions:
where αHet and αHom are the missing rates of heterozygotes and homozygotes, respectively. We simulated a SNP with a minor allele frequency (MAF) of 0.1 and assumed HWE at this SNP. The overall genotype missing rates were set at 0.02, 0.05, 0.1, and 0.15, respectively. The DDRs were specified at 0.25, 0.5, 1, 2.5, 5, and 10, respectively. The total sample size was set at 2,000. We compared our method with the naïve method that simply removed the observations with missing genotypes from the analyses. With 1,000 replications, Figure 1 presents the box-and-whiskers plots of the 1,000 estimates of allele frequencies. We can see that when DDR = 1 (αHom = αHet, no differential dropout), both the naïve method and our new method give unbiased estimates of allele frequencies (in our simulation results, the medians and means are very close). When DDR <1 or >1, the naïve method gives biased estimates while the new method still generates unbiased results. The more the DDR departs from 1, the more biased are the estimates that the naïve method produces. This bias is especially prominent when the overall genotype missing rate is equal to or larger than 0.05. We also simulated a SNP with MAF of 0.2, and the result was very similar to that shown in Figure 1 (of course the centers of the boxes changed to 0.2).

Figure 1: The box-and-whiskers plots of 1,000 estimates of MAF, given MAF = 0.1. The different panels in the figure are arranged so that the overall genotype missing rate (P.drop) is 0.02, 0.05, 0.1, and 0.15 (from top to bottom) and the DDR (r.drop) is 0.25, 0.5, 1, 2.5, 5, and 10 (from left to right). A box is constructed with a median (here, very close to the mean) and two quartiles (the first and the third quartiles). The outliers are data points outside the range of (first quartile −1.5 × IQR, third quartile +1.5 × IQR), where IQR is the inter-quartile range (third quartile − first quartile). The end of the upper whisker is the largest data point below the third quartile +1.5 × IQR, while the end of the lower whisker is the smallest data point beyond the first quartile −1.5 × IQR.

Mentions:
where αHet and αHom are the missing rates of heterozygotes and homozygotes, respectively. We simulated a SNP with a minor allele frequency (MAF) of 0.1 and assumed HWE at this SNP. The overall genotype missing rates were set at 0.02, 0.05, 0.1, and 0.15, respectively. The DDRs were specified at 0.25, 0.5, 1, 2.5, 5, and 10, respectively. The total sample size was set at 2,000. We compared our method with the naïve method that simply removed the observations with missing genotypes from the analyses. With 1,000 replications, Figure 1 presents the box-and-whiskers plots of the 1,000 estimates of allele frequencies. We can see that when DDR = 1 (αHom = αHet, no differential dropout), both the naïve method and our new method give unbiased estimates of allele frequencies (in our simulation results, the medians and means are very close). When DDR <1 or >1, the naïve method gives biased estimates while the new method still generates unbiased results. The more the DDR departs from 1, the more biased are the estimates that the naïve method produces. This bias is especially prominent when the overall genotype missing rate is equal to or larger than 0.05. We also simulated a SNP with MAF of 0.2, and the result was very similar to that shown in Figure 1 (of course the centers of the boxes changed to 0.2).

Bottom Line:
This naïve method is straightforward but is valid only when the missingness is random.However, a given assay often has a different capability in genotyping heterozygotes and homozygotes, causing the phenomenon of "differential dropout" in the sense that the missing rates of heterozygotes and homozygotes are different.Compared with the naïve method, our method provides more accurate allele frequency estimates when the differential dropout is present.

Affiliation:
Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University Taipei, Taiwan.

ABSTRACTThe presence of missing single-nucleotide polymorphism (SNP) genotypes is common in genetic studies. For studies with low-density SNPs, the most commonly used approach to dealing with genotype missingness is to simply remove the observations with missing genotypes from the analyses. This naïve method is straightforward but is valid only when the missingness is random. However, a given assay often has a different capability in genotyping heterozygotes and homozygotes, causing the phenomenon of "differential dropout" in the sense that the missing rates of heterozygotes and homozygotes are different. In practice, differential dropout among genotypes exists in even carefully designed studies, such as the data from the HapMap project and the Wellcome Trust Case Control Consortium. Under the assumption of Hardy-Weinberg equilibrium and no genotyping error, we here propose a statistical method to model the differential dropout among different genotypes. Compared with the naïve method, our method provides more accurate allele frequency estimates when the differential dropout is present. To demonstrate its practical use, we further apply our method to the HapMap data and a scleroderma data set.