Background

It is well known that the presence of population stratification (PS) may cause the usual test in case-control studies to produce spurious gene-disease associations. However, the impact of the PS and sample selection (SS) is less known. In this paper, we provide a systematic study of the joint effect of PS and SS under a more general risk model containing genetic and environmental factors. We provide simulation results to show the magnitude of the bias and its impact on type I error rate of the usual chi-square test under a wide range of PS level and selection bias.

Results

The biases to the estimation of main and interaction effect are quantified and then their bounds derived. The estimated bounds can be used to compute conservative p-values for the association test. If the conservative p-value is smaller than the significance level, we can safely claim that the association test is significant regardless of the presence of PS or not, or if there is any selection bias. We also identify conditions for the null bias. The bias depends on the allele frequencies, exposure rates, gene-environment odds ratios and disease risks across subpopulations and the sampling of the cases and controls.

Conclusion

Our results show that the bias cannot be ignored even the case and control data were matched in ethnicity. A real example is given to illustrate application of the conservative p-value. These results are useful to the genetic association studies of main and interaction effects.

In the search of causative agents of human disease, both environmental and genetic risk factors have been identified. Overwhelming evidence indicates that there are reasons to believe that relative common polymorphisms in a wide spectrum of genes may modify the effect of environmental agents [1, 2]. Several studies also have demonstrated the presence of gene-gene interaction in complex human diseases [3–7]. Gene-gene interaction, or epistasis, is also considered as a basic genetic concept which has been widely used by biologists for a long time [8].

Many association designs have been proposed for studying gene-environment or gene-gene interactions. Recently, Wang and Zhao [9] found that in the study of gene-gene interactions, the unmatched case-control association design is more powerful than both the matched case-control design and case-parents design. They also found that when a logistic regression model is fitted for assessing gene-environment interactions based on case-parents sample, the approach may be susceptible to the PS bias [10]. However, case-control design is also well known to be susceptible to the PS bias in the study of genetic effect, if the gene under study shows marked variation in allele frequency across subgroups of the population and if these subgroups also differ in their base-line disease risks [11–17]. Wang, et al. [18] recently provided numerical examples showing that when the correlation between genetic and environmental factors is small or the linkage disequilibrium is weak, and case-control data were collected according to a simple random sampling (SRS) scheme, that is no selection bias, the PS bias in testing null interaction odds ratio is also small. However, selection bias often occurs in case-control studies and more studies are needed in order to better understand the impact of the PS and SS.

In this paper, we investigate the joint effect of population stratification and sample selection in testing null main or interaction effects. Under general sampling, we quantify the magnitude of the PS-SS bias in terms of the baseline disease risks, genotype frequencies, exposure rates, their odds ratios (linkage disequilibrium coefficients), and the effect sizes of the risk factors. Based on this result, we find that matching in ethnicity cannot eliminate bias in association studies. Using the bias, we are also able to derive important conditions under which it is null.

The PS-SS bias cannot be estimated, since we don't know how many subpopulations involved in the studied population and/or which subpopulation a person belongs to. Although adjusting for covariates such as principal components can be used to account for PS in genome wide association studies [19], however, it is not clear whether the same approach can be applied in the studies of interaction. Since, for example, the bias level also depends on the effect size of the environmental factor. In this paper, we also derive useful bounds to measure the maximal impact of the bias. Sometimes, these bounds can be estimated so that tests robust to the joint effect of PS and SS can be derived; see Lee and Wang [20] for similar suggestion in studies of gene-disease association. We use theoretical formula and simulation results to show the general properties of the usual association test in the presence of PS or selection bias. We also provide a real example to demonstrate computation of a conservative p-value in studying interaction effect of maternal smoking and GSTT1 variant on the risk of orofacial cleft.

The Magnitude of the Bias

We begin this section with the notation that will be used throughout this work. Disease status is denoted as D with levels D = 1, and 0, indicating the presence and absence of the disease, respectively. Let G = 1(0) represent the presence (absence) of the genotype of interest. H = 1(0) represents the presence (absence) of the environmental exposure or another genotype of interest. Although we only focus on 2 × 2 × 2 table, however, all results can be extended to any number of risk factors or any number of levels. We also assume that the population under study consists of K subpopulations and denote S as the stratification variable, taking values s = 1,..., K. However, K is unknown and S is not observable in our discussion of the PS effect.

To quantify the PS effect, we assume that the risk model is given by

logitP(D=1|G=g,H=h,S=s)=μ′+α′s+βg+γh+δgh,

where the genetic and environmental data are obtained from subpopulation s. As usual, we use s = 1, g = 0, and h = 0 to represent the referent subpopulation, genotype and environmental exposure, respectively. For the purpose of identifiability, we define α1′ = 0. αs′,s = 1,..., K, are the subpopulation-specific parameters representing the potential heterogeneity of disease risk across subpopulations. In this model, log-odds-ratio β measures the association between the genotype and risk of disease, log-odds-ratio γ measures the association between the environmental exposure (or another genotype) and risk of disease. The multiplicative interaction δ measures the change of the disease-genotype log-odds-ratios according to different levels of risk factor H. Similar risk models for studying genetic effect under PS can be found in Satten et al. [21] and Cheng and Lin [17], for examples. For subpopulation s, we use ORs to represent the baseline G-H odds ratio (given D = 0). Define

Gs=P(G=1|S=s,D=0,H=0)P(G=0|S=s,D=0,H=0)

as the baseline G- frequency odds and baseline H- frequency odds Hs is similarly defined. Also define Ds as the baseline disease frequency odds given by

Ds=P(D=1|S=s,G=0,H=0)P(D=0|S=s,G=0,H=0).

In the discussion of PS effect, one often assumes that case and control data are sampled according to the SRS design. Let P(S = s|D = 1) and P(S = s|D = 0) represent the corresponding proportions of subpopulation s in the cases and controls, respectively. However, in real applications, selection bias often happens and sampling may not be done according to the SRS scheme for various reasons. Let the true proportion of subjects in the cases (controls) that are from subpopulation s be denoted by P#(S = s|D = 1) (P#(S = s|D = 0)). We use DSs = P# (S=s|D=1)P(S=s|D=1)/P# (S=s|D=0)P(S=s|D=0) to measure the effect of the sample selection for subpopulation s. If there is no selection bias, DSs = 1.

Since in the population level we only observe factors G and H, we show in the Methods section that given the presence of PS and general sampling, the main effects and interaction are given by

exp(β*), exp(γ*) and exp(δ*) are the bias levels. We note that if DsDSs is a constant with respect to s, then K(g, h)is also a constant and there is no bias of any kind. A sufficient condition for this to hold is when the baseline disease risk is identical across all subpopulations and sampling of the study follows a SRS design. Further, since

therefore, if the disease prevalence P(D = 1|S = s) and baseline disease risk P(D = 1|G = H = 0,S = s) are approximately equal in each subpopulation, then bias depends on DsDSs only through the degree of matching P# (S=s|D=1)P# (S=s|D=0). Accordingly, if the case and control are matched in ethnicity, then the bias should be very small. However, P(D = 1|S = s) ≈ P(D = 1|G = H = 0,S = s) for all subpopulations is often not true when environmental factor, such as smoking, are involved in causing the disease risk. Under this scenario, even the cases and controls are perfectly matched, the bias can still be large. This conclusion is different from that under the gene-disease association study; see for example, Cheng, Lee and Chen [22]. We shall see more discussion of this issue in latter sections.

Maximal bias and conditions for the null bias

Here, we give conditions for the null bias and bounds for bias. The bias exp(β*) to the estimation of genetic main effect depends on the variation of the genotype frequencies measured byG†=maxsGs/minsGs,variation of the disease prevalence measured by D†=maxsDs/minsDsand the sampling variation measured by DS†=maxsDSs/minsDSs. The bias exp(δ*) to the estimation of interaction depends additionally on the variation of the baseline odds ratio, measured by OR†=maxsORs/minsORs and the variation of exposure rates measured by H† = maxHss/minsHs.

Note that the bias β* depends only on K(g, 0). We first present some conditions for the null bias β* = 0, when the true genetic main effect is null: (1) if the baseline genotype frequency is constant across subpopulations, then the bias β* is zero (can be proved using equation (1) in the Methods section); (2) if the sample selection follows a SRS scheme (DS† = 1), and the disease risk is constant, then the bias is also null. (However, if the sampling is not SRS, the bias may be non-null; see Tables 1 and 2.); (3) if the case and control data are matched in ethnicity, and γ = δ = 0 (both H-main effect and interaction are null), then the bias is null.

Table 1

Biases and the true type I errors of the chi-square tests when G† = 5 and LD = (0,0)

When the interaction effect is null, some conditions for the null bias δ* = 0 are: (1) if the baseline G-H odds ratios and G(or H)- frequency odds are constant across subpopulations, then the bias δ* is null (can be proved using equation (2) in the Methods section); (2) if the sample selection of the study follows SRS, and the disease risk is constant, then the bias δ* is also null. However, see Tables 1 and 2 for the presence of bias when the SRS condition fails.

Next, we present bound to measure the largest bias to the estimation of main effect. In the Methods section, we show that the bias exp(β*) can be expressed as

where ws are some constants satisfying 0 ≤ ws ≤ 1 and ∑s=1Kws=1. The bias is the greatest when the number of subpopulations is 2. The bias is also bounded below byLβ≡Uβ-1. These bounds give the maximal impact of the bias in making inference about the genetic main effect. Under rare disease, the background disease rate is approximately equal to the background disease odds. We find that the bound under SRS (DS† = 1) is similar to that given by Lee and Wang [19]. However, our result is more general in the sense that their risk model was a special case of ours and selection bias was not considered in their paper either.

In the Methods section, we also showed that under SRS, the bias exp(δ*) was bounded above by Uδ(1)=(D† )2 and bounded belowLδ(1)=(D† )-2. These are the same bounds derived by Wang et al. [18]. Unfortunately, these bounds are not valid when there is selection bias. Under the general sample selection, we showed that the bias exp(δ*) was bounded above by

OR†×G†H†+13G†H†+G†G†H†+H†×G†H†+G†H†G†+H†2≡Uδ(2),

(2)

and bounded below by 1/Uδ(2)≡Lδ(2). Using these bounds we can easily conclude that if the genetic factors are in linkage equilibrium within each subpopulation, and the variation of the G (or H) frequency odds is small then the bias is also expected to be small.

True type I errors

In case-control studies, one often expects that the type I errors of the association tests can be approximately controlled at some predetermined level. However, in the presence of PS or selection bias, the usual test statistic does not have a chi-square distribution under the null hypothesis. Instead, it has a non-central chi-square distribution, with non-centrality parameter depending on the level of the bias. Thus, the usual chi-square test tends to have inflated type I errors.

Suppose that the intended type I error rate of the chi-square test is α and let χ1;1-α2represent the 100(1-α) percentile of the chi-square distribution with one degree of freedom. Let χ12(Δ) represent a non-central chi-square random variable with one degree of freedom and non-centrality parameter Δ. In the case of testing null interaction, the non-centrality parameter is given by

where ngh(d) is number of observations with outcome G = g, H = h and disease status d. Then the true type I error of the usual chi-square test of null interaction is given by αδ=P(χ12(Δδ)≥χ1;1-α2), which is always ≥ α. In the case of testing null genetic main effect, the non-centrality parameter is given by

Δβ=β*2(1n10(1)+1n00(1)+1n10(0)+1n00(0)).

The corresponding true type I error of the chi-square test is given by αβ=P(χ12(Δβ)≥χ1;1-α2), which is also ≥ α.

Conservative p-values

In most practical applications, one often does not know the true value of the non-centrality parameter and therefore it is difficult to calculate the true p-value of the chi-square test when the PS is present and/or there is selection bias. However, we are able to develop a bound for the non-centrality parameter, and the latter may be estimable in many cases. Define Δδ* (Δβ*) as Δδ(Δβ) but with δ* (β*) replaced by its upper bound logUδ(2) (logUβ ). Let χδ2 (χβ2) be the usual statistic for testing null interaction (main effect). Then following Cheng, Lee and Chen [22], a conservative p-value of the chi-square test is given by P(χ12(Δδ*)≥χδ2) (P(χ12(Δβ*)≥χβ2)). We note that by using the property of non-central chi-square distribution, the test based on using conservative p-value always have true type I error rate smaller than or equal to the significance level and the latter is always smaller than or equal to the true type I error rate of the usual chi-square test. If a test has conservative p-value less than or equal to the designated significance level, it is significant even there is PS or selection bias.

Examples of true biases and type I error rates

Tables 1 and 2 show some values of the biases β* and δ* and true type I error rates αβ and αδ of the usual chi-square tests when the significance level is 0.05. We assumed that there are two subpopulations (K = 2), β = δ = 0, γ = 0 or 1. G (H-) frequency of the first subpopulation was given by P(G = 1|S = 1) = 0.51 (P(H = 1|S = 1) = 0.19), the first subpopulation disease risk was P(D = 1|S = 1) = 0.05, the proportion of subpopulation 1 in the overall population was 0.7, and case and control sample sizes both equaled to n = 500. We defined LDs = (LD1, LD2) where LDs was the linkage disequilibrium coefficient between loci G and H in subpopulation s, and considered linkage disequilibrium coefficient LDs = 0 or 0.05. We also assumed that the sampling proportions of the cases followed SRS but those of the controls might not. The rest of the parameter values were determined from the values for the variations G† ,H† ,D† and DS† given in the tables with the assumption that subpopulation 2 has the maximal baseline G (or H) frequency odds, disease risk, and sampling deviation (this implies that P#(S = 2|D = 0) ranges from 0.0585 to o.7163). Finally, we note that in computing the non-centrality parameters, the sample frequencies nghd were replaced by n × P(G = g, H = h|D = d). The simulation results for G† = 5 were given in Tables 1 and 2, and those for G† = 3 can be found from Tables S1 and S2 in Additional file 1.

According to the results in Table 1 the true type I error αβ ranges from 0.05 to 0.9998 under linkage equilibrium. If the SRS condition holds and γ = 0, the true type I error αβ ranges from 0.05 to 0.9602 with mean 0.4377 and standard error 0.3298. Under the same conditions but γ = 1, the corresponding range becomes (0.05, 0.9326) with mean 0.3822 and standard error 0.2969. On the other hand, if the sampling is not SRS (DS† = 3 or 5) and γ = 0, the range of αβ is (0.05, 0.9998) with mean 0.6871 and standard error 0.317. Under non-SRS but γ = 1, the corresponding range becomes (0.05, 0.9992) with mean 0.6291 and standard error 0.3117. These results indicate that the bias can be quite large and its level may be modified by the sample selection and the level of H-main effect. We also observe that the bias β* may be nonzero under perfect matching. For example, if matching is perfect and H-main effect γ = 1, the largest true type I error is 0.1064, which occurs at the case with G† = H†= D† = 5. This is contrary to our usual belief that matching between cases and controls in ethnicity can eliminate the PS bias. However, except in some special cases, the bias under perfect matching design are smaller than those under other sampling designs.

Wang et al. [18] suggested that the bias δ* to the interaction effect is small when the linkage disequilibrium coefficient is small and the sampling is SRS. Our Table 1 also shows that under the same condition, the true type I error αδ in testing null interaction ranges from 0.05 to 0.0659. This agrees with their finding. However, if there is selection bias (DS† = 3 or 5), the true type I error rate αδ has range (0.05, 0.2656), mean 0.101, and standard error 0.056 when γ = 0, and range (0.05, 0.2750), mean 0.1053, and standard error 0.0597 when γ = 1. The means and standard errors given here and later were computed based on the results shown in Tables 1 and 2, and Tables S1 and S2 in Additional file 1. These results indicate that PS and SS also can cause serious bias problem in case-control study of gene-gene interactions even when the two genes are in linkage equilibrium. Under this scenario, the best way of reducing the bias is to match cases and controls in ethnicity. We note that under perfect matching and linkage equilibrium, the range of αδ is only between 0.05, and 0.0541.

Linkage disequilibrium between two genes or correlation between genetic and environmental factors play important role in determining the bias level in the studies of interaction. According to results presented in Table 2 we find that the bias to the estimation of the genetic main effect becomes smaller when the linkage disequilibrium coefficient increases from 0 to 0.05. When γ = 0, the mean of αβ is 0.3377 under SRS and 0.5514 under non-SRS (selection bias), and when γ = 1 the mean becomes 0.2716 and 0.4597, under SRS and non-SRS, respectively. On the contrary, the bias to the estimation of the interaction effect increases when the linkage disequilibrium coefficient increases from 0 to 0.05. Our results show that when γ = 0, the mean of αδ is 0.1642 under SRS and 0.5512 under non-SRS. When γ = 1, the mean becomes 0.1706 and 0.5555, under SRS and non-SRS, respectively. In all, bias δ* seems to become larger when linkage disequilibrium coefficient gets larger. Under stronger linkage disequilibrium, the true type I error αδ can be as large as 0.1101 even the cases and control were perfectly matched.

An application

Shi et al. [23] studied the interaction effects of maternal smoking and maternal or fetal pharmacogenetic variants on the risk of orofacial cleft based on 1244 subjects from Demark and Iowa, USA with facial clefting and 4183 parents, siblings or unrelated population controls. We considered the combined Denmark and Iowa case-control data with H = 1if maternal smoking was yes (0 if no) and G = 1if GSTT1 genotype was null (0, if genotype was not-null); see Table A6 of [23]. Based on these data, we found that G × H interaction was 3.2499 and chi-square test had p-value equal to 5.5676 × 10-4, indicating strong interaction effect. Also, from [24] we found that GSTT1 genotype frequencies of the Caucasian populations were between 0.129 and 0.276, giving the variation of the genotype frequencies G† = 4.8762. The range of maternal smoking rate was between 0.101 and 0.244 (see [25–27]), giving the variation of exposure rates H† = 1.968. Since maternal smoking and GSTT1 were independent in the unrelated control population (p-values of the independence test for the Demark data and Iowa data were respectively equal to 0.0942 and 0.0976), our upper bound for the bias exp(δ*) (see equation 2) equals to 1.6149, leading to the conservative p-value equal to 2.0353 × 10-2. This suggests that the maternal smoking effect on the cleft risk can be modified by the GSTT1 genotype even the population stratification and selection bias are both present in the study.

The impact of population stratification is considered by many to be important in case-control studies of gene-disease association. Many authors have suggested quantitative methods to control type I errors of the usual association test. The most popular treatments include the "genomic control" method [28–33] and the "structured association" method [34–37]. Each of the proposed methods requires typing extra polymorphic markers to generate an estimate of PS which can be used to adjust the test statistic. The impact of PS in case-control studies of gene-gene (environment) interaction is considered to be less important, when the genes under studied are in linkage equilibrium or when the gene-environment correlation is weak [18, 38]. However, this conclusion holds only when the sampling of the case and control data follow a SRS design, that is no selection bias. Unfortunately, there is no formal method for testing the validity of the SRS condition when the PS is present.

In practical applications, the selection bias is not unusual. For examples, when the hospital-based cases (controls) are used in the study and they are not representative of the population-based cases (controls) or when many non-response of the cases or/and controls occur in the study or there are self-selections, then the SRS condition may fail. In this paper, we show that under slight selection bias (DS† = 3), the bias to the estimation of main or interaction effect may become unacceptable. Our suggestion is that the bias should be treated seriously, even when the genetic factors are in linkage equilibrium or the genetic and environmental factors are uncorrelated. Large correlation or strong linkage disequilibrium could make the bias become even larger. Also, small variation in disease risk cannot guarantee small bias, unless there is also small selection bias. In applications, it is important to be able to measure the impact of the bias. In this paper, we drive some bounds for the bias. If these bounds are estimable, then they can be used to make conservative inference. We show one real example that a conservative p-value for testing null interaction can be computed and significance conclusion can be reached even there is bias. Genotype frequencies of the SNPs and their LDs are readily available from international HapMap project. Further, disease prevalence is also available from many nations or from World Health Organization, for example. This information allows us to easily compute bounds and then conservative p-values.

We note that matching in ethnicity between cases and controls has been suggested by epidemiologists as an affective method to control the PS bias in case-control gene-disease association study. However, in a more complicated risk model such as the one discussed here, bias (β*) (see equation 1) to the genetic main effect also depends on the effect size of other risk factor. We found that if γ = δ = 0 then the residual bias after matching is small. However, if γ = 1, and δ = 0, the residual bias after matching is still quite substantial. A sufficient condition to assure bias β* = 0 under perfect matching is γ = δ = 0. Tables 1 and 2 also show that matching cannot remove bias to the estimation of the interaction effect.

Since the presence of PS and selection bias may cause unacceptable bias to the usual interaction analysis, it is of importance to have an efficient method to control the bias. Unfortunately, so far there exists no effective method. The major difficulty is that the level of the bias depends on the effect size of other related factor which is in general unknown or not estimable under the PS. However, under some special cases, for example, when the genetic main effects are null (or weak) and testing gene-gene interaction is the main focus, one may follow the idea of genomic control to type extra pairs of null markers and apply the computed interaction levels to control the bias. In principle, if the candidate markers are in linkage equilibrium, the selected pairs of null markers also need to be in linkage equilibrium so that the important characteristics of the bias can be captured. On the other hand, if the candidate markers are in linkage disequilibrium, the paired null markers also need to be correlated. We are currently working to solve this important problem. Another approach for reducing bias is to match the cases and controls in ethnicity. According to our simulations, we find that under perfect matching and weak linkage disequilibrium, the bias to the estimation of the interaction effect is small. However, more study is needed in order to understand the impact of the residual bias when the matching is not perfect.

In this paper, the biases to the estimation of genetic main and interaction effects are quantified and their bounds are derived. We find that if there is environmental effect or interaction, the bias to the genetic main effect cannot be ignored even cases and controls were matched in ethnicity. The bias to the estimation of interaction effect also has the same problem. The estimated bound can be used to compute conservative p-value for the association test. The computation of conservative p-value does not require the knowledge on the number of subpopulations involved in the study or the membership of each study subject. In real applications, it is usually not clear that if there is PS or selection bias or both. However, if appropriate information such as the variation of genotype frequencies is known, we always can compute the conservative p-value. If the conservative p-value is smaller than the designated significance level, we can safely claim that the test is significant regardless of the presence of PS/non-SRS.

Here GM (Gm ) is the largest value of Gs .DM , Dm , DSM , and DSm are similarly defined. Also note that under SRS, DSs = 1 and therefore according to the definition of exp(δ*)we easily show that it is bounded above by (D† )2 and bounded below by (D† )-2. However, under general sampling design, the bias is expressed as

Acknowledgements

This research was supported in part by a grand from National Science Council and a joint research grand from China Medical University and Asia University. The authors are grateful to the discussion of Jin-Hua Chen and would like to thank two reviewers for their comments which greatly improve the presentation of this paper.

Electronic supplementary material

12863_2011_964_MOESM1_ESM.DOCAdditional file 1: Biases and the true type I errors of the chi-square tests. The file contains two tables showing the biases and true type I errors of the chi-square tests when G† = 3 and LD = (0,0) or LD = (0,0.5). (DOC 159 KB)

Authors' contributions

KFC designed the study, performed the analysis and wrote the paper. JYL performed the Computation and helped in discussion. All authors read and approved the final manuscript.

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.