Abstract

The differences in common genetic polymorphism frequencies by willingness to participate in epidemiologic studies are unexplored, but the same threats to internal validity operate as for studies with nongenetic information. We analyzed single nucleotide polymorphism genotypes, haplotypes, and short tandem repeats among control groups from three studies with different recruitment designs that included early, late, and never questionnaire responders, one or more participation incentives, and blood or buccal DNA collection. Among 2,955 individuals, we compared 108 genotypes, 8 haplotypes, and 9 to 15 short tandem repeats by respondent type. Among our main comparisons, single nucleotide polymorphism genotype frequencies differed significantly (P < 0.05) between respondent groups in six instances, with 13 expected by chance alone. When comparing the odds of carrying a variant among the various response groups, 19 odds ratios were ≤0.70 or ≥1.40, levels that might be notably different. Among the various respondent group comparisons, haplotype and short tandem repeat frequencies were not significantly different by willingness to participate. We observed little evidence to suggest that genotype differences underlie response characteristics in molecular epidemiologic studies, but a greater variety of genes should be examined, including those related to behavioral traits potentially associated with willingness to participate. To the extent possible, investigators should evaluate their own genetic data for bias in response categories.

Descriptive, Risk Factor, and Methodologic Studies

Introduction

Loss of information because of nonresponse can compromise the validity of risk estimates from epidemiologic studies, which is a growing concern in light of declining participation rates (1-3). For various behaviors, exposures, and outcomes, numerous studies have investigated the potential effects of nonresponse (2, 4-9), but corresponding threats due to genetic variation are unexplored; validity in genetic studies is not assured because we assume a genetic variant is unrelated to response (3).

Although genetic variation with “true” nonresponse (i.e., those who did not provide genetic material) is impossible to address, genetic studies with recruitment waves provide a unique opportunity to investigate genetic frequency differences by participation. We examined frequencies of single nucleotide polymorphism genotypes, haplotypes, and short tandem repeat alleles by response status in control subjects from three studies with different recruitment designs allowing comparisons of early, late, and never questionnaire responders, one or more participation incentives, and blood or buccal DNA donation.

Materials and Methods

Subjects

Study A participants were controls in a nested-case control study of breast cancer among the U.S. Radiologic Technologists cohort (10, 11). All controls were female cohort members that provided consent and a blood sample for genetic analyses and for whom a study survey had been previously mailed. Among them were early responders (n = 679), late responders, requiring an extra incentive (a one dollar bill) to participate (n = 54), and nonresponders (n = 50) to the previously mailed questionnaire. Because sampling for the breast cancer case-control study occurred independently of questionnaire response, nonrespondents were included in the biospecimen recruitment effort.

Study B participants consisted of non-Hispanic Caucasian controls (516 males and 466 females) recruited for a case-control study of non-Hodgkin's lymphoma, from within four areas of the Surveillance, Epidemiology, and End Results cancer registry of the National Cancer Institute (12). Of these, 554 controls chose to provide a blood sample for genetic analyses, whereas 209 controls who did not provide blood samples did provide saliva (buccal) samples; 741 of these subjects responded to biospecimen donation at the time of study questionnaire administration (regular responders), whereas 22 subjects who initially refused to provide blood or buccal cell samples provided buccal cells after a final mail query at the end of the study (late responders).

Study C participants were controls that provided blood samples (232 females and 958 males) from a case-control study of lung cancer from the Lombardy region of Italy. Two-hundred and fifty-two controls (less incentivized group) were recruited by mail and telephone follow-up; the invitation was accompanied by cash or gas coupons and by a letter endorsing the study signed by the subjects' family physician. Nine-hundred and thirty-eight controls (highly incentivized group) were recruited using a letter of invitation, accompanied by a direct call by the subjects' family physician, a letter from the mayor of the participating cities supporting the research, and gas coupons to the subjects and family physicians; a toll-free number through which potential participants could obtain information about the study was also established and television advertisements were made.

Laboratory Methods

Study A participants were genotyped for 36 single nucleotide polymorphisms in DNA repair and growth factor genes (13). All samples from studies B and C were analyzed at the Core Genotyping Facility of the National Cancer Institute (http://cgf.nci.nih.gov/home.cfm). Study B participants were genotyped for 103 single nucleotide polymorphisms in genes involved in immune, oxidative stress, metabolism, cell cycle, and DNA repair pathways. For short tandem repeat analysis in all three studies, samples were quantified using PicoGreen and reverse transcription-PCR analysis and profiled using the Applied Biosystems Identifiler kit. Fifteen short tandem repeat loci were analyzed in studies A and C, and nine were analyzed in study B.

Statistical Analysis

We only considered those single nucleotide polymorphisms with a minor allele frequency of ≥5% for analyses; 15 single nucleotide polymorphisms from study A and 16 single nucleotide polymorphisms from study B were too infrequent for inclusion. We reconstructed haplotypes for APEX, BACH1, BRCA2, TGFβ1, XRCC1, and ZNF350 for study A and for IL10 and LTA/TNF for study B (separately for each comparison group) using the PHASE software package (14). Haplotypes were not reconstructed for regular versus late responder analyses in study B because of the small number of late responders. We analyzed single nucleotide polymorphism and haplotype frequencies among categories of study recruitment using contingency table analyses in SAS release 8.02 (SAS Institute, Inc., Cary, NC); in addition to χ2 analyses and odds ratios comparing the frequency of single nucleotide polymorphism carriers and noncarriers between the various comparison groups, single nucleotide polymorphism genotypes (homozygous wild type, heterozygous, homozygous variant) were analyzed among participation categories using the Mantel-Haenszel test for trend. We noted odds ratios that were ≤0.70 or ≥1.40 because this magnitude is approximately symmetrical around 1.0, and values outside this range could conceivably impact genotype-disease associations. For short tandem repeat analysis, we used SAS release 8.02 to calculate short tandem repeat genotype means and ranges at each locus for the various comparison groups. In addition, using Arlequin version 2000 (15), we estimated the standardized fixation index or FST (ratio of the number of different alleles observed between two individuals in two different samples compared with the number of different alleles observed between two individuals in the same sample; ref. 16). FST provides a single measure of genetic differentiation when multiallelic loci are being considered, such as short tandem repeats. All tests for significance were two-sided with α set at 0.05.

Results

When comparing late responders to early responders and nonresponders to early responders in study A (Table 1), we found that seven and eight odds ratios were ≤0.70 or ≥1.40, respectively. The TGFβ1 P25R variant differed significantly (trend test, P = 0.03) among the questionnaire response groups; one statistically significant trend was expected by chance. Haplotype frequencies for the various genes were not found to be statistically significantly different (χ2 test; not shown).

Odds ratios for late and never responders compared with early responders to a mailed questionnaire in study A

Two of the variant frequencies were significantly different between the blood and buccal groups in study B; these were EPHX1 H139R (P = 0.0064) and CYP1B1 V432L (P = 0.027; Fig. 1); at least four were expected to differ by chance. The Mantel-Haenszel test for trend revealed significant frequency differences between the biospecimen groups with increasing copies of EPHX1 H139R (P = 0.0031). Significant trends were also found with IL8RB 1235T>C (P = 0.024) and MPO 642G>A (P = 0.011). Four of the 87 single nucleotide polymorphisms in Fig. 1 had point estimates that were ≤0.70 or ≥1.40. Haplotype frequencies were not found to be significantly different between the biospecimen groups (interleukin-10, P = 0.25; LTA/TNF, P = 0.45). Among the respondent groups, six single nucleotide polymorphism frequencies were significantly different: IL1A A114S, IL1A 12G>A, IL4R 28120T>C, NQO1 P187S, TYMS 157C>T, and MGMT L84F (not shown); eight statistically significant differences were expected by chance.

Table 2 shows short tandem repeat results for all three studies. In study A, we found two loci (D21S11, TH01) that were statistically significantly different between early and late responders; one difference was expected by chance. We found no statistically significant FST values in studies B and C; we also found no statistically significant FST values when considering the early and late respondent groups in study B (not shown).

Discussion

To our knowledge, this is the first exploration of the threat to internal validity from genotype frequency differences by participation status for cancer genetic epidemiology. In the present analysis, we did not find that genotype frequency differences between categories of respondents and incentive groups significantly exceeded the number expected by chance. The biases that occur in epidemiologic studies of the effects of genetic variants correspond to the general framework for any exposure (17). That is, biases may be related to inclusion in the study (selection bias), to availability or accuracy of response (recall or ascertainment bias), and to correlation with other factors (confounding, model misspecification, population stratification). Although commentators have recently focused much attention on population stratification (18-21), a form of confounding, selection bias and response bias have had less attention, in part, because genetic data on nonresponders are difficult by definition to obtain.

Because polymorphism frequencies in nonresponders are unknown, investigators have assumed that participation in genetic studies was unrelated to genotype. This may not be true when variants in genes related to behavioral characteristics are under investigation or if a variant may be related to family disease history; willingness to participate has been associated with family history of the particular disease under study (9). In our analyses, there were a few statistically significant differences by participation status; whereas the number of such observations was consistent with expectation, there were no statistically significant differences consistent within and between studies. We also found no evidence of differences, beyond those expected by chance, between subjects opting to provide mouthwash samples for genetic analysis instead of blood samples.

As with confounding, a statistically significant association between a genetic variant and response is not necessary for bias to occur; a sufficient relationship must simply exist in the data (3). Thus, we have identified those single nucleotide polymorphisms in Table 1 and Fig. 1 with response differentials (odds ratios ≤0.70 or ≥1.40) that may result in substantial bias under certain circumstances; although not examined in these series of analyses, for biased odds ratios to occur in case-control studies, response differentials must themselves be different between cases and controls. The mathematics of participation bias has been described elsewhere (22).

The present analysis had several strengths in that multiple studies with polymorphisms in common permitted exploration and confirmation of study specific findings and each study provided data on plausible surrogates for nonresponse, such as reaction to incentives. Study A, in particular, provided a rare opportunity to assess genetic profiles of questionnaire nonresponders. A limitation was that the polymorphisms we examined were already available in these three studies; they were selected based on a priori disease associations, not as candidate variants in genes potentially related to willingness to participate.

Despite the apparent conundrum of assessing genetic characteristics of “true” nonresponders, we show there are opportunities to approach the question of response bias in molecular epidemiologic studies. Our findings, while reassuring, cannot exclude that differences by response exist in other genes. The potential for bias due to the “genetics of response” should continue to be evaluated, when possible, within the wider molecular epidemiologic research community.

Acknowledgments

We thank Sholom Wacholder and Lindsay Morton for their thoughtful comments on the manuscript.

Footnotes

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

Wacholder S, Rothman N, Caporaso N. Counterpoint: bias from population stratification is not a major threat to the validity of conclusions from epidemiological studies of common polymorphisms and cancer. Cancer Epidemiol Biomarkers Prev2002;11:513–20.