Abstract

Background

Genome-wide association studies (GWAS) have identified numerous common SNPs associated with prostate cancer (CaP) risk in men of European descent. This study evaluates GWAS SNPs associated with CaP in African Americans (AA) and European Americans (EA).

Methods

800 SNPs were genotyped, including 32 from European-based GWAS and 35 flanking SNPs, in 417 AA and 455 EA cases from the NC-LA Prostate Cancer Project (PCaP) and compared to 925 AA and 1,687 EA controls from Illumina's iControlDB. The 32 GWAS SNPs were evaluated for their predictive power to discriminate between cases and controls using ROC curves.

Results

Of the 32 GWAS SNPs, 13 were significant at P < 0.05 in EA and 4 in AA (rs6983267, rs7017300, rs1859962, rs6501455). Three of 35 flanking SNPs, all from chromosome 8q, reached study-wide significance (p < 3.5×10−5); two in AA (rs10505476 rs6985504) and one in EA (rs16901970). Among the remaining 656 SNPs, two were associated with CaP (p < 3.5×10−5): rs1472606 (OR: 1.43 in EA) and rs9351265 (OR: 1.48 in AA) both in intergenic regions. For the 32 GWAS SNPs, ROC plots yielded AUC estimates too low for clinical use (EA AUC= 0.60 and AA AUC= 0.56).

Conclusions

This study confirms a large proportion of CaP associated regions implicated by European-based GWAS and provides evidence that some regions may be important in AA CaP risk. Despite the identification of a large panel of GWAS replicated SNPs for CaP, this panel is not appropriate for clinical screening.

Introduction

Despite widespread screening and improved treatment, prostate cancer (CaP) remains a major public health concern with more than 192,290 cases diagnosed in the US annually [1]. Significant racial disparities exist, with African Americans (AA) facing a 70% higher incidence of CaP compared to Caucasian/European Americans (EA), higher than any other ethnic group in the US [2] . Quantitative estimates from twin studies indicate that 42% of the variation in CaP risk may be attributed to genetic components—higher than for any other type of common human cancer [3]. Genome-wide association studies (GWAS) have identified a number of common SNPs that are associated with CaP risk in populations of European descent [4-7], although these SNPs explain only a small fraction of the heritable component. Replication and fine mapping studies of these SNPs have focused largely on Europeans [8-11]. AAs have been evaluated less frequently despite their higher risk of CaP [5,8,12-15].

While the SNPs identified from GWAS show association with disease, little evidence suggests that they are causal, but rather are markers that are in linkage disequilibrium (LD) with as yet unidentified causal variants. Identifying SNPs responsible for CaP susceptibility is particularly important because they may provide insight into the biology of the disease and a basis on which to develop new treatments. In addition, such causal SNPs provide the potential to improve screening and identification of high risk patients, and an avenue toward personalized medicine. Extending replication studies and investigating populations with different LD structure may help identify the full complement of causal/functional genetic variation explaining CaP heritability. AA populations are of particular interest toward this end because of their high risk, higher genetic diversity, and finer grained LD structure.

Publically available genetic data sets and new tools for integrating linkage, molecular, and GWAS data make it possible to better inform SNP selection for genetic investigation. Here we use a newly available bioinformatics tool to construct a SNP panel based on GWAS results and functional prediction. Using this panel, we describe an analysis that compared publically available iControlDB data (http://www.illumina.com/science/icontroldb.ilmn) to newly generated data from EA and AA CaP cases from North Carolina. Finally, the utility of a set of a priori SNPs from previous GWAS studies was assessed for their predictive accuracy in discriminating CaP cases from controls.

Subjects and Methods

Study population

PCaP is a multi-disciplinary, population-based, case-only study, designed to address racial differences in CaP, which has been described previously [16]. Briefly, 1,133 AA and 1,128 EA incident CaP cases were rapidly ascertained and recruited in North Carolina and Louisiana, respectively. Subjects self-reporting as black/African American or white/Caucasian American, between 40-79 years of age at diagnosis with histologically confirmed adenocarcinoma of the prostate were eligible to participate. Informed consent was obtained from all subjects prior to blood and questionnaire collection. The study was approved by the University of North Carolina at Chapel Hill (UNC-CH) and Louisiana State University Health Sciences Center (LSUHSC) Institutional Review Boards and the Department of Defense Human Subjects Research Review Board. Because of delays in recruitment in Louisiana following Hurricane Katrina, this analysis is limited to North Carolina cases enrolled from September 2004 through December of 2007 with available DNA.

Study questionnaire information was collected via a structured in-home interview conducted by trained study nurses that included information on self-described race and ethnicity, family history of CaP, and detailed information on demographic characteristics, diet, and health history.

We used genotype data from Illumina iControlDB (http://www.illumina.com) as controls. Illumina's iControlDB is a freely available, online database of genotype and phenotype data from individuals that can be used as controls in association studies[17] . Control genotype data generated using the Illumina HumanHap550 chip were downloaded from the Illumina's web site for 1260 EA and 925 AA men who had > 90% complete genotype data.

SNP selection

We designed a panel of SNPs for prostate cancer using our previously described SNPinfo web server (www.niehs.nih.gov/snpinfo) [18]. SNPinfo provides an integrated SNP-selection platform that incorporates primary data from previous GWAS studies, population-specific linkage disequilibrium, and detailed functional predictions based on coding, transcription factor binding, splicing, and miRNA binding. We selected SNPs using three pipelines: GenePipe for candidate genes (121 SNPs), LinkPipe for linkage regions (225 SNPs), and GenomePipe for the overall genome (200 SNPs) with details of parameter settings of SNPinfo provided in the supplemental materials. In addition, a set of “a priori” SNPs included 205 SNPs from published CaP GWAS and validation studies available at the time of panel selection [4-8,19-22] and 55 SNPs that had association P<0.01 in both CGEMS GWAS [6] and Framingham GWAS [23].

These a priori SNPs supplemented the SNPs chosen via SNPinfo and together included a total of 32 “replicated SNPs”, i.e. SNPs that were the primary findings of GWAS and replication studies, representing 21 different chromosomal regions, (see Supplementary Table 1). Not all of these SNPs had been established as replicated SNPs at the time of our selection. Additionally, our SNP set included 35 “flanking SNPs” that are located near replicated SNPs.

To control for population stratification, we selected 50 ancestry informative SNP markers (AIMs) based on allele frequency data in HapMap phase I+II populations (http://hapmap.ncbi.nlm.nih.gov). Twenty five of these SNPs are monoallelic (variant allele frequency (VAF) = 0) in CEU (US residents with ancestry from Northern and Western Europe, collected by the Centre d'Etude du Polymorphisme Humain (CEPH)), rare in Asians (VAF < 0.01) but very common in populations of African ancestry (YRI, Yoruban from Nigeria, VAF > 0.65 and AA VAF > 0.25). Additionally 25 SNPs are monoallelic (VAF = 0) in YRI, rare in Asians (VAF < 0.05), but very common in CEU (VAF > 0.5).

Genotyping

DNA was extracted from blood samples (n=796) by the UNC-CH Biospecimen Processing Facility, or from peripheral blood mononuclear cells immortalized by the UNC-CH Tissue Culture Facility (n=89). Genotyping was performed by the Center for Inherited Disease Research (CIDR) at Johns Hopkins University using a custom designed 1,536 SNP Illumina GoldenGate array. Genotyping included a set of 22 blind duplicates, and a set of HapMap controls comprised of 8 CEU and 11 YRI trios.

Statistical Analysis

Associations of individual SNPs with CaP were tested using unconditional logistic regression assuming a log-additive genotypic model where genotypes were coded as 0, 1, or 2 according to the number of “risk alleles” identified in European-based GWAS for the 32 replicated SNPs (Supplemental Table 1), or the number of minor alleles (for all other SNPs). A P value threshold of P < 0.05 was used for the set of 32 a priori replicated SNPs and a Bonferroni-corrected study-wide P < 3.55 × 10−5 (based on 723 SNP comparisons in both EA and AA populations) was used for all other SNPs. The proportion of European or West African ancestry was used to adjust for population stratification. Ancestry proportion was estimated using STRUCTURE[24] based on AIMs. HapMap genotype data for 209 independent individuals from CEU, YRI, CHB and JPT populations were used to assist with individual ancestry estimation.

The “risk allele” reported in European-based GWAS was identified for each of the 32 replicated SNPs. The number of risk alleles was summed for each research subject. Receiver Operating Characteristic (ROC) curves were used to estimate the predictive accuracy of risk allele counts as a method to discriminate cases from controls.

Simulation

Simulation was used to construct expected P value distributions after case-control sampling under different allele frequency scenarios. Two sets of 10,000 simulations were conducted: one where the risk (minor) allele frequency differed in cases (MAF = 0.29) and controls (MAF = 0.25), and a second, under the null, where risk (minor) allele frequencies were identical (MAF = 0.27). In each simulation, 500 cases and 500 controls were sampled randomly. Only those simulations where the major allele appeared to be the risk allele (discordant) were considered. Fisher's exact test was used to generate P values.

Results

Allele calling was conducted using Illumina's Genotyping Module version 1.0.10 in GenomeStudio 1.0.2.20706. The genotype intensity cluster plots were visually inspected for each SNP. Individual genotypes with an Illumina GenCall (GC) score below 0.25 were assigned as missing. Six PCaP study subjects were excluded due to poor genotyping performance. Seventy seven (9.6%) SNPs were excluded from association analysis due to a poor clustering pattern or parent-parent-child (P-P-C) heritability errors identified based on HapMap trios. The overall subject genotyping call rate was 99.95%. The reproducibility rate was 99.99% based on blind duplicates and the overall P-P-C heritability based on HapMap trios was 99.95%.

Quality control analysis was performed on the 723 SNPs that were typed in our panel and the iControlDB. iControlDB genotype data was checked for erroneous duplicates and related subjects and excluded 48 EA and 42 AA men with an identity by state (IBS) score > 1.6 with another iControlDB subject [17]. Hardy-Weinberg Equilibrium (HWE) was assessed at each SNP locus in iControlDB using the Fisher's exact test. 13 SNPs in EAs and 18 SNPs in AAs were excluded because HWE P values were ≤ 0.01.

For the remaining PCaP cases (n = 879) and iControlDB control subjects (n = 2095) that passed quality control criteria, individual ancestry was estimated using AIMs and the software STRUCTURE[24]. We excluded 5 PCaP cases and 41 iControlDB controls that self-identified as EA but had less than 85% European ancestry, and excluded 2 PCaP cases and 13 iControlDB controls that self-identified as AA but had more than 10% Asian ancestry. The ancestry estimates for the remaining individuals are shown in Supplementary Figure 1. PCaP EA men (n=455) had an average proportion of European ancestry of 0.98±0.02, which is almost identical to that observed in EA iControlDB men (n=1171, European ancestry 0.98±0.02). PCaP AA men (n=417) had an average proportion of African ancestry of 0.89±0.12 and European ancestry of 0.10±0.12; which is almost identical to iControlDB AA men (n=870, African ancestry 0.89±0.16, European ancestry 0.10±0.16).

Our SNP panel included 32 replicated SNPs from European-based GWAS and, an additional 35 flanking SNPs (Figure 1 and Supplementary Table 1). Of the 32 replicated SNPs, 13 were significantly associated (P <0.05) with CaP in EA and 4 in AA, with 3 SNPs significantly associated with CaP in both populations (Table 1). Associations with 3 of the 35 flanking SNPs reached the study-wide significance level (P < 3.55 × 10−5) including one in EA and two in AAs; all 3 are located on chromosome 8q24 in regions 2, 4, and 5 (Figure 1, Table 1, and Supplementary Table 1). Associations with two of the remaining 656 SNPs reached the study-wide significance level (P < 3.55 × 10−5): SNP rs9351265 among AAs and SNP rs1472606 among EA (Table 1). These SNPs are located in intergenic regions on chromosomes 6q16.1 and 5q35.2, respectively.

For the 32 replicated SNPs we also examined whether the risk allele identified from European-based GWAS was concordant with the risk allele identified in our study populations, independent of statistical significance (Supplemental Table 2). We found that PCaP EA risk alleles were concordant with GWAS reported European risk alleles for 27 of 32 SNPs, including all 13 SNPs with association P values <0.05 (Figure 2a).

Rank-order of association P values for 32 replication-SNPs. Risk alleles were identified from European-based GWAS and concordance with PCaP risk alleles is shown as solid circles (concordant) or open triangles (discordant). Under the null hypothesis that...

We also explored concordance between European-based GWAS risk alleles and the risk alleles for AA in our study (Supplemental Table 2). Under the null hypothesis that these SNPs are not associated with CaP in AA, equal numbers of concordant and discordant SNPs would be randomly distributed along the diagonal when rank-ordered by association P value (Figure 2B). Although we observed approximately equal numbers of concordant (n=18) and discordant (n=14) SNPs that roughly followed the diagonal, an unusual pattern emerged where concordant SNPs were distributed across lower P values (mean P value = 0.20) and discordant SNPs were distributed across higher P values (mean P value = 0.67) (Wilcoxon rank sum test, p=1.5×10−5 and Student's t test t= 3.17, p= 0.0074). In order to explore this unusual pattern we simulated samples of cases and controls from populations with minor allele frequencies that were either identical or had slight differences (4%). Under the null (where simulated case and control populations had identical allele frequencies) discordant SNPs had the expected random distribution of association P values ranging from 0 to 1 (Supplementary Figure 2A). When simulated case and control populations had slightly different allele frequencies (i.e., where SNPs have a “true” association with disease), discordant SNPs had a distribution of association P values that were significantly skewed toward higher values – similar to that observed in our data (Supplemental Figure 2B).

Receiver Operator Characteristic (ROC) curve plots for EA and AA are shown in Figure 3. Each point represents a different observed cumulative risk allele count. Sensitivity (true positive fraction) and 1 – Specificity (false positive fraction) are plotted. An optimal test maximizes sensitivity and minimizes the false positive fraction. An AUC of 0.5, shown by the dotted diagonal, represents a test with no ability to differentiate between cases and controls. AUCs were 0.60 (95% CI: 0.57-0.63) and 0.56 (95% CI: 0.53-0.60) for EA and AA, respectively.

Discussion

Prostate cancer genetics is relatively advanced in that there have been several large GWAS and multiple large-scale replication studies published to date—however the majority of these studies were of men of European descent[5-8,10,19,21,22,25-28]. To validate previous GWAS findings and identify additional SNPs associated with PCaP, we selected and analyzed 800 SNPs with the assistance of SNPinfo web tools[18]. This panel included 32 replicated SNPs representing 21 distinct chromosomal regions reported by previous GWAS, and 35 flanking SNPs in these regions. The genotypes for AA and EA CaP cases were compared to iControlDB controls – a publically available dataset of individual genotypes from selected racial groups that has been established for use in genetic association studies. This database has been used in 19 published association studies and shown to produce results that are comparable to those reported in matched case-control analyses (see supplementary text for the list of related peer reviewed publications). Similar to the methods employed by other studies, we controlled for population stratification by removing outliers based on ancestry proportion estimates from STRUCTURE analysis. Both PCaP AA and EA cases are genetically well-matched to iControlDB AA and EA controls (Supplementary Figure 1). Although use of iControlDB controls has been established in multiple publications, these men were not explicitly screened for prostate cancer and thus may harbor undetected disease. Such misclassification of controls would be expected to lead to a slight bias toward the null and could reduce the number of GWAS hits confirmed in this case-control association analysis.

Given that 32 SNPs in our panel had already been established by previous GWAS studies, we used the 0.05 significance level when testing for association. Nearly half of the 32 SNPs achieved nominal significance at P = 0.05 level in EA men. Despite AA men having a higher incidence of prostate cancer, no CaP GWAS of AA has been published to-date, and AA men have been underrepresented in replication studies. There have been a total of 5 replication studies examining European GWAS hits that have included African Americans [8,12-15]. These studies collectively examined 24 of the 32 replicated SNPs (Supplementary Table 2) surveyed in our panel and reported 6 SNPs (rs2660753 chr 3, rs6983267 chr 8, rs10896449 chr 11, rs4430796 chr 17, rs2735839 chr 19 and rs 5945572 chr X) that showed significant evidence of CaP association in at least one study of AA. In our study, 4 of the 32 SNPs demonstrated CaP associations in AA, including one (rs6983267 on chr 8) of the 6 SNPs previously reported. The remaining 3 SNPs (rs 7017300 chr 8, rs1859962 chr 17 and rs6501455 chr 17) are here identified as risk factors for CaP in AA for the first time. Thus, there are now a total of 9 SNPs that have been associated with CaP in AA.

The lack of confirmation in AA for many of the 32 European-based GWAS and the inconsistency of associated genetic variants identified in AA populations may be explained in part by the relatively small number of studies of AA reported to date as well as the relatively small sample size within each study. But more importantly, the lack of association may be related to the fact that LD structure often differs between EA and AA populations and GWAS hits are typically marker SNPs in linkage disequilibrium (LD) with causal alleles. Therefore, differences in LD structure between EA and AA populations may diminish the strength of associations when European-based marker SNPs are applied to AA populations. Thus, even if EA and AA share common causal alleles, the set of marker SNPs that show strong association in EA may show little or no association in AA. It is interesting to note that in our analysis we found an unexpected pattern: SNPs with discordant risk alleles between previous EA GWAS and PCaP AA tend to have large association P values and SNPs with concordant risk alleles tend to have small association P values. We demonstrate through simulation that when there is no difference in allele frequency between cases and controls, associations with discordant risk alleles will produce the expected random distribution of P values. However when allele frequencies differ between cases and controls (indicating an association with disease), associations with discordant risk alleles will be skewed toward high P values. Thus, although we only observe associations between CaP and 4 SNPs among AA, the distribution of discordant risk alleles may suggest that this set of 32 SNPs define loci important for CaP risk in AA.

Given the different LD structure of the AA population we also included a set of 35 SNPs adjacent to the 32 GWAS SNPs. Two SNPs reached study-wide significance in AA and one in EA, and all came from different subregions of 8q24. These 3 flanking SNPs produce a stronger signal than their corresponding replication SNPs (which were also significant), however, the flanking and replication SNPs are not in strong LD. Thus, these flanking SNPs may provide additional information for fine mapping of causal alleles in these chromosomal regions.

In addition to examining GWAS hits and related flanking SNPs, additional CaP SNPs were sought using our web-based SNP selection software tool SNPinfo[18]. This program allows researchers to combine GWAS information with linkage and functional data along with population-specific LD information for SNP selection. In constructing our panel, 5 SNPs (rs11649743, rs4857841, rs12543663, rs8102476 and rs620861) were included in 5 chromosome regions that were later reported as GWAS hits[10,25,27,28], – thus highlighting the utility of our selection approach. However, 2 of these SNPs (rs8102476 and rs620861) were subsequently excluded because of poor Illumina design scores. We found two SNPs that reached study-wide significance, one in EA (rs1472606) and one in AA (rs9351265). SNP rs1472606 is located on chromosome 5q35 within a reported copy number variant [29] about 90 kb from the transcription start site of HRH2, a G-coupled histamine receptor gene. This SNP was previously demonstrated to have strong evidence of linkage in 606 CaP families with early age at diagnosis (≤ 65 years) [30]. To our knowledge, this is the first population-based study to identify this SNP as having a strong association with CaP risk, although the CGEMS GWAS showed some signal for this SNP as well (P = 0.0016, rank = 1098) [6]. The second SNP, rs9351265 is located at chromosome 6q16.1 in a gene-poor region 800 kb upstream from the transcription start site of MAP3K7. Although not previously examined in AA, both the CGEMS follow-up study and Thomas, et. al. [7], also found evidence of association with CaP in Europeans (CGEMS p=0.00067, rank = 135; Thomas et al p< 0.001 rank 184). In addition, Liu, et al., [31] found a deletion 820kb from rs9351265 associated with high-grade prostate cancers.

There has been growing interest in the use of genetic profiles for personalized medicine. Existing genetic panels are being marketed for prediction of disease risk, although the predictive power of many of these have yet to be clearly demonstrated [32]. For CaP, Zheng, et. al., [33] suggested that an individual's allele counts for 5 SNPs correlated with increasing risk in Swedish men. In a subsequent study of US men, Salinas, et. al., confirmed that these 5 SNPs were significantly associated with risk, but the ROC curves obtained using clinical variables (AUC = 0.63) were not improved by inclusion of SNP information (AUC = 0.66)[34]. It has been suggested that an AUC > 0.75 may provide an appropriate threshold for screening tools in high risk populations, while an AUC > 0.99 may be required for general population screening [32]. In our study using a much larger panel of 32 SNPs whose association with prostate cancer had been established a priori in previous GWAS replication studies, risk allele counts differed significantly between cases and controls (Europeans P = 1.7 × 10−11). Despite the profound difference in allele counts, the ROC curve analysis of our data shows poor discriminatory power for both EA (AUC=0.60; 95% CI: 0.57-0.63) and AA (AUC= 0.56; 95% CI: 0.53-0.60). In part, this may be due to disease heterogeneity, if multiple subtypes of CaP have distinct genetic and environmental risk determinants. In addition, genetic heterogeneity is likely in CaP, meaning that different variations in the same gene, or variations across multiple genes (a number of which have yet to be identified), may also contribute to genetic susceptibility. While finer mapping may provide better characterization of causative SNPs and improve the clinical utility of CaP genetic panels, it is clear that our current panel is not adequate for general clinical use, even among high risk individuals.

Supplementary Material

Supp Figure S1- S2 Table S1- S2

Acknowledgments

The authors thank the North Carolina Central Cancer Registry and the PCaP staff, advisory committees and participants for their important contributions. The Prostate Cancer Project (PCaP) is carried out as a collaborative study supported by the Department of Defense contract DAMD 17-03-2-0052. This research was supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences and the NIH National Center on Minority Health and Health Disparities.

Footnotes

Disclosure of Potential Conflicts of Interest: No potential conflicts of interest were disclosed.