Abstract

Genetic factors have been implicated in stroke risk but few replicated associations have been reported. We conducted a genome-wide association study (GWAS) in ischemic stroke and its subtypes in 3,548 cases and 5,972 controls, all of European ancestry. Replication of potential signals was performed in 5,859 cases and 6,281 controls. We replicated reported associations between variants close to PITX2 and ZFHX3 with cardioembolic stroke, and a 9p21 locus with large vessel stroke. We identified a novel association for a SNP within the histone deacetylase 9 (HDAC9) gene on chromosome 7p21.1 which was associated with large vessel stroke including additional replication in a further 735 cases and 28583 controls (rs11984041, combined P = 1.87×10−11, OR=1.42 (95% CI) 1.28-1.57). All four loci exhibit evidence for heterogeneity of effect across the stroke subtypes, with some, and possibly all, affecting risk for only one subtype. This suggests differing genetic architectures for different stroke subtypes.

Cerebrovascular disease (stroke) is one of the three most common causes of death and the major cause of adult chronic disability (1). Stroke represents an increasing health problem throughout the world as the proportion of elderly increases, and is an important cause of dementia and age-related cognitive decline. While conventional risk factors such as hypertension account for a significant proportion of stroke risk, much remains unexplained (2). Twin and family history studies suggest genetic factors are responsible for some of this unexplained risk (3). Stroke is a syndrome rather than a single disease, and subtypes of stroke are caused by a number of different specific disease processes. About 80% of stroke is ischemic; the three most common ischemic stroke subtypes are large vessel, cardioembolic and small vessel (lacunar) stroke. Genetic epidemiological studies show heterogeneity between stroke subtypes, the large vessel subtype being more strongly associated with family history (4). SNPs associated with atrial fibrillation were found only to be significantly associated with cardioembolic stroke (5,6), and a 9p21 variant initially associated with coronary artery disease and atherosclerosis only associated with large vessel stroke (7). This suggests that different genetic variants can predispose to different subtypes of ischemic stroke.

To date there have been few genome wide association studies (GWAS) in ischemic stroke and few replicable associations have been identified (8). To further understand the genetic basis of ischemic stroke, we undertook a GWAS as part of the Wellcome Trust Case Control Consortium 2 (WTCCC2). We hypothesised that associations might be present only with specific stroke subtypes. To investigate this, cases were classified into stroke subtypes according to the pathophysiological TOAST classification (9), using clinical assessment as well as brain and vascular imaging where available (see Online Methods). Association analyses were performed on all ischemic stroke combined (including individuals not further classified by stroke subtype), and also the three major stroke subtypes: large vessel, small vessel and cardioembolic stroke. Discovery samples were of European ancestry and were genotyped on Illumina arrays (see Online Methods). Following quality control, the discovery set consisted of 3,548 cases (2,374 British, 1,174 German) and 5,972 controls (5,175 British WTCCC2 common controls, and 797 German controls) genotyped on an overlapping set of 495,851 autosomal SNPs (Table 1 and Online Methods). Within the British and German data, cases and controls were well matched for ancestry (see Online Methods and Supplementary Figure 1). We therefore performed association analysis separately in the two groups and combined them using a fixed effect meta-analysis approach. A two-stage replication study was performed in 5,859 cases (3,863 European, 1,996 American) and 6,281 controls (4,554 European, 1,727 American) all of self-reported European ancestry. (Table 1 and Online Methods). Full details of the cohorts are available in the Supplementary material. and Supplementary Table 1.

Post quality control breakdown of case and control by cohort and ischaemic stroke subtype

Table 2 shows results at previously reported loci and Figure 1 shows the association analysis results across the autosomes. We replicated an association between cardioembolic stroke and variants close to the PITX2 gene and also a SNP in the ZFHX3 gene, both of which were initially associated with atrial fibrillation, a well recognised risk factor for stroke (5,6,10). We also replicated a previously reported association between large vessel stroke and the 9p21 region (7). As we, and others, already reported (11,12), we did not confirm the previously published association between all stroke and variants in the 12p13 region (13, 14).

Association signals at the newly associated locus (upper tier) and at loci previously reported as associated with stroke or one of the stroke subtype (lower tier)

Thirty-eight previously unreported loci showed potential association for all stroke or one of the stroke subtypes in the discovery samples, and we further investigated these loci in the European replication samples by genotyping 43 SNPs covering these loci as well as 7 SNPs to cover the previously reported loci (Supplementary Table 2). Thirteen of these previously unreported loci and the previously reported loci were taken forward to replication in the American samples with genotyping of 20 SNPs covering these regions (Supplementary Table 3). Most replication samples were genotyped using Sequenom assays; for those previously typed with GWAS chips we used genotype imputation where the SNP was not directly typed (see Supplementary Tables 2 and 3 and Online Methods). A SNP at chromosome 7p21.1 (rs11984041) showed evidence of association with large vessel stroke in the discovery data (P=1.07×10−5) and in the joint European and US replication data in the same direction (one-sided P=7.9×10−5). As a further check, we investigated this SNP in three further collections of large vessel cases and matched controls (735 cases, 28583 controls in total), which we refer to as Stage 3 replication (see Online Methods for details). The Stage 3 data also showed evidence in the same direction (one-sided P=2.25×10−4). Together, the combined discovery and three-stage replication data provide strong evidence for association (P=1.87×10−11) and suggest each copy of the A allele increases risk of large vessel stroke by approximately 1.4 fold (Table 2 and Figure 2). This SNP is within the final intron of the gene HDAC9. The risk allele (A) frequency was 9.29% and 8.78% in the UK and German discovery controls respectively.

Forest plot for the associations between rs11984041 and large vessel stroke in discovery and replication collections

Standard statistical tests of association between rs11984041 and each of cardioembolic and small vessel stroke are not significant (discovery plus 2-stage replication p = 0.12, OR = 1.10, 95% CI = 0.98 – 1.23, and p = 0.06, OR = 1.13, 95% CI = 1.00 – 1.28 respectively). A non-significant result could simply be due to a lack of power: lack of significance in itself cannot rule out an effect in these subtypes. We investigated this potential genetic heterogeneity further by formally comparing different statistical models for the effect of the SNP on the different stroke subtypes. The models we compared were: (i) a model in which the variant has no effect on risk for any of the subtypes (“null” model); (ii) a model in which the SNP has the same effect on each subtype (“same effects” model); (iii) three models, in each of which the SNP has an effect on one subtype, and no effect for the other two subtypes (“LVD”, “SVD” and “CE” models respectively for the effect only in large vessel, small vessel, and cardioembolic stroke); and (iv) a “correlated effects” model allowing different, but correlated, effects for each subtype. We undertook the model comparison in a Bayesian statistical framework (see Online Methods for details), for our new association around HDAC9, as well as for the previously reported associations we confirmed as listed in Table 2. The results, based on the discovery and the first two stages of the replication, are shown in Figure 3.

Genetic heterogeneity of different stroke sub-types for the 4 loci with significant associations: HDAC9, PITX2, 9p21(CDKN2A/CDKN2B) and ZFHX3

For rs11984041 at HDAC9 there is very strong evidence against the null model and both the SVD and CE models (unsurprisingly given we ascertained this SNP on the basis of evidence for an effect in LVD) and also strong evidence against the model in which the SNP has the same effect in each subtype, thus demonstrating genetic heterogeneity across stroke subtypes at this SNP. The greatest posterior weight rests on the model in which there is only an effect for large vessel disease, with some weight on the correlated effects model, and in this model the posterior distributions on effect size for SVD and CE stroke are concentrated on much smaller effect sizes than for LVD.

In our data, heterogeneity is also seen at rs2383207 in the 9p21 region, a locus associated with heart disease and related phenotypes, and previously associated with large vessel stroke. Most support is for the model in which the effect sizes for the three stroke subtypes are correlated but there is also substantial weight on the model in which there is only an effect for large vessel stroke. The same analyses in our data for the top SNPs in the regions previously associated with cardioembolic stroke (PITX2 region, rs1906599, and ZFHX3 regions, rs12932445) show strong support for the model in which these SNPs only affect risk for cardioembolic stroke. Together these analyses provide compelling evidence for heterogeneity of genetic effects between stroke subtypes.

The association with rs11984041, in the gene HDAC9, implicates a novel region of the genome in an individual’s susceptibility to stroke. Any association with stroke could be mediated via associations with intermediate cardiovascular risk factors that themselves increase large vessel stroke risk. Our study design does not allow a direct assessment of this, as such risk factors were not available for control individuals. However, to date no associations have been reported between rs11984041 or correlated SNPs and hypertension (15), hyperlipidaemia (16), or diabetes (17) from large-scale GWAS of these risk factors.

Association of genetic variants surrounding HDAC9 are represented in Figure 4. All variants showing an association signal reside within a peak between two recombination hotspots and encompass the tail end of HDAC9. The downstream genes TWIST1 and FERD3L are physically relatively close to the identified peak and cannot be excluded as possible mechanisms via which genetic variants may exert cis-effects on the large vessel stroke phenotype. HDAC9 is a member of a large family of genes that encode proteins responsible for deacetylation of histones, and therefore regulation of chromatin structure and gene transcription (18). HDAC9 is ubiquitously expressed, with high levels of expression in cardiac tissue, muscle and brain (19). Although known as histone deacetylases, these proteins also act on other substrates (20) and lead to both upregulation and downregulation of genes (21).

Plot of association signals around rs11984041 for large vessel stroke in the combined British and German discovery samples. SNPs are coloured based on their correlation (r2) with the labelled hit SNP which has the smallest P-value in the region. r2 is...

The mechanism by which variants in the HDAC9 region increase large vessel stroke risk is not immediately clear. The specific association with this stroke subtype would be consistent with the association acting via accelerating atherosclerosis. The HDAC9 protein inhibits myogenesis and is involved in heart development (19) although deleterious effects on systemic arteries have not yet been reported. Alternatively it could increase risk by altering brain ischaemic responses and therefore have effects on neuronal survival. The protein has been shown to protect neurons from apoptosis, both by inhibiting JUN phosphorylation by MAPK10 and by repressing JUN transcription. HDAC inhibitors have been postulated as a treatment for stroke (22).

It is not uninformative that a large GWAS (~3,500 cases, ~6,000 controls) failed to find any novel associations for the combined phenotype of ischemic stroke. It may be that the genetic architecture of the disease involves fewer variants of more moderate effect than many other diseases, and/or that these happen not to be well tagged by the Illumina 660-W chip used in the study. On the other hand, as our data demonstrate, all the known loci exhibit genetic heterogeneity across the stroke subtypes, with at least some, and possibly all, affecting only a single subtype. This supports the possibility that distinct subtypes of the disease have differing genetic architectures. However this is based on only four loci and does not exclude the possibility that future loci associated with stroke may predispose to all ischaemic stroke. Clinical classification of disease into subtypes is not perfect. Since errors in classification would reduce power to detect heterogeneity, our findings of homogeneity within classes indirectly reinforces the value of current classification methods. Because GWAS studies to date, including the one reported here, have had relatively small sample sizes for each disease subtype (and hence are underpowered for common variants of small effect), it remains possible, and indeed a priori likely, that the range of effect sizes for each subtype will be similar to those for other common diseases. This suggests that future genetic studies should study adequate sample sizes for particular subtypes of ischaemic stroke, rather than for the disease as a whole.

In summary, in this largest GWAS study of ischemic stroke conducted to date, we identified a novel association with the HDAC9 gene region in large vessel stroke with an estimated effect size which is at the larger end for GWAS loci (OR 1.38, 95% CI 1.22-1.57 from replication data). We also replicated three other known loci, and showed genetic heterogeneity across subtypes of the disease for all four stroke loci. This genetic heterogeneity seems likely to reflect heterogeneity in the underlying pathogenic mechanisms, and reinforces the need for separate consideration of stroke subtypes in the research and clinical context.

Supplementary Material

1

Acknowledgements

The principal funding for this study was provided by the Wellcome Trust, as part of the Wellcome Trust Case Control Consortium 2 project (085475/B/08/Z and 085475/Z/08/Z and WT084724MA). For details of other funding support see Supplementary Material.

We thank S. Bertrand, J. Bryant, S.L. Clark, J.S. Conquer, T. Dibling, J.C. Eldred, S. Gamble, C. Hind, M.L. Perez, C.R. Stribling, S. Taylor and A. Wilk of the Wellcome Trust Sanger Institute’s Sample and Genotyping Facilities for technical assistance. We acknowledge use of the British 1958 Birth Cohort DNA collection, funded by the Medical Research Council grant G0000934 and the Wellcome Trust grant 068545/Z/02, and of the UK National Blood Service controls funded by the Wellcome Trust. We thank W. Bodmer and B. Winney for use of the People of the British Isles DNA collection, which was funded by the Wellcome Trust.

APPENDIX

Methods

Study subjects

All subjects were of self-reported European ancestry. Patients were classified into mutually exclusive etiologic subtypes according to the Trial of Org 10172 in Acute Stroke Treatment (TOAST) (9). TOAST classification was performed in all stroke cases. The TOAST system has a category of “etiology unknown” which includes cases in which no cause has been found due to insufficient investigation, as well as cases where no cause is found despite full investigation. This “unknown” group was not analysed in subtype analyses described in this paper which focussed only in those patients where there were appropriate investigations to assign one of three subtypes; large vessel disease, cardioembolic and small vessel disease. The unknown cases were only included in the analyses of all ischaemic stroke which did not take into account subtype.

Our main analyses were of associations with all ischemic stroke and with the three main subtypes: large vessel, cardioembolic and small vessel stroke. We performed additional analyses in the discovery populations with young stroke (age <70 years at first stroke), and with the presence of large vessel stenosis and, separately, the presence of cardioembolic source, irrespective of assigned subtype. These last two analyses allowed inclusion of patients whose data was excluded from individual subtype analysis because they had more than one potential stroke subtype. Details of individual populations are given in Table 1 and in supplementary material.

DNA sample preparation

GWA genotyping

Samples from the cases were genotyped at the WTSI on the Human660W-Quad (a custom chip designed by WTCCC2 comprising Human550 and circa 6000 common CNVs from the Structural Variation Consortium (24)). Samples from British control collections were genotyped on the Human1.2M-Duo (a WTCCC2 custom array comprising Human1M-Duo and the CNV content described above). Bead intensity data was processed and normalized in BeadStudio; data for successfully genotyped samples was extracted and genotypes called within collections using Illuminus(25). German controls were typed on Illumina Human 550k platform, and intensity data was processed and normalized for each sample in GenomeStudio using the Illumina cluster file HumanHap550v3.

GWA quality control

Samples

As previously described (26,27), we removed samples whose genome-wide patterns of diversity differed from those of the collection at large, interpreting them likely to be due to biases or artefacts. To do so we used a Bayesian clustering approach (28) to infer outlying individuals on the basis of call rate, heterozygosity, ancestry, and average probe intensity. To obtain a set of putatively unrelated individuals we used a hidden Markov model (HMM) to infer identity by descent and then iteratively removed individuals to obtain a set with pair-wise identity by descent <5%. To guard against sample mishandling we removed samples if their inferred gender was discordant with recorded gender, or if <90% of the SNPs typed by Sequenom on entry to sample handling (see above) agreed with the genome-wide data. Our final discovery dataset consisted of 3548 cases (2374 British, 1174 German) and 5972 controls (5175 British, 797 German) following sample quality control (Supplementary Table 4). A full breakdown of samples by cohort and subtype is in Table 1.

SNPs

A measure of (Fisher) information for allele frequency at each SNP was calculated using SNPTEST (see URLs). Autosomal SNPs were excluded if this information measure was below 0.98, if minor allele frequency was <0.01%, if the SNP had >5% missing data, or if Hardy Weinberg p-value was <1×10−20 in the case or control collections. In the 58C, UKBS and case data set, association between SNP and the plate on which samples were genotyped was calculated and SNPs with a plate effect p-value <1×10−6 were also excluded. An additional 45 SNPs were removed following visual inspection of cluster plots. A breakdown of the number of SNPs excluded is provided in Supplementary Table 5. Only SNPs genotyped on all the case and control collections were considered, leaving 495,851 autosomal SNPs after quality control. Hardy Weinberg p-values for the SNPs taken to replication are given in Supplementary Table 6.

Initial replication genotyping and quality control

Genotyping of European replication samples was carried out at the WTSI using Sequenom iPLEX Gold assay and genotyping of the US samples at the Broad Institute, Boston, USA using the Sequenom platform, with the exception of the GEOS Study, in which genotyping was carried out using Illumina Human Omni1-Quad. Imputation to HapMap3 using BEAGLE software program (29) was performed. Individual samples were excluded from analysis if they had call rates <80% or if reported gender was discordant with gender specific markers. We removed pairs of samples showing concordance indicative of being duplicates.

The PoBI samples were genotyped on the custom Human1.2M-Duo array using Illumina’s Infinium platform and subjected to similar quality control as described above; for each SNP used in replication the cluster plot was visually inspected.

The PROCARDIS controls were genotyped with the Illumina HumanHap 610 Quad beadchip. PCA with HapMap2 reference population data allowed exclusion of individuals with non-European ancestry. Subsequent PCA with HapMap3 on German stroke samples with GWAS data and additional European reference population data, showed German PROCARDIS controls had similar ancestry to German stroke cases (data not shown).

Third stage replication of rs11984041 SNP

For the deCODE cases and controls genotyping was performed on 317K or 370K Illumina chips. The SNP rs11984041 was imputed using HapMap. ASGC cases and control samples were genotyped on the Illumina HumanHap610-Quad array, SNP rs11984041 was directly genotyped. Milan cases samples were genotyped using Illumina Human610-Quadv1_B or Human660W-Quad_v1_A beadchip; both include the rs11984041 SNP. Milan controls were genotyped with the Illumina HumanHap 610 Quad beadchip. PCA with HapMap3 on the Italian stroke samples showed that Italian PROCARDIS controls had similar ancestry to Italian stroke cases.

Genotype imputation

Association analysis

We performed single SNP analysis separately in the British and German discovery data sets under an additive model (on the log-odds scale) using missing data likelihood score tests as implemented in SNPTEST. We conducted a fixed effect meta-analysis in R to combine the evidence of association, averaging the estimated effect size parameters associated with genotype risk across the two data sets, weighting the effect size estimates by the inverse of the square of corresponding standard errors. P-values were calculated assuming the combined data z-score to be normally distributed. The British and German cohorts had an inflation factor ranging from 1.014 to 1.058 and from 1.011 to 1.044 respectively, depending on the stroke subtype considered (Supplementary Figure 1). This analysis was also performed separately in males and females.

We also conducted a genome-wide scan analysis based on a Bayesian model which allows each stroke subtype to have its own effect and models relationships between these effects using a hierarchical prior specification. The same effects were assumed for the corresponding stroke subtype in both British and German populations (Supplementary Table 2).

Finally we performed a genome-wide scan using GENECLUSTER (31). This estimates genealogical tree of the case-control sample at a position of interest based on the genealogy of a reference panel (HapMap2 CEU in our case), by simultaneously phasing and clustering the case and control haplotypes to the tips of the reference genealogy. The method detects signals of association in the form of differential clustering of cases and controls underneath a branch, or a number of branches, in the estimated genealogy, which is equivalent to associations due to haplotypic effects or allelic heterogeneity (Supplementary Table 2).

Replication

Replication of potential associations found in the GWAS of the discovery cohorts was conducted in two stages in independent European and American samples. We investigated in the European replication cohorts 50 SNPs, that either were in loci reported in the literature from previous GWAS, or showed potential associations (P<1×10−5) with all stroke or one of the stroke subtypes in analysis of the discovery data set, and showed consistent direction of effect in both British and German cohorts (Supplementary Table 2). This threshold was chosen based on resources available for replication. After analysis of the combined results of the discovery and European replication populations 20 of these SNPs were taken forward to second stage replication in the American samples (Supplementary Table 3).

Association analysis was performed in each replication cohort separately via a logistic regression assuming an additive genetic model. Evidence of association across the replication data was combined using a fixed effect meta-analysis as previously described. Data on the presence or absence of a cardioembolic source or large vessel stenosis (irrespective of assigned TOAST subtype) were not available in all of replication cohorts. For replication of SNPs identified due to association with these phenotypes in the discovery cohorts, we assessed association in the replication cohorts with the cardioembolic or large vessel stroke subtypes respectively.