Lumbar disc degeneration (LDD) is associated with both genetic and environmental factors and affects many people worldwide. A hallmark of LDD is loss of proteoglycan and water content in the nucleus pulposus of intervertebral discs. While some genetic determinants have been reported, the etiology of LDD is largely unknown. Here we report the findings from linkage and association studies on a total of 32,642 subjects consisting of 4,043 LDD cases and 28,599 control subjects. We identified carbohydrate sulfotransferase 3 (CHST3), an enzyme that catalyzes proteoglycan sulfation, as a susceptibility gene for LDD. The strongest genome-wide linkage peak encompassed CHST3 from a Southern Chinese family–based data set, while a genome-wide association was observed at rs4148941 in the gene in a meta-analysis using multiethnic population cohorts. rs4148941 lies within a potential microRNA-513a-5p (miR-513a-5p) binding site. Interaction between miR-513a-5p and mRNA transcribed from the susceptibility allele (A allele) of rs4148941 was enhanced in vitro compared with transcripts from other alleles. Additionally, expression of CHST3 mRNA was significantly reduced in the intervertebral disc cells of human subjects carrying the A allele of rs4148941. Together, our data provide new insights into the etiology of LDD, implicating an interplay between genetic risk factors and miRNA.

Low back pain (LBP) is one of the most common symptoms of spinal abnormalities, with an annual point prevalence averaging 30% (1). It is a major factor affecting quality of life and has a significant social and economic impact worldwide (2). One cause of LBP is lumbar disc degeneration (LDD) (3–6), a subset of intervertebral disc (IVD) degeneration (IDD; OMIM 603932), which includes lumbar disc herniation (LDH) and sciatica (7).

The etiology of LDD is not fully understood. While factors such as body weight, mechanical loading, physical activities, and smoking may have a role (8–11), familial predisposition (12, 13) and twin studies (6, 14–16) have demonstrated a significant genetic contribution, with heritability estimates as high as 74% (16).

LDD presents with a complex cascade of degenerative events. The current gold standard for phenotype analysis is an assessment from T2-weighed MRI of the spine, which provides key information on disc dehydration as an indicator of degeneration, and associated changes such as reduced disc height, herniations of cartilaginous end-plate (Schmorl’s nodes), or herniations of the nucleus pulposus (NP) characterized by sciatica. Numerous genetic risk factors for LDD have been reported and recently reviewed (17–19). There is moderate statistical evidence for ASPN (20), COL11A1 (21), GDF5 (22), SKT (23), THBS2 (24), and MMP9 (24), based on criteria for the assessment of genetic outcomes (25), with studies performed in cohorts of reasonable sample size and/or follow-up via replication or functional studies. Most of the genetic risk factors identified so far are from candidate gene studies, selected on the basis of our limited understanding of IVD biology in health and disease. PARK2 was recently identified as a risk factor through meta-analysis of 4,600 northern European subjects (26).

The present study is the largest-scale investigation thus far of the genetic determinants of LDD. We considered evidence from both a genome-wide linkage analysis of families with early-onset LDD and a large genome-wide association study (GWAS) meta-analysis using several population samples. From these studies, we identified a novel candidate gene for LDD, carbohydrate sulfotransferase 3 (CHST3), which we followed up with functional analyses that linked the putative risk allele with the function of microRNAs (miRNAs).

A region in chromosome 10 with significant linkage to early-onset LDD. To identify genetic loci associated with LDD, we first carried out a 2-stage linkage analysis in families of Southern Chinese origin with early-onset LDD. The first-stage genome-wide linkage analysis was performed with 89 individuals from families 1–10 (Supplemental Figure 1; supplemental material available online with this article; doi:
10.1172/JCI69277DS1), using 400 microsatellite markers with an average resolution of 10 cM. Cases and controls were defined based on the age-adjusted LDD score (20): individuals with age-adjusted LDD scores 0.5 SD above the mean value for their age group were assigned as cases; all others were assigned as controls. Nonparametric linkage (NPL) analysis of the data identified candidate regions on chromosomes 1, 5, 8, 10, and 20 (Figure 1A).

Linkage and association analyses for LDD. (A) NPL Z-score plot of the 2-stage genome-wide linkage analysis. Scores of the first-stage analysis using 10 Southern Chinese families are plotted against the position of the microsatellite markers across the different chromosomes (gray dotted line). Suggestive regions (gray arrows) were then analyzed in the second stage using additional markers and an additional 8 families; the total scores of these markers are denoted by blue circles. A score of 3 was used as the threshold for significance. (B) Workflow of the 4-stage GWAS and the 3-stage fine-mapping of CHST3 in the identification of potential causal variants reaching genome-wide (GW) significance. The number of SNPs selected for analysis at each stage, the cohort used, and the total number of cases and controls in the study are indicated. (C) Position of the SNPs in relation to CHST3. Purple diamonds, –log10(P) values; red squares, P values of the 2 significant SNPs within the 3′UTR in a meta-analysis using 3 population cohorts; blue circle, original SNP from the GWAS and replication studies. The physical map of the gene structure of CHST3 is shown in relation to the physical position in chromosome 10. The LD map (r2) was generated from the 13 SNPs genotyped from the SC-1 + SC-2 cohort (n = 2,999).

In the second-stage linkage analysis, we added 37 individuals from 8 new families (families 11–18; Supplemental Figure 1), with 19 new microsatellite markers added at 5-cM intervals around the candidate regions. Only markers in the chromosome 10 region showed a substantial increase in NPL Z-score, while markers in other chromosomal regions showed no change or a decrease in NPL Z-score (Figure 1A). The maximum NPL Z-score was 3.72 at 98.96 cM (D10S569) within the 10q21.2-q23.1 region (72.065–79.223 Mb on chromosome 10).

Case-control GWAS identified variants in CHST3 (in 10q21.2-q23.1) at P < 5 × 10–8.
We performed case-control association studies using 7 independent cohorts (Table 1) in a GWAS composed of 4 stages, in which the number of SNPs was reduced and the sample size increased at each stage (Figure 1B). In stage 1, we examined a Japanese cohort (J1; 366 LDD cases, 3,331 controls) using the HumanHap550v3 Genotyping BeadChip (Illumina). After data quality assessment, we performed the Cochran-Armitage trend test on these case-control data at 464,775 SNPs genome-wide. Principal component analysis found no strong genetic heterogeneity (Supplemental Figure 2A), and the genomic control inflation factor (λGC) was 1.03 (Supplemental Figure 2B), indicative of weak population substructure. We then selected the top 1,500 SNPs according to P value (7.59 × 10–7 ≤ P ≤ 3.53 × 10–3) for stage 2, which consisted of genotyping in a second independent Japanese cohort (J2; 544 cases, 15,800 controls), and analyzed 1,349 SNPs after data quality assessment. However, none of the SNPs achieved genome-wide significance after meta-analysis using Mantel-Haenszel combined minimum P. Therefore, we took the top 10 SNPs according to P value to stage 3, which included an additional Japanese cohort (J3; 242 cases, 622 controls) and a Northern Chinese cohort (NC; 572 cases, 776 controls). The smallest P value from stage 3 was for rs1245582 (P = 1.46 × 10–6; Table 2). While the association at rs1245582 was not genome-wide significant, the SNP lies within the 10q21.2-q23.1 region identified by the linkage study, so we took this SNP forward for genotyping in stage 4.

rs1245582 is located within an LD block of approximately 85 kb, according to the Japanese information in the International HapMap Project database (release 24) (Supplemental Figure 3A). Only 3 SNPs (rs1245582, rs751450, and rs4148917) in the region showed low P values (P < 2.10 × 10–4) in stage 2 of our study, and each is located within the LD block, which overlaps with CHST3. Carbohydrate sulphotransferase-3 catalyzes the transfer of sulphate from 3′-phosphoadenosine 5′-phosphosulphate (PAPS) to position 6 of the N-acetylgalactosamine (GalNAc) residue of chondroitin (29). The genome-wide significant association observed here was supported by our family linkage study, which found CHST3 within the strongest linkage peak genome-wide.

Candidate causal variants in the 3′ untranslated region (3′UTR) of CHST3 identified from fine-mapping study. To look for candidate causal variants within CHST3, we performed a 3-stage fine-mapping study (Figure 1B). In stage 1, we analyzed all 69 SNPs identified by the Japanese HapMap Phase II resource across CHST3 and the relevant LD block (Supplemental Figure 3A) using our SC-1 cohort (n = 1,379; Supplemental Table 2 and ref. 20). Similar to the Japanese HapMap data, the LD map of CHST3 from this dense set of SNPs in the SC-1 cohort showed 1 large LD block (Supplemental Figure 3B). Association testing of the 1,379 SC-1 individuals using linear regression at the 69 SNPs identified 14 SNPs with P < 0.05 (Supplemental Table 2). These were taken forward to stage 2 for genotyping in the combined Southern Chinese cohort (SC-1 + SC-2; n = 2,999). After data quality control analysis, 13 SNPs were tested for association (Supplemental Table 3), which revealed a cluster of 8 SNPs at the 3′ region of the gene with elevated association signals; these were contained within an LD block (rs6480592 to rs1245582) (Figure 1C) with pairwise correlation (r2 > 0.49). Of these SNPs, 5 were within the 3′UTR of CHST3, while the others were intronic or downstream of the gene, which suggests that the causal variant may be in the 3′UTR.

We hypothesized the involvement of miRNA binding sites and performed an analysis using the PolymiRTS database (30). We found that rs4148941 and rs4148949 were within predicted miRNA binding sites for miR-513a-5p (also known as miR-513) and miR-626, respectively. With these SNPs as the best candidate causal variants, we performed replication studies using cohorts from Japan and Finland (J1′, J2′, and F3; Table 1). This was followed by a final meta-analysis using only severely affected individuals (top 20th percentile) from the SC-1S + SC-2S cohorts as a more appropriate subgroup, with both rs4148941 and rs4148949 achieving genome-wide significance using an allelic model (Table 3 and Figure 1C). Analysis using a dominant model gave similar P values, although not genome-wide significance, with a Bonferroni corrected threshold set at 1.07 × 10–7 for a total of 467,709 SNPs analyzed throughout the study.

CHST3 expression in human IVDs and the effect of miRNA on allelic variants.
(A–C) Comparison of CHST3 expression by quantitative RT-PCR, (A) between control and degenerated samples and between samples differentiated according to genotypes of (B) CC/CT and TT for rs4148949 and (C) AA/AC and CC for rs4148941. (D) Comparison of COL2A1 for samples differentiated according to genotypes of AA/AC and CC for rs4148941. (A–D) Levels are expressed as ΔCt values above GAPDH (endogenous housekeeping gene for mRNA). Data are presented as dot plots; red bars denote mean values. (E) Predicted alignment of allelic variation at rs4148941 and rs4148949 as recognition sites for the binding of miR-513a-5p and miR-626, respectively. Both SNPs (shaded) occurred in the 7-bp seed sequence of complementarity (underlined) at the 5′ end of the miRNAs. (F and G) Luciferase reporter assays were performed using C28I2 cells to determine the effect of the A/C allele at rs4148941 and C/T allele at rs4148949 in the absence (F) and presence (G) of miR-513a-5p and miR-626, respectively. (H) Relative allelic expression difference of CHST3 mRNA, determined by pyrosequencing for human IVD tissues with heterozygous genotype at rs4148941. Data are mean ± SD. AF, annulus fibrosus; EP, cartilage end-plate.

We then genotyped the tissue samples, and CHST3 mRNA levels were tested against rs4148941 and rs4148949 under both additive and dominant models. Differential expression was significant under the dominant model, but not the additive model. Individuals with the risk genotypes for rs4148941 (AA/AC) and rs4148949 (CC/CT) had significantly lower levels of CHST3 mRNA than did individuals with the CC and TT genotypes, respectively (Figure 2, B and C), which suggests the risk alleles directly affect CHST3 mRNA levels. We detected no significant difference in expression of COL2A1, another gene expressed in disc cells, with respect to the genotypes for rs4148941 (Figure 2D), which suggests the effect is specific to CHST3.

rs4148941 affects the interaction with miR-513a-5p, reducing CHST3 mRNA levels. rs4148941 and rs4148949 are within potential binding sites for the seed region of miR-513a-5p and miR-626, respectively (Figure 2E).
We hypothesized that the risk allele would function through interaction with miR-513a-5p or miR-626. We therefore tested the effect of the CHST3 3′UTR sequence containing rs4148941-A/C or rs148949-C/T alleles using a luciferase reporter system (Figure 2F) in an immortalized human chondrocyte cell line, C28I2. We detected little or no expression of miR-513a-5p or miR-626 in C28I2 cells (data not shown), and C28I2 cells transfected with the rs4148941-A/C or rs1418949-C/T reporter constructs showed no differences in luciferase activity (Figure 2F). When we repeated the experiments with exogenous miR-513a-5p, the luciferase activity for the rs4148941-A reporter was 27% less than that in cells transfected with the counterpart rs4148941-C reporter (P = 0.02; Figure 2G). In contrast, cells transfected with either the rs4148949-C or -T reporter in the presence of miR-626 showed no significant difference in luciferase activity. This is consistent with mRNAs transcribed from the risk allele for rs4148941 being less stable.

To assess whether this allelic difference occurs in human IVD samples, we quantified CHST3 allelic-specific transcripts in individuals with the AC genotype for rs4148941 by pyrosequencing. There was a significant difference between the relative levels of the A and C allelic products as a percentage of total CHST3 mRNA for all tissues of the IVD (Figure 2H). These data support rs4148941-A being the functional risk allele and causing a preferential reduction in CHST3 mRNA. Finally, we showed that miR-513a-5p was expressed in the IVD tissues that could influence the level of CHST3 mRNA. There was no significant difference in the expression level of miR-513a-5p between control and LDD samples (Supplemental Figure 5A), nor according to the risk genotype (Supplemental Figure 5B), which suggests that the allelic difference is primarily the consequence of the interaction between miR-513a-5p and the risk allele of CHST3.

Rare mutations in CHST3 that disrupt its enzymatic activity have been reported in patients with recessive skeletal abnormalities, including spondyloepiphyseal dysplasia Omani type, Larsen syndrome, humero-spinal dysostosis, and chondrodysplasia with multiple dislocation (31–36). Despite the different diagnostic labels, patients with CHST3 mutations have similar clinical characteristics and can be generally classified as spondyloepiphyseal dysplasia with congenital joint dislocation and vertebral changes as the principal features (OMIM 603799). By late childhood, these features manifest and lead to arthritis of the hips and spine with IVD degeneration, rigid kyphoscoliosis, and trunk shortening.

Numerous mutations have been identified, including missense, nonsense, small insertions, and deletions. Biochemical analysis showed the majority of these mutations result in loss-of-function or severe reduction in the enzymatic activity (31, 33–36). Recessive inheritance would be consistent with most rare mutations that affect enzymes. While detailed clinical information for the heterozygous parents was not available, it is conceivable that a reduction of 50% or less in CHST3 activity causes a milder phenotype that may have preferential defects of the spine, leading to early-onset disc degeneration. We obtained and reviewed MRI scans of 3 parents with heterozygous missense mutations in CHST3, 2 of Chinese ethnicity (35) and 1 from Oman (33). Each showed evidence of IVD degeneration and associated disc abnormalities, such as herniation (Schmorl’s nodes) of the cartilaginous end-plate, in the lumbar region (Figure 3). With respect to our Southern Chinese population cohort, their LDD scores with age adjustment were considered to be mild for their age group (30–35 years old; ref. 20). They would therefore be considered to have earlier onset of disc degeneration resulting from reduced CHST3 activity. This is consistent with our hypothesis that reduced expression of CHST3 mRNA is a risk factor for LDD.

The phenotype definition for LDD is complex. In the present study, because LDH is caused by disc degeneration or other traumatic events in the lumbar region of the spine, we included it as a subset of LDD. Therefore, when performing the meta-analysis, we treated all cohorts as a single population with common heritable factors. Diagnostic heterogeneity in the case samples is unlikely, because heterogeneity tests were not significant. The fact that independent linkage and GWAS studies homed in on the same region in chromosome 10 provides strong evidence for a susceptibility locus for LDD in the region.

We adopted the strategy of initially searching for linkage and association loci by means of genome-wide approaches in more severe cases that may have a larger effect on the disease, followed by association analysis to fine-tune regions within significant linkage/association signals. Complex diseases often involve multiple genetic variants, each with small or moderate effect, reducing the power of linkage analysis compared with association studies (37). Our selection of early-onset cases with disc degeneration was aimed at reducing the heterogeneity, working on the hypothesis that some variants are common within these families. Furthermore, the relatively high heritability from familial aggregation and twin studies supports the notion that useful information can be obtained from linkage analysis for early-onset families. Our 2-stage approach enabled the filtering of potential false positives and provided a candidate region within chromosome 10.

Using a GWAS comprising 4 stages allowed the identification of a SNP (rs1245582) closely associated with LDD, with CHST3 as the nearest gene. This approach minimized the possibility of false-positive associations due to population stratification in stage 1 and reduction in number of SNPs required for genotyping in stage 2. Although there was an age difference between case and control groups in stages 1 and 2 (data not shown), it was not a confounding factor, as association remains significant after adjustment for age, gender, and BMI using the logistic regression model. In stages 3 and 4, while rs1245582 did not show significant association except in the Chinese study of replication 2, when we set the significance threshold to P < 0.05, the risk allele frequency in each of the cases and controls showed a common trend throughout all phases of the association studies. Importantly, this multistep association study and meta-analysis provided the first evidence associating LDD with rs1245582 by an unbiased approach. The fact that rs1245582 was within a large LD block that contained only CHST3 provided us with a match in the gene list from our linkage analysis and the confidence to carry out a detailed association study that substantiated CHST3 as a novel risk factor with genome-wide statistical significance.

CHST3 plays a key role in maintaining the hydration and function of cartilaginous tissues. Aggrecan is the most abundant proteoglycan in the disc matrix and contains abundant chondroitin sulphate glycosaminoglycan side chains. Proper sulfation of glycosaminoglycan side chains is critical for water retention within the disc to maintain proper osmotic pressure to resist compressive forces. Thus, the activity of CHST3 would be important for IVD function.

CHST3 mutations result in recessive forms of spondyloepiphyseal dysplasia with congenital joint dislocation and vertebral changes as the principal features (OMIM 603799). Our finding that 3 heterozygous parents with previously known mutations in CHST3 that disrupt its enzymatic activity showed disc abnormalities is consistent with earlier onset of degeneration in the corresponding age group, providing further support for CHST3 as a genetic risk factor for disc degeneration.

Although we did not find a significant difference in the level of CHST3 mRNA in IVD tissues from control and LDD patients, this could be due to the limited number of samples and consequent lack of statistical power or other unknown mechanisms. Clarification will require a much large set of IVD samples. However, we did find evidence supporting rs4148941-A as a functional risk allele that reduces CHST3 mRNA level. We speculate that mild CHST3 reduction caused by the susceptibility SNP could result in disc degeneration in adults in conjunction with other risk factors.

A recent publication has demonstrated that miRNAs are likely to be involved in the degenerative process of IVD, with several either up- or downregulated in cells isolated from degenerated tissues (38). Although miR-513a-5p was not among the reported miRNAs with changed expression levels (38), we found that it interacted with certain alleles of CHST3, reducing its expression. miR-513a-5p also has a negative effect on the mRNA level of B7-H1 (also known as CD274 or programmed death receptor-1 [PD-1] ligand-1) in biliary epithelial cells (cholangiocytes) (39). B7-H1 is a key member of the B7 costimulator family with important regulatory functions in cell-mediated immune responses (40). miR-513a-5p binds to a site in the B7-H1 3′UTR, resulting in translational repression (39). The expression of miR-513a-5p itself can be downregulated by IFN-γ (39), a proinflammatory cytokine that can mediate many cellular events in cholangiocytes (41). Thus, it is possible that there is a dynamic interaction among miR-513a-5p and other miRNAs, regulated by proinflammatory cytokines in the degenerative process, that cannot be captured readily in disc tissue samples. The expression of miR-513a-5p in response to various cytokines needs to be studied carefully in disc cells or IVD explants in bioreactors.

In summary, the present study represents the largest systematic investigation thus far of the genetic risk factors for LDD. Using a combination of genome-wide linkage analysis and association studies, we identified CHST3 as a novel risk factor, emphasizing the importance of CHST3 as a musculoskeletal disease gene, and the involvement of miRNA, shedding light on the molecular pathogenesis of LDD.

Study populations. Case-control association studies were carried out using multiple cohorts from 3 different populations consisting of Chinese, Japanese, and Finnish subjects (see Supplemental Table 1 for composition and phenotype inclusion criteria). All recruited subjects in the Southern Chinese cohort underwent MRI scanning of the lumbar spine. LDD was diagnosed on the basis of signal intensity changes within the NP of the IVDs of the lumbar spine and graded using Schneiderman classification (42) for signal intensity within the NP (43). Because age is a confounding factor in LDD, we adjusted for this using a sliding-window method (20, 27). In brief, the LDD score based on summation of the Schneiderman score in the lumbar region was normalized by logarithmic transformation. The age band for each individual was defined as age ± 5 years. The age-adjusted LDD score for each individual was calculated by subtracting the mean and then dividing by the SD in the individual’s age band.

The NC, J1, J2, J3, F1 (27), and F3 (23) cohorts were recruited on the basis of LDH characterized by sciatica or LBP requiring surgical treatment, but not necessarily having had surgical treatment. The F2 cohort (28) consisted of individuals from the Northern Finland Birth Cohort 1966 (NFBC 66) who were hospitalized owing to sciatica. 18 families with early-onset LDD identified from recruitment of the Southern Chinese cohort were used for whole-genome linkage analysis (Supplemental Figure 1). See Supplemental Methods for details of the sample sets.

Laboratory methods. PRISM human linkage mapping set v2.5-MD10 (Applied Biosystems) was used for the linkage analysis. Illumina HumanHap550v3 Genotyping BeadChip was used for stage 1 of the GWAS. For genotyping of specific SNPs, multiple methods were used, depending on cohort origin (see Supplemental Methods).

Quantitative assessment of gene expression was performed using quantitative RT-PCR on total RNA extracted from disc tissues collected postoperatively with informed consent, while pyrosequencing was used to determine the relative allelic mRNA products. Expression of specific miRNA was determined using TaqMan miRNA assays (Applied Biosystems). miRNA binding prediction of SNP variants was performed using the PolymiRTS database (30). Luciferase assays were carried out in transfected C28I2 chondrocytes (provided by M.B. Goldring, Hospital for Special Surgery, New York, New York, USA). See Supplemental Methods and Supplemental Table 4 for details of gene and miRNA expression analyses, reporter construct generation, cell transfection, luciferase assay conditions, and primer sequences.

Statistics. Multipoint NPL analysis (44) was performed using the Sall scoring function of MERLIN (45); NPL Z-scores and corresponding P values are reported. For quantitative-trait linkage analysis, the Deviates method of MERLIN was used to test for excess sharing among individuals in the same tail of trait distribution, as it makes no assumptions regarding trait distribution and is implemented in the general framework of Whittemore and Halpern (44) and Kong and Cox (46). Thus, the Deviates method was employed when using age-adjusted and raw LDD scores. Association and Hardy-Weinberg equilibrium analyses were assessed using the χ2 test. Odds ratios and other statistical measures were calculated using SPSS 15.0. HapMap data were downloaded from
http://www.hapmap.org/index.html.en. Linkage disequilibrium blocks were estimated, and Hardy-Weinberg equilibrium was tested using Haploview v4.2 software (47). GraphPad Prism 4 (GraphPad Software) was used for data presentation. Meta-analyses were carried out using R (R Foundation for Statistical Computing) and PLINK (48) using fixed-effect model analyses at both the GWAS and the fine-mapping stages. Multiple testing using the Bonferroni correction was used in all statistical analyses. A P value less than 0.05 was considered significant.

Study approval. The local ethics committees from the relevant institutions approved the study designs. Written informed consent was obtained from all participants in the study.

Supported by the University Grants Council of Hong Kong (AoE04/04), AOSPINE (AOSBRC-07-2), by Research Grants Council of Hong Kong (HKU7509/03M), by a grant-in-aid from the Ministry of Education, Culture, Sports, and Science of Japan (grant no. 21249080), and by the Academy of Finland (grant nos. 121620 and 129504). The authors thank the individuals and families who participated in this study as well as Walter Boldmer, Peter Goodfellow, Dong Yan Jin, and Kin Hang Kok for study design; Florence Mok for phenotyping; Pei Yu for data entry and patient recruitment; Sandy C.S. Poon for technical assistance; and Jockey Club MRI Centre (LKS Faculty of Medicine, The University of Hong Kong) and The Hong Kong Sanatorium and Hospital for MRI.