Abstract

Purpose: Breast cancer in young women is associated with poor prognosis. We aimed to define the role of gene expression signatures in predicting prognosis in young women and to understand biological differences according to age.

Experimental Design: Patients were assigned to molecular subtypes [estrogen receptor (ER)+/HER2−; HER2+, ER−/HER2−)] using a three-gene classifier. We evaluated whether previously published proliferation, stroma, and immune-related gene signatures added prognostic information to Adjuvant! online and tested their interaction with age in a Cox model for relapse-free survival (RFS). Furthermore, we evaluated the association between candidate age-related genes or gene sets with age in an adjusted linear regression model.

Results: A total of 3,522 patients (20 data sets) were eligible. Patients aged 40 years or less had a higher proportion of ER−/HER2− tumors (P < 0.0001) and were associated with poorer RFS after adjustment for breast cancer subtype, tumor size, nodal status, and histologic grade and stratification for data set and treatment modality (HR = 1.34, 95% CI = 1.10–1.63, P = 0.004). The proliferation gene signatures showed no significant interaction with age in ER+/HER2− tumors after adjustment for Adjuvant! online. Further analyses suggested that breast cancer in the young is enriched with processes related to immature mammary epithelial cells (luminal progenitors, mammary stem, c-kit, RANKL) and growth factor signaling in two independent cohorts (n = 1,188 and 2,334).

Translational Relevance

Young age (i.e., ≤40 years) at breast cancer diagnosis has long associated with a poor prognosis. In this work, we address two key questions: (i) whether prognostic gene expression signatures can also discriminate prognostic subgroups in young patients, as young age alone can be an indicator for adjuvant chemotherapy and (ii) whether breast cancers arising in this age group are biologically distinct. We report that proliferation-related prognostic gene signatures retain their prognostic power independent of age—these results are important because although their clinical utility is currently being evaluated in prospective trials, many young patients are not likely to be recruited in these studies. Next, we report that breast cancer arising at a young age is enriched with unique molecular processes that may explain their poor outcomes. These data, from more than 3,500 patients, provide further rationale for investigating separate therapeutic approaches for breast cancer diagnosed in young women.

Introduction

Around 7% of patients in the developed world and 25% of patients in the developing world are diagnosed with breast cancer below the age of 40 (1, 2). These women have poorer survival and higher risk of relapse than their older counterparts (3, 4). Several factors have been linked to the poor prognosis associated with developing breast cancer at a young age. These include large tumor size at diagnosis, higher tumor grade, mitotic rate, lymphovascular invasion, increased expression of HER2, and lower estrogen and progesterone receptor expression (4, 5). However, even after correction for stage and tumor characteristics, young age at diagnosis remains an independent risk factor for relapse and breast cancer–related death (6–10) and also an indication for aggressive systemic therapy (6).

Using gene expression profiling, at least 3 distinct molecular subtypes have been observed, each associated with different clinical outcome: “luminal”, HER2+, and “basal-like” breast cancer (11–14). Recent studies have examined the distribution of breast cancer molecular subtypes and found that breast cancer in young women is enriched with aggressive subtypes (15–17). This has led some investigators to question whether breast cancer diagnosed at a young age has a unique biology or whether it is a just a surrogate for a higher incidence of aggressive molecular subtypes (15).

The use of gene expression profiling in breast cancer has also contributed to the development of biomarkers for improved prognostication (18–23). Several multigene expression signatures have been reported to provide a better indication of clinical outcome than the traditional clinical and pathologic standards (19, 20, 24). Some of these gene signatures are currently commercially available as diagnostic tests, though prospective clinical trials are ongoing to evaluate their exact clinical utility (25, 26).

In this study, we aimed to conduct a comprehensive analysis of breast cancer with respect to age, taking advantage of the large compendium of publicly available gene expression data sets from more than 3,500 breast cancer patients. We specifically wanted to clarify the relevance of several published prognostic gene signatures in young women (≤40) and to determine whether young age is truly associated with unique disease biology.

Materials and Methods

Patient clinical and gene expression data

We searched 39 publically available data sets and retrieved all clinical and gene expression data. We excluded all patients with missing information on age and cross-checked the different data sets to identify patients who were included in more than 1 data set. Repeated patients were deleted. As stroma and lymphocyte content are known to be altered in different tissue sampling procedures, those neoadjuvant data sets in which fine needle aspirates were known to have been used, were also excluded (27; Fig. 1). Eligible patients were divided into 2 cohorts according to whether they had received systemic adjuvant therapy or not (“untreated,” cohort 1 and “treated,” cohort 2; Supplementary Table S1). We used normalized microarray data (log2 intensity in single-channel platforms or log2 ratio in dual channel platforms), as published by the original studies. Hybridization probes were mapped to Entrez GeneID with Entrez database version 2007.01.21. When multiple probes were mapped to the same GeneID, the probe with the highest variance in a particular data set was selected to represent the GeneID.

Flow diagram summarizing the gene expression data sets used for the various analyses.

We calculated the estimated 10-year relapse-free survival (RFS) for the untreated cohort with Adjuvant! Online (AOL, version 8.0). AOL (http://www.adjuvantonline.com) was calculated for patients with available tumor size and nodal status. Patients with unknown histologic grade were considered to have “undefined” histologic grade on the AOL risk assessment model. Estrogen receptor (ER) status was missing in 12 patients (1%), and in these cases the dichotomized ER gene (ESR1) mRNA value (positive and negative) was used instead. For patients with node-positive disease, we searched the original publication to retrieve the exact number of positive nodes to accurately estimate AOL risk. When this information was unavailable, we classified them as N1 (i.e., node positive with 1–3 positive nodes).

Microarray analysis

Molecular subtypes.

Patients were assigned to molecular subtypes with a 3-gene classifier (ESR1, ERBB2 [HER2], and AURKA). The cutoff for ESR1 and ERBB2 expression from microarray data was derived from fitting 2 normal distributions to the observed distribution of expression values in a single training study of 286 patients (23) which was consequently applied to all other data sets, using a previously described method (28). Three main molecular subtypes were defined as ER−/HER2−, HER2+, and ER+/HER2− (i.e., luminal). The luminal subtype group was also divided into phenotypes “A” and “B” on the basis of median AURKA expression—ER+/HER2−/low proliferation (luminal A) and ER+/HER2−/high proliferation (luminal B). For this purpose, similarly the AURKA cutoff was trained on the same data set and then applied to all the other data sets as previously described (28, 29). To ensure compatibility of expression values across multiple data sets, ESR1, ERBB2, and AURKA gene expression values were rescaled before applying the 3-gene classifier (see below). This method is fully documented and was implemented with the R/Bioconductor package genefu (version 1.3.6; http://www.bioconductor.org/packages/release/bioc/html/genefu.html).

Evaluation of previously published prognostic gene signatures according to age.

For this analysis, we used the untreated cohort (cohort 1) as not to have treatment as a confounder on prognostic outcomes. In the 3 breast cancer subtypes (ER+/HER2−; HER2+, ER−/HER2−), within the overall patient series and in the different age groups (≤40, 41–52, 53–64, ≥65 years), we evaluated (i) 3 proliferation-related prognostic gene signatures (GGI, GENE70, and GENE76; refs. 18, 19, 23); (ii) 3 stroma-related gene signatures (DCN, SDPP, and PLAU; refs. 29–31); and (iii) 3 immune-related gene signatures (IRM, immunomodulatory cluster, STAT1; refs. 29, 32, 33). A summary of the 9 evaluated signatures is provided in Supplementary Table S2.

Evaluation of candidate age-related genes and gene sets identified from the literature.

We conducted a MEDLINE search using the terms “breast cancer, young, biology,” “breast cancer, young, gene expression,” and “breast cancer, young, prognosis” to retrieve publications related to the biology of breast cancer in young women until February 2011. Any gene or protein alterations that were suggested to be related to the biology of breast cancer in young women were identified. We then evaluated the expression of these genes and gene sets in a linear regression as a function of age and after adjustment for potential confounders to determine whether breast cancer arising in the young is associated with unique cancer biology (see below).

Statistical analysis

The different prognostic gene signatures and age-related gene sets were treated as continuous variables, and they were defined as a weighted average of the included genes using the following formula:where xi is the expression of a gene in the gene set or gene signatures that is present in the data set platform and wi is either +1 or −1 depending on the sign of gene-specific statistic from the original studies. Each risk score was scaled such that quantiles 2.5% and 97.5% equaled −1 and +1, respectively, to allow for comparison between data sets using different microarray technologies and normalization procedures.

RFS was the primary survival endpoint, which is defined as the time elapsing between breast cancer diagnosis and date of local or systemic relapse, or death. RFS was evaluated with a Cox regression model stratified by data set and adjuvant treatment modality (hormonal only, chemotherapy, or no therapy), and adjusted for age as a binary (≤40 vs. >40 years) or continuous variable, the 3 breast cancer subtypes, tumor size, nodal status, and histologic grade. Survival plots according to the age groups were drawn using the Kaplan–Meier method, and the differences were evaluated with a log-rank test. Only patients with relapse information available were included in these analyses. When RFS data were not reported, distant metastasis-free survival (DMFS) information was used if available. The median follow-up was calculated with the reversed Kaplan–Meier method (34).

Prognostic gene signatures were evaluated for their ability to provide further prognostic information to AOL. The additional prognostic value of the gene signatures to AOL was assessed using the change in the likelihood ratio χ2 value. We also examined whether there was any interaction between age as a continuous variable and the prognostic performance of the different gene signatures across the breast cancer subtypes.

Differences in the incidence of breast cancer subtypes according to age group were assessed by the χ2 test. To evaluate the association between candidate age-related genes and gene sets, we built a linear regression model for each candidate gene expression score as a function of age as a continuous variable, after controlling for potential confounding factors. The first set of variables entered into the model was age and data set followed by the second set, adding histologic grade, tumor size (<2 cm, 2–5 cm, and >5 cm), nodal status (positive and negative), and the 3 main breast cancer molecular subtypes. The model was applied to the untreated cohort first (i.e., cohort 1), and then we attempted to replicate the significant findings in the treated one (i.e., cohort 2).

To visualize the differences in the prognostic performance of the gene signatures across age, distribution of breast cancer molecular subtypes and breast cancer biology, patients were divided into 4 age groups (≤40, 41–52, 53–64, ≥65 years).

As each of the analyzed data sets represented a series of patients from different hospitals and countries, treated heterogeneously, with varying sample collection techniques and profiled on different platforms, we opted to adjust the linear regression analysis for data set and stratify the RFS analysis by data set to avoid potential biases. To control for multiple testing, we used a false discovery rate (FDR) approach as defined by Benjamini and colleagues (35). Reported P values are 2-sided. Statistical analyses were conducted with SPSS (version 15.0; SPSS Inc.) and R software (version 2.9.2; http://www.r-project.org).

Results

Patient characteristics

We compiled clinical and gene expression profiling data from 39 published data sets of early breast cancer. After exclusion of patients with missing information on age and those who received neoadjuvant treatment, a total of 3,522 patients from 20 data sets were eligible for this study (Fig. 1; Supplementary Table S1).

When comparing the distribution of breast cancer subtypes across the whole series, as expected, patients aged 40 years or less had nearly doubled the proportion of ER−/HER2− tumors (34.3% vs. 17.9%) and half the luminal-A breast cancer than the oldest group (i.e., ≥65 years; 17.2% vs. 35.4%, P < 0.0001; Supplementary Fig. S3).

We used the untreated cohort (cohort 1) to evaluate the performance of the previously published prognostic gene signatures across age to avoid treatment as a confounder of clinical outcome. Both untreated (cohort 1: n = 1,188; Table 1) and treated cohorts (cohort 2: n = 2,334; Table 1) were used to examine the biological differences according to age.

Characteristics of patients in the untreated (cohort 1) and treated (cohort 2) cohorts

Differences in RFS according to age and breast cancer molecular subtype

Out of 3,522 patients identified for this study, 621 (17.6%) did not have information available on RFS, and thus were not included in this analysis. Of the remaining 2,901 patients, 1,697 patients (52.8%) had DMFS and not RFS available, and hence DMFS values were used in these patients. A total of 952 patients relapsed (33%) at a median follow-up of 5.2 years (interquartile range 1.5–8.6 years). We observed a significantly higher risk of relapse in patients of 40 years or less than in older age groups (P < 0.0001, Fig. 2A). As a binary variable, age less than or equal to 40 years was significantly associated with a poor outcome after adjustment compared with ages older than 40 at diagnosis (HR = 1.34, 95% CI = 1.10–1.63, P = 0.004). A subgroup analysis per breast cancer molecular subtype suggested an inferior RFS with young age, particularly in the ER+/HER2− phenotypes (Fig. 2B–D). Similar results were observed on restricting the analysis to the untreated cohort (Supplementary Fig. S4).

Differences in RFS according to age using all patients with available relapse data. A, all available patients with
relapse data; B, ER+/HER2- luminal A subtype; C, ER+/HER2-luminal B subtype; D, HER2 overexpressing subtype; E, ER−/HER2- subtype.

In a multivariate analysis stratified for data set and treatment, and adjusted for tumor size, nodal status, histologic grade, and breast cancer subtype, a 1-year increase in age was associated with a 1% reduction in the risk of relapse (HR = 0.99, 95% CI = 0.98–0.99; P = 0.043).

Clinical relevance of previously published prognostic gene signatures according to age and breast cancer subtype

The prognostic value of 3 proliferation related, 3 stroma related, and 3 immune-related gene signatures was evaluated according to age group and breast cancer subtype in the untreated cohort (cohort 1). All analyses were adjusted to the estimated risk of relapse by AOL at 10 years, which was computed for all eligible patients (n = 1,150). AOL could not be precisely calculated in 163 (13.7%) patients as 116 (9.8%) had an undefined histologic grade (though AOL adjusts its risk assessment for missing histologic grade values) and 47 (3.9%) had an unknown number of positive lymph nodes. One patient had both variables missing.

Prognostic evaluation of proliferation gene signatures in the ER+/HER2− subtype according to age groups. Dotted line represents a HR of 1.0 and error bars represent 95% CIs. All HR shown have been adjusted for AOL. A, GENE70; GGI (B); GENE76 (C). The Pinteraction between age as a continuous variable and the gene signature in a Cox-model and corresponding FDR value is shown for each gene signature.

Prognostic evaluation of stroma-related gene signatures in the ER−/HER2− subtype according to age groups. Dotted line represents a hazard ratio (HR) of 1.0, and error bars represent 95% CIs. All HR shown have been adjusted for AOL. A. PLAU; B DCN; C SDPP. The p-value of an interaction between age as a continuous variable and the gene signature in a Cox-model and corresponding FDR value is shown for each gene signature.

Is breast cancer arising in young women associated with unique disease biology?

To understand whether breast cancer in young women is biologically distinct from that diagnosed in older age groups and not just a surrogate for a higher incidence of aggressive breast cancer subtypes, we conducted a MEDLINE search to identify candidate age-related genes and pathways that have been suggested to characterize breast cancer arising at a young age. Out of 280 potentially relevant articles, we identified a total of 41 genes and 13 gene sets related to these aberrations (Supplementary Table S8). We then evaluated the differences in the gene expression values of these candidates using a linear regression model adjusted first for age as a continuous variable and data source and then other potential confounding variables such as breast cancer subtype, tumor size, nodal status, and histologic grade.

Within the untreated cohort (cohort 1), the expression of 16 genes and gene sets were found to be significantly age dependent after adjustment. We proceeded to replicate these findings in the treated cohort (cohort 2) and found that 12 out of the 16 were still significantly associated with age after adjustment (Table 2). The common themes associated with young age were enrichment of biological processes related to immature mammary cell populations (RANKL, c-kit, BRCA1-mutated phenotype, mammary stem cells, and luminal progenitors cells), and growth factor signaling [mitogen—activated protein kinase (MAPK), phosphoinositide 3-kinase (PI3K)-related]. There was also downregulation of apoptosis-related genes.

Discussion

To the best of our knowledge, this is the largest work using gene expression data to investigate the biology and prognosis of breast cancer in young women. Notably, we found that breast cancer in the young seems to be associated with a unique biology irrespective of being enriched with more ER−/HER2− tumors. We also found that the proliferation-related gene signatures seemed to be clinically relevant in patients aged 40 years or less as well as in the older age groups, imparting prognostic information beyond that provided by AOL (36). This is despite the observation that young breast cancer patients with ER+/HER2− disease have an inferior recurrence-free survival compared with older women with ER+/HER2− disease. As this is the first report that specifically addresses the relevance of prognostic gene signatures in young patients, our data suggest that they may be helpful in treatment decision-making given that young age alone is considered by many to warrant aggressive systemic therapy.

Interestingly, in the ER−/HER2− subtype, we observed a prognostic value of stroma-related gene signatures, DCN and PLAU only in patients aged 40 years or less. However, although there was consistency in the prognostic performance of the proliferation-related gene signatures, this was not observed with the 3 stroma-related ones. Of note, SDPP was developed by microdissection of tumor-associated stroma, which was not the case for DCN and PLAU, which were both developed in silico and are highly correlated (R = 0.88). Regardless, these results highlight a potential role of the microenvironment in mediating breast cancer growth in young women, particularly for those with ER−/HER2− breast cancer. As breast stroma in young women is highly responsive to growth factor stimulation potentially to accommodate pregnancy and lactation during child-bearing years, it is not inconceivable that this microenvironment could also be advantageous for aggressive tumor growth (37–40). Therefore, it may be worth developing therapeutic approaches to target the microenvironment and stroma for ER−/HER2− subgroup in young women.

Consistent with recent publications using both gene expression-defined molecular subtypes and immunohistochemistry, we found that breast cancer in young women is enriched with ER−/HER2− tumors, with a lower incidence of luminal-A type tumors (15, 16). However, one of the most controversial questions is whether young age is associated with unique cancer biology. Recently, Anders and colleagues concluded that age alone did not seem to induce biological influence beyond that of breast cancer subtype and grade (15). These results were in direct contrast with the poor outcome of young breast cancer patients after adjusting for ER, grade, and HER2 status documented in several studies, including this one (6–10).

There are several differences between our study and that of Anders and colleagues (15). We studied candidate genes and gene sets based on a literature search, thereby reducing the potential bias associated with multiple testing. In addition, we included 873 patients aged 45 years or less compared with 130 patients in their analysis. We also investigated trends across age as a continuous variable, rather than dichotomizing age at its extremes. This allowed us potentially to detect subtle biological and clinical differences in gene expression across age.

The results of our biological analyses propose several interesting hypotheses to be further validated. We confirm a previous finding that suggested that breast cancer in young women is enriched with genes involved in extracellular signal—regulated kinase and PI3K signaling (5). Similarly, we also found that the single gene BRCA1 and a gene set developed from BRCA1 germ line mutant breast tumors was significantly associated with breast cancer arising at a young age, suggesting similar biological processes (18). Of note, in all the analyzed series, there were only 15 known germ line mutant BRCA1 carriers documented in 4 data sets. In addition, we observed significant enrichment of gene sets representing luminal progenitor cells, mammary stem cells, and high levels of RANKL and c-kit. These gene sets were strongly correlated with young age, independent of breast cancer subtype, in 2 large independent and heterogeneous cohorts of patients.

Although these age-specific findings could be difficult to functionally validate in an experimental model, the RANKL results are particularly interesting. The normal breast in young women is enriched with immature mammary cell subpopulations (stem cells and progenitors), which have been shown to increase significantly with pregnancy, menstrual cycles, and lactation. RANKL has been shown to be a key mediator of this effect and RANKL inhibition could be antiproliferative as well as mediate reductions in the mammary stem cell compartment which is thought to predispose to cancer (41–43). Similarly, Lim and colleagues recently proposed that the cell of origin of BRCA1-associated and “basal-like” tumors were probably luminal progenitor cells rather than the stem-cell enriched population with c-kit identified as a key marker (44). Our results may also help to explain the higher incidence of ER−/HER2− tumors but also suggest a common shared biology in the young. This could account for the worst clinical outcome long associated with young age and imply that approaches such as suppression of mammary stem cell function or RANKL signaling may need to be explored in the young population. This and other data now provide significant scientific rationale to take these concepts forward into the clinical setting—inhibition of these pathways with specific drugs such as a RANKL inhibitor and their effects on mammary epithelial populations and tumor growth could be initially examined in preoperative “window of opportunity” studies in young women with newly diagnosed tumors to show “proof-of-concept” (for example, EudraCT number 2011-006224-21).

Our study has potential limitations that should be considered while interpreting its results. We chose to use a 3-gene classifier for breast cancer subtyping rather than the intrinsic gene list (12, 13, 45–47) as we felt that this would more closely approximate clinically used subgrouping and results would be therefore comparable with previous publications using IHC. Currently, no gene expression method for the assignment of molecular subtypes is considered as the “gold standard” as hierarchical clustering allocation is subjective and interobserver reproducibility remains modest (48, 49). Nevertheless, the 3-gene classifier used in the current study has been shown to provide a robust classification of the major molecular breast cancer subtypes using gene expression of the ER (ESR1), ERBB2, and a proliferation gene (AURKA) and provide similar prognostic information to that of PAM50 (46). We also acknowledge that the AOL calculation was not completely accurate in 100% of patients due to missing clinical variables. However we do not believe that the missing information could have significantly overestimated the value of prognostic gene signatures after adjustment. One should also note that for many of the prognostic gene signatures, while the exact published algorithms were not used, the approximated versions still produced a strong prognostic signal. Another limitation was the minimal information available on survival and specific adjuvant treatment modalities given to these women as it would have been interesting to look at regimen-specific treatment effects according to age and subtype.

In summary, we conclude that proliferation-related prognostic gene signatures could aid in treatment decision-making independent of age. This may be particularly clinically relevant for the young given the potential long-term side effects of adjuvant systemic chemotherapy. Furthermore, we find that young age adds extra biological complexity, which is independent of differences in breast cancer subtype distribution. Although these results require further validation, either experimentally or in other clinical data sets, we suggest that separate therapeutic approaches may need to be specifically designed to improve outcomes for breast cancer arising in young women.

Disclosure of Potential Conflicts of Interest

C. Sotiriou is a named inventor of the Genomic Grade Index (GGI). S. Loi and C. Sotiriou are named co-inventors of a PI3K prognostic gene signature (PIK3CA-GS). No potential conflicts of interest were disclosed by the other authors.

Grant Support

H.A. Azim Jr is supported by an ESMO translational research grant.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

Acknowledgments

The authors thank Carolyn Straehle for editorial assistance and all of the patients who have generously donated their tumor tissue for research.

Footnotes

Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).