Abstract

Purpose: The purpose of this study is to predict breast cancer recurrence and metastases and to identify gene signatures indicative of clinicopathologic characteristics using gene expression patterns derived from cDNA microarray.

Experimental Design: Expression profiles of 7,650 genes were investigated on an unselected group of 99 node-negative and node-positive breast cancer patients to identify prognostic gene signature of recurrence and metastases. The identified gene signature was validated on independent 78 patients with primary invasive carcinoma (T1/T2 and N0) and on 58 patients with locally advanced breast cancer (T3/T4 and/or N2). The gene predictors were identified using a combination of random forests and linear discriminant analysis function.

Conclusions: This study has established a population-based approach to predicting breast cancer outcomes at the individual level exclusively based on gene expression patterns. The 28-gene recurrence signature has been validated as quantifying the probability of recurrence and metastases in patients with heterogeneous histology and disease stage.

population-based study

molecular signature

breast cancer prognosis

feature selection

classification

Breast cancer patients with the same stage of disease can have markedly different clinical outcomes. Traditional diagnostic and prognostic factors may stratify patients with molecularly distinct diseases into the same group based on morphologic assessments. It remains a critical issue to reliably identify specific high-risk breast cancer patients for recurrent and metastatic diseases. Molecular prediction is the future direction of personalized cancer care. Microarray technologies have fostered tremendous advances in molecular diagnosis and prognosis of breast cancer (1–11). A recent report has described a clinically applied multigene assay to predict recurrence of tamoxifen-treated, node-negative, and estrogen receptor (ER)–positive breast cancer (12). A population-based approach to molecular prognosis of breast cancer and rational individualization of therapy is needed.

In this work, we present a population-based study to predict recurrence and metastases of breast cancer by using gene expression patterns in tumors. Sotiriou et al. (8) originally undertook the population-based study from a regional cancer center where 350 new patients a year were referred in from a population of 1.5 million. Within the years 1993 to 1995, 700 new breast cancer cases were seen, of which 99 cases were representative of the population with comparable overall survival adjusted for tumor size and nodal status. Almost all of these 99 patients received adjuvant therapy after surgery, according to accepted standards of practice at that time. Based on the gene profiling on these 99 tumors, we identified a new 28-gene signature to predict recurrence of breast cancer. Furthermore, the accuracy of the 28-gene signature in disease-free survival prediction was validated (P < 0.001) on two independent patient cohorts (1, 7). The cohort of van't Veer et al. (1) contained 78 patients with sporadic cancer all under the age of 55 years with no lymph node involvement, who were not treated with adjuvant chemotherapy. The cohort of Sorlie et al. (7) included 78 cases, 51 of which were with locally advanced T3/T4 and/or N2 breast cancer treated with primary chemotherapy. In this study, a recurrence score function was used to define a patient's risk for relapse and metastases for individualized clinical decision-making. Patients categorized into high risk, low risk, and intermediate risk had distinct disease-free survival (P < 0.005, Kaplan-Meier analysis, log-rank test) in all three cohorts. A strong association (P < 0.05) was identified between risk groups and tumor size, tumor grade (13), ER and progesterone receptor (PR) status, and HER2/neu overexpression in the studied patient cohorts (1, 7, 8). The recurrence score was also predictive of overall survival (P < 0.001) in the cohorts from Sotiriou et al. (8) and Sorlie et al. (7). To further reveal the molecular pathogenesis of breast cancer, we also identified 14-gene predictors of nodal status and 9-gene predictors of tumor grade using the data from Sotiriou et al. (8) in this population-based study.

Materials and Methods

Patients. Three patient cohorts from the previous publications (1, 7, 8) were analyzed in this study. The cohort from Sotiriou et al. (8) contained the tumor samples from 99 patients with primary local breast carcinoma obtained from the John Radcliffe Hospital from January 1993 to December 1994. All of the 99 tumors were invasive ductal carcinomas: 46 patients were node negative and 53 were node positive. The data were publicly available at the Proceedings of the National Academy of Sciences Web site.7

The cohort from Sorlie et al. (7) contained 78 breast carcinomas (71 ductal, 5 lobular, and 2 ductal carcinomas in situ obtained from 77 different individuals; two independent tumors from one patient diagnosed at different times). Fifty-one of the patients were part of a prospective study on locally advanced breast cancer (T3/T4 and/or N2 tumors) treated with preoperative doxorubicin monotherapy. A total of 58 patients with complete follow-up information were included in this study. The detailed clinical information of all samples is available at the Proceedings of the National Academy of Sciences journal Web site and the author Web site.8

The clinical data of 78 patients from van't Veer et al. (1) were analyzed in this study, including 34 patients developed metastasis within 5 years and 44 patients continued to be disease-free after 5 years. Twenty patients with hereditary breast cancer (BRCA1 or BRCA2 carriers) were excluded from this study. All of these patients were diagnosed between 1983 to 1996 with primary invasive carcinoma (T1 or T2), no axillary metastases (N0), and all under the age of 55 years. The complete patient data are publicly available at the Nature journal Web site and the author Web site.9

Acquisition of gene expression profiles. There were 7,650 genes assayed by cDNA microarrays on 99 patient samples as described previously by Sotiriou et al. (8). RNA was isolated by using the Trizol method (Invitrogen, Carlsbad, CA). Total RNA from the Universal Human Reference (Stratagene, La Jolla, CA) was amplified and used as a reference for cDNA microarray analysis (8). The detailed protocols for RNA amplification and cDNA probe labeling and hybridization are available online.10

The data set from Sorlie et al. (7) includes 9,216 genes screened on 78 patient samples. Total RNA was isolated by phenol-chloroform extraction (Trizol, Life Technologies), and mRNA was purified by either magnetic separation using Dynabeads (Dynal) or the Invitrogen FastTrack 2.0 kit (7). All experiments and microarray assays were described previously (7) and the detailed protocols are available at Stanford University Web site.11,12

The data provided by van't Veer et al. (1) contain 24,500 genes screened on 98 patient samples. The detailed protocols for RNA isolation, cRNA labeling, and gene expression profiling using microarray were described in the original publication (1).

Missing value replacement and gene matching. The data from Sotiriou et al. (8) were used as training set. The genes that had missing values in more than five samples were excluded from the study. For the remaining genes, missing values were replaced by using the EMV package in software R.13 The missing values were estimated based on a k-nearest neighbor algorithm (k = 20). This algorithm first selects k nearest genes that do not contain any missing values to the one with at least one missing value, based on the Euclidean distance. Then, the missing values are replaced by the average of the neighbors. Genes screened on different microarray platforms were matched by their Unigene Cluster IDs with the interactive MatchMiner (14) Web site interface.14

Feature selection for biomarker identification. Feature selection is important to identify relevant and important genes and to remove irrelevant genes and noise from large scale microarray data sets. A combination of random forests (15) and linear discriminant analysis (LDA) was used to identify gene signatures for predicting breast cancer recurrence/metastases, tumor grade, and nodal status. Random forests of software R was first used to identify a small subset of genes from the original microarray data. LDA of software SAS15 was used to further refine the gene signature.

Random forests is a generalization of the standard tree algorithms (16). The basic step of random forests is to form diverse tree classifiers from a single training set. Each tree is built on a bootstrap sample from the training set. The variables used for splitting the tree nodes are a random subset of the whole variables set. The classification decision of a new case is obtained by majority voting (unless the cutoff value is user defined) over all trees. In random forests, about one third of the cases in the bootstrap sample are not used in growing the tree. These cases are called “out-of-bag” cases and are used to evaluate the algorithm performance. A very important function of random forests is variable importance evaluation. The importance of a variable is defined in terms of its contribution to classification accuracy. Based on the variable importance measure, backward elimination was used to identify the gene subset with the smallest out-of-bag error rate. Here, the out-of-bag error rate was not used to assess the prediction accuracy of the identified gene subsets. Instead, it served as a stopping rule for feature selection. The varSelRF package of software R (17) was used according to the following steps: (1) build a forest with N trees and obtain a ranking of variable importance; (2) remove 20% of the least important variables; (3) construct a new forest with K trees; (4) repeat steps 2 and 3 until two genes are left; and (5) select the gene subset with the smallest out-of-bag error rate.

In the experiments, we chose N = 3,000 and K = 1,000 because the large number of trees in the initial forests are likely to produce stable importance measures (17). We did not follow the “1-SE rule” as suggested by Diaz-Uriarte et al. (17). This rule chooses the smallest gene subset, whose error rate is within one SE of the minimum error rate of all forests. We used the “0-SE rule,” which identifies the gene subset with the smallest out-of-bag error rate. The “0-SE rule” usually selects more genes than the “1-SE rule” does. Because further gene filtering would be done by using LDA, we chose the gene subsets with the lowest prediction error using random forests.

Discriminant analysis was used to determine which variables discriminate two or more naturally occurring groups in prognosis. Given several variables as the data representation, each class is modeled as multivariate normal distribution with a covariance matrix and a mean vector. Instances are classified to the label of the nearest mean vector based on Mahalanobis distance. The decision surfaces between classes become linear if the classes have a common covariance matrix. When the distribution within each group is assumed to be multivariate normal, a parametric method can be used to develop a discriminant function. Such function is determined by a measure of generalized square distance, which is based on the pooled covariance matrix as well as the prior probabilities of group membership. The generalized squared distance from input x to class i is where is the squared distance from x to group I; mi is the p-dimensional mean vector for group I; V is the pooled covariance matrix and g(i) depends on the prior probability of class i. In practice, the prior probability can be assumed as equal for all groups (refer to SAS Users' Manual). In this study, we assumed equal prior probability and thus g(i) = 0. x is classified into class I, if is the smallest among all the distance measures. We selected the gene markers using backward selection of stepwise discriminant analysis with software SAS.

Classifier evaluation. LDA of software SAS was used to refine the gene signature obtained from random forests and assess the classification accuracy of models in predicting 5-year relapse-free survival, tumor grade, and nodal status based on the identified gene signatures. Leave-one-out cross-validation was used in the evaluation (18).

To evaluate the accuracy of survival prediction, time-dependent receiver operating characteristic (ROC) analysis for censored data (19, 20) was done with software R. Time-dependent ROC analysis extends the concepts of sensitivity, specificity, and ROC curves for time-dependent binary disease variables in censored data. In our study, the binary disease variable Ri(t) = 1, if patient i has recurrent or metastatic breast cancer before time t; otherwise, Ri(t) = 0. For a diagnostic marker M, both sensitivity and specificity are defined as a function of time t:A ROC(t) is a function of t at different cutoffs c. A time-dependent ROC curve is a plot of sensitivity(c, t) versus 1 − specificity(c, t). The area under the ROC curve (AUC) can be used as an accuracy measure of the ROC curve. A higher prediction accuracy is evidenced by a larger AUC(t) (19, 20).

Statistical methods. To estimate a patient's recurrent and metastatic potential, risk scores were generated by fitting the identified gene predictors in a Cox regression model as covariates. The distribution of the risk scores was used to classify the patients into three groups: high risk, low risk, and intermediate risk. Kaplan-Meier analysis was used to assess the disease-free survival probability of three risk groups in the studied patient cohorts (1, 7, 8). To evaluate the association between risk groups and clinicopathologic variables in three patient cohorts (1, 7, 8), the χ2 test or Fisher's exact test was used. All statistical testing was done with software R.

Results

Gene expression–based prediction of recurrence and metastases. The expression profiles and clinical data from Sotiriou et al. (8) were used as the training set to predict recurrence in individual breast cancer patients. To identify gene predictors of breast cancer recurrence, 96 tumor samples were used in the classification, and three samples were excluded because their 5-year relapse status could not be determined. In this population-based cohort, 59 patients had relapse within 5 years after surgery, and 37 remained relapse-free for a 5-year interval. To reliably identify good and poor prognostic tumors, a powerful three-step supervised classification method was used. In brief, based on the expression profiles of 7,091 genes after data preprocessing, 66 genes were first identified by using random forests. Then, LDA analysis further refined the recurrence signature to 28 genes (Table 1
; Supplementary Fig. S1). Based on the expression profiles of these 28 genes, LDA classified 5-year relapse status for the 96 patients, achieving an overall accuracy of 0.92 (88 of 96), a sensitivity of 0.90 (53 of 59), and a specificity of 0.95 (35 of 37). To evaluate relapse-free survival prediction in the complete cohort between the years 1 and 11 after surgery, a Cox proportional hazards model (21) was built on the 28-gene signature and the risk score was used to construct the time-dependent ROC curve. The AUC (5-year) was 0.983 (Fig. 1A
) and remained 0.92 between years 8 and 11 during the follow-up (Fig. 1C).

Time-dependent ROC analyses of the 28-gene signature in disease-free survival prediction in three patient cohorts. A, time-dependent ROC (t = 5 y) curve of the 28-gene signature on the training set from Sotiriou et al. (8) AUC of 0.983. B, time-dependent ROC (t = 5 y) curves of the 28-gene signature on two validation sets. AUC of 0.843 with 25 overlapping genes on data from van't Veer et al. (1) and AUC of 0.764 with eight overlapping genes on data from Sorlie et al. (7). C, AUC in years 1 to 11 during follow-up after surgery in the patient cohort from Sotiriou et al. (8). D, AUC in years 1 to 13 during follow-up after surgery on two independent patient cohorts (1, 7).

To evaluate the prognostic power of our identified gene signature, two independent validation sets were used (1, 7). From each validation set, we identified the overlapped genes with our 28-gene signature. Eight genes were found in the data generated by Sorlie et al. (7), including one unknown gene. Twenty-five genes were obtained from the data generated by van't Veer et al. (1), in which 4 genes were duplicated. Using these overlapped genes, time-dependent ROC analyses were done to evaluate relapse/metastases prediction on two independent patient cohorts (Fig. 1B and D). The AUC (5-year) on the data from van't Veer et al. (1) was 0.843 with 25 overlapped genes in predicting metastatic potential. The AUC (5-year) was 0.764 on the data from Sorlie et al. (7) with eight overlapped genes in the relapse-free survival prediction (Fig. 2B
). The patients from van't Veer et al. (1) were with primary invasive carcinoma (T1 or T2) and no lymph node involvement (N0); whereas 88% (51 of 58) of the patients from Sorlie et al. (7) were with locally advanced (T3/T4 and/or N2) breast cancer. The results showed that our identified gene predictors from the population-based study achieved accurate disease-free survival prediction in patients with heterogeneous histology and disease stages.

Time-dependent ROC analyses of the 28-gene signature in overall survival prediction. A, time-dependent ROC curves at t of 5 y. AUC of 0.927 on data from Sotiriou et al. (8) and AUC of 0.808 on data from Sorlie et al. (7). B, the AUC of overall survival prediction during the follow-up after surgery.

Time-dependent ROC analysis showed that the identified 28-gene signature was also predictive of overall survival (P < 0.001; Fig. 2). In the prediction of overall survival, the AUC (5-year) was 0.927 on the data from Sotiriou et al. (8) with all 28 genes and 0.808 on data from Sorlie et al. (7) with 8 overlapped genes.

The identified 28-gene signature of recurrent and metastatic potential is unique. There is no overlap between our gene signature and previously reported survival signatures (1, 8, 12). A literature search found that 17 genes of this 28-gene signature are related to tumorigenesis, and 9 genes are directed linked to breast cancer pathogenesis (Table 1). Furthermore, among the nine breast cancer–related genes, five genes are the established breast cancer biomarkers [i.e., MCM2 (22), RAD50 (23), PDGFRA (24, 25), S100P (26, 27), and FGF2 (28; Table 1).

Significant association of expression profile–defined risk groups with clinicopathologic characteristics. The clinical variables, such as nodal status, tumor size, tumor grade, ER status, and HER2/neu overexpression in breast cancer patients affect the disease outcomes. In three studied cohorts (1, 7, 8), the distribution of the recurrence risk scores in Cox modeling was used to stratify patients into three groups: high risk, low risk, and intermediate risk (the details are provided in Supplementary Materials). Kaplan-Meier analysis showed that disease-free survival was significantly different (P < 0.005, log-rank tests) among the risk groups in all patient cohorts (Fig. 3
). The clinical characteristics of each risk group in the studied cohorts were reported in Supplementary Tables S5 to S7, including average disease-free survival days, ER and PR status, HER2/neu overexpression, nodal status, age, tumor size, grade, and treatment received. The results indicated that the expression profiles of the identified 28-gene signature were strongly associated with the clinicopathologic variables, including tumor size, tumor grade, ER and PR status, and HER2/neu overexpression (P < 0.05; Table 2
). There was insufficient evidence to support the association between ER status and risk groups in Sorlie et al. (7). The reason might be that 80% of the patients in this study exhibited ER positive. No strong evidence was found that age (representing menopausal status) of the patients was associated with the expression profiles of the tumors (Table 2).

The association of expression profile–defined risk groups and clinicopathologic variables in three patient cohorts

Gene predictors of tumor grade and nodal status. To further reveal the molecular pathogenesis of breast cancer, we also identified 14-gene predictors of nodal status and 9-gene predictors of tumor grade (Table 3
; Supplementary Fig. S1; Supplementary Table S1) using the combination of random forests and LDA. Based on the expression profiles of the 14-gene signature, LDA correctly classified 80% (79 of 99) of nodal status (node positive versus node negative) in the patients from Sotiriou et al. (8), with a sensitivity of 83% (44 of 53) and a specificity of 76% (35 of 46). Among these 14 nodal status predictor genes, the functions of four genes are unknown, whereas the remaining 10 genes are all related to cancer development and progression. Furthermore, six genes are directly related to breast cancer pathogenesis (Table 3). It has also been found that VEGFB is a well-established breast cancer nodal status biomarker (29), whereas STK12 (30, 31) and BIRC3 (32) are the biomarkers for breast cancer risk and survival.

Based on the expression profiles of the nine-gene signature, LDA correctly classified 85% (84 of 99) of tumor grade (grade 1 or 2 versus grade 3; ref. 13) in the patients from Sotiriou et al. (8), with a sensitivity of 87% (39 of 45) and a specificity of 83% (45 of 54). Interestingly, for these nine-gene predictors, all except ALDH3A2 (aldehyde dehydrogenase 3 family for fat metabolism) are related to cancer development and progression. More importantly, six genes are directly related to breast cancer pathogenesis. Furthermore, RUNX1 was identified as a molecular target of breast cancer, which is associated with the transforming growth factor-β signal transduction pathway (Table 3; ref. 33).

Discussion

In this population-based study, we were able to quantify the likelihood of local and distant recurrence in breast cancer patients with heterogeneous histologies and stages. The use of the gene expression–based prognostic model provides an accurate estimate of the risk of recurrence and metastases in individual patients. The patients in different recurrence risk categories had distinct disease-free survival (P < 0.001) in three studied cohorts. The recurrence gene signature can also accurately predict overall survival. This feature is remarkable because ∼50% of patients died in the absence of recurrent breast cancer (1). In addition, the recurrence gene signature predicts the relapse-free interval and metastases-free interval. Therefore, the expression profile–defined prognostic model robustly predicts all the outcomes we examined.

Microarray analyses of breast cancer have identified gene expression profiles associated with patient survival. Sorlie et al. (7) found the expression profiles to distinguish ER-positive from ER-negative tumors with distinct outcomes in a locally advanced cohort treated with primary chemotherapy. van't Veer et al. (1) established a 70-gene signature to predict metastatic potential in an untreated, node-negative cohort. Sotiriou et al. (8) showed the concordance with these previous analyses in node-positive and node-negative patients with the majority receiving adjuvant treatment. Here, we present a powerful systems biology approach to identify unique gene predictors of breast cancer recurrence and metastases, which achieved highly accurate prognosis at the individual level in these three cohorts (1, 7, 8). Our results are remarkable in their high prognostic accuracy for the data from these previous studies despite the differences in patient populations, treatments used, and technology platforms used.

The variables, such as patient's age, tumor size, measurement of ER/PR (by ligand-binding assay), and HER2/neu status (by fluorescence in situ hybridization), are routinely used as predictors of recurrence in breast cancer and are incorporated into current treatment guidelines (34, 35). The results of this study show that the expression profiles of our recurrence gene predictors were strongly associated (P < 0.05) with tumor size, tumor grade, ER, PR, and HER2/neu in the studied patient cohorts. The gene expression profiles in tumors were not significantly associated with the patient's age in all three cases.

We evaluated the recurrence gene signature in the context of the interobserver variability in tumor grading that is typical in oncology practice. Tumor grade correlates with the likelihood of recurrence when analyzed in large populations of patients (12). However, previous studies have also reported that the grading of breast cancer entails subjectivity, leading to considerable variability among pathologists (12, 36–40). Our results indicate that the expression of a recurrence gene signature is strongly associated (P < 0.05) with tumor grade in the patients from all three cohorts.

A recent report by a Breast Task Force serving the American Joint Committee on Cancer did not add tumor grade to its staging criteria because of the interobserver variability problem in the current grading system (39). To produce a consistent and standard system for tumor grading, we identified a nine-gene signature to predict tumor grade. The expression profiles of the nine genes (Table 3) accurately classified 85% of the tumor grades in the population-based cohort from Sotiriou et al. (8), with a sensitivity of 87% and a specificity of 83%. The prediction accuracy of our identified nine-gene grade signature was notably high.

The high dimensionality of microarray data has complicated major diagnostic and prognostic breakthroughs in cancer treatment and puts a premium on innovative data mining methods. This article presents a novel computational gene selection system for the identification of molecular signatures. An integration of random forests and LDA was used in this study. Random forests is characterized as an effective machine learning method for processing noisy large-scale data sets. Therefore, we used this algorithm to filter out noninformative genes sequentially from the original microarray data until a small subset of genes was obtained. Then, LDA was used to further filter out more genes. Our previous studies have shown that an integrative feature selection system based on random forests and other state-of-the-art techniques enables the identification of important biomarkers (41, 42). This study is another successful example of the combinatorial gene selection system, providing a general guideline for developing robotic prognosis toward personalized medicine.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.