ABSTRACT Systemic chemotherapy in the adjuvant setting can cure breast cancer in some patients that would otherwise recur with incurable, metastatic disease. However, since only a fraction of patients would have recurrence after surgery alone, the challenge is to stratify high-risk patients (who stand to benefit from systemic chemotherapy) from low-risk patients (who can safely be spared treatment related toxicities and costs).
We focus here on risk stratification in node-negative, ER-positive, HER2-negative breast cancer. We use a large database of publicly available microarray datasets to build a random forests classifier and develop a robust multi-gene mRNA transcription-based predictor of relapse free survival at ten years, which we call the Random Forests Relapse Score (RFRS). Performance was assessed by internal cross-validation, multiple independent data sets, and comparison to existing algorithms using receiver operating characteristic and Kaplan-Meier survival analysis. Internal redundancy of features was determined using k-means clustering to define optimal signatures with smaller numbers of primary genes, each with multiple alternates.
Internal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of 0.704, which was comparable to or better than those reported previously or obtained by applying existing methods to our data set. Three risk groups with probability cut offs for low, intermediate and high-risk were defined. Survival analysis determined a highly significant difference in relapse rate between these risk groups. Validation of the models against independent test datasets showed highly similar results. Smaller 17-gene and 8-gene optimized models were also developed with minimal reduction in performance. Furthermore, the signature was shown to be almost equally effective on both hormone-treated and untreated patients.
RFRS allows flexibility in both the number and identity of genes utilized from thousands to as few as 17 or 8 genes, each with multiple alternatives. The RFRS reports a probability score strongly correlated with risk of relapse. This score could therefore be used to assign systemic chemotherapy specifically to those high-risk patients most likely to benefit from further treatment.

recurrence within 10 years after surgery. Earlier ap-proaches to this problem were developed using rela-tively small datasets. Here we have explored the extentto which larger and more current meta-datasets can beused to improve predictor performance. Our approachwas to develop a multi-gene transcription-level-basedclassifier of 10-year-relapse (disease recurrence within10 years) using a large database of existing, publiclyavailable microarray datasets. Existing solutions havebeen implemented on costly platforms that are slow toreturn results to the patient. However, new assaytechnologies exist which could make breast cancerprognostics much more accessible and timely. Some ofthese potential assay technologies can only measuretranscription of a relatively small number of geneswhile still optimizing cost and efficiency. To this end,we developed a robust prognostic panel offering flexi-bility in both the number and identity of genes utilizedfrom thousands to as few as eight genes, each withmultiple alternatives to maximize the chance of suc-cessful migration to other assay technologies.MethodsLiterature search and curationStudies were collected which provided gene expressiondata for ER+, LN-, HER2- patients with no systemicchemotherapy (hormonal therapy was allowed). Eachstudy was required to have a sample size of at least 100,report LN status, and include time and events for eitherrecurrence-free survival or distant metastasis-free sur-vival. The latter were grouped together for survivalanalysis where all events represent either a local ordistant relapse of disease. If ER or HER2 status were notreported, they were determined by array, but preferencewas given to studies with clinical determination first. Aminimum of 10 years follow-up was required for train-ing the classifier. However, patients with shorter follow-up were included in survival analyses. Patients withimmediately postoperative events (time =0) were ex-cluded. Nine studies [3–11] meeting the above criteriawere identified by searching PubMed and the GeneExpression Omnibus (GEO) database [12]. To allowcombination of the largest number of samples, only thecommon Affymetrix U133A gene expression platformwas used. A total of 2,175 breast cancer samples wereidentified. After filtering for only those samples whichwere ER+, node-negative, and had not received systemicchemotherapy, 1,403 samples remained. Duplicate ana-lysis removed a further 405 samples due to the signifi-cant amount of redundancy between studies (Figure 1).Filtering for ER +and HER2- status using array determi-nations eliminated another 140 samples (51 ER-, 72HER2+, 17 ER-/HER2+, Additional file 1: Figure S1).Some ER- samples were from the Schmidt (2008) dataset(31/201) which did not provide clinical ER status andthus for that study we relied solely on arrays for deter-mination of ER status. However, there were also a smallnumber (37/760) from the remaining studies, which rep-resent discrepancies between array status and clinicaldetermination. In such cases, both the clinical and array-based determinations were required to be positive for in-clusion in further analysis. A total of 858 samples passedall filtering steps including 487 samples with 10-yearfollow-up data (213 relapse; 274 no relapse). TheFigure 1 Duplicate analysis showing approximate relationship between studies in analysis. The VennMaster diagram shows theapproximate overlap between GEO datasets used in the current study. Three studies show zero overlap while the other six showsignificant overlap.Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 2 of 14

Page 3

remaining 371 samples had insufficient follow-up for10-year classification analysis but were retained for usein survival analysis. None of the 858 samples weretreated with systemic chemotherapy but 302 (35.2%)were treated with adjuvant hormonal therapy of which95.4% were listed as tamoxifen. The 858 samples werebroken into two-thirds training and one-third testingsets resulting in: a training set of 572 samples for use insurvival analysis and 325 samples with 10 year-follow-up (143 relapse; 182 no relapse) for classification ana-lysis; and a testing set of 286 samples for use in survivalanalysis and 162 samples with 10-year follow-up (70relapse; 92 no relapse) for classification analysis. Thenumbers of samples available for classification analysisand survival analysis differed because the former wasperformed only on those samples that could be binarizedinto ‘relapse’ or ‘no relapse’ with 10 years of follow-upwhereas the latter made use of all patients (with censor-ing), even those with no relapse but <10 years follow-up. Table 1 outlines the datasets used in the analysisand Figure 2 illustrates the breakdown of samples foranalysis.PreprocessingAll data processing and analyses were completed withopen source R/Bioconductor packages. Raw data (Celfiles) were downloaded from GEO. Duplicate sampleswere identified and removed if they had the same data-base identifier (for example, GSM accession), same sam-ple/patient ID, or showed a high correlation (r >0.99)compared to any other sample in the dataset. Raw datawere normalized and summarized using the affy [13]and gcrma [14] packages in R/Bioconductor. Probe setswere mapped to Entrez gene symbols using both stand-ard and custom annotation files [15]. ER and HER2 ex-pression status was determined using standard probesets. For the Affymetrix U133A array we and others havefound the probe set ‘205225_at’ to be most effective fordetermining ER status [16]. The rank sum of the bestprobe sets for ERBB2 (216835_s_at), GRB7 (210761_s_at),STARD3 (202991_at), and PGAP3 (55616_at) was used todetermine HER2 amplification status. This was calculatedby taking the sum of individual expression level ranks ofeach of the four probes. Cutoff values for ER and HER2 sta-tus were chosen by mixed model clustering (mclust [17,18]package). Mclust performs normal mixture modeling, fittedvia an expectation maximization algorithm. A mixturemodel is a probabilistic model for representing the presenceof subpopulations within an overall population. In this casewe hypothesized that two populations exist in the dataset(ER+and ER- or HER2+ and HER2-) and would be repre-sented by two distinct distributions of expression values.This is clearly visible in Additional file 1: Figure S1 whereboth ER and HER2 expression levels apparently manifest astwo distinct distributions. In both cases, mclust successfullyidentified two distributions and we used the maximumvalue of the first distribution as an unbiased, objective cut-off to separate the two distributions. Unsupervised cluster-ing was performed to assess the extent of batch effects.Once all prefiltering was complete, data were randomlysplit into training (two-thirds) and test (one-thirds) datasetswhile balancing for study of origin and number of relapseswith 10-year follow-up. The test dataset was renormalized,put aside, left untouched, and only used for final validation,once each for the full-gene, 17-gene, and eight-gene classi-fiers. Training data was renormalized together as above.Probe sets were then filtered for a minimum of 20% sam-ples with expression above background threshold (rawvalue >100) and coefficient of variation (COV) between 0.7and 10. A total of 3,048 probe sets/genes passed this filter-ing and formed the basis for the ‘full-gene-set’ model de-scribed below. The clinical annotations and preprocessedTable 1 Studies included in analysisStudyGSETotalsamplesER+/LN-/ untreateda/outcomeDuplicatesremovedER+/HER2-array10-yearrelapse10-yearno relapseDesmedt, 2007 [3] GSE7390198 13513511642 60Ivshina, 2006 [4]GSE4922290 1332202Loi, 2007 [5]GSE6532327 170 4340 105Miller, 2005 [6] GSE3494251132200b115 100 3052Schmidt, 2008 [7]GSE11121 200200 155 2546Sotiriou, 2006 [8]GSE2990189 113484512 15Symmans, 2010 [9] GSE17705298175 110 102 1241Wang, 2005 [10]GSE2034286209 209173 6729Zhang, 2009 [11]GSE12093136136 1361251524Nine studies2,175 1403 998858213 274aUntreated includes some patients with adjuvant endocrine therapy alone or radiation.bER status not available, filtered later based on array expression data.Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 3 of 14

Page 4

data for both training and test datasets (before filtering forpercent expression and COV) are provided as Additionalfiles 2, 3, 4, and 5.Classification (training)Classification was performed on only training sampleswith either a relapse or no relapse after 10-year follow-up using the randomForest [19] package. Forests werecreated with at least 100,001 trees (odd number ensuresfully deterministic model) and otherwise default settings.Performance was assessed by area under the curve(AUC) of a receiver-operating characteristic (ROC)curve, calculated with the ‘ROCR’ package, from Ran-dom Forests internal out-of-bag (OOB) testing results.In Random Forests, this OOB testing takes the place ofcross-validation to get an unbiased estimate of the testset error. Each tree is constructed using a different ran-dom bootstrap sample of about two-thirds of cases fromthe original data. Each case left out in the constructionof the tree is run through that tree to get a classification.In this way, a test set classification is obtained for eachcase in about one-third of the trees. At the end of therun, the software takes j to be the class that got the mostvotes every time case n was OOB. The proportion oftimes that j is not equal to the true class of n averagedover all cases is the OOB error estimate. By default, RFperforms a binary classification (for example, relapseversus no relapse). However it also reports a probability(proportion of ‘votes’) for relapse, which we call theRandom Forests Relapse Score (RFRS). Risk groupthresholds were determined from the distribution of re-lapse probabilities using mixed model clustering to setcutoffs for low, intermediate, and high-risk groups.Determination of optimal 17-gene and eight-gene setsInitially an optimal set of 20 genes was selected by re-moving redundant probe sets and extracting the top 100genes (by reported Gini variable importance), k-meansclustering (k =20) these genes and selecting the bestgene from each cluster (again by variable importance)(Figure 3). Additional genes in each cluster will serve asrobust alternates in case of failure to migrate primarygenes to an assay platform. A gene might fail to migratedue to problems with prober/primer design or differ-ences in the sensitivity of a specific assay for that gene.The top 100 genes/probe sets were also manuallychecked for sequence correctness by alignment to thereference genome (see Additional file 6). Seven genes/probe sets with ambiguous or erroneous alignmentswere excluded. Three genes/probe sets were also ex-cluded because of their status as hypothetical proteins(KIAA0101, KIAA0776, KIAA1467). After these re-movals, three clusters were lost, leaving a set of 17 pri-mary genes and 73 alternate genes. All but two primarygenes have two or more alternates (TXNIP is without al-ternate, and APOC1 has a single alternate). Table 2 listsFigure 2 Sample breakdown. A total of 858 samples passed all filtering steps including 487 samples with 10-year follow-up data (213 relapse;274 no relapse). The remaining 371 samples had insufficient follow-up for 10-year classification analysis but were retained for use in survivalanalysis. The 858 samples were broken into two-thirds training and one-third testing sets resulting in: a training set of 572 samples for use insurvival analysis and 325 samples with 10-year follow-up (143 relapse; 182 no relapse) for classification analysis; and a testing set of 286 samplesfor use in survival analysis and 162 samples with 10-year follow-up (70 relapse; 92 no relapse) for classification analysis.Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 4 of 14

Page 5

the final gene set, their top two alternate genes (whereavailable) and their variable importance values (SeeAdditional file 1: Table S1 for complete list). The aboveprocedure was repeated to produce an optimal set of eightgenes, this time starting from the top 90 non-redundantprobe sets (excluding the 10 genes with problems identi-fied above), k-means clustering (k=8) these genes andselecting the best gene from each cluster (Additionalfile 1: Figure S2). All eight genes were also included inthe 17-gene set and have at least two alternates (Table 3,Additional file 1: Table S2). Using the final optimized17-gene and eight-gene sets as input, new RF modelswere built on training data. Actual probe sequences forthe top 100 probe sets are provided in Additional file 7.Figure 3 Heatmap showing top 100 probe sets after k-means clustering (k= 20). Training data (n=325) were clustered by expression valueusing k-means clustering (k =20) for the top 100 probe sets identified by random forest classification variable importance. The first color side baron the left indicates cluster number and the second indicates relative variable importance within the cluster (darker blue=greater importance).The top side bars indicate risk group (low, intermediate, and high from left to right) and relapse status (red= relapse; yellow= no relapse). Genes(probe sets) are indicated on the right axis. Genes highlighted in yellow represent the primary genes in the model (best in each cluster). Genesnot highlighted represent alternates to primary genes in each cluster. Genes highlighted in pink represent genes excluded from the modelbecause of probe set sequence ambiguity or status as a hypothetical protein.Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 5 of 14

Page 6

Validation (testing and survival analysis)Survival analysis on all training data, now also includingthose patients with <10 years of follow-up, was performedwith risk group as a factor, for the full-gene, 17-gene, andeight-gene models, using the ‘survival’ package. Note, therisk scores and groups for samples used in training wereassigned from internal OOB cross-validation. Only thosepatients not used in initial training (without 10-yearfollow-up) were assigned a risk score and group by denovo classification. Significance between risk groups wasdetermined by Kaplan-Meier log rank test (with test forlinear trend). To directly compare relapse rates at 10 yearsfor each risk group to that reported in the OncotypeDXpublication [20], the overall relapse rates in our patientcohort were randomly down-sampled to the same rate(15%) as in their cohort [20] and results averaged >1,000iterations. To illustrate, the training dataset includes 572samples with 143 relapse events (that is, 25.0% relapserate). Samples with relapse events were randomly elim-inated from the cohort until only 15% of remainingsamples had relapse events (76/505=15%). This ‘down-sampled’ dataset was then classified using the RFRS modelto assign each sample to a risk group and the rates ofrelapse determined for each group. The entire down-sampling procedure was then repeated 1,000 times toobtain average estimated rates of relapse for each riskgroup given the overall rate of relapse of 15%. Settingthe overall relapse rate to 15% is also useful becausethis more closely mirrors the general population rateof relapse. Without this down-sampling, expected relapserates in each risk group would appear unrealistically high.See Figure 2 for explanation of the breakdown of samplesinto training and test sets used for classifier building andsurvival analysis.Next, the full-gene, 17-gene, and eight-gene RF modelsalong with risk group cutoffs were applied to the inde-pendent test data. The same performance metrics, sur-vival analysis, and estimates of 10-year relapse rateswere performed as above. The 17-gene model was alsotested on the independent test data, stratified by treat-ment (untreated vs. hormone therapy treated), to evalu-ate whether performance of the signature was biasedtowards one patient subpopulation or the other. Finally,for direct performance comparison on the same datasets,the Oncotype DX algorithm [20] was implemented in Rand applied to both training and test datasets.The independent test data were not used in any wayduring the training phase. However, these samples repre-sent a random subset of the same overall patient popula-tions that were used in training. Therefore, they are notas fully independent as recommended by the Institute ofMedicine ‘committee on the review of omics-based testsfor predicting patient outcomes in clinical trials’ [21].Therefore, additional independent validations were per-formed against the NKI dataset [22] obtained from theNetherlands Cancer Institute [23] and the METABRICdataset [24] accessed through Synapse [25]. The NKIdata represent a set of 295 consecutive patients with pri-mary stage I or II breast carcinomas. The dataset was fil-tered down to the 89 patients who were node-negative,ER-positive, HER2-negative, and not treated by systemicchemotherapy. Relapse times and events were defined byany of distant metastasis, regional recurrence or local re-currence. The METABRIC data represent a set of 1,992primary breast tumors. The dataset was filtered down tothe 315 patients who were node-negative, ER-positive,HER2-negative, and not treated by systemic chemother-apy. Relapse times and events were used as provided. Ex-pression values from the NKI Agilent array data andMETABRIC (Illumina HT 12v3) array were rescaled tothe same distribution as that used in training using theTable 2 The 17-gene RFRS signaturePrimary predictorAlternate 1 Alternate 2CCNB20.785 MELK0.739 GINS10.476TOP2A 0.590MCM2 0.428CDK1 0.379RACGAP1 0.588LSM1 0.139 SCD0.125CKS20.515 NUSAP1 0.491 ZWINT0.272AURKA 0.508PRC1 0.499CENPF 0.306FEN1 0.403FADD 0.313SMC40.170EBP 0.341RFC40.264 NCAPG0.234TXNIP0.292N/AN/A N/AN/ASYNE2 0.270SCARB2 0.225PDLIM5 0.167DICER10.209CALD10.129 SOX90.125AP1AR 0.201 PBX2 0.134WASL 0.126NUP107 0.197FAM38A0.165 PLIN20.110APOC10.176 APOE 0.121 N/AN/ADTX40.164AQP1 0.141 LMO40.120FMOD 0.154RGS50.120 PIK3R10.103MAPKAPK2 0.151MTUS1 0.136DHX90.136SUPT4H10.111 PHB0.106 CD440.105Table 3 The eight-gene RFRS signaturePrimary predictorAlternate 1Alternate 2CCNB20.785 MELK0.739 TOP2A0.590RACGAP10.588 TXNIP 0.292 APOC10.176CKS20.515 NUSAP10.491 FEN10.403AURKA0.508 PRC10.499 CENPF0.306EBP0.341 FADD0.313 RFC40.264SYNE20.270 SCARB20.225 PDLIM50.167DICER10.209 FAM38A0.165 FMOD0.154AP1AR0.201 MAPKAPK20.151 MTUS10.136Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 6 of 14

Page 7

‘preprocessCore’ package. Values for the eight-gene and17-gene-set RFRS models were extracted for further ana-lysis. If more than one Agilent or Illumina probe setcould be mapped to an RFRS gene then the probe setwith best performance was used. The full-gene-setmodel was not applied because not all Affymetrix-defined genes (probe sets) in the full-gene-set could bemapped to Agilent or Illumina-defined genes (probesets). However, the 17-gene and eight-gene RFRS modelswere applied to NKI and METABRIC data to calculatepredicted probabilities of relapse. Patients were dividedinto low-, intermediate-, and high-risk groups by rankingaccording to probability of relapse and then dividing sothat the proportions in each risk group were identical tothat observed in training. ROC AUC, survival P values,and estimated rates of relapse were then calculated asabove. It should be noted that while the NKI clinicaldata described here (n=89) had an average follow-uptime of 9.55 years (excluding relapse events), 34 patientshad a follow-up time of <10 years without an event(range, 1.78-9.83 years). Similarly, the METABRIC clin-ical data (n=315) had an average follow-up time of9.85 years (excluding relapse events) but 118 patientshad a follow-up time of <10 years without an event(range, 0.05-9.80 years). These patients with <10 years offollow-up but no events would not have met our criteriafor inclusion in the training dataset and likely includesome events which have not occurred yet. If anything,this is likely to reduce the AUC estimate and underesti-mate P value significance in survival analysis.Selection of reference genesWhile not necessary for Affymetrix, migration to otherassay technologies (for example, RT-PCR approaches)may require highly and invariantly expressed genes toact as a reference for determining accurate gene expres-sion level estimates. To this end, we developed two setsof reference genes. The first was chosen by the followingcriteria: (1) filtered if not expressed above backgroundthreshold (raw value >100) in 99% of samples; (2) filteredif not in top fifth percentile (overall) for mean expres-sion; (3) filtered if not in top 10th percentile (remaininggenes) for standard deviation; (4) ranked by coefficientof variation. The top 25 reference genes are listed inAdditional file 1: Table S3 along with reference genesused by OncotypeDX. The second set of reference geneswere chosen to represent three ranges of mean expres-sion levels encompassed by genes in the 17-gene signa-ture (low, 0-400; medium, 500-900; high, 1,200-1,600).For each mean expression range, genes were: (1) fil-tered if not expressed above background threshold(raw value >100) in 99% of samples; and (2) ranked bycoefficient of variation. The top five genes from eachrange are listed in Additional file 1: Table S4. Referencegenes underwent the same manual checks for sequencecorrectness by alignment to the reference genome asabove (see Additional file 8). Five genes were markedfor exclusion in the first set and three genes excludedfrom the second set. Actual probe sequences for all ref-erence probe sets are provided in Additional file 9.Implementation of algorithmThe RFRS algorithm is implemented in the R programminglanguage and can be applied to independent patient data.Input data are tab-delimited text files of normalized expres-sion values with 17 or eight transcripts/genes as columnsand patient(s) as rows. A sample patient data file (patient_-data.txt) is presented in Additional file 10. A sample R pro-gram (RFRS_sample_code.R) for running the algorithm ispresented in Additional file 10. The RFRS algorithm con-sists of a Random Forest of 100,001 decision trees. This isprecomputed, provided as an R data object (see Additionalfile 11) and must be included in the working directory.Each node (branch) in each tree represents a binary deci-sion based on transcript levels for transcripts describedabove. Based on these decisions, the patient is assigned to aterminal leaf of each decision tree, representing a vote foreither ‘relapse’ or ‘no relapse’. The fraction of votes for‘relapse’ to votes for ‘no relapse’ represents the RFRS - ameasure of the probability of relapse. If RFRS is ≥0.606the patient is assigned to the ‘high-risk’ group, if ≥0.333and <0.606 the patient is assigned to the ’intermediate-risk’group, and if <0.333 the patient is assigned to the ‘low-risk’group. These cutoffs were chosen by mixed-model cluster-ing during the training phase. The patient’s RFRS value isalso used to determine a likelihood of relapse by compari-son to a loess fit of RFRS versus likelihood of relapse for thetraining dataset. Precomputed R data objects for the loessfit (RelapseProbabilityFit.Rdata) and summary plot (Relap-seProbabilityPlot.Rdata) are loaded from file (see Additionalfile 12). The patient’s estimated likelihood of relapse isdetermined, added to the summary plot, and output as anew report (see Additional file 13). A complete list of GEOsample identifiers, used for training and testing (Additionalfile 14), and additional sample R code (Additional file15), are provided to help others reproduce the RFRSimplementation.Results and discussionInternal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of0.704. This was comparable or better than those re-ported by Johannes et al. (2010) who tested a numberof different classifiers on a smaller subset of the samedata and found AUCs of 0.559 to 0.671 [26]. It alsocompares favorably to the AUC value of 0.688 whenthe OncotypeDX algorithm was applied to this sametraining dataset.Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 7 of 14

Page 8

Mixed model clustering analysis identified three riskgroups with probabilities for low-risk <0.333; intermediate-risk 0.333-0.606; and high-risk ≥0.606 (Figure 4). Survivalanalysis determined a highly significant difference in relapserate between risk groups (P=3.95E-11; Figure 5A). Afterdown-sampling to a 15% overall rate of relapse, approxi-mately 46.7% (n=235) of patients were placed in the low-risk group and were found to have a 10-year risk of relapseof only 8.0%. Similarly, 38.6% (n=195) and 14.9% (n=75)of patients were placed in the intermediate- and high-riskgroups with rates of relapse of 17.6% and 30.3%, re-spectively. These results are very similar to those forOncotypeDX which were reported as 51% of patientsin the low-risk category with a rate of distant recur-rence at 10 years of 6.8% (95% CI, 4.0-9.6); 22% inintermediate-risk category with recurrence rate of14.3% (95% CI, 8.3-20.3); and 27% in high-risk categorywith recurrence rate of 30.5% (95% CI, 23.6-37.4) [20].Our goal was to identify a potential signature for mi-gration to a more affordable and faster platform. We canexpect a certain amount of attrition when migrating tosuch platforms. Therefore we developed a rational ap-proach in which approximately 100 candidate signaturegenes were identified but organized into signatures ofeight- and 17-gene sets with each gene having multiplealternates in case of platform migration failures. The sig-nature sizes were chosen for practical reasons, with theassumption that high-throughput, low-cost platformsmay have limitations to approximately 10 or 20 featuresand need space for one or more control/reference genes.Validation of the models against the independent testdataset also showed very similar results to training esti-mates. The full-gene-set model had an AUC of 0.730and the 17-gene and eight-gene optimized models hadminimal reduction in performance with AUC of 0.715and 0.690, respectively. Again, this compared favorablyto the AUC value of 0.712 when the OncotypeDX algo-rithm was applied to the same test dataset. Survival ana-lysis again found very significant differences between therisk groups for the full-gene (P =6.54E-06), 17-gene (P =9.57E-06), and eight-gene (P =2.84E-05; Figure 5B)models. For the 17-gene model, approximately 38.2%(n=97) of patients were placed in the low-risk groupand were found to have a 10-year risk of relapse of only7.8%. Similarly, 40.5% (n=103) and 21.3% (n=54) of pa-tients were placed in the intermediate and high-riskgroups with rates of relapse of 15.3% and 26.8%, respect-ively. Very similar results were observed for the full-gene and eight-gene models (Table 4). We have alsocompared performance to random gene sets of the samesizes (that is, eight- and 17-gene sets). Genes werechosen randomly from the total set of 3,048 genes whichpassed basic percent expression and COV filtering. Onaverage, AUC values of 0.649 and 0.608 were estimatedFigure 4 Risk group threshold determination. The distribution of RFRS scores was determined for patients in the training dataset (n=325)comparing those with a known relapse (right side; in blue) versus those with no known relapse (left side; in red). As expected, patients without aknown relapse tend to have a higher predicted likelihood of relapse (by RFRS) and vice versa. Mixed model clustering was used to identifythresholds (0.333 and 0.606) for defining low-, intermediate-, and high-risk groups as indicated.Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 8 of 14

Page 9

for random sets of 17 and eight genes compared to0.715 and 0.690 for RFRS gene sets on the same inde-pendent test dataset. Only 2.2% of 17-gene and 3.6% ofeight-gene random sets performed as well as those wechose. AURKA alone had an AUC of 0.623.Validation against the additional, independent, NKIdataset also had very similar results. The 17-gene andeight-gene models had AUC values of 0.688 and 0.699,respectively, nearly identical to the results for theprevious independent dataset. Differences between riskgroups in survival analysis were also significant for both17-gene (P= 0.023) and eight-gene (P =0.004, Figure 5C)models. Similar results were also obtained from the val-idation against METABRIC data with 17-gene (AUC =Figure 5 Likelihood of relapse according to RFRS group. Kaplan-Meier survival analysis shows a significant difference in relapse-free survivalfor low-, intermediate-, and high-risk groups as defined by: (A) the full-gene-set model on training data (n= 572, P=3.95E-11); (B) the eight-gene-set model on independent test data (n=286, P=2.84E-05); (C) the eight-gene-set model on independent NKI data (n=89, P= 0.004); (D) the17-gene-set model on independent METABRIC data (n=315, P <1.99E-04). Note, for panel A, the risk scores and corresponding groups for samplesused in classifier training (n=325) were assigned from internal OOB cross-validation. Only those patients not used in initial training (training datawithout 10-year follow-up; test data) were assigned a risk score and group by de novo classification. Significance between risk groups was determinedby Kaplan-Meier log rank test (with test for linear trend).Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 9 of 14

Page 10

0.628, P <1.99E-04, Figure 5D) and eight-gene models(AUC =0.642, P <0.04). It should be noted that any levelof performance observed on independent test datasets issomewhat disadvantaged if they make use of differentexpression platforms (as NKI and METABRIC datasetsdo). Even with renormalizing or rescaling, the cutoffsand inter-variable relationships will likely be sensitive tochanges in the inherent distribution and dynamic rangethat accompany another technology. When migrating toa new platform, an additional training phase should beperformed. Our signatures and their alternates provide aconvenient candidate list for development on such anew platform.The trend between risk group and rate of relapsecontinued if groups were broken down further (usingtraining data) into five equal groups instead of the threegroups defined above (Additional file 1: Figure S4). Thisobservation is consistent with the idea that the RFRS is aquantitative, linear measure directly related to probabil-ity of relapse. Figure 6 shows the likelihood of relapse at10 years, calculated for 50 RFRS intervals (from 0 to 1),with a smooth curve fitted, using a loess function and95% confidence intervals representing error in the fit.The distribution of RFRS values observed in the trainingdata is represented by short vertical marks just abovethe x axis, one for each patient. From this training datawe can estimate that 18.9% of patients from a populationwith an overall 15% relapse rate will be predicted to havea 5% or lower risk of relapse. Similarly, 37.0% will have a10% or lower risk of relapse.Table 4 Comparison of validation results in independent test data for full-gene-set, 17-gene, and eight-geneRFRS modelsRelapse-free survivalRFRS performanceLow risk Intermediate risk High riskModel AUC RRn (%)RRn (%) RRn (%)KM (P)Full-gene-set0.7306.9 78 (30.7)15.8 133 (52.4) 26.843 (16.9)6.54E-0617-gene0.7157.8 97 (38.2)15.3103 (40.5) 26.8 54 (21.3)9.57E-06Eight-gene0.690 9.7101 (39.8) 13.9105 (41.3)28.3 48 (18.9)2.84E-05KM (P), p-value for Kaplan-Meier logrank test (with test for linear trend) from survival analysis; RR, relapse rate.0.00.2Random Forests Relapse Score (RFRS)0.4 0.60.81.001020304050Likelihood of Relapse at 10 years (%)Low IntermediateHighFigure 6 Estimated likelihood of relapse at 10 years for any RFRS value. The likelihood of relapse was calculated in the training dataset(n=505) for 50 RFRS intervals (from 0 to 1). A smooth curve was fitted using a loess function and 95% confidence intervals plotted to representthe error in the fit. Short vertical marks just above the x axis, one for each patient, represent the distribution of RFRS values observed in thetraining data. Thresholds for risk groups are indicated. The plot shows a linear relationship between RFRS and likelihood of relapse at 10 yearswith the likelihood ranging from approximately 0 to 40%.Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 10 of 14

Page 11

In order to maximize the total size of our trainingdataset we allowed samples to be included from bothuntreated patients and those who received adjuvant hor-monal therapy such as tamoxifen. Since outcomes likelydiffer between these two groups, and they may representfundamentally different subpopulations, it is possiblethat performance of our predictive signatures is biasedtowards one group or the other. To assess this issue weperformed validation against the independent test data-set, stratified by treatment status, using the 17-genemodel. Both groups were found to have comparableAUC values with the slightly better value of 0.740 forhormone-treated versus 0.709 for untreated. Survivalcurves were also highly similar and significant with Pvalue of 0.004 and 3.76E-07 for treated and untreated re-spectively (Additional file 1: Figure S5A and S5B). Thedifference in P value appears more likely due to differ-ences in the respective sample sizes than actual differ-ence in survival curves. This is important because itpredicts equal or better performance of the signature inhormonal-therapy treated patients.Several prognostic signatures in breast cancer identifythe same or very similar risk groups of patients [27,28].While the genes utilized in the RFRS model have onlyminimal overlap with those identified in other breastcancer outcome signatures (Additional file 1: Figure S6),the prognostic ability of the RFRS model compares wellwith other assays. The 17-gene and eight-gene optimizedsets have only a single gene (AURKA) in common withOncotypeDX [20], a single gene in common withVeridex [10] (FEN1, 17-gene set only), and none withMammaprint [22]. The similar level of performance be-tween our signature and others together with the lack ofoverlap in gene members suggests that there are manyequivalent solutions to this problem. We provide an op-timal subset for our training data but acknowledge thatmany other effective subsets are also possible and pro-vide a method for identifying such combinations via thepredictor alternates derived from k-mean clustering ana-lysis. Gene Ontology categorization was performed usingDAVID [29,30] and revealed that genes in the 17-genelist are involved in a wide range of biological processesknown to be involved in breast cancer biology includingcell cycle, hormone response, cell death, DNA repair,transcription regulation, wound healing, and others(Figure 7). Since the eight-gene set is entirely containedin the 17-gene set it would be involved in many of thesame processes. GO categorization of the 17-gene setFigure 7 Gene Ontology categorization of 17-gene model. A Gene Ontology (GO) categorization was performed using DAVID to identify theassociated GO biological processes for the 17-gene model. A VennMaster diagram represents the approximate overlap between GO terms. Tosimplify, redundant terms were grouped together. Genes in the 17-gene list are involved in a wide range of biological processes known to beinvolved in breast cancer biology including cell cycle, hormone response, cell death, DNA repair, transcription regulation, wound healing, andothers. Since the eight-gene set is entirely contained in the 17-gene set it would be involved in many of the same processes.Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 11 of 14

Page 12

was provided to characterize this specific set of genes.However, when training our models, we required onlythat the chosen genes/probes be robustly and reprodu-cibly predictive of relapse. We did not require the signa-tures to be a comprehensive list of such correlatedgenes/probes. In fact, we show that there are likely manyessentially equivalent subsets of genes that would workequally well as predictors of outcome. For practical rea-sons we have attempted to choose a relatively small listof genes to facilitate migration to other technologies.With only 17 genes as an input list we would not expectstatistically significant enrichment of genes for anycategory. Therefore, we caution against interpretation ofthese GO categorizations in that sense.ConclusionsWhile most breast cancers are diagnosed at a resect-able and thus potentially curable stage, all carry somerisk of relapse. As such, most patients are in theorycandidates for adjuvant maneuvers to lessen the risk ofrelapse. We present a meta-analysis of gene expressiondata in 858 lymph-node-negative, ER-positive, HER2-negative, chemotherapy-naive breast tumors from pub-lished datasets. This dataset supports a method ofpredicting the likelihood of long-term survival withoutrelapse for such breast cancers. The method involvesassaying the expression level of one or more RNA tran-scripts or their expression products in a breast cancersample, from a set of eight or 17 genes each with oneor more alternate genes. We determine a RFRS andrisk group by applying the RFRS algorithm to tumorderived gene expression. Furthermore, by comparingthe RFRS value to outcomes in training data we canestimate a likelihood of long-term survival withoutbreast cancer relapse in a patient. In theory, thosebreast cancer patients with tumors at high risk of re-lapse could be treated more aggressively whereas thoseat low risk of relapse could more safely avoid the risksand side effects of systemic chemotherapy. The bene-fits of this type of approach are expected to be signifi-cant and are currently being evaluated in the TailorRX(USA) and MINDACT (Europe) trials [31]. Our hopeis that this method, together with a rapid assay plat-form could provide rapid (<1 h) and useful informationto inform clinical decision-making. A test developedfrom this method could be applied to resected breasttumor tissue (either frozen or fresh), formalin-fixed,paraffin-embedded breast cancer tissue or tissue iso-lated from a core biopsy or fine needle aspirate. Wealso provide a list of reference genes for use in normal-izing gene expression measurements if necessary.We tested the hypothesis that a larger dataset with morecurrent clinical data would allow development of a signa-ture with better performance than existing methods. Oursignature performed comparably but not significantly bet-ter, perhaps suggesting that we have reached the upperlimits of accuracy for array-based transcriptional signaturesfor recurrence in ER+/LN- breast cancer. However, RFRScompares favorably to previously described, clinically avail-able products that predict outcome in breast cancer (for ex-ample, OncotypeDX and Mammaprint) in several respects:(1) the signature was built from the largest training datasetavailable to date; (2) patients with HER2+ tumors were ex-cluded, thus focusing only on patients without an existingclear treatment course; (3) the gene signature predicts re-lapse with equal success for both patients that went on toreceive adjuvant hormonal therapy and those who did not;(4) the gene signature was designed for robustness with (inmost cases) several alternate genes available for eachprimary gene; (5) probe set sequences have been validatedby alignment and manual assessment. These features,particularly the latter two, make this signature an especiallystrong candidate for efficient migration to low-cost plat-forms for use in a clinical setting. Development of a panelfor use in the clinic could take advantage of not only pri-mary genes but also some number of alternate genes to in-crease the chance of a successful migration. Given thesmall but significant number of discrepancies observed be-tween clinical and array-based determination of ER statuswe also recommend inclusion of standard biomarkers suchas ER, PR, and HER2 on any design. We provide a list ofconsistently expressed genes, specific to breast tumor tis-sue, for use as control genes for those platforms that re-quire them. Finally, we make available a large, carefullycurated gene expression dataset for ER+, LN- breast canceralong with clinical annotations for use in the researchcommunity.Additional filesAdditional file 1: Contains all supplementary figures, tables, and amore detailed explanation of all additional files.Additional file 2: Contains clinical annotations for the 572 patientsin the training dataset.Additional file 3: Contains clinical annotations for the 286 patientsin the test dataset.Additional file 4: Contains GCRMA normalized mRNA expressionvalues for the 572 patients in the training dataset.Additional file 5: Contains GCRMA normalized mRNA expressionvalues for the 286 patients in the test dataset.Additional file 6: Describes alignment of the top 100 probe sets tothe reference genome.Additional file 7: Contains actual probe sequences for the top 100probe sets.Additional file 8: Describes alignment of all reference probe sets tothe reference genome.Additional file 9: Contains actual probe sequences for all referenceprobe sets.Griffith et al. Genome Medicine 2013, 5:92http://genomemedicine.com/content/5/10/92Page 12 of 14

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.

[Show abstract][Hide abstract]ABSTRACT: Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis.
The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets.
Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate - adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA).
Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed.
To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.