Abstract

Purpose: The categorical definition of response assessed via the Response Evaluation Criteria in Solid Tumors has documented limitations. We sought to identify alternative metrics for tumor response that improve prediction of overall survival.

Experimental Design: Individual patient data from three North Central Cancer Treatment Group trials (N0026, n = 117; N9741, n = 1,109; and N9841, n = 332) were used. Continuous metrics of tumor size based on longitudinal tumor measurements were considered in addition to a trichotomized response [TriTR: response (complete or partial) vs. stable disease vs. progression). Cox proportional hazards models, adjusted for treatment arm and baseline tumor burden, were used to assess the impact of the metrics on subsequent overall survival, using a landmark analysis approach at 12, 16, and 24 weeks postbaseline. Model discrimination was evaluated by the concordance (c) index.

Results: The overall best response rates for the three trials were 26%, 45%, and 25%, respectively. Although nearly all metrics were statistically significantly associated with overall survival at the different landmark time points, the concordance indices (c-index) for the traditional response metrics ranged from 0.59 to 0.65; for the continuous metrics from 0.60 to 0.66; and for the TriTR metrics from 0.64 to 0.69. The c-indices for TriTR at 12 weeks were comparable with those at 16 and 24 weeks.

Translational Relevance

The high failure rate (e.g., 50%–60%) of phase III trials in oncology, attributable, in part, to less than optimal predictions of effectiveness from hypothesis generating phase II trials, presents a major obstacle to the drug discovery process. Historically, phase II trials have used (categorical) tumor response rate as the primary endpoint, an approach that has documented concerns. In this article, we explore longitudinal tumor measurement–based continuous metrics and alternative categorical response metrics. Our results suggest that an unconfirmed trichotomized objective status assessed as early as 12 weeks posttreatment initiation predicts for subsequent survival and/or better than traditional response based on our continuous metrics. This trichotomized objective status metric, if validated, may positively impact the drug development process by (i) accurately reflecting the intended goals of current therapies, (ii) shortening assessment time, and (iii) improving prediction of subsequent survival.

Introduction

The high failure rate (e.g., 50%–60%) of phase III trials in oncology presents a major obstacle to the drug discovery process (1). Understanding and addressing the potential reasons for these high failure rates are crucial to making progress. Possible reasons for the high failure rate in phase III trials include (i) suboptimal choice of patient population and (ii) inaccurate predictions of effectiveness from the hypothesis generating prior phase II trials. The focus of this work is on the second of these reasons, and focuses on the choice of the endpoints used for phase II trials in patients with measurable disease to identify agents worthy of further evaluation. Our goal was to consider alternative phase II endpoints to the standardly used endpoint of tumor response, and in particular the use of endpoints based on continuous tumor measurements.

Historically, phase II trials have used tumor response rate as the primary endpoint, where response is assessed via the Response Evaluation Criteria in Solid Tumors (RECIST; ref. 2). RECIST was implemented in an effort to standardize assessment of tumor response and has been widely used in cancer clinical trials since 2000. Per RECIST, measurable target lesions representative of all involved organs are identified, recorded, and measured at baseline by unidimensional tumor measurements. The overall patient-level objective status is then determined on the basis of the assessment of the target lesions, nontarget lesions, and new lesions. Best response is defined as the best objective status [i.e., complete response (CR) or partial response (PR), stable disease, or progression; each of which is based on relative change in tumor size) on treatment. Confirmed response is defined as 2 consecutive assessments of complete or partial response assessed at least 4 weeks apart. Two observations are worth noting about RECIST. First, by definition, confirmed response, in contrast to best response, requires that the response status of the patient be sustained for at least a period of 4 weeks, thus avoiding to some extent possible overestimation of the observed response rate due to one-time measurement error. This is particularly important in nonrandomized trials in which tumor response is the primary endpoint. Second, these definitions of response are categorical and specifically dichotomous (CR/PR or not). Modified RECIST guidelines, RECIST 1.1, were introduced in 2009. These changes do not materially impact the subject matter of this article.

The concerns over the response rate as a primary endpoint are well documented. First, there is a demonstrated lack of concordance between response rates in single-center phase II trials and subsequent multicenter phase III studies (3). More fundamentally, tumor measurements are continuous and their categorization may result in loss of information (4). A related concern is the use of an arbitrary cutoff to determine “response” and “no response” (5) and timing of assessments. With the advent of targeted therapies that prolong disease stabilization, patients may experience stable disease rather than tumor shrinkage. It has been shown that patients with stable disease also achieve clinical benefit (6) and hence it is not appropriate to ignore stable disease when assessing treatment efficacy.

Nonprogression rate (also known as disease control rate) has become one accepted alternate endpoint in assessing treatment efficacy, as it includes patients who achieve stable disease for an extended period of time as a success, in addition to those who achieve complete or partial response. Disease control rate was shown to be better than response rate in predicting survival in the setting of non–small cell lung cancer (NSCLC; ref. 7). A trichotomous response (TriTR) has also been considered, in which response is categorized into CR/PR versus stable versus progression (6). Bradbury and Seymour (8) and Dhani and colleagues (9) provide a recent review of the many proposed phase II trial alternate endpoints.

Actual tumor measurements are relatively simple to obtain and have been previously explored by others to be used in a phase II endpoint. Karrison and colleagues (10) considered log change in the sum of tumor measurements from baseline to 8 weeks as a phase II trial endpoint. Wang and colleagues (11) developed a model both for tumor size and for survival; the primary goal of the tumor model was to account for missing tumor measurement data. Claret and colleagues (12) developed a mathematical model to predict overall survival from baseline and 7-week predicted tumor size, using a simulation study to compare observed and predicted data.

In this article, we propose several continuous metrics on the basis of the tumor measurements recorded over course of treatment. We hypothesized that these continuous metrics would more fully capture a patient's tumor lesion experience over the course of the treatment than traditional dichotomous or trichotomous response categorization. The goal was to identify an appropriate metric that can be assessed relatively early during treatment, which is predictive of longer-term clinical outcomes such as overall survival.

Materials and Methods

Data

We obtained individual patient tumor measurement and survival data from 3 North Central Cancer Treatment Group cancer clinical trials: a phase II first line pemetrexed plus gemcitabine study in advanced NSCLC (N0026, n = 157; ref. 13), a phase III randomized study of IFL, FOLFOX4, and IROX as first-line therapy for advanced colorectal cancer (N9741, n = 1,691; ref. 14), and a phase III randomized study of CPT-11 versus OXAL/5-FU/CF as second-line therapy for advanced colorectal cancer (N9841, n = 491; ref. 15). N0026 had 3 treatment arms in which none was found superior [overall response rate = 19%]. N9741 randomized patients to 3 treatment arms IFL (arm A), FOLFOX (arm F), and IROX (arm G). Arm F was found to be the most effective treatment, arm A was the previous standard of care, and arm G was found to have some efficacy but high toxicity. Overall response rate for N9741 was 38%. N9841 had 2 treatment arms between which there was no difference found in overall survival (overall response rate = 22%).

All patients with measurable disease who had a baseline measurement as well as at least 1 postbaseline measurement were included in our analysis. Patients who progressed or went off study for any reason prior to their first postbaseline measurement were excluded. The analysis data set therefore included 117, 1,109, and 332 patients from N0026, N9741, and N9841, respectively, with associated response rates of 26%, 45%, and 25%, respectively. Baseline characteristics for the patients included in the analyses are summarized in Table 1.

Both N9841 and N0026 used RECIST measurement for collection and assessment. N9741 opened prior to RECIST and instead collected and assessed tumor measurements according to the World Health Organization criteria. Applying RECIST to N9741 data, we used the maximum of the 2 measurements recorded for each lesion. The number of lesions measured at each assessment varied over time within patients; thus, only lesion measurements that were available across all assessments for the patient were utilized. Each study was designed to assess the lesions at 4- to 6-week intervals and assessed up to 10 lesion measurements.

Metrics

The continuous metrics we considered are based on the tumor measurements recorded over course of treatment. As such, these metrics are different from previously considered work on continuous tumor measurements (e.g., of tumor size at k weeks for some fixed k) in that these capture an overall tumor burden over the course of study. Table 2 summarizes the metrics. The total sum of measurements (TSM) was calculated simply as the sum of measurements at each assessment starting from baseline. To account for total time on study, we considered the average sum of measurements (ASM), which is calculated by dividing the TSM by the total number of assessments. As another metric, we considered the relative change from baseline (RCB), calculated for each assessment by dividing the sum of measurements at that assessment by the sum of the baseline measurements and subtracting 1 from this ratio, and then summing this quantity over all assessments. With this definition, negative values of the RCB indicate a decrease in tumor measurements. We also considered the average RCB (ARCB), which is the RCB divided by the number of assessments. In addition to these continuous metrics, we also considered TriTR, defined as CR/PR versus stable versus progression, by RECIST. In particular, we considered TriTR status at a predefined time point and best trichotomized response (best TriTR), as well as the traditional confirmed response.

Statistical analysis

Both the TSMs and the ASMs were log transformed to normalize the distribution in order to satisfy model assumptions. Distributions of other continuous metrics seemed approximately symmetric and unimodal and therefore were not transformed. A Cox proportional hazards model adjusting for the metric, treatment arm, and sum of baseline measurements (i.e., baseline tumor burden) was fit, where the metrics were calculated from tumor measurements available at randomization (baseline) until (i) 12 weeks postrandomization (12-week landmark analysis); (ii) 16 weeks postrandomization (16-week landmark analysis); and (iii) 24 weeks postrandomization (24-week landmark analysis). In each analysis, we considered the continuous metrics, confirmed response, best TriTR, and TriTR status at the predefined time points. For this last metric, the objective status at the assessment closest to that time point, that is, within 3 weeks from the expected assessment time, was used. When using a landmark analysis approach, TSM and ASM are theoretically equivalent because they differ only by a constant factor, that is, TSM divided by the number of assessments yields the ASM. However, not all patients have the same number of assessments in a given time period so that ASM and TSM do not necessarily differ by a constant factor across all patients. We have therefore included both metrics for consideration. A similar comment applies to RCB and ARCB.

To understand the associations between each metric and survival, we considered HRs for each metric. As a measure of model fit and a means to compare nonnested models, we considered the Akaike Information Criteria (AIC).

Discrimination and calibration were considered to better understand predictive utility of our response metrics, our primary goal. To determine how well a model (and by extension, a metric) discriminates among patients with different outcomes, we used the concordance index (or c-index; refs. 16, 17). The c-index in the context of survival considers all pairs of individuals. If one can determine which individual in the pair died first, then this pair is evaluable. If the patient with a higher estimated HR dies prior to the one with a lower estimated HR, then the pair is deemed concordant, otherwise they are considered discordant. The index is the fraction of all evaluable pairs that are concordant. It ranges from 0.5 to 1.0, where 0.5 indicates no association and 1 indicates perfect association. We assessed calibration by comparing expected and observed survival probabilities at 1 year as follows: patients were grouped into deciles of their predicted probabilities from a Cox model. Within each decile, we calculated the average predicted probability (“expected”) and the Kaplan–Meier estimate (“observed”) and compared these in plots and via an informal measure calculated as the sum of the squared difference between expected and observed probabilities. In comparing metrics, our focus was on discrimination.

Results

Table 3 provides results from the 12-week landmark analysis, adjusting for metric, treatment arm, and sum of baseline measurements. Figure 1 presents the c-indices for all 7 metrics (represented by different symbols) for each study. In this section, unless otherwise noted, we present results from the 12-week landmark analysis. Results for the 16- and 24-week landmark analyses were similar and are included as Supplementary Tables.

Up to 12-week assessments adjusting for treatment arm and sum of baseline tumor measurements

All of the categorical and continuous response metrics were found to be statistically significantly associated with overall survival (HR, P < 0.05; Table 3). The AIC values ranged from 799 (for the model with ASM) to 815 (for confirmed response) in N0026; from 11,872 (for TriTR) to 12,699 (for confirmed response) in N9741; and from 2,987 (for ARCB) to 3,058 (for RCB) in N9841.

On the basis of the plots and measure of observed versus expected 1-year survival probabilities, calibration across models was reasonable and comparable (Fig. 2; Supplementary Figs. S1 and S2). Plots for 2- and 3-year survival probabilities (not shown) revealed similar findings. The c-indices for confirmed response ranged from 0.59 to 0.65; from 0.60 to 0.66 for the continuous metrics; and from 0.64 to 0.69 for TriTR metrics (Table 3). Three observations about the c-indices hold across all 3 studies. First, although continuous metrics were all statistically significantly associated with overall survival (via HRs; Table 3), they provided minimal (if any) improvement in prediction for overall survival compared with the various categorical response metrics based on the c-index. In Fig. 1, the c-indices for categorical response metrics (open circles, open point-down triangles, and solid circles; range: 0.59–0.69 from Table 3) are as high as or higher than those for continuous metrics (other symbols; range: 0.60–0.66 from Table 3). This suggests that, in general, the categorical response metrics may be better predictors of overall survival than the continuous metrics. Second, the c-indices for TriTR status at 12 weeks (solid circles; range: 0.64–0.69 from Table 3) are at least as high as or higher than those for best TriTR (open circles; range: 0.64–0.67 from Table 3) and confirmed response (open point-down triangles; range: 0.59–0.65 from Table 3) within the same time frame. Third, the c-indices for TriTR at 12 weeks (solid circles; range: 0.64–0.69) are comparable with those for TriTR at 16 and 24 weeks (range: 0.65–0.70 and 0.65–0.66, respectively, from Supplementary Tables). As reference, we also calculated the c-indices for models that only included treatment and baseline sum of measurements based on measurement data available at 12 weeks. These were 0.59, 0.61, and 0.64 for N0026, N9741, and N9841, respectively.

Plots of observed versus expected 1-year survival probabilities for N9741 based on models for each metric and a model with no metric, adjusting for treatment arm and sum of baseline tumor measurements.

All of these analyses were also repeated within each arm for N9741, the large randomized trial, and the results were similar (data not shown). Of note is that the c-indices for response at 12 weeks were consistently as high as or higher than those for other metrics. Figure 3 shows Kaplan–Meier survival curves by TriTR status at 12 weeks and by ARCB (dichotomized at the median) for each study.

Discussion

Contrary to our hypothesis, we found that the continuous metrics we assessed provide no predictive advantage over the categorical response metrics. However, we do recommend further study of the TriTR at early time points (e.g., 12, 16, and 24 weeks), with particular attention to the 12-week status. This metric has at least 2 advantages. First, it addresses the concern over ignoring stable disease by including stable disease as a separate category; second, it can be assessed earlier because it does not require confirmation and does not require data from the entire study period.

It is interesting to note that N0026 and N9841 had relatively low response rates (25%–26%), yet the TriTR metric for these studies performed as well as it did for N9741, which had a higher response rate (45%). Probably, the TriTR appropriately recognizes the survival benefit associated with stable disease by placing such patients into their own category rather than combining them in the same category as progression. A natural extension to the TriTR would be a 5-level metric (CR vs. PR vs. stable vs. increasing vs. progression). However, this also has some inherent limitations, specifically, (i) the need to specify a cutoff point to distinguish between increasing and progression, where the choice for this is not obvious, and (ii) the CR rate is often small in oncology studies, for example, in our data, the CR rates for N0026, N9741, and N9841 were 0%, 4.2%, and 3.3%, respectively.

The inability for the continuous metrics we assessed to improve survival prediction may be due to several factors. First, when considered over an entire study population, tumor growth may be sufficiently “regular” that measurements at a fixed time point postbaseline adequately characterize tumor activity. Second, the imaging frequency could be too infrequent to capture the tumor size changes. Alternatively, unidimensional tumor size may not be the most accurate measure of disease aggressiveness; functional imaging, volumetric assessment, or other advanced imaging methods may offer improvements. Finally, it may be too much to expect any early tumor measurement–related endpoints to predict overall survival in settings where second-line and later line therapies are used (18). An important assumption for the validity of endpoints based on our continuous metrics is that patient tumors are measured at regular intervals that do not differ by arm. This is to eliminate the possible bias that could arise in the following situation: 2 patients have similar tumor growth trajectories, but one patient has a tumor measurement at j weeks and the other has a tumor measurement as j + i weeks (for i > 0) by which time the tumor is a different size than at week j. As a result, these patients may have different tumor response profiles based on our continuous metrics.

An additional limitation to the current data is the inability to effectively assess the impact of the missing measurement data due to clinical progression, new lesions, and missing assessments. Moreover, the number of lesions measured at each assessment was variable, and the current analysis used only the lesion measurements that were available across all assessments for the patient. Because not all lesion measurements at each assessment were used, the measurement data from each cycle used to compute the metrics could be biased. Future work should consider further exploration of TriTR as well as alternative continuous metrics because simple scalar summaries such as those we considered may not likely capture “the” key features of the tumor growth curve. For example, it is possible to have one patient for whom the tumor decreases over time and another patient for whom tumor increases over time, but for these 2 patients to have identical sums of measurements. Furthermore, tumor growth curves often exhibit nonlinearity, for example, initial tumor shrinkage followed by progression. To capture key features of the tumor growth curve and thereby to improve prediction, a metric will likely need to be composite, for example, a linear combination of multiple scalar summaries such as those considered in this article. Longitudinal modeling, for example, mixed models, is another option others have previously considered (e.g., ref. 12).

In conclusion, our data suggest that categorical response metrics predict survival and/or better than the continuous tumor measurement–based metrics considered in this work. Furthermore, TriTR at early time points, possibly as early as 12 weeks, is worthy of further study as an alternative endpoint in phase II trials.

Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

Grant Support

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

Footnotes

Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/).

This work was presented, in part, at the 2007 and 2008 Annual Meetings of the American Society of Clinical Oncology.

This article was written while M-W. An was on leave from Vassar College in the Department of Health Sciences Research at the Mayo Clinic.