Abstract

There is no shortage of treatment approaches offered to people with pain. The maze of options presents patients and clinicians with difficult choices. Key to making those choices is evidence of treatment effectiveness provided by clinical trials and systematic reviews. Recent growth in the number of clinical trials and systematic reviews, of both high and low quality, makes it vital that users of this evidence—clinicians, researchers, patients, and policy makers—have the skills and knowledge to critically interpret these studies. In this review, we discuss some contemporary issues regarding evidence of effectiveness derived from clinical trials and systematic reviews—issues that we think are critical to understanding the field. We focus on evidence of treatment effectiveness in pain, although many of these issues are relevant to and transferable across the spectrum of evidence-based practice.

People with pain and the clinicians who help them are faced with a maze of treatment options, each backed by enthusiastic and highly motivated advocates, all of whom lay claim to “evidence.” Negotiating the treatment maze has never been more difficult. How can patients make an informed choice about their own care and how can clinicians best inform that choice?

Clinical trials remain the best tool for reducing uncertainty about the effects of treatment. The recent growth in the number of clinical trials and systematic reviews, of both high and low quality, makes it vital that clinicians, researchers, patients, and policy makers have the skills and knowledge to critically interpret the available evidence. Here, we discuss some contemporary issues regarding evidence of effectiveness from clinical trials and systematic reviews in pain—issues that we think are critical to understanding the field.

Clinical trials can be designed to test efficacy (whether an intervention delivers an effect in ideal conditions) or effectiveness (whether an intervention delivers an effect in the real world). In reality, many trials test something that falls somewhere on a continuum between the two.1 We focus on evidence of effectiveness of treatments for pain, particularly chronic pain. We also examine evidence from the world of pharmacological interventions for pain to consider what lessons there may be for interpreting nonpharmacological evidence. Many of these issues also are relevant to evidence of efficacy of treatments for pain and are transferable across the spectrum of evidence-based practice.

Beyond “P”: The Search for Importance

In effectiveness research, the P value has long been a critical determinant of whether a treatment is thought to work. We contend that the P value has been a significant barrier to efforts to establish which treatments are truly effective and, therefore, worthwhile. There has been an implicit acceptance among researchers, and users of evidence, that statistical significance represents clinical effectiveness. Unfortunately, this position suffers from substantial conceptual flaws.2,3 It is possible for a treatment effect to be statistically significant and clinically meaningless. Conversely, although perhaps less commonly in the field of pain research, a treatment might provide important benefits to patients but be unjustly ignored because it does not cross this arbitrary statistical threshold.

Recently, overreliance on the P value to determine treatment effectiveness has come under further scrutiny.4 Although it is commonly held that a P value of <.05 suggests a type I error rate of less than 5%, the actual false discovery rate is dependent on the prior probability of a treatment having an effect.4 For example, if we assume that just 10% of the various interventions that we test might provide a real treatment effect, an alpha level of <.05 would actually translate to a false discovery rate of 36% rather than the nominal 5% (for a review of this concept, see Colquhoun4). This finding is particularly pertinent for chronic pain trials where we recruit participants who have proven refractory to interventions, and hence the prior probability of intervention success is likely to be low. Aside from this problem, inappropriate, or perhaps innovative, statistical analyses can yield supportive-looking P values.5

When assessing the effectiveness of a treatment, the size, precision, and subsequent clinical importance of the treatment's effects are of greater importance than whether the apparent effect could have occurred by chance. Most patients with chronic pain want a cure for their pain,6 and treatments are routinely promoted and marketed as delivering large benefits quickly. Unfortunately, the prospect of such an outcome remains very unlikely. So the question shifts to one of how much improvement would be needed to be meaningful to a patient. This minimal clinically important difference (MCID)7 or smallest worthwhile effect (SWE)8 should represent the minimum treatment effect with which patients would be satisfied. Recognizing this metric is a step forward, it allows us to classify a treatment as imparting enough of an effect to be of value in the real world.

What exactly constitutes an MCID, or an SWE, on any given outcome measure remains contentious, although various methodological approaches are being applied to the problem (for a review with regard to back pain, see Ferreira et al8). Remarkably, it seems that many of these approaches do not actively consider the patient's perspective.8 It is likely that what a patient would be satisfied with might differ substantially between individuals, patient groups, interventions,9 the point in the care pathway at which the patient arrives, and a range of other possible factors. In addition, people almost certainly have variable thresholds of what level of risk, inconvenience, or cost associated with the intervention they would consider to be prohibitive—an issue that, to our knowledge, has received very little attention. This variability—and the requirement that any potential benefit of an intervention must be weighed against its potential harms (including cost and inconvenience)—suggests that the construct of a generic MCID for chronic pain interventions is problematic.

Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials

The Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) has offered some provisional benchmarks for important change in chronic pain. These benchmarks are based on studies that compared pain scores with global impression of change in patients with neuropathic pain,10 arthropathies,11 and pain following spinal cord injury or amputation.12 According to the IMMPACT, a 30% reduction in pain in an individual patient represents the lower threshold for considering an effect to be moderately clinically important, and a 50% reduction represents a substantially clinically important change.13 There are obvious problems with applying cutoffs arbitrarily and across-the-board. For example, it does not seem reasonable to assume that we might require the same degree of change when agreeing to receive a short educational booklet as we might when agreeing to undergo an invasive surgical procedure. These cutoffs also are sensitive to baseline levels of symptoms. A 30% improvement in a severe intractable pain is probably quite a different proposition from a 30% change in a mildly bothersome ache or twinge.

Rather than focusing on the SWE, a number of studies have used the Patient-Centered Outcomes Questionnaire (PCOS) to investigate the threshold of symptom improvement required for people with chronic pain to consider treatment successful.14–16 These studies suggest that around 54% to 58% improvement in pain intensity and 63% to 68% improvement in pain interference are required for treatment success. However, as with the SWE, we might expect judgments of success to be specific to the intervention and other contextual factors. Zeppieri et al16 investigated participants about to start physical therapy, whereas Robinson et al14 and O'Brien et al15 sampled patients from pain clinics and a rheumatology department, respectively, with no intervention specifically identified. Thus, although estimates suggest that large changes may be necessary, it is not appropriate to assume that these estimates should apply across all interventions.

The Elusive “Average” Patient and the Elusive “Responder”

The IMMPACT benchmarks and the thresholds derived from them reflect within-patient change from baseline. This is an appealing concept because it has a real-world resonance, being the amount of change experienced by an individual undergoing the intervention of interest. Unfortunately, within-patient change from baseline provides a poor measure of the effects of the intervention because it includes the influences of natural recovery, statistical regression, and the nonspecific effects associated with clinical contact, including but not limited to “placebo effects”17 (see Moseley,18 however, for an alternative understanding of placebo). Within-patient change in outcome might tell us how much an individual's condition improved, but it does not tell us how much of this improvement was due to treatment. In most common randomized trial designs, the only value that can help us estimate the actual effect of the intervention is the average between-group difference after treatment.19 It is only recently that this important principle has been applied to MCID or SWE research.

Ferreira et al9 used the benefit-harm trade-off method to try and determine the SWE for physical therapy in people with chronic low back pain based on intervention-control between-group comparison, attempting to capture change due to treatment, not simply change over time. Participants were informed that their pain and disability are likely to improve 30% without intervention and were asked to estimate how much additional improvement would be needed to make the intervention worthwhile. The results of this study suggest that, on average, people with chronic low back pain would need to experience an additional 20% improvement in pain and disability compared with no treatment to perceive that the effect of physical therapy was worthwhile, that is, an overall 50% change.

There are limitations inherent in interpreting the average effects of interventions in clinical trials. The question arises of who, if anyone, experiences the average treatment effect. It has been argued that, in the world of pharmaceutical trials for chronic pain, the response pattern is often bimodally distributed.20–22 Simply put, some patients do very well with the intervention, some have minimal to no effects, and very few experience intermediate (moderate) effects. In this instance, the average effect might be the effect that the fewest participants actually demonstrate.20 The commonly proposed solution to this problem is to conduct a “responder analysis,” which compares the proportion achieving a clinically important improvement from baseline in the treatment and control groups. It has been proposed that this type of analysis better quantifies individual participant responses to treatment20 and that it enables the calculation of easily interpreted measures such as the number needed to treat (NNT). The NNT is the number of people we would need to treat with the intervention instead of the control condition for one more participant to achieve the outcome of interest (often a predefined MCID).

This approach also has important limitations. The term “responder analysis” is a misnomer and is frequently misunderstood.23 In this type of analysis, “responders” are identified by within-person change from baseline. For many participants in each group, we are not really measuring treatment “response,” we are measuring “good outcome,” which, as mentioned above, might be due to natural recovery, nonspecific treatment effects, and regression to the mean, as well as (or instead of) the effects of the intervention. Also, it is possible that some individuals who responded strongly to the intervention might not be counted as responders. If the natural history of individuals during the treatment period would have been significant worsening, yet with treatment their condition remains stable, they will be counted as nonresponders despite receiving significant benefit from the intervention. So, even though the between-group difference in the proportion of participants who experience a good outcome reflects the net increase in the proportion of patients who responded during the treatment period, it does not get any closer to telling us about the effects of intervention on individual people.

Methods for distinguishing true responders from those who improved regardless of the treatment have their own substantial difficulties.24 Responder analysis for a subjective outcome measured on a continuous scale (eg, pain severity) may be sensitive to the cutoffs used to define clinical importance, and these cutoffs are often arbitrary. Moreover, because the outcome is measured imperfectly and responders may be frequently misclassified, responder analyses might underestimate true effects.25 This approach also potentially introduces the problem of only detecting positive change, not negative change. That is, all “nonresponders” are considered equal. In reality, the response within this group might vary from mild improvement to severe deterioration. Finally, the dichotomization of outcomes in responder analyses greatly reduces the precision of estimates of effect.

Although the use of responder analysis is growing, currently such data remain scarce, particularly for nonpharmacological interventions.26 A good case for responder analyses in rehabilitation trials has not been clearly established. The observation that patterns of outcome may be bimodal for some specific interventions is not evidence that they are necessarily bimodal for others. More importantly, evidence of bimodal outcomes is not evidence of bimodal treatment effects. The belief that responder analysis will demonstrate treatment effects on individuals that are not apparent in other analyses may be unfounded. Data from drug trials in chronic pain, where such analyses are more common, rarely show NNTs below 6.20

Do Clinical Trials Underestimate Effectiveness?

It is commonly argued that clinical trials are not fit for the purpose of evaluating physical interventions because they fail to capture the true effects of physical therapy treatments. Such arguments seem common at physical therapy conferences, particularly from therapists who find the disappointing results from clinical trials to be at odds with their clinical experience. The most common criticisms are that treatments are inadequately targeted in clinical trials because they are shoehorned into a one-size-fits-all approach; therapies in clinical trials differ from real-world therapy, which is complex, tailored, and often multimodal, and the effects of treatment are diluted by the application of single interventions to a complex, heterogeneous group with diverse treatment needs. These criticisms are certainly justified in some, but not all, trials. In the field of chronic pain, additional difficulties are presented in establishing meaningful diagnoses. Existing diagnostic labels (eg, chronic nonspecific low back pain, complex regional pain syndrome, fibromyalgia) often identify heterogeneous cohorts of people who share similar symptom profiles but not necessarily similar disease mechanisms.

The one-size-fits-all criticism is arguably an unfair characterization of many modern therapy trials. Indeed, in recent years, many, if not most, trials allowed therapists some discretion to tailor their approach to the individual, usually within a specific theoretical framework and often in a way that closely modeled existing clinical practices. For example, in the manipulation arm of the UK back pain exercise and manipulation trial,27 therapists were free to deliver a range of soft tissue, joint, and neural manual therapy techniques. In addition, therapists could prescribe various exercises for the spine and hip, provide education on activity and return to work, and address simple psychological issues.28 In the recent PROMISE trial of exercise for chronic whiplash,29 therapists were able to tailor multimodal exercise, manual therapy, and cognitive-behavioral techniques to the individual patient. Furthermore, in the SWIFT trial, participants randomized to the physical therapy arm received a combination of individualized education or advice, exercise therapy, and manipulative therapy at the discretion of the treating physical therapist based on usual practice.30

Currently, there are no firmly established robust and widely accepted models for subgrouping patients with chronic pain to facilitate better targeting of treatment. Efforts at subgrouping have largely returned mixed outcomes.31 Much of this work has focused on the management of low back pain, both acute and chronic, for which numerous approaches to subgrouping have been developed and tested. The picture that emerges is one in which positive trials32,33 tend to demonstrate small, positive effects on primary outcomes, although these trials often fail to be replicated34–37 or are currently awaiting independent replication.38 Subgrouping algorithms are frequently based on retrospective analysis of trial data rather than on prospective tests of predictions based on theoretical frameworks or biological mechanisms. Moreover, some subgroup analyses have been shown to be dependent on the cutoff points used to determine MCID,39 and many subgroup analyses conducted within trials have been severely underpowered and poorly reported.40 Better tailoring or subgrouping of cohorts to treatments may still improve outcomes, but so far the promise of subgrouping remains largely unfulfilled.

A further assertion is that the true effects of an intervention are lost in the cacophony of competing real-world variables, including social and psychological factors, competing therapies, adherence, and participation. This assertion maintains that the signal of effective treatment cannot always be detected in the presence of noise. Again, there may be some truth in this assertion, but the best way around it is to conduct large trials that can provide precise estimates of average treatment effects. Therein lies the challenge facing all health interventions: to demonstrate clear benefit in the chaos of the real world. The “noise” may be particularly loud in chronic pain, but we should recognize that, in both clinical practice and research, interventions cannot be provided in the clinical equivalent of a soundproofed room.

Recently, specifically in the case of cognitive-behavioral therapy interventions, Morley41 argued for greater integration of “practice-based evidence,” in which data generated from routine clinical practice is afforded greater importance. In this approach, clinical outcome data are compared to “benchmark” effect sizes generated from the treatment and control arms of clinical trials. This comparison allows a degree of control over the effects of natural history and nonspecific effects of treatment, although it does not offer the high level of control offered by randomization. One possible risk associated with this approach is that where effect sizes are sufficiently low, it may encourage the celebration of possibly dubious successes. As such, it seems best suited to demonstrating “proof of concept” of new hypotheses regarding treatment innovation for subsequent testing in RCTs.

Exaggeration, Misreporting, and Spin

It also is worth considering the alternative possibility that clinical trials might generate exaggerated estimates of effectiveness. In the context of clinical trials for physical interventions, treatments are often provided by more experienced clinicians, patients are given more time, and greater steps are taken to ensure treatment adherence than would be the case in routine clinical practice, potentially offering a more effective package of care than might be realized in routine clinical practice. More importantly, the observed effectiveness of a treatment is represented by the true effect of treatment plus the effect of biases that also can positively influence outcome. These biases are often suboptimally controlled in clinical trials so the observed effect represents both the effectiveness of the intervention and bias. Many readers will be familiar with the conventional risk of bias criteria by which trials are assessed in systematic reviews. Meta-epidemiological evidence shows that these criteria are associated with treatment effect sizes, particularly for subjective outcome measures such as pain.42,43 Blinding of patients and care providers is often not achieved in trials of physical or psychological treatments,44 and it is notable that trials of physical interventions commonly fall short on a number of other criteria. Although quality is improving,45 it is likely that the effect sizes reported in most clinical trials represent more than just the effects of treatment on patient outcomes.

In clinical trials, size matters; small studies increase the risk of false negatives by virtue of their low statistical power, but in clinical trials, they tend to also result in false positives and inflated effect sizes.46–48 There are a number of possible reasons for this phenomenon: small studies may include more homogeneous clinical groups for which effects are more consistent,49 and it is easier to deliver high-quality interventions in smaller trials.49 Small trials also are often more loosely controlled and of lower methodological quality. Negative small studies have a tendency not to be published, rendering the available sample of published small trials unrepresentatively positive. Large and significant effects arising from small, underpowered studies are at higher risk of being false positives than if they arose from large, well-powered studies.50 The benefits of meta-analysis do little to correct this problem—even where a pooled estimate includes a large number of participants, it may be prone to small study bias if it is dominated by small studies.

Managing loss to follow-up of participants and protocol violation during trials is difficult. Traditionally, we look for an intention-to-treat analysis, in which all participants are analyzed by the treatment to which they were allocated, regardless of what follows that allocation. Currently, the application and reporting of intention-to-treat analyses in analgesic trials are inconsistent,51 reflecting a common risk of bias in this field. Common methods for dealing with missing data themselves introduce bias. Evidence from drug trials in chronic pain suggests that the commonly used “last observation carried forward” approach to imputing data inflates effect sizes.52 This is often an issue with adverse event withdrawal, where the last observation precedes the adverse event, but also might hold true for other reasons for withdrawal—withdrawal may be associated with worsening symptoms or a realization that the treatment is not really helping, both of which may occur after the last formal observation. New methods for analyzing clinical trials, particularly multiple imputation, may improve estimates of effect in the presence of substantial loss to follow-up,53 although further data are needed to formally evaluate this perspective.

Beyond these threats related to methodology are challenges to the balanced conduct and communication of trials. Selective outcome reporting is considered as a risk of bias in many assessment tools and involves the selective presentation of results that are more positive or statistically significant, as well as the withholding on negative or nonsignificant results.54 This selective reporting can be achieved through poor practices such as deviating from the trial protocol by switching the primary outcomes in light of the trial results. A recent review of analgesic trials55 compared records in international trials registers with the final published study reports and showed discrepancies between the primary outcomes in 79% of the available data, with 30% of trials containing what were defined as “unambiguous” discrepancies, where a registered primary outcome was either not reported in the published trials or was demoted to a secondary outcome. A similar review of acupuncture trials56 showed inconsistency in the primary outcomes in 45% of available trials, of which 71% had a discrepancy that favored a statistically significant “positive” result on the primary outcome.

There is also evidence of a strong positivity bias in the interpretation and presentation of results from clinical trials. Boutron et al57 found evidence of “spin”—presenting an experimental treatment as beneficial—in 40% of statistically negative trials. In rheumatology trials, Mathieu et al58 found that 23% of the trials had conclusions that were misleading and that the only predictor of misleading conclusions was a statistically negative result. This pattern also is apparent in the analgesic trial literature. Worryingly, some type of positive spin was identified in at least one part of the abstract of 61% of analgesic trials with statistically nonsignificant results in their primary analysis, most commonly the placing of undue emphasis on statistically significant results from secondary analyses.59 It seems that beyond the difficulty of getting negative results published, researchers do not like to accept negative results in the first place. Perhaps this attitude is partially motivated by the “publish or perish” culture of modern research. Notwithstanding that, it clearly represents a failure of the scientific process in which there is a bias toward one possible answer to the research question. For consumers of research articles, the message is that simply looking to the abstract or conclusions of a trial for the truth carries risk—an issue we have touched on previously.60

Pursue Success, Expect Failure?

Looking across the Cochrane Library at reviews of common interventions for chronic pain, and being somewhat selective by avoiding interventions where the evidence suggests no effect at all, reveals that most “effective” therapies appear to provide only very small, short-term effects on pain or other important patient-centered outcomes such as function, distress, and quality of life. We must bear in mind that, particularly for complex interventions, the meta-analyses that produce these estimates contain multiple sources of clinical heterogeneity that have the potential to influence effect size; they combine interventions that are often quite different in terms of content and dose61; the quality of the intervention is often hard to determine, although of great potential importance49; the theories underpinning the interventions often vary significantly among studies or are not clearly established62; the contextual equivalence of the control group interventions is variable63; and adherence levels vary, and patients are drawn from diverse sources. But accepting these limitations, we suggest that, when we do not currently have robust means of identifying a priori those patients who might respond to treatment, it is the average between-group effects that represent our best estimate of the intervention-specific benefit for any individual.

For drug therapies, treatment response, when it comes, is usually rapid. Moore and colleagues20 recommend that when we introduce a new therapy, we should expect failure, be alert to a lack of treatment response, and switch quickly to another agent if outcomes are poor. Such an approach might maximize the chance of finding an effective option as quickly as possible while minimizing the risks of adverse events from drugs that confer no individual benefit, although it makes the potentially tenuous assumption that, without intervention, the patient's symptoms would not have changed substantially.

Could this approach be applied to nonpharmacological interventions? We think so, but to avoid pushing patients through a mill of ineffective therapies, we also think that we should limit the potential options to interventions that possess at least biological plausibility (a foundation stone that can be difficult to find in our field) and rigorous evidence of effectiveness.64

Reviewing this evidence can leave one with a somewhat negative impression. We acknowledge that there is a danger here—that such focused attention on rigor and bias can appear hypercritical and unduly negative and take away whatever desire clinicians and patients may have had to negotiate the evidence maze. That, however, is our challenge: to be dispassionate, recognize bias, and make balanced appraisals of the strength and direction of the evidence, and that must in the end be a positive step. In the words of the physicist Richard Feynman, “For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled.”65

These issues are not unique to the field of chronic pain research—many of them apply across the range of clinical disciplines. This applicability is important because there are examples from other clinical fields of the development and validation of clearly successful interventions, investigated by high-quality clinical trials and systematic reviews. Such compelling evidence of effectiveness from other, comparably complex fields, offers genuine hope for our own field. For example, we can now be confident that stress urinary incontinence can be prevented and treated with pelvic-floor muscle training66 and the risk of falling in the elderly population can be decreased with exercise programs.67

We should reflect that, in chronic pain treatment and research, we all have some sort of vested interest.68 If we offer assessments of the evidence for our treatments without due diligence regarding bias and limitations, we will not serve our patients well. Our patients may be given a choice but not the choice they need. By its very nature, clinical research should threaten current practice. Acknowledging what does not work, as well as what does (and by how much), is of great value and will force us to innovate. In fair tests,69 if our treatments achieve their goals meaningfully and consistently, the effect sizes will reflect that truth. An appreciation of how to interpret evidence of effectiveness is a critical skill not only for those engaged with research but also for those who want to use it in clinical practice.

Altering the natural course of any clinical condition is a difficult and complex challenge. In the words of epidemiologist Archie Cochrane, after whom the Cochrane Collaboration is named: “[O]ne should…be delightfully surprised when any treatment at all is effective, and always assume that a treatment is ineffective unless there is evidence to the contrary.”70 This statement has genuine resonance in chronic pain, in which we set ourselves the substantial challenge of changing symptoms in a group defined by the fact that those symptoms have so far proven unchangeable. We suggest that we should always have that perspective in mind, while remaining ready to be delightfully surprised.

. Different minimally important clinical difference (MCID) scores lead to different clinical prediction rules for the Oswestry Disability Index for the same sample of patients. J Man Manip Ther. 2013;21:71–78.