In Chapter 2, the committee recommends a framework for the US Food and Drug Administration (FDA) regulatory decision-making process in which scientific evidence plays a critical role, together with other factors including ethical considerations and the perspectives of patients and other stakeholders. This chapter focuses on the evaluation of the scientific evidence and on how FDA should use evidence in its decisions. Just as courts determine when evidence is admissible and which standard of proof to apply in a given case, scientific evidence must be evaluated for its quality and applicability to the public health question that is the focus of regulatory decision-making. FDA needs to base its decisions on the best available scientific evidence related to that question. Different people, however, can interpret and judge scientific evidence in various ways. Decisions in which there is disagreement among experts about what decisions are best supported by a given body of evidence are among the most difficult that FDA must make. For these decisions to properly incorporate all the relevant uncertainties and values, the regulators need to understand the bases of the various judgments that the experts are making. As has been shown in many difficult cases that FDA has had to decide, evidence does not speak for itself.

This chapter will categorize and discuss the sources of technical disagreements between experts about the kinds of data that FDA typically deals with. It will start with a short primer on approaches to statistical inference, with an introduction to Bayesian methods, followed by a discussion of the distinctions between scientific data and evidence. It then discusses why scientists sometimes disagree about the evidence of a drug’s benefits and risks and how their disagreements may affect regulatory decision-making.

The National Academies of Sciences, Engineering, and Medicine 500 Fifth St. N.W. | Washington, D.C. 20001

Citation Manager

Committee on Ethical and Scientific Issues in Studying the Safety of Approved Drugs, Institute of Medicine "3 Evidence and Decision-Making."
Ethical and Scientific Issues in Studying the Safety of Approved Drugs.
Washington, DC: The National Academies Press, 2012.

Please select a format:

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 121
3
Evidence and Decision-Making
In Chapter 2, the committee recommends a framework for the US Food and
Drug Administration (FDA) regulatory decision-making process in which scien -
tific evidence plays a critical role, together with other factors including ethical
considerations and the perspectives of patients and other stakeholders. This chap-
ter focuses on the evaluation of the scientific evidence and on how FDA should
use evidence in its decisions. Just as courts determine when evidence is admis -
sible and which standard of proof to apply in a given case, scientific evidence
must be evaluated for its quality and applicability to the public health question
that is the focus of regulatory decision-making. FDA needs to base its decisions
on the best available scientific evidence related to that question. Different people,
however, can interpret and judge scientific evidence in various ways. Decisions
in which there is disagreement among experts about what decisions are best sup -
ported by a given body of evidence are among the most difficult that FDA must
make. For these decisions to properly incorporate all the relevant uncertainties
and values, the regulators need to understand the bases of the various judgments
that the experts are making. As has been shown in many difficult cases that FDA
has had to decide, evidence does not speak for itself.
This chapter will categorize and discuss the sources of technical disagree -
ments between experts about the kinds of data that FDA typically deals with.
It will start with a short primer on approaches to statistical inference, with an
introduction to Bayesian methods, followed by a discussion of the distinctions
between scientific data and evidence. It then discusses why scientists sometimes
disagree about the evidence of a drug’s benefits and risks and how their disagree -
ments may affect regulatory decision-making.
121

OCR for page 121
122 STUDYING THE SAFETY OF APPROVED DRUGS
STATISTICAL INFERENCE AND DECISION-MAKING
Evidence
Although the terms data and evidence are often used interchangeably, data
is not a synonym for evidence. The Compact Oxford English Dictionary defines
data as “facts and statistics collected together for reference or analysis” and evi-
dence as “the available body of facts or information indicating whether a belief
or proposition is true” (Oxford Dictionaries, 2011). The difference is whether or
not the information is being used to draw scientific conclusions about a specific
proposition. In the context of a drug study, the “proposition” is a hypothesis about
a drug effect, often stated in the form of a scientific question, such as “Do broad-
spectrum antibiotics increase the risk of colitis”? In the broader context of FDA’s
regulatory decisions, the proposition may be implicit in the public health question
that prompts the need for a regulatory decision, such as, “Does the risk of coli -
tis caused by broad-spectrum antibiotics outweigh their benefits to the public’s
health”? In this way, evidence is defined with respect to the questions developed
in the first step of the decision-making framework described in Chapter 2.
Statistical methods help to ascertain the “strength of the evidence” support-
ing a given hypothesis by measuring the degree to which the data support one
hypothesis rather than the other. The evidence in turn affects the likelihood that
either hypothesis is true. The most common scientific hypothesis in the realm of
drug evaluation is the “null hypothesis”—that in a given treated population, the
drug has no effect relative to a comparator treatment. For the concept of evidence
to have meaning, however, there must be at least one other hypothesis under
consideration, such as that the drug has some effect.
A small change in the scientific hypotheses being compared can change
the strength of the evidence provided by a given set of data. For example, if the
question above changed from whether broad-spectrum antibiotics produce any
increase in the risk of colitis to whether broad-spectrum antibiotics produce a
clinically important increase in the risk of colitis—say, an increase of more
than 10 percent—the strength of the evidence provided by the same data could
change. Where one observer might see a four percent increase in risk as strong
evidence of some excess risk, another could regard it as strong evidence against a
10 percent increase in risk.1 Agreement on the strength of the evidence therefore
requires agreement on the hypotheses being contrasted and on the public health
questions that gives rise to them.
1 Confusion can result from use of the word significant to describe an effect that is both statistically
significant and clinically relevant; the latter is often termed clinically significant. The two uses should
remain separate.

OCR for page 121
123
EVIDENCE AND DECISION-MAKING
Inference
Good science, together with proper statistics, has a dual role. The first role is
to decrease uncertainty about which hypotheses are true; the second is to properly
measure the remaining uncertainty. These are carried out in part through a process
called statistical inference. Statistical inference involves the process of summariz-
ing data, estimating the uncertainty around the summary, and using the summary
to reach conclusions about the underlying truth that gave rise to the data.
The two main approaches to statistical inference are the standard “frequen -
tist” approach and the Bayesian approach. Each has distinctive strengths and
weaknesses when used as bases for decision-making; including both approaches
in the technical and conceptual toolbox can be extraordinarily important in mak -
ing proper decisions in the face of complex evidence and substantial uncertainty.
The frequentist approach to statistical inference is familiar to medical research -
ers and is the basis for most FDA rules and guidance. The Bayesian approach is
less widely used and understood, however, it has many attractive properties that
can both elucidate the reasons for disagreements, and provide an analytic model
for decision-making. This model allows decision-makers to combine the chance
of being wrong about risks and benefits, together with the seriousness of those
errors, to support optimal decisions.
The frequentist approach employs such measures as P values, confidence
intervals, and type I and II errors, as well as practices such as hypothesis-testing.
Evidence against a specified hypothesis is measured with a P value. P values are
typically used within a hypothesis-testing paradigm that declares results “statisti -
cally significant” or “not significant”, with the threshold for significance usually
being a P value less than 0.05. By convention, type I (false-positive) error rates
in individual studies are set in the design stage at 5 percent or lower, and type II
(false-negative) rates at 20 percent or below (Gordis, 2004).
In the colitis example, if the null hypothesis posits that broad-spectrum
antibiotics do not increase the risk of colitis, a P value less than 0.05 would lead
one to reject that null hypothesis and conclude that broad-spectrum antibiotics
do increase the risk of colitis. The range of that elevation statistically consistent
with the evidence would be captured by the confidence interval. If the P value
exceeded 0.05, several conclusions could be supported, depending on the loca-
tion and width of the confidence interval; either that a clinically negligible effect
is likely, or that the study cannot rule out either a null or clinically important
effect and thus is inconclusive. In the drug-approval setting, the FDA regulatory
threshold of “substantial evidence”2 for effectiveness is generally defined as two
well controlled trials that have achieved statistical significance on an agreed
upon endpoint, although there can be exceptions (Carpenter, 2010; Garrison et
al., 2010).
2 21 USC § 355(d) (2010).

OCR for page 121
124 STUDYING THE SAFETY OF APPROVED DRUGS
Hypothesis-testing provides a yes-or-no verdict that is useful for regulatory
purposes, and its value has been demonstrated over time, both procedurally and
inferentially. Its emphasis on pre-specification of endpoints, study procedures
and analytic plans has regulatory and often inferential benefits. But hypothesis
tests, P values, and confidence intervals do not provide decision-makers with
an important measure—the probability that a hypothesis is right or wrong. In
settings where a difficult balancing of various decisional consequences must be
made in the face of uncertainty about both the presence and magnitude of ben -
efits and risks, the probability that a given hypothesis is true plays a central role.
The failure to assign a degree of certainty to a conclusion is a weakness of the
frequentist approach when it is used for regulatory decisions (Berry et al., 1992;
Etzioni and Kadane, 1995; IOM, 2008; Parmigiani, 2002).
In contrast, the Bayesian approach to inference allows a calculation on the
basis of results from an experiment of how likely a hypothesis is to be true or
false. However, this calculation is premised on an estimated probability that a
hypothesis is true prior to the conduct of the experiment, a probability that is
not uniquely scientifically defined and about which scientists can differ. Both in
spite of this and because of this, Bayesian approaches can be very useful comple -
ments to traditional frequentist analyses, and can yield insights into the reasons
why scientists disagree, a topic that will be discussed in more depth later in this
chapter.
The use of Bayesian approaches is not new to FDA. FDA’s Center for
Devices and Radiological Health (CDRH) has published guidance for the use of
Bayesian statistics in medical device clinical trials (FDA, 2010a) and FDA has
used Bayesian approaches in regulatory decisions. A 2004 FDA workshop on
the use of Bayesian methods for regulatory decision-making included extensive
discussion by FDA scientists, as well as Center for Drug Evaluation and Research
(CDER) and CDRH leadership, of ways in which Bayesian approaches could
enhance the science of premarketing approval.3 Campbell (2011), director of the
CDRH Biostatistics division, discussed the uses of Bayesian methods for FDA
decision-making, and presented 17 requests for premarketing approval submitted
to and approved by the CDRH for medical devices that used Bayesian methods.
Although Bayesian methods have been little used by CDER, Berry (2006) dis -
cusses how a Bayesian meta-analysis served as the basis for a CDER approval of
Pravigard™ Pac (co-packaged pravastin and buffered aspirin) to lower the risk of
cardiovascular events. Bayesian sensitivity analyses were used to help evaluate
the literature investigating the possible association between antidepressants and
suicidal outcomes (Laughren, 2006; Levenson and Holland, 2006), elaborated
later in Kaizar (2006). Finally, FDA staff has recently proposed Bayesian meth -
odology for analysis of safety endpoints in clinical trials (McEvoy et al., 2012).
3 Published papers from the workshop are available in the August 2005 issue of Clinical Trials
(2:271-378).

OCR for page 121
125
EVIDENCE AND DECISION-MAKING
The Bayesian approach does not use a P value to measure evidence; rather, it
uses an index called the Bayes factor (Goodman, 1999; Kass and Raftery, 1995).
The Bayes factor encodes mathematically the principle presented earlier—that
the role of evidence is to help adjudicate between two or more competing hypoth-
eses. The Bayes factor modifies the probability of whether a hypothesis is true.
Decision-makers can then use that probability to characterize the likelihood that
their decisions will be wrong. In its simplest form, Bayes theorem can be defined
in the following equation (Goodman, 1999; Kass and Raftery, 1995):
The odds that a The odds that a The strength of
hypothesis is true = hypothesis is true × new evidence
after new evidence before new evidence (the Bayes factor)
The Bayes factor is sometimes regarded as the “weight of the evidence”
comparing how strongly the data support one hypothesis (or combination of
hypotheses) to another (Good, 1950; Kass and Raftery, 1995). Most important is
the role that the Bayes factor plays in Bayes theorem; it modifies the probability
that a given hypothesis is true. This concept that a hypothesis has a certain “truth
probability” has no counterpart in standard frequentist approaches.
There is not a one-to-one relationship between P values and Bayes factors,
because the magnitude of an observed effect and the prior probabilities of hypoth-
eses also can affect the Bayes factor calculation itself. But in most common
statistical situations, there exists a strongest possible Bayes factor, and that can
be defined as a function of the observed P value. That relationship can be used to
calculate the maximum chance that the non-null hypothesis is true as a function
of the P value and a prior probability (Goodman, 2001; Royall, 1997).
Assume that the null hypothesis is that a given drug does not cause a given
harm, and that the alternative hypothesis is that it does elevate the risk of that
harm. Table 3-1 shows how a given P value (translated into the strongest Bayes
factor) alters the probability of the hypothesis of harm, defining the null hypoth -
esis as stating that a given drug does not harm, and the alternative hypothesis
is that it does elevate the risk of that harm. For example, if a new randomized
controlled trial (RCT) yields a P value of 0.03 for a newly reported adverse effect
of a drug and there was deemed to be only a 1 percent chance before the RCT
of that unsuspected adverse effect being caused by the drug, the new evidence
increases the chance of the causal relationship to at most 10 percent (see Table
3-1). A regulatory decision predicated on the harm being real would therefore be
wrong more than 90 percent of the time.
Without a formal Bayesian interpretation, that high probability of error
would not be apparent from any standard analysis. Using conventional measures,
such a study might report that “a previously unreported association of tinnitus
was observed with the drug, OR [odds ratio] = 3.5, 95% CI [confidence interval]
1.1 to 11.1. P = 0.03”. This statement does not actually indicate how likely it is

OCR for page 121
126 STUDYING THE SAFETY OF APPROVED DRUGS
TABLE 3-1 Maximum Change in the Probability of a Drug Effect as a Function
of P Value and Bayes Factor, Calculated by Using Bayes’ Theorem
Maximum
Probability
P Value in Strongest Strength of Prior Probability After the
Evidencea of an Effect, %b
New Study Bayes Factor New Study, %
0.10 0.26 Weak 1 2.5
25 46
50 79
83 95
0.05 0.15 Moderate 1 6
25 69
50 87
76 95
0.03 0.10 Moderately 1 10
25 78
Strong
50 81
67 95
0.01 0.04 Strong 1 21
25 90
40 95
50 96.5
0.001 0.005 Very Strong 1 75
8 95
25 99
50 99.5
aThe qualitative descriptor of the strength of the evidence is made on the basis of the quantitative
change in the probability of truth of a null-null drug effect.
bThe prior truth probabilities of 1%, 25%, or 50% are arbitrarily chosen to span a wide range of
strength of prior evidence. The shaded prior probability illustrates the minimum prior probability re-
quired to provide a 95% probability of a drug effect after observing a result with the reported P value.
SOURCE: Modified from Goodman (1999).
that the drug actually raises the risk of tinnitus. For that, a prior probability is
needed, and the Bayes factor. If the mechanism or some preliminary observa -
tions justified a 25 percent prior chance of a harmful effect, the same evidence
would raise that to at most a 78 percent chance of harm—that is, at least a 22
percent chance that the drug does not cause that harm. Table 3-1 shows that after
observing P = 0.03 for an elevated risk of harm, in order to be 95 percent certain
that this elevation was true, the prior probability of a risk elevation would have
to have been at least 67 percent before the study. That might be the case if there
was an established mechanism for the adverse effect, if other drugs in the same
class were known to produce this effect, or if a prior study showed the same
effect.
In practice, however, there exist no conventions or empirical data to deter-
mine exactly how to assign such prior probabilities, although the elicitation of
prior probabilities from experts has been much studied (Chaloner, 1996; Kadane

OCR for page 121
127
EVIDENCE AND DECISION-MAKING
and Wolfson, 1998). FDA incorporated the notion of a prior informally in its
incorporation of “biologic plausibility” into decision-making of how to respond
to drug safety signals that arise in the course of pharmacovigilance, in March
2012 draft guidance (FDA, 2012):
CDER will consider whether there is a biologically plausible explanation for
the association of the drug and the safety signal, based on what is known from
systems biology and the drug’s pharmacology. The more biologically plausible
a risk is, the greater consideration will be made to classifying a safety issue as
a priority.
As demonstrated in the above paragraph, biologic plausibility and other
forms of external evidence are currently accommodated qualitatively; Bayesian
approaches allows that to be done quantitatively, providing a formal structure
by which both prior evidence and other sources of information (for example, on
common mechanisms underlying different harms, or their relationship to disease
processes) should affect decisions.
This discussion illustrates a number of important issues
• Given new evidence, the probability that a drug will be harmful can vary
widely depending on the strength of the prior or external information,
represented as a prior probability distribution.
• The chance that a drug will be harmful, based on P values for a harmful
effect in the borderline significant range (0.01–0.05), is often far lower
than is suspected, unless there are fairly strong reasons to believe in the
harm before the study.
• The Bayesian approach allows the calculation of intermediate levels of
certainty (for example, less than 95 percent) that might be sufficient for
regulatory action, particularly for drug harms.
• Without agreed-upon conventions or empirical bases for assigning prior
probabilities, the prior probabilities derived from a given body of evi-
dence will differ among scientists, resulting in different conclusions from
the same data.
The probability that a given harm will be caused by a drug is a key attribute
in regulatory decision-making. How sure regulators must be to take a given action
varies according to the consequences of decisions. In some cases, 95 percent
certainty might be needed, in others 75 percent, and in still others less than 50
percent. The Bayesian approach provides numbers that feed into that judgment
(Kadane, 2005).
Despite these advantages, one of the weaknesses of Bayesian calculations is
that there is no unique way to assign a prior probability to the strength of external
evidence, particularly if that evidence is difficult to quantify, such as biologic

OCR for page 121
128 STUDYING THE SAFETY OF APPROVED DRUGS
plausibility. Although it may be impossible to assess subtle differences in prior
probability, even crude distinctions can be helpful, such as whether the prior evi-
dence justifies probability ranges of 1–5 percent, 15–50 percent, 60–80 percent,
or 90+ percent. Such categorizations often provide fine enough discrimination to
be useful for decision-making. In the absence of agreement on prior probabilities,
“non-informative” prior distributions can be used that rely almost exclusively
on the observed data, and sensitivity analyses with different kinds of prior prob -
abilities from different decision-makers can be conducted (Emerson et al., 2007;
Greenhouse and Waserman, 1995). At a minimum, these prior probabilities
should be elicited and their evidential bases made explicit so that this potential
source of disagreement can be better understood, and perhaps diminished.
The difference between Bayesian and frequentist approaches can go well
beyond the incorporation of prior evidence, extending to more complex aspects
of how the analytic problem is structured and analyzed. Madigan et al. (2010)
provide a comprehensive suite of Bayesian methods to analyze safety signals
arising from a broad range of study designs likely to be employed in the post -
marketing setting.
WHY SCIENTISTS DISAGREE
When new information arises that puts into question a drug’s benefits and
risks, FDA’s decision-makers often face sharp disagreements among scientists
over how to interpret that information in the context of pre-existing information
and over what regulatory action, if any, should be taken in response to the new
information. Such disagreements are often unavoidable, and moving forward
with appropriate decision-making is difficult if the underlying reasons for them
are unknown or misunderstood. The committee identified a number of reasons
for the disagreements about scientific evidence that occur among scientists. Those
reasons, which are listed in Box 3-1, are discussed below.
Different Prior Beliefs About the Existence of an Effect
People’s beliefs about the plausibility of an effect of a drug are determined,
in part, by their knowledge and interpretation of prior evidence about the drug’s
benefits and risks (Eraker et al., 1984). That knowledge shapes their responses
to new evidence. Prior evidence can come directly from earlier clinical studies
of the drug’s effects, from studies of drugs in the same class that demonstrate
the effect, and from information about the drug’s mechanism of action. Newly
observed evidence might be interpreted as resulting in a higher chance that a drug
is harmful if earlier studies have also demonstrated the harm. If other drugs in the
same class have been associated with a particular adverse effect, the drug has a
higher prior probability of causing that effect than a drug in a class whose mem -

OCR for page 121
129
EVIDENCE AND DECISION-MAKING
BOX 3-1
Why Scientists Disagree About the Strength
of Evidence Supporting Drug Safety
Prior Evidence
1. Different weights given to pre-existing mechanistic or empirical evi-
dence supporting a given benefit or risk.
Quality of the New Study
2. Different views about the reliability of the data sources.
3. Different confidence in the design’s ability to eliminate the effect of
factors unrelated to drug exposure.
4. Different views on the appropriateness of statistical models.
Relevance of the New Evidence to the Public Health Question
5. Different views of the hypotheses needing evaluation.
6. Different assessments of the transportability of results.
Synthesizing the Evidence
7. Different ideas about how to weigh and combine all the available evi-
dence from disparate sources relevant to the public health question.
Appropriate Regulatory Response to the Body of Evidence
8. Different opinions among scientists regarding the thresholds of cer-
tainty to justify concern or regulatory action, which can affect how they
view the evidence
bers have not produced such an effect. If a drug has a mechanism of action that
has been implicated in a particular adverse effect, it has a higher prior probability
of causing that effect than a drug for which such a mechanism is implausible.
For example, the prior probability that a topical steroid would produce significant
internal injury would be very low because what is known about the absorption,
metabolism, and physiologic actions of topical steroids makes it difficult to
imagine how such an injury could occur, but the prior probability of an adverse
dermatologic effect would be much higher.
Evidential bases of prior probability can take two forms: an assessment of
the evidence supporting the mechanistic explanation of a proposed effect and the
cumulative weight of previous empirical studies. Marciniak, in the FDA Office
of New Drugs (OND) Division of Cardiovascular and Renal Products discussed
mechanism directly in a letter that was provided for a July 2010 FDA Advisory
Committee meeting related to Avandia (Marciniak, 2010):

OCR for page 121
130 STUDYING THE SAFETY OF APPROVED DRUGS
Others have speculated that rosiglitazone could increase MI [myocardial infarc -
tion] rates through its effects upon lipids or by the same mechanism whereby it
increases HF [heart failure] rates. There are no clinical studies establishing these
mechanisms. We propose that there is a third mechanism for which there is some
evidence from clinical studies. The third possible mechanism is the following:
The Avandia label states that “In vitro data demonstrate that rosiglitazone is
predominantly metabolized by Cytochrome® P450 (CYP) isoenzyme 2C8, with
CYP2C9 contributing as a minor pathway.” The published literature suggests that
rosiglitazone may also function as an inhibitor of CYP2C8 . . . . Allelic variants of
the CYP2C9 gene have been associated in epidemiological studies with increased
risk of myocardial infarction and atherosclerosis. . . . Recently, CYP2C8 vari -
ants has also been associated with increased risk of MI. . . . CYP2C9 and 2C8
catalyze the metabolism of arachidonic acid to vasoactive substances, providing
one potential mechanism for affecting cardiac disease. Interference with ciga -
rette toxin metabolism is another. . . . Rosiglitazone effects upon CYP2C8 and
CYP2C9 could be the mechanism for its CV adverse effects. Regardless, there
are several possible mechanisms for CV toxicity of rosiglitazone.
The above paragraph describes a mechanism that is fairly speculative, as
labeled. There is no suggestion or claim that such a mechanism would definitely
or even probably produce adverse cardiovascular effects. Rather, this particular
exposition is exploratory and aimed at establishing that such an effect is possible
rather than probable. Those who have a good understanding of this particular set
of pathways might interpret the explanation differently and establish a different
starting point for the probability of such an effect. It is unlikely, though, that on
the basis of such evidence general consensus could be garnered for a high prior
probability of effect.
Mechanistic explanations generally provide weak evidence when they are
offered post hoc to support an observed result. They carry more weight when they
are proposed before such an effect is observed. Misbin (2007) raised questions
about the safety of rosiglitazone on the basis of its effects on body weight and
lipids—both well-established risk factors for cardiovascular disease—long before
any risk of myocardial infarction (MI) was seen in any studies.
Another, more subtle way in which mechanistic considerations can affect
inferences is in the choice of endpoints, as illustrated in discussions by Marcin -
iak, from the FDA Office of New Drugs (OND) Division of Cardiovascular and
Renal Products, of the wisdom of combining silent and clinical MIs into a single
endpoint (Marciniak, 2010):
There is additional evidence from RECORD [the Rosiglitazone Evaluated for
Cardiac Outcomes and Regulation of Glycemia in Diabetes trial] that the MI
risk for rosiglitazone is real rather than a random variation:
We prospectively excluded silent MIs from our primary analysis
because we had concerns that silent MIs might represent a different
disease mechanism than symptomatic MIs, e.g., could they represent

OCR for page 121
131
EVIDENCE AND DECISION-MAKING
gradual necrosis from diabetic microvascular disease rather than an
acute event with coronary thrombosis in an epicardial coronary artery?
Whether or not silent and clinical MIs should be combined—a critical deci -
sion in assessing the evidence—is framed here as contingent on whether or not
they represent different manifestations of the same pathophysiologic process.
What is important to recognize is that the numbers arising from an analysis that
excludes silent MIs are only as credible as the underlying mechanistic explana -
tion. This example shows how a mechanistic explanation can affect the analyses,
especially exploratory analysis, even if it is not explicitly invoked as an evidential
basis of a claim.
Even if two scientists agree about what evidence new data provides, if they
have different assessments of the strength of prior evidence they might disagree
about the probability of a higher drug risk. Such a disagreement might appear
outwardly to be about the new evidence when in fact the disagreement is about
the prior probability. That phenomenon is captured quantitatively by Bayes theo -
rem, as previously noted (Fisher, 1999), which can use sensitivity analyses with
different priors to illustrate the plausible range of chances that the drug induces
unacceptable safety risks.
Quality of the New Study
Standard approaches to evaluating evidence rely on the use of evidence
hierarchies, which traditionally emphasize the type of study design as the main
determinant of evidential quality; an example is the US Preventive Services Task
Force guidance (AHRQ, 2008). Many scientists judge a study on the basis of its
type of design above all other considerations. The type of study design, however,
is only one of the factors that should be taken into account in assessing the qual -
ity of a study and thereby the quality of the evidence from the study. In addition
to the type of study, such other aspects as the source and reliability of the data,
study conduct, whether there are missing or misclassified data, and data analyses
influence the quality of the evidence generated by a study. Some of these reflected
in the Grading of Recommendations Assessment, Development and Evaluation
(GRADE) approach to evidence assessment (Guyatt et al., 2008). Those factors
and their role in disagreements among scientists are discussed below.
Different Views about the Reliability of the Data Source
Most evidence hierarchies assume that data in a study are generated for
research purposes and that outcome measures are specified in advance. Much
postmarketing research about a drug’s benefits and risks, however, whether an
RCT or an observational study, depends at least in part on data gathered with
systems developed for other purposes. For example, billing data that happen to

OCR for page 121
158 STUDYING THE SAFETY OF APPROVED DRUGS
statistical code, and information about how decisions were made to produce the
analytic dataset from the raw measured data. Optimally, it involves some form
of data-sharing. Such data sharing permitted the reanalysis of the RECORD trial
that was presented to FDA in the rosiglitazone case. The review revealed that
innumerable discrepancies and judgment calls frequently occurred in the original
study—from defining a clinical event to the choice of analytic method—and those
discrepancies and judgments affected the weight that the results were given in the
regulatory decision-making process. For critical research that is to be the basis of
regulatory decisions, which can be primary studies like RECORD or can be meta-
analyses, standards should be developed within FDA to adhere to reproducible
research principles so that the basis of the many judgments can be examined and
adjudicated by scientists and regulators when disputes over data interpretation
and its implications arise.
Going a step beyond reproducibility, FDA is well-positioned to help assure
the accurate public reporting of risk information submitted to it as part of the
premarketing approval process. These are often, but not always, published after
approval and included in postmarketing safety assessments. FDA scientists
themselves have identified the discordance of published data from that submit -
ted to FDA as a problem for the validity of postmarketing safety meta-analyses
(Hammad et al., 2011), and there are numerous examples of under or delayed
reporting of harms that had been previously reported to regulatory authorities (for
example, Carragee et al., 2011; Lee et al., 2008; Melander et al., 2003; Vedula
et al., 2009). FDAAA addressed this problem by requiring that all clinical trials
submitted for new drug approval or for new labeling be registered at inception
at ClinicalTrials.gov, and that the summary results of all pre-specified outcomes
be posted within one year of drug approval for new drugs, or three years for new
indications (Miller, 2010; Wood, 2009). However, recently reported evidence
has shown that compliance with this aspect of FDAAA has been low (Law et al.,
2011). In addition, the FDA policy on the reporting of studies submitted for non-
approved drugs has not been settled (Miller, 2010). Finally, publishing summary
results is not equivalent to sharing primary data, which allows for re-analyses.
New approaches are needed to facilitate the publication of safety data submitted
to FDA for approved drugs, and to find ways to release similar data for drugs
that are disapproved, but whose information might be extremely valuable for the
interpretation of safety information from approved drugs in the same class.
FINDINGS AND RECOMMENDATIONS
Finding 3.1
Some of FDA’s most difficult decisions are those in which experts disagree about
how compelling the evidence that informs the public health question is. Under-
standing the nature and sources of those disagreements and their implications for

OCR for page 121
159
EVIDENCE AND DECISION-MAKING
FDA’s decisions is key to improving the agency’s decision-making process. For
example, experts can disagree about the plausibility of a new risk (or decreased
benefit) on the basis of different assessments of prior evidence, the quality of
new data, the adequacy of confounding control in the relevant studies, the trans -
portability of results, the appropriateness of the statistical analysis, the relevance
of the new evidence to the public health question, how the evidence should be
weighed and synthesized, or the threshold for regulatory actions.
Recommendation 3.1
FDA should use the framework for decision-making proposed in Recom-
mendation 2.1 to ensure a thorough discussion and clear understanding of the
sources of disagreement about the available evidence among all participants
in the regulatory decision-making process. In the interest of transparency,
FDA should use the BRAMP document proposed in Recommendation 2.2 to
ensure that such disagreements and how they were resolved are documented
and made public.
Finding 3.2
Such methods as Bayesian analyses or other approaches to integrating external
relevant information with newly emerging information could provide decision-
makers with useful quantitative assessments of evidence. An example would be
sensitivity analyses of clinical-trial data that illustrate the influence of prior prob -
abilities on estimates of probabilities that an intervention has unacceptable safety
risks. These approaches can inform judgments, allow more rational decision-
making, and permit input from multiple stakeholders and experts.
Recommendation 3.2
FDA should ensure that it has adequate expertise in Bayesian approaches, in
combination with expertise in relevant frequentist and causal inference meth -
ods, to assess the probability that observed associations reflect actual causal
effects, to incorporate multiple sources of uncertainty into the decision-
making process, and to evaluate the sensitivity of those conclusions to dif -
ferent representations of external evidence. To facilitate the use of Bayesian
approaches, FDA should develop a guidance document for the use of Bayes -
ian methods for assessing a drug’s benefits, risks, and benefit–risk profile.
Finding 3.3
Traditionally, the main criteria for evaluating a study are ones that contribute to
its internal validity. A well-conducted RCT typically has higher internal valid -
ity than a well-conducted observational study. Results of observational studies,
however, can have greater transportability if their participants are more similar

OCR for page 121
160 STUDYING THE SAFETY OF APPROVED DRUGS
to the target clinical population than to the participants in a clinical trial. In some
circumstances, such as an evaluation of the association between a drug and an
uncommon unexpected adverse event, observational studies may produce esti -
mates closer to the actual risk in the general population than can be achieved in
clinical trials. In assessing the relevance of study findings to a public health ques-
tion, the transportability of the study results is as important as the determinants
of its internal validity.
Recommendation 3.3
In assessing the benefits and risks associated with a drug in the postmarketing
context, FDA should develop guidance and review processes that ensure that
observational studies with high internal validity are given appropriate weight
in the evaluation of drug harms and that transportability is given emphasis
similar to that given bias and other errors in assessing the weight of evidence
that a study provides to inform a public health question.
Finding 3.4
The principles of reproducible research are important for ensuring the integrity
of postmarketing research used by FDA. Those principles include providing
information on the provenance of data (from measurement to analytic dataset)
and, when possible, making available properly annotated analytic datasets, study
protocols (including statistical analysis plan) and their amendments, and statisti -
cal codes.
Recommendation 3.4
All analyses, whether conducted independently of FDA or by FDA staff,
whose results are relied on for postmarketing regulatory decisions should use
the principles of reproducible research when possible, subject to legal con-
straints. To that end, FDA should present data and analyses in a fashion that
allows independent analysts either to reproduce the findings or to understand
how FDA generated the results in sufficient detail to understand the strengths,
weaknesses, and assumptions of the relevant analyses.
Finding 3.5
The ability of researchers in and outside FDA to analyze new information about
the benefits and risks associated with a marketed drug and to design appropri -
ate postmarketing research—including conducting individual-patient meta-
analyses—is enhanced by access to data and analyses from all studies of the drug
and others in the same drug class that were reported in the preapproval process.
Although disclosure of such information is likely to advance the public’s health,
such disclosures raise concerns about the privacy of participants in the research

OCR for page 121
161
EVIDENCE AND DECISION-MAKING
that generated the information and may threaten industry interest in maintain -
ing proprietary information, which is deemed important for innovation. New
approaches to resolving this tension are needed.
Recommendation 3.5
FDA should establish and coordinate a working group, including industry and
patient and consumer representatives, to find ways that appropriately balance
public health, privacy, and proprietary interests to facilitate disclosure of data
for trials and studies relevant to postmarketing research decisions.
Finding 3.6
The elements of the benefit–risk profile of a drug are best estimated by using all
the available high-quality data, and meta-analysis is a useful tool for summarizing
such data and evaluating heterogeneity. However, because the reporting of harms
in published RCTs and observational studies is often poor or inconsistent and
because there is often substantial publication bias in studies of drug risk, steps
are needed to improve both the reporting of harms and the design of studies of
harm. That can be done through prospective planning for selected meta-analyses
and by monitoring compliance with the FDAAA requirement that summary trial
results for all primary and secondary outcomes be published at ClinicalTrials.gov.
Recommendation 3.6
For drugs that are likely to have required postmarketing observational stud -
ies or trials, FDA should use the BRAMP to specify potential public health
questions of interest as early as possible; should prospectively recommend
standards for uniform definition of key variables and complete ascertainment
of events among studies or convene researchers in the field to suggest such
standards and promote data-sharing; should prospectively plan meta-analyses
of the data with reference to specified exposures, outcomes, comparators, and
covariates; should conduct the meta-analyses of the data; and should make
appropriate regulatory decisions in a timely fashion. FDA can also improve
the validity of meta-analyses by monitoring and encouraging compliance
with FDAAA requirements for reporting to ClinicalTrials.gov.
Finding 3.7
FDA produced a high-quality guidance document on the use of the noninferior-
ity design for the study of efficacy. Increasingly, FDA is using the noninferiority
design to evaluate drug-safety endpoints as the primary outcomes in randomized
trials. The use of noninferiority analyses to establish the acceptability of the
benefit–risk profile of a drug can take the decision about how to balance the risks
and benefits of two drugs out of the hands of regulators. Noninferiority trials also