Statistics
for the ClinicianA
look at the statistical structure and design of a clinical trial.BY SARAH ROSNER, MPH  BERNARD ROSNER, PhD

Recently
Genaera Corporation reported limited clinical results for Envizon (squalamine lactate),
a drug under investigation for wet AMD. The trial included 6 patients and all 12
eyes enrolled in the study showed preserved or improved vision.1 These
results are encouraging, but what should a clinician conclude from this data? What
about the data reported for other CNV treatments nearing an approval decision, such
as Retaane (15 mg anecortave acetate, Alcon Labs, Inc.), Lucentis (ranibizumab,
Genentech) or recently approved Macugen (pegaptanib sodium, Eyetech/Pfizer)? The
studies behind these drugs represent a small part of the flood of clinical data
that clinicians must wade through in seeking the best care for their patients. The
statistical designs and analyses that are applied to a particular drug program are
important to understand and deserve a thorough review given the new drugs that will
potentially be available to ophthalmologists in the coming months and years.

When a drug enters clinical
evaluation it is investigated using a prospective design. Clinical studies can use
a retrospective structure; however, this is typically performed in epidemiologic
studies and is not applicable to the drug development process. A prospective clinical
trial is the most powerful experimental design used to test the effectiveness of
an intervention in human populations. A prospective study design means that subjects
are followed from a well-defined baseline point and forward in time. The key feature
of a clinical trial is that it involves an intervention, such as a device or therapeutic
agent that is assigned according to a randomization code. A proper design always
includes a control, which can be a placebo agent and/or a standard therapy. Including
a proper control limits the potentially confounding effect of natural disease progression,
seasonality, and patient effects such as the Placebo Effect, Hawthorne Effect, and
regression to the mean. This article will review basic statistical concepts needed
to interpret data from clinical trials. For simplicity, a 2-group study consisting
of an intervention and a control group with a continuous outcome variable will be
considered.

STATISTICAL ANALYSIS
FOR A CLINICAL TRIAL

When interpreting results from
a clinical trial, many clinicians will simply focus on the P-value. What
exactly is the P-value and why is 0.05 often used to define statistical significance?
If the P-value for a test statistic is less than 0.05 then the results are
statistically significant and represent a finding of interest. Usually, when the
P-value is greater than or equal to 0.05, the findings are not statistically
significant and thus may be disregarded by the clinician. However, one must consider
that the value of 0.05 is an arbitrary cutoff. Studies with "significant" results
may represent false positives and studies with "nonsignificant" results can still
provide extremely valuable information. A brief review of some basic statistical
concepts will give the clinician tools for evaluating the results of a clinical
trial beyond simply looking at the P-value.

Hypothesis Test

It is important to understand the
concept of a hypothesis test. When conducting a clinical trial, the investigators
must first decide if they are interested in testing a 1-sided or 2-sided hypothesis.
A 1-sided hypothesis would look for differences in only 1 direction,
for example if the intervention is better than the control. A 2-sided hypothesis
looks for differences in both directions: it would test if the intervention is either
better or worse than the control. Unless there is a strong justification for why
the investigator would expect to see a difference in only 1 direction, 2-sided hypothesis
tests are typically used. It is important to designate the null and alternative
hypotheses at the start of a study. For example, if one were comparing the mean
response level between the treatment (intervention) and control groups (control),
one would test the null hypothesis H0: μintervention
= μcontrol OR μintervention - μcontrol = 0. If there is no effect
of the treatment, it is assumed that the mean responses for the 2 groups are equal,
or that the difference between the mean responses is 0. The appropriate 2-sided
alternative hypothesis would be HA: μintervention �
μcontrol OR μintervention - μcontrol � 0. The alternative hypothesis states that the
intervention is either better or worse than the control, but is not the same.

The goal of a study is to test
the null hypothesis and decide whether or not it can be rejected. If it is rejected
then the alternative hypothesis is likely to be true. There are 4 possible outcomes
of a hypothesis test (Table 1):

(1) H0 true,
H0 is not rejected

(2) H0 true,
H0 is rejected (Type I error)

(3) HA true,
H0 is rejected

(4) HA true,
H0 is not rejected (Type II error)

It is possible that a study can
incorrectly reject the null hypothesis, therefore having false positive findings.
The probability of this type of result is termed the significance level or the
Type I error rate, normally designated as alpha. The significance level of a study
is normally set at 0.05, but this is an arbitrary choice. The significance level
relates to the P-value of a statistical test. The P-value is the probability
of observing an absolute difference between the mean responses of the groups as
large, or larger than the difference observed, given that the null hypothesis is
true. If a P-value is very small, then it is unlikely that you would
observe the difference between the intervention and control groups that was seen
if there is truly no difference between the 2 groups. If the P-value is less
than the Type I error (0.05) then the null hypothesis is rejected. It is also possible
for a study to have false negative findings where the null hypothesis is accepted
when it is in fact false. The probability of incorrectly accepting the null hypothesis
is termed the Type II error rate, normally designated as beta.

Table
1. Four Possible Outcomes of a Hypothesis Test.

Hypothesis
Test Result*

"True State"

H0
True

HA True

Reject
H0

Type I error ( )

No error (1- )

Do
not reject H0

No error (1- )

Type II error ( )

*Reject
H0 when P-value <

Confidence Interval

When designing a clinical trial
it is important to be able to reject the null hypothesis when the alternative hypothesis
is true. In other words, if there is truly a difference between the treatment and
control groups, you want your study to be able to detect it.

Generally, a confidence interval
for the sample difference is reported along with the P-value. For example,
if is 0.05 then a 95% confidence interval will be reported. So what does a 95% confidence
interval really mean? On average, a 95% confidence interval calculated from a data
set will contain the true population difference with a probability of 0.95. For
example, if you were to repeat your study 100 times, 95 of the confidence intervals
calculated would contain the true population difference while 5 of them would not.
It gives you an idea of the range of different values that are consistent with your
data. Ideally, you do not want the null value (0) to be contained in the confidence
interval. For example, if given a 95% confidence interval of (–2, 3), you
would conclude that there could be a difference between your groups of anywhere
from –2 to 3 units, including no difference at all. You would want to obtain
a confidence interval that does not include the null value, such as (-4,-2) or (3,5).
The P-value relates directly to the confidence interval. For example if P
is greater than or equal to 0.05, then the null value will be contained in the
95% confidence interval. If P<0.05, then there is a significant difference between
the 2 groups and the null value will not be contained in the 95% confidence interval.

Sample Size

Power is the ability of a study
to detect a true difference between the treatment groups of a specific size. Power
is the probability of concluding that the alternative hypothesis is true when it
is actually true. The concept of power is related to the Type II error rate in that
power is equal to 1- alpha. For example, if ‚ is 0.2 then the study has an
80% chance of detecting a difference of size � between treatment and control groups
if the difference truly exists. One of the challenges in designing a clinical trial
is to determine the smallest difference between groups that is still of clinical
significance.

The
concept of power is directly related to the sample size of a study. The size of
a study should be determined in the planning phase of the trial. It is essential
that a study have a large enough sample size to detect a clinically meaningful difference
between treatment groups with high probability. If the sample size is too small,
it is possible that a study may not have enough power to detect a true effect. The
calculation of sample size depends on 4 parameters: 1) power (1-, 2) significance
level, 3) the detectable difference between groups, and 4) the standard deviation
within groups. Typically, the significance level is set at 0.05 and the power is
set at 0.8 or 0.9. Although it may seem desirable to set the power as close as possible
to 100%, the extremely large sample sizes that are required to attain that power
make it an unfeasible choice. The budget of a study will limit the sample size.

There are numerous formulas available
for calculating the sample size for a study with a given power, significance level,
and detectable difference between groups depending on the study design. The details
of these formulas are beyond of the scope of this article; however, they can be
found in most texts on biostatistics.2 It is important to note that these
formulas only provide estimates of the sample size needed for the study. Therefore
it is advisable to use a sample size slightly larger than what is determined by
the formula. Given these considerations, statistical powering of a study is heavily
influenced by expected differences, which can be taken from previous studies found
in the literature, or earlier proof-of-concept clinical studies done in the same
patient population.

In the field of ophthalmology,
the calculation of sample size is not as straightforward as in other disciplines.
Since each person contributes 2 eyes to the study population, does that mean that
you need half as many people? Unfortunately, no, since a person's 2 eyes are not
independent units they cannot be considered as 2 separate "subjects." Special sample
size formulas are available for ophthalmic data where the eye is the unit of analysis.3

COMMON STATISTICAL
TESTS

Once the study has been designed
and implemented, the data can be analyzed using statistical tests. The investigator
has a choice of 2 categories of test statistics: parametric and nonparametric. Parametric
tests assume that the data is sampled from populations that usually follow a
Gaussian,
or bell-shaped, distribution. Nonparametric tests make no assumptions about the
distribution of the study populations. Nonparametric tests perform calculations
based on the rank of the values rather than on the actual data values. Therefore,
there is less influence from outliers than with parametric tests.

So how do you know if a particular
test statistic calculated from your study population follows a Gaussian distribution?
It is often the case that if your sample size is large, then it should follow a
Gaussian distribution and a parametric test can be used. If the sample size is small
or the outcome variable is on an ordinal scale (i.e., pain, comfort, or slit-lamp
exam scores) it is better to use a nonparametric test. Table 2 summarizes
some common statistical tests. For each parametric test, there is a nonparametric
test equivalent. The choice of a statistical test will depend on how many groups
you want to compare and whether or not you have paired data.

Table
2. Common Statistical Tests.

Goal of
Test

Parametric Test

Non-Parametric Test

To compare 2 unpaired
groups

2-sample
t test

Mann-Whitney U Test
(equivalent
to Wilcoxan Rank sum)

To compare 2 paired
groups

Paired
2-sample t test

Wilcoxon signed rank
test

To compare 3+
unpaired groups

ANOVA
(analysis
of variance)

Kruskal-Wallis test

STUDY DESIGN

In addition to looking at the statistical
results of a trial, it is also important to consider study design. Traditionally,
clinical trials are superiority trials where the purpose is to show that a new treatment
is more effective than placebo. However, more recently there have been many equivalence
or noninferiority trials where the goal is to show that 2 treatments have the same
therapeutic effect.

Since it can be difficult to show
that a new treatment is superior to other treatments, many development programs
try to show equivalence across drugs. Although 2 drugs may have the same therapeutic
effect, a newer treatment may have other desirable properties, such as fewer side
effects or extended duration of action. One of the challenges involved in an equivalence
trial is that a "region of therapeutic equivalence" must be defined. Before starting
a trial the investigator must determine which range of values corresponds to therapeutic
equivalence. This can be a subjective process, which can dramatically influence
the results of the trial if the range selected is too large or too small.

A recent example is the Retaane
15 mg (anecortave acetate suspension, Alcon Labs) versus Visudyne Photodynamic Therapy
(verteporfin, QLT/Novartis, Vancouver, Canada) Study that reported results in late
2004.4 This trial defined rigorous
noninferiority parameters based on previous studies upon initiation of the trial,
which greatly affected the results. Alcon defined equivalence in the trial as percentage
of patients who maintained vision (subject who lost fewer than 3 lines of vision)
within 7% in the Retaane and Visudyne groups; however, this number is highly subjective
and it has been argued that a larger window could have been justifiable for this
study.

Choosing this range of equivalence
is a vitally important part of the study design: as it was initially reported, Retaane
would have met its predefined endpoint if a slightly larger window was used; however,
since a 7% window was used the endpoint was not met. Thus, it is possible to obtain
a different study outcome while using the same results. It is also interesting to
note that a Chi-square analysis showed no statistically significant difference between
the Retaane and Visudyne treatment groups.

Several subanalyses were also performed
looking at the effect of controllable factors such as treatment interval and drug
reflux on the results (Table 3). These subanalyses reduced the number of
subjects included in the analysis; however, it showed numeric differences, which
lead Alcon to further investigate potentially controllable factors in the study.
Table 3 shows that the Retaane group had 57% of patients with preserved vision.
This analysis is valuable in showing an overall trend toward better efficacy after
excluding patients who experienced reflux during drug administration and/or were
retreated after 6 months.

After analyzing the results above,
Alcon initiated a small clinical study to specifically investigate the effect of
drug reflux and if a specially designed counter-pressure device (CPD) could prevent
drug reflux. Results reported at the Macula Society in February indicated that the
CPD was effective in preventing drug reflux in 100% of patients and serum levels
of the drug confirmed that eliminating reflux correlated with a higher level of
drug absorption.

a Maintained vision is defined
as patients who lost less than 3-lines of visual acuity.

b This subgroup excludes patients
who experienced reflux during drug administration and patients treated after the
6-month dosing interval.

CROSSOVER VERSUS PARALLEL
DESIGN

In addition to deciding whether
to perform a superiority or equivalence trial, the investigator must also decide
whether to have a parallel group design or a crossover design. In a parallel group
design, each subject is assigned to a single treatment group for the duration of
the study. At the end of the study, the experience of each treatment group is compared.
In contrast, with the crossover design, each subject receives both treatments albeit
at different times. They receive 1 treatment for a period of time and then they
switch to the other treatment. The major advantage of crossover trials is that each
subject can serve as their own control. However, if there is any carryover effect
of the first treatment that the patients receives, then one must use only the first
period results to compare treatments, which will generally provide less power than
if a standard parallel design where a baseline and follow-up period was used.

When interpreting the results of
a clinical trial it is important to consider not only the P-value, but also
the power, sample size, and study design of the trial.

Sarah
Rosner, MPH is a ScD candidate epidemiology at the Harvard School of Public Health.
Bernard Rosner, PhD is professor in the Department of Biostatistics, Channing Laboratory
Harvard Medical School. Neither author has an financial interest in the information
contained in this article. Dr. Bernard Rosner can be reached by e-mail at stbar@channing.harvard.edu.