Abstract

Background

Physicians reading the medical literature attempt to determine whether research studies
are valid. However, articles with negative results may not provide sufficient information
to allow physicians to properly assess validity.

Methods

We analyzed all original research articles with negative results published in 1997
in the weekly journals BMJ, JAMA, Lancet, and New England Journal of Medicine as well
as those published in the 1997 and 1998 issues of the bimonthly Annals of Internal
Medicine (N = 234). Our primary objective was to quantify the proportion of studies
with negative results that comment on power and present confidence intervals. Secondary
outcomes were to quantify the proportion of these studies with a specified effect
size and a defined primary outcome. Stratified analyses by study design were also
performed.

Results

Only 30% of the articles with negative results comment on power. The reporting of
power (range: 15%-52%) and confidence intervals (range: 55–81%) varied significantly
among journals. Observational studies of etiology/risk factors addressed power less
frequently (15%, 95% CI, 8–21%) than did clinical trials (56%, 95% CI, 46–67%, p <
0.001). While 87% of articles with power calculations specified an effect size the
authors sought to detect, a minority gave a rationale for the effect size. Only half
of the studies with negative results clearly defined a primary outcome.

Conclusion

Prominent medical journals often provide insufficient information to assess the validity
of studies with negative results.

Background

Physicians are faced with the challenge of assessing whether the conclusions of research
studies are valid. Power, the probability that a study will detect an effect of a
specified size, is analogous to the sensitivity of a diagnostic test. [1] Just as a negative result does not rule out disease when the test applied has low
sensitivity, a negative study with inadequate power cannot disprove a research hypothesis.
Power/sample size calculations play an important role in study planning, give readers
an idea of the adequacy of the investigation, and help readers assess the validity
of studies with negative results. [2-4] Effect size (delta) is a critical component of power calculations. Investigators
choose from a wide range of possible deltas when calculating sample size. Clinicians
and investigators also often struggle to determine what effect size is reasonable
to expect.[2,5-8] Consequently, it is important for investigators to report the effect size they wish
to detect. However, this is often neglected.[8]

Sample size calculations alone are insufficient for the interpretation of studies
with negative results; power and confidence intervals compliment each other and should
both be reported.[6,9] Confidence intervals take into account the data actually collected, define the upper
and lower range consistent with a study's data, provide an estimate of precision,
and can give readers some indication of the clinical significance of the results.
[10-13]

Our work adds to the literature in several ways. Several authors have found that many
randomized controlled trials were underpowered, or had an unacceptable risk of missing
an important effect due to inadequate sample size. [14-21] Because power calculations are often complicated,[21] many readers are unlikely to have the statistical sophistication necessary to perform
a power analysis. Therefore, we were interested in whether articles provided information
necessary for readers to assess the validity of studies with negative results. We
looked for evidence of power/sample size calculations and effect size. In addition,
unlike prior work, we examined studies for documentation of confidence intervals.[22] Finally, because the calculation of sample size is applicable to all comparative
studies, we did not limit our study to randomized controlled trials.[23]

Our primary objective was to quantify the proportion of studies with negative results
within prominent general medical journals[24] that comment on power and present confidence intervals. Secondary outcomes were to
quantify the proportion of these studies with a specified delta and a defined primary
outcome.

Methods

All articles from the 1997 issues of the British Medical Journal (BMJ), Journal of
the American Medical Association (JAMA), Lancet, and the New England Journal of Medicine
(NEJM) were reviewed. Because the Annals of Internal Medicine (Annals) is published
bimonthly, all articles from 1997 and 1998 were reviewed so as to include a comparable
number of articles. One investigator (RSH) manually searched the journals and reviewed
all articles for eligibility. Review articles, meta-analyses, modeling studies, decision
and cost-effective analyses, case reports, editorials, letters, and studies without
inferential statistics (i.e. descriptive studies) were excluded. Equivalence trials
(studies designed to show equivalent efficacy of treatments) were included because
power analysis, confidence intervals, and delta are particularly important to their
design. Methodological issues involved in the design and analysis of these studies
have been described elsewhere.[25,26]

Articles were classified as having negative results if 1) the primary outcome(s) was
not statistically significant (i.e. the article had an explicit statement that the
comparison between two groups did not reach statistical significance) or 2) in those
articles with no primary outcome(s), any of the first three outcomes were not statistically
significant. Other outcomes were not evaluated. A second author (TAE) reviewed the
full text of a simple random sample of 50 articles and the kappa statistic was calculated
to assess the intraobserver variability for our classification scheme.

We examined articles to see if the authors named a primary outcome variable. We employed
a decision rule, modified from Moher and colleagues, to define the primary outcome
in those articles where none was specified.[19] If an article reported a sample size calculation, this was assumed to be the primary
outcome.[27] If calculations were not performed, a total of three outcomes, if present, were examined.
In those articles with multiple outcomes and none defined as primary, the three outcomes
evaluated were the first three listed in the abstract (or result section if less than
three outcomes were listed in the abstract).

The full text of included articles was systemically reviewed. Data was abstracted
by a single author (RSH) and recorded in standardized fashion. Information was recorded
on whether the article had a primary outcome(s), commented on power, sample size calculations,
and confidence intervals pertaining to the outcomes evaluated, a projected delta,
and a reason for this delta. A paper was given credit for addressing power if sample
size calculations or comments on power/sample size were present. Power, sample size
calculations, and confidence intervals could pertain to any one of the three outcomes
evaluated and was not necessary for all outcomes.

Comparisons were made across journals by Chi-square analysis. We also assessed articles
for comment on power and/or presentation of confidence intervals while stratifying
by study design (clinical trials, observational studies of etiology/risk factors,
screening/diagnosis, prognosis, and other). Responses were summarized as proportions
and 95% confidence intervals. All data was analyzed using STATA 6.0 (Stata Corp.,
College Station, TX).

Results

One thousand thirty eight articles were eligible for analysis. Two hundred thirty
four (23%) were classified as negative. There was good agreement between observers
in the classification of articles (k = 0.74). The percent of negative articles per
journal was: Annals 20% (41/203), BMJ 22% (57/256), JAMA 23% (44/191), Lancet 22%
(46/205), and NEJM 25% (46/183) (p = 0.857).

Of the negative articles including information about sample size, 87% (61/70) specified
a delta or the effect size that the authors sought to detect. A minority, 43% (26/61),
explained the rationale behind the delta chosen. Of these, 77% (20/26) cited references
or pilot studies to support their rationale.

Only 52% (122/234) of articles with negative results had a clearly defined primary
outcome(s).

Discussion

Many articles underreport power/sample size calculations and confidence intervals.
Significant variation exists among journals. Our work demonstrates that power was
reported more often in clinical trials than in observational studies of etiology/risk
factors. Investigators involved in randomized clinical trials may be more familiar
with the importance of power and sample size calculation.[28] Also, investigators conducting observational studies often do not have the ability
to determine sample size prior to beginning their work. Most articles with sample
size calculations reported a projected effect size, but only a minority shared the
rationale behind this delta and even less provided empiric evidence to support the
rationale.

While this manuscript describes an analysis of a large body of studies with negative
results, several limitations must be considered. First, although most negative studies
did not list power/sample size calculation, we cannot be certain this had not been
performed a priori. It is also possible that, for the sake of brevity, authors and/or
editors omitted power/sample size calculations from the final text when preparing
manuscripts for submission. While it is possible these calculations were done but
not reported, this may not be the case.[29] Second, our definition of a negative study may seem unduly broad. We examined three
outcomes in order to classify articles because articles frequently report several
outcomes, often with none defined as primary. [30-33] Previous authors, limiting their work to randomized controlled trials, who have encountered
multiple outcomes have defined the primary outcome as "the most clinically important"[19] or the outcome that was the "primary focus of the article".[20] These outcomes are often not possible to discern in observational studies. Nonetheless,
our results may represent a best-case scenario given the publication bias against
articles with negative results and the fact that we examined the more prominent general
medical journals.[34]

Conclusions

In summary, this study demonstrates that prominent medical journals often provide
insufficient information to assess the validity of studies with negative results.
Authors and journal editors need to include this information so readers can be informed
consumers of the medical literature.

Competing interests

1. The research was not supported by external funds.

2. There are no competing interests including financial, stocks, honoraria, speaker's
fees, and any competing academic, religious, moral, or personal interests for all
authors.

3. We have no financial interest in the material contained in the manuscript.

4. The manuscript is neither under review by another publisher nor previously published.

5. All authors have participated in the design, analysis, and writing of the accompanying
manuscript.

6. All authors have approved the final manuscript and have taken care to ensure the
integrity of the work.

Authors' contributions

1. Randy S Hebert MD MPH

• Conception and design

• Acquisition of data

• Analysis and interpretation of data

• Drafted and revised the article

• Gives final approval of the version for publication

2. Scott M Wright MD

• Analysis and interpretation of data

• Revised the article for important intellectual content

• Gives final approval of the version for publication

3. Robert S Dittus MD MPH

• Analysis and interpretation of data

• Revised the article for important intellectual content

• Gives final approval of the version for publication

4. Tom A Elasy MD MPH

• Conception and design

• Analysis and interpretation of data

• Drafted and revised the article

• Gives final approval of the version for publication

References

Browner WS, Newman TB: Are all significant P values created equal? The analogy between diagnostic tests and
clinical research.