Biostatistical Methods in Epidemiology

An introduction to classical biostatistical methods in epidemiologyBiostatistical Methods in Epidemiology provides an introduction to a wide range of methods used to analyze epidemiologic data, with a focus on nonregression techniques. The text includes an extensive discussion of measurement issues in epidemiology, especially confounding. Maximum likelihood, Mantel-Haenszel, and weighted least squares methods are presented for the analysis of closed cohort and case-control data. Kaplan-Meier and Poisson methods are described for the analysis of censored survival data. A justification for using odds ratio methods in case-control studies is provided. Standardization of rates is discussed and the construction of ordinary, multiple decrement and cause-deleted life tables is outlined. Sample size formulas are given for a range of epidemiologic study designs. The text ends with a brief overview of logistic and Cox regression. Other highlights include:* Many worked examples based on actual data* Discussion of exact methods* Recommendations for preferred methods* Extensive appendices and referencesBiostatistical Methods in Epidemiology provides an excellent introduction to the subject for students, while also serving as a comprehensive reference for epidemiologists and other health professionals.

You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.

Biostatistical Methods
in Epidemiology
Biostatistical Methods
in Epidemiology
STEPHEN C. NEWMAN
A Wiley-Interscience Publication
JOHN WILEY & SONS, INC.
New York • Chichester • Weinheim • Brisbane • Singapore • Toronto
This book is printed on acid-free paper. ∞
c 2001 by John Wiley & Sons, Inc. All rights reserved.
Copyright
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as
permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978)
750-4744. Requests to the Publisher for permission should be addressed to the Permissions Department,
John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 10158-0012, (212) 850-6011, fax (212)
850-6008. E-Mail: PERMREQ@WILEY.COM.
For ordering and customer service, call 1-800-CALL-WILEY.
Library of Congress Cataloging-in-Publication Data:
Newman, Stephen C., 1952–
Biostatistical methods in epidemiology / Stephen C. Newman.
p. cm.—(Wiley series in probability and statistics. Biostatistics section)
Includes bibliographical references and index.
ISBN 0-471-36914-4 (cloth : alk. paper)
1. Epidemiology—Statistical methods. 2. Cohort analysis. I. Title. II. Series.
RA652.2.M3 N49 2001
614.4 07 27—dc21
2001028222
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
To Sandra
Contents
1.
Introduction
1.1 Probability, 1
1.2 Parameter Estimation, 21
1.3 Random Sampling, 27
1
2.
Measurement Issues in Epidemiology
2.1 Systematic and Random Error, 31
2.2 Measures of Effect, 33
2.3 Confounding, 40
2.4 Collapsibility Approach to Confounding, 46
2.5 Counterfactual Approach to Confounding, 55
2.6 Methods to Control Confounding, 67
2.7 Bias Due to an Unknown Confounder, 69
2.8 Misclassification, 72
2.9 Scope of this Book, 75
31
3.
Binomial Methods for Single Sample Closed Cohort Data
3.1 Exact Methods, 77
3.2 Asymptotic Methods, 82
77
4.
Odds Ratio Methods for Unstratified Closed Cohort Data
4.1 Asymptotic Unconditional Methods for a Single 2 × 2 Table, 90
4.2 Exact Conditional Methods for a Single 2 × 2 Table, 101
4.3 Asymptotic Conditional Methods for a Single 2 × 2 Table, 106
4.4 Cornfield’s Approximation, 109
4.5 Summary of Examples and Recommendations, 112
4.6 Asymptotic Methods for a Single 2 × I Table, 112
89
vii
viii
CONTENTS
5. Odds Ratio Methods for Stratified Closed Cohort Data
5.1 Asymptotic Unconditional Methods for J (2 × 2) Tables, 119
5.2 Asymptotic Conditional Methods for J (2 × 2) Tables, 129
5.3 Mantel–Haenszel Estimate of the Odds Ratio, 132
5.4 Weighted Least Squares Methods for J (2 × 2) Tables, 134
5.5 Interpretation Under Heterogeneity, 136
5.6 Summary of 2 × 2 Examples and Recommendations, 137
5.7 Asymptotic Methods for J (2 × I ) Tables, 138
119
6. Risk Ratio Methods for Closed Cohort Data
143
6.1 Asymptotic Unconditional Methods for a Single 2 × 2 Table, 143
6.2 Asymptotic Unconditional Methods for J (2 × 2) Tables, 145
6.3 Mantel–Haenszel Estimate of the Risk Ratio, 148
6.4 Weighted Least Squares Methods for J (2 × 2) Tables, 149
6.5 Summary of Examples and Recommendations, 150
7. Risk Difference Methods for Closed Cohort Data
151
7.1 Asymptotic Unconditional Methods for a Single 2 × 2 Table, 151
7.2 Asymptotic Unconditional Methods for J (2 × 2) Tables, 152
7.3 Mantel–Haenszel Estimate of the Risk Difference, 155
7.4 Weighted Least Squares Methods for J (2 × 2) Tables, 157
7.5 Summary of Examples and Recommendations, 157
8. Survival Analysis
8.1 Open Cohort Studies and Censoring, 159
8.2 Survival Functions and Hazard Functions, 163
8.3 Hazard Ratio, 166
8.4 Competing Risks, 167
159
9. Kaplan–Meier and Actuarial Methods for Censored Survival Data
9.1 Kaplan–Meier Survival Curve, 171
9.2 Odds Ratio Methods for Censored Survival Data, 178
9.3 Actuarial Method, 189
171
10. Poisson Methods for Censored Survival Data
10.1 Poisson Methods for Single Sample Survival Data, 193
10.2 Poisson Methods for Unstratified Survival Data, 206
10.3 Poisson Methods for Stratified Survival Data, 218
193
CONTENTS
ix
11. Odds Ratio Methods for Case-Control Data
229
11.1 Justification of the Odds Ratio Approach, 229
11.2 Odds Ratio Methods for Matched-Pairs Case-Control Data, 236
11.3 Odds Ratio Methods for (1 : M) Matched Case-Control Data, 244
12. Standardized Rates and Age–Period–Cohort Analysis
12.1 Population Rates, 249
12.2 Directly Standardized Death Rate, 251
12.3 Standardized Mortality Ratio, 255
12.4 Age–Period–Cohort Analysis, 258
249
13. Life Tables
13.1 Ordinary Life Table, 264
13.2 Multiple Decrement Life Table, 270
13.3 Cause-Deleted Life Table, 274
13.4 Analysis of Morbidity Using Life Tables, 276
263
14. Sample Size and Power
14.1 Sample Size for a Prevalence Study, 281
14.2 Sample Size for a Closed Cohort Study, 283
14.3 Sample Size for an Open Cohort Study, 285
14.4 Sample Size for an Incidence Case-Control Study, 287
14.5 Controlling for Confounding, 291
14.6 Power, 292
281
15. Logistic Regression and Cox Regression
15.1 Logistic Regression, 296
15.2 Cox Regression, 305
295
Appendix A Odds Ratio Inequality
307
Appendix B Maximum Likelihood Theory
B.1 Unconditional Maximum Likelihood, 311
B.2 Binomial Distribution, 313
B.3 Poisson Distribution, 320
B.4 Matrix Inversion, 323
311
Appendix C Hypergeometric and Conditional Poisson Distributions
C.1 Hypergeometric, 325
C.2 Conditional Poisson, 326
325
x
CONTENTS
C.3
C.4
Hypergeometric Variance Estimate, 327
Conditional Poisson Variance Estimate, 328
Appendix D Quadratic Equation for the Odds Ratio
329
Appendix E Matrix Identities and Inequalities
331
E.1 Identities and Inequalities for J (1 × I ) and J (2 × I ) Tables, 331
E.2 Identities and Inequalities for a Single Table, 336
E.3 Hypergeometric Distribution, 336
E.4 Conditional Poisson Distribution, 337
Appendix F Survival Analysis and Life Tables
F.1 Single Cohort, 339
F.2 Comparison of Cohorts, 340
F.3 Life Tables, 341
339
Appendix G Confounding in Open Cohort and Case-Control Studies
G.1 Open Cohort Studies, 343
G.2 Case-Control Studies, 350
343
Appendix H Odds Ratio Estimate in a Matched Case-Control Study
H.1 Asymptotic Unconditional Estimate of Matched-Pairs Odds
Ratio, 353
H.2 Asymptotic Conditional Analysis of (1 : M) Matched
Case-Control Data, 354
353
References
359
Index
377
Biostatistical Methods in Epidemiology. Stephen C. Newman
Copyright ¶ 2001 John Wiley & Sons, Inc.
ISBN: 0-471-36914-4
Preface
The aim of this book is to provide an overview of statistical methods that are important in the analysis of epidemiologic data, the emphasis being on nonregression
techniques. The book is intended as a classroom text for students enrolled in an epidemiology or biostatistics program, and as a reference for established researchers.
The choice and organization of material is based on my experience teaching biostatistics to epidemiology graduate students at the University of Alberta. In that setting I emphasize the importance of exploring data using nonregression methods prior
to undertaking a more elaborate regression analysis. It is my conviction that most of
what there is to learn from epidemiologic data can usually be uncovered using nonregression techniques.
I assume that readers have a background in introductory statistics, at least to the
stage of simple linear regression. Except for the Appendices, the level of mathematics used in the book is restricted to basic algebra, although admittedly some of the
formulas are rather complicated expressions. The concept of confounding, which is
central to epidemiology, is discussed at length early in the book. To the extent permitted by the scope of the book, derivations of formulas are provided and relationships
among statistical methods are identified. In particular, the correspondence between
odds ratio methods based on the binomial model, and hazard ratio methods based
on the Poisson model are emphasized (Breslow and Day, 1980, 1987). Historically,
odds ratio methods were developed primarily for the analysis of case-control data.
Students often find the case-control design unintuitive, and this can adversely affect
their understanding of the odds ratio methods. Here, I adopt the somewhat unconventional approach of introducing odds ratio methods in the setting of closed cohort
studies. Later in the book, it is shown how these same techniques can be adapted
to the case-control design, as well as to the analysis of censored survival data. One
of the attractive features of statistics is that different theoretical approaches often
lead to nearly identical numerical results. I have attempted to demonstrate this phenomenon empirically by analyzing the same data sets using a variety of statistical
techniques.
I wish to express my indebtedness to Allan Donner, Sander Greenland, John Hsieh,
David Streiner, and Stephen Walter, who generously provided comments on a draft
manuscript. I am especially grateful to Sander Greenland for his advice on the topic
of confounding, and to John Hsieh who introduced me to life table theory when I was
xi
xii
PREFACE
a student. The reviewers did not have the opportunity to read the final manuscript
and so I alone am responsible for whatever shortcomings there may be in the book.
I also wish to acknowledge the professionalism and commitment demonstrated by
Steve Quigley and Lisa Van Horn of John Wiley & Sons. I am most interested in
receiving your comments, which can be sent by e-mail using a link at the website
www.stephennewman.com.
Prior to entering medicine and then epidemiology, I was deeply interested in a
particularly elegant branch of theoretical mathematics called Galois theory. While
studying the historical roots of the topic, I encountered a monograph having a preface
that begins with the sentence “I wrote this book for myself.” (Hadlock, 1978). After
this remarkable admission, the author goes on to explain that he wanted to construct
his own path through Galois theory, approaching the subject as an enquirer rather
than an expert. Not being formally trained as a mathematical statistician, I embarked
upon the writing of this book with a similar sense of discovery. The learning process
was sometimes arduous, but it was always deeply rewarding. Even though I wrote
this book partly “for myself,” it is my hope that others will find it useful.
S TEPHEN C. N EWMAN
Edmonton, Alberta, Canada
May 2001
Biostatistical Methods in Epidemiology. Stephen C. Newman
Copyright ¶ 2001 John Wiley & Sons, Inc.
ISBN: 0-471-36914-4
CHAPTER 1
Introduction
In this chapter some background material from the theory of probability and statistics is presented that will be useful throughout the book. Such fundamental concepts
as probability function, random variable, mean, and variance are defined, and several of the distributions that are important in the analysis of epidemiologic data are
described. The Central Limit Theorem and normal approximations are discussed,
and the maximum likelihood and weighted least squares methods of parameter estimation are outlined. The chapter concludes with a discussion of different types of
random sampling. The presentation of material in this chapter is informal, the aim
being to give an overview of some key ideas rather than provide a rigorous mathematical treatment. Readers interested in more complete expositions of the theoretical
aspects of probability and statistics are referred to Cox and Hinkley (1974), Silvey
(1975), Casella and Berger (1990), and Hogg and Craig (1994). References for the
theory of probability and statistics in a health-related context are Armitage and Berry
(1994), Rosner (1995), and Lachin (2000). For the theory of sampling, the reader is
referred to Kish (1965) and Cochran (1977).
1.1 PROBABILITY
1.1.1 Probability Functions and Random Variables
Probability theory is concerned with mathematical models that describe phenomena
having an element of uncertainty. Problems amenable to the methods of probability theory range from the elementary, such as the chance of randomly selecting an
ace from a well-shuffled deck of cards, to the exceedingly complex, such as predicting the weather. Epidemiologic studies typically involve the collection, analysis,
and interpretation of health-related data where uncertainty plays a role. For example,
consider a survey in which blood sugar is measured in a random sample of the population. The aims of the survey might be to estimate the average blood sugar in the
population and to estimate the proportion of the population with diabetes (elevated
blood sugar). Uncertainty arises because there is no guarantee that the resulting esti1
2
INTRODUCTION
mates will equal the true population values (unless the entire population is enrolled
in the survey).
Associated with each probability model is a random variable, which we denote by
a capital letter such as X . We can think of X as representing a potential data point for
a proposed study. Once the study has been conducted, we have actual data points that
will be referred to as realizations (outcomes) of X . An arbitrary realization of X will
be denoted by a small letter such as x. In what follows we assume that realizations
are in the form of numbers so that, in the above survey, diabetes status would have
to be coded numerically—for example, 1 for present and 0 for absent. The set of all
possible realizations of X will be referred to as the sample space of X . For blood
sugar the sample space is the set of all nonnegative numbers, and for diabetes status
(with the above coding scheme) the sample space is {0, 1}. In this book we assume
that all sample spaces are either continuous, as in the case of blood sugar, or discrete,
as in the case of diabetes status. We say that X is continuous or discrete in accordance
with the sample space of the probability model.
There are several mathematically equivalent ways of characterizing a probability model. In the discrete case, interest is mainly in the probability mass function,
denoted by P(X = x), whereas in the continuous case the focus is usually on the
probability density function, denoted by f (x). There are important differences between the probability mass function and the probability density function, but for
present purposes it is sufficient to view them simply as formulas that can be used to
calculate probabilities. In order to simplify the exposition we use the term probability
function to refer to both these constructs, allowing the context to make the distinction clear. Examples of probability functions are given in Section 1.1.2. The notation
P(X = x) has the potential to be confusing because both X and x are “variables.”
We read P(X = x) as the probability that the discrete random variable X has the
realization x. For simplicity it is often convenient to ignore the distinction between
X and x. In particular, we will frequently use x in formulas where, strictly speaking,
X should be used instead.
The correspondence between a random variable and its associated probability
function is an important concept in probability theory, but it needs to be emphasized that it is the probability function which is the more fundamental notion. In a
sense, the random variable represents little more than a convenient notation for referring to the probability function. However, random variable notation is extremely
powerful, making it possible to express in a succinct manner probability statements
that would be cumbersome otherwise. A further advantage is that it may be possible to specify a random variable of interest even when the corresponding probability
function is too difficult to describe explicitly. In what follows we will use several
expressions synonymously when describing random variables. For example, when
referring to the random variable associated with a binomial probability function we
will variously say that the random variable “has a binomial distribution,” “is binomially distributed,” or simply “is binomial.”
We now outline a few of the key definitions and results from introductory probability theory. For simplicity we focus on discrete random variables, keeping in mind
that equivalent statements can be made for the continuous case. One of the defining
3
PROBABILITY
properties of a probability function is the identity
P(X = x) = 1
(1.1)
x
where here, and in what follows, the summation is over all elements in the sample
space of X . Next we define two fundamental quantities that will be referred to repeatedly throughout the book. The mean of X , sometimes called the expected value
of X , is defined to be
E(X ) =
x P(X = x)
(1.2)
x
and the variance of X is defined to be
[x − E(X )]2 P(X = x).
var(X ) =
(1.3)
x
It is important to note that when the mean and variance exist, they are constants,
not random variables. In most applications the mean and variance are unknown and
must be estimated from study data. In what follows, whenever we refer to the mean
or variance of a random variable it is being assumed that these quantities exist—that
is, are finite constants.
Example 1.1 Consider the probability function given in Table 1.1. Evidently
(1.1) is satisfied. The sample space of X is {0, 1, 2}, and the mean and variance of X
are
E(X ) = (0 × .20) + (1 × .50) + (2 × .30) = 1.1
and
var(X ) = [(0 − 1.1)2 .20] + [(1 − 1.1)2 .50] + [(2 − 1.1)2 .30] = .49.
Transformations can be used to derive new random variables from an existing
random variable. Again we emphasize that what is meant by such a statement is that
we can derive new probability functions from an existing probability function. When
the probability function at hand has a known formula it is possible, in theory, to write
down an explicit formula for the transformed probability function. In practice, this
TABLE 1.1 Probability Function of X
x
P(X = x)
0
1
2
.20
.50
.30
4
INTRODUCTION
TABLE 1.2 Probability Function of Y
y
P(Y = y)
5
7
9
.20
.50
.30
may lead to a very complicated expression, which is one of the reasons for relying
on random variable notation.
Example 1.2 With X as in Example 1.1, consider the random variable Y =
2X + 5. The sample space of Y is obtained by applying the transformation to the
sample space of X , which gives {5, 7, 9}. The values of P(Y = x) are derived as
follows: P(Y = 7) = P(2X + 5 = 7) = P(X = 1) = .50. The probability function
of Y is given in Table 1.2.
The mean and variance of Y are
E(Y ) = (5 × .20) + (7 × .50) + (9 × .30) = 7.2
and
var(Y ) = [(5 − 7.2)2 .20] + [(7 − 7.2)2 .50] + [(9 − 7.2)2 .30] = 1.96.
Comparing Examples 1.1 and 1.2 we note that X and Y have the same probability
values but different sample spaces.
Consider a random variable which has as its only outcome the constant β, that
is, the sample space is {β}. It is immediate from (1.2) and (1.3) that the mean and
variance of the random variable are β and 0, respectively. Identifying the random
variable with the constant β, and allowing a slight abuse of notation, we can write
E(β) = β and var(β) = 0. Let X be a random variable, let α and β be arbitrary
constants, and consider the random variable α X + β. Using (1.2) and (1.3) it can be
shown that
E(α X + β) = α E(X ) + β
(1.4)
and
var(α X + β) = α 2 var(X ).
(1.5)
Applying these results to Examples 1.1 and 1.2 we find, as before, that E(Y ) =
2(1.1) + 5 = 7.2 and var(Y ) = 4(.49) = 1.96.
Example 1.3 Let X be an arbitrary random variable with mean µ and variance
σ 2 , where σ > 0, and consider the random variable (X − µ)/σ . With α = 1/σ and
5
PROBABILITY
β = −µ/σ in (1.4) and (1.5), it follows that
E
X − µ
=0
σ
and
X − µ
= 1.
var
σ
In many applications it is necessary to consider several related random variables.
For example, in a health survey we might be interested in age, weight, and blood
pressure. A probability function characterizing two or more random variables simultaneously is referred to as their joint probability function. For simplicity we discuss
the case of two discrete random variables, X and Y . The joint probability function of
the pair of random variables (X, Y ) is denoted by P(X = x, Y = y). For the present
discussion we assume that the sample space of the joint probability function is the
set of pairs {(x, y)}, where x is in the sample space of X and y is in the sample space
of Y . Analogous to (1.1), the identity
P(X = x, Y = y) = 1
(1.6)
x
y
must be satisfied. In the joint distribution of X and Y , the two random variables are
considered as a unit. In order to isolate the distribution of X , we “sum over” Y to
obtain what is referred to as the marginal probability function of X ,
P(X = x) =
P(X = x, Y = y).
y
Similarly, the marginal probability function of Y is
P(X = x, Y = y).
P(Y = y) =
x
From a joint probability function we are to able obtain marginal probability functions, but the process does not necessarily work in reverse. We say that X and Y are
independent random variables if P(X = x, Y = y) = P(X = x) P(Y = y), that is,
if the joint probability function is the product of the marginal probability functions.
Other than the case of independence, it is not generally possible to reconstruct a joint
probability function in this way.
Example 1.4 Table 1.3 is an example of a joint probability function and its associated marginal probability functions. For example, P(X = 1, Y = 3) = .30. The
marginal probability function of X is obtained by summing over Y , for example,
P(X = 1) = P(X = 1, Y = 1) + P(X = 1, Y = 2) + P(X = 1, Y = 3) = .50.
6
INTRODUCTION
TABLE 1.3 Joint Probability Function of X and Y
P(X = x, Y = y)
y
x
1
2
3
P(X = x)
0
1
2
.02
.05
.03
.06
.15
.09
.12
.30
.18
.20
.50
.30
P(Y = y)
.10
.30
.60
1
It is readily verified that X and Y are independent, for example, P(X = 1, Y = 2) =
.15 = P(X = 1) P(Y = 2).
Now consider Table 1.4, where the marginal probability functions of X and Y are
the same as in Table 1.3 but where, as is easily verified, X and Y are not independent.
We now present generalizations of (1.4) and (1.5). Let X 1 , X 2 , . . . , X n be arbitrary random variables,
n let α1 , α2 , . . . , αn , β be arbitrary constants, and consider the
random variable i=1
αi X i + β. It can be shown that
n
αi X i + β
=
n
αi E(X i ) + β
(1.7)
and, if the X i are independent, that
n
n
αi X i + β =
αi2 var(X i ).
var
(1.8)
E
i=1
i=1
i=1
i=1
In the case of two independent random variables X 1 and X 2 ,
E(X 1 + X 2 ) = E(X 1 ) + E(X 2 )
E(X 1 − X 2 ) = E(X 1 ) − E(X 2 )
TABLE 1.4 Joint Probability Function of X and Y
P(X = x, Y = y)
y
x
1
2
3
P(X = x)
0
1
2
.01
.06
.03
.05
.18
.07
.14
.26
.20
.20
.50
.30
P(Y = y)
.10
.30
.60
1
7
PROBABILITY
and
var(X 1 + X 2 ) = var(X 1 − X 2 ) = var(X 1 ) + var(X 2 ).
(1.9)
If X 1 , X 2 , . . . , X n are independent and all have the same distribution, we say the
X i are a sample from that distribution and that the sample size is n. Unless stated otherwise, it will be assumed that all samples are simple random samples (Section 1.3).
With the distribution left unspecified, denote the mean and variance of X i by µ and
σ 2 , respectively. The sample mean is defined to be
X=
n
1
Xi .
n i=1
Setting αi = 1/n and β = 0 in (1.7) and (1.8), we have
E(X) = µ
(1.10)
and
var(X ) =
σ2
.
n
(1.11)
1.1.2 Some Probability Functions
We now consider some of the key probability functions that will be of importance in
this book.
Normal (Gaussian)
For reasons that will become clear after we have discussed the Central Limit Theorem, the most important distribution is undoubtedly the normal distribution. The
normal probability function is
f (z|µ, σ ) =
−(z − µ)2
1
√ exp
2σ 2
σ 2π
where the sample space is all numbers and exp stands for exponentiation to the
base e. We denote the corresponding normal random variable by Z . A normal distribution is completely characterized by the parameters µ and σ > 0. It can be shown
that the mean and variance of Z are µ and σ 2 , respectively.
When µ = 0 and σ = 1 we say that Z has the standard normal distribution. For
0 < γ < 1, let z γ denote that point which cuts off the upper γ -tail probability of the
standard normal distribution; that is, P(Z ≥ z γ ) = γ . For example, z .025 = 1.96. In
some statistics books the notation z γ is used to denote the lower γ -tail. An important
property of the normal distribution is that, for arbitrary constants α and β > 0,
(Z −α)/β is also normally distributed. In particular this is true for (Z −µ)/σ which,
in view of Example 1.3, is therefore standard normal. This explains why statistics
8
INTRODUCTION
books only need to provide values of z γ for the standard normal distribution rather
than a series of tables for different values of µ and σ .
Another important property of the normal distribution is that it is additive. Let
Z 1 , Z 2 , . . . , Z n be independent normal random variables and suppose that
n Z i has
mean µi and variance σi2 (i = 1, 2, . . . , n). Then the randomvariable i=1
Z i is
n
also
normally
distributed
and,
from
(1.7)
and
(1.8),
it
has
mean
µ
and
variance
i=1 i
n
2
i=1 σi .
Chi-Square
The formula for the chi-square probability function is complicated and will not be
presented here. The sample space of the distribution is all nonnegative numbers.
A chi-square distribution is characterized completely by a single positive integer r ,
which is referred to as the degrees of freedom. For brevity we write χ(r2 ) to indicate
that a random variable has a chi-square distribution with r degrees of freedom. The
mean and variance of the chi-square distribution with r degrees of freedom are r and
2r , respectively.
The importance of the chi-square distribution stems from its connection with the
normal distribution. Specifically, if Z is standard normal, then Z 2 , the transformation
2 . More generally, if Z is normal with mean µ
of Z obtained by squaring, is χ(1)
2
and variance σ then, as remarked above, (Z − µ)/σ is standard normal and so
2 . In practice, most chi-square distributions
[(Z − µ)/σ ]2 = (Z − µ)2 /σ 2 is χ(1)
with 1 degree of freedom originate as the square of a standard normal distribution.
This explains why the usual notation for a chi-square random variable is X 2 , or
sometimes χ 2 .
Like the normal distribution, the chi-square distribution has an additive property.
Let X 12 , X 22 , . . . , X n2 be independent chi-square random variables and suppose that
n
X i2 has ri degrees of freedom (i = 1, 2, . . . , n). Then i=1
X i2 is chi-square with
n
i=1 ri degrees of freedom. As a special case of this result, let Z 1 , Z 2 , . . . , Z n be
independent normal random variables, where Z i has mean µi and variance σi2 (i =
2 for all i, and so
1, 2, . . . , n). Then (Z i − µi )2 /σi2 is χ(1)
X2 =
n
(Z i − µi )2
σi2
i=1
(1.12)
2 .
is χ(n)
Binomial
The binomial probability function is
r
P(A = a|π) =
π a (1 − π)r −a
a
where the sample space is the (finite) set of integers {0, 1, 2, . . . , r }. A binomial
distribution is completely characterized by the parameters π and r which, for conve-
9
PROBABILITY
nience, we usually write as (π, r ). Recall that, for 0 ≤ a ≤ r , the binomial coefficient
is defined to be
r
r!
=
a! (r − a)!
a
where r ! = r (r − 1) · · · 2 · 1. We adopt the usual convention that 0! = 1. The
binomial coefficient ar equals the number of ways of choosing a items out of r
without regard to order of selection. For example, the number of possible bridge
11
hands is 52
13 = 6.35 × 10 . It can be shown that
r
r
a
a=0
π a (1 − π)r −a = [π + (1 − π)]r = 1
and so (1.1) is satisfied. The mean and variance of A are πr and π(1 − π)r , respectively; that is,
E(A) =
r
r
a
π a (1 − π)r −a = πr
a
a=0
and
var(A) =
r
r
(a − πr )2
π a (1 − π)r −a = π(1 − π)r.
a
a=0
Like the normal and chi-square distributions, the binomial distribution is additive.
Let A1 , A2 , . . . , An be independent binomial random variables
n and suppose that Ai
has parameters πi
= π and ri (i = 1, 2, . . . , n). Then i=1
Ai is binomial with
n
parameters π and i=1
ri . A similar result does not hold when the πi are not all
equal.
The binomial distribution is important in epidemiology because many epidemiologic studies are concerned with counted (discrete) outcomes. For instance, the binomial distribution can be used to analyze data from a study in which a group of r
individuals is followed over a defined period of time and the number of outcomes of
interest, denoted by a, is counted. In this context the outcome of interest could be,
for example, recovery from an illness, survival to the end of follow-up, or death from
some cause. For the binomial distribution to be applicable, two conditions need to
be satisfied: The probability of an outcome must be the same for each subject, and
subjects must behave independently; that is, the outcome for each subject must be
unrelated to the outcome for any other subject. In an epidemiologic study the first
condition is unlikely to be satisfied across the entire group of subjects. In this case,
one strategy is to form subgroups of subjects having similar characteristics so that,
to a greater or lesser extent, there is uniformity of risk within each subgroup. Then
the binomial distribution can be applied to each subgroup separately. As an example
where the second condition would not be satisfied, consider a study of influenza in a
10
INTRODUCTION
classroom of students. Since influenza is contagious, the risk of illness in one student
is not independent of the risk in others. In studies of noninfectious diseases, such as
cancer, stroke, and so on, the independence assumption is usually satisfied.
Poisson
The Poisson probability function is
P(D = d|ν) =
e−ν ν d
d!
(1.13)
where the sample space is the (infinite) set of nonnegative integers {0, 1, 2, . . .}. A
Poisson distribution is completely characterized by the parameter ν, which is equal
to both the mean and variance of the distribution, that is,
E(D) =
∞ −ν d
e ν
d
=ν
d!
d=0
and
var(D) =
∞
e−ν ν d
(d − ν)2
= ν.
d!
d=0
Similar to the other distributions considered above, the Poisson distribution has
an additive property. Let D1 , D2 , . . . , Dn be independent Poisson
random variables,
n
where Di
has the parameter νi (i = 1, 2, . . . , n). Then i=1
Di is Poisson with
n
parameter i=1
νi .
Like the binomial distribution, the Poisson distribution can be used to analyze data
from a study in which a group of individuals is followed over a defined period of time
and the number of outcomes of interest, denoted by d, is counted. In epidemiologic
studies where the Poisson distribution is applicable, it is not the number of subjects
that is important but rather the collective observation time experienced by the group
as a whole. For the Poisson distribution to be valid, the probability that an outcome
will occur at any time point must be “small.” Expressed another way, the outcome
must be a “rare” event.
As might be guessed from the above remarks, there is a connection between the
binomial and Poisson distributions. In fact the Poisson distribution can be derived as
a limiting case of the binomial distribution. Let D be Poisson with mean ν, and let
A1 , A2 , . . . , Ai , . . . be an infinite sequence of binomial random variables, where Ai
has parameters (πi , ri ). Suppose that the sequence satisfies the following conditions:
πi ri = ν for all i, and the limiting value of πi equals 0. Under these circumstances
the sequence of binomial random variables “converges” to D; that is, as i gets larger
the distribution of Ai gets closer to that of D. This theoretical result explains why
the Poisson distribution is often used to model rare events. It also suggests that the
Poisson distribution with parameter ν can be used to approximate the binomial distribution with parameters (π, r ), provided ν = πr and π is “small.”
11
PROBABILITY
TABLE 1.5 Binomial and Poisson Probability Functions (%)
Binomial
x
0
1
2
3
4
5
6
7
8
9
10
..
.
π = .2
r = 10
π = .1
r = 20
π = .01
r = 200
Poisson
ν=2
10.74
26.84
30.20
20.13
8.81
2.64
.55
.08
.01
< .01
< .01
12.16
27.02
28.52
19.01
8.98
3.19
.89
.20
.04
.01
< .01
..
.
13.40
27.07
27.20
18.14
9.02
3.57
1.17
.33
.08
.02
< .01
..
.
13.53
27.07
27.07
18.04
9.02
3.61
1.20
.34
.09
.02
< .01
..
.
—
Example 1.5 Table 1.5 gives three binomial distributions with parameters
(.2, 10), (.1, 20), and (.01, 200), so that in each case the mean is 2. Also shown
is the Poisson distribution with a mean of 2. The sample spaces have been truncated
at 10. As can be seen, as π becomes smaller the Poisson distribution provides a
progressively better approximation to the binomial distribution.
1.1.3 Central Limit Theorem and Normal Approximations
Let X 1 , X 2 , . . . , X n be a sample from an arbitrary distribution and denote the common mean and variance by µ and σ 2 . It was shown in (1.10) and (1.11) that X has
2
mean E(X)
√ = µ and variance var(X) = σ /n. So, from Example 1.3, the random
variable n(X −µ)/σ has mean 0 and√variance 1. If the X i are normal then, from the
properties of the normal distribution, n(X − µ)/σ is standard normal. The Central
Limit Theorem is a remarkable
√ result from probability theory which states that, even
when the X i are not normal, n(X −µ)/σ is “approximately” standard normal, provided n is sufficiently “large.” We note that the X i are not required to be continuous
random variables. Probability statements such as this, which become more accurate
as n increases, are
√ said to hold asymptotically. Accordingly, the Central Limit Theorem states that n(X − µ)/σ is asymptotically standard normal.
Let A be binomial with parameters (π, n) and let A1 , A2 , . . . , An be a sample
from the binomial distribution with parameters (π, 1). Similarly, let D be Poisson
with parameter ν, where we assume that ν = n, an integer, and let D1 , D2 , . . . , Dn be
a sample from the Poisson distribution with parameter 1. From the additive
n properties
of binomial and Poisson distributions,
A
has
the
same
distribution
as
i=1 Ai , and
n
D has the same distribution as i=1
Di . It follows from the Central Limit Theorem
12
INTRODUCTION
that, provided n is large, A and D will be asymptotically normal. We illustrate this
phenomenon below with a series of graphs.
Let D1 , D2 , . . . , Dn be independent Poisson random variables, where Di has the
parameter νi (i = 1, 2, . . . , n). From the arguments leading to (1.12) and the Central
Limit Theorem, it follows that
X2 =
n
(Di − νi )2
νi
i=1
(1.14)
2 . More generally, let X , X , . . . , X be independent random
is approximately χ(n)
1
2
n
variables where X i has mean µi and variance σi2 (i = 1, 2, . . . , n). If each X i is
approximately normal then
X2 =
n
(X i − µi )2
σi2
i=1
(1.15)
2 .
is approximately χ(n)
Example 1.6 Table 1.6(a) gives the exact and approximate values of the lower
and upper tail probabilities of the binomial distribution with parameters (.3, 10). In
statistics the term “exact” means that an actual probability function is being used to
perform calculations, as opposed to a normal approximation. The mean and variance
of the binomial distribution are .3(10) = 3 and .3(.7)(10) = 2.1. The approximate
values were calculated using the following approach. The normal approximation to
P(A ≤ 2 |.3), for√example, equals the area under the standard normal curve to the left
of [(2+.5)−3]/ 2.1, and the normal approximation to P(A ≥√
2 |.3) equals the area
under the standard normal curve to the right of [(2 − .5) − 3]/ 2.1. The continuity
correction factors ±.5 have been included because the normal distribution, which is
continuous, is being used to approximate a binomial distribution, which is discrete
(Breslow and Day, 1980, §4.3). As can be seen from Table 1.6(a), the exact and
approximate values show quite good agreement. Table 1.6(b) gives the results for the
TABLE 1.6(a) Exact and Approximate Tail Probabilities (%) for the Binomial Distribution
with Parameters (.3,10)
P( A ≤ a |.3)
P( A ≥ a |.3)
a
Exact
Approximate
Exact
Approximate
2
4
6
8
38.28
84.97
98.94
99.99
36.50
84.97
99.21
99.99
85.07
35.04
4.73
.16
84.97
36.50
4.22
.10
13
PROBABILITY
TABLE 1.6(b) Exact and Approximate Tail Probabilities (%) for the Binomial Distribution
with Parameters (.3,100)
P( A ≤ a |.3)
P( A ≥ a |.3)
a
Exact
Approximate
Exact
Approximate
20
25
30
35
40
1.65
16.31
54.91
88.39
98.75
1.91
16.31
54.34
88.50
98.90
99.11
88.64
53.77
16.29
2.10
98.90
88.50
54.34
16.31
1.91
binomial distribution with parameters (.3,100), which shows even better agreement
due to the larger sample size.
Arguments were presented above which show that binomial and Poisson distributions are approximately normal when the sample size is large. The obvious question
is, How large is “large”? We approach this matter empirically and present a sample
size criterion that is useful in practice. The following remarks refer to Figures 1.1(a)–
1.8(a), which show graphs of selected binomial and Poisson distributions. The points
in the sample space have been plotted on the horizontal axis, with the corresponding probabilities plotted on the vertical axis. Magnitudes have not been indicated on
the axes since, for the moment, we are concerned only with the shapes of distributions. The horizontal axes are labeled with the term “count,” which stands for the
number of binomial or Poisson outcomes. Distributions with the symmetric, bellshaped appearance of the normal distribution have a satisfactory normal approximation.
The binomial and Poisson distributions have sample spaces consisting of consecutive integers, and so the distance between neighboring points is always 1.
Consequently the graphs could have been presented in the form of histograms (bar
charts). Instead they are shown as step functions so as to facilitate later comparisons
with the remaining graphs in the same figures. Since the base of each step has a
length of 1, the area of the rectangle corresponding to that step equals the probability
associated with that point in the sample space. Consequently, summing across the
entire sample space, the area under each step function equals 1, as required by (1.1).
Some of the distributions considered here have tails with little associated probability
(area). This is obviously true for the Poisson distributions, where the sample space
is infinite and extreme tail probabilities are small. The graphs have been truncated at
the extremes of the distributions corresponding to tail probabilities of 1%.
The binomial parameters used to create Figures 1.1(a)–1.5(a) are (.3,10), (.5,10),
(.03,100), (.05,100), and (.1,100), respectively, and so the means are 3, 5, and 10.
The Poisson parameters used to create Figures 1.6(a)–1.8(a) are 3, 5, and 10, which
are also the means of the distributions. As can be seen, for both the binomial and
Poisson distributions, a rough guideline is that the normal approximation should be
satisfactory provided the mean of the distribution is greater than or equal to 5.
FIGURE 1.1(a)
FIGURE 1.1(b)
FIGURE 1.1(c)
14
Binomial distribution with parameters (.3, 10)
Odds transformation of binomial distribution with parameters (.3, 10)
Log-odds transformation of binomial distribution with parameters (.3, 10)
FIGURE 1.2(a)
FIGURE 1.2(b)
FIGURE 1.2(c)
Binomial distribution with parameters (.5, 10)
Odds transformation of binomial distribution with parameters (.5, 10)
Log-odds transformation of binomial distribution with parameters (.5, 10)
15
FIGURE 1.3(a)
FIGURE 1.3(b)
FIGURE 1.3(c)
16
Binomial distribution with parameters (.03, 100)
Odds transformation of binomial distribution with parameters (.03, 100)
Log-odds transformation of binomial distribution with parameters (.03, 100)
FIGURE 1.4(a)
FIGURE 1.4(b)
FIGURE 1.4(c)
Binomial distribution with parameters (.05, 100)
Odds transformation of binomial distribution with parameters (.05, 100)
Log-odds transformation of binomial distribution with parameters (.05, 100)
17
FIGURE 1.5(a)
FIGURE 1.5(b)
FIGURE 1.5(c)
18
Binomial distribution with parameters (.1, 100)
Odds transformation of binomial distribution with parameters (.1, 100)
Log-odds transformation of binomial distribution with parameters (.1, 100)
19
PROBABILITY
FIGURE 1.6(a)
FIGURE 1.6(b)
Poisson distribution with parameter 3
Log transformation of Poisson distribution with parameter 3
20
INTRODUCTION
FIGURE 1.7(a)
FIGURE 1.7(b)
Poisson distribution with parameter 5
Log transformation of Poisson distribution with parameter 5
21
PARAMETER ESTIMATION
FIGURE 1.8(a)
FIGURE 1.8(b)
Poisson distribution with parameter 10
Log transformation of Poisson distribution with parameter 10
1.2 PARAMETER ESTIMATION
In the preceding section we discussed the properties of distributions in general, and
those of the normal, chi-square, binomial, and Poisson distributions in particular.
These distributions and others are characterized by parameters that, in practice, are
usually unknown. This raises the question of how to estimate such parameters from
study data.
In certain applications the method of estimation seems intuitively clear. For example, suppose we are interested in estimating the probability that a coin will land
heads. A “study” to investigate this question is straightforward and involves tossing
the coin r times and counting the number of heads, a quantity that will be denoted
22
INTRODUCTION
by a. The question of how large r should be is answered in Chapter 14. The proportion of tosses landing heads a/r tells us something about the coin, but in order
to probe more deeply we require a probability model, the obvious choice being the
binomial distribution. Accordingly, let A be a binomial random variable with parameters (π, r ), where π denotes the unknown probability that the coin will land heads.
Even though the parameter π can never be known with certainty, it can be estimated
from study data. From the binomial model, an estimate is given by the random variable A/r which, in the present study, has the realization a/r . We denote A/r by π̂
and refer to π̂ as a (point) estimate of π. In some of the statistics literature, π̂ is
called an estimator of π, the term estimate being reserved for the realization a/r . In
keeping with our convention of intentionally ignoring the distinction between random variables and realizations, we use estimate to refer to both quantities.
The theory of binomial distributions provides insight into the properties of π̂ as
an estimate of π. Since A has mean E(A) = πr and variance var(A) = π(1−π )r , it
follows that π̂ has mean E(π̂) = E(A)/r = π and variance var(π̂) = var(A)/r 2 =
π(1 − π)/r . In the context of the coin-tossing study, these properties of π̂ have the
following interpretations: Over the course of many replications of the study, each
based on r tosses, the realizations of π̂ will be tend to be near π; and when r is
large there will be little dispersion of the realizations on either side of π. The latter
interpretation is consistent with our intuition that π will be estimated more accurately
when there are many tosses of the coin.
With the above example as motivation, we now consider the general problem of
parameter estimation. For simplicity we frame the discussion in terms of a discrete
random variable, but the same ideas apply to the continuous case. Suppose that we
wish to study a feature of a population which is governed by a probability function
P(X = x|θ ), where the parameter θ embodies the characteristic of interest. For example, in a population health survey, X could be the serum cholesterol of a randomly
chosen individual and θ might be the average serum cholesterol in the population.
Let X 1 , X 2 , . . . , X n be a sample of size n from the probability function P(X = x|θ ).
A (point) estimate of θ , denoted by θ̂ , is a random variable that is expressed in terms
of the X i and that satisfies certain properties, as discussed below. In the preceding
example, the survey could be conducted by sampling n individuals at random from
the population
and measuring their serum cholesterol. For θ̂ we might consider using
n
X = ( i=1
X i )/n, the average serum cholesterol in the sample.
There is considerable latitude when specifying the properties that θ̂ should be
required to satisfy, but in order for a theory of estimation to be meaningful the properties must be chosen so that θ̂ is, in some sense, informative about θ . The first
property we would like θ̂ to have is that it should result in realizations that are “near”
θ . This is impossible to guarantee in any given study, but over the course of many
replications of the study we would like this property to hold “on average.” Accordingly, we require the mean of θ̂ to be θ , that is, E(θ̂) = θ . When this property is
satisfied we say that θ̂ is an unbiased estimate of θ , otherwise θ̂ is said to be biased.
The second property we would like θ̂ to have is that it should make as efficient use of
the data as possible. In statistics, notions related to efficiency are generally expressed
in terms of the variance. That is, all other things being equal, the smaller the variance
PARAMETER ESTIMATION
23
the greater the efficiency. Accordingly, for a given sample size, we require var(θ̂) to
be as small as possible.
In the coin-tossing study the parameter was θ = π. We can reformulate the earlier
probability model by letting A1 , A2 , . . . , An be independent
binomial random varin
ables, each having parameters (π, 1). Setting A = ( i=1
Ai )/n we have π̂ = A,
and so E(A) = π and var(A) = π(1 − π)/n. Suppose that instead of A we decide to use A1 as an estimate of π; that is, we ignore all but the first toss of the
coin. Since E(A1 ) = π, both A and A1 are unbiased estimates of π. However,
var(A1 ) = π(1 − π) and so, provided n > 1, var(A1 ) > var(A). This means that A
is more efficient than A1 . Based on the above criteria we would choose A over A1 as
an estimate of π.
The decision to choose A in preference to A1 was based on a comparison of
variances. This raises the question of whether there is another unbiased estimate of
π with a variance that is even smaller than π(1 − π)/n. We return now to the general
case of an arbitrary probability function P(X = x|θ ). For many of the probability
functions encountered in epidemiology it can be shown that there is a number b(θ )
such that, for any unbiased estimate θ̂ , the inequality var(θ̂) ≥ b(θ ) is satisfied.
Consequently, b(θ ) is at least as small as the variance of any unbiased estimate of θ .
There is no guarantee that for given θ and P(X = x|θ ) there actually is an unbiased
estimate with a variance this small; but, if we can find one, we clearly will have
satisfied the requirement that the estimate has the smallest variance possible.
For the binomial distribution, it turns out that b(π) = π(1 − π)/n, and so
b(π) = var(π̂). Consequently π̂ is an unbiased estimate of π with the smallest variance possible (among unbiased estimates). For the binomial distribution, intuition
suggests that π̂ ought to provide a reasonable estimate of π, and it turns out that π̂
has precisely the properties we require. However, such ad hoc methods of defining
an estimate cannot always be relied upon, especially when the probability model is
complex. We now consider two widely used methods of estimation which ensure that
the estimate has desirable properties, provided asymptotic conditions are satisfied.
1.2.1 Maximum Likelihood
The maximum likelihood method is based on a concept that is intuitively appealing
and, at first glance, deceptively straightforward. Like many profound ideas, its apparent simplicity belies a remarkable depth. Let X 1 , X 2 , . . . , X n be a sample from
the probability function P(X = x|θ ) and consider the observations (realizations)
x1 , x2 , . . . , xn . Since the X i are independent, the (joint) probability of these observations is the product of the individual probability elements, that is,
n
P(X i = xi |θ ) = P(X 1 = x1 |θ ) P(X 2 = x2 |θ ) · · · P(X n = xn |θ ).
(1.16)
i=1
Ordinarily we are inclined to think of (1.16) as a function of the xi . From this
perspective, (1.16) can be used to calculate the probability of the observations provided the value of θ is known. The maximum likelihood method turns this argument
24
INTRODUCTION
around and views (1.16) as a function of θ . Once the data have been collected, values
of the xi can be substituted into (1.16), making it a function of θ alone. When viewed
this way we denote (1.16) by L(θ ) and refer to it as the likelihood. For any value of
θ , L(θ ) equals the probability of the observations x1 , x2 , . . . , xn . We can graph L(θ )
as a function of θ to get a visual image of this relationship. The value of θ which is
most in accord with the observations, that is, makes them most “likely,” is the one
which maximizes L(θ ) as a function of θ . We refer to this value of θ as the maximum
likelihood estimate and denote it by θ̂ .
Example 1.7 Let A1 , A2 , A3 , A4 , A5 be a sample from the binomial distribution with parameters (π, 1), and consider the observations a1 = 0, a2 = 1, a3 = 0,
a4 = 0, and a5 = 0. The likelihood is
5
π ai (1 − π)1−ai = π(1 − π)4 .
L(π) =
i=1
From the graph of L(π), shown in Figure 1.9, it appears that π̂ is somewhere in the
neighborhood of .2. Trial and error with larger and smaller values of π confirms that
in fact π̂ = .2.
The above graphical method of finding a maximum likelihood estimate is feasible
only in the simplest of cases. In more complex situations, in particular when there
are several parameters to estimate simultaneously, numerical methods are required,
such as those described in Appendix B. When there is a single parameter, the maximum likelihood estimate θ̂ can usually be found by solving the maximum likelihood
equation,
L (θ̂) = 0
where L (θ ) is the derivative of L(θ ) with respect to θ .
FIGURE 1.9 Likelihood for Example 1.7
(1.17)
25
PARAMETER ESTIMATION
Example 1.8 We now generalize Example 1.7. Let A1 , A2 , . . . , Ar be a sample
from the binomial distribution with parameters (π, 1), and denote the observations
by a1 , a2 , . . . , ar . The likelihood is
r
L(π) =
π ai (1 − π)1−ai = π a (1 − π)r −a
(1.18)
i=1
where a = ri=1 ai . From the form of the likelihood we see that is not the individual
ai which are important but rather
their sum a. Accordingly we might just as well
have based the likelihood on ri=1 Ai , which is binomial with parameters (π, r ). In
this case the likelihood is
r
(1.19)
L(π) =
π a (1 − π)r −a .
a
As far as maximizing (1.19) with respect to π is concerned, the binomial coefficient is irrelevant and so (1.18) and (1.19) are equivalent from the likelihood perspective. It is straightforward to show that the maximum likelihood equation (1.17)
simplifies to a − π̂r = 0 and so the maximum likelihood estimate of π is π̂ = a/r .
Maximum likelihood estimates have very attractive asymptotic properties. Specifically, if θ̂ is the maximum likelihood estimate of θ then θ̂ is asymptotically normal
with mean θ and variance b(θ ), where the latter is the lower bound described earlier.
As a result, θ satisfies, in an asymptotic sense, the two properties that were proposed above as being desirable features of an estimate—unbiasedness and minimum
variance. In addition to parameter estimates, the maximum likelihood approach also
provides methods of confidence interval estimation and hypothesis testing. As discussed in Appendix B, included among the latter are the Wald, score, and likelihood
ratio tests.
It seems that the maximum likelihood method has much to offer; however, there
are two potential problems. First, the maximum likelihood equation may be very
complicated and this can make calculating θ̂ difficult in practice. This is especially
true when several parameters must be estimated simultaneously. Fortunately, statistical packages are available for many standard analyses and modern computers are
capable of handling the computational burden. The second problem is that the desirable properties of maximum likelihood estimates are guaranteed to hold only when
the sample size is “large.”
1.2.2 Weighted Least Squares
In the coin-tossing study discussed above, we considered a sample A1 , A2 , . . . , An
from a binomial distribution with
(π, 1). Since
i ) = π we can denote
parameters
E(A
n
n
Ai by π̂i , and in place of A =
i=1 Ai /n write π̂ =
i=1 π̂i /n. In this way we
can express the estimate of π as an average of estimates, one for each i. More generally, suppose that θ̂1 , θ̂2 , . . . , θ̂n are independent unbiased estimates of a parameter
θ , that is, E(θ̂i ) = θ for all i. We do not assume that the θ̂i necessarily have the
same distribution; in particular, we do not require that the variances var(θ̂i ) = σi2 be
26
INTRODUCTION
equal. We seek a method of combining the individual estimates θ̂i of θ into an overall
estimate θ̂ which has the desirable properties outlined earlier. (Using the symbol θ̂
for both the weighted least squares and maximum likelihood estimates is a matter of
convenience and is not meant to imply any connection between the two estimates.)
For constants wi > 0, consider the sum
n
1
wi (θ̂i − θ̂ )2
W i=1
(1.20)
n
wi . We refer to the wi as weights and to an expression such (1.20)
where W = i=1
as a weighted average. It is the relative, not the absolute, magnitude of each wi that is
important in a weighted average. In particular, we can replace wi with wi = wi /W
and obtain a weighted average in which the weights sum to 1. In this way, means
(1.2) and variances (1.3) can be viewed as weighted averages.
Expression (1.20) is a measure of the overall weighted “distance” between the
θ̂i and θ̂ . The weighted least squares method defines θ̂ to be that quantity which
minimizes (1.20). It can be shown that the weighted least squares estimate of θ is
θ̂ =
n
1
wi θ̂i
W i=1
(1.21)
which is seen to be a weighted average of the θ̂i . Since each θ̂i is an unbiased estimate
of θ , it follows from (1.7) that
E(θ̂) =
n
1
wi E(θ̂i ) = θ.
W i=1
So θ̂ is also an unbiased estimate of θ , and this is true regardless of the choice
of weights. Not all weighting schemes are equally efficient in the sense of keeping
the variance var(θ̂ ) to a minimum. The variance σi2 is a measure of the amount of
information contained in the estimate θ̂i . It seems reasonable that relatively greater
weight should be given to those θ̂i for which σi2 is correspondingly small. It turns out
that the weights wi = 1/σi2 are optimal in the following sense: The corresponding
weighted least squares estimate has minimum variance among all weighted averages
of the θ̂i (although not necessarily among estimates in general). Setting wi = 1/σi2 ,
it follows from (1.8) that
var(θ̂) =
n
1
1
w2 var(θ̂i ) =
.
W
W 2 i=1 i
(1.22)
Note that up to this point the entire discussion has been based on means and
variances. In particular, nothing has been assumed about distributions or sample size.
It seems that the weighted least squares method has much to recommend it. Unlike
the maximum likelihood approach, the calculations are straightforward, and sample
27
RANDOM SAMPLING
size does not seem to be an issue. However, a major consideration is that we need
to know the variances σi2 prior to using the weighted least squares approach, and in
practice this information is almost never available. Therefore it is usually necessary
to estimate the σi2 from study data, in which case the weights are random variables
rather than constants. So instead of (1.21) and (1.22) we have instead
θ̂ =
n
1
Ŵ
ŵi θ̂i
(1.23)
i=1
and
var(θ̂) =
1
Ŵ
(1.24)
n
where ŵi = 1/σ̂i2 and Ŵ = i=1
ŵi . When the σi2 are estimated from large samples
the desirable properties of (1.21) and (1.22) described above carry over to (1.23) and
(1.24), that is, θ̂ is asymptotically unbiased with minimum variance.
1.3 RANDOM SAMPLING
The methods of parameter (point) estimation described in the preceding section, as
well as the methods of confidence interval estimation and hypothesis testing to be
discussed in subsequent chapters, are based on the assumption that study subjects
are selected using random sampling. If subjects are a nonrandom sample, the above
methods do not apply. For example, if patients are enrolled in a study of mortality
by preferentially selecting those with a better prognosis, the mortality estimates that
result will not reflect the experience of the typical patient in the general population.
In this section we discuss two types of random sampling that are important in epidemiologic studies: simple random sampling and stratified random sampling. For
illustrative purposes we consider a prevalence study (survey) designed to estimate
the proportion of the population who have a given disease at a particular time point.
This proportion is referred to as the (point) prevalence rate (of the disease), and an
individual who has the disease is referred to as a case (of the disease). The binomial
distribution can be used to analyze data from a prevalence study. Accordingly, we
denote the prevalence rate by π.
1.3.1 Simple Random Sampling
Simple random sampling, the least complicated type of random sampling, is widely
used in epidemiologic studies. The cardinal feature of a simple random sample is
that all individuals in the population have an equal probability of being selected. For
example, a simple random sample would be obtained by randomly selecting names
from a census list, making sure that each individual has the same chance of being
chosen. Suppose that r individuals are sampled for the prevalence study and that
28
INTRODUCTION
a of them are cases. The simple random sample estimate of the prevalence rate is
π̂srs = a/r , which has the variance var(π̂srs ) = π(1 − π)/r .
1.3.2 Stratified Random Sampling
Suppose that the prevalence rate increases with age. Simple random sampling ensures that, on average, the sample will have the same age distribution as the population. However, in a given prevalence study it is possible for a particular age group to
be underrepresented or even absent from a simple random sample. Stratified random
sampling avoids this difficulty by permitting the investigator to specify the proportion of the total sample that will come from each age group (stratum). For stratified
random sampling to be possible it is necessary to know in advance the number of individuals in the population in each stratum. For example, stratification by age could
be based on a census list, provided information on age is available. Once the strata
have been created, a simple random sample is drawn from each stratum, resulting in
a stratified random sample.
Suppose there are n strata. For the ith stratum we make the following definitions:
Ni is the number of individuals in the population, πi is the prevalence rate, ri is
the number of subjects in the simple random sample,
n and ai is the
nnumber of cases
among the ri subjects (i = 1, 2, . . . , n). Let N = i=1
Ni , a = i=1
ai and
r=
n
ri .
(1.25)
i=1
For a stratified random sample, along with the Ni , the ri must also be known prior
to data collection. We return shortly to the issue of how to determine the ri , given an
overall sample size of r . For the moment we require only that the ri satisfy the constraint (1.25). Since a simple random sample is chosen in each stratum, an estimate
of πi is π̂i = ai /ri , which has the variance var(π̂i ) = πi (1 − πi )/ri . The stratified
random sample estimate of the prevalence rate is
π̂str =
n
Ni
i=1
N
π̂i
(1.26)
which is seen to be a weighted average of the π̂i . Since E(π̂i ) = πi , it follows from
(1.7) that
E(π̂str ) =
n
Ni
i=1
N
πi = π
and so π̂str is unbiased. Applying (1.8) to (1.26) gives
var(π̂str ) =
n
Ni 2 πi (1 − πi )
i=1
N
ri
.
(1.27)
RANDOM SAMPLING
29
We now consider the issue of determining the ri . There are a number of approaches
that can be followed, each of which places particular conditions on the ri . For example, according to the method of optimal allocation, the ri are chosen so that
var(π̂str ) is minimized. It can be shown that, based on this criterion,
√
Ni πi (1 − πi )
r i = n
r.
(1.28)
√
i=1 Ni πi (1 − πi )
As can be seen from (1.28), in order to determine the ri it is necessary to know, or
at least have reasonable estimates of, the πi . Since this is one of the purposes of the
prevalence study, it is therefore necessary to rely on findings from earlier prevalence
studies or, when such studies are not available, have access to informed opinion.
Stratified random sampling should be considered only if it is known, or at least
strongly suspected, that the πi vary across strata. Suppose that, unknown to the investigator, the πi are all equal, so that πi = π for all i. It follows from (1.28) that
ri = (Ni /N )r and hence, from (1.27), that var(π̂str ) = π(1 − π)/r . This means that
the variance obtained by optimal allocation, which is the smallest variance possible
under stratified random sampling, equals the variance that would have been obtained
from simple random sampling. Consequently, when there is a possibility that the πi
are all equal, stratified random sampling should be avoided since the effort involved
in stratification will not be rewarded by a reduction in variance.
Simple random sampling and stratified random sampling are conceptually and
computationally straightforward. There are more complex methods of random sampling such as multistage sampling and cluster sampling. Furthermore, the various
methods can be combined to produce even more elaborate sampling strategies. It will
come as no surprise that as the method of sampling becomes more complicated so
does the corresponding data analysis. In practice, most epidemiologic studies use relatively straightforward sampling procedures. Aside from prevalence studies, which
may require complex sampling, the typical epidemiologic study is usually based on
simple random sampling or perhaps stratified random sampling, but generally nothing more elaborate.
Most of the procedures in standard statistical packages, such as SAS (1987) and
SPSS (1993), assume that data have been collected using simple random sampling or
stratified random sampling. For more complicated sampling designs it is necessary to
use a statistical package such as SUDAAN (Shah et al., 1996), which is specifically
designed to analyze complex survey data. STATA (1999) is a statistical package that
has capabilities similar to SAS and SPSS, but with the added feature of being able
to analyze data collected using complex sampling. For the remainder of the book it
will be assumed that data have been collected using simple random sampling unless
stated otherwise.
Biostatistical Methods in Epidemiology. Stephen C. Newman
Copyright ¶ 2001 John Wiley & Sons, Inc.
ISBN: 0-471-36914-4
CHAPTER 2
Measurement Issues in Epidemiology
Unlike laboratory research where experimental conditions can usually be carefully
controlled, epidemiologic studies must often contend with circumstances over which
the investigator may have little influence. This reality has important implications for
the manner in which epidemiologic data are collected, analyzed, and interpreted.
This chapter provides an overview of some of the measurement issues that are important in epidemiologic research, an appreciation of which provides a useful perspective on the statistical methods to be discussed in later chapters. There are many
references that can be consulted for additional material on measurement issues and
study design in epidemiology; in particular, the reader is referred to Rothman and
Greenland (1998).
2.1 SYSTEMATIC AND RANDOM ERROR
Virtually any study involving data collection is subject to error, and epidemiologic
studies are no exception. The error that occurs in epidemiologic studies is broadly of
two types: random and systematic.
Random Error
The defining characteristic of random error is that it is due to “chance” and, as such,
is unpredictable. Suppose that a study is conducted on two occasions using identical
methods. It is possible for the first replicate to lead to a correct inference about the
study hypothesis, and for the second replicate to result in an incorrect inference as a
result of random error. For example, consider a study that involves tossing a coin 100
times where the aim is to test the hypothesis that the coin is “fair”—that is, has an
equal chance of landing heads or tails. Suppose that unknown to the investigator the
coin is indeed fair. In the first replicate, imagine that there are 50 heads and 50 tails,
leading to the correct inference that the coin is fair. Now suppose that in the second
replicate there are 99 heads and 1 tail, leading to the incorrect inference that the coin
is unfair. The erroneous conclusion in the second replicate is due to random error,
and this occurs despite the fact that precisely the same study methods were used both
times.
31
32
MEASUREMENT ISSUES IN EPIDEMIOLOGY
Since the coin is fair, based on the binomial model, the probability of observing
99
1
−29 , an exceedthe data in the second replicate is 100
99 (1/2) (1/2) = 7.89 × 10
ingly small number. Although unlikely, this outcome is possible. The only way to
completely eliminate random error in the study is to toss the coin an “infinite” number of times, an obvious impossibility. However, as intuition suggests, tossing the
coin a “large” number of times can reduce the probability of random error. Epidemiologic studies are generally based on measurements performed on subjects randomly
sampled from a “population.” A population can be any well-defined group of individuals, such as the residents of a city, individuals living in the catchment area of a
hospital, workers in a manufacturing plant, or patients attending a medical clinic, just
to give a few examples. The process of random sampling from a population introduces random error. In theory, such random error could be eliminated by recruiting
the entire population into the study. Usually populations of interest are so large or
otherwise inaccessible as to make this option a practical impossibility. As a result,
random error must be addressed in virtually all epidemiologic studies. Much of the
remainder of this book is devoted to methods for analyzing data in the presence of
random error.
An epidemiologic study is usually designed with a particular hypothesis in mind,
typically having to do with a purported association between a predictor variable and
an outcome of interest. For example, in an occupational epidemiologic study it might
be hypothesized that exposure to a certain chemical increases the risk of cancer.
The classical approach to examining the truth of such a hypothesis is to define the
corresponding “null” hypothesis that no association is present. The null hypothesis
is then tested using inferential statistical methods and either rejected or not. In the
present example, the null hypothesis would be that the chemical is not associated
with the risk of cancer. Rejecting the null hypothesis would lead to the inference that
the chemical is in fact associated with this risk.
The null hypothesis is either true or not, but due to random error the truth of the
matter can never be known with certainty based on statistical methods. The inference
drawn from a hypothesis test can be wrong in two ways. If the null hypothesis is
rejected when it is true, a type I error has occurred; and if the null hypothesis is not
rejected when it is false, there has been a type II error. The probability of a type I
error will be denoted by α, and the probability of a type II error will be denoted by β.
In a given application the values of α and β are determined by the nature of the study
and, as such, are under the control of the investigator. It is desirable to keep α and β
to a minimum, but it is not possible to reduce either of them to 0. For a given sample
size there is a tradeoff between type I error and type II error, in the sense that α can
be reduced by increasing β, and conversely (Chapter 14).
Systematic Error
The cardinal feature of systematic error, and the characteristic that distinguishes it
from random error, is that it is reproducible. For the most part, systematic error occurs as a result of problems having to do with study methodology. If these problems
are left unattended and if identical methods are used to replicate the study, the same
systematic errors will occur. As can be imagined, there are an almost endless number
MEASURES OF EFFECT
33
of possibilities for systematic error in an epidemiologic study. For example, the study
sample could be chosen improperly, the questionnaire could be invalid, the statistical
analysis could be faulty, and so on. Certain epidemiologic designs are, by their very
nature, more prone to systematic error than others. Case-control studies, discussed
briefly in Chapter 11, are usually considered to be particularly problematic in this
regard due to the reliance on retrospective data collection. With careful attention to
study methods it is possible minimize systematic error, at least those sources of systematic error that come to the attention of the investigator. In this chapter we focus
on two types of systematic error which are particularly important in epidemiologic
studies, namely, confounding and misclassification.
Ordinarily the findings from an epidemiologic study are presented in terms of a
parameter estimate based on a probability model. In the coin-tossing example the
focus would typically be on the parameter π from a binomial distribution, where
π is the (unknown) probability of the coin landing heads. When systematic error is
present, the parameter estimate will usually be biased in the sense of Section 1.2, and
so it may either over- or underestimate the true parameter value. Epidemiology has
borrowed the term “bias” from the statistical literature, using it as a synonym for systematic error. So when an epidemiologic study is subject to systematic error we say
that the parameter estimate is biased or, rather more loosely, that the study is biased.
2.2 MEASURES OF EFFECT
In this book we will mostly be concerned with analyzing data from studies in which
groups of individuals are compared, the aim being to determine whether a given exposure is related to the occurrence of a particular disease. Here “exposure” and “disease” are used in a generic sense. The term exposure can refer to any characteristic
that we wish to investigate as potentially having a health-related impact. Examples
are: contact with a toxic substance, treatment with an innovative medical therapy,
having a family history of illness, engaging in a certain lifestyle practice, and belonging to a particular sociodemographic group. Likewise, the term disease can refer
to the occurrence of any health-related outcome we wish to consider. Examples are:
onset of illness, recovery following surgery, and death from a specific cause. In the
epidemiologic literature, “risk” is sometimes used synonymously with probability, a
convention that tends to equate the term with the probability parameter of a binomial
model. Here we use the term risk more generally to connote the propensity toward
a particular outcome, whether or not that tendency is modeled using the binomial
distribution.
2.2.1 Closed Cohort Study
There are many types of cohort studies, but the common theme is that a group of
individuals, collectively termed the cohort, is followed over time and monitored for
the occurrence of an outcome of interest. For example, a cohort of breast cancer
patients might be followed for 5 years, with death from this disease as the study
34
MEASUREMENT ISSUES IN EPIDEMIOLOGY
endpoint. In this example, the cohort is a single sample which is not being contrasted
with any comparison group. As another example, suppose that a group of workers
in a chemical fabrication plant is followed for 20 years to determine if their risk of
leukemia is greater than that in the general population. In this case, the workers are
being compared to the population at large.
A reality of cohort studies is that subjects may cease to be under observation
prior to either developing the disease or reaching the end of the planned period of
follow-up. When this occurs we say that the subject has become “unobservable.”
This can occur for a variety of reasons, such as the subject being lost to follow-up by
the investigator, the subject deciding to withdraw from the study, or the investigator
eliminating the subject from further observation due to the development of an intercurrent condition which conflicts with the aims of the study. Whatever the reasons,
these occurrences pose a methodological challenge to the conduct of a cohort study.
For the remainder of this chapter we restrict attention to the least complicated type of
cohort study, namely, one in which all subjects have the same maximum observation
time and all subjects not developing the disease remain observable throughout the
study. A study with this design will be referred to as a closed cohort study.
In a closed cohort study, subjects either develop the disease or not, and all those
not developing it necessarily have the same length of follow-up, namely, the maximum observation time. For example, suppose that a cohort of 1000 otherwise healthy
middle-aged males are monitored routinely for 5 years to determine which of them
develops hypertension (high blood pressure). In order for the cohort to be closed, it is
necessary that all those who do not develop hypertension remain under observation
for the full 5 years. Once a subject develops hypertension, follow-up for that individual ceases. In a closed cohort study involving a single sample, the parameter of
interest is usually the binomial probability of developing disease. In some of the epidemiologic literature on closed cohort studies, the probability of disease is referred
to as the incidence proportion or the cumulative incidence, but we will avoid this
terminology. In most cohort studies, at least a few subjects become unobservable for
reasons such as those given above, and so closed cohort studies are rarely encountered in practice. However, the closed cohort design offers a convenient vehicle for
introducing a number of ideas that are also important in the context of cohort studies
conducted under less restrictive conditions.
Consider a closed cohort study in which the exposure is dichotomous and suppose
that at the start of follow-up there are r1 subjects in the exposed cohort (E = 1) and
r2 subjects in the unexposed cohort (E = 2). At the end of the period of follow-up
each subject will have either developed the disease (D = 1) or not (D = 2). Someone who develops the disease will be referred to as a case, otherwise as a noncase.
The development of disease in the exposed and unexposed cohorts will be modeled
using binomial random variables A1 and A2 with parameters (π1 , r1 ) and (π2 , r2 ),
respectively. As discussed in Section 1.2.1, we assume that subjects behave independently with respect to developing the disease. Tables 2.1(a) and 2.1(b) show the
observed counts and expected values for the study, respectively. We do not refer to
the entries in Table 2.1(b) as expected counts, for reasons that will be explained in
Section 4.1.
35
MEASURES OF EFFECT
TABLE 2.1(a) Observed Counts:
Closed Cohort Study
D
E
1
2
1
a1
a2
2
b1
r1
b2
r2
TABLE 2.1(b) Expected Values:
Closed Cohort Study
D
E
1
2
1
π1 r 1
π2 r 2
2
(1 − π1 )r1
r1
(1 − π2 )r2
r2
2.2.2 Risk Difference, Risk Ratio, and Odds Ratio
When an exposure is related to the risk of disease we say that the exposure has an
“effect.” We now define several measures of effect which quantify the magnitude of
the association between exposure and disease in a closed cohort study.
The risk difference, defined by RD = π1 − π2 , is an intuitively appealing measure
of effect. Since π1 = π2 + RD, the risk difference measures change on an additive
scale. If RD > 0, exposure is associated with an increase in the probability of disease;
if RD < 0, exposure is associated with a decrease in the probability of disease; and
if RD = 0, exposure is not associated with the disease.
The risk ratio, defined by RR = π1 /π2 , is another intuitively appealing measure
of effect. In some of the epidemiologic literature the risk ratio is referred to as the
relative risk, but this terminology will not be used in this book. Since π1 = RRπ2 ,
the risk ratio measures change on a multiplicative scale. Note that RR is undefined
when π2 = 0, a situation that is theoretically possible but of little interest from an
epidemiologic point of view. If RR > 1, exposure is associated with an increase in
the probability of disease; if RR < 1, exposure is associated with a decrease in the
probability of disease; and if RR = 1, exposure is not associated with the disease. A
measure of effect that has both additive and multiplicative features is (π1 −π2 )/π2 =
RR − 1, which is referred to as the excess relative risk (Preston, 2000). A related
measure of effect is (π1 − π2 )/π1 = 1 − (1/RR), which is called the attributable risk
percent (Cole and MacMahon, 1971). These measures of effect are closely related to
the risk ratio and will not be considered further.
For a given probability π = 1, the odds ω is defined to be
ω=
π
.
1−π
36
MEASUREMENT ISSUES IN EPIDEMIOLOGY
Solving for π gives
π=
ω
1+ω
and so probability and odds are equivalent ways of expressing the same information.
Although appearing to be somewhat out of place in the context of health-related
studies, odds terminology is well established in the setting of games of chance. As
an example, the probability of picking an ace at random from a deck of cards is
π = 4/52 = 1/13. The odds is therefore ω = (4/52)/(48/52) = 1/12, which
can be written as 1:12 and read as “1 to 12.” Despite their nominal equivalence,
probability and odds differ in a major respect: π must lie in the interval between 0
and 1, whereas ω can be any nonnegative number. An important characteristic of the
odds is that it satisfies a reciprocal property: If ω = π/(1 − π) is the odds of a given
outcome, then (1 − π)/[1 − (1 − π)] = 1/ω is the odds of the opposite outcome. For
example, the odds of not picking an ace is (48/52)/(4/52) = 12, that is, “12 to 1.”
Returning to the discussion of closed cohort studies, let ω1 = π1 /(1 − π1 ) and
ω2 = π2 /(1 − π2 ) be the odds of disease for the exposed and unexposed cohorts,
respectively. The odds ratio is defined to be
OR =
ω1
π1 (1 − π2 )
=
.
ω2
π2 (1 − π1 )
(2.1)
Since ω1 = ORω2 , the odds ratio is similar to the risk ratio in that change is measured on a multiplicative scale. However, with the odds ratio the scale is calibrated
in terms of odds rather than in terms of probability. If OR > 1, exposure is associated with an increase in the odds of disease; if OR < 1, exposure is associated with
a decrease in the odds of disease; and if OR = 1, exposure is not associated with
the disease. It is easily demonstrated that ω1 > ω2 , ω1 < ω2 , ω1 = ω2 are equivalent to π1 > π2 , π1 < π2 , π1 = π2 , respectively, and so statements made in terms
of odds are readily translated into corresponding statements about probabilities, and
conversely.
When the disease is “rare,” 1 − π1 and 1 − π2 are close to 1 and so, from (2.1),
OR is approximately equal to RR. In some of the older epidemiologic literature the
odds ratio was viewed as little more than an approximation to the risk ratio. More
recently, some authors have argued against using the odds ratio as a measure of effect
in clinical studies on the grounds that it cannot substitute for the clinically more
meaningful risk difference and risk ratio (Sinclair and Bracken, 1994). In this book
we regard the odds ratio as a measure of effect worthy of consideration in its own
right and not merely as a less desirable alternative to the risk ratio. As will be seen
shortly, the odds ratio has a number of attractive measurement properties that are not
shared by either the risk difference or the risk ratio.
2.2.3 Choosing a Measure of Effect
We now consider which, if any, of the risk difference, risk ratio, or odds ratio is the
most desirable measure of effect for closed cohort studies. One of the most con-
MEASURES OF EFFECT
37
tentious issues revolves around the utility of RD and RR as measures of etiology
(causation) on the one hand, and measures of population (public health) impact on
the other. This is best illustrated with some examples. First, suppose that the probability of developing the disease is small, whether or not there is exposure; for example,
π1 = .0003 and π2 = .0001. Then RD = .0002, and so exposure is associated with
a small increase in the probability of disease. Unless a large segment of the population has been exposed, the impact of the disease will be small and so, from a public
health perspective, this particular exposure is not of major concern. On the other
hand, RR = 3 and according to usual epidemiologic practice this is large enough
to warrant further investigation of the exposure as a possible cause of the disease.
Now suppose that π1 = .06 and π2 = .05, so that RD = .01 and RR = 1.2. In
this example, the risk difference will be of public health importance unless exposure is especially infrequent, while the risk ratio is of relatively little interest from an
etiologic point of view.
The above arguments have been expressed in terms of the risk difference and risk
ratio, but are in essence a debate over the merits of measuring effect on an additive
as opposed to a multiplicative scale. This issue has generated a protracted debate
in the epidemiologic literature, with some authors preferring additive models (Rothman, 1974; Berry, 1980) and others preferring the multiplicative approach (Walter
and Holford, 1978). Statistical methods have been proposed for deciding whether
an additive or multiplicative model provides a better fit to study data. One approach
is to compare likelihoods based on best-fitting additive and multiplicative models
(Berry, 1980; Gardner and Munford, 1980; Walker and Rothman, 1982). An alternative method is to fit a general model that has additive and multiplicative models as
special cases and then decide whether one or the other, or perhaps some intermediate
model, fits the data best (Thomas, 1981; Guerrero and Johnson, 1982; Breslow and
Storer, 1985; Moolgavkar and Venzon, 1987).
Consider a closed cohort study where π1 = .6 and π2 = .2, so that ω1 = 1.5
and ω2 = .25. Based on these parameters we have the following interpretations:
Exposure increases the probability of disease by an increment RD = .4; exposure
increases the probability of disease by a factor RR = 3; and exposure increases the
odds of disease by a factor OR = 6. This simple example illustrates that the risk
difference, risk ratio, and odds ratio are three very different ways of measuring the
effect of exposure on the risk of disease. It also illustrates that the risk difference
and risk ratio have a straightforward and intuitive interpretation, a feature that is not
shared by the odds ratio. Even if ω1 = 1.5 and ω2 = .25 are rewritten as “15 to 10”
and “1 to 4,” these quantities remain less intuitive than π1 = .6 and π2 = .2. It seems
that, from the perspective of ease of interpretation, the risk difference and risk ratio
have a distinct advantage over the odds ratio.
Suppose we redefine exposure status so that subjects who were exposed according
to the original definition are relabeled as unexposed, and conversely. Denoting the
resulting measures of effect with a prime , we have RD = π2 − π1 , RR = π2 /π1 ,
and OR = [π2 (1 − π1 )]/[π1 (1 − π2 )]. It follows that RD = −RD, RR = 1/RR, and
OR = 1/OR, and so each of the measures of effect is transformed into a reciprocal
quantity on either the additive or multiplicative scale. Now suppose that we redefine
disease status so that subjects who were cases according to the original definition are
38
MEASUREMENT ISSUES IN EPIDEMIOLOGY
relabeled as noncases, and conversely. Denoting the resulting measures of effect with
a double prime , we have RD = (1 − π1 ) − (1 − π2 ), RR = (1 − π1 )/(1 − π2 ), and
OR = [(1 − π1 )π2 ]/[(1 − π2 )π1 ]. It follows that RD = −RD and OR = 1/OR,
but RR = 1/RR. The failure of the risk ratio to demonstrate a reciprocal property
when disease status is redefined is a distinct shortcoming of this measure of effect.
For example, in a randomized controlled trial let “exposure” be active treatment (as
compared to placebo) and let “disease” be death from a given cause. With π1 = .01
and π2 = .02, RR = .01/.02 = .5 and so treatment leads to an impressive decrease
in the probability of dying. Looked at another way, RR = .99/.98 = 1.01 and so
treatment results in only a modest improvement in the probability of surviving.
Since 0 ≤ π1 ≤ 1, there are constraints placed on the values of RD and RR.
Specifically, for a given value of π2 , RD and RR must satisfy the inequalities 0 ≤
π2 + RD ≤ 1 and 0 ≤ RRπ2 ≤ 1; or equivalently, −π2 ≤ RD ≤ (1 − π2 ) and
0 ≤ RR ≤ (1/π2 ). In the case of a single 2 × 2 table, such as being considered
here, these constraints do not pose a problem. However, when several tables are
being analyzed and an overall measure of effect is being estimated, these constraints
have greater implications. First, there is the added complexity of finding an overall
measure that satisfies the constraints in each table. Second, and more importantly,
the constraint imposed by one of the tables may severely limit the range of possible
values for the measure of effect in other tables. The odds ratio has the attractive
property of not being subject to this problem. Solving (2.1) for π1 gives
π1 =
ORπ2
.
ORπ2 + (1 − π2 )
(2.2)
Since 0 ≤ π2 ≤ 1 and OR ≥ 0, it follows that 0 ≤ π1 ≤ 1 for any values of OR and
π2 for which the denominator of (2.2) is nonzero. Figures 2.1(a) and 2.1(b), which
FIGURE 2.1(a)
π1 as a function of π2 , with OR = 2
39
MEASURES OF EFFECT
FIGURE 2.1(b)
π1 as a function of π2 , with OR = 5
are based on (2.2), show graphs of π1 as a function of π2 for OR = 2 and OR = 5.
As can be seen, the curves are concave downward in shape. By contrast, for given
values of RD and RR, the graphs of π1 = π2 + RD and π1 = RRπ2 (not shown) are
both linear; the former has a slope of 1 and an intercept of RD, while the latter has a
slope of RR and an intercept of 0.
When choosing a measure of effect for a closed cohort study, it is useful to
consider the properties discussed above—that is, whether the measure of effect is
additive or multiplicative, intuitively appealing, exhibits reciprocal properties, and
imposes restrictions on the range of parameter values. However, a more fundamental
consideration is whether the measure of effect is consistent with the underlying
mechanism of the disease process. For example, if it is known that a set of exposures
exert their influence in an additive rather than a multiplicative fashion, it would
be appropriate to select the risk difference as a measure of effect in preference to
the risk ratio or odds ratio. Unfortunately, in most applications there is insufficient
substantive knowledge to help decide such intricate questions. It might be hoped that
epidemiologic data could be used to determine whether a set of exposures is operating additively, multiplicatively, or in some other manner. However, the behavior
of risk factors at the population level, which is the arena in which epidemiologic
research operates, may not accurately reflect the underlying disease process (Siemiatycki and Thomas, 1981; Thompson, 1991).
Walter (2000) has demonstrated that models based on the risk difference, risk
ratio, and odds ratio tend to produce similar findings, a phenomenon that will be illustrated later in this book. Currently, in most epidemiologic studies, some form of
multiplicative model is used. Perhaps the main reason for this emphasis is a practical
consideration: In most epidemiologic research the outcome variable is categorical
(discrete) and the majority of statistical methods, along with most of the statistical
packages available to analyze such data, are based on the multiplicative approach
40
MEASUREMENT ISSUES IN EPIDEMIOLOGY
(Thomas, 2000). In particular, the majority of regression techniques that are widely
used in epidemiology, such as logistic regression and Cox regression, are multiplicative in nature. For this reason the focus of this book will be on techniques that are
defined in multiplicative terms.
2.3 CONFOUNDING
One of the defining features of epidemiology as a field of inquiry is the concern
(some might say preoccupation) over a particular type of systematic error known as
confounding. In many epidemiologic studies the aim is to isolate the causal effect of
a particular exposure on the development of a given disease. When there are factors
that have the potential to result in a spurious increase or decrease in the observed
effect, the possibility of confounding must be considered. Early definitions of confounding were based on the concept of collapsibility, an approach which has considerable intuitive appeal. The current and widely accepted definition of confounding
rests on counterfactual arguments that, by contrast, are rather abstract. As will be
shown, the collapsibility and counterfactual definitions of confounding have certain
features in common. We will develop some preliminary insights into confounding
using the collapsibility approach and then proceed to a definition of confounding
based on counterfactual arguments (Greenland et al., 1999).
2.3.1 Counterfactuals, Causality, and Risk Factors
The concept of causality has an important place in discussions of confounding (Pearl,
2000, Chapter 6). The idea of what it means for something to “cause” something else
is a topic that has engaged philosophers for centuries. Holland (1986) and Greenland
et al. (1999) review some of the issues related to causality in the context of inferential statistics. A helpful way of thinking about causality is based on the concept of
counterfactuals. Consider the statement “smoking causes lung cancer,” which could
be given the literal interpretation that everyone who smokes develops this type of
tumor. As is well known, there are many people who smoke but do not develop
lung cancer and, conversely, there are people who develop lung cancer and yet have
never smoked. So there is nothing inevitable about the association between smoking
and lung cancer, in either direction. One way of expressing a belief that smoking is
causally related to lung cancer is as follows: We imagine that corresponding to an
individual who smokes there is an imaginary individual who is identical in all respects, except for being a nonsmoker. We then assert that the risk of lung cancer in
the person who smokes is greater than the risk in the imaginary nonsmoker. This type
of argument is termed counterfactual (counter to fact) because we are comparing an
individual who is a known smoker with the “same” individual minus the history of
smoking.
Epidemiologists are usually uncomfortable making claims about causality, generally preferring to discuss whether an exposure and disease are associated or related.
The term “risk factor” imparts a sense of causality and at the same time is appropri-
41
CONFOUNDING
ately conservative for an epidemiologic discussion. So instead of referring to smoking as a cause of lung cancer, it would be usual in an epidemiologic context to say
that smoking is a risk factor for this disease. The term risk factor is also used for
any condition that forms part of a causal chain connecting an exposure of interest
to a given disease. For example, a diet deficient in calcium can lead to osteoporosis, and this can in turn result in hip fractures. We consider both calcium deficiency
and osteoporosis to be risk factors for hip fractures. Sometimes the definition of what
constitutes a risk factor is broadened to include characteristics that are closely associated with a causal agent but not necessarily causal themselves. In this sense, carrying
a lighter can be considered to be a risk factor for lung cancer. We will restrict our
use of the term risk factor to those characteristics that have a meaningful etiologic
connection with the disease in question.
2.3.2 The Concept of Confounding
The type of problem posed by confounding is best illustrated by an example. Imagine
a closed cohort study investigating alcohol consumption as a possible risk factor for
lung cancer. The exposed cohort consists of a group of individuals who consume
alcohol (drinkers) and the unexposed cohort is a group who do not (nondrinkers).
Setting aside the obvious logistical difficulties involved in conducting such a study,
suppose that at the end of the period of follow-up the proportion of drinkers who
develop lung cancer is greater than the corresponding proportion of nondrinkers.
This might be regarded as evidence that alcohol is a risk factor for lung cancer,
but before drawing this conclusion we must consider the well-known association
between drinking and smoking. Specifically, since smoking is a known cause of lung
cancer, and smoking and drinking are lifestyle habits that are often associated, there
is the possibility that drinking may only appear to be a risk factor for lung cancer
because of the intermediate role played by smoking.
These ideas are captured visually in Figure 2.2(a), which is referred to as a causal
diagram. In the diagram we use E, D and F to denote drinking (exposure), lung
cancer (disease) and smoking (intermediate factor), respectively. The unidirectional
solid arrow between smoking and lung cancer indicates a known causal relationship,
the bidirectional solid arrow between drinking and smoking stands for a known noncausal association, and the unidirectional dashed arrow between drinking and lung
Drinking (E)
Smoking (F)
Lung cancer (D)
FIGURE 2.2(a)
Causal diagram for drinking as a risk factor for lung cancer
42
MEASUREMENT ISSUES IN EPIDEMIOLOGY
cancer represents an association that results from smoking acting as an intermediate
factor.
A quantitative approach to examining whether smoking results in a spurious association between drinking and lung cancer involves stratifying (dividing) the cohort
into smokers and nonsmokers, and then reanalyzing the data within strata. Stratification ensures that the subjects in each stratum are identical with respect to smoking
status. So if the association between drinking and lung cancer is mediated through
smoking, this association will vanish within each of the strata. In a sense, stratifying
by smoking status breaks the connection between drinking and lung cancer in each
stratum by blocking the route through smoking. In fact, drinking is not a risk factor
for lung cancer and so, random error aside, within each smoking stratum the proportion of drinkers who develop lung cancer will be the same as the proportion of
nondrinkers. So after accounting (controlling, adjusting) for smoking we conclude
that drinking is not a risk factor for this disease. In the crude (unstratified) analysis,
drinking appears to be a risk factor for lung cancer due to what we will later refer
to as confounding by smoking. The essential feature of smoking which enables it to
produce confounding is that it is associated with both drinking and lung cancer.
Now imagine a closed cohort study investigating calcium deficiency (E) as a risk
factor for hip fractures (D). We have already noted that calcium deficiency leads to
osteoporosis (F) and that both calcium deficiency and osteoporosis cause hip fractures. These associations are depicted in Figure 2.2(b). By analogy with the previous
example it is tempting to regard osteoporosis as a source of confounding. However,
the situation is different here in that osteoporosis is a step in the causal pathway
between calcium deficiency and hip fractures. Consequently, osteoporosis does not
induce a spurious risk relationship between calcium deficiency and hip fractures but
rather helps to explain a real causal connection. For this reason we do not consider
osteoporosis to be a source of confounding.
As with any mathematical construct, the manner in which confounding is operationalized for the purposes of data analysis is a matter of definition; and, as we will
see, different definitions are possible. The process of arriving at a definition of confounding is an inductive one, with concrete examples examined for essential features
which can then be given a more general formulation. The preceding hypothetical
studies illustrate some of the key attributes that should be included as part of a definition of confounding, and these requirements will be adhered to as we explore the
Calcium deficiency (E)
Osteoporosis (F)
Hip fractures (D)
FIGURE 2.2(b)
Causal diagram for calcium deficiency as a risk factor for hip fractures
43
CONFOUNDING
concept further. Specifically, for a variable F to be a source of confounding (confounder) we require that F satisfy the following conditions: F must be a risk factor
for the disease, and F must be associated with the exposure. To these two conditions
we add the requirement that F must not be part of the causal pathway between the
exposure and the disease.
2.3.3 Some Hypothetical Examples of Closed Cohort Studies
As illustrated in the preceding section, stratification plays an important role in the
analysis of epidemiologic data, especially in connection with confounding. In this
section we examine a series of hypothetical closed cohort studies in order to develop
a sense of how the risk difference, risk ratio, and odds ratio behave in crude and stratified 2×2 tables. This will motivate an analysis that will be useful in the discussion of
confounding. In an actual cohort study, subjects are randomly sampled from a population, a process that introduces random error. For the remainder of this chapter it is
convenient to avoid issues related to random error by assuming that the entire population has been recruited into the cohort and that, for each individual, the outcome
with respect to developing the disease is predetermined (although unknown to the
investigator). In this way we replace the earlier probabilistic (stochastic) approach
with one that is deterministic. Strictly speaking, we should now refer to π1 and π2
in Table 2.1(b) as proportions rather than probabilities because there is no longer a
stochastic context. However, for simplicity of exposition we will retain the earlier
terminology. In what follows, we continue to make reference to the population, but
will now equate it with the cohort at the start of follow-up.
Tables 2.2(a)–2.2(e) give examples of closed cohort studies in which there are
three variables: exposure (E), disease (D), and a stratifying variable, (F). We use
E = 1, D = 1, and F = 1 to denote the presence of an attribute and use E =
2, D = 2, and F = 2 to indicate its absence. Here, as elsewhere in the book, a
dot • denotes summation over all values of an index. We refer to the tables with the
TABLE 2.2(a) Hypothetical Closed Cohort Study: F Is Not a Risk Factor for the Disease
and F Is Not Associated with Exposure
F =1
D
E
F =•
E
E
1
2
1
2
1
1
70
40
140
80
210
120
2
30
60
60
120
90
180
100
200
200
300
100
RD
RR
OR
F =2
.30
1.8
3.5
.30
1.8
3.5
2
300
.30
1.8
3.5
44
MEASUREMENT ISSUES IN EPIDEMIOLOGY
TABLE 2.2(b) Hypothetical Closed Cohort Study: F Is Not a Risk Factor for the Disease
and F Is Not Associated with Exposure
F =1
D
F =2
E
F =•
E
E
1
2
1
2
1
1
70
40
160
80
230
120
2
30
60
40
120
70
180
100
200
200
300
100
RD
RR
OR
.30
1.8
3.5
.40
2.0
6.0
2
300
.37
1.9
4.9
TABLE 2.2(c) Hypothetical Closed Cohort Study: F Is Not a Risk Factor for the Disease
and F Is Associated with Exposure
F =1
D
F =2
E
F =•
E
E
1
2
1
2
1
1
70
80
160
40
230
120
2
30
120
40
60
70
180
200
200
100
300
100
RD
RR
OR
.30
1.8
3.5
.40
2.0
6.0
2
300
.37
1.9
4.9
TABLE 2.2(d) Hypothetical Closed Cohort Study: F Is a Risk Factor for the Disease and
F Is Not Associated with Exposure
F =1
D
E
F =•
E
E
1
2
1
2
1
1
90
60
80
20
170
80
2
10
40
120
180
130
220
100
200
200
300
100
RD
RR
OR
F =2
.30
1.5
6.0
.30
4.0
6.0
2
300
.30
2.1
3.6
45
CONFOUNDING
TABLE 2.2(e) Hypothetical Closed Cohort Study: F Is a Risk Factor for the Disease and
F Is Associated with Exposure
F =2
F =1
D
E
F =•
E
E
1
2
1
2
1
90
120
30
2
10
80
170
200
200
100
RD
RR
OR
.30
1.5
6.0
1
2
10
120
130
90
180
170
100
300 300
−.03
.92
.87
.05
1.5
1.6
TABLE 2.2(f) Hypothetical Closed Cohort Study: F Is a Risk Factor for the Disease and F
Is Associated with Exposure
F =1
D
E
F =3
E
F =•
E
E
1
2
1
2
1
2
1
1
140
50
120
20
70
90
330
160
2
60
50
180
180
30
210
270
440
100
300
200
100
300
600
200
RD
RR
OR
F =2
.20
1.4
2.3
.30
4.0
6.0
.40
2.3
5.4
2
600
.28
2.1
3.4
headings “F = 1” and “F = 2” as the stratum-specific tables and refer to the table
with the heading “F = •” as the crude table. The crude table is obtained from the
stratum-specific tables by collapsing over F—that is, summing over strata on a cellby-cell basis. The interpretation of the subheadings of the tables will become clear
shortly.
In Table 2.2(a), for each measure of effect, the stratum-specific values are equal
to each other and to the crude value. In fact, the entries in stratum 2 are, cell by cell,
double those in stratum 1. There would seem to be little reason to retain stratification
when analyzing the data in Table 2.2(a). In Tables 2.2(b) and 2.2(c), for each measure
of effect,