We propose random-effects models to summarize and quantify the accuracy of the diagnosis of multiple lesions on a single image without assuming independence between lesions. The number of false-positive lesions was assumed to be distributed as a Poisson mixture, and the proportion of true-positive lesions was assumed to be distributed as a binomial mixture. We considered univariate and bivariate, both parametric and nonparametric mixture models. We applied our tools to simulated data and data of a study assessing diagnostic accuracy of virtual colonography with computed tomography in 200 patients suspected of having one or more polyps.

Diagnostic accuracy is usually summarized with the true-positive rate (TPR) or sensitivity and the false-positive rate (FPR) or one minus specificity. Some diagnostic tasks are more complicated than simple detection of a single occurrence of the (abnormal) condition, and in such cases calculating TPR and FPR may prove difficult. In this paper, we will discuss as an example the diagnostic accuracy of virtual colonography in locating colonic polyps and documenting their size or severity. Accurate sensitive diagnosis is essential as polyps may develop into tumor tissue, but diagnosis must also be sufficiently specific to minimize invasive procedures. Patients often have more than one polyp, any of which may be seen or missed. Other examples of such a diagnostic situation are occurrence of multiple lesions using mammography or multiple infarcts using computed tomography (CT) or magnetic resonance imaging in a patient suspected of having a stroke.

Zhou and others (2002, p. 43) describe several methods of summarizing diagnostic accuracy for such data. Often seen is the so called “per-lesion” approach. Sensitivity is calculated as the fraction of correctly identified polyps. It seems clinically meaningful to assess sensitivity in the case of polyps because of the importance to diagnose each polyp. Specificity is, however, difficult to define in this case because there is an almost infinite number of locations in the colon where there could be a polyp. Zhou and others (2002) suggested a simple way of addressing this difficulty by dividing the image (colon in this case) into segments and calculating sensitivity and specificity per segment. Unfortunately, however, it is not always obvious how to define segments. The approach also weighs equally the segments with one or many true- or false-positive (TP or FP) polyps, whereas, intuitively, a segment with many polyps contains more information than a segment with only one. Egan and Schulman (1961), Bunch and others (1978), and Chakraborty and Winder (1990) developed the “free-response receiver-operator-characteristic (ROC) curve (FROC)” to analyze multiple lesions. The y-axis of the FROC curve is the probability of both detecting and correctly locating the lesions; the x-axis is the average number of FPs per patient. The summary measure of the FROC curve may be interpreted as the average fraction of lesions detected on an image before an observer makes an FP error. Unfortunately, the approach assumes independence between multiple findings on the same image of a patient.

In this paper, we propose random-effects models to summarize and quantify the accuracy of the diagnosis of multiple lesions on a single image without the independence assumption. In Section 2, we describe the statistical model underlying our approach. In section 4, we apply our tool to the data of a study assessing diagnostic accuracy of virtual colonography with CT in 200 patients suspected of having one or more polyps.

2.METHODS

We consider the situation where N patients undergo a diagnostic test to detect and localize possibly multiple lesions. For each patient, a gold standard is required such that for each lesion identified by the diagnostic test, it is known whether it is an FP or a TP diagnosis.

2.1Specificity

Since the exact locations of the FPs are of lesser importance, we consider the number of FP lesions in patient i,Xi, say, and assume that Xi follows the Poisson distribution with expectation μ. The specificity is then defined as the probability of zero FP lesions in a patient: Pr(X=0|μ)=e−μ. Given a sample of observations x1,…,xN, the maximum likelihood estimator of the parameter μ is = ∑ixi/N with standard error se()=, and the 95% confidence interval of the specificity is given by exp(−±1.96×se()) (Johnson and others, 1992).

The assumption of a common FPR across patients is often too restrictive, and often the variance of the number of FPs is larger than the mean, thus violating the Poisson assumption. A common method to relax this independence assumption is to assume that the intensity parameter varies randomly between patients, according to a distribution with density g(μ|θ). Specificity is now defined as(2.1)a weighted average of specificity values across patients with weights given by g(μ|θ).

The density function g(·) is unknown and must be estimated from the data. The most commonly used approach is to assume a gamma distribution g(μ|a,b). The integral has a closed-form solution, and the specificity is then(2.2)

Maximum likelihood estimates of (a,b) are easy to calculate (see, e.g. Johnson and others, 1997) but have no simple equation. It is also awkward to estimate the 95% confidence interval of the specificity, but a nonparametric bootstrap procedure performs well.

Other parametric functions for g(·) can be used (e.g. a log-normal distribution), but in general there is little evidence to justify any particular choice. One might therefore prefer to specify a fully nonparametric distribution for g(·) (Carlin and Louis, 1996, p. 51). Laird (1978) proved that the nonparametric maximum likelihood estimate of g(·) is a discrete distribution with at most N mass points. Assume that g(·) has mass points θ=(θ1,…,θJ) with probabilities w=(w1,…,wJ), where J≤N, then the likelihood of observations x1,x2,…,xN is given by(2.3)Iteratively, this likelihood is easily maximized as a function of θ and w using an expectation-maximization (EM) algorithm (Carlin and Louis, 1996). Given maximum likelihood estimates of and , specificity is estimated as Pr(X=0)=. In this case, calculating the 95% confidence interval is very awkward too, but we have found that a nonparametric bootstrap approach works well.

2.2Sensitivity

In contrast to FPs, the exact locations of the TP lesions are of primary importance, and we therefore consider the outcome of the diagnostic test for each lesion in patient i separately. Define Yij as the result of the diagnostic test at the location of the jth lesion in patient i:Yij=1 when the diagnostic test is positive and Yij=0 if it is negative.

Sensitivity is defined as the probability of a TP test result. As with the specificity, we allow sensitivity γi to vary between patients. Given γi, the outcomes of the diagnostic testing in patient i at locations j and l,yij and yil, are assumed to be independent for all j,l. The likelihood of the test results of ki lesions, yi1,…,yiki, therefore equals L(yi|γi)=. When ki is small, the maximum likelihood estimate of γi is very unstable. For this reason, we again assume that in the population of patients γ follows a distribution with density g(γ|θ). A common choice for g(·) is the beta distribution with parameters θ=(a,b) (Ryea and others, 2007). The likelihood of the observations yi given θ=(a,b) is L(yi|(a,b))=∫01L(yi|γ)g(γ|(a,b))dγ. Again the beta distribution is a convenient choice because the integral in the likelihood has a closed-form solution.

Again there is usually little evidence to prefer a particular distribution, and therefore one might again prefer a nonparametric specification. As in case of the specificity, we use a discrete distribution with J classes with mass points θ=(θ1,…,θJ) and probabilities w=(w1,…,wJ), where J≤N. The likelihood of the diagnostic testing of the ki lesions in patient i is given by(2.4)This likelihood is also iteratively maximized as a function of θ and w using an EM algorithm.

Given the maximum likelihood estimates and , the (per-lesion) sensitivity is estimated as . Again we use a bootstrap approach to estimate the 95% confidence interval. The probability that there is at least one TP lesion in a patient (the “per-patient sensitivity”) equals , where k is the (mean) number of lesions in the patients.

2.3A generalization and an alternative

The model can be generalized in several ways. First, both specificity and sensitivity may depend on observed patient characteristics. Including this dependence in the model may explain the variation between patients. For the FP lesions, the logarithm of the Poisson intensity parameter μi in patient i may depend on a linear fashion of a set of covariate values observed in i:logμi=β0+β1zi1+…+ei, where β is a regression parameter and ei is a residual component. The lesion-specific sensitivity of patient i may also be modeled as a function of covariates, logit(γi)=β0+β1zi1+…+ei.

Instead of modeling the correlations between (false) lesions in the same patient with a random effect, these correlations may be ignored. The analysis then boils down to simple Poisson and binomial regressions, and the associated standard errors can be corrected with a generalized estimation equation approach (Martus and others, 2004).

2.4Bivariate relation between specificity and sensitivity

Multiple-lesion data allow estimation of the relation between the per-patient specificity and the per-lesion sensitivity, “even with only one cutpoint”. The relation can be directly modeled in a bivariate random-effects model in order to obtain the ROC curve. In such a model, the Poisson parameter υi=log(μi) and the logit-transformed parameter ξi=logit(γi) are assumed to have a bivariate distribution g(υ,ξ) across patients, where it helps to model ξi as ξi=a+bυi+ϵi. An obvious choice for g(·) is the bivariate normal with expectations =a+b× and covariance matrix . The likelihood of the observations of patient is(2.5)This model can be estimated in standard software such as SAS proc NLMIXED, and a Bayesian version is easily implemented in Winbugs.

Instead of the bivariate normal distribution, a nonparametric distribution can be used which is defined by a 2-dimensional grid of points θrs with weights wrs(r=1,…,R;s=1,…,S). The likelihood of patient i is then(2.6)This model is not readily available in standard software, but the log-likelihood is easily maximized with an EM algorithm.

3.SIMULATION

We simulated data somewhat according to the example of diagnosing colon polyps (lesions) with virtual colonography, which we will discuss in Section 4. We performed several simulations, and we will show an extreme and slightly pathological example. For ease of the simulation, lesions (and nonlesions) were assumed to be characterized by one variable only, which we might call the lesion thickness, and that each patient consisted of 50 locations. Data for 50 patients at the 50 locations were simulated as follows. First, the number of true lesions per patient was sampled from the Poisson distribution with expectation of 5 lesions. The thickness of each lesion per patient was sampled from a normal distribution; the means and variances of this distribution varied between patients, and these patient-specific parameters were sampled themselves from a normal distribution (mean) and a gamma distribution (variance) with fixed parameters. The thickness variable at the nonlesion locations was also sampled from a normal distribution with means and variances sampled from normal and gamma distributions. Finally, a threshold was chosen such that the overall number of identified true lesions was a fixed number (30%, 50%, or 70%), and afterwards, the number of FP lesions and the number of TP lesions were counted for each patient.

We discuss results from one simulation with a threshold chosen such that 50% of all lesions were identified. In the 50 patients, the number of lesions varied between 0 (in 1 patient) and 9 with a mean of 4.8 and standard deviation 2.0. There were 1124 FP lesions, and the number of FP lesions varied between 0 and 49 per patient with mean 22.48 and variance 388.87, see Figure 1. The histogram seems to consist of 3 subgroups of patients with few, moderate, or large numbers of FP lesions. The gamma and normal distributions for the random effect will probably not fit well for these data, and this example will therefore favor the nonparametric approach.

Fig. 1.

Histogram of the number of FP lesions per patient in the simulated data.

There were 14 patients without FP lesions, and the specificity could therefore be calculated as 14/50=28%. According to the simple Poisson model (log-likelihood ℓ∝−644.97), the intensity μ is estimated as 22.48(se=0.67) and the specificity is therefore e−22.48≈0%, which is clearly much too low. The generalized estimation equation (GEE) approach, implemented in SAS proc genmod, does not correct this bias but does increase the standard error of the estimate to 2.76. The parameters of the gamma–Poisson model (log-likelihood ℓ∝−198.38) were estimated as log(a)=−0.84 and log(b)=−3.95 yielding an intensity estimate of elog(a)−log(b)=22.42 and specificity 18%, which is still too low. The log-normal–Poisson mixture model had a similar estimate of the specificity (15%). The nonparametric Poisson mixture (log-likelihood ℓ∝−164.63) resulted in a mixture of 4 patient subgroups: patients with an intensity of 0.01 FPs (weight w=0.30), patients with about 7 FPs (w=0.13), patients with about 18 FPs (w=0.11), and patients with about 42 FPs (w=0.46). The estimated specificity is 28% with 95% confidence interval (17−42%). Note that this estimate corresponds with the ratio of number of patients without FPs to the total number of patients. This correspondence was seen in almost all simulations.

A total of 234 lesions were present in 49 out of the 50 simulated patients, and the number varied between 1 and 9 with mean 4.8 (standard deviation 2.0). By choosing our threshold, 117 lesions were identified yielding lesion sensitivity of 117/234=50%. The log-likelihood of this binomial model was ℓ=−130.21. There were 32 patients in whom at least one lesion was identified (32/49=65%). For a patient with 4 lesions, the probability of observing at least one lesion is therefore 1−(1−0.50)4=94%. The observed proportion of identified lesions per patient varied between 0 and 1 with mean 0.49 and standard deviation 0.44. This variation cannot be explained by sampling only since the beta-binomial model (ℓ=−74.15) and the logit-normal–binomial model fitted better than the binomial model. The mean lesion sensitivity was 50% in both models, but the probability of observing at least one lesion in a patient with 4 lesions is much smaller: about 65%. The nonparametric binomial mixture model had log-likelihood ℓ=−76.29 and yielded 3 subgroups of patients with lesion sensitivity of 9%, 47%, and 90% and weights 45%, 11%, and 44%, respectively. The probability of observing at least one lesion in a patient with 4 lesions was 65%. Note the close correspondence between the “estimated” probabilities of the random-effect models of identifying at least one lesion in a patient with the “observed” rate of patients with at least one lesion identified. Again this was observed in almost all simulations.

The log-likelihood of the nonparametric bivariate model was −234.17, which was slightly larger than the sum of the 2 likelihoods of the 2 nonparametric mixture models (−164.63)+(−76.29)=−240.92. The correlation between the logit-transformed sensitivity and the Poisson parameter was estimated as −0.20.

4.APPLICATION: THE VIRTUAL COLONOGRAPHY STUDY

We now apply our methods to data from a study to evaluate the test characteristics of CT colonography at different levels of radiation dose. In this study, 200 patients at risk for colorectal cancer were evaluated for the presence of one or more polyps. Colonoscopy was used as the reference standard. Herein only lesions > 6 mm will be considered. Colonographic lesions were identified by 3D display mode, and the lesion was defined as a TP if it was located in the same segment and had similar size and appearance to a lesion identified by colonoscopy. Detailed information can be found in van Gelder and others (2004).

Of the 200 patients, there were 174 patients without polyps and 26 patients with between 1 and 7 lesions; in total there were 44 lesions. Of these 44 lesions, there were 32 identified by virtual colonography (70%); the percentage lesions identified by virtual colonography varied in the 26 patients between 0% and 100% (mean 60%, standard deviation 47%).

FP lesions were observed in 68 patients. There were 93 FP lesions in total, and the number of FP lesions varied between 1 and 3; the mean was 0.5 and variance 0.6. Note that the variance is about the same as the mean which points to the fact that the differences between patients in number of FPs are likely due to chance only and not due to systematic differences between patients.

4.1Specificity

The different estimates of the specificity are given in Table 1. Of the 174 patients without lesions, there were 118 without FP lesions, and hence, the specificity could be defined as 116/174=66%. Alternatively, assuming that the number of FP lesions follows a Poisson distribution with the same intensity μ for all patients, including the 26 with lesions, the maximum likelihood estimate is the mean of the FP lesions: =0.47, with standard error se==0.048. The specificity is then estimated as e−0.47=63% with 95% confidence interval e−0.47±1.96×0.048, that is, 57−69%. Using a GEE approach, assuming exchangeable covariance between the FPs, the standard error of is slightly increased to 0.054, yielding the same estimate but a slightly larger confidence interval of the specificity: 56−70%.

Table 1.

Results of the different models to estimate patient-specific specificity values

Model

Log-likelihood

AIC

Specificity

Poisson

– 183.97

369.95

0.63 (0.57–0.69)

Poisson + GEE

—

—

0.63 (0.56–0.70)

Gamma–Poisson mixture

– 182.17

368.34

0.66 (0.59–0.74)

Nonparametric Poisson mixture

– 181.93

369.86

0.66 (0.58–0.73)

Both random-effect models have higher likelihood values than the Poisson model pointing to (small) systematic differences between patients with respect to the number of FP lesions. The nonparametric distribution of the nonparametric mixture model was reduced to 2 points (at almost 0 with weight 0.56 and at 1.41 with weight 0.44).

4.2Sensitivity

The different estimates of the sensitivity are given in Table 2. Of the 26 patients with lesions, there were 17 in whom at least one lesion was recognized. Hence, the patient-specific sensitivity is 17/26=65% (95% confidence interval: 47−84%). Of the 44 lesions, there were 32 identified by colonography, and if all lesions are independent, the lesion sensitivity can be estimated as 32/44=73% (95% confidence interval: 58−84%). The Akaike information criterion (AIC) value of the beta-binomial model is smaller than that of the binomial model, again pointing to systematic differences between patients with respect to lesion sensitivity. The nonparametric binomial mixture distribution was reduced to 3 points (at γ about 0.36, about 0.44, and 0.85 with weights 0.30, 0.13, and 0.57).

Table 2.

Results of the different models to estimate lesion-specific sensitivity values

Model

Log-likelihood

AIC

Lesion sensitivity

Binomial

– 24.40

50.79

0.73 (0.58–0.84)

Binomial + GEE

—

—

0.73 (0.57–0.85)

Beta-binomial

– 20.99

45.98

0.63 (0.50–0.77)

Nonparametric binomial mixture

– 20.89

51.78

0.65 (0.47–0.79)

4.3Bivariate model

The deviance of the model with a bivariate normal random effects υ=log(μ) and ξ=logit(γ) was 369.8, which was slightly smaller than the sum of the deviances of the log-normal–poisson mixture and the logit-normal–binomial mixture models (deviance=371.3), although with one parameter more, the AIC value was slightly larger: 379.8 versus 379.3. The dependence between logit-transformed sensitivity (ξ) and the Poisson parameter υ=log(μ) was estimated as E(ξ)=0.93+0.007×log(μ) (see Figure 2), but the correlation between log(μ) and ξ was estimated as close to 0 (0.0002). This was also found in the nonparametric bivariate model; the AIC value of the nonparametric bivariate model was slightly larger than the sum of the AIC values of both univariate nonparametric models (422.28 versus 421.64). The distribution consisted of 6 points with weights almost equal to the product of the weights found in the univariate random-effects models: correlation 0.001.

Fig. 2.

Estimated regression line between sensitivity and specificity derived from the analysis using the bivariate normal random-effects model.

Although the association is very weak in the present case, the figure illustrates the classical inverse relationship between sensitivity and specificity, and the model can be used to evaluate specific choices for sensitivity/specificity.

5.CONCLUSION

There are 2 statistical problems with multiple-lesion diagnostic data. The first is the correlation between the multiple lesions in the same patient. If the correlation is ignored, then the diagnostic yield is often overestimated, and the estimated standard errors and confidence intervals are almost certainly too small. There are 2 general statistical methods to take this correlation into account in the statistical analysis, either with a marginal model with a generalized estimation equation approach or with random-effect models. We chose the latter approach.

The second problem is that in principle an infinite number of FP lesions might be detected, and this means that the lesion-specific specificity parameter cannot be assessed. Our approach models the “lesion-specific” sensitivity because often it is important to diagnose all true lesions and the “patient-specific” specificity. We defined the patient-specific specificity as the probability to have zero FP lesions.

We modeled the number of FP lesions and the number of TP lesions with the Poisson and the binomial distributions, respectively. The associated parameters were considered to be patient-specific random effects sampled from some distribution g(·|·). Specific and convenient parametric choices can be made for g(·|·) or a nonparametric density can be chosen; we found comparable results for the empirical data of the colonography study. However, in simulated data the nonparametric approach gave better results. The parametric choices for g(·|·) are available in SAS proc NLMIXED; the nonparametric approach is available in a collection of R-functions that can be obtained from the first author. A reviewer has pointed out that all parametric models ignore the possible spatial correlations between lesions with respect to the outcomes of the diagnostic testing. This will affect the standard error of the estimated sensitivity value, but it is easy to calculate a robust standard error (Williams, 2000).

The primary effect of the random-effect models is that the model-based estimate of the marginal patient-specificity and patient-sensitivity values is much closer to the observed values. This effect is seen best in simulated examples. Of 50 patients, there were 14 patients without FPs, suggesting a specificity of 28%. Using the simple Poisson model, the probability to have zero FPs was estimated to be 0%. Thus, assuming independence between FPs leads to underestimation of the specificity. According to the nonparametric random-effect model, the specificity was estimated as 28%, and this close correspondence was seen in almost all simulations. The Poisson–gamma random-effects model also has this effect, but this effect depends on the goodness-of-fit of the assumed gamma distribution. A similar effect was seen with the probability of identifying at least one true lesion in a patient. If we define this as the “patient-specific” sensitivity, then the random-effects models correspond much better to the observed rate, and when independence between lesions is assumed, this probability is overestimated.

Multiple-lesion data allow estimation of the relationship between patient-specific specificity and lesion-specific sensitivity directly using a bivariate random-effects model in a similar fashion as was described by van Houwelingen and others (2002) and can be done with SAS proc NLMIXED and with our R-functions. The availability of this regression curve allows us to evaluate the effects of different cutoff values on sensitivity and specificity.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

We present a new method to efficiently estimate very large numbers of p-values using empirically constructed null distributions of a test statistic. The need to evaluate a very large number of p-values is increasingly common with modern genomic data, and when interaction effects are of interest, the number of tests can easily run into billions. When the asymptotic distribution is not easily available, permutations are typically used to obtain p-values but these can be computationally infeasible in large problems. Our method constructs a prediction model to obtain a first approximation to the p-values and uses Bayesian methods to choose a fraction of these to be refined by permutations. We apply and evaluate our method on the study of association between 2-way interactions of genetic markers and colorectal cancer using the data from the first phase of a large, genome-wide case–control study. The results show enormous computational savings as compared to evaluating a full set of permutations, with little decrease in accuracy.

We consider here the problem of estimating empirical p-values for a very large number of test statistics, when the true distributions of the test statistics are unknown and when the true distributions may vary across the test statistics. The crucial assumption behind our work is that accurate approximations to the p-values are most important for the small p-values; we are willing to tolerate imprecise estimates of large p-values. Conceptually, therefore, we are building on the work of Besag and Clifford (1991) who proposed a sequential strategy for permutation tests, in which permutations were simulated until a fixed number of simulated test statistics had exceeded the observed one, so that small p-values received more permutations than did large ones.

There is one subtle difference between the motivation for our approach and that of previous sequential approaches. Others (e.g. Besag and Clifford, 1991,) define the p-value to be the Monte Carlo or empirical p-value x/N, where the observed test statistic is the xth largest when included in a random sample of N − 1 test statistics simulated under the null hypothesis. We adopt the point of view that this is just an approximation to the true p-value, that is, the value that would be obtained by enumerating the complete permutation distribution, or equivalently the limiting value of the empirical p-value as N→∞. One of the key ideas of this paper is to treat the true p-values as parameters to be estimated using computationally efficient methods.

Our work was motivated by a desire to test for interactions between haplotypes in a case–control study of genetic predictors of colorectal cancer (the ARCTIC study). This study includes approximately 1200 colorectal cases and 1200 controls from Ontario, Canada; for this research, we focus on marker genotypes at 1363 markers in 212 candidate genes. Haplotypes are sequences of marker alleles on the same chromosome for a chosen set of markers (see Table 1). We use a test of interaction similar to a recently proposed general goodness-of-fit statistic (Becker, Schumacher, and others, 2005).

Table 1.

Genotypes and potential haplotypes for 3 SNP markers. The 2 true haplotypes are ACG and TCC. However, only the genotype data are observed. Two possible haplotype pairs are consistent with the observed genotypes

Marker 1

Marker 2

Marker 3

True haplotype pair

Chromosome 1

A

C

G

Chromosome 2

T

C

C

Observed genotypes

AT

CC

CG

Second potential haplotype pair

Chromosome 1

A

C

C

Chromosome 2

T

C

G

Asymptotic estimates of statistical significance for tests of haplotype interactions are unlikely to be valid for 3 reasons. First, the haplotypes are not directly observed but are estimated from genotype data, and therefore the haplotype counts may contain fractional entries associated with the probabilities of each haplotype combination. Second, the tables tend to be very sparse, with many haplotype combinations observed rarely or at low probabilities. Third, each individual has 2 chromosomes and can therefore contribute at least twice to the counts, violating the assumption of independence. In simulations, the null distributions of our test statistics are highly variable depending on the number of sparse cells and the number of possible haplotypes.

In the statistical genetics literature, several approaches have been developed for obtaining empirical significance levels for functions of a set of p-values or test statistics. For example, Dudbridge and Koeleman (2004) showed that − ∑k = 1Rlogp(k), summing over the smallest R p-values, should follow an extreme-value distribution. The parameters of this distribution can be estimated from a reasonably sized set of permutations and can lead to increased accuracy at very small significance levels. Another approach is to work directly with the distribution of minP, the smallest p-value in a group (Becker, Cichon, and others, 2005), (, beckerSchumacher05), especially when there is a small region of interest with a limited number of markers (and hence tests). When asymptotically normal score tests can be used, Monte Carlo simulation of standard normal variates can be combined with the score tests to obtain a null distribution with the same correlation pattern as the original data (Lin, 2005), (, seaman05), this can lead to more efficient ways of estimating significance levels. All these methods depend on the ability to accurately calculate small p-values for the individual tests in the set, and this is precisely our focus in this paper.

In Section 2.1, we describe the test statistic used in our example. Section 2.2 describes a method for quickly obtaining approximate estimates of the p-values associated with the test statistics, using a Random Forest (RF) model. Then, Section 2.3 describes a Bayesian scheme for deciding which tests are of most interest and where permutations could be effectively used to improve the p-value estimates. Section 2.4 describes how this approach is evaluated. Results are shown in Section 3. The ideas behind this approach are applicable to different test statistics, and in fact, to any context where massive numbers of tests are being performed yet asymptotic significance levels are not appropriate and permutation is needed to estimate significance.

2.METHODS

For illustration of our methods, we examine tests of interactions among a very large set of haplotypes. Since haplotypes are unobserved, the popular PHASE algorithm of Stephens and others (2001) was used to estimate haplotypes within overlapping “windows” consisting of 3 adjacent markers within each gene. This choice is arbitrary, and our approaches would work for a variety of window lengths. For each window of 3 markers, there could be a maximum of 8 possible haplotypes among the cases and controls, although for any individual, only a few haplotypes are likely to have nonzero probability. We restrict our attention to interactions between those haplotype windows, where the triplets lie in different genes, and the term “window pair” will be used to refer to a particular haplotype pair, with 1 haplotype from each gene. For the data we consider here, the data from separate genes can be considered independent since 2 genes rarely lie close to each other. For each individual, the probabilities will sum to 4 (or slightly less due to rounding and truncation by PHASE), as each triplet window occurs on both copies of the chromosome, and we create all possible pairings of the windows in the 2 genes.

As described in Becker, Schumacher, and others (2005), for each interaction of 2 haplotype windows, a table of estimated counts, with a maximum dimension of 64 × 2, can be constructed from the haplotype probability distributions. Table 2 gives an example and it can be easily seen that many of the counts are very small and few of them are integers. In Section 2.1, we will make extensive use of row totals: these refer to 64 totals of probabilities for all haplotype pairs (5th and 10th columns in Table 2). We will refer to tables like Table 2 as “haplotype-pair count” tables.

Table 2.

Sample table of haplotype-pair counts, rounded to 1 decimal place

Win1

Win2

Case

Control

Total

Win1

Win2

Case

Control

Total

CCC

CAC

1.2

0.2

1.4

TCC

CAC

0.0

0.1

0.1

CCC

CAG

0.0

0.0

0.1

TCC

CAG

0.0

0.0

0.1

CCC

CGC

9.5

6.9

16.4

TCC

CGC

4.6

2.9

7.4

CCC

CGG

4.9

2.0

6.9

TCC

CGG

1.7

0.9

2.6

CCC

TAC

5.8

3.4

9.2

TCC

TAC

1.5

1.7

3.3

CCC

TAG

1.1

1.9

3.0

TCC

TAG

1.0

0.5

1.5

CCC

TGC

6.5

4.4

10.9

TCC

TGC

1.6

2.2

3.8

CCC

TGG

2.5

3.3

5.8

TCC

TGG

0.4

0.4

0.8

CCT

CAC

8.4

11.1

19.5

TCT

CAC

7.6

3.7

11.3

CCT

CAG

3.5

3.1

6.7

TCT

CAG

3.3

0.4

3.7

CCT

CGC

79.1

93.4

172.4

TCT

CGC

57.8

57.2

115.0

CCT

CGG

26.6

28.7

55.3

TCT

CGG

24.9

24.1

49.0

CCT

TAC

69.0

74.1

143.1

TCT

TAC

47.0

46.5

93.5

CCT

TAG

46.3

37.0

83.3

TCT

TAG

27.1

29.0

56.1

CCT

TGC

77.9

68.5

146.4

TCT

TGC

41.4

44.6

86.1

CCT

TGG

41.7

34.8

76.4

TCT

TGG

28.9

27.9

56.8

CGC

CAC

80.4

77.8

158.2

TGC

CAC

13.1

9.6

22.7

CGC

CAG

22.0

17.7

39.7

TGC

CAG

2.9

2.2

5.1

CGC

CGC

731.9

789.5

1521.4

TGC

CGC

107.3

110.9

218.2

CGC

CGG

293.8

305.0

598.9

TGC

CGG

39.5

43.9

83.5

CGC

TAC

705.6

660.0

1365.6

TGC

TAC

90.3

95.7

186.1

CGC

TAG

397.9

366.6

764.5

TGC

TAG

52.0

48.3

100.3

CGC

TGC

609.2

616.0

1225.3

TGC

TGC

81.3

75.5

156.8

CGC

TGG

373.7

355.4

729.1

TGC

TGG

39.3

43.7

83.0

CGT

CAC

7.6

11.4

19.0

TGT

CAC

8.6

8.8

17.4

CGT

CAG

2.9

2.4

5.3

TGT

CAG

2.6

2.8

5.4

CGT

CGC

83.0

89.5

172.5

TGT

CGC

73.5

78.3

151.8

CGT

CGG

36.5

25.0

61.5

TGT

CGG

39.2

25.8

65.0

CGT

TAC

64.8

86.9

151.8

TGT

TAC

59.7

77.8

137.5

CGT

TAG

46.7

42.2

89.0

TGT

TAG

49.8

44.5

94.3

CGT

TGC

65.5

60.2

125.7

TGT

TGC

52.9

50.8

103.7

CGT

TGG

45.0

28.0

73.0

TGT

TGG

38.0

30.3

68.3

Total

4949.6

4897.5

9847.1

2.1.A test statistic for detecting interactions

The statistic (2.1) below is proposed to test for association between the haplotype pairs and the disease state in each window pair k = 1,…,K. It is a modification of a chi-square test for independence. This statistic can detect both marginal associations for either one of the haplotypes and interactions between the 2 haplotypes and the disease. Let nijk be the counts for haplotype pair j of case (i = 1) or control (i = 2) in window pair k.(2.1)where E(nijk) stands for the expectation of nijk under the independence hypothesis, which is calculated asThe constant c in the denominator of (2.1) was set to 0.5. This has the effect of reducing the contribution of rare haplotype pairs to TSk, similar to the effect of pooling cells with low counts.

A very similar statistic was used by Becker, Schumacher, and others (2005). Their statistic did not use the constant c in the denominator and divided the overall result by 2.0. The method described below will be appropriate for any choice of test statistic defined on data like that in Table 2; the general principles will apply to many large collections of tests.

2.2.Machine learning approach for estimating p-values

In this section, we describe how to use permutations on a small training set of window pairs to obtain initial approximations for all window pairs k = 1,…,K, based on the observed summary table of nijk counts for that pair. These estimates will then be improved by using the algorithms described in Section 2.3 to run permutations on selected window pairs.

A simple prediction strategy is used for obtaining k. For each of the window pairs in the training set, a true null distribution is obtained by permutation of the case/control labels. This constructs a “reference set” of null distributions. A prediction rule is then defined by modeling the empirical distributions as functions of marginal characteristics of the haplotype-pair count table; this is then used to obtain k from the observed statistic for all tests.

Figure 1 summarizes the p-value prediction algorithm. The motivation for using the margin counts of Table 2 as predictors is as follows. If the haplotype-pair count table were actual counts based on classifying independent individuals and all cells had nonzero expected counts, then (2.1), with c = 0, would be a standard Pearson chi-square statistic testing for independence with asymptotic null distribution χ(63)2. However, if some haplotype combinations do not occur or occur with very low frequency, then the null distribution will be better approximated with fewer degrees of freedom. The distribution might also be influenced by the proportion of individual haplotype counts to which PHASE attributed low probabilities (the dispersion of the distribution). Finally, we order the sums by their magnitudes separately for each haplotype-count table, since there is no meaningful matching of haplotypes across such tables. We use the Random Forest (RF; Breiman, 2001,) machine learning tool as implemented by Liaw and Wiener (2002) in R (R Development Core Team, 2007) for deriving predictions.

Fig. 1.

p-value prediction algorithm.

2.3.Bayesian updating of p-value estimates

We have K window pairs to examine, k = 1,2,…,K. If we wished to evaluate all p-values using N permutations of the case–control labels, we would require KN constructions of tables of nijk values and evaluations of (2.1). However, most of the K window pairs are of limited interest to us, we want to accurately estimate only the small p-values.

We start by representing the RF predictions from Section 2.2 as a prior distribution πk(·) for the true p-value pk, with details described later. We will obtain permutation draws TSkl*, l = 1,…,nk, from the null distributions for TSk. After nk permutation draws, resulting in xk simulated statistics exceeding the observed one, the posterior distribution for pk is(2.2)We want to minimize the sum of nk‘s over all window pairs, without compromising the “inferential characteristics” of the procedure.

We assume that nk may be capped by a fixed N; if we ran N permutations on all pairs, we would have sufficiently accurate determinations of the p-values to proceed with inference. We also assume that there is some target p-value, p0, and those window pairs with pk < p0 are of particular interest. (In fact, the procedure we describe below is relatively insensitive to the precise value of p0, provided it is not too large.) The relation between N and p0 is based on our precision requirements. For example, if we are interested in estimating p-values in the neighborhood of a small p0 with a standard error of p0/10, we would set N≈100/p0. In our calculations, we chose N = 104 and p0 equal to 5×10 − 5,10 − 4, and 10 − 3.

The update algorithm is presented in Figure 2. In each iteration, it attempts to maximally decrease the total estimated number of “missed” p-values that have not been estimated with full precision:(2.3)where the probabilities in (2.3) are taken with respect to the corresponding posterior density πk(·|TSk·*). By targeting window pairs with nk < N and the largest P(pk < p0)/(N − nk), we are performing a greedy minimization of Kmiss.

Fig. 2.

p-value Bayesian update algorithm. K is the number of p-values to consider, N is a cap for the number of permutations for each p-value, p0 is a target p-value, b is a batch size for the updates, and nk is the number of permutations done for window pair k so far.

We are interested in the p-values with pk < p0. Assuming a uniform distribution of pk values, we would expect p0K elements in this set; the algorithm stops when we have done the full set of permutations on all but a proportion α of these, that is, when Kmiss < αp0K. We used α = 0.01 in our simulations.

The prior distribution πk(·) needs to take into account the information coming from the RF estimation and to facilitate the computations described above. An ideal approach would be to determine the distribution of k conditional on the true pk and combine that with prior knowledge about window pair k to compute a true posterior distribution of pk given data k. However, this calculation is difficult and would likely yield integrals of (2.2) that were intractable. We sought a computationally convenient approximation.

A Beta prior is conjugate to the Bernoulli (pk) updates but we found that it did not match observations. However, a mixture of 2 Beta distributions worked well.

We used the following process to choose the parameters. First, we selected Hp = 3000 pairs at random from the full set of pairs and ran 10 000 permutations on each of them. We also computed the RF predictions for each. Then in an exploratory step, we fit the parameters of a 2-component Beta-binomial mixture with the Beta parameters (αik,βik) and component proportions γik, i = 1,2, k = 1,…,3000, all depending smoothly on k. As described in Section 3, this resulted in fits where the parameters appeared to have fairly simple parametric relations to k; we then fit those relations and used them to assign priors to the full set of pairs.

We need a common basis of comparison to evaluate these approaches. We suppose that one use for a p-value will be simple comparison against one or more fixed levels (or thresholds) pT. If we knew the true p-values pk and each method produced fixed estimates k, we could define the sensitivity of a procedure at threshold pT as and the specificity as .

However, we do not know the true pk values and our approach produces posterior distributions for pk, not simple estimates. But both the Besag(n) and the Classical methods allow easy posterior calculations with a uniform prior: xk successes out of nk trials produces a Beta (xk + 1,nk − xk + 1) posterior.

Thus, we adopted the following approach. We approximated the true p-values by running 10 000 permutations for all window pairs and an additional 30 000 permutations for those window pairs which had fewer than 1200 successes in the first run. (We call these the “reference permutations.”) We then imagined the following experiment: draw a window pair k at random, draw k from the posterior distribution of pk under the method being evaluated, and independently draw pk from the posterior based on the reference permutations. The sensitivity is the conditional probability under this sampling scheme. This conditional probability is evaluated as(2.4)where Ptest is the posterior arising from the approach being examined and Pref is the posterior arising from the reference permutations. Specificity is defined analogously.

3.RESULTS3.1.Simple validation of RF p-value estimates

The RF model for predicting p-values was built on 3000 randomly chosen window pairs, for which reference p-values were obtained using 10 000 permutations. The model was then used to predict all 410 108 p-values and these predictions were compared to p-value estimates xk/nk on all window pairs based on the reference permutations described in Section 2.4. The RF procedure has a number of tuning parameters which could be used to optimize model performance. As our main goal is to provide a sensible initial p-value estimate, we have not attempted to fully optimize the RF model and have used the default settings in the R implementation of the model. We used h = 10 nearest neighbors for p-value prediction.

Figure 3 shows reference versus predicted p-values for all window pairs using the BUaP approach. The plot shows that the p-value prediction scheme works well; there are very few significant window pairs which would have been missed. For example, in the BUaP run described below with the p0 target equal to 10 − 4, all pairs with predicted p-values below about 0.02 were targeted at least once for permutation updates. Among all window pairs with predicted p-values over 0.02, none had a true p-value below the p0 level. The same result holds for the BUaP runs with other p0 targets. The circle in the plot displays one of the most significant overpredictions; here, a p-value was predicted over 0.001 while its true value is below 0.0001. Such instances are rare and a more common bias is to underpredict the p-value. This is likely a result of focusing the RF model on the 95th percentile and reflects our desire to optimize sensitivity of the prediction step with a consequent loss of specificity.

The window pair from Table 2 had a test statistic equal to 35.66 with a predicted p-value of 0.785. Its reference permutation p-value was 0.85 (with 8537 successes in the first 10 000 permutations).

Fig. 3.

p-value estimates based on the reference permutations and as predicted by the RF model on all 410 181 window pairs.

3.2.Prior selection in the Beta mixture model

Using locally weighted maximum likelihood, we found that the Beta mean parameters αik/(αik + βik), i = 1,2, were approximately linearly related to the RF predictions k. The sums αik + βik were highest at the extremes k = 0 or 1, indicating more concentrated distributions there, but rather than attempting a precise fit, we took a conservative approach and fixed the more precise component to have α1k + β1k = 40 (near the low end of the observed range). We then refit the model linearly in the means, with constant precision and γik, resulting in the following parameters for πk(p):One interpretation of these parameters is that the RF fit was taken to be as informative as running somewhere between α2k + β2k = 5.9 and α1k + β1k = 40 Bernoulli trials. We emphasize that this is likely a conservative approximation to the information provided by RF.

3.3.Sensitivity and specificity of the methods

This section shows the main results of our study. Our hypothesis is that the BUaP strategy has a strong computational advantage without a large decrease in accuracy.

In Figure 4, we compare sensitivity across a range of p-value thresholds for the different strategies. The sensitivity curves in these plots were calculated as explained in Section 2.4. All methods have very similar sensitivities up to a threshold around 2×10 − 4; BUaP with p0 = 10 − 4 continues to rise in sensitivity up to about 10 − 3. In the crucial range of small p-values, it exceeds the sensitivity of the Besag methods. Sensitivities of different strategies.

Fig. 4.

Sensitivities of different strategies.

Table 3 compares the number of permutations needed for each method. (This does not include those used in training the BUaP methods: that would be a relatively fixed cost, the numbers shown in Table 3 could be expected to be roughly proportional to the number of tests.) We see here that BUaP with p0 = 10 − 4 required fewer permutations than Besag(3), even though its sensitivity is much higher in the low range. We also show the effect of changing p0: larger values rapidly increase the required number of permutations. Sensitivities (not shown) followed the same pattern as for p0 = 10 − 4 but the peak sensitivity moved according to the value of p0.

Table 3.

Numbers of permutations used by each strategy

Method

Permutations (millions)

Besag(3)

15.2

Besag(6)

28.7

BUaP (p0 = 0.00005)

8.4

BUaP (p0 = 0.0001)

13.4

BUaP (p0 = 0.001)

31.4

Classic

4101.8

PO

0.0

We also computed specificities, which were very high for all methods shown in Figure 4 (greater than 0.995 for thresholds up to 0.01).

3.4.Note on running times and computational complexity

Our method has 2 advantages: it provides an ability to focus the p-value estimation process on interesting cases while greatly reducing the total number of permutations compared to Classical and Besag approaches. We focus the discussion of computational complexity on the total number of permutations, since this is common to all methods and it contributes the lion's share of the running time. Building the RF model on the 3000 pairs took about a minute and producing all 410 108 predictions from this model about 30 min, once the full data set (i.e. 128 predictors for each pair and their corresponding test statistics) was assembled. In contrast, running BUaP (p0 = 10 − 4) took about 4 h, while Classical took about 9 days, “after” careful hand optimization of the permutation code used by Classical. As with Classical, running times of the Besag family of methods are roughly proportional to the total number of permutations required.

The BUaP method has another important computational advantage. With a large genomic data set, such as the ARCTIC data considered in this paper, one will need to process the data in chunks, repeatedly accessing a remote database server for reading and writing. Since the vast majority of pairs (about 98% for BUaP [p0 = 0.0001] in our case) do not require any permutations at all, one saves a huge amount of database traffic while still predicting all p-values. The RF predictions require only a summary table for each pair which can be precomputed and stored within the database or computed on the fly using a stored procedure. In contrast, running permutations will require transferring large amounts of individual-level data from the database.

4.CONCLUSIONS

In this paper, we have used the RF algorithm to estimate p-values and then used the empirical behavior of those p-value estimates to construct prior distributions in the Bayesian estimation step. We see 2 advantages of this 2-stage approach: it modularizes the procedure so that different approaches to both prediction and Bayesian update could be used; it is much easier to use a large number of covariates to predict a single response than to predict several parameters of the prior distribution.

The output of our procedure is a posterior distribution of the p-value for each window pair. These posteriors could be used in further computations. For example, to estimate the false-discovery rate (Benjamini and Hochberg, 1995), (Storey and Tibshirani, 2003) of the testing procedure, we would need an estimate of the distribution of all the p-values. This may be obtained by averaging (and perhaps smoothing) the posteriors across the full set of window pairs.

This paper has concentrated on the computation of p-values in one part of the ARCTIC project. Readers who are interested in the conclusions of that study about the relationship between genetic markers and colon cancer are referred to Zanke and others (2007).

A software package implementing the algorithm is available from the first author on request.

FUNDING

Natural Sciences and Engineering Research Council Discovery Grants to R.K. and D.J.M.; Genome Canada through the Ontario Genomics Institute, by Génome Québec, the Ministère du Développement Économique et Régional et de la Recherche du Québec and the Ontario Cancer Research Network to ARCTIC project; National Program on Complex Data Structures to R.K. and X.S. Funding to pay the Open Access publication charges for this article was provided by Natural Sciences and Engineering Council of Canada (NSERC).

The authors acknowledge the support of the Centre for Applied Genomics, Hospital for Sick Children. This work was made possible through collaboration and cooperative agreements with the Colon Cancer Family Registry (CFR) and PIs (RFA CA-95-011). The content of this manuscript does not necessarily reflect the views or policies of the National Cancer Institute or any of the collaborating institutions or investigators in the Colon CFR, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government or the Colon CFR. Conflict of Interest: None declared.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

The problem of testing for genotype–phenotype association with loci on the X chromosome in mixed-sex samples has received surprisingly little attention. A simple test can be constructed by counting alleles, with males contributing a single allele and females 2. This approach does assume not only Hardy–Weinberg equilibrium in the population from which the study subjects are sampled but also, perhaps, an unrealistic alternative hypothesis. This paper proposes 1 and 2 degree-of-freedom tests for association which do not assume Hardy–Weinberg equilibrium and which treat males as homozygous females. The proposed method remains valid when phenotype varies between sexes, provided the allele frequency does not, and avoids the loss of power resulting from stratification by sex in such circumstances.

Genetic association studies1.INTRODUCTION

Association between genetic markers and disease is most commonly demonstrated by case–control studies, in which the frequency distributions of genotype in cases and controls are compared. The most widely useful markers are single nucleotide polymorphisms (SNPs), which are chromosomal loci that have only 2 forms, or alleles. Since most human chromosomes occur in pairs (autosomes), there are 3 possible genotypes at such a locus. In the simplest case, the test for association commonly used is the conventional chi-squared 2 degree-of-freedom test for association in the 3×2 contingency table or the Cochran–Armitage 1 degree-of-freedom trend test. The former test makes no strong assumptions about the disease association, but the latter is sensitive to departures from the null in which the case–control ratio, reflecting risk in the underlying population, varies monotonically with genotype, ordered by the number of copies of a nominated allele (0, 1, or 2). An alternative method has been to carry out the 1 degree-of-freedom test for association in the 2×2 table which counts chromosomes, or alleles, in cases and controls. Unlike tests at the genotype level, this test assumes that the 2 chromosomes carried by each individual can be regarded as independently sampled from a population of chromosomes—the assumption of Hardy–Weinberg equilibrium (Sasieni, 1997). This test is closely related to the Cochran–Armitage test; both contrast the observed number of alleles in cases with the expected number under the null hypothesis, but these tests use difference variance estimates for this (O − E) statistic.

For SNPs on the X chromosome, females carry 2 copies but males carry only one copy. At first sight, it is obvious how the simple allele-counting method can be extended to this case: if the allele frequency in males and females can be assumed to be equal, we would count alleles in a 2×2 table and calculate a chi-squared test on 1 degree of freedom as before. However, 2 criticisms can be leveled at this approach; first, it assumes Hardy–Weinberg equilibrium in females, and second, males have only half the impact on the analysis as females. The latter problem reflects an implicit alternative hypothesis that the effect of 1 copy of a variant allele on phenotype is the same in males as in females. This may not be a realistic assumption.

These difficulties can be addressed in the usual method for analysis of case–control studies in epidemiology, that is, to treat the case–control status as the dependent variable in a logistic regression analysis (Prentice and Pyke, 1979), the genotype entering as a predictor variable. However, if the sex ratio differs between cases and controls, this necessitates inclusion of sex as a covariate—whether or not the allele frequency varies between sexes. This is equivalent to stratification of the analysis by sex and can lead to considerable loss of power. But when the allele frequency does not differ between sexes, stratification by sex, with its attendant loss in power, would seem unnecessary and undesirable.

The work described in this paper was motivated by a genome-wide association study in which a common control group was used for several groups of cases of different diseases. Inevitably, for some comparisons, the sex ratio differed markedly between cases and controls (Wellcome Trust Case Control Consortium, 2007). However, the problem is not unique to the case–control setting; it extends to any test for genotype–phenotype association for loci on the X chromosome—particularly when the distribution of phenotype varies substantially between the sexes.

In Section 2, the standard 1 and 2 degree-of-freedom tests for genotype–phenotype association for autosomal loci will be reviewed. The subsequent section discusses the modifications necessary for a locus on the X chromosome. Later sections discuss some extensions and alternative approaches.

2.AUTOSOMAL LOCI

In this section, the derivation of 1 and 2 degrees of freedom for association with autosomal loci will be briefly reviewed. These test statistics are based on genotype–phenotype covariances and can be derived as score tests in the context of generalized linear models (GLMs) (McCullagh and Nelder, 1989) which relate the expectation of the phenotype, transformed by a “link” function, to a linear model which may include “additive” and “dominance” components. The score statistics (Cox and Hinkley, 1974) are defined by first derivatives of the log-likelihood function with respect to additive and dominance effect parameters, evaluated under the null hypothesis, H0, of no association. In the simplest case, only an additive effect is assumed; this will be discussed first.

2.1.Additive genetic model

For a general phenotype, the score statistic for testing for an additive effect of a diallelic locus on phenotype is the genotype–phenotype covariancewhere Yi is the phenotype for subject i and Ai codes the corresponding genotype 1/1,1/2, and 2/2 to 0, 1, or 2, respectively. is the arithmetic mean of Y in the whole sample. (If there are additional covariates in the model and a link function other than the “canonical” link is used in the GLM, then additional weights are needed. This represents a minor extension and will not be discussed further here.)

For reasons that will become clear later, although the test statistic has been introduced in terms of a model for the effect of genotype on phenotype, it is convenient, initially, to consider its distribution based on Pr(Ai|Yi)(i = 1,…,N). Then, the statistic is asymptotically normally distributed under H0 with 0 mean and variancewhere VA is the variance of Ai (assumed constant for all i); this can be estimated byUnder H0, the ratio is asymptotically distributed as chi-squared on 1 degree of freedom. A well-known special case of this test is the Cochran–Armitage test for a dichotomous phenotype or case–control data (Cochran, 1954), (Armitage, 1955), but it is equally applicable for a quantitative phenotype, even when the sample is selected by extremes of phenotype (Wallace and others, 2006). If Hardy–Weinberg equilibrium in the population can be assumed, the estimate of VA may be replaced bywhere P is the allele frequency, although this would not usually be recommended.

2.2.Dominance

The test above is locally most powerful against GLMs for genotype–phenotype association in which genotype enters as a linear term. Under such models, the heterozygous genotype, 1/2, falls midway between the 1/1 and 2/2 homozygous genotypes on the linear predictor scale. A broader class of alternatives is obtained by entering a “dominance” term in the linear model. A convenient way to do this is by an heterozygosity indicator, D say, taking the value 1 for heterozygotes and 0 for homozygotes. An additional score test statistic for the dominance effect is thenThe 2 degree-of-freedom test combines UA and UD. Under H0, UD also has 0 mean andwhere VD is the variance of each Di and VAD is the covariance between Ai and Di. These can be estimated by(Again, alternative estimates can be used if one is prepared to assume Hardy–Weinberg equilibrium.) Then, writingthe statistic is asymptotically distributed under H0 as chi-squared with 2 degrees of freedom. In the special case of a dichotomous phenotype, this test is identical to the conventional Pearsonian chi-squared test for association in the 3×2 contingency table.

3.THE X CHROMOSOME

Loci on the pseudo-autosomal part of the X chromosome can be treated in exactly the same way as autosomal loci, but others generally require different treatment. For these, males will only carry 1 copy, while, in females, most loci are subject to X inactivation (Chow and others, 2005), so that a female will have approximately half her cells with 1 copy active while the remainder of her cells have the other copy activated. Thus, in the absence of interaction with other loci or environmental factors, males should be equivalent to homozygous females in respect to such loci. This suggests that, for X loci in males, Ai should be coded 0 or 2, while Di should be coded 0. This has several consequences, some of which require modifications to the theory outlined above.

If the allele frequency does not vary between sexes, the expectation of A is also equal (at 2P) for the 2 sexes. Thus, the expectation of UA will remain at 0 under H0 even when the phenotype, Y, is related to sex.

However, the variance of A differs between males and females. For example, under Hardy–Weinberg equilibrium, its variance is 2P(1 − P) in females and 4P(1 − P) in males. This means that, in general, an alternative variance estimate for UA must be used.

Only females contribute to the dominance score, UD. For notational simplicity, assume that subjects are arranged so that subjects 1,…,F are female and subjects (F + 1),…,N are male. Then,

where F is the mean of Y “in females.”

A modified estimator for the variance–covariance matrix of U can now be derived. The variance–covariance matrix for A and D for females can be estimated bywhere F is the mean of Di in females. (Since allele frequencies are assumed to be equal between males and females, may be calculated from the entire sample rather than from females alone.) In males, since there is only a single copy of the allele, this variance–covariance matrix can be estimated byAgain, P can be estimated in the entire sample, perhaps by allele counting or, alternatively, by /2. The variance–covariance matrix of the 2-vector of scores, U, is then estimated byAs before, the 2 degree-of-freedom chi-squared test is then given by , while the 1 degree-of-freedom test is given by . It should perhaps be emphasized that in the above expression for refers to the overall sample mean of the phenotype and not to the sex-specific means.

It has been stated above that the above modifications to UD and to the estimator for the variance–covariance matrix are necessary “in general.” The exception is when sampling is carried out in such a way that the sample distributions of phenotype, Y (or at least their first 2 moments), are equal between males and females. Then, UD, as defined for autosomal loci, will continue to have zero expectation under H0, and the autosomal variance–covariance estimator will be unbiased (see Appendix A). This would occur, for example, in a case–control study in which cases and controls are frequency matched by sex.

4.STRATIFIED TESTS

Stratified score tests in which the alternative hypothesis is one of the equal effects of genotype on phenotype across strata are constructed by calculating the 2-vector of scores, U, and its estimated variance, , separately in each stratum. Both are then summed over strata. The final stratified chi-squared tests are then calculated in exactly the same way as before. This mirrors the classical Mantel–Haenszel generalizations of the standard 2×2 table association tests and the Mantel extension of the Cochran–Armitage test (Mantel and Haenszel, 1959), (Mantel, 1963). (It should be noted, however, that this assumes that the GLM which forms the alternative hypothesis uses a “canonical” link function; otherwise, the different stratum contributions would need to be weighted appropriately.)

The test outlined in Section 3 was derived under the assumption that the allele frequency does not vary with sex and, if this assumption cannot be made, it will be necessary to stratify by sex in the analysis. Note, however, that in the event of strong association between sex and phenotype, this will result in loss of power (perhaps considerable). An extreme example is provided by the unlikely case in which all the cases are female and all the controls male; stratification by sex leaves no information for testing association, whereas, if allele frequencies can be assumed to be equal between the sexes, valid test can be carried out as described in Section 3.

5.CONDITIONING ON GENOTYPE

If only the 1 degree-of-freedom test were to be derived, a rather simpler derivation might have followed by taking the phenotype, Y, as the random response, deriving the distribution of UA in sampling from Pr(Yi|Ai)(i = 1,…,N). For an autosomal locus, this leads to precisely the same score test statistic, UA, andwhere VY is the variance of Yi. Estimating VY by then leads to an identical asymptotic test. Similarly, the 2 degree-of-freedom test can be derived in the same way.

It should be noted that a test based on UA and the above expression for Var(UA) remains valid in the presence of relationship between phenotype, Y, and sex, provided there is no relationship between genotype and sex. This follows from an argument which mirrors that presented in Appendix A but with the roles of genotype and phenotype reversed. However, in the presence of sex differences in the distribution of phenotype, differences in the mean of A between sexes lead to nonzero expectation for UA under H0, while a sex difference in the variance of A between sexes invalidates the above expression for Var(UA). Either eventuality would render the standard test invalid. For autosomal loci, it will usually be safe to assume equality between the sexes for both the mean and the variance of A, but for loci on the X chromosome, while equality of the mean of A can usually be assumed, equality of the variances cannot.

Dependency of the variance of Y on sex can be allowed for by using separate estimates for males and females, in the same way as was used for the variance of A in the earlier derivation. Alternatively, a variance estimator in the spirit of the Huber–White “sandwich” estimator (Huber, 1967), (White, 1980) could be used. A valid Wald test for additive effect of a locus on the X chromosome could be carried out by simply regressing Y on A in a GLM and testing for a nonzero regression coefficient of A using a Huber–White estimate for the variance–covariance matrix of coefficients. The stratified version of the test would be obtained by additionally including the stratifying factor in the GLM. Generalized score tests which do not make the equal variance assumption have been discussed by Boos (1992).

At first, it seems natural to derive the 2 degree-of-freedom test in exactly the same way—by simply adding the heterozygosity indicators, D, into the model and testing for nonzero coefficients for both A and D. However, this is incorrect, since D is confounded with sex—males are always coded as homozygous, while females are only sometimes homozygous. Thus, when the phenotype varies with sex, this will generate a false dominance effect. Putting sex in the model corrects this but at the expense of power to detect the additive effect. There would seem to be no way to obtain the 2 degree-of-freedom test in 1 step by simple regression methods. It can, however, be done in 2 steps:

Calculate a 1 degree-of-freedom chi-squared test for additive effect, using a GLM in which neither dominance nor sex effects are included as predictors. To allow for the omission of sex from the model, Huber–White “robust” variance estimates must be used.

Calculate a 1 degree-of-freedom chi-squared test for dominance using a GLM which includes additive and dominance effects, together with sex. There is no need to use robust variance estimates at this stage.

Adding the 2 chi-squared tests yields a 2 degree-of-freedom test.

It would also be possible to test for a dominance effect by discarding males altogether. This would be less powerful since the additive effect would be less precisely estimated. However, this approach would be less reliant on the assumption that the effect of genotype in males mirrors that in homozygous females.

6.DISCUSSION

It has been argued that, when testing for genotype–phenotype association for loci on the X chromosome, males should be treated as homozygous females. If allele frequencies do not vary between the sexes, the additive genetic effect is not confounded with sex and there is no need to stratify by sex in the analysis. Indeed, to do so could seriously reduce power. However, if the first 2 moments of the distributions of phenotype are not equal in males and females, it becomes necessary to modify the variance calculations. In contrast, the dominance component of the genetic effect is, in general, confounded with sex, and testing for its presence requires allowance to be made for this fact.

This argument has been presented from the point of view of both probability models for genotype conditional on phenotype and for models for phenotype conditional on genotype. These are asymptotically (and sometimes algebraically) equivalent. However, the latter approach is more flexible in that it more naturally allows for the presence of further covariates.

FUNDING

The Wellcome Trust; the Juvenile Diabetes Research Foundation to D.C. Funding to pay the Open Access publication charges for this article was provide by The Wellcome Trust.

APPENDIX AA.1.Ignoring sex in analyses of autosomes

The variance estimator for the score statistic UA for autosomal loci isWhen the variance of A differs between the sexes, it can be shown thatwhere σF2 and σM2 are the variances of A in females and males, respectively, and F and M are the numbers of males and females in the sample. If the first and second sample moments of Y are equal for males and females, thenso thatThis is the true variance of UA when the variance of A differs between the sexes. Thus, when the first 2 sample moments of the phenotype Y do not differ between the sexes, the usual variance estimate will be unbiased even when the variance of A does differ between the sexes.

A similar argument shows that the usual variance estimator can be used when the distribution of phenotype varies between the sexes, provided that the first 2 moments of A do not vary between the sexes. This justifies ignoring sex (even when it has a strong effect) in analyses of autosomal loci.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This manuscript describes a novel, linear mixed-effects model–fitting technique for the setting in which correlated data indicators are not completely observed. Mixed modeling is a useful analytical tool for characterizing genotype–phenotype associations among multiple potentially informative genetic loci. This approach involves grouping individuals into genetic clusters, where individuals in the same cluster have similar or identical multilocus genotypes. In haplotype-based investigations of unrelated individuals, corresponding cluster assignments are unobservable since the alignment of alleles within chromosomal copies is not generally observed. We derive an expectation conditional maximization approach to estimation in the mixed modeling setting, where cluster assignments are ambiguous. The approach has broad relevance to the analysis of data with missing correlated data identifiers. An example is provided based on data arising from a cohort of human immunodeficiency virus type-1–infected individuals at risk for antiretroviral therapy–associated dyslipidemia.

Mixed-effects modeling is a well-established method for the analysis of correlated data where correlation among observations can arise from repeated measures or clustering. Since the landmark paper of Laird and Ware (1982), an extensive literature has developed that spans a range of model-fitting techniques and applications, including Diggle and others (1994), Vonesh and Chinchilli (1997), Pinheiro and Bates (2000), Verbeke and Molenberghs (2000), McCulloch and Searle (2001), Demidenko (2004), and Fitzmaurice and others (2004), among others. Together, these provide a clear and comprehensive discussion of state-of-the-art methods for estimation, testing, and prediction in the context of linear, generalized linear, and nonlinear mixed-effects modeling. In addition, a broad array of applications are presented with complete discussion of available software tools for implementation of existing methods. To our knowledge, a fully likelihood–based method that specifically addresses unobservable correlated data indicators, that is, missing individual or cluster identifiers, has not been described.

The data settings motivating our research are population-based genetic association studies of unrelated individuals for whom haplotypic phase, that is, the alignment of alleles on a single chromosome, is unobservable. In a recent manuscript, we describe a multistage approach for this setting that involves (1) estimating haplotype frequencies using only the available genetic information, (2) multiply imputing cluster membership identifiers, (3) for each of these imputations, fitting a mixed-effects model for the outcome of interest, such as a measure of disease progression, using existing analytical tools, and (4) combining the results across imputations to make inference (Foulkes and others, 2007). While this approach is straightforward to implement, it does not provide knowledge about the outcome to inform the haplotype frequency estimation. That is, estimation of haplotype frequencies (step 1 above) is done independently of the mixed-effects model–fitting procedure (step 3). In the present manuscript, we derive a novel, likelihood-based approach that incorporates the haplotype estimation component into the model-fitting procedure. Our approach has the marked advantage of drawing strength from a clinical measure (outcome in the model framework) to update the haplotype frequency estimates.

Specifically, we derive an expectation conditional maximization (ECM) algorithm for this missing data setting. Expectation–maximization (EM)-type algorithms have been described for fitting mixed-effects models (Laird and Ware, 1982), (Jennrich and Schluchter, 1986), (Laird and others, 1987), (Jamshidian and Jennrich, 1993). In its original formulation, model random effects are treated as missing data (McCulloch and Searle, 2001, p. 264). We extend this for our setting by letting both the random effects and the correlated data indicators together constitute the missing component. We also distinguish our setting from the more common missing data settings in which covariate or response data are missing and/or there is imbalance in the design, that is, unevenly spaced measurements over time. Methods for these settings are well described as noted in Fitzmaurice and others (p. 375 2004) and McCulloch and Searle (p. 94 2001).

The ECM approach originally proposed by Meng and Rubin (1993) extends the EM algorithm of Dempster and others (1977) to reduce complexities in the maximization step by partitioning the set of parameters into disjoint and exhaustive subsets with likelihood functions that are easier to maximize. Two alternative maximization algorithms are well described in the context of fitting mixed models, Newton–Raphson and Fisher scoring (FS) (Lindstrom and Bates, 1988), (Wolfinger and others, 1994), (Pinheiro and Bates, 2000), (Demidenko, 2004), and combinations of each with EM-type algorithms provide both efficiency and stability. A combination of FS and the EM was recently proposed for missing covariate and response data by Schafer and Yucel (2002). While further extensions for missing cluster identifiers are tenable, the ECM algorithm is efficient, provides simple interpretable solutions at each step, and converges reliably to maximum likelihood (ML) estimates by guaranteeing an increase in the likelihood function at each iteration (Little and Rubin, 1987).

Unobservable cluster identifiers can arise in a variety of settings. For example, hospital records may have incomplete information on patients’ local area identifiers such as ZIP codes, which may be desirable in modeling treatment patterns (Chiu and others, 2005). Alternatively, clusters may define underlying biological constructs that are not observable. In general, investigators can identify a subset of clusters that are consistent with the observed data. For example, additional information available from either census records or hospital records may identify a set of possible ZIP codes. In the context of characterizing biological states, genetic indicators can inform us about the set of possible groupings of individuals (Foulkes and DeGruttola, 2002).

The data motivating our research arise from a cohort of human immunodeficiency virus type-1 (HIV-1)–infected individuals on highly active antiretroviral therapy (HAART). Long-term exposure to HAART has be associated with an array of lipid abnormalities that can lead to early onset of cardiovascular disease in this population. Our investigation aims to characterize the associations among genetic polymorphisms and lipids, controlling for the effects of drug exposures and other relevant clinical and demographic factors. Ultimately, understanding the pharmacogenomic underpinnings to complex diseases, such as cardiovascular disease, will have broad implications for tailoring therapy decisions to patient-specific characteristics.

In general, the pair of single nucleotide polymorphisms (SNPs) at each locus within a gene is observed; however, the alignment of these nucleotides across loci for a given chromosomal copy is unobservable. This unobservable information, commonly referred to as haplotypic phase, can be biologically and clinically informative and ignoring it may lead to a loss of power to detect associations. In this manuscript, we describe how the ECM approach accounting for uncertainty in cluster identifiers can be applied to the setting of ambiguous-phase haplotype data to discover clinically relevant biological associations. This approach represents a contribution to existing methodology since it addresses simultaneously the need to consider multiple genetic indicators and the unobservable aspect of haplotypic phase using a fully likelihood–based approach.

Recently, Foulkes and others (2005) proposed applying mixed-effects models to data arising from genetic association studies of unrelated individuals. A primary strength of this approach is that it allows for assessing overall variability across combinations of multiple genetic polymorphisms using a single, omnibus test while controlling for potential confounding by environmental and clinical characteristics. Empirical Bayes estimates of multilocus genotype effects and corresponding prediction intervals lend additional insight into the specific polymorphisms contributing to measures of disease progression. The method proposed herein extends this approach to handle the setting in which genetic information is unobservable.

The proposed method also extends the generalized linear modeling approach of Lake and others (2003) and Lin and Zeng (2006) that both describe implementation of an EM algorithm for unobservable haplotype data. Notably, both the mixed modeling approach and the methods given in Lin and Zeng (2006) can accommodate specific departures from Hardy–Weinberg equilibrium (HWE). The primary difference between the approaches is that the mixed modeling approach we present assumes that haplotype effects are random, arising from an underlying probability distribution. This provides a flexible analytic framework for characterizing a large number of genetic indicators and may offer a solution to the degrees-of-freedom problem inherent in tests of haplotype–trait associations as described by Tzeng and others (2006).

Finally, mixed-effects models have been described as a special case of structural equation models (SEMs) or latent class models (Sanchez and others, 2005). Here, we introduce a doubly latent class structure since there are latent cluster random effects as well as unobservable cluster identifiers. Notably, the inclusion of both latent class indicators and latent random effects has been described in the SEM literature. For example in Muthen and Shedden (1999), alcohol dependency classes have latent indicators while person-specific random effects are included to account for repeated measures over time. In our setting, clusters similarly have latent indicators but it is the same clusters (and not individuals) that are assumed to have random effects. This renders our setting distinct. The heterogeneity model of Verbeke and Lesaffre (1996), on the other hand, assumes an unobservable mixture distribution on the random effects, that is, that the random cluster effects are themselves clustered. This is again different from our setting since we assume that the cluster effects arise from a single distribution while the membership to these clusters is potentially unobservable. These are subtle distinctions but important ones requiring novel associated methods.

We begin in Section 2 by outlining our notation, the assumed underlying model, and a brief summary of estimation via the EM algorithm in the usual linear mixed modeling setting in which cluster assignments are fully observed. We then describe a novel estimation approach in the general context of cluster ambiguity in Section 3. Extensions for investigation of genetic associations are provided in Section 4. Finally, in Section 5 we present a summary of results from a simulation study and from applying this approach to a study of HAART-associated dyslipidemia in HIV-1-infected individuals. A comprehensive discussion of the simulation approach and corresponding findings is provided in Appendix 3.

2.BACKGROUND2.1Notation and model

Consider the linear mixed-effects model given in (2.1) for i =1,…,M, where Yi is an ni×1 vector with jth element equal to the response for the jth observation in cluster i, ni is the number of observations in cluster i, Xi is a corresponding matrix of covariates, and Zi is the design matrix for random cluster effects. We assume and . In the general mixed modeling setting, Zi is observed and an EM approach is used to estimate β and θ, where β is the vector of mean parameters and θ =(D, σϵ2) is a vector of variance components. This approach is described in Laird and Ware (1982) and summarized in Section 2.2 below:(2.1)

Now, suppose ZN×M = blkdiag[Zi], where N = ∑i = 1Mni. In the ambiguous cluster setting, both Z (the indicator for cluster membership) and ni are potentially unobserved. In addition, the elements of Yi and Xi will vary depending on cluster assignments. Let the observed data relevant to cluster assignments be G. We define 𝒮 to be the set of all design matrices Z that are consistent with these observed data. For simplicity of notation in subsequent sections, we let Y = (Y1T,Y2T,…,YMT)T, X = [X1T,X2T,…,XMT]T, 𝒟 = blkdiag[D], b = (b1T,…,bMT)T, and ϵ = (ϵ1T,…,ϵNT)T. The model in (2.1) can be rewritten in complete matrix notation as described in (2.2). The variance of Y is given by W = Z𝒟ZT + σ2IN×N:(2.2)

2.2Estimation in the fully observed cluster setting

First, consider the usual linear mixed modeling setting in which cluster assignments are fully observed and the traditional EM approach to estimation of Laird and Ware (1982). This approach proceeds by first calculating the ML estimate of β assuming the current estimate of θ. This calculation is straightforward since a closed-form solution exists. Second, we update the estimate of θ assuming the current estimates of β. Estimation at this step proceeds using an EM algorithm, which involves first determining the conditional expectation of the complete-data log-likelihood (E-step) and then maximizing this to arrive at new parameter estimates (M-step). This process is then repeated iteratively until a convergence criterion is met.

If the variance parameters are known, the ML estimate of β is given by in (2.3). In general, θ is not known and we replace W in this equation with its ML estimate given by . Based on the complete-data likelihood, where the complete data consist of Y, b, and ϵ, the sufficient statistics for θ = [σ2,D] are given by t1 = ∑i = 1MϵiTϵi and t2 = ∑i = 1MbibiT. The M-step of the EM algorithm is composed of arriving at ML estimates and assuming the current estimate of Ω = (β,θ):(2.3)

The E-step involves setting the sufficient statistics t1(k) and t2(k) equal to their expectation conditional on the observed data Y, as summarized in (2.4). It is straightforward to show and . Restricted maximum likelihood (REML) estimates are obtained by adding and to each equation, respectively, where . These additional terms account for estimation of the mean parameter β:(2.4)

3.METHODS

In this section, we extend the methods described in Section 2.2 to handle ambiguity in the correlated data indicators. That is, we assume that the i of (2.1) is not observed. Since estimation of β requires knowledge of the unobserved cluster assignments, an additional implementation of the EM algorithm is required at the first step. Specifically, estimation of β will depend on weights equal to the estimated posterior probabilities of each potential cluster assignment given the observed data. This requires first assuming a distribution for the number of observations in each cluster as described in Section 3.1 below. We let this distribution be a function of the parameter vector α = (α1,…,αM), where αi is the population frequency of cluster i. We then proceed similarly to the unambiguous setting by first estimating the mean components Φ = [β,α] assuming θ is known and, second, estimating the variance parameters θ given the current estimate of Φ.

3.1Defining a distribution for cluster counts

We assume that the probability of a particular configuration of cluster assignments (represented by the design matrix Z) follows a multinomial distribution. This probability density is given explicitly in (3.1), where nZ,i is the number of observations in cluster i for the given Z, α = (α1,…,αM), αi is the population frequency of cluster i, and ∑i = 1Mαi = 1 or, equivalently, αM = 1 − ∑i = 1M − 1αi. Note that the usual constant term (N!/n1!…nM!) is not included in this formula since the probability is for a single configuration Z:(3.1)

The number of clusters, given by M, is assumed to be known. This is a reasonable assumption in most data settings. For example, in Section 4 clusters are formulated based on pairs of haplotypes; the number of possible pairs is a fixed number that depends on the number of SNPs under investigation within a gene. Alternatively, clusters may represent hospitals or schools and the number of such units is generally fixed at the onset of a study.

3.2Estimating mean parameters, conditional on θ

The complete data consist of Y, Z, b, and ϵ and are denoted Ycomplete. In estimating the mean parameters, we treat b and ϵ as known and write the complete-data likelihood for Φ = (β,α) given θ as Lc(Φ|Ycomplete,θ) in (3.2). Here, Pr(Y|Z,β,θ) is the marginal conditional density for the observed data Y and is given by (3.3). Note that the particular configuration of cluster assignments will contribute to W and we therefore include an additional Z subscript in our notation:(3.2)(3.3)

The E-step involves calculating the conditional expectation of the complete-data log-likelihood. This conditional expectation is given in (3.4), where pZ(Ω) is the posterior probability of the combination of cluster assignments (again denoted by the design matrix Z) given Y and Ω = (Φ,θ). Recall that 𝒮 is the set of all design matrices Z that are consistent with the observed data. A formulation of this posterior probability is given in (3.5). At this step, we update our estimate of pZ(Ω) assuming the current estimate of Ω. That is, we calculate , where is the vector of current parameter estimates:(3.4)(3.5)

The M-step involves maximizing the conditional expectation of the complete-data log-likelihood conditional on the current estimate of the posterior probabilities calculated in the E-step. Maxima can be obtained by maximizing the system of equations given in (3.6) and (3.7). Here, we use the relationship that the derivative of the conditional expectation is equal to the conditional expectation of the score function. Resulting closed-form solutions for and are given in (3.8) and (3.9):(3.6)(3.7)(3.8)(3.9)

In the case that the variance parameters are known, we iterate between updating our estimates of pZ(Ω) and updating our estimates of Φ. The EM algorithm ensures that we will increase the likelihood at each iteration. In general, the variance components θ are not known; however, we can obtain ML estimates of θ and condition on these estimates. This amounts to substituting these ML estimates into the above equations. In the following paragraphs, we describe a modification of the EM algorithm for estimation of θ that additionally incorporates posterior probabilities associated with each combination of cluster assignments.

3.3Estimating variance components, conditional on Φ

For the purpose of estimating variance parameters, we define the complete-data log-likelihood by Lc(θ|Ycomplete,Φ) in (3.10). Note f(Y|b,ϵ,β) is a dirac function ( = 1 under model) and only depends on β so is ignored in estimation of θ. Using the same approach as described in Section 2.2, we set the sufficient statistics for θ equal to their expectation. Here, we sum additionally over the set of all design matrices Z that are consistent with the observed data and weight by corresponding posterior probabilities . Again the maximization step involves setting , where and MZ is the number of clusters corresponding to the specific configuration given by Z. Adjustments to these equations to arrive at REML estimates proceed as described in Section 2.2:(3.10)(3.11)

3.4Summary of approach

In summary, ML estimation proceeds by iterating between 2 EM algorithms: (1) estimation of Φ and (2) estimation of θ. For computational efficiency, we implement one iteration of the first EM algorithm conditional on the current estimate of θ and then one iteration of the second EM algorithm conditional on the current estimate of Φ. This is then repeated until a convergence criterion is met. Initial values for the parameter estimates are arrived at by randomly assigning clusters in the case of ambiguity and fitting the usual mixed-effects model. This approach is summarized in the following step-by-step procedure:

Data arising from genetic association studies of unrelated individuals are generally composed of 3 components: (1) one or more outcomes (commonly referred to as phenotypes, these can be continuous or a binary indicator for case–control status), (2) covariates and potential confounders, including clinical and demographic factors, and (3) genotypes, consisting of the pair of nucleotides present at each locus within and across the candidate genes under consideration. In general, the alignment of nucleotides on a single chromosomal copy, commonly referred to as haplotypic phase, is not observable. For example, if an individual is heterozygous at 2 loci within a gene so that their observed genotype is (Aa,Bb), then the corresponding possible haplotype pairs for this individual are (AB,ab) or (Ab,aB).

4.1Defining clusters

The mixed modeling approach to the analysis of genetic association studies begins by grouping individuals into clusters so that individuals within the same cluster have similar or identical underlying genetic compositions. For example, in Foulkes and others (2005) individuals with identical multilocus genotypes (i.e. the same pattern of SNPs across multiple loci within or across a gene) are deterministically assigned to a corresponding cluster. Once these clusters are defined, analysis proceeds using the mixed-effects modeling framework just as it would in a typical clustered data setting. In the context of studying haplotypic variations, such a grouping based on genetic compositions is generally unobservable. That is, just as haplotypic phase is unobservable, cluster assignments based on this information must also be unobservable.

More formally, suppose ℋ = (h1,…,hK) represents the set of all haplotype pairs (diplotypes) consistent with the observed genotypes for a given gene. We define clusters C1,…,CM such that an individual with haplotype combination contained in ℋi belongs to cluster Ci, where ℋi⊂ℋ. In the most general case, we assume that the number of clusters equals the number of haplotype pairs, that is, K = M and ℋi = hi so that all individuals within the same cluster have identical diplotypes. For example, in the case of 2 SNPs within a gene, there are 4 possible haplotypes and K = 10 possible diplotypes. In the general case, we define 10 corresponding clusters.

Several alternative formulations of the clusters are tenable, and these can reflect the biological hypothesis under investigation. For example, returning to the simple case above, we can group all diplotypes that contain at least one copy of the rare haplotype into a single cluster. This would result in 7 clusters and is consistent with a dominant genetic model in which one copy of the disease allele results in an altered phenotype. A visual representation of these 2 approaches to defining clusters is given in Figure 1.

Fig. 1.

Sample approaches to defining clusters. For the 2 SNP example in which the observed genotypes are (AA, Aa,or aa) and (BB, Bb, or bb), there are 4 possible haplotypes, AB, Ab, aB, and ab, and 10 possible diplotypes. The most general approach to defining clusters results in 10 clusters consisting of all these possible combinations of 2 haplotypes. These are indicated by shaded rectangles. An alternative approach groups all diplotypes with at least one copy for the rare ab haplotype into a single cluster. This is indicated by the dashed rectangle that combines 4 of the previously defined clusters into a single cluster. In this case, there are a total of 7 clusters.

4.2Estimation

Returning to the model in (2.1), Zi again indicates membership to cluster i (or equivalently for the general case, presence of haplotype pair hi) and is potentially unobservable. The observed data G that inform the cluster memberships are the observed genotypes. Recall that the population frequency of each cluster is given by αi for i = 1,…,M. Under the assumption of HWE, the probability of a pair of haplotypes is equal to the product of the corresponding marginal frequencies.

In our setting, the HWE assumption is not required since we estimate cluster-level (diplotype) probabilities. This results from the fact that we allow for 2 components of the data to inform our estimation of cluster frequencies. The first is those individuals whose cluster membership is unambiguous and the second is the phenotype (Y). In the special case that we are estimating the frequencies of clusters that are completely ambiguous, that is, all individuals within the clusters are ambiguous, then we rely solely on Y for this purpose, unless we make an additional assumption such as HWE. In this extreme case, while we are able to estimate cluster frequencies and calculate corresponding empirical Bayes estimates of random effects, it is not possible to distinguish which values correspond to which clusters without additional assumptions. Notably, the omnibus test for overall variability in the random cluster effects is still valid.

If the HWE assumption is reasonable, the proposed method can be refined further to define cluster probabilities as the product of the corresponding 2 haplotype probabilities. That is, (3.1) can be reexpressed as described in (4.1), where Hj represents a single haplotype, M* is the number of unique haplotypes, and is the number of copies of Hj observed across all clusters for the configuration given in Z. The ML estimate of δj is as given in (3.9), where nZ,i is replaced with . For the simple example described in Figure 1, under the HWE assumption the number of frequency parameters reduces from M = 10 to M* = 4. Note, however, that in both cases, the number of random effects is 10 since there are 10 clusters. Sensitivity of this approach to violations of HWE is described in Section 5.1 and Appendix 3:(4.1)

4.3Testing and prediction

For the case in which we are interested in overall genetic effects and not interactions between genes and other covariates, bi reduces to a scalar and Zi = 1ni is an ni×1 vector of 1s. In the application of mixed modeling to genetic data, interest may lie in testing for significant variability across random effects (e.g. H0:σb2 = 0). A likelihood ratio test comparing the expected complete-data log-likelihood for the full model (with random cluster effects) to the reduced model (without random cluster effects) can be applied. Finally, empirical Bayes estimates of the random effects inform us about the cluster-specific effects on the phenotype under consideration. These are calculated in the usual way with additional weights equal to the posterior probabilities of cluster assignments and given in (4.2):(4.2)

4.4Computational and identifiability considerations

Suppose heterozygosity is observed in our sample at exactly r sites for the gene under investigation. In this case, the number of possible haplotypes is 2r and the number of haplotype pairs (clusters) is R = . Notably, some clusters consist only of individuals whose haplotypic phase is fully determined. For example, consider the clusters illustrated in Figure 1. The cluster consisting of the haplotype pair (AB,Ab) corresponds uniquely to individuals with the genotype (AA,Bb). Since this genotype has heterozygosity at only a single site, the corresponding haplotypic phase is known. That is, the haplotypic phase of individuals within the (AB,Ab) cluster is completely observed. On the other hand, for the example provided, the true cluster assignment for individuals with uncertainty in phase will be either (AB,ab) or (Ab,aB). In general, R* = of the R clusters consists of individuals for whom haplotypic phase is ambiguous.

If there are K individuals with ambiguous phase, then the number of possible cluster assignment configurations is |𝒮| = (R*)K. This is the number of elements Z of the set 𝒮 in (3.4)–(4.2). For the simple case in which r = 2, we have R* = 2 and |𝒮| = 2K. Thus, the computational burden of the proposed modeling approach is clearly quite large; however, a few matrix identities help to reduce the computational intensity. Specifically, suppose each individual has a single random effect so that Z = 1nZ,i and 𝒟 = σb2I. In this case, we can write WZ − 1 as in (4.3), where JnZ,inZ,i = 1nZ,i1nZ,iT. A formal derivation of this identity is given in Appendix 1(a). Note that WZ− 1 depends only on the number of individuals within each cluster and not the specific configuration of the individuals. Since multiple elements of 𝒮 yield the same numbers of observations per cluster, use of (4.3) reduces the number of calculations of W − 1 from 2K to K + 1:(4.3)

If we further assume X = 1N so that the model given in (2.1) consists of an intercept and no covariates, then reduces to (4.4). A detailed derivation is provided in Appendix 1(b). The sum over ∈ S now represents a sum over all combinations of cluster sizes and is the sum of PZ () for all Z consistent with the configuration . Note that ∑j = 1nZ,iYij does depend on the particular configuration of cluster assignments and thus must be determined for each Z∈𝒮. Again calculation of the inverse (the first term in the product in (4.4)) depends only on the number of individuals per cluster, thus reducing the number of computations substantially:(4.4)

Gains in computational efficiency can also be achieved by partitioning Y, X, and Z into their ambiguous and unambiguous components. Let YT = [YaT|YuT], XT = [XaT|XuT], and , where the subscripts a and u indicate subsets of the observed data corresponding to individuals whose cluster assignments are ambiguous and unambiguous, respectively. Since W is block diagonal (clusters are assumed independent), W−1 can be partitioned similarly into , where Wa − 1 = (Za𝒟ZaT + σϵ2I) − 1 and Wu − 1 = (Zu𝒟ZuT + σϵ2I) − 1. We can now write (3.8), (3.9), and (3.11) in terms of sums of ambiguous and unambiguous components, as described in Appendix 2(a)–(c). The unambiguous components need only be calculated once while the ambiguous components depend on Z ∈ 𝒮.

The most general approach to defining clusters in the haplotype setting, as described in Figure 1, results in all individuals with ambiguity in phase belonging to a subset of clusters while no fully observed individuals belong to clusters in this subset. As noted in Section 4.2, estimation of cluster-level frequencies in this setting relies solely on the response variable and is not identifiable in the sense that we cannot distinguish which frequencies and empirical Bayes estimates correspond to which particular clusters. While the omnibus test for overall variability is valid in this setting, estimation of haplotype frequencies under the HWE assumption may be more relevant. In the general setting of missing correlated data identifiers, this would represent an extreme case.

5.DATA RESULTS5.1Simulation study

In order to evaluate the mixed modeling approach for characterizing haplotype–trait associations, we conducted a simulation study that includes the following components: (1) a sensitivity analysis of the mixed modeling approach to the number of clusters (haplotypes) and model misspecification. Both founder effect models (assuming dominant and recessive traits) and departures from HWE are considered. (2) A comparison to alternative methods, including a traditional analysis of variance (ANOVA) approach and a 2-stage multiple imputation (MI) approach (Foulkes and others, 2007). (3) Detailed simulation findings, including power, coverage rates (CRs), and false-discovery rates (FDRs) for varying effect sizes (ratios of standard deviations) and degrees of ambiguity in cluster assignments. Details of the simulation study and corresponding results are provided in Appendix 3.

Briefly, we found that the mixed modeling approach has reasonable power for a sample size of n = 200 and between 10 and 36 clusters (4–8 haplotypes). Performance is relatively poor, however, under misspecification of the random-effects distribution. The reduction in power is especially pronounced for the recessive founder model in which only a single cluster effect arises from a normal distribution with > 0 variability and the remaining cluster effects have 0 variability. On the other hand, power appears stable under moderate deviations from HWE when this is assumed. For a ratio of standard deviations (σb/σϵ) of 0.4 and sample size of n = 200, power for the ANOVA and mixed modeling approaches is comparable while the number of clusters is less than 20. For more than 20 clusters, power of the mixed modeling approach (based on the single degree of freedom test) is greater. Power for the 2-stage MI approach is comparable to the fully likelihood–based approach described herein for one data example. Modest reductions in power are observed for increasing ambiguity in cluster identifiers while corresponding CRs for variance parameters are lower. Finally, FDRs increase from 5% to 7–9% with an increase in cluster ambiguity up to 20%.

5.2Example

Recent studies indicate that long-term exposure to certain combination antiretroviral therapies (ARTs) may lead to a host of lipid abnormalities including increases in triglycerides and total cholesterol and a reduction in high-density lipoprotein cholesterol (HDL-C). In turn, this can lead to accelerated onset of cardiovascular disease and death, presenting a grave concern for HIV-1-infected individuals receiving continuous long-term therapy. However, the large number of available ARTs provides a great potential to tailor treatments to individual-level characteristics. Furthermore, understanding the characteristics of individuals at greatest risk for cardiovascular complications will provide clinicians the opportunity to target interventions, such as administration of lipid-lowering therapies.

The data motivating our research arise from a cohort of N = 626 HIV-1-infected individuals at risk for ART-associated dyslipidemia. These data were collected as part of multiple AIDS Clinical Trials Group studies and combined under New Work Concept Sheet 224. The primary aim of this study is to identify genetic factors that predict lipid abnormalities after controlling for traditional risk factors and other clinical parameters, including age, sex, use of lipid-lowering therapy, and current ART exposure. First-stage analysis results and general descriptive information on the cohort are provided in Foulkes and others (2006). This analysis revealed potential effect modification by race/ethnicity, and so for the purpose of illustration, we describe here application of the above method within Hispanics (N = 109).

The effects of haplotypic variation in endothelial lipase (EL) on HDL-C are considered. The SNPs chosen for analysis are rs12970066, Asn396Ser, and rs3829632 (-1309A/G) and were determined based on prior knowledge of association with plasma lipoproteins and for capturing genetic variability within this gene. A haplotype-based analysis can be advantageous if the observed SNPs are in linkage disequilibrium with the disease-causing variant and are not themselves functional; in general and in this setting, the functionality of specific SNPs is not fully characterized, and thus, a haplotype-based analysis can provide new insight. We assume HWE within the single race/ethnicity group and apply the ECM approach described in Section 4.2. A summary of genotype frequencies is given in Table 1. In this sample, N = 13 individuals have uncertainty in haplotypic phase due to heterozygosity at rs12970066 (AG) and Asn396Ser (CG). Notably, variability is not observed in the third SNP (rs3829632) within Hispanics; however, we include this SNP in our presentation for completeness. Covariates included in model fitting are age, gender, CD4 count, current ART exposure, use of lipid-lowering therapy, and study. N = 100 individuals with complete data are included in the analysis.

Table 1.

EL genotype within Hispanics. Genotype counts for combination of 3 SNPs in EL. Although variability in rs3829632 is not observed within the subset of Hispanics, this SNP is included in the presentation for completeness

EL genotypes

Count (%)

rs12970066

Asn396Ser

rs3829632 (-1309A/G)

1

AA

CC

AA

23 (0.21)

2

AA

CG

AA

24 (0.22)

3

AA

GG

AA

4 (0.04)

4

AG

CC

AA

31 (0.28)

5

AG

CG

AA

13 (0.12)

6

GG

CC

AA

14 (0.13)

Total: 109

A convergence criterion of a maximum absolute percentage change in parameter estimates from one iteration to the next of < 1×10 − 5 is used. Convergence is met after 20 iterations. Resulting haplotype frequency estimates are given in Table 2. The estimated variance of the random haplotype effects is = 0.013 with a corresponding likelihood ratio test statistic of χ1,02 = 6.69 (p < 0.05). The estimated error variance is = 0.057. Empirical Bayes estimates of the random haplotype pair effects are given in Figure 2. These results suggest overall variability in the haplotypic effects of EL on HDL-C. The cluster (ACA,AGA) has the largest absolute estimated effect, suggesting that individuals with this pair of haplotypes will have a lower predicted HDL-C level. Since HDL-C is considered as the good cholesterol, these individuals may be at greatest risk for ART associated lipid complications and candidates for targeted intervention.

In this manuscript, we describe an ECM approach to finding ML parameter estimates in the linear mixed-effects model setting when the correlated data indicators are ambiguous. This research was motivated by interest in characterizing genetic effects on a phenotype when haplotypic phase is unobservable. The proposed approach, however, has broader relevance to other settings in which cluster identifiers are not known with certainty. Notably, a similar approach can be applied to missing genotype data, where multilocus genotype group identifiers are treated as ambiguous.

We focused on the simple linear mixed-effects model with a single random cluster effect. Alternative formulation of the design matrix for the random effects allows for assessing interactions between patient-specific characteristics and haplotypes. For example, inclusion of an ART drug indicator in the Z matrix of (2.1) would allow us to investigate drug-by-gene effects in a pharmacogenomic study. Extensions to settings with alternative, noncontinuous outcomes and semiparametric mixed models that relax the normality assumption on the random effects require additional consideration. As expected, our simulation study suggests relatively poor performance in the context of a recessive, founder model in which the normality assumption of the random effects is severely violated. Further investigations of performance under alternative model formulations, as well as explorations of the utility of semiparametric procedures and model diagnostics in these settings, would be interesting.

In another recent manuscript, we describe an MI approach for this setting in which the haplotype frequencies are estimated independently of the outcome (Foulkes and others, 2007). An EM algorithm as described by Excoffier and Slatkin (1995) can be applied for haplotype reconstruction and then multiple imputed data sets derived by repeated weighted sampling based on the estimated posterior probabilities. The primary advantage of the joint approach we describe herein is that it incorporates information about the phenotype in the estimation procedure. For example, if ambiguity rests between 2 clusters with effects bi and bi′, where bi > bi′, then this approach will tend to assign individuals with higher observed phenotypes to Ci and individuals with lower phenotypes to Ci′. Notably, in one simple simulation study, the 2 approaches yielded similar results while the computational burden associated with the joint approach is much greater. In light of the theoretical advantages, however, further and extensive consideration of alternative settings and the extent to which incorporating this additional layer of information indeed results in greater efficiency is warranted.

A primary limitation of the mixed modeling approach for haplotype–trait association studies is that as the number of SNPs increases, the number of haplotypes (and therefore clusters) can quickly approach the number of individuals under investigation. In genome-wide association studies, taking a random sample of SNPs within a known disease pathway or genetic region may be tenable and appropriate for the random-effects modeling framework. Paradoxically, increasing the number of variables (SNPs) can also lead to greater phase ambiguity in the data, suggesting an important trade-off between information gained from more accurate haplotype reconstruction and potential power loss associated with increasing ambiguity.

As mentioned in Section 1, the proposed approach represents an extension of SEMs with a doubly latent class structure defined by both latent random effects and unobservable class identifiers. Further extensions that draw on the literature of SEMs may provide additional tools for incorporating known biological function, such as gene-specific pathways to disease. For example, multiple random effects based on sets of genes with similar, known functionality may provide additional insight into the determinants of complex diseases. The framework we describe allows for this multivariable investigation while accounting for the unobservable nature of haplotypic phase in association studies.

FUNDING

National Institute of Allergy and Infection Diseases (NIAID) (AI056983); National Institute of Diabetes and Digestive and Kidney Diseases (DK021224); Adult AIDS Clinical Trials Group funded by the NIAID (AI38858); CRI: Computational Biology Facility for Western Massachusetts (CNS 0551500) to computing cluster.

We thank the AIDS Clinical Trials Group New Works Concept Sheet 224 study team for helpful discussions and providing access to data. Conflict of Interest: None declared.

APPENDIX 1APPENDIX 2APPENDIX 3

We begin by describing the results of a simulation study aimed at assessing the sensitivity of the mixed modeling approach to both the number of clusters and the model misspecification. Founder effect models as well as departures from HWE are considered. A comparison of the mixed modeling to a more traditional ANOVA approach for identifying haplotype associations as well as a more recently described 2-stage MI approach (Foulkes and others, 2007) is also provided. We then summarize precision and power for a range of percentages of individuals with ambiguous cluster membership. Finally, estimates of the FDRs associated with varying degrees of missingness are presented.

Performance and sensitivity results for varying numbers of clusters and models are provided in Figures 3(a) and (b). These results are based on 400 iterations per condition, samples of size n = 200, and fully observed haplotypes. Cluster assignments are resampled within each iteration based on assumed frequencies. The ratio of standard deviations is defined as σb/σϵ, and for simplicity, we set σϵ = 1 across all simulations. Power is defined as the proportion of simulations for which the likelihood ratio test comparing the mixed model (with a random cluster effect) to a fixed-effects model (intercept only) is significant at the 0.05 level. That is, power is the proportion of times we reject the omnibus null hypothesis H0:σb2 = 0. A significance cutoff is chosen based on a 50–50 mixture of a χ12 and χ02 distribution since we are testing a variance parameter at a boundary.

Fig. 3.

Performance and sensitivity of the mixed modeling approach. (a) Power for detecting haplotype effect variability by number of clusters (n = 200). (b) Power under dominant and recessive founder models (n = 200). For recessive model, population frequencies of founder haplotype (Hd) equal to (1) 0.20 and (2) 0.40 are considered. For dominant model, a frequency of 0.20 is illustrated. (c) Power for mixed model and ANOVA approaches (n = 200, σb/σϵ = 0.40).

Figure 3(a) illustrates power for a range of variance ratios where the number of clusters ranges from 3 to 36. For the most general case in which clusters are defined by unique haplotype pairs, this corresponds to 2–8 observed haplotypes. Assumed cluster frequencies are determined based on population haplotype frequencies. For the case of 36 clusters (8 haplotypes), haplotype frequencies are set equal to (0.20,0.15,0.15,0.12,0.10,0.10,0.10,0.08) and corresponding cluster probabilities are calculated assuming independence (HWE). For 21, 10, and 3 clusters, the corresponding haplotype probabilities are set equal to (0.20,0.20,0.20,0.15,0.15,0.10), (0.40,0.20,0.20,0.20), and (0.60,0.40), respectively, and again HWE is assumed to determine cluster frequencies. For example, in the case of 2 haplotypes, the cluster frequencies are (0.602,2×0.60×0.40,0.402) = (0.36,0.48,0.16).

These results suggest that for σb/σϵ > 0.4, the omnibus test has reasonable power ( > 80%) for detecting variability in the cluster effects in the case of at least 10 clusters (4 haplotypes). Power increases to greater than 90% for a ratio of standard deviations of more than 0.5. The difference in power between differing numbers of clusters is more marked for smaller effect sizes. For example, M = 10 and M = 21 clusters yield comparable power for σb/σϵ ≥ 0.4, while the power differential is 15–20% for σb/σϵ = 0.3 and 0.2. Interestingly, the power gains for a larger number of clusters diminish entirely for effect sizes of greater than 0.5. This may be due in part to the decreasing number of observations per cluster for a fixed sample size of n = 200 across simulations.

In the second case, illustrated in Figure 3(b), data are generated according to a founder effect model in which a single haplotype (Hd) is associated with the disease phenotype. Both recessive and dominant genetic models are considered. In the case of the recessive model, the presence of 2 copies of Hd is required to effect the phenotype, while in the dominant model, a single copy of Hd effects the phenotype. More specifically, for the recessive model bi∼N(0,σb2) for the unique i such that Ci = (Hd,Hd) and bi = 0 otherwise. For the dominant model, on the other hand, bi∼N(0,σb2) for all i such that Hd∈Ci and bi = 0 otherwise. In the case of the dominant model, the frequency of Hd is set equal to 0.20. In the recessive model examples, founder haplotype frequencies of 0.20 and 0.40 are considered, corresponding to single cluster frequencies of 0.04 and 0.16, respectively. Model fit is based on the assumption of normality of all the random effects, and thus, we indicate that the model has been misspecified.

As expected, power is dramatically lower under model misspecification. Under the recessive founder model, only a single cluster effects the phenotype while the variability in the effect of the other clusters is assumed to equal 0. If this cluster has a low population frequency (0.16 and 0.04 are illustrated), then power is less than 80% for ratios of standard deviations of as high as 1. Performance is improved for the dominant model, in which all clusters with at least one copy of the founder haplotype have an effect on the phenotype while the remaining cluster effects are assumed to have no variability. In the examples provided for the dominant model, 4 of the 10 cluster effects and 6 of the 21 cluster effects arise from a normal distribution while the remaining in each case are set to equal 0. In these cases, greater than 80% power is observed for ratios of standard deviations of 0.7 and greater.

Power under specific departures from HWE is also estimated. Consistent with the approach of Satten and Epstein (2004), we let the joint probability of the haplotype pair (H,H′) be given by αHH′ = I(H = H′)FδH + 2I(H≠H′)(1 − F)δHδH′, where I(·) is the indicator function, F is a scalar, and δH is the population-level frequency of haplotype H. Notably, F = 0 corresponds to the HWE setting. In this case, we assume wild-type and variant SNP frequencies of 0.80 and 0.20 for each of 2 SNPs. This results in 4 haplotypes with frequencies of (0.64,0.16,0.16,0.04) and 10 corresponding clusters with frequencies under HWE of (0.0496,0.2048,0.2048,0.0512,0.0256,0.0512,0.0238,0.0258,0.0128,0.0016). Ambiguity lies between 2 of these clusters and all individuals within these 2 clusters are ambiguous. We apply the mixed modeling approach under the HWE assumption to 100 simulated data sets of size n = 100, where σb/σϵ = 0.60. Values of F = − 0.05, 0, and 0.05 are considered as described in Satten and Epstein (2004). Resulting power for omnibus test of no variability in cluster effects is 84%, 84%, and 86%, respectively, suggesting reasonable performance under moderate departures from HWE.

Figure 3(c) illustrates power both for the omnibus test of variability in random cluster effects and for an F-test based on a one-way ANOVA model in which clusters are treated as fixed factor levels. The ANOVA approach is extended for unobservable haplotypes in Lake and others (2003) and easily implemented in R with the haplo.glm() function of the haplo.stats package. Here, we focus on the observed haplotype setting to characterize overall performance, though a reduction in power is expected using both approaches in the context of missingness in phase, as described in more detail below for the mixed modeling setting. A ratio of standard deviations of 0.40 is assumed. Power is comparable between the 2 approaches when the number of clusters (factor levels) is less than 15, corresponding to 5 haplotypes; however, it tends to deviate as the number of clusters increases to 36 (corresponding to 8 haplotypes.) While power is maintained at near-constant values for the mixed modeling approach, it tends to decrease with increasing haplotypes using ANOVA. This is consistent with reports in the literature suggesting that the increase in degrees of freedom associated with consideration of more haplotypes can reduce power using an ANOVA approach (Tzeng and others, 2006). Ultimately, power will also decline using the mixed modeling approach since as the number of clusters (haplotypes) increases, the number of individuals within the clusters will decrease for a fixed sample size.

An additional comparison of power between the ECM approach described herein and a 2-stage MI approach is provided. The later approach and its performance are described in detail in Foulkes and others (2007). In this example, the trait is simulated for a sample of size n = 200 for each of 100 simulations and model fitting assumes HWE with missingness between 2 clusters (representing about 8% ambiguity for this data example). Power for detecting variability is comparable for the ECM and MI approaches, averaging 79% and 78.6%, respectively, across a range of standard deviations from 0.4 to 0.8; however, an approximate 1.5-fold increase in the 25th, 50th, and 75th quantiles of the test statistic distribution is observed for the ECM versus MI approach. Notably, the computational burden associated with increasing amounts of haplotype ambiguity for ECM (on the order of 2Na, where Na is the number of ambiguous individuals between 2 clusters) far exceeds that associated with the MI approach (on the order of B×Na,

where B is the number of imputations). While the theoretical advantage of ECM is that it incorporates information on the trait of interest (as reflected in our simulation study by the overall distribution of test statistics being greater for ECM compared to MI), we believe that a complete investigation of the relative performance is warranted in light of the trade-off in computational efficiency.

More detailed simulation results are given in Table 3 for varying degrees of cluster ambiguity and ratios of standard deviations. In all cases, a sample size of n = 200 is again assumed. The ECM approach described in this manuscript is applied for settings in which the ambiguity is greater than 0%. Bias is defined as the absolute difference between the median parameter estimate over the simulations and the true value. The estimated standard error of the parameter estimates based on the simulations is given by . β0 is a fixed intercept and is set equal to 0 across all simulations. a and u are the average biases over the ambiguous and unambiguous clusters, respectively. CR is defined as the percentage of simulations for which the true parameter value is within the 95% confidence interval. These intervals are constructed for each simulation based on the current parameter estimate and the estimated standard error () across all simulations. Although the variance estimates appear to be slightly rightwardly skewed, transformations are not applied since they result in more pronounced leftward skewness. Both logarithmic and square root transformations were considered (results not shown).

Results are based on *400 and **200 simulations per condition (σb/σϵ) with samples of size n = 200 and m = 21 clusters.

†Bias is defined as the absolute difference between the median of the estimate over the simulations and the true parameter value. a and u are the average bias across the ambiguous and unambiguous clusters, respectively. Standard errors () are calculated based on all simulations within a condition.

‡CR is defined as the proportion of simulations for which the true parameter value is within the corresponding 95% confidence interval.

Cluster-level ambiguity is sampled as follows. For 5% ambiguity, 10 of the n = 200 observations are ambiguous between a pair of clusters. In the case of 10% ambiguity, 5% of the sample is ambiguous between a pair of clusters and another 5% is ambiguous between a second pair of clusters. Likewise for 20% ambiguity, 4 sets of individuals, each consisting of 5% of the sample, are ambiguous between different pairs of clusters. In all cases, 21 clusters are assumed and frequencies ranging from 0.01 to 0.08 are determined as described for Figure 3 above. In total, 200–400 simulations are performed per condition as indicated, and model fitting is based on the general case in which we do not make any assumption about HWE. Standard deviation ratios of 0.2–1.0 are presented. Finally, FDRs are reported. In this case, 100 simulations are performed for each level of ambiguity under the assumption that σb2 = 0, and we report the percentage of times we reject the null hypothesis H0:σb2 = 0 based on the likelihood ratio test.

These results suggest modest reductions in power for detecting overall variability in cluster effects with ambiguity as high as 20%. While CRs are maintained at between 0.93 and 0.97 for the fixed effect, β0, and the haplotype frequencies, αa and αu, CRs for the variance parameters decline with increasing ambiguity. This is most marked for σb2 for which the CRs drop to between 0.88 and 0.92 for 20% ambiguity. In general, σb2 is slightly overestimated when there is ambiguity in the cluster assignments. This is reflected further in the FDRs that tend to increase as the rate of ambiguity increases from 0% to 20%. Specifically, for 0%, 5%, 10%, and 20% ambiguity, the estimated FDRs are 5%, 6%, 9%, and 7%, respectively.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background correction is an important preprocessing step for microarray data that attempts to adjust the data for the ambient intensity surrounding each feature. The “normexp” method models the observed pixel intensities as the sum of 2 random variables, one normally distributed and the other exponentially distributed, representing background noise and signal, respectively. Using a saddle-point approximation, Ritchie and others (2007) found normexp to be the best background correction method for 2-color microarray data. This article develops the normexp method further by improving the estimation of the parameters. A complete mathematical development is given of the normexp model and the associated saddle-point approximation. Some subtle numerical programming issues are solved which caused the original normexp method to fail occasionally when applied to unusual data sets. A practical and reliable algorithm is developed for exact maximum likelihood estimation (MLE) using high-quality optimization software and using the saddle-point estimates as starting values. “MLE” is shown to outperform heuristic estimators proposed by other authors, both in terms of estimation accuracy and in terms of performance on real data. The saddle-point approximation is an adequate replacement in most practical situations. The performance of normexp for assessing differential expression is improved by adding a small offset to the corrected intensities.

Fluorescence intensities measured by microarrays are subject to a range of different sources of noise, both between and within arrays. Background correction aims to adjust for these effects by taking account of ambient fluorescence in the neighborhood of each microarray feature.

Ritchie and others (2007) compared a range of background correction methods for 2-color microarrays. A method normexp was introduced which models the observed intensities as the sum of exponentially distributed signals and normally distributed background values. The corrected intensities are obtained as the conditional expectations of the signals given the observations. The normexp method is an adaptation of the background correction method proposed by Irizarry and others (2003) for Affymetrix single-channel arrays, as the first step of the popular “robust multi-array average (RMA)” algorithm for preprocessing Affymetrix expression data. Ritchie and others (2007) showed that normexp, followed by a started-log transformation (i.e. log(x + c), for constant c), gave the lowest false-discovery rate of any commonly available background correction method for 2-color microarrays.

The convolution model underlying the normexp method involves 3 unknown parameters, all of which must be estimated before the method can be applied. In the 2-color context, the parameters must be estimated for each channel on each array, by fitting the convolution model to the observed intensities for that channel. Ritchie and others (2007) suggested an approximate likelihood method for estimating the parameters, based on a saddle-point approximation, but did not give mathematical details.

This article develops the normexp method further by improving the estimation of the parameters. First, a complete mathematical development is given of the normexp model and the associated saddle-point approximation. Second, some subtle numerical programming issues are solved which caused the original normexp method to fail occasionally when applied to unusual data sets. Third, we show how exact maximum likelihood estimation (MLE) of the parameters can be made practical and reliable. Fourth, we compare exact and approximate MLE with estimators proposed by other authors.

MLE has previously proved difficult because of numerical sensitivity of the likelihood function (Irizarry and others, 2003), (Bolstad, 2004), (McGee and Chen, 2006). Instead of MLE, the RMA algorithm, implemented in the affy software package for R (Gautier and others, 2004), uses simple heuristic estimators obtained by smoothing the histogram of observed intensities and partitioning the distribution about its mode (Bolstad, 2004), (Irizarry and others, 2003). McGee and Chen (2006) observed that the RMA estimators are highly biased and proposed 2 new estimators. These methods are based on the RMA kernel smoothing approach but partition the distribution about its mean (the “RMA-mean” method) or 75th percentile (the “RMA-75” method) and then apply a 1-step correction. The RMA-mean and RMA-75 estimators are far less biased than those of RMA but apparently do not improve the performance of the RMA algorithm on real data (McGee and Chen, 2006).

The saddle-point approximation avoids the sensitivity of the likelihood function by providing a closed-form expression for the probability density on the log-scale, ensuring good relative accuracy. However, the saddle point itself must first be found for each data value. This article provides a globally convergent iterative scheme that locates the saddle point to full accuracy in floating-point arithmetic in all cases.

The accuracy of the different estimators are compared in a simulation study. The estimators are also compared using the extensive battery of calibration data sets assembled by Ritchie and others (2007). This allows the estimators to be compared according to their ability to estimate fold changes and to detect differential expression on real data. As in Ritchie and others (2007), the assumed context is that of a small microarray experiment in which popular differential expression methods are to be applied. MLE is shown to have markedly better performance than the heuristic estimators.

Section 2 describes the normexp convolution model, presents the MLE and “saddle” procedures, and addresses some challenges in their implementation. Section 3 briefly describes the 3 test data sets with known levels of differential expression. Section 4 compares the 4 estimation schemes both by simulation and by performance on the test data sets.

2.CORRECTION METHODS2.1The normal–exponential convolution model

Image analysis software for 2-color microarrays produces red foreground and background intensities Rf and Rb and green foreground and background intensities Gf and Gb for each spot on each array. Our aim is to adjust the foreground intensities Rf and Gf for the ambient intensities represented by Rb and Gb.

The normexp model for the red channel assumes Rf = Rb + B + S, where S is the true expression intensity signal and B is the residual background not captured by Rb. The model for the green channel is similar. The signal S is assumed exponentially distributed with mean α, while B is normally distributed with mean μ and variance σ2. The parameters μ, σ2, and α are assumed different for each channel on each array. All variables are assumed independent.

Write X = Rf − Rb for the background-subtracted observed intensity. The normexp model becomes(2.1)The joint density of B and S is just the product of densities(2.2)where s > 0 and φ(·) is the Gaussian density function. A simple transformation gives the joint density of X and S aswhere μS·X = x − μ − σ2/α. Integrating over s gives the marginal density of X:(2.3)where Φ(·) is the Gaussian distribution function. Dividing the joint by the marginal gives the conditional density of S given X asfor s > 0, which is a truncated Gaussian distribution. Our estimate of the signal given the observed intensities is the conditional expectation(2.4)

2.2Saddle-point approximation

MLE requires the marginal density (2.3), which turns out to be difficult to compute with full relative accuracy in floating-point arithmetic, due to subtractive cancelation affecting both factors in the expression. As an alternative, the saddle-point approximation, or tilted Edgeworth expansion, provides a means of approximating the density of any random variable from its cumulant generating function (Barndorff-Nielsen and Cox, 1981, p. 104). The approximation is attractive because it typically remains accurate far into the tails of the distribution.

The cumulant generating function of X is immediately available as the sum of those for B and S,where θ < 1/α. The definition of the cumulant generating function implies that g(x;θ) = fX(x)exp[yθ − KX(θ)] integrates to unity for all θ. Here, we suppress the dependence of fX on μ, σ, and α for notational simplicity. The density g(x;θ) defines a linear exponential family with canonical parameter θ and rth cumulant κr = KX(r)(θ).

The second-order Edgeworth expansion for g (Barndorff-Nielsen and Cox, 1981, p. 106) is log yielding the approximation . The key feature which makes the saddle-point approximation so effective is its ability to choose θ to make the Edgeworth expansion as accurate as possible for each x, by choosing θ so that x is the mean of the distribution, that is θ is chosen to solve the saddle-point equation(2.5)for θ < 1/α. Although this equation has a simple analytic solution, computing the solution is subject to catastrophic subtractive cancelation for certain values of σ and α. Details of how we avert this numerical issue are provided in Section A of the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).

2.3Optimization

Given a set of observed intensities xi, i = 1,…,n, the unknown parameters μ, σ, and α must be estimated before the correction formula (2.4) can be applied. Starting values are obtained as follows. The initial estimate 0 of μ is the 5% quantile of the xi. The initial variance is the mean of (xi−0)2 for xi<0. The initial is .

Next, the saddle-point approximation to the likelihood is maximized using the Nelder–Mead (1965) simplex algorithm. Finally, using the saddle-point estimates as starting values, the exact likelihood is maximized using the nlminb function of R, which performs unconstrained minimization using PORT routines (Gay, 1981), (Gay, 1983), (Gay, 1990). First and second derivatives of fX with respect to μ, logα, and logσ2 are supplied. Optimizing the likelihood with respect to logα and logσ2, rather than α and σ2, avoids parameter constraints and improves convergence.

The algorithm is implemented in the limma software package for R (Smyth, 2005). Saddle-point parameter estimation takes about 1 s per channel with 20 000 probe arrays on a 2 GHz Windows PC. Exact MLE takes about 50% longer. Time taken is roughly linear with the number of probes.

2.4Transformation and offset

The normexp background correction (2.4) is performed for each channel on each array, yielding adjusted strictly positive red and green intensities R and G for each spot on each array. These are then converted to log-ratios, M = log2(R/G), and log-averages, (Yang and others, 2001).

It also proves useful to offset the intensities by a small positive value k, giving offset log-ratios M = log2[(R + k)/(G + k)]. This simple transformation shifts the intensities away from 0 and serves to stabilize the variance of the log-ratios at low intensities (Rocke and Durbin, 2003), (Ritchie and others, 2007). The value k = 50 was chosen for this study on the basis of our previous experience with cDNA microarray experiments (Ritchie and others, 2007).

3.TEST DATA3.1Spike-in experiment

We use the same 3 calibration data sets as Ritchie and others (2007). The first uses Lucidea Universal ScoreCard controls (Amersham Biosciences) to assess bias. Twelve copies of the control probe set were printed in duplicate on 9 cDNA microarrays, along with a 13K clone library. Only the control probes are analyzed here. Prior to labeling, test and reference control RNA were spiked into RNA samples to produce known fold changes (Supplementary Table 1 available at Biostatistics online). All 8 background correction methods (RMA, RMA-75, saddle, and MLE, with and without the offset) were applied. The resulting log-ratios were normalized and duplicate spots were combined to give an estimate of the log2-fold change as described by Ritchie and others (2007).

3.2Mixture experiment

The second data set is from Holloway and others (2006). Six RNA mixtures consisting of mRNA from MCF7 and Jurkat cell lines in known relative concentrations (100%:0%, 94%:6%, 88%:12%, 76%:24%, 50%:50% and 0%:100%) were compared to pure Jurkat reference mRNA on 12 cDNA microarrays printed with a Human 10.5K clone set. Dye-swap pairs were performed for each of the 6 mixtures. All 8 background correction methods were applied and the data were normalized using print-tip loess (Yang and others, 2001). Probe-wise nonlinear regression equations were fitted to the normalized log-ratios (Holloway and others, 2006). This produced for each probe a reliable consensus estimate of the MCF7 to Jurkat fold change and a standard deviation that estimates the between-array measurement error.

3.3Quality control study

The final data set is from Ritchie and others (2006) and comprises 111 replicate arrays printed with the same 10.5k clone set as in the mixture study and hybridized with MCF7 (Cy3) and Jurkat mRNA (Cy5). Spot image data were morph background corrected and print-tip loess normalized. This very large data set enables genes truly differentially expressed (DE) between MCF7 and Jurkat to be identified with a high degree of confidence.

4.RESULTS4.1Reliability

The estimation scheme outlined in Section 2.3 has proved to be extremely reliable. It has converged successfully for all data sets the authors have encountered so far, including thousands of simulated and real microarrays. This contrasts with earlier experiences reported by McGee and Chen (2006), whose optimization algorithm, using Newton's method, converged in only 15% of cases, even when initial estimates were equal to the true parameter values.

RMA estimation also returned useable values for all data sets. The RMA-mean and RMA-75 methods each failed for some simulated data sets, the former slightly more often than the latter. Since the two are otherwise similar in performance, results will be presented here only for RMA-75. RMA-75 returned NaNs for 32% of simulated data sets with σ = 5 and α = 104 and for 0.3% of data sets with σ = 20 and α = 104.

4.2Estimation accuracy

Data were simulated for all combinations of μ∈{30,100,500}, σ∈{5,20,100}, and α∈{102,103,104}. These values represent a very wide range of scenarios in terms of the distribution of foreground values typically observed in microarray data. For each combination of parameter values, 1000 replicate samples of 20 000 observed intensities X were generated. Results are presented only for μ = 100 as the other results are almost identical.

The MLE bias and standard deviation were the smallest, followed closely by those of saddle (Tables 1–3). RMA-75 is much more biased and RMA is by far the worst. Parameter estimates for individual data sets for the representative parameter values σ = 20 and α = 1000 are plotted in Figure 1; the estimates from RMA fall outside the range of this plot. MLE is the most precise with almost no bias. Saddle is equally precise but with some bias, tending to underestimate σ. RMA-75 and RMA on the other hand overestimate σ.

Table 1.

Bias and standard deviation (shown in brackets) in estimating μ for the 4 estimation methods in 9 different scenarios. The true values of α and σ in each scenario are shown in the first 2 columns, and μ = 100 for all scenarios. All values are given to 2 significant figures

σ

α

MLE

Saddle

RMA-75

RMA

5

102

0.0079 (0.22)

– 0.25 (0.22)

1.7 (1.6)

12 (2.7)

20

102

0.0024 (0.47)

0.013 (0.50)

5.4 (2.3)

25 (2.6)

100

102

0.013 (1.6)

11.0 (1.5)

4.8 (11)

47 (9.0)

5

103

– 0.023 (0.67)

– 0.37 (0.65)

4.2 (7.5)

44 (23)

20

103

– 0.025 (1.4)

– 1.3 (1.4)

6.6 (11)

69 (24)

100

103

– 0.098 (3.1)

– 3.4 (3.1)

26.0 (18)

170 (24)

5

104

0.022 (2.3)

– 0.36 (2.2)

32.0 (64)

380 (230)

20

104

0.20 (4.2)

– 1.3 (4.0)

32.0 (66)

390 (220)

100

104

0.069 (9.2)

– 6.5 (9.0)

41.0 (85)

520 (240)

Table 2.

Bias and standard deviation (shown in brackets) in estimating σ for the 4 estimation methods in 9 different scenarios. The true values of α and σ in each scenario are shown in the first 2 columns, and μ = 100 for all scenarios. All values are given to 2 significant figures. ∞a and ∞b indicate, respectively, where 32.4% and 0.3% of replicates yielded infinite estimates

σ

α

MLE

Saddle

RMA-75

RMA

5

102

0.00059 (0.20)

– 0.40 (0.19)

1.5 (0.71)

7.0 (1.9)

20

102

– 0.0069 (0.40)

– 0.46 (0.43)

5.6 (1.0)

15.0 (1.6)

100

102

0.003 (1.0)

7.3 (0.99)

25.0 (5.0)

45.0 (5.1)

5

103

– 0.067 (0.62)

– 0.56 (0.56)

3.2 (4.7)

32.0 (19)

20

103

– 0.11 (1.2)

– 1.9 (1.1)

6.0 (5.3)

44.0 (19)

100

103

– 0.00048 (2.8)

– 5.9 (2.7)

27.0 (7.8)

100.0 (16)

5

104

– 0.72 (2.4)

– 1.2 (2.1)

∞a (∞a)

310.0 (190)

20

104

– 0.40 (4.0)

– 2.5 (3.6)

∞b (∞b)

300.0 (180)

100

104

– 0.52 (8.5)

– 10.0 (7.8)

36.0 (46)

360.0 (190)

Table 3.

Bias and standard deviation (shown in brackets) in estimating α for the 4 estimation methods in 9 different scenarios. The true values of α and σ in each scenario are shown in the first 2 columns, and μ = 100 for all scenarios. All values are given to 2 significant figures

σ

α

MLE

Saddle

RMA-75

RMA

5

102

– 0.00013 (0.75)

0.25 (0.75)

– 1.1 (1.5)

– 80 (0.31)

20

102

– 0.013 (0.82)

– 0.023 (0.84)

– 2.5 (1.9)

– 79 (0.40)

100

102

– 0.046 (1.6)

– 11.0 (1.5)

27.0 (8.1)

– 69 (4.4)

5

103

0.021 (6.8)

0.37 (6.8)

– 2.8 (10)

– 800 (2.9)

20

103

0.11 (6.8)

1.4 (6.8)

– 4.6 (12)

– 800 (2.9)

100

103

– 0.16 (7.5)

3.2 (7.5)

– 15.0 (16)

– 790 (3.2)

5

104

0.50 (72)

1.0 (72)

– 28.0 (100)

– 8000 (28)

20

104

– 3.2 (69)

– 1.6 (69)

– 29.0 (100)

– 8000 (28)

100

104

3.1 (71)

9.5 (71)

– 23.0 (110)

– 8000 (30)

Fig. 1.

Box plots of parameter estimates for the 3 best-performing methods. The true values of the parameters are indicated by dashed vertical lines. Estimates of RMA were so far from those of the other methods that do not appear when plotted on this scale (see Tables 1–3).

Another way to view accuracy is in terms of ability to return the correct signal values. The left panel of Figure 2 shows the bias with which E(S|X) estimates S on the log2-scale, for μ = 0, σ = 20, and α = 1000. Here, RMA-75 and especially RMA yield far more biased estimates of the signal than MLE or saddle, which are relatively accurate. Although MLE and saddle do tend to overestimate the true signal at lower intensities, this is indistinguishable from the bias that arises from inserting the true parameter values into E(S|X).

Fig. 2.

Left panel: smoothed log2-ratio of the true to the estimated signal versus the true signal. The black line shows this relationship if the true parameter values are used instead of estimates. The data used for this figure include 100 000 observations simulated with μ = 0, σ = 20, and α = 1000. Quantiles for the signal distribution are marked. The curves were smoothed using the lowess function in R (Cleveland, 1979). Right panel: smoothed 2 from the nonlinear fits versus intensity for the mixture experiment. The A-values have been standardized between methods and plotted from the 5th to the 95th percentiles. The quantiles of the A-values are marked.

4.3Implicit offsets

The normalized M- and A-values for one array from the mixture experiment are shown in Figure 3. This array has 100% Jurkat on both channels, so there is no true differential expression.

Fig. 3.

MA-plots obtained using different background correction methods for a self–self hybridization from the mixture experiment.

Some fanning of M-values is apparent at low A-values in the MLE and saddle panels. This fanning is essentially eliminated in the corresponding offset panels at the cost of compressing the range of A-values. Compared with MLE, RMA-75 and especially RMA show a somewhat compressed range of A- and M-values even before the offset is applied. Our interpretation is that these estimation schemes implicitly incorporate offsets, which arise from the fact that they tend to overestimate the quantity μS·X. Adding an offset to RMA is therefore in effect a double offset.

For this array, high offset and low M-value variability is desirable because the true M-values are zero. For arrays with genuine differential expression, compression of the M-values might appear as bias. We examine this in Section 4.5.

4.4Precision of expression values

We now examine the precision of the background-corrected intensities, using results from the mixture experiment. The residual standard deviation for each probe, i, is a measure of the precision with which the M-values returned by the microarrays follow the pattern of the mixing proportions. The right panel of Figure 2 shows the trend in variability for each background method as a function of intensity. The vertical scale is log2-variance, so each unit on the vertical axis corresponds to a 2-fold change in variance.

As expected, precision improves with intensity for all the background correction methods prior to applying an offset. MLE and saddle have the best precision of the 4 methods for most of the intensity range. RMA-75 is relatively poor at higher intensities. After adding an offset, MLE and saddle have roughly constant variance across the intensity range, whereas the offset seems overdone for RMA and RMA-75, which now show a reversed trend in precision.

4.5Bias of expression values

It is to be expected that higher precision, purchased by compressing the intensity range, will also result in attenuated signal. This is confirmed by examining the MCF7–Jurkat log-fold changes estimated from the mixture experiment. Supplementary Figure 1 available at Biostatistics online shows box plots of the log-fold changes arising from each method. The spread of fold changes narrows when offsets are added, although the largest fold changes remain nearly as great.

To confirm whether attenuated fold changes can be interpreted as bias, we turn to the spike-in experiment data. Supplementary Figure 2 available at Biostatistics online shows the M-values for a typical slide for the non-DE calibration controls and for the DE D03Med ratio controls, theoretically having 3-fold change down ( − log23 = − 1.58). All methods give log-ratios which are slightly biased towards 0, and the bias increases when offsets are added. There is surprisingly little difference between the 4 estimation algorithms, all leading to broadly similar bias.

4.6Assessing differential expression

We now assess the ability of background corrected expression values to identify DE genes correctly. Apart from the self–self hybridizations, the mixture experiment consists of 5 dye-swap pairs of arrays. We assessed differential expression between MCF7 and Jurkat using each pair of arrays separately. The RNA mixtures vary from 100% to 50% MCF7, so the magnitude of the fold changes will vary from one pair of the arrays to another, but the set of DE genes should be the same in each case.

Using only 2 arrays to find DE genes presents a challenging problem because there is only one degree of freedom available to estimate gene-wise standard deviations. The level of difficulty further increases with the concentration of Jurkat in the MCF7:Jurkat RNA mixture. The use of ordinary t-tests or other traditional univariate statistics to assess differential expression would be disastrous (Smyth, 2004). Instead, we use two of the most popular algorithms for microarray differential expression which have the characteristic of “borrowing” information between genes and so enable statistical inferences with some confidence even for small numbers of replicate arrays. Genes were ranked in terms of evidence for differential expression using significance analysis of microarrays (SAM) regularized t-statistics (Tusher and others, 2001) and using empirical Bayes moderated t-statistics (Smyth, 2004). The statistics were calculated using the samr (http://www-stat.stanford.edu/display=’block'∼tibs/SAM/) and limma (Smyth, 2005) software packages, respectively.

To assess the success of the differential expression analyses, an independent determination of which genes are truly DE is required. The top 30% of genes, as ranked by moderated t-statistics, from the quality control study were selected as unambiguously DE and the bottom 40% as unambiguously non-DE. This gave 3098 DE and 4130 non-DE genes. The remaining 30% of genes were treated as indeterminate and are not used in the analysis.

Figure 4 shows the number of false discoveries for each method versus the number of genes selected by ranking the genes using absolute t-statistics, from largest to smallest for (a) limma and (b) SAM. The curves have been averaged over the 5 dye-swap pairs. The limma curves show that adding an offset reduces the false-discovery rate, with the best performance achieved by MLE and saddle, followed by RMA-75 and then RMA. For SAM, the advantage of “MLE + offset” and “saddle + offset” over the methods is even more marked. SAM appears to penalize the methods which do not stabilize the variance.

Fig. 4.

Number of false discoveries from the mixture data set using moderated t-statistics from (a) limma and (b) SAM. Each curve is an average over the 5 mixtures.

5.DISCUSSION

In this article, we have shown that exact MLE gives the most accurate estimation of the normexp parameters, which translates into higher precision for the computed log-ratios of expression. The saddle-point approximation is a very close competitor. The heuristic normexp estimators are markedly poorer in estimation accuracy. Furthermore, RMA-mean and RMA-75 fail occasionally and even frequently for some simulated scenarios. However, MLE and saddle converged successfully in all of our tests.

The performance of normexp for assessing differential expression on real data is improved when combined with an offset, as a result of stabilizing the variance as a function of intensity. MLE + offset and saddle + offset gave the lowest false-discovery rates. Although exact MLE does slightly better, the saddle-point approximation could be considered an adequate replacement in most practical situations.

Estimation accuracy did not directly translate to practical performance in all cases. RMA gives easily the most biased parameter estimates. Yet when we turned to the real data examples, RMA yielded higher precision and fewer false positives than RMA-75. Prior to offset, RMA is the best of all the methods when used with SAM significance analysis. This can be understood in terms of noise–bias trade-off. It appears that the biased RMA estimators have the fortuitous effect of introducing an implicit offset into the corrected intensities, and this has a variance stabilizing effect. This partly explains why the RMA algorithm has been so successful for Affymetrix data. RMA also tends to return roughly similar parameter estimates regardless of the data, producing more consistent parameter estimates between arrays than the other methods. We speculate that this consistency may also help its performance on real data.

Since our study was completed, Ding and others (2008) developed a normexp-type background correction method for Illumina microarray data. They proposed a Markov chain Monte Carlo (MCMC) simulation method to approximate the maximum likelihood parameter estimates. Their method is not directly applicable to non-Illumina data because it requires Illumina negative controls. MCMC is far more computationally intensive than our Newton–Raphson MLE and returns estimates which vary stochastically from run to run.

Our algorithm is the first to return reliable, exact maximum likelihood estimates for the normexp model. This was only achieved after careful attention to a number of numerical analysis issues. In initial attempts, numerical issues including subtractive cancelation prevented us from computing the likelihood for some data sets. Several ingredients were required before reliable success was achieved including: (1) good initial estimates provided by the saddle procedure, (2) optimizing with respect to logα and logσ instead of α and σ (to enforce α > 0 and σ > 0), and (3) optimizing using both first and second derivatives. Note that the Nelder–Mead algorithm was used first with the saddle-point likelihood, then a pseudo-Newton–Raphson algorithm was used on the exact likelihood once a focused parameter range was established. The Nelder–Mead algorithm could not have been used directly on the exact likelihood because of the much wider range of parameter values under which the likelihood would need to be evaluated. Nor could the Newton–Raphson have been applied to the saddle-point approximation because of the lack of good starting values.

Although we have focused exclusively here on 2-color microarrays, our algorithmic development has obvious applications to other micoarray platforms as well.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Routinely collected administrative data sets, such as national registers, aim to collect information on a limited number of variables for the whole population. In contrast, survey and cohort studies contain more detailed data from a sample of the population. This paper describes Bayesian graphical models for fitting a common regression model to a combination of data sets with different sets of covariates. The methods are applied to a study of low birth weight and air pollution in England and Wales using a combination of register, survey, and small-area aggregate data. We discuss issues such as multiple imputation of confounding variables missing in one data set, survey selection bias, and appropriate propagation of information between model components. From the register data, there appears to be an association between low birth weight and environmental exposure to NO2, but after adjusting for confounding by ethnicity and maternal smoking by combining the register and survey data under our models, we find there is no significant association. However, NO2 was associated with a small but significant reduction in birth weight, modeled as a continuous variable.

Studies based on synthesis of data sets of different designs are becoming more common in environmental epidemiology. Observational studies in epidemiology are susceptible to a variety of potential biases, as discussed by Greenland (2005), who recommended that the effect of each potential bias on the conclusions should be routinely and jointly assessed. Typically, the biases are not identified by the study data, but information can often be gained by incorporating external data. At the same time, precision can be increased by combining data. We consider studies of the relationship between an exposure and an outcome using a combination of 2 commonly used forms of data:

A large administrative data set, such as a census or disease register, which represents the whole population and enables the study of small-scale geographical variations. This may only be published as aggregate data, leading to ecological bias (Greenland and Morgenstern, 1989), and variables of interest may not be recorded.

A small individual-level data set containing all key variables but lacking power, in particular, information on geographical variations.

Ecological bias from aggregate administrative data can be alleviated by incorporating surveys of individual exposures (Prentice and Sheppard, 1995, Wakefield and Salway, 2001), exposures and outcomes (Jackson et al., 2006, Jackson et al., 2008) or case–control data (Haneuse and Wakefield, 2007). In this paper, instead of an aggregate data set, we consider the situation where the large administrative data set is an “individual-level” register which includes the exposure of interest but omits important confounders. The register data are complemented by a smaller survey data set which contains all relevant variables. We use multiple imputation methods, within a Bayesian graphical modeling framework, to analyze jointly the combined data.

Gelman et al. (1998) described similar methods for simultaneously analyzing multiple survey data sets in which some questions are not asked in some surveys. That article was focused on producing a set of multiply imputed data sets for later analysis, with multivariate normal observed and missing data. In this paper, we describe a joint model for imputing the data and fitting a regression model to the imputed data. Our application involves a binary outcome and categorical missing data, but the methods can be implemented for general forms of data using general purpose software.

1.2.General model for jointly analyzing data sets with different variables

We are interested in a regression of an outcome y on a set of N covariates x1,…,xN when we have 2 or more individual-level data sets. Suppose that observations of y are made in every data set, but only a subset of the covariates is observed in each data set. The idea is to predict the missing covariates in one data set using completely observed variables in the others.

Fig. 1.

General model for regression of y on x using a combination of data sets with different observed covariates. Circles represent unknown quantities and squares represent observed data. Covariates x(M1) missing in data set 1 are predicted from a regression fitted using the observed values of x(M1) in data set 2 and variables x(C) common to both. Covariates x(M2) missing in data set 2 are predicted in a similar way using information from data set 1.

This is illustrated for the simplest case of 2 data sets by a graphical model (Figure 1). To impute the missing covariates x(M1) in data set 1, we require that there are some covariates x(C) observed in both data sets and that x(M1) are observed in data set 2. Firstly, we fit a regression of the x(M1) on x(C) using data set 2. Using the x(C) in data set 1, we predict from this regression to impute the x(M1) missing in data set 1. Similarly, to predict the x(M2) missing in data set 2, if x(M2) are observed in data set 1, we can use data set 1 to estimate a regression model for x(M2) in terms of x(C). The regression coefficients of interest β governing the relationship between y and x1,…,xN in all data sets, the regression coefficients γd governing the imputation model for data set d, and the missing covariates in each data set are estimated simultaneously. In practice, Markov chain Monte Carlo (MCMC) posterior simulation will usually be necessary.

These principles can immediately be generalized to 3 data sets or more. To predict the missing covariates x(Md) in data set d, we require that these covariates are observed in at least one other data set, in which there also exist variables observed in data set d to inform a regression model for x(Md). Indeed, similar principles can be used, if necessary, to impute missing outcomes y. Or, by separating complete from incomplete records, covariates which are missing intermittently within one data set can be imputed.

1.3.Low birth weight and air pollution

This model will be illustrated by a study of the association between low birth weight and exposure to ambient air pollution. Some studies have suggested that exposure to air pollution increases the risk of low birth weight, either as a result of preterm delivery or intrauterine growth retardation. Most of these have been based on births registers covering a single city or region in 1 year, for example, São Paulo (Gouveia et al., 2004), Vancouver (Liu et al., 2003), Sydney (Mannes et al., 2005), and California (Parker et al., 2005). The study in this paper is based on the population of England and Wales, which should enable us to examine a relatively wide range of exposures. In UK, the Office for National Statistics maintains a register of all births, to which we can link modeled pollution exposures by postcode of residence (see Section 2.3). However, many major risk factors for low birth weight, such as ethnicity or maternal age, are either not recorded or not made available in the register. Therefore, we obtain detailed confounder information from a survey of births, the Millennium Cohort Study (MCS).

There are 2 important sources of potential bias. Bias due to confounding by variables absent from the register data can be alleviated by an imputation model for the missing covariates, constructed using the survey data. Inferences from the survey data are also subject to selection bias. This can be alleviated by adjusting models for known factors affecting selection or by weighting observations by the inverse probability of selection, which is known by design. We could consider analyzing the survey data alone, since all relevant variables are observed, and pollution exposure estimates can be linked by postcode. However, inferences from the survey data alone lack power to detect the small risk increases expected for environmental exposures, typically a few percent for a population quartile of a continuous exposure (e.g. Gouveia et al., 2004). We will demonstrate how the graphical modeling framework described in Section 1.2 can incorporate all data sources, controlling biases and improving precision. Bayesian estimation of the joint model ensures that uncertainties are propagated appropriately between different model components.

Section 2 describes our data sources in more detail. Section 3 describes how the model presented in Section 1.2 is specified and implemented to estimate the association between low birth weight and pollution. Section 4 presents the results, including a range of sensitivity analyses to assess the influence of each source of data and each component of the model. Finally, we discuss the advantages and drawbacks of the methods and suggest some ideas for further development.

2.DATA2.1.National births register

The UK register of births (Office for National Statistics) recorded 579 267 singleton births in England and Wales between September 1, 2000, and August 30, 2001. Information available for every birth includes date of birth, birth weight, sex, and postcode. Social class and employment status of the mother are available for a 10% random sample of every 10th registered birth and we study only this subset of 57 844 births. A total of 1231 individuals who also appeared in the MCS, ascertained by a match on postcode, date of birth, sex, and birth weight, were excluded from the register data. Also, 88 births with missing birth weight were excluded, leaving 56 525 births for analysis.

2.2.Millennium Cohort Study

The MCS (Centre for Longitudinal Studies, 2000–2005) covers 18 819 babies born between September 1, 2000, and August 30, 2001, in the UK. We include only the 14 100 singleton births from England and Wales. Births with postcodes of residence which could not be linked to pollution data, due to incomplete pollution mapping or inaccurate recording of postcodes in the MCS, were excluded, leaving 13 131 births for analysis. The distribution of birth weight was similar between the included and the excluded births.

The MCS was cluster-sampled by electoral wards, areas containing an average of around 5000 individuals. Wards in England were stratified into mutually exclusive categories labeled “advantaged”, “disadvantaged” and “high ethnic minority”, and wards in Wales were stratified into categories labeled advantaged and disadvantaged. A different proportion of wards were sampled from each stratum to achieve adequate representation of each stratum. All families in the sampled wards with children born in the relevant period, resident in the UK at 9 months, were invited to participate. Response rates averaged 70%, but varied by stratum, with the lowest response rate of about 60% in the ethnic minority wards. Variables from the MCS which we consider include birth weight, sex, ethnicity, tobacco smoking during pregnancy, maternal age, parity (number of previous births), height and weight, and socioeconomic characteristics of the mother, including social class, employment status, lone parent, and education over age 16.

2.3.Pollution exposure

Estimated background maps of ambient concentrations of NO2 and SO2, for 2001, on a 1-km grid, were obtained from the National Environmental Technology Centre (Stedman et al., 2002). These were modeled from point sources such as power stations, line sources such as road traffic, and monitoring sites, using a dispersion matrix approach. Pollution estimates from 54 517 grid squares in England and Wales were attributed to 566 932 postcodes using area-weighting techniques. The births register and MCS data were linked by postcode to the annual mean pollution concentration for the year (2000 or 2001) in which the nominal date of the middle of pregnancy (140 days prior to birth) falls. Concentrations for the year 2000 were estimated by adjusting the 2001 concentrations by published scaling factors (Department for Environment, Food and Rural Affairs, UK, 2003), calculated from estimated changes in road traffic emissions, which decreased from 2000 to 2001.

2.4.Aggregate data

Some important risk factors for low birth weight are not available from the births register, in particular, ethnic group and tobacco smoking, which are likely to be confounded with air pollution exposure. An imputation model is fitted to individual ethnicity and smoking from the MCS data and subsequently used to predict these variables for the births in the register. To inform this model, geographical aggregate data on these variables were obtained. Neighborhood smoking behavior and ethnicity are expected to be good predictors of their individual-level equivalents. The proportion of the resident population in each of 4 ethnic groups (white, South Asian, black, other) for 46 548 census output areas (areas containing around 200–300 individuals) were obtained from the 2001 UK census. Estimated annual tobacco expenditures, by 2001 census output areas, were obtained from consumer classification data (CACI Information Solutions, Limited). These were linked by postcode to all individuals in the MCS and register.

2.5.Consistency of data sources: selection bias

Table 1 presents a summary of the variables common to the MCS and administrative (either register or aggregate) data. The differences between the administrative and the survey data reflect how the MCS data are not a random sample of the population. However, when the MCS data are summarized using the published survey weights, which are inversely proportional to the proportion of wards sampled in each stratum, the distributions of all variables except social class and employment are consistent with the population summary. The different distribution of social class and employment between the MCS and the register, even after reweighting the MCS, is likely to be caused by inaccurate recording of these variables in the register, rather than selection bias—since the summary of social class and employment from the reweighted MCS was consistent with 1991 UK census data for women between the ages of 12 and 60. The binary smoking status reported by the MCS cannot be directly compared to area-level mean tobacco expenditures.

We aim to adjust by regression for all factors governing selection (Gelman and Carlin, 2001). The data we study are a combination of a random 10% sample of the register with a selective 2% sample (MCS). The combined data set therefore has a 2-stage selection mechanism. Firstly, the MCS subjects are sampled within 5 strata. Secondly, we assume the remaining subjects (ignoring the 2% of those who also appeared in the MCS) are selected at random from the remaining population, which we consider to be a sixth sampling stratum. This sampling design is accounted for in the model for low birth weight by adjusting for the stratum as a covariate. Other covariates in our model, including ethnicity and social class, are assumed to be sufficient to adjust for nonresponse within the MCS sampling strata.

The sample selection mechanism must also be accounted for in the model we use to impute the missing ethnicity and smoking data in the register. The combination of the MCS and register is considered as a single data set in which the births which came from the register have these covariates missing. These can be modeled using multiple imputation. By adjusting the imputation model for the variables governing selection into the MCS, we can assume a “missing-at-random” mechanism for these variables since missingness is equivalent to inclusion in the portion of the data set which came from the register rather than the MCS.

3.MODELS

Two regression models are estimated in parallel using the combined data: a “model of interest” for the relationship of low birth weight to pollution exposure, and an “imputation model” for 2 potential confounders of this relationship, ethnicity, and smoking, which are missing from the register data but available from the MCS.

3.1.Model of interest for low birth weight

Suppose baby i from ward k in the MCS has low–birth weight indicator yik. Let xik(C) be a vector of covariates which are observed in both the register and the MCS, and let xik(M) be a vector of confounders which are missing in the register but available in the MCS. The model for this individual's risk pik of low birth weight is a random-effects logistic regression:(3.1)

In (3.1), μsk represent different baseline risks of low birth weight for the stratum sk in which ward k is classified, defined by the sampling design of the MCS, and Uk are ward-level random effects, assumed exchangeable within each stratum, with a different variance within each stratum sk.

Similarly, for baby j, resident in ward l, in the register, where xjl(M) are unknown,(3.2)

Ward-level random effects Ul are included, with the same distribution as Uk, to account for any small-area clustering in the risk of low birth weight that is not explained by covariates included in the regression model. The intercept m represents the sixth sampling stratum, discussed in Section 2.5. As in hierarchical related regression (Jackson et al., 2008), the log-odds ratios βC and βM are assumed to be the same between the MCS and the register data.

The covariates xik(C), available in both data sets and included in the final model for low birth weight, are NO2 and SO2, which are continuous, and the mother's social class, which is categorical. Covariates xik(M), available in the MCS only, included smoking during pregnancy (binary) and ethnic group (4 categories representing white, South Asian, black and other). Other covariates were either not significant predictors of low birth weight, such as employment status of the mother, or assumed to be not confounded with pollution, such as maternal age, parity, height, and weight. The latter were all found to have negligible correlation with NO2 and SO2 in the MCS data. The results of regression models which included pollution exposure as a categorical variable suggested that it was appropriate to treat the effect of pollution as linear. We assume that the covariates we include, in particular the MCS design strata, ethnic group and social class, are sufficient to adjust for all factors governing MCS selection and nonresponse. In Section 4.3, we perform sensitivity analyses to assess this assumption.

3.2.Imputation model for missing smoking and ethnicity

Each xik(M) indicates the combined smoking status and ethnic group of individual i in ward k from the MCS. This has 8 categories with probabilities qik = (qik1,…,qik8). A regression model is fitted to xik(M) in the MCS data and used to predict the missing xjl(M) in the register. As recommended by Little (1992), all completely observed variables are used for this prediction. These include the individual-level variables xik(C) in the model of interest common to the MCS and register (NO2, SO2, social class) and additional variables xik(P) specific to the imputation model (individual employment status and aggregate covariates). The aggregate covariates, describing the census output area in which individual i is resident (Section 2.4), include the average annual tobacco expenditure per person for each output area and the log-relative proportions of ethnic minorities, defined as log(ψms/ψm1)(s = 2,3,4), where ψms is the proportion of the population of output area m in ethnic group s. A random-effects multinomial logistic regression is fitted for xik(M) in terms of xik(P) and xik(C):(3.3)

This is fitted to the MCS data and used to predict the missing smoking and ethnicity in the register data. Classical likelihood ratio tests suggested that all covariates, especially individual NO2 exposure, aggregate ethnicity, aggregate tobacco and individual social class, seem significantly to improve the prediction model. Including further interaction terms did not significantly improve fit. The sampling design of the MCS in model (3.3) is again represented by cluster-level random effects Vk. We assume this model contains all factors governing selection, as discussed in Section 2.5. Different intercepts within each sampling stratum were not used since the strata, based on ward-level child poverty and ethnicity, were highly correlated with the aggregate ethnicity and tobacco data. The low–birth weight outcome also influences this prediction, as described in Section 3.3.

3.3.Graphical model implementation

The model is fully specified by (3.1–3.3). Figure 2 shows the directed acyclic graph for this model, which forms the basis of a MCMC algorithm (Gilks et al., 1996). The joint posterior distribution of the set of all quantities V in the graph is expressible as the product ∏v∈Vp(v|pa[v]) of all conditional posterior distributions, where pa[v] denotes the parent nodes of v. MCMC estimation of the model proceeds by iterative sampling from the full conditional distributions p(v|·) of each node, where · indicates all nodes other than v. Each full conditional distribution is the product of a prior and a likelihood term: p(v|·) = p(v|pa[v])∏v∈pa[w]p(w|pa[w]).

The right-hand side of the graph illustrates that the prior distribution of the unknown confounders xjl(M) in the register is defined by the imputation model, parameterized by the νr,τsk and γr = (γrP,γrC). Information to estimate these parameters comes from their likelihood, which depends on xik(M) in the MCS. Also, xjl(P) denotes variables used in the model for imputing xjl(M) which do not appear in xjl(C). In this graph, the low–birth weight outcome yjl is implicitly involved in the prediction of xjl(M) since the likelihood term of yjl, defined by (3.2), involves the unknown xjl(M).

3.4.Approximation to the full graphical model

In the full probability model illustrated by Figure 2, we would sample directly from the posterior distribution of the imputation coefficients νr,γr to calculate the probabilities qjl governing xjl(M), thus accounting for the uncertainty about the imputation model while estimating the model for low birth weight. However, we found that the calculation of qjl on the register data, using the WinBUGS software (Spiegelhalter et al., 2003) for MCMC sampling, was computationally infeasible. Therefore, we proceeded in 2 stages, firstly fitting the imputation model (3.3) to the MCS data to derive a hierarchical prior distribution for xjl(M), then using this prior distribution to impute missing values of xjl(M) in Bayesian estimation of the model of interest (3.1) and (3.2).

In the first stage, the posterior distributions of the coefficients of model (3.3) were estimated from the MCS data using MCMC sampling. Variables xjl(P),xjl(C),yjl for each individual j and output area l in the register, and samples of 100 from the posterior distributions of νr,γr and Vl were used to predict a sample of 100 replicates of the vector of prior probabilities qjl = (qjl1,…,qjl8) for each individual's unknown smoking status and ethnic group xjl(M). A Dirichlet distribution was then fitted to these replicate vectors, for each j,l, by maximum likelihood (Yee and Wild, 1996,). These Dirichlet distributions were then used as priors for (qjl1,…,qjl8) in the second-stage model for low birth weight, thus the uncertainty about νr,γr is propagated through to the second stage. This is represented by the graphical model illustrated in Figure 3:(3.4)

Graphical model (2-stage imputation and regression). In Stage 1, the imputation model with parameters γ,νis fitted to the ethnicity and smoking data xik(M)in the MCS and used to predict probabilities qjlgoverning the missing data xjl(M)in the register. In Stage 2, the model of interest is fitted to the low–birth weight outcomes yikin the MCS and yjlin the register, using a Dirichlet prior distribution for qjlparameterized by δjl.

For the Stage 2 model, the prior distributions for the covariate effects comprising βC and βM are independent normal with mean 0 and variance 100. Logistic(0, 1) priors were used for the logit baseline risk parameters m,μ1,…μ5. Truncated positive N(0, 1) priors were used for σ12,…,σ52 (Gelman, 2006). The data are sufficient to dominate the influence of this choice of priors.

Note that yjl is included as an explicit predictor of xjl(M) in the Stage 1 model, through the model for qjl, instead of implicitly influencing the prediction through its “likelihood” term which depends on xjl(M). The “valve” on the arrow from xjl(M) to yjl in Figure 3, Stage 2, indicates that this likelihood term is omitted from the full conditional distribution of that node. That is, the dependence of yjl on xjl(M) is “cut,” so that prior information on yjl flows in the direction of the arrow, but likelihood information on xjl(M) does not flow in the reverse direction (Lunn et al., 2008). In the WinBUGS software, this is achieved by “the cut function.” Without this cut, yjl would effectively have been adjusted for twice.

4.RESULTS

We aim to assess the influence of each source of data on the conclusions and the benefit of each model elaboration. The model defined by (3.1), (3.2) and (3.4) and the various simplifications of it are fitted using all available data and various subsets. In particular, the impact of confounding and selection bias, the benefit gained by combining the MCS and register data, the choice of predictors for the imputation model, the influence of the imputed data, and the benefit of cutting the graphical model are assessed.

Figure 4 presents the posterior mean odds ratios of low birth weight associated with NO2 and SO2 exposure and the odds ratios associated with ethnicity, smoking and the 6 categories of social class in graphical form for 3 important cases. Additional results are given in the supplementary material, available at Biostatistics online (available from http://www.biostatistics.oxfordjournals.org).

Fig. 4.

Odds ratios of low birth weight associated with pollution, smoking, ethnicity, and social class, estimated using 3 different combinations of data. In all cases, the fitted model included pollution, smoking, ethnicity, and social class. Horizontal axis is on the log-scale.

4.1.Impact of confounding

Using the register data alone, a logistic regression for low birth weight on pollution exposures, adjusted for individual social class (available for every mother in the register) but not ethnicity or smoking status, gives an odds ratio of 1.15 for a change in NO2 equal to its interquartile range across England and Wales (95% credible interval 1.07 to 1.23). Fitting a similar model (model (3.1)) to the MCS data, adjusting for ethnicity and smoking status, suggests that this apparent association is the result of confounding—that there is no association of low birth weight with NO2 conditionally on ethnicity and smoking. Most studies of low birth weight and pollution have been conducted using birth registers. Our analysis suggests that a misleading result would have arisen given only the UK register, in which we are unable to control for confounding. However, the lack of association with pollution in the MCS may just reflect lack of power, therefore we strengthen our conclusions by combining the MCS and register data under the imputation model.

4.2.Benefit of combining the administrative and survey data

The estimated odds ratio for NO2 under the full model (3.1), (3.2) and (3.4), combining the MCS and register data, integrating over the missing individual ethnicity and smoking data in the register is 0.98 (0.91, 1.04), demonstrating an increase in precision compared to the MCS alone. Similar increases in precision are shown for all other covariate effects (Figure 4). Note that the increasing risk of low birth weight with decreasing levels of social class now appears significant (Figure 4, “Combined” result compared to “MCS”). A similar odds ratio for the NO2 effect of 0.97 (0.90, 1.03) is obtained using the outcome data from the register data alone (model 3.2), but controlling for confounding using the imputation model (3.4) (Figure 4). This suggests that the main role of the MCS is to inform the imputation model, while the model of interest is dominated by the register data. For SO2, the odds ratio from the combined data is 1.02 (0.98, 1.05). Thus, the evidence for lack of an association of either pollutant with low birth weight has been strengthened by combining the register and survey data.

The posterior distribution of the deviance ( − 2× log-likelihood) was calculated for the MCS and register outcomes separately in the model which combined the two, as a measure of model fit. The posterior mean deviances for the MCS and register were 6288 and 25 130, respectively (standard deviations 11 and 57), demonstrating a good fit, comparing with expected deviances for a saturated model of 13 131 and 56 525 respectively (the number of observations in the data).

4.3.Impact of selection bias and data inconsistency

When the differential selection and cluster sampling of the MCS are not accounted for, so that μsk is replaced by a constant μ in (3.1) and the random effects Uk and Ul are removed, the combined model yields an odds ratio of 1.00 (0.94, 1.06) for NO2, implying that the effect of selection bias would not have been great if the sampling design of the MCS had been ignored. An additional model was fitted to the MCS data which ignored confounding by smoking and ethnicity. The estimated odds ratios are similar to those obtained from the same model fitted to the register data alone (first row of Table 3, supplementary material, available at Biostatistics online), but with wider credible intervals. The consistency between the MCS and the population register results suggests that the selection and nonresponse mechanisms in the MCS do not bias the association between pollution and low birth weight. A further model was fitted to the combined data and the MCS alone, excluding social class as a predictor of low birth weight. In both cases, the posterior mean odds ratios and credible limits for NO2 and SO2 (not presented) were less than 1% different from those obtained from the model including social class, suggesting that the poor quality of the social class data from the register (discussed in Section 2.5) did not affect the conclusions.

4.4.Influence of the imputation model

The main assumption of this data synthesis is that the imputation model is able to impute the ethnicity and smoking data in the register with sufficient accuracy to control for their confounding effects. We now assess the influence of the imputation model, the choice of predictors in the imputation model, and the amount of power lost by propagating the imputation uncertainty.

Firstly, we assess roughly how many predictors of individual ethnicity and smoking are required from the population data to control for their confounding effects. Our main imputation model uses all available predictors. When the aggregate ethnicity and tobacco data are omitted from this model, so that it depends only on individual-level variables (low birth weight, NO2 exposure, SO2 exposure, social class, and unemployment), there is a 5% change in the odds ratio for NO2 and a greater change in the confounder odds ratios (fifth row of Table 3, supplementary material available at Biostatistics online). With NO2 also omitted from the imputation model, the biases are greater: the odds ratios in the model of interest are close to those from the register data unadjusted for confounders. Thus, confounding can be controlled to some extent without the auxiliary aggregate data on the confounders, by including a sufficient number of individual predictors of the confounders, provided that the exposures of interest are included in the imputation model.

Secondly, the relative influence of the “observed” and “imputed” confounder data on the model of interest is assessed. The full model was fitted to the combined data, but with the observed confounders in the MCS replaced by multiply imputed values. The odds ratios (ninth row of Table 2, supplementary material available at Biostatistics online) are similar to those with the observed confounders (third row) suggesting that the imputations are consistent with the observed data.

By combining the data, uncertainty is reduced by increasing the sample size, but at the cost of extra uncertainty about the imputed covariate data, which is propagated by the MCMC scheme. The posterior variance of the log-odds ratio for NO2 is 0.00840 from the MCS only. If the data are combined but imputation uncertainty is ignored, using a single random imputation of the confounders in the register, this variance reduces to 0.00102, about 12% of the variance under the MCS. Propagating the uncertainty only increases this variance to 0.00109, about 13% of the variance under the MCS.

4.5.Benefit of cutting the dependency on birth weight

The full model was also fitted to the combined data with the graph not cut as described in Section 3.4. Here, the low–birth weight outcome is allowed to influence this confounder imputation indirectly through the graph, as well as being implicitly accounted for in the prior parameters δjl of xjl(M). This seems to result in large biases in the odds ratios for NO2, ethnicity and smoking (Tables 2 and 3, supplementary material available at Biostatistics online). This warns against applying a graphical model naïvely without considering whether its structure implicitly provides information about certain nodes.

4.6.Substantive interpretation

We conclude that in England and Wales there is a large increase in risk of low birth weight associated with maternal smoking (odds ratio [OR] 1.93 [1.79, 2.09]), South Asian ethnic groups (OR 2.6 [2.3, 2.91]), Black ethnic groups (OR 1.78 [1.5, 2.1]), other ethnic minorities (OR 1.55 [1.28, 1.84]), and decreasing social class. Conditionally on these factors, there does not seem to be an effect of exposure to environmental NO2 or SO2. These results are not inconsistent with the literature on the effects of pollution exposure on birth weight. While there are several studies suggesting associations between NO2, PM10, CO, and SO2 exposure and adverse birth outcomes, these vary in the definition of the exposure and outcome studied and the nature of the association. For example, Mannes et al. (2005) found an association of CO and NO2 exposure in the first trimester of pregnancy with all birth weight in Sydney and Gouveia et al. (2004) found an association of PM10 and CO exposure (but not NO2) in the first trimester with low birth weight for gestational age in São Paulo, whereas Hansen et al. (2007) found no association of NO2 or PM10 exposure with a reduction in birth weight in Brisbane.

Birth outcomes.

The outcome used in our study was low birth weight at all gestational ages. However, the aetiology of preterm birth and intrauterine growth restriction (resulting in low full-term birth weight) is different. The gestational age of each birth is required to distinguish between these 2 outcomes. This is available from the MCS but not from register data. To investigate possible effects on each of these outcomes separately, we fitted standard logistic regression models to the MCS data alone, adjusting for individual ethnicity, smoking, and social class. Around 43% of low–birth weight babies in the MCS were full term ( ≥ 37 weeks gestational age). The associations of NO2 and SO2 with low full-term birth weight are similar to those for all low birth weight, and the effects of smoking and ethnicity are stronger (Tables 2 and 3, supplementary material available at Biostatistics online). The only significant predictor of preterm birth was maternal smoking. There does not appear to be an association of preterm birth with pollution exposure. The findings of lack of an association of either outcome with pollution are inconclusive, although there is no strong evidence to suggest important differences in effect according to gestational age. In Molitor et al. (2008), we propose an extension of the current modeling framework to impute missing information on gestational age in the register.

We study low birth weight, defined as less than 2.5 kg, since this is established as an important public health indicator (United Nations Children's Fund and World Health Organization, 2004). As an alternative to a dichotomous outcome, we consider modeling birth weight as a continuous variable. Wilcox and Russell (1983) characterized the population distribution of birth weight as a mixture of a predominant normal distribution and a heavy tail, representing full-term and preterm births, respectively. We fit a mixture of 2 normal distributions to our combined birth weight data, adjusted for the same variables as models (3.1) and (3.2). Uninformative priors were used for the component membership probability and the component-specific means and variances, with an ordering constraint on the component means. The regression coefficients were constrained to be the same between components—allowing them to vary did not improve fit, judging from an increase in the posterior mean deviance. Under this model, there is a change of − 31 g ( − 40 g, − 23 g) associated with a change in NO2 equal to its interquartile range, similarly 1.1 g ( − 3.6 g, 5.8 g) for SO2. The significant association of NO2 with reduction in birth weight contrasts with the results obtained when dichotomizing birth weight. However, the association is small compared to the population mean birth weight of 3374 g and the “low–birth weight” threshold (about the 6.3% percentile) of 2500 g.

Exposure measurement error and variability.

Now, consider the nature of the exposure data in our study. Firstly, the potential impact of measurement error should be considered. The only exposure data we have are modeled annual pollution concentrations in 2000 and 2001 by postcode of residence. These are proxies for the true individual exposures. The true exposures are likely to have higher variance than the observed data (Berkson error), and there is no reason to believe that errors are differential. Thus, while measurement error is likely to reduce power, it is not expected to cause bias in estimated exposure effects Armstrong, 1998, Zeger et al., 2000. To investigate these impacts, we performed a sensitivity analysis. In the model for the combined data, the observed NO2 exposure xik1 was replaced by the unknown true exposure xik1(true) and a Berkson error model was assumed:

The measurement error standard deviation was defined as ω = 0.5λ, where is the empirical mean exposure in the combined data, representing the belief that the true value varies within about ±100λ% of the observed value. The observed SO2 exposure was modeled in the same way. For values of λ up to 1, the estimated odds ratios and their credible limits were within 1% of the estimates with λ = 0, suggesting that measurement error within plausible limits did not affect the power of our analysis.

Secondly, the impact of temporal variations in the exposure should be considered. While our annual mean exposure data only enable us to determine the effect of long-term exposure rather than specific effects in different months of pregnancy, we can investigate seasonal variations. Concentrations of NO2 and SO2 were lower in 2001 and are generally higher in winter months (December to February in UK) when the air is cool and stable. Births are approximately uniformly distributed in the data by season. To assess whether there is some seasonal component to the risk of low birth weight after adjusting for annual background concentrations, we fitted the combined model including an extra term for season of birth (categorized as September 2000–November 2000, December 2000–February 2001, March 2001–May 2001, June 2001–August 2001). Small seasonal variations were observed with the lightest births in winter and summer. Relative to a baseline of September–November, the odds ratio for low birth weight for birth in December–February was 1.09 (1.00, 1.18), for March–May 1.04 (0.95, 1.14), and for June–August 1.07 (0.99, 1.17). Lower birth weights in summer may be related to pollution exposure during pregnancy in the winter, although it has been suggested that low temperatures during mid-pregnancy may directly affect foetal growth (Murray et al., 2000).

5.DISCUSSION

Multiple imputation methods, which are more commonly used for intermittent nonresponse within single data sets, can also be used to combine data in situations where some variables are missing by design in particular data sets. In this paper, we presented and applied a model for combining data sets with different sets of variables, generalizing the model presented by Gelman et al. (1998) to include estimation of regression relationships on the imputed data and general forms of observed and missing data, both discrete and continuous. The graphical modeling framework enables a joint probability distribution for the combined data, in which uncertainties from one model component are taken account of in other components. It is easily extensible and can be implemented in general purpose software. For example, the multiple imputation methods could be extended, in the way described by Gelman et al. (1998), to deal with situations in which several individual data sets are modeled, some with certain variables completely absent and others with intermittent nonresponse. Survey-level covariates may be needed to explain systematic biases from each survey, and a hierarchical model may be needed to represent the correlation structure. However, in routine application of graphical models, the structure of the influence relationships must be considered carefully, as we showed by demonstrating the need for “cutting” the dependency of the missing covariates on the observed outcome.

A similar situation of synthesizing data sets with different sets of covariates arises in “2-phase” or 2-stage designs (White, 1982). These are used to improve efficiency, commonly of case–control studies, in situations where covariate collection is expensive. Individuals are classified into strata defined by combinations of an outcome and an exposure of interest, and samples of individuals are selected from each stratum for further covariate collection. By oversampling from the smaller strata and using appropriate methods for inference (Breslow and Holubkov, 1997,), efficiency can be increased. Only the smaller of the 2 data sets, with full covariate information, is analyzed directly, but the information on the exposure-outcome relationship in the larger data set is used indirectly when constructing a model to account for the sampling design. In this paper, we have described how this information can be used directly, in a situation where the design of the smaller data set is not based on the larger data set. The improvement in power comes from constructing an expanded data set on which to estimate the model, rather than from the design of the sample.

By synthesizing data from different sources, inferences can be improved. In our application, we were able to make the most of the strengths of each data set: the large sample size of the administrative data and the more detailed covariate collection of the survey data. However, any analysis of combinations of data, including meta-analysis, is not recommended when the data sets being combined are too heterogeneous. Here, “heterogeneity” is used as a general term encompassing differences in study design, different variables collected, differences in the underlying populations, or systematically different responses to variables which are nominally the same. Ideally, the reasons for heterogeneity should be represented as extra parameters in the model. If these are not identifiable from data, then hierarchical models, as in Gelman et al. (1998), can often help to account for the extra uncertainty incurred by combining the data sets. But if the data sets are too heterogeneous, this extra uncertainty will lose any advantage gained by combining them. For example, if covariates are missing in one data set, then there needs to be sufficient complete data in other data sets to enable their imputation. In our application, if sufficient predictors of individual smoking and ethnicity had not been available from population data, then data synthesis would have been futile. Further work in this area should focus on “calibrating” specific methods of data synthesis to assess the potential benefit of the synthesis before analysis. For example, this may involve determining the amount of covariate information required to inform a multiple imputation before the imputation gives any benefit.

FUNDING

Economic and Social Research Council (RES-576-25-5003); Department of Health (PHI/03/C1/045). Funding to pay the Open Access publication charges for this article was provided by the Medical Research Council (U.1052.00.008, U.1052.00.001).

Supplementary Material

[Supplementary Material]

We are grateful to the Small Area Health Statistics Unit at Imperial College for access to the births register, postcoded MCS data, and CACI data; Heather Joshi and Jon Johnson at the Centre for Longitudinal Studies at the Institute of Education for their assistance in obtaining access to the postcoded MCS data; and Kees de Hoogh for the pollution exposure data. This work was carried out as part of the Imperial College node (http://www.bias-project.org.uk) of the Economic and Social Research Council National Centre for Research Methods. This work was carried out while at the Department of Epidemiology and Public Health, Imperial College London, UK. Conflict of Interest: None declared.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Following the recent success of genome-wide association studies in uncovering disease-associated genetic variants, the next challenge is to understand how these variants affect downstream pathways. The most proximal trait to a disease-associated variant, most commonly a single nucleotide polymorphism (SNP), is differential gene expression due to the cis effect of SNP alleles on transcription, translation, and/or splicing gene expression quantitative trait loci (eQTL). Several genome-wide SNP–gene expression association studies have already provided convincing evidence of widespread association of eQTLs. As a consequence, some eQTL associations are found in the same genomic region as a disease variant, either as a coincidence or a causal relationship. Cis-regulation of RPS26 gene expression and a type 1 diabetes (T1D) susceptibility locus have been colocalized to the 12q13 genomic region. A recent study has also suggested RPS26 as the most likely susceptibility gene for T1D in this genomic region. However, it is still not clear whether this colocalization is the result of chance alone or if RPS26 expression is directly correlated with T1D susceptibility, and therefore, potentially causal. Here, we derive and apply a statistical test of this hypothesis. We conclude that RPS26 expression is unlikely to be the molecular trait responsible for T1D susceptibility at this locus, at least not in a direct, linear connection.

Association studiesGene expressionRPS26T1D1.INTRODUCTION

Genome-wide association studies have successfully linked a large number of genetic variants with susceptibility to common diseases (McCarthy and others, 2008). However, these findings need to be followed up in order to understand the functional role of these susceptibility alleles at the molecular level. One way to address this follow-up is to correlate measures of gene expression, or alternative measurements at the protein level, with common susceptibility variants. Owing to the development of affordable genome-wide gene expression analysis and single nucleotide polymorphism (SNP) genotyping technologies, it has recently become feasible to scan the genome for variants that correlate with the expression of nearby genes (eQTLs), affecting either the overall transcription levels or the relative amounts of splice variants. Recent studies have already provided evidence of widespread association of eQTLs with SNPs (Morley and others, 2004; Dixon and others, 2007; Goring and others, 2007; Moffatt and others, 2007; Emilsson and others, 2008). As a consequence, some gene expression quantitative trait loci (eQTL) associations are found in the same genomic region as a disease variant.

An illustration of this example is the colocalization of association signals for type 1 diabetes (T1D) (Todd and others, 2007) and the RPS26 eQTL (Morley and others, 2004; Dixon and others, 2007) in the 12q13 genomic region. A recent study of eQTLs using human liver tissue samples combined with bioinformatic network analyses (Schadt and others, 2008) suggested RPS26 as the most likely T1D susceptibility gene in this chromosome region. Therefore, an obvious question is whether the T1D association signal is a consequence of the RPS26 eQTL. If RPS26 expression is the driving factor for T1D susceptibility, then a genetic variant that affects RPS26 expression should also explain or correlate precisely with the T1D association. In that case, provided that the sample size is sufficient and that the causal variant has been identified, both T1D and eQTL association peaks should coincide. Given the limited fine mapping of both of these traits in this genomic region, the most likely hypothesis is that the actual causal variants, either for T1D or RPS26, remain unknown. Moreover, discrepancies between T1D and RPS26 association maps could result from limited sample sizes. We have designed a statistical test of the presence of a sole, shared causal variant for two overlapping association signals. Using combined data from approximately 4 000 case and 4 000 control British Juvenile Diabetes Research Foundation/Wellcome Trust (JDRF/WT) T1D samples and 387 Epstein Barr virus-transformed lymphoblastoid cell lines (LCLs) measured for RPS26 expression (Dixon and others, 2007), we conclude that the RPS26 eQTL expression is unlikely to be the molecular trait responsible for T1D susceptibility at this locus.

2.METHODS2.1Gene expression data

We obtained data from a genome-wide eQTL expression study (Dixon and others, 2007) using RNA from LCLs from unrelated individuals of British descent. RPS26 expression measurements were available for 387 samples. Gene expression measurements were obtained using the Affymetrix HG-U133 Plus 2.0 gene expression chip. Robust multi-array averaging (Irizarry and others, 2003) was applied to the data, providing a log-scale estimate of the gene expression level.

2.2Case–control samples

Case–control samples were obtained from the UK JDRF/WT T1D case–control collection. Excluding missing data, the full genotype was available for 3 988 healthy controls and 4 141 T1D patients. Patients and healthy controls originated from England, Scotland, and Wales and were matched across 12 subregions of Great Britain.

2.3SNP selection

For both the RPS26 gene expression study and the T1D case–control collection, genome-wide genotyping data were available. Additional fine-mapping SNP genotyping data were available for the T1D case–control samples. Initially, we restricted our study to SNPs present in both studies with highly significant p-values for T1D and RPS26 eQTL (p < 10−10 in both cases). This resulted in a set of 4 SNPs (rs705704, rs705699, rs1131017, and rs877636) with full genotype data in a case–control set of approximately 4000 cases and 4 000 controls. In addition, full genotype data for these 4 SNPs combined with RPS26 expression were available for the 387 unrelated individuals.

To apply the statistical test presented in this study, we further restricted our analysis to the subset of SNPs with a significant joint contribution to either the T1D status or the RPS26 gene expression measurement (and not simply to SNPs with marginally significant p-values). This was done using a stepwise forward regression approach, adding at each step the most significant SNP and stopping the procedure when the p-value associated with adding a SNP was higher than 0.05. We found that a single SNP could explain the T1D association (rs705704). Similarly, a single SNP (rs1131017) explained the RPS26 eQTL association.

2.4Statistical test

We assumed a linear relationship between the unobserved causal variant Z, treated as a continuous trait, and the observed genotypes X. This assumption is valid in the context of a region in perfect linkage disequilibrium (Clayton and others, 2004), which is the case in this study: D′ =1 between our best T1D susceptibility (rs705704) and RPS26 gene expression (rs1131017) markers. Hence,

This unknown genotype relates to both traits as follows:

where Y denotes the disease status. These equations together imply the following:

Assuming that the link between the observed genotypes X and the unobserved causal variant Z is identical in both studies, the parameters γi, i = 1, 2, are a constant multiple of γi′, i = 1, 2. Using the asymptotically Normal distribution of the estimated parameters, we computed the likelihood of the data under the null hypothesis of a sole causal variant and under the alternative and derived a likelihood ratio test statistic. The resulting statistic is distributed as χ2 with n − 1 degrees of freedom, where n is the number of SNPs involved in the analysis (n = 2 in the case of this study).

In addition, in the common situation where the genotype effect on disease susceptibility is small, the link between Y and the observed genotypes X can be replaced with a logit function, thus avoiding a complex maximization under the constraint ℙ(Y=1|X)=[γ0+∑i=12γiXi]∈[0,1].

2.5Extension of the test for non linear genotype/phenotype correlations

Our test extends naturally to the nonlinear case by linking both traits to RPS26 using nonlinear relationships:

However, as a result of nonlinearity, both regressions (observed to unobserved genotypes and unobserved genotype to phenotype) could not be combined in a unique regression. Therefore, we used a missing data approach and incorporated the causal genotype Z in our model. Our modified null hypothesis becomes λi=λi′ for i=0,1,2.

An additional complexity originated from the fact that when explaining the T1D association, the full model was nearly unidentifiable, owing to the difficulty of differentiating a rare variant with a large effect from a common variant with a small effect. To address this issue, we used an informative Bayesian prior distribution on the minor allele frequency of the causal variant. For each set of values λ02, we estimated the allele frequency in the study and used as prior a beta distribution on centered at a minor allele frequency of 0.43 (a = 150, b = 200, mean 0.43, and standard deviation 0.026). Consequently, we report Bayes’ factors (Kass and Raftery, 1995) rather than p-values when using our extended test. Altering the prior distribution of the minor allele frequency p did not affect our estimates of the Bayes’ factors.

Estimating the Bayes’ factors amounted to estimating three versions of (for the eQTL study, the case–control study, and both studies jointly), where denotes the normalized prior on the three-dimensional parameter λ. This estimation used a Monte Carlo Markov chain (MCMC) to sample the parameters λi, i = 0, 1, 2, from ℙ(λ|D). The MCMC algorithm was a random-walk Metropolis–Hastings (Hastings, 1970). Proposal distributions were independent N(0, 0.0012).

2.6Software

All computations were done using the programming language R (http://www.r-project.org/). We wrote an R package QTLMatch which implements the statistical procedures proposed in this paper (available at http://www-gene.cimr.cam.ac.uk/todd/).

3.RESULTS

We first used single marker analysis to identify SNPs highly correlated with both T1D status and RPS26 expression. Using p = 10−10 for the T1D association and p = 10−50 for RPS26 expression as significance thresholds, we selected an initial set of 4 highly correlated SNPs with genotypes available in both studies. A visual illustration of the strong correlations between RPS26 expression and rs1131017 is shown in Figure 1. We also observed clear departure from a linear trend model when comparing a 1-degree-of-freedom trend model with a 2-degree-of-freedom model (p≪0.001).

Subsequent analysis relies on the pattern of linkage disequilibrium to be identical in both data sets. To check this assumption, we computed the pairwise squared correlation coefficients r2 and minor allele frequencies for the 4 SNPs in our analysis. We found these measures of linkage disequilibrium to be highly similar (see Table 1), consistent with the fact that all samples are of British ancestry.

Table 1.

Pairwise squared correlation coefficient r2 between markers and minor allele frequencies in the T1D case–control samples and the RPS26 gene expression study. The first value relates to the JDRF/WT British T1D case–control collection and the second value to the eQTL study

rs877636

rs1131017

rs705699

rs705704

MAF (RPS26 study)

MAF (T1D study)

rs877636

—

0.64/0.67

0.56/0.57

0.88/0.88

0.37

0.37

rs1131017

0.64/0.67

—

0.88/0.89

0.69/0.71

0.46

0.45

rs705699

0.56/0.57

0.88/0.89

—

0.61/0.66

0.45

0.44

rs705704

0.88/0.88

0.69/0.71

0.61/0.66

—

0.37

0.37

MAF, minor allele frequency.

We then compared estimated regression coefficients between the T1D case–control collection and the RPS26 gene expression study. Assuming a sole, shared causal variant for T1D and RPS26, regression coefficients should be proportional between both data sets. However, we found a clear deviation from the proportionality (see Figure 2).

Fig. 2.

Comparison of estimated regression coefficients in the T1D case–control and the eQTL RPS26 expression studies, analyzing 1 SNP at a time. Assuming a sole causal variant for both traits, the estimated regression coefficients should be proportional between both studies.

To further describe these differences, we used a stepwise regression approach. The most significantly associated SNP for RPS26 expression was rs1131017 (p≪ 10−50)), and the most significantly associated SNP for T1D status was rs705704 (p = 10−13). When correlating T1D status with rs705704, adding rs1131017 did not improve the model (p = 0.3). Conversely, when correlating RPS26 status with rs1131017, adding rs705704 did not improve the model (p = 0.85). On the other hand, when explaining the T1D variable, the best T1D SNP rs705704 significantly added to the best RPS26 SNP rs1131017 (p < 0.001). And reciprocally, when explaining RPS26 expression, the best RPS26 SNP rs1131017 significantly added to the best T1D SNP rs705704 (p ≪ 10−50). We estimated from our data that the pairwise r2 between rs1131017 and rs705704 is r2 = 0.7.

These results indicated discrepancies between the genetics of T1D and RPS26 expression. We therefore devised a formal multilocus statistical test for the existence of a sole, shared causal variant. Because rs1131017 captured all the T1D association and rs705704 captured all the RPS26 expression, we restricted our study to these 2 SNPs. We found that confidence intervals of jointly estimated coefficients are not consistent with the proportionality predicted under the null hypothesis (see Figure 3). As a consequence, our test rejected the null hypothesis of a sole, shared common causal variant (p = 0.001).

Fig. 3.

(A) Jointly estimated regression coefficients and confidence intervals for rs1131017 and rs705704 under the null hypothesis 0 of a sole common variant and under the alternative 1. Under the null hypothesis 0, estimated regression coefficients have to be located on a line passing through the origin. (B) Rescaled version of (A) highlighting the fact that when compared on the same unit scale (i.e. regression estimates located on the unit circle), the variance of the estimated coefficients is higher for the T1D study. Therefore, regression estimates under the null are driven by the RPS26 expression data.

We then extended our analysis to account for the observed nonlinearity between observed genotypes and RPS26 expression. Incorporating a quadratic model to link the causal variant with the RPS26 expression, we also found substantial evidence against the null hypothesis of a shared common variant (Bayes' factor of 10 against the null). These results indicate that it is highly unlikely that both phenotypes share a common causal variant.

4.DISCUSSION

We have derived a statistical procedure to test for the presence of a sole causal variant. Using this test, we could reject the hypothesis of a sole, shared common causal variant for RPS26 expression and T1D status. If the effect of this T1D susceptibility locus was mediated through the expression of RPS26, the RPS26 variant would also be causal for T1D. Therefore, we conclude that the overlap between both associations is probably the result of chance alone and RPS26 expression level is unlikely to be related to T1D susceptibility.

Even if both T1D/RPS26 associations shared a common variant, the expression of RPS26 could still have no etiological effect on T1D susceptibility. This would be the case if the unique variant had an effect on 2 unrelated pathways. On the other hand, T1D susceptibility cannot be the direct consequence of RPS26 expression level if the genetics of both traits do not match, which is the case in our analysis. We note, however, that our analysis relies on RPS26 expression pattern in LCLs. We cannot exclude that the genetics of RPS26 expression and T1D are in fact concordant in a different cell population or under different conditions.

Our statistical analysis relies on the pattern of linkage disequilibrium being identical in both data sets. A better design would consist of sampling individuals from the same population or ideally obtaining gene expression measurements from the same case–control samples. However, the samples in the gene expression study originated from Great Britain and we expect the pattern of linkage disequilibrium to be very similar to the T1D British case–control collection, as indeed we observed (Table 1).

A limitation of our approach is the assumption of a sole causal variant. A more complex scenario, involving several loci and allelic heterogeneity in the 12q13 region, could be evoked in which SNPs do affect T1D susceptibility via RPS26 expression differences. Owing to missing information and complex genetic architecture, we could not confirm this relationship. However, stepwise regression analysis shows that based on currently available data both the T1D association (Todd and others, 2007) and the RPS26 eQTL can be summarized using a single SNP for each trait. Therefore, there is no evidence of allelic heterogeneity either for T1D or for RPS26 eQTL and this more complex scenario appears unlikely.

Nonlinear genotype/phenotype relationships can also affect the outcome of our statistical procedure. The Bayesian extension of our test is designed to address such nonlinear effects. Its main drawback is its reliance on iterative estimation procedures: convergence issues can lead to misestimated Bayes’ factors. On the other hand, we expect our first likelihood ratio test to be robust to limited departure from linearity. Our rationale is the fact that in a generalized regression model where the function linking genotype and phenotype is unknown or misspecified, the intercept parameter is not identifiable but the slope coefficients can nevertheless be robustly estimated up to a proportionality constant (Li and Duan, 1989). Given that proportionality between 2 sets of estimated regression slopes is the focus of our test, the consequence of a nonlinear genotype/phenotype relationship should be limited.

Our approach can be used in future studies to test the existence of causal relationships between eQTLs, or any other phenotype, and disease susceptibility. The same methodology could also be used to compare 2 different disease associations located in the same genomic region. However, our example shows that the most accurately estimated regression parameters come from the gene expression analysis, in spite of a much larger sample size for the case–control study (4 000 controls and 4 000 cases compared to 387 data points for the expression study). This is a consequence of the much stronger genotype/phenotype correlation observed for the eQTL. Therefore, comparing 2 disease associations with small effects will provide large confidence intervals and limited power to separate both signals.

Statistical power to confirm these causal relationships will be increased if the causal variant is tagged effectively, thus motivating ongoing fine-mapping efforts of disease susceptibility loci. Increased case–control sample size will also contribute to greater statistical power to confirm such causal relationships.

FUNDING

Juvenile Diabetes Research Foundation; the National Institute for Health Research, Cambridge Biomedical Research Centre; the Wellcome Trust. Funding to pay the Open Access publication charges for this article was provided by the Wellcome Trust.

The Cambridge Institute for Medical Research is in receipt of a Wellcome Trust Strategic Award (079895). V. Plagnol is a Juvenile Diabetes Research Foundation postdoctoral fellow. We thank Matthew Stephens, Thomas Lumley and John Whittaker for helpful comments. Conflict of Interest: None declared.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Semicontinuous data in the form of a mixture of zeros and continuously distributed positive values frequently arise in biomedical research. Two-part mixed models with correlated random effects are an attractive approach to characterize the complex structure of longitudinal semicontinuous data. In practice, however, an independence assumption about random effects in these models may often be made for convenience and computational feasibility. In this article, we show that bias can be induced for regression coefficients when random effects are truly correlated but misspecified as independent in a 2-part mixed model. Paralleling work on bias under nonignorable missingness within a shared parameter model, we derive and investigate the asymptotic bias in selected settings for misspecified 2-part mixed models. The performance of these models in practice is further evaluated using Monte Carlo simulations. Additionally, the potential bias is investigated when artificial zeros, due to left censoring from some detection or measuring limit, are incorporated. To illustrate, we fit different 2-part mixed models to the data from the University of Toronto Psoriatic Arthritis Clinic, the aim being to examine whether there are differential effects of disease activity and damage on physical functioning as measured by the health assessment questionnaire scores over the course of psoriatic arthritis. Some practical issues on variance component estimation revealed through this data analysis are considered.

Psoriatic arthritis (PsA) is a chronic inflammatory arthritis associated with psoriasis. The University of Toronto Psoriatic Arthritis Clinic has developed a prospective longitudinal observational cohort of patients with PsA since 1978 Gladman and others (1987). In a recent study, the investigators were interested in examining whether there are differential effects of disease activity and damage on physical functioning as measured by the health assessment questionnaire (HAQ) over PsA duration Husted and others (2007).

The HAQ is a self-report functional status (disability) measure that has become the dominant instrument in many disease areas, including arthritis Bruce and Fries (2003). It produces a measure that can take the value zero with positive probability, while nonzero values vary continuously in the range 0 (no disability) to 3 (completely disabled). Since June 1993, the HAQ has been administered annually to patients in the PsA clinic and, as of March 2005, 440 patients had completed at least one HAQ, with 382 (87%) completing 2 HAQs Husted and others (2007) and comprising the study group. In addition, at clinic visits, scheduled at 6–12 month intervals, demographic and other clinical information was obtained. There were 2107 HAQ observations available for our analyses. As shown in Figure 1, a notable feature of these data is the observation cluster at zero (645/2107 = 30.6%). This presents a challenge in characterizing the relationship between the HAQ scores and the explanatory variables.

Fig. 1.

Bar plot for the HAQ data in Section 1.1.

1.2Models for longitudinal semicontinuous data

When an outcome variable is a mixture of true zeros and continuously distributed positive values, the data generated are termed “semicontinuous” Olsen and Schafer (2001). Various methods have been proposed for analyzing cross-sectional and longitudinal semicontinuous data Olsen and Schafer (2001), Berk and Lachenbruch (2002), Tooze and others (2002), Moulton and others (2002), Hall and Zhang (2004). It is natural to view a semicontinuous variable as the result of 2 processes, one determining whether the outcome is zero and the other determining the actual value if it is nonzero; for convenience, we refer to the data arising from these 2 processes as the “binary part” and the “continuous part” of the data, respectively. Two-part models are therefore attractive. In a 2-part model, it is assumed that explanatory variables influence the outcome through their role in the different processes. For example, for the HAQ data, interest may be in characteristics that distinguish PsA patients who had no difficulty in physical functioning (HAQ score = 0) from those who had at least mild difficulty (HAQ score > 0), and what characteristics have impact on the actual level of difficulty represented by positive HAQ scores, given that the patients had at least mild difficulty (HAQ score > 0). In other words, the targets of inference are the distribution of the binary HAQ indicators and the conditional distribution of the HAQ scores given they are positive. In econometrics, 2-part models have been well developed for cross-sectional semicontinuous data Duan and others (1983), Zhou and Tu (1999), Tu and Zhou (1999). For longitudinal semicontinuous data, 2 approaches have been proposed recently. One is based on 2-part mixed models with correlated random effects in both parts of the model Olsen and Schafer (2001), Berk and Lachenbruch (2002), Tooze and others (2002). The other is based on 2-part marginal models using generalized estimating equation methodology Moulton and others (2002), Hall and Zhang (2004). Here, we focus on the former approach.

It is natural to conjecture that the 2 processes that generate semicontinuous data may be related, especially if the outcome is observed at multiple time points. For example, since no disability and low level of disability can both be features of mild PsA, clinically we would expect a low level of disability (positive HAQ score) on one occasion to be positively associated with the probability of having no disability (zero HAQ score) on another occasion. The introduction of correlated random effects is a means to account for both the dependence between observations within subjects and the dependence between the 2 processes in semicontinuous data. However, it can also lead to severe computational problems. For example, with many unstandardized explanatory variables and a long sequence of unbalanced longitudinal data Husted and others (2007), it may not be possible to obtain a fit using the SAS NLMIXED procedure (SAS Institute, Cary, NC, Version 9.1) within a reasonable time frame, probably due to the complexity of the specified model. In the analysis reported in Husted and others (2007), 2 of us (Brian D. M. Tom and Vernon T. Farewell) uncritically conjectured further that an incorrect assumption of independent random effects would not prevent consistent estimation of regression coefficients. Here, we correct this assumption and examine the impact of this correlation on the estimation of 2-part mixed models. The correlation is important because parameters in the model for the binary part determine the cluster size (e.g. the number of observations with positive HAQ score within subjects) for the continuous part of the model. Therefore, we are faced with an “informative cluster size” problem. Thus, the assumption of independence between random effects may produce bias in the estimation of both regression coefficients and variance components in the continuous part of the model for semicontinuous data.

The remainder of this article is organized as follows. Section 2 briefly summarizes 2-part mixed models for longitudinal semicontinuous data, including an extension to accommodate artificial zeros due to left censoring, and derives the asymptotic bias of parameter estimators when random effects are incorrectly assumed independent and other variance component parameters are fixed. In Section 3, we investigate the factors that influence the asymptotic bias derived in Section 2. The performance of 2-part mixed models in practice is considered in Section 4 using Monte Carlo simulations. The HAQ data are analyzed in Section 5, and some practical issues regarding variance component estimation are addressed in Section 6. We conclude with a discussion in Section 7.

2.BIAS IN 2-PART MIXED MODELS FOR SEMICONTINUOUS DATA

In this section, we briefly describe 2-part mixed models for semicontinuous data and their extension to accommodate artificial zeros Olsen and Schafer (2001), Berk and Lachenbruch (2002), Tooze and others (2002). We also discuss the potential bias for parameters in the continuous part.

2.1Model assumptions

Olsen and Schafer (2001) first extended the 2-part model to the longitudinal setting by introducing correlated random effects into both the binary and the continuous parts of the model. Tooze and others (2002) discussed a similar 2-part mixed model.

Let Yij be a semicontinuous variable for the ith (i = 1,…,N) subject at time tij (j = 1,…,ni). This outcome variable can be represented by 2 variables, the occurrence variableand the intensity variable g(Yij) given that Yij > 0, where g(·) is a transformation that makes Yij∣Yij > 0 approximately normally distributed with a subject-time-specific mean.

Instead of focusing on the marginal distribution of Yij, in a 2-part mixed model we are interested in both the distribution for the occurrence variable Zij and the conditional distribution of the intensity variable g(Yij) given that Yij > 0. Specifically, it is assumed that Zij follows a random effects logistic regression model(2.1)where Xij is a 1×q explanatory variable vector, θ is a q×1 regression coefficient vector, and Ui is the subject-level random intercept. The intensity variable g(Yij) given Yij > 0 follows a linear mixed model(2.2)where Xij* is a 1×p explanatory variable vector, β is a p×1 regression coefficient vector, and Vi is again a subject-level random intercept. The error term ∈ij is assumed to be distributed as N(0,σe2). Note that this 2-part mixed model can be extended to include additional random effects. For simplicity, we restrict attention here to 2-part mixed models with random intercepts; extensions to models with random slopes will be discussed in Section 3.2.

An important assumption is that the random intercepts, (Ui,Vi), are jointly normal and possibly correlated,(2.3)In the context of the HAQ analysis introduced in Section 1.1, for example, the correlation aspect of this assumption can be interpreted as the presence or absence of disability at one occasion being related to the level of disability, if any, at that and other occasions.

In this model, the explanatory variable vectors Xij, Xij* may coincide, but this is not required. The data can be unbalanced by design or due to ignorable missingness. The primary targets of inference are the regression coefficients θ and β, while variance components, including the correlation parameter ρ, are usually treated as nuisance parameters.

2.2Model fitting

Generally, the estimation of θ, β, σu2, σv2, ρ, and σe2 is based on maximization of the likelihood(2.4)which presents the same computational challenges as with generalized linear mixed models (GLMM) Stiratelli and others (1984), Breslow and Clayton (1993), Wolfinger and O'Connell (1993). Olsen and Schafer (2001) proposed an approximate Fisher scoring procedure based on high-order Laplace approximations for obtaining maximum likelihood estimates. Tooze and others (2002) used quasi-Newton optimization of the likelihood approximated by adaptive Gaussian quadrature and implemented it in the SAS PROC NLMIXED procedure. In the simulations and HAQ analysis in Sections 4 and 5, we use the same estimation procedure (SAS, 9.1) as in Tooze and others (2002).

2.3Potential bias in 2-part mixed models

In practice, the multidimensional integration that is necessary to obtain the likelihood in (2.4) induces difficulties in fitting 2-part mixed models. In our HAQ analysis, we found that, even with properly standardized explanatory variables and the simplest model with 2 correlated random intercepts, it can take several hours to fit using the SAS NLMIXED procedure (1.5-GHz CPU, 1-Gb RAM, and SUN workstation). This is probably linked to the number of explanatory variables included in the model and the amount of data available for analysis. As a result, it may be impractical to conduct model assessment and selection procedures when a number of potentially important explanatory variables are available. However, if we assume independence between random effects, the likelihood components for the binary and continuous parts become separable Tooze and others (2002) and maximization of the likelihood is computationally much simpler and faster.

Nevertheless, as noted earlier, if the random effects are correlated, there is an informative cluster size aspect to the data structure since parameters in the binary part influence the number of observations in the continuous part of the model. Essentially, with a positive correlation, subjects with larger random effects Vi will have more observations contributing to estimation of the continuous part of the model; there will be an overrepresentation of larger values in this part of the data. Since we assume that E(Vi) = 0, an incorrect assumption of independence between random intercepts and the consequent analysis of the continuous part of the data separately from the binary part will produce positive bias in estimating the intercept term in β. The impact on estimation of other elements in β will depend on θ, σu2, σv2, ρ, σe2, and the true value for β.

This scenario parallels the nonignorable missingness problem characterized in a class of “shared parameter models” Wu and Carroll (1988), Wu and Bailey (1989), Henderson and others (2000), Saha and Jones (2005). The model for the binary part in semicontinuous data corresponds to the logistic random effects model for missing indicators in shared parameter models, and the continuous part is similar to the partly unobserved outcome data modeled (typically) by linear mixed models. Underlying random effects in the shared parameter models link the models for missing indicators and outcomes, while in our case, the shared parameters are exactly those controlling correlated random intercepts (Ui,Vi) in (2.3). The only difference between these 2 scenarios is that in 2-part mixed models, both θ and β are primary targets of inference, whereas in shared parameter models only β in the outcome model is of interest.

For shared parameter models, Saha and Jones (2005) provided a useful procedure to quantify the asymptotic bias for estimating regression parameters in the outcome model when missingness is nonignorable and the missing data mechanism is not modeled jointly. Following Saha and Jones (2005), we can derive the asymptotic bias (as N goes to infinity) for estimating β in 2-part mixed models when the correlation ρ is nonzero but ignored (i.e. set to be zero) in estimation. We adopt the following notation:

(A) ni = J, the fixed number of observations within subjects;

(B) Xij = Xij* = (1,tij,Gi,Gitij) such that the explanatory variable vectors Xij and Xij* both follow a group by time design and Gi∈(0,1) is a group membership indicator;

Further, for illustration, we assume that subjects have equal probability of being in the 2 groups, in other words, Pr(Gi = g) = 1/2 (g = 0,1), and that variance component parameters σu2, σv2, ρ, and σe2 are known. It follows by equation (12) in Saha and Jones (2005) that the separate maximization of the likelihood for the continuous part (ρ = 0) will give estimates of β:(2.5)Therefore, the absolute asymptotic bias of this estimation procedure is β* − β, which is a function of θ and σu2, σv2, ρ, and σe2. Because we assume that the continuous part of the model is specified by a linear mixed model and the variance components are known, the asymptotic bias derived here is independent of the true value of β. In practice, variance components also need to be estimated, and the asymptotic bias for estimating β in misspecified 2-part mixed models will depend on the true value of β. In that case, iterative methods are necessary to evaluate the asymptotic bias, as no analytical expression is available Saha and Jones (2005).

To compute (2.5), we need to evaluate Pr(Mi = m∣Gi = g) and E(Vi∣Mi = m,Gi = g). These can be shown to be(2.6)(2.7)The integrals in (2.6) and (2.7) are analytically intractable. In Section 3.1, we use a 30-point Gaussian quadrature Stroud and Secrest (1966) to evaluate them.

2.4Artificial zeros

In practice, zero values from observed data can be a mixture of true zeros and artificial zeros due to left censoring. Berk and Lachenbruch (2002) discussed 2-part mixed models for dealing with this type of data. Specifically, following the notation in Section 2.1 and assuming that there is a detection limit d for the continuous part, the likelihood for the 2-part mixed model with additional artificial zeros is(2.8)where F is the cumulative distribution function for g(Yij)∣Yij > 0.

The same argument for potential bias as before can be applied to this 2-part mixed model with artificial zeros when the correlation between random intercepts is ignored. However, the derivation of asymptotic bias in Section 2.3 is no longer directly applicable. In Section 4, we will investigate bias using Monte Carlo simulations. It should be noted that there is minimal computational gain from assuming independence between random effects here as for the model with true zeros only because in this case the likelihood contributions for the binary and continuous parts cannot be disentangled and higher dimensional numerical integration is necessary for maximum likelihood estimation.

In this section, we quantify the asymptotic bias in the estimation of β in the misspecified 2-part mixed models with random intercepts only assuming that all variance component parameters are known. Let tij = 0,1 denote the 2 measurement times for each subject and Gi = 0,1 denote a treatment indicator. We assume that subjects are equally likely to be assigned to the 2 groups and that

Recall that in (2.5), the asymptotic bias for estimating β depends on θ (or equivalently, the proportion of nonzero values for a typical subject in the subject groups), the correlation parameter ρ, the between-subject variability of occurrence variables σu2, the between-subject variability of nonzero values σv2, and the error variance of nonzero values σe2. Given that the variance components are fixed in this specific scenario, the bias for β is independent of the true value of β.

For simplicity, we fix θ1 = − 1 and θ2 = log(2). Also, we fix σe2 = 0.08 based on the HAQ analysis reported in Section 5. We then investigate how the asymptotic bias varies as a function of θ0, σu2, σv2, and the correlation parameter ρ.

Figure 2 presents the contour plots of absolute asymptotic bias in estimation of the intercept term β0 by σu2 and the intraclass correlation ψ = σv2/(σv2 + σe2) at different combinations of (θ0,ρ). The axes for σu2 and ψ are centered at 4 and 0.4, respectively, based on the HAQ analysis reported in Section 5. It is apparent from Figure 2 that β0 is overestimated and the magnitude of the bias is positively related to ρ, σu2, and σv2 (or equivalently ψ). On the other hand, as θ0 (the proportion of nonzero values in a control subject) increases, the bias in the estimation of β0 decreases.

We also investigated absolute asymptotic bias in estimating the time effect β1 and treatment effect β2. A positive bias for β1 and a negative bias for β2 are observed, but the magnitudes of both biases are much smaller than for β0. Details are given in Section 1.1 of the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).

3.2Two-part mixed model with random intercept and slope

As pointed out by a referee, it may be of interest to go beyond the simple 2-part model with random intercepts only and investigate the extended model where a random slope for time is included in the continuous part. Following the notation in Section 2.3, we now assume thatwhere Xij* = (1,tij,Gi,Gitij), V0i, and V1i are random intercept and random time slope, respectively. Similarly to (2.3), we assume that the random intercepts and additional random slope follow(3.1)In our HAQ example, the correlation ρ1 under this assumption can be interpreted as the presence or absence of disability at one occasion being related to the rate of change in the disability level over time. For example, we would expect that patients who usually report no disability are unlikely to have large changes in the disability level when any disability is actually reported.

We use the same data structure as in Section 3.1 except that [log(Yij)∣Yij > 0]∼N(β0 + β1tij + β2Gi + V0i + V1itij,σe2), and the random intercepts and slope (Ui,V0i,V1i) follow a trivariate normal distribution as in (3.9). We draw similar contour plots as in Section 3.1 to examine the asymptotic bias for the intercept term β0, the time effect β1, and the treatment effect β2. We find that there are large positive biases for β0 and β1 and smaller negative bias for β2 when the positive correlations increase and θ0 decreases. Details are given in Section 1.2 of the supplementary material available at Biostatistics online.

4.MONTE CARLO SIMULATION

Simulation studies were done to investigate the performance of different 2-part mixed models in practice. For semicontinuous data with true zeros only, biases of different magnitude are observed for the regression coefficients in the continuous part of the model when the positive correlation of the random effects is ignored. In addition, the variance component in the continuous part is underestimated. For the data with additional artificial zeros, we observe biases for regression coefficients and variance components in both the binary and the continuous parts when the correlation is set to zero. Details are given in Section 2 of the supplementary material available at Biostatistics online.

5.ANALYSIS OF THE HAQ DATA

The HAQ data described in Section 1.1 can be modeled using a 2-part mixed model. The random-intercept logistic model (2.1) is used to model a binary indicator of a nonzero HAQ score, and the random-intercept linear mixed model (2.2) is used for nonzero HAQ scores. For the linear mixed model, residual plots suggest a symmetric error distribution. Thus, no transformation is applied to the nonzero HAQ scores and the results are therefore comparable to those in Husted and others (2007), where these data were modeled with an assumption of independent random intercepts. We refit this simple model and term it the “misspecified model.”

The same set of explanatory variables is included in both model parts, but the coefficients are allowed to differ. These include age at onset of PsA (standardized), sex, PsA disease duration in years, total number of actively inflamed joints, total number of clinically damaged joints, psoriasis area and severity index (PASI) score (standardized), morning stiffness (coded as either present or absent), standardized erythrocyte sedimentation rate (ESR), and highest medication level ever used prior to a visit, grouped based on a medication pyramid Gladman and others (1995), Munro and others (1998). Since there is particular interest in differential effects of both the number of actively inflamed joints and the number of clinically deformed joints on physical functioning over PsA duration, interaction terms for PsA duration with both variables are included in the model.

Prior to formal model fitting, an empirical check casts doubt on the assumption of independent random effects. When the empirical Bayes estimates of the random intercepts in the binary part are introduced as an additional explanatory variable in the linear mixed model for the continuous part, the associated coefficient is significantly positive (p < 0.001). Thus, we also fit a 2-part mixed model with correlated random intercepts (referred to as the “full model”). For estimation, the SAS NLMIXED procedure was used with the maximum number of adaptive Gaussian quadrature points in the quasi-Newton algorithm held at 31. The results are given in Tables 1 and 2.

Table 1.

Parameter estimates in the binary part of the model for the HAQ data

Parameters

Misspecified model

Full model

Latent process model

Estimate (SE)

p

Estimate (SE)

p

Estimate (SE)

p

Intercept

– 1.0199 (0.4079)

0.0129

– 1.0015(0.3746)

0.0078

– 0.9909(0.3556)

0.0056

Age at onset of PsA

0.6031 (0.1743)

0.0006

0.6266(0.1611)

0.0001

0.6392(0.1538)

< 0.0001

Sex

Male

Female

1.9944(0.3603)

< 0.0001

2.0080 (0.3276)

< 0.0001

2.0037(0.3149)

< 0.0001

PsA disease duration

– 0.0027 (0.0259)

0.9169

0.0156(0.0232)

0.5027

0.0166(0.0220)

0.4501

Actively inflamed joints

0.1758 (0.0513)

0.0007

0.1566(0.0495)

0.0017

0.1380(0.0465)

0.0032

Clinically deformed joints

– 0.0161 (0.0321)

0.6165

0.0120(0.0260)

0.6441

0.0179(0.0238)

0.4531

PASI score

0.1941 (0.1257)

0.1233

0.1754(0.1086)

0.1071

0.1543(0.1017)

0.1299

Morning stiffness

No

Yes

1.5953 (0.2319)

< 0.0001

1.5777(0.2112)

< 0.0001

1.5691(0.2018)

< 0.0001

ESR

0.3030 (0.1310)

0.0213

0.2988(0.1164)

0.0106

0.2971(0.1103)

0.0074

Medications

None

NSAIDs

0.2998 (0.2743)

0.2751

0.2955(0.2529)

0.2435

0.2960(0.2439)

0.2257

DMARDs

0.3074 (0.2508)

0.2211

0.3100(0.2295)

0.1776

0.3138(0.2197)

0.1541

Steroids

0.9945 (0.4698)

0.0350

0.9946(0.4458)

0.0263

0.9927(0.4355)

0.0232

Interaction of actively inflamed

0.0002 (0.0034)

0.9502

– 0.0003(0.0033)

0.9403

0.0003(0.0031)

0.9300

joints with arthritis duration

Interaction of clinical deformed

0.0032 (0.0016)

0.0442

0.0022(0.0013)

0.0844

0.0018(0.0011)

0.1102

joints with arthritis duration

σu2

4.2519 (0.8549)

< 0.0001

4.3930(0.8924)

< 0.0001

4.2641(0.9001)

< 0.0001

ρ

(ρ = 0)

0.9423(0.0373)

< 0.0001

(ρ = 1)

SE, standard error.

Table 2.

Parameter estimates in the continuous part of the model for the HAQ data

Parameters

Misspecified model

Full model

Latent process model

Estimate (SE)

p

Estimate (SE)

p

Estimate (SE)

p

Intercept

0.3176(0.0567)

< 0.0001

0.2149(0.0556)

0.0001

0.1748(0.0555)

0.0018

Age at onset of PsA

0.1011(0.0242)

< 0.0001

0.1009(0.0245)

< 0.0001

0.0984(0.0250)

0.0001

Sex

Male

Female

0.1811(0.0505)

0.0004

0.2225(0.0512)

< 0.0001

0.2461(0.0523)

< 0.0001

PsA disease duration

0.0039(0.0033)

0.2272

0.0035(0.0032)

0.2726

0.0044(0.0032)

0.1719

Actively inflamed joints

0.0219(0.0028)

< 0.0001

0.0239(0.0027)

< 0.0001

0.0243(0.0027)

< 0.0001

Clinically deformed joints

0.0058(0.0031)

0.0627

0.0052(0.0031)

0.0957

0.0051(0.0031)

0.1034

PASI score

0.0128(0.0140)

0.3636

0.0247(0.0134)

0.0667

0.0257(0.0134)

0.0553

Morning stiffness

No

Yes

0.1502(0.0274)

< 0.0001

0.1573(0.0263)

< 0.0001

0.1620(0.0262)

< 0.0001

ESR

0.0395(0.0132)

0.0028

0.0388(0.0127)

0.0024

0.0374(0.0126)

0.0033

Medications

None

NSAIDs

– 0.0240 (0.0289)

0.4065

– 0.0177(0.0281)

0.5288

– 0.0181(0.0280)

0.5194

DMARDs

0.0224(0.0280)

0.4252

0.0235(0.0272)

0.3889

0.0226(0.0272)

0.4064

Steroids

0.0457(0.0453)

0.3135

0.0493(0.0441)

0.2641

0.0481(0.0441)

0.2761

Interaction of actively inflamed

– 0.0004(0.0002)

0.0290

– 0.0004(0.0002)

0.0072

– 0.0005(0.0002)

0.0042

joints with arthritis duration

Interaction of clinical deformed

0.0002(0.0001)

0.1122

0.0003(0.0001)

0.0330

0.0003(0.0001)

0.0351

joints with arthritis duration

σv2

0.1587(0.0154)

< 0.0001

0.1732(0.0166)

< 0.0001

—

—

σv/σu

—

—

—

—

0.2074 (0.0210)

< 0.0001

σe2

0.0785(0.0040)

< 0.0001

0.0774(0.0039)

< 0.0001

0.0779(0.0039)

< 0.0001

ρ

(ρ = 0)

0.9423(0.0373)

< 0.0001

(ρ = 1)

– 2 log- likelihood (both parts)

2116.0

2018.1

2022.2

AIC

2178.0

2082.1

2084.2

SE, standard error.

As shown in Table 1, the estimated coefficients in the binary part are approximately the same in both the full and the misspecified models and suggest the same explanatory variables of functional difficulty. There is no differential effect of actively inflamed joints on functioning difficulty over PsA duration, but some evidence that the effect of deformed joints increases with disease duration. The parameter estimates for the random-intercept distribution in the binary part are also similar.

The estimated correlation between random intercepts of the 2 parts of the full model is positive and close to one (). This large estimate suggests that there might be a single unmeasured latent process which influences the 2 processes of the mixed model, corresponding to perfectly correlated random intercepts. Therefore, we also fit a 2-part model such that the correlated random intercepts follow Vi = αUi and σv2 = α2σu2 and refer to this model as the “latent process model.” A similar approach is implemented in the Mplus software Brown and others (2005), Muthén and Muthén (1998–2007). The estimates from the binary part of this model are listed in the last 2 columns of Table 1 and are similar to those from the other 2 models.

As expected, the misspecified model overestimates the intercept term and underestimates the time-invariant sex effect in the continuous part (Table 2). For other time-varying explanatory variables, the estimates are approximately the same except that the coefficients for PASI score and the interaction between clinically deformed joints and PsA duration are larger in the full model, with correspondingly smaller p-values. The random-intercept variance of the continuous part in the misspecified model is underestimated and error variance estimates are similar, consistent with our simulation results. Thus, the qualitative conclusions do not change across models. In particular, the positive effects of actively inflamed joints and clinically deformed joints differ over PsA duration: the effect of the former decreases while the effect of the latter increases over time.

The deviance and Akalike Information Criterion (AIC) values in Table 2 indicate that the full model and latent process model provide a better fit to the data. A likelihood ratio test of the hypothesis of zero correlation generates a p-value less than 0.001.

6.REMARKS ON VARIANCE COMPONENT ESTIMATION IN 2-PART MIXED MODELS

In preliminary analysis, we observed that, with some important explanatory variables omitted (e.g. age at onset of PsA, sex, and ESR) in the binary part of the model, estimation of the random-intercept variance σu2 becomes unstable. For example, its point estimate can increase from 6.9 in a misspecified model (ρ = 0) to 10.8 in a full model with estimated correlation ρ close to one. As a result, estimates of subject-specific regression coefficients θ are inflated in the full model. However, the corresponding standard error estimate of σu2 also increases and ratio-based statistics are approximately the same in both models. This behavior was not evident in our simulation results. We suspect that the reason for this instability is that the unaccounted variability represented by the variance component is large, and the likelihood surface is flat for the estimation procedure to locate the maximum. This can be investigated further through examination of the profile likelihood for σu2 under scenarios where σu2 is large.

We simulated data with N = 250 subjects, with ni = 2, from the same logistic-lognormal mixture distribution as in (2.1) and (2.2) of the supplementary material available at Biostatistics online. The true values for the parameters were set to θ = (3,0,0,0) (or θ = (0,0,0,0)), β = (0.5,0,0,0), σu2 = 4.5 (or σu2 = 10.5), σv2 = 0.2, σe2 = 0.08, and ρ = 0.9. In obtaining the profile likelihood for σu2 and ρ, we fixed σv2 and σe2 at their true values and let θ and β be estimated.

Figure 3 presents the contour plots of the profile likelihood (in terms of the deviance) for σu2 and ρ from 4 simulated data sets. The top-left panel in Figure 3 displays flat profile likelihoods for σu2 at different levels of ρ when the true between-subject heterogeneity is large (σu2 = 10.5) and the proportion of zeros in the data is small (θ0 = 3). The black dots, which are the corresponding restricted maximum likelihood estimates for σu2, show an increasing trend as ρ increases. With σu2 = 10.5 still, but the proportion of zeros now increased (θ0 = 0), the profile likelihood surface shows slightly more curvature. The situation improves further when the true variance decreases to σu2 = 4.5, but restricted maximum likelihood estimates for σu2 when θ0 = 3 still vary considerably. In contrast, with θ0 = 0, the likelihood appears to be well behaved and estimates for σu2 are relatively constant. Therefore, the sparseness of the occurrence indicator data also impacts on variance component estimation in the binary part of the mixed model.

Fig. 3.

Contour plots of profile likelihood (in terms of the deviance) occurrence random-intercept variance σu2 and correlation ρ from 4 simulated data sets (N = 250) with different combinations of true values for σu2 and θ0; other variance components are fixed at their true values σv2 = 0.20 and σe2 = 0.08; the true value for β0 is set as β0 = 0.5; the black dots are maximum likelihood estimates of σu2 at different values of ρ

These results help to explain the instability observed in our preliminary analyses. With important explanatory variables omitted in the binary part, the unexplained variability in the indicator of a positive HAQ score was unduly large, estimation of σu2 was unstable, and point estimates and standard errors changed as the correlation ρ increased. Consequently, the estimates for subject-specific regression coefficients θ differed across the models. With a reasonable set of important explanatory variables in the final HAQ analysis, the estimates for both σu2 and θ were stabler.

In summary, careful modeling of mean relationships is necessary to avoid unstable estimation of variance components and subject-specific regression coefficients when fitting 2-part mixed models. When the number of zeros in longitudinal semicontinuous data is small, caution is advised in fitting 2-part mixed models. Simpler alternatives, such as standard regression methods for the marginal distribution of outcomes, either truncated or bounded, should be considered.

7.DISCUSSION

For 2-part mixed modeling of longitudinal semicontinuous data, with true zeros only or with additional artificial zeros due to left censoring, an incorrect assumption of independence between random effects can induce bias in the estimation of regression coefficients and variance components in the continuous part of the model. This arises due to differential representation of nonzero values in the continuous part of the data. For illustration, we examined linear mixed models for the continuous part of the model, but the same issues apply to other GLMM. Model fitting with correlated random effects is computationally expensive, and the availability of more efficient software would therefore be welcome.

As pointed out by an associate editor, the extreme computing time experienced in the HAQ analysis might be alleviated by adopting a marginal approach for a 2-part model. As shown in Section 6, variance component estimation in the binary part can be unstable when the unexplained variability is large. Computing time can be considerable due to the difficulty of locating the maximum of a flat likelihood surface. In this case, we may choose marginal 2-part models such as in Moulton and others (2002) and Hall and Zhang (2004) rather than the mixed model approach. However, we emphasize that for marginal 2-part models of longitudinal or even cross-sectional semicontinuous data, bias can also be induced if important explanatory variables determining both the binary process and the process of nonzero values are excluded in the model for the continuous part. These important explanatory variables in marginal models are similar to the unmeasured explanatory variables represented by correlated random effects in mixed models. Therefore, the same problem of differential representation of nonzero values in the continuous part can arise even when these omitted explanatory variables are independent of other included explanatory variables in the continuous part. Thus, when building a model for mean structures in these marginal models, any important explanatory variables in the binary part should be included in the continuous part, at least initially, to reduce the possibility of bias.

The HAQ data analysis presented in this article is primarily illustrative. Alternative models might be preferred. The normality assumption of random intercepts was examined using empirical Bayes estimates. However, as with shared parameter models Tsonaka and others (2008), diagnostic checks based on empirical Bayes estimates are unreliable due to shrinkage (Verbeke and Molenberghs, 2001, Section 7.8). In practice, investigators might be only interested in the continuous part of the data and thus fit regression models ignoring the zeros. The bias illustrated in this article is then still present due to the differential representation of nonzero values across patients. The change of the primary inference target from (β,θ) to β does not solve the problem.

FUNDING

Funding to pay the Open Access publication charges for this article was provided by Medical Research Council (UK) (U.1052.00.009).

Supplementary Material

[Supplementary Material]

The authors thank Dafna Gladman, Janice Husted, Patty Solomon, the referees, associate editor, and editor for helpful comments and patients in the University of Toronto Psoriatic Arthritis Clinic.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Classification studies with high-dimensional measurements and relatively small sample sizes are increasingly common. Prospective analysis of the role of sample sizes in the performance of such studies is important for study design and interpretation of results, but the complexity of typical pattern discovery methods makes this problem challenging. The approach developed here combines Monte Carlo methods and new approximations for linear discriminant analysis, assuming multivariate normal distributions. Monte Carlo methods are used to sample the distribution of which features are selected for a classifier and the mean and variance of features given that they are selected. Given selected features, the linear discriminant problem involves different distributions of training data and generalization data, for which 2 approximations are compared: one based on Taylor series approximation of the generalization error and the other on approximating the discriminant scores as normally distributed. Combining the Monte Carlo and approximation approaches to different aspects of the problem allows efficient estimation of expected generalization error without full simulations of the entire sampling and analysis process. To evaluate the method and investigate realistic study design questions, full simulations are used to ask how validation error rate depends on the strength and number of informative features, the number of noninformative features, the sample size, and the number of features allowed into the pattern. Both approximation methods perform well for most cases but only the normal discriminant score approximation performs well for cases of very many weakly informative or uninformative dimensions. The simulated cases show that many realistic study designs will typically estimate substantially suboptimal patterns and may have low probability of statistically significant validation results.

Recent years have seen an explosion of work on classification problems where the number of measured features per sample is vastly greater than the number of samples. For biological classification problems, such data arise from genomic DNA microarrays and proteomic mass spectrometry assays, from which investigators try to classify disease categories, tumor types, response to drugs, or other categories (Ludwig and Weinstein, 2005). Most of the efforts in method development have appropriately focused on what to do with real data sets (Wang and Shen, 2006, Adam and others, 2002). Generally speaking, various methods must select features (sometimes called biomarkers) to be used for classification and estimate a classifier without over-fitting to the many available data dimensions.

Because of the complexity of the algorithms involved, it is not straightforward to answer questions about study design. For example, if there are 10 informative and 5000 noninformative features and the best possible classification error rate is 5%, how many samples are necessary to have an 80% chance of estimating a classifier with less than 10% error rate for independent validation samples? Or, how many samples are necessary so that with probability 95%, the estimated classifier will perform statistically significantly better than a 50% error rate for independent validation samples, that is, conclude the study has at least found something nonrandom? Investigators planning studies have access to sound statistical principles but few specifics to serve as guideposts in evaluating sample sizes relative to hypothesized outcomes. Analysis of study design for high-dimensional classification studies has been identified as an important problem for genomics and proteomics because significant resources are required to execute such studies (Dobbin and Simon, 2007, Allison and others, 2006, Pusztai and Hess, 2004, Hwang and others, 2002).

Issues of sample size for genomic and proteomic pattern discovery studies are potentially quite important. Over 60 proteomics discovery studies have been published in recent years (Coombes and others, 2005, Baker, 2005). Many have sample sizes in the approximately 10–20 range; some notable cases with higher sample sizes (e.g. Adam and others, 2002, Petricoin, Ardekani, and others, 2002, Petricoin, Ornstein, and others, 2002, Zhang and others, 2004, Rogers and others, 2003) reveal that in broad terms, sample sizes of ∼50 per group are rare and of ∼100 per group are very rare. Implicit in some rationales for biomarker discovery studies is the possibility that multiple, individually weak biomarkers could combine to form a collectively strong diagnostic pattern. The observation that discovery studies often find nonspecific markers (Baker, 2005) also suggests that disease-specific patterns may require multiple, individually weak biomarkers. Detecting patterns of multiple weak biomarkers amid many noninformative data dimensions may require substantially greater sample sizes than detecting individually strong biomarkers.

In proteomics, early biomarker discovery and validation studies (Petricoin, Ardekani, and others, 2002, Petricoin, Ornstein, and others, 2002, Petricoin and Liotta, 2003, Rogers and others, 2003, Adam and others, 2002, Li and others, 2002, Adam and others, 2001) led to renewed attention toward potential pitfalls of design and analysis methods. These include low discovery and validation sample sizes, uncertainty about data preprocessing and statistical methods, low sample processing and measurement reproducibility within and between study sites, uncertainty about the biological nature and consistency of patterns, and lack of independent validation studies (Sorace and Zhan, 2003, Diamandis, 2004a, Diamandis, 2004b, Listgarten and Emili, 2005, Coombes and others, 2005, Ebert and others, 2006, Wilkins and others, 2006). Similar issues have been raised for genomic studies (e.g. Pusztai and Hess, 2004, Ludwig and Weinstein, 2005). Two important studies notable for their independent validation trials highlight the possibility—among many possible reasons for low validation success—that small sample sizes have been fundamentally limiting. Rogers and others (2003) saw sensitivity for renal cancer decline from ∼100% in discovery to ∼40% in validation, and Zhang and others (2004) saw specificity decline from ∼90% in discovery to ∼65% in validation.

For prospective analysis of pattern discovery study designs, purely simulation approaches quickly become cumbersome because there are many scenarios of interest, but purely analytical results are not easy to obtain. We take a middle road between simulations and approximations, with Monte Carlo methods for the feature-selection step and approximations for generalization error rates given each feature set. We use multivariate normal data and linear discriminant classification of features selected by univariate tests. While biologically simplistic, this framework captures the key impacts of both inaccurate feature selection and inaccurate classifier estimation. Related studies that use multivariate normal models include Pepe and others (2003), Hu and others (2005), Jung (2005), and Dobbin and Simon (2007), among others. Our approach gives order-of-magnitude faster estimation of generalization error compared to direct simulations, which are given for comparison. Both full simulation and simulation–approximation results are useful, but the latter can facilitate more practical exploration of study designs. Our approach also gives insight into which sources of variation are most important and suggests directions for future improvements.

We evaluate the simulation–approximation approach by comparing it to complete simulations that address meaningful study design questions (supplementary material available at Biostatistics online, http://www.biostatistics.oxfordjournals.org). We ask how validation error rate depends on the strength and number of informative features (and hence the minimum possible error rate), the number of noninformative features, the patient sample size, and the number of features allowed into the pattern. We find that typical sample sizes may perform poorly when there is a true pattern composed of many individually weak features. This result is not surprising based on general principles, but moving from principles to specific examples as guideposts is important for design of real studies.

We also give 2 approximations of the generalization (or test, or validation) error of a linear discriminant classifier when the training and validation samples do not follow the same distributions. The first is a delta approximation, from Taylor expansions of generalization error around the expected discriminant boundary. The second, and more successful, approximates the discriminant scores as normally distributed. Approximations of linear discriminant analysis with training and generalization samples from the same distributions have been reviewed by McLachlan (1992) and Wyman and others (1990). According to Wyman and others (1990) and Viollaz and others (1995), normal approximations of discriminant scores seem to be more accurate than other approaches, consistent with our results.

A related approach was given by Dobbin and Simon (2007), but ours appears to be more general and accurate (at the expense of being more computational). Theoretical bounds on generalization error from machine learning theory give another path of investigation (Hastie and others, 2001). For the related goal of identifying individually significant data dimensions (features), much study design work has built on feature-by-feature false discovery rate ideas (Benjamini and Hochberg, 1995, Storey, 2002, Efron, 2007). Feature-by-feature metrics of study design efficacy include the expected discovery rate (Gadbury and others 2004), anticipated average power (Pounds and Cheng, 2005), expected number of false discoveries (Tsai and others, 2005), and probability of informative features ranking highly (Pepe and others, 2003). Numerous recent studies give methods for feature selection or estimation of generalization error given real data, as opposed to prospective study design (e.g. Mukherjee and others (2003), Fu and others (2005), Wang and Shen (2006).

2.PROBLEM DEFINITION

Consider samples of size nj for each of J classes (j∈{1,…,J}), with each sample having M dimensions. By a high-dimensional classification problem, we mean M≫n, where n=∑j=1Jnj is the total sample size. For the training samples, from which the classifier will be estimated, let be the data vector for the ith sample of class j. Let Xj be all the data for class j and X be all the training data.

Let the number of dimensions of the data distributions that are truly informative (i.e. differ between classes) be MI and those that are truly uninformative be MU, with M = MI + MU. In the examples below, we will for simplicity use J = 2 and group means centered around 0 with all variances equal to 1. Let Δ be the vector of differences between class means for the informative dimensions, so the means from group 1 are (−0.5 Δ, 0MU) and the means from group 2 are (0.5 Δ, 0MU), where 0MU is a length MU vector of zeros. In this notation, a true pattern is defined by (Δ, MU) and a study design scenario is defined by (Δ, MU, n), where n = (n1, n2).

A classifier ψ(xG|X) predicts the class, j∈1,…,J, of a new (generalization or validation) sample xG based on the training data, X. The generalization sample comes from one of the same distributions (for its unknown class) as the training samples. Define the conditional generalization error for class j as the expected fraction of incorrect classifications for a new sample, xGj, from class j given a training sample X,(2.1)where the expectation is over xGj sampled from true distribution j and the indicator function I() is 1 if is true and 0 otherwise.

Define the conditional generalization error across all classes as(2.2)where P(j) is the probability that a new sample is from class j.

The generalization error for a new sample from group j is the conditional generalization error averaged over training samples:(2.3)where ET denotes expectation over training samples, X, with sample sizes n. Finally, the overall generalization error is(2.4)

Given a generalization sample XG, with replicate data xGj from groups j=1,2, and a classification procedure ψ(xG|X), define the “pattern discovery power” as the expected probability of rejecting the null hypothesis that the predictions ψ(xGj|X) are independent of the true class labels, using an appropriate statistical test, with expectations over both the training and generalization samples. This is the probability that the independent validation step of an entire study concludes that the estimated classifier is at least better than random. This paper focuses on calculating generalization error rather than pattern discovery power, but the latter relates to one of the ultimate judgments about a study—whether something nonrandom has been independently validated—and is represented graphically with the simulation results.

3.SIMULATION–APPROXIMATION OF GENERALIZATION ERROR

Next, we give a joint simulation and approximation approach to estimate efficiently the generalization error rates CGj and Gj for multivariate normal data analyzed with linear discriminant analysis. Define a partition of the space of X samples into R nonoverlapping regions, Ω1,…,ΩR, that determine which dimensions of X are selected to estimate the classifier, that is, the feature selection. Define δ = (δ1, …, δM) to be a vector of 0s and 1s, with δk=1 if dimension k will be used for classification and 0 if not. For all X∈Ωr, the same dimensions of X are used by the classifier (so R≥2M), so it makes sense to write δ as a function of Ωr: δr ≡ δ(Ωr).

The generalization error for class j can be factored as(3.1)where P(·) is the probability indicated by its argument.

We develop approximations for ET[CGj(Δ,MU|X)|X∈Ωr] based on the first 2 moments of P(X|X∈Ωr), the probability density of training data sets given that they lead to feature selection δr. This is an expected generalization error given that the training and generalization samples do not come from the same distributions. We use Monte Carlo samples to estimate P(X∈Ωr) and the first 2 moments of P(X|X∈Ωr), which can be generated efficiently. In what follows, Ω∈{Ω1,…,ΩR}.

In a real analysis, feature selection is intertwined with the problem of how many features to include, which is one type of regularization parameter that may be optimized over data-based estimates of generalization error, such as cross-validation. From the study design point of view, the goal is to provide insight into typical study outcomes under various scenarios. Instead of trying to include optimization of the number of features within each approximation, we calculate the approximation across a range of the feature-selection thresholds. This does not include variation or suboptimality in the feature-selection threshold in our estimates of generalization error distributions, but it does offer insight about the sensitivity of generalization error to the feature-selection threshold, which provides context and builds intuition for interpreting results with real data.

3.1.Monte Carlo approximation of feature selection

Next, we show how P(X∈Ω) and the mean and variance of P(X|X∈Ω) can be estimated with Monte Carlo methods. In the examples here, we assume feature selection is based on feature-by-feature univariate t-tests, which, when the data dimensions really are independent, makes the analysis optimistic because it “knows” this aspect of the “truth.” It is common to use feature-by-feature hypothesis tests to estimate false discovery rates as part of analyzing a high-dimensional study, so this simplification allows our results to stand side-by-side with expected false discovery rates and related ideas in considering study designs.

Consider a single data dimension, k, which may or may not be truly informative, for which δk will be 1 if the dimension is selected for the pattern and 0 if not. Let xijk be the kth dimension of sample i from class j. Let the n1 and n2 samples from groups j = 1 and j = 2, respectively, be normally distributed in dimension k: . Suppose the decision to include feature k in classification is based on the P-value of a t-test. One calculates for j=1,2; , where dfs=n1+n2−2 are the degrees of freedom of sk2; and . The feature is included if |tk| > t1−Pc/2,dfs, where Pc is a threshold significance level for choosing δk=1 and t1−Pc/2,dfs is the inverse cumulative t-density at 1−Pc/2 with dfs degrees of freedom.

It is equivalent to consider the 2 independent random variables(3.2)and . Then,(3.3)and(3.4)Using or g(z,e2)=σ2e2/dfs in (3.4) gives an estimate of the mean difference between groups 1 and 2 or the within-group variance, respectively, given that the t-test is significant.

Working with the densities of z and e2 allows more efficient numerical methods to estimate (3.3) and (3.4) than if one worked with the densities of xijk directly. Next, 2 possible Monte Carlo implementations are given, but a variety of numerical methods could be used. For the case of a t-test, (3.3) is simply a cumulative density of a noncentral t-distribution with noncentrality parameter and dfs degrees of freedom. For a Monte Carlo estimate of (3.4), define {z(l),e2,(l)},l=1,…,m, to be a simulated sample from , which can be generated efficiently with a Markov chain Monte Carlo (MCMC) algorithm. Then, a Monte Carlo estimate of (3.4) is(3.5)

Even a small sample (by MCMC standards) of say m=100 can be reasonable for (3.5).

If one chose to extend the basic idea here for a test for which values of (3.3) are not as easily available as a noncentral t-distribution, then both (3.3) and (3.4) could be estimated by Monte Carlo. For that case, redefine {z(l),e2,(l)},l=1,…,m, to be a Monte Carlo sample of size m from P(z,e2). Then, the natural estimates of (3.3) and (3.4) are(3.6)and(3.7)

Extensions based on other Monte Carlo numerical integration techniques (such as importance sampling) are straightforward and not our focus here.

3.2.Approximations for generalization error

Let be the parameter vector of the classification function ψ, a linear discriminant function in the examples here. An estimated classifier ψ(xGj|X) is defined by estimated parameters = (X). For more concise notation, we view generalization error as a function of , that is, CGj(Δ, MU|X) = CGj().

Delta approximation. A delta approximation for the class generalization error given X∈Ω is(3.8)where is the second derivative of CGj with respect to r and s evaluated at is the covariance between the r and the s dimensions of |X ∈ Ω, and p is the number of features selected due to X∈Ω. The delta approximation is derived by Taylor series expansion of the expectation integral around . Note that although the dimensions (or features) are assumed to be independent for feature selection, after they are selected they are approximated as multivariate normal, so the covariances in (3.8) are not necessarily zero.

Normal score approximation. Classifiers typically involve a continuous score function, , with prediction of group 1, ψ(xGj|X)=1, if < 0 (by convention here) and prediction of group 2, ψ(xGj|X)=2, if > 0. The normal score approximation is to treat as normally distributed with mean E[] and variance V[]. Then,(3.9)where u1=+1, u2=−1, and Φ(·) is the standard normal cumulative density function.

Relation to linear discriminant theory. For the case that ψ is a linear discriminant function, we need to calculate and for the delta approximation and E[] and V[] for the normal score approximation. Define xFij to be the selected training features (i.e. given X∈Ω) of the ith sample from class j. It is convenient to arrange the signs of the data in a consistent manner, so we assume (without loss of generality) that whenever dimension k is included in the classifier, (i.e. if , reverse the signs of the data). Then, define μFj and ΣF to be the mean vector and covariance matrix of xFij, respectively. The difference between means is ΔF=μF2−μF1. By symmetry, μF1+μF2=0. The distributions of the xFij will not typically be normal because they are conditioned on a significant difference between normal sample means, but the approximation below uses exact expressions (see supplementary material available at Biostatistics online) for , and V[] under the assumption that the distributions are normal. The expressions use results of Siskind (1972) on the second moments of inverse Wishart distributions, which are related to the sampling distribution of . This allows full incorporation of multivariate sampling variability in estimating the linear discriminant classifier and uses the principle that second moment–based approximations derived from normal theory are often reasonable. Thus, there are really 2 approximations happening: an approximation of training features (given they have been selected) as multivariate normally distributed and either the delta approximation or normal score approximation of generalization error.

3.3.Linear discriminant analysis when the training and validation samples follow different distributions

As above, define a training sample of , i=1,…,n1, from class 1 and , i=1,…,n2, from class 2. Define ΔF=μF2−μF1, with ΔF > 0 in every dimension. Define the “true” parameters of ψ in the linear discriminant case as =(w, a), where and a=0.5(μF1+μF2)=0. These are estimated by , where(3.10)is the pooled unbiased estimate of , and . This is the setup of standard linear discriminant analysis (McLachlan, 1992).

Define a validation sample from class j as . The discriminant score for a value xG is(3.11)with prediction of class 1 for < 0 and class 2 for > 0. If the training and validation samples came from the same distributions, then w and a would give the optimal discriminant function.

We maintain the generality of the prior log-odds ratio, log(P(1)/P(2)), in the derivations. In the simulations below, we assume P(1)=P(2). These values may be very different for a population screening test, where only a very small fraction is expected to have a disease condition, compared to a problem such as disease classification given disease presence. Consideration of P(1)≠P(2) is standard in balancing sensitivity and specificity of medical tests.

To use the delta approximation (3.8), we need the first 2 moments of and and the derivatives of the generalization error with respect to the elements of and . To use the normal score approximation (3.9), we need the first 2 moments of . These are given exactly in the supplementary material available at Biostatistics online for the approximation that the training samples are normally distributed given that the selected dimensions were individually significant.

3.4.Summation over feature spaces

It remains to complete the calculation (3.1) efficiently by combining the Monte Carlo estimates of (3.3) and (3.4) and the approximations (3.8) or (3.9). If the space of features that might be selected is relatively simple, then one might directly enumerate cases where P(X∈Ω) is appreciably greater than zero; this is not stated mathematically here. More generally, one can use a Monte Carlo sample from the space of selected features to approximate (3.1).

Let {Ω(l)},l=1,…,m, be a sample from P(Ω)≡P(X∈Ω). Corresponding to each partition piece Ω, there is a distribution P(X|X∈Ω). Since this is characterized by ΔF and ΣF (estimated by (3.5)), we denote . Then, the Monte Carlo approximation of (3.1) is(3.12)

For feature-by-feature selection as discussed above, the relationship δr=δ(Ωr),r=1,…,R, is one-to-one, so we can identify P(δr)≡P(Ωr). Then, sampling from P(Ω) in practice amounts to simulating on a feature-by-feature basis whether each feature is selected.

3.5.Choice of feature-selection thresholds

The above simulation and approximation steps require a choice for the P-value cutoff, Pc, used for feature selection. In practice, one can consider a range of Pc-values based on heuristic considerations to encompass the value of Pc that minimizes the expected validation error. In the simulation results here (supplementary material available at Biostatistics online, summarized below), the following heuristics perform well. The lower bound PL of Pc is set to the value at which the probability of zero true discoveries is 30% because excluding most or all informative features will not lead to good patterns. The upper bound PU of Pc is the minimum of 2 values. The first is the Pc level at which the probability of including all informative features equals 80%, on the rationale that after including most or all informative features, error rates will only get worse as false features are added. The second is the Pc such that the expected number of uninformative features is N/2−MI, that is, the expected total number of features if all truly informative features are included should not exceed N/2. In scenarios where the second bound was lower than the first, higher Pc would lead to worse validation error rates due to many uninformative features.

3.6.Summary of simulation–approximation method

In summary, the simulation–approximation procedure uses the following steps:

1. Choose Δ, MU, and n to define a study scenario.

2. Choose a useful range of feature-selection thresholds, Pc, which influence how many features are chosen in the feature-selection stage.

3. For each (unique) dimension of Δ and each Pc, use the noncentral t-distribution and/or Monte Carlo methods to estimate

a) the probability that the feature will be selected,

b) the expected within-group variances and difference between group means given that the feature is selected.

4. For the Monte Carlo approximation (3.12), generate a sample of training feature combinations, {Ω(l)}, for which the generalization error will be approximated.

5. For each training feature combination, use the variances and mean differences given that the features are selected to approximate the generalization error using either (3.8) or (3.9) with the calculations in the supplementary material available at Biostatistics online.

6. Sum the terms in (3.12).

4.SIMULATION STUDY

Results of simulations of 7 realistic study designs are detailed in the supplementary material available at Biostatistics online. The first 6 scenarios consider optimal (i.e. Bayes) error rates of 0.05, 0.10, and 0.20 with either 3 (few strong) or 12 (many weak) truly informative dimensions, while the seventh considers optimal error of 0.05 from 46 (very many, very weak) dimensions. All scenarios use equal discovery sample sizes for control and disease groups, n1=n2, with the same mean difference for all informative dimensions and 10 patients per group for validation power. The simulation–approximation is accurate with the normal score approximation in all scenarios and with the delta approximation in all scenarios except for very many, very weak true features. Both methods are most accurate when most of the variation in generalization error is due to variation in which features are selected rather than in discriminant parameters given the feature space. Much larger numbers of truly informative dimensions would render the approximations inaccurate, and, moreover, suggest methods beyond basic linear discriminant analysis (LDA), such as shrinkage methods to constrain high variances in estimated patterns.

Several realistic scenarios have limited statistical power for validation and lead to substantially suboptimal patterns. With 12 informative and 2000 uninformative features and optimal error rate of 20%, sample sizes of 20, 50, and 100 give median validation error rates around 48%, 40%, and 30–35%, respectively, with only sample sizes of 100 giving better than 50% power for validation. If the features give optimal error rate of 10%, then 50 patients per group give high validation power but with median error rates of roughly 18–22% for 1000–5000 uninformative dimensions. With an optimal error rate of 5%, 20 samples would give roughly 50–80% validation power at 5% significance for 1000–5000 uninformative dimensions.

For a given optimal error rate, it is much harder to find patterns from many weak than from few strong informative features. Given optimal error rate of 20%, 50 patients per group for 3 strong features give better results than 100 patients per group with 12 weak features. For optimal error rate of 10% or 5%, 20 patients per group for 3 strong features give roughly comparable performance to 50 patients per group for 12 weak features. In summary, by far the strongest factors in pattern discovery power are sample size and individual feature strength. Some of these results are sobering in light of sample sizes in typical studies. It is plausible that some real studies to discover diagnostic patterns from high-dimensional assays could have low power for independent validation and find patterns far from the best true pattern.

5.DISCUSSION

Prospective analysis of study design for high-dimensional pattern discovery is important to plan studies with reasonable expectations of success based on scientific guesswork about the types of real patterns that might exist. The complexity of feature selection and pattern analysis methods raises many challenges for prospective study design. Here, we have explored a middle road between simulation and approximation, with simulations to handle variability in the selected features and an approximation of linear discriminant analysis given that the selected features appear to be informative in training data.

One of the most complicated ways in which the scenarios here may be optimistic is their lack of multivariate patterns and pattern recognition methods. Multivariate patterns could include correlated features that appear to be individually weak but are collectively strong or even harder possibilities such as the classic “XOR” (checkerboard) problem, where each marginal distribution has no information and only more complicated models than LDA can represent the pattern. In such problems, the hazard of over-fitting is greater than for the simulations here and would likely produce less favorable results. Other directions for further exploration of the relationships between sample size, numbers of informative and noninformative features, true optimal error rate, and discovery and generalization error rates include the following: generation of data from distributions that are unknown to the learning method (i.e. non-normal), further development of the relationship between false discovery rates and pattern discovery power, and further theoretical development of accurate approximations and/or efficient simulations.

Supplementary Material

[Supplementary Material]

This work was initiated while all authors were employed at Predicant Biosciences. We thank our colleagues at Predicant for insightful discussions and support. Conflict of Interest: None declared.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

The incidence of nasopharyngeal carcinoma (NPC) varies widely according to age at diagnosis, geographic location, and ethnic background. On a global scale, NPC incidence is common among specific populations primarily living in southern and eastern Asia and northern Africa, but in most areas, including almost all western countries, it remains a relatively uncommon malignancy. Specific to these low-risk populations is a general observation of possible bimodality in the observed age-incidence curves. We have developed a multiplicative frailty model that allows for the demonstrated points of inflection at ages 15–24 and 65–74. The bimodal frailty model has 2 independent compound Poisson-distributed frailties and gives a significant improvement in fit over a unimodal frailty model. Applying the model to population-based cancer registry data worldwide, 2 biologically relevant estimates are derived, namely the proportion of susceptible individuals and the number of genetic and epigenetic events required for the tumor to develop. The results are critically compared and discussed in the context of existing knowledge of the epidemiology and pathogenesis of NPC.

There are remarkable and well-defined geographical and ethnic variations in the incidence of nasopharyngeal carcinoma (NPC) worldwide. Rates are high to intermediate in certain areas of south-eastern China, southern Asia, northern Africa, and among Inuit populations of the Arctic region. With the exception of migrant populations from high-risk areas, rates of this malignancy tend to be uniformly low elsewhere. The aetiology of NPC is rather complex with causal pathways involving the Epstein–Barr virus (EBV) as well as factors related to both the environment (often lifestyle related) and the host (genetic susceptibility) (Chang and Adami, 2006), (Hildesheim and Levine, 1993).

Age-incidence curves of certain cancers often exhibit a single peak in rates followed by a subsequent decline. Among alternative explanations, this unimodality may be interpreted as a frailty phenomenon, whereby most individuals are nonsusceptible to the disease, but a subset of individuals has an increased risk at a given age. The risk at the population level must decline once those susceptible individuals have acquired the disease, leaving the general population (at a given age) that is, in theory, nonsusceptible.

Frailty modeling provides an opportunity to take individual heterogeneity in disease susceptibility into account. For reviews of frailty theory, see for example the introductions by Aalen (1988), Aalen (1994) or Hougaard (2000). Frailty is an unobservable quantity modeled as a random variable over the population of individuals, with a high (low) value of the frailty variable associated with a large (small) risk of acquiring the disease. If the frailty variable is 0, the individual is nonsusceptible or `immune'.

The age-incidence curve of NPC for low-risk countries is somewhat atypical amongst cancer types. In Bray and others (2008), it was shown that for most, if not all populations in this category, rates exhibit a small peak within the age range 15–24, with rates steadily increasing to a second peak at ages 65–74 years, and then declining subsequently. The aims of this study were firstly to identify a frailty model that provides an adequate fit to this more complex instance of bimodality in the age-incidence structure, secondly to assess the significance of the first peak, and thirdly to interpret the resulting parameter estimates in the context of the current epidemiologic and biological knowledge of NPC.

Using a number of published data sets from population-based cancer registries worldwide, we include 2 frailties, one per peak, and 2 basic rates in the multiplicative frailty model. The frailties are assumed independent and compound Poisson distributed. This distribution has a discrete part of 0 frailty (i.e. nonsusceptible) and a continuous part of positive frailties. Covariates are included in the underlying Poisson parameters. We present the NPC hazard ratios by sex and geographical area in the analysis, together with 95% confidence intervals for these ratios. The observed and estimated age-specific incidence rates are plotted, and we examine the fit of the bimodal frailty model. Estimates of the proportion of susceptible individuals and the number of genetic and epigenetic events required to attain malignancy are given, with 95% confidence intervals.

This paper is organized as follows: in Section 2, the data sources and the model are described, together with some theoretical results. Section 3 presents the main results following application of the model to the data. Finally, in Section 4, the assumptions of the model are stated, and the results are discussed in light of our present understanding of the biology and aetiology of NPC.

2.MATERIAL AND METHODS2.1Material

The Cancer Incidence in Five Continents (CI5) Vol. I to VIII ADDS database (Parkin and others, 2005) was used to extract incident cases of nasopharyngeal cancer (ICD-10 C11) for 72 population-based cancer registries, together with the corresponding population data by year of diagnosis, sex, and age. Although all nasopharyngeal cancers were extracted, rather than only NPCs, the term NPC is used here to identify carcinomas, given that they represent the vast majority of nasopharyngeal tumors, and the subset for which most epidemiological studies have focused.

The inclusion and exclusion criteria are provided in detail in Bray and others (2008). Briefly, we restricted analyses to the period 1983–1997 and, to remove some of the inherent random variability, excluded populations with a mean annual coverage of less than 1 million inhabitants. For the remaining 23 registry populations, incidence data were available by eighteen 5-year age groups (0–4, 5–9, …, 80–84, 85+) and sex for each of the years of diagnosis 1983–1997 (see footnotes of Table 1 for exceptions). Regional registries were aggregated to national or larger area levels on the basis of geographical area, thus enabling sufficient numbers for meaningful age-specific analyses. Five aggregated low-risk areas were defined: North America, Japan, north and west Europe, Australia, and India. To examine the effect of calendar time, the data were further divided into three 5-year diagnostic periods (1983–1987, 1988–1992, 1993–1997).

Table 1.

Number of NPC cases and corresponding number of person–years at risk (in millions) for males and females in 1983–1997

Area

Cases (M/F)

Person–years (M/F)

North America

2705/1227

345.99/354.05

Canada

Surveillance Epidemiology and End Results white

Japan

587/232

80.67/83.21

Miyagi

Osaka

North and west Europe

1424/709

270.98/285.46

Denmark

Estonia

Switzerland, Züricha

UK, Birmingham and West Midlands

UK, Merseyside and Cheshire

UK, North western

UK, Oxfordb

UK, South Thames region

UK, Yorkshire

UK, Scotland

Australia

814/310

86.25/87.37

New South Wales

South

Victoria

India

539/219

110.63/93.04

Chennaic

Mumbaic

a

Incidence data available for the years of diagnosis 1983–1996.

b

Incidence data available for the years of diagnosis 1985–1997.

c

Population data available in 16 age groups (0–4, 5–9, …, 70–74, 75+).

Table 1 gives an overview of the countries/regions included in the analysis, with the number of NPC cases and corresponding number of person–years at risk (in millions) for males and females in the aggregated areas in 1983–1997. In total, there were 6069 cases among males and 2697 among females. The total number of person–years at risk (in millions) was 894.53 for males and 903.14 for females.

2.2Statistical methods

Standard frailty theory makes use of the multiplicative frailty model. In this model, the individual hazard rate is the product of an unobservable frailty variable Z and an unobservable basic rate λ(t) common to all individuals; that is, h(t|Z) = Zλ(t) (Aalen, 1994), where t throughout denotes age. The population hazard rate is the net result for a number of individuals with different frailties and is observable, as the age-incidence rate. The basic rate specifies how the hazard changes with age. The level of the hazard for a given individual is specified by the frailty which follows a specific statistical distribution. Common distributions are the power variance function (PVF) distributions, which include the gamma and the compound Poisson distribution as special cases.

To accommodate the bimodality in the age-incidence curve of NPC, we make a minor modification to the multiplicative frailty model by including 2 frailties, assumed for simplicity, to be independent. The first frailty, Z1, represents the risk of developing NPC in very early adulthood, postulated to be a result of genetic and viral factors (Ayan and others, 2003). Later lifestyle factors (including smoking) probably influence the risk of getting NPC for individuals aged 65–74 years, represented by the second frailty term, Z2. We let the individual hazard rate be a linear combination of these 2 frailties,

(2.1)

NPC is a rare form of cancer, and to allow individuals to be nonsusceptible, we use the compound Poisson distribution for the frailties Z1 and Z2. This distribution has been successfully applied to testicular cancer and colorectal cancer (Aalen and Tretli, 1999), (Moger and others, 2004), (Svensson and others, 2006). For i = 1,2, let Xi,1,Xi,2,…,Xi,Ni be independent gamma-distributed random variables with scale and shape parameter νi and ηi, respectively. The frailty variables Z1 and Z2 are given bywhere Ni is a Poisson-distributed random variable with expectation ρi. The Poisson parameters ρi(i = 1,2) determine the proportion of nonsusceptible individuals as P(Zi≠0) = 1 − exp( − ρi).

The age-specific incidence rates of NPC vary by sex and geographic location and, in some populations, with time. Hence, we allowed ρi to change over sex, area, and diagnostic period by including covariates in this parameter. The Poisson parameters can therefore be written as(2.2)

The process of carcinogenesis can be described by different multistage models, among which the Armitage–Doll (AD) multistage model (Armitage and Doll, 1954) is well known. In this model, cells go through an irreversible process, transforming normal cells into malignant cells via many intermediate states. The AD model does not take into account that cells can replicate, die, or differentiate. The Moolgavkar–Venzon–Knudson (MVK) model is a 2-stage model which allows for clonal expansion of intermediate cells. Both these multistage models are illustrated in Portier and Kopp-Schneider (1991), who also give an expansion of the MVK model to include DNA damage, cell replication, and DNA repair, the damage-fixation multistage model. Little (1995) proposes a generalization of the MVK model which allows an arbitrary number of mutational stages.

Armitage and Doll (1954) justify the use of the Weibull distribution for the basic rates, while Kopp-Schneider (1997) states that the Weibull model is the most commonly used parametric model for carcinogenesis. If we let k be the shape parameter of this distribution, we obtain that λi(t) = kitki − 1, i = 1,2. Usually these hazard rates are written as aikitki − 1, where the as are scale parameters. To avoid overparameterization, these parameters are subsumed in the frailty variables, that is, a1 = a2 = 1.

The individual survival function, given the frailties, is S(t|Z1,Z2) = exp( − Z1Λ1(t) − Z2Λ2(t)), where Λi(t) = ∫0tλi(s)ds = tki, i = 1,2, are the cumulative basic rates. If we integrate out the unknown frailty variables, we get the population survival function(2.3)

By differentiating the natural logarithm of (2.3) with respect to t and changing the sign, we find the population hazard rate

(2.4)

The function in (2.4) is bimodal, as opposed to the individual hazard rate in (2.1) which is monotonic. It is an expansion of the population hazard rate given in Aalen and Tretli (1999). With only one peak in the age-incidence curve, only one of the terms in (2.4) would have been necessary. The Poisson parameter ρ would have been a proportionality factor, and including covariates in this parameter only would have given a proportional hazards model. However, it is possible for ρ1 and/or ρ2 to be proportionality parameters also in the bimodal model. Figure 1(a) shows an example of the hazard function in (2.4). The plot in Figure 1(b) shows the population hazard rates for the 2 peaks separately, that is, for the 2 terms in (2.4). We see that these hazard rates increase up to a certain age after which the curves start to decrease. If we add these hazard rates together, we get the bimodal curve in Figure 1(a). The first peak (from Z1) in the bimodal curve decreases less than the long-dashed line in Figure 1(b), but the second peak (from Z2) is in accordance with the dashed line in Figure 1(b). At all ages where one of the 2 curves is approximately 0, the corresponding term in (2.4) will cancel out. Hence, in this example the Poisson parameter ρ2 is a proportionality factor at, for example, the second peak since the frailty Z1 is 0 at this age, but this will not be the case for ρ1 at the first peak where both Z1 and Z2 contribute to the total curve.

Fig. 1.

Bimodal population hazard rate in (a) (2.4) for certain parameter values. Hazard function for each of the 2 peaks separately in (b), same parameter values as in (a).

The parameters for the frailty distributions and the basic rates are assumed equal for both sexes in all age intervals, areas, and diagnostic periods. From (2.4), we see that the population hazard rates for males and females in area j and diagnostic period k (denoted later as hMjk(t) and hFjk(t), respectively) differ only in the values of the Poisson parameters. Let ρiMjk and ρiFjk be the Poisson parameters for males and females, respectively, in peak i, area j, and diagnostic period k. Further let

(2.5)

be the parts of the population hazard rate in (2.4) that are equal for the sexes in all age intervals, areas, and diagnostic periods. Combining (2.4) and (2.5), the hazard ratio between males and females in area j and diagnostic period k becomes(2.6)

The hazard ratio between males in area j and reference area j′, in the same diagnostic period k, is given by

(2.7)

The hazard ratios in (2.6) and (2.7) depend on age. They are quite complex because the population hazard rate in (2.4) consists of 2 terms, so generally we cannot cancel out common terms. Parametric bootstrapping is required to obtain corresponding confidence intervals.

The proportion of susceptible individuals follows from the underlying Poisson parameters. Specifically, the probabilities of the individual being susceptible in peak 1 and peak 2 are 1 − exp( − ρ1) and 1 − exp( − ρ2), respectively.

2.3Estimation procedure

The method is the same as in Aalen and Tretli (1999). Let μjklm and Rjklm be, respectively, the expected and the observed number of NPC cases in area j, diagnostic period k, and age interval l for sex m. Let Tjklm be the corresponding number of person–years at risk. From a Poisson model, the likelihood function is given by

The midpoints of the age intervals are denoted by t1,…,t16 or t1,…,t18, depending on the number of age groups. The expected number of NPC cases is defined as the average hazard rate per year for area j, diagnostic period k, age interval l, and sex m, multiplied by the number of person–years,

The likelihood function depends on the parameters through the population survival function given in (2.3). We assume that the Weibull shape parameters ki and the scale and shape parameters νi and ηi of the underlying gamma distributions are the same for both sexes in all age intervals, areas, and diagnostic periods. This gives the same shape of the distributions to reduce the number of parameters. The Poisson parameters ρi (i = 1,2) are allowed to change over sex, area, and diagnostic period according to (2.2). This gives 11 parameters per peak and a total of 22 parameters in the model, which we estimate by maximizing the natural logarithm of the likelihood function, ln(L). The R function “nlminb” is used for the maximization, and standard errors are calculated from the Hessian matrix in the R function “optim.” The parameter estimate divided by the standard error of this estimate gives the Wald test, which is used to test the effect of the covariate by computing 2-sided p-values.

The confidence intervals for the hazard ratios are based on the percentile method. This method uses the α/2 and 1 − α/2 percentiles of the bootstrap sample, in ascending order, if α(B + 1) is an integer (Carpenter and Bithell, 2000). For simplicity, we use B = 999 and a significance level α = 0.05.

3.RESULTS

The reference level for the covariate diagnostic period is 1983–1987. Two-sided p-values for the test of no effect of this covariate in peak 1, adjusted for the covariates sex and area, are 0.25 and 0.24 for the periods 1988–1992 and 1993–1997, respectively. For the second peak, the p-values are 0.14 and 0.09. Hence, there is no significant difference in the age incidence for the three 5-year diagnostic periods. In the following, we therefore analyze data for the aggregated 15-year diagnostic period 1983–1997.

The left part of Table 2 shows the 2-sided p-values for the test of no effect of the covariates sex and area, unadjusted for diagnostic period. For these covariates, Table 2 also gives the hazard ratios, as given in (2.6) and (2.7), at the mean value of the age intervals for the 2 peaks (t = 19.5 and t = 69.5, respectively) with 95% bootstrap confidence intervals. The confidence intervals are much wider at age 19.5 than at age 69.5 because of fewer cases. The covariate sex is significant in both peaks with an increased risk for males compared to females. Corresponding to the example in Figure 1, both terms in (2.6) contribute to the hazard ratio at age 19.5, and the effect of sex therefore depends on area of residence. We present the mean hazard ratio over areas to get one combined estimate of 1.89 with (1.50, 2.20) as the 95% confidence interval. At age 69.5, the hazard ratio for sex is 2.56 (2.53,2.74) regardless of area, as the first hazard in (2.6) is approximately 0 at this age. In most areas from which data are available, the reported male:female ratio in the population of individuals who acquire the disease is in the range of 2–3:1 (Hildesheim and Levine, 1993).

Table 2.

P-values for both peaks and hazard ratios at ages t = 19.5 (mean of age interval peak 1) and t = 69.5 (mean of age interval peak 2) with 95% bootstrap confidence intervals of sex and area

P-value

HR(19.5)

HR(69.5)

Peak 1

Peak 2

Sex. Reference level: women

Sex

< 0.001

< 0.001

1.89 [1.50, 2.20]

2.56 [2.53, 2.74]

Area. Reference level: North America

Japan

< 0.45

< 0.001

1.02 [0.63, 1.23]

0.81 [0.79, 0.85]

N/W Europe

< 0.81

< 0.001

0.86 [0.64, 0.97]

0.59 [0.58, 0.60]

Australia

< 0.07

< 0.001

1.29 [1.13, 1.84]

1.13 [1.06, 1.15]

India

< 0.001

< 0.001

1.83 [1.30, 2.09]

0.84 [0.79, 0.90]

Correspondingly, for the area covariate, we present the mean hazard ratio over sex at age 19.5. From the p-values and the hazard ratios, India is the only area with a significantly higher risk than North America at the first peak. The other possible differences are not significant according to the Wald test, though unity is not included in the confidence interval for north and west Europe and Australia. The results for these 2 tests differ because the hazard ratio in (2.7) is influenced by the parameters in both peaks. The function A2(t) in (2.5) is approximately 0 for small values of t, but this is not the case for t = 19.5. At the second peak (age 69.5), we see significant differences between North America and all the 4 other areas. The 95% confidence intervals support this conclusion; the difference for individuals aged 69.5 years is significant. For t = 69.5, the hazard ratio is mostly influenced by the parameters in the second peak since the function A1(t) in (2.5) is approximately 0 for large values of t. This results in consistent results from p-values and hazard ratios. North America has a higher risk than all the other areas except Australia.

Figure 2 presents 25 bootstrap age-incidence curves, used to calculate bootstrap confidence intervals, together with the observed values. The estimated incidence rates are given by replacing the parameters in (2.4) with their estimated values. These graphs are presented on a semilog-scale to highlight the bimodality. We see less variation for North America than Japan, especially up to the first peak, and the fit is also somewhat better for the former area. This is expected as North America has the highest number of person–years at risk and Japan the lowest (see Table 1). North America contributes therefore the most to the likelihood function and hence the parameter estimates.

Fig. 2.

Observed (discrete points) and 25 bootstrap (continuous curves) age-specific incidence rates per 100 000 person–years for both sexes in North America and Japan. Vertical lines are included to emphasize the rates in age groups 15–24 and 65–74.

The estimates of the other parameters in the compound Poisson model are given in Table 3. The underlying Weibull hazard rate has a shape parameter of 2.48 with 95% confidence interval (2.16,2.80) for the first peak and 4.65 (4.28,5.03) for the second. These confidence intervals are based on a normal approximation and are calculated from the estimates and standard errors in Table 3. Note that exp(β) for the second peak is equal to the hazard ratios given in the last column of Table 2, since the underlying Poisson parameter ρ2 given in (2.2) is a proportionality factor.

Table 3.

Maximum likelihood estimates with standard errors of the parameters

Parameters

ν

η

k

ρ0

Peak 1

Estimates

2.81 × 104

23.48

2.48

− 11.32

se

4.33 × 104

36.70

0.16

.0.15

Peak 2

Estimates

6.16 × 108

01.39

4.65

− 7.54

se

2.19 × 108

0.95

0.19

0.12

Parameters

β1

β21

β22

β23

β24

Peak 1

Estimates

0.50

0.15

0.03

0.34

0.92

se

0.10

0.20

0.14

0.18

0.13

Peak 2

Estimates

0.94

− 0.210

− 0.52

0.12

− 0.17

se

0.02

.0.04

− 0.03

0.04

− 0.05

se, standard error.

To check the improvement in goodness of fit for a bimodal model over a unimodal model, we also fitted a standard unimodal compound Poisson frailty model with a Weibull baseline hazard to the data. This model has a total of 9 parameters and yielded a log-likelihood of 30067.52. The bimodal model yielded a log-likelihood of 30372.82, a significant improvement over the single-peaked model by the likelihood ratio test (p-value < 0.001). A comparison of the observed and estimated incidence rates for these models illustrate this; in Figure 3, graphs of rates versus age are presented on a semilog-scale. The modified multiplicative frailty model provides an acceptable fit to the data, and we can clearly see the improvement over the unimodal fit. Again, we see a better fit for North America and north and west Europe than for the other areas.

Fig. 3.

Observed (discrete points) and estimated (continuous curves) age-specific incidence rates per 100 000 person–years for both sexes in 5 low-risk areas. Solid (dashed) line is from a bimodal (unimodal) fit. Vertical lines are included to emphasize the rates in age groups 15–24 and 65–74.

In Figure 4, we have plotted the estimated proportion of susceptible males and females per 100000 person–years, with error bars giving the 95% confidence intervals. These intervals are log transformed since the proportions of susceptible individuals are relatively small and the coefficients of variation for these values are relatively large. In all 5 aggregated low-risk areas, for both peaks, there is a higher frailty proportion among males than females, reflecting the higher incidence among males. In peak 1, North America has the lowest proportion of frail individuals and India the highest. The hazard ratio at age 19.5 gave significantly higher risk for India than North America. North and west Europe has the lowest proportion of frail individuals and Australia the highest in the second peak.

The principal finding of the present study is that NPC incidence rates in low-risk populations are well described by a bimodal frailty model in both males and females diagnosed over the period 1983–1997. It is necessary to discuss the relevance of the assumptions of the model since other models built on an alternative set of assumptions may also fit the data.

The key assumption of a frailty model implies that only a certain proportion of individuals are susceptible to develop NPC at a given age during their lifetime. Both genetic and environmental factors contribute to the development of this disease. The link between the NPC and the EBV is well known (Chang and Adami, 2006), (Hildesheim and Levine, 1993). EBV belongs to the herpes virus family and is one of the most common human viruses. This virus is ubiquitous worldwide, and many individuals are infected during their lifetime. Only a small proportion of individuals develop NPC, however, so EBV is not a sufficient cause of NPC. In high-risk populations where undifferentiated carcinomas or lymphoepitheliomas (Type-I NPC tumors) are common, genetic events appear to occur early in NPC pathogenesis and may cause predisposition to subsequent EBV infection. It may be speculated that EBV is a necessary factor for those histological types of NPC where stable infection of epithelial cells by EBV requires such an altered, undifferentiated cellular environment (Lo and Huang, 2002Young and Rickinson, 2004).

In the low-risk settings studied here, however, type-III tumors—keratinizing squamous cell carcinomas—dominate, particularly at older ages (the late peak in age incidence), and there is an inconsistent relationship between the EBV infection and the development of these tumors (Chang and Adami, 2006).

Genetic and/or other environmental cofactors must additionally contribute to the risk of NPC. The first peak in individuals diagnosed in late adolescence or early adulthood would imply a role for germline mutations (major genes) and gene polymorphisms (minor genes), see Chan and others (2005) and Bray and others (2008). EBV infection seems likely to contribute to NPC in this young age group (Ayan and others, 2003), where type-III cancer is the more commonly diagnosed type (linked with the early peak in age incidence). The second later peak relates more to lifestyle-related risk determinants, including tobacco and alcohol consumption and, more speculatively, occupational exposures to carcinogens, such as formaldehyde (Chang and Adami, 2006).

Another assumption of our model is independence between frailties. This assumption provides a simplification of the model, as with the population survival function in (2.3). Usually, bimodal age-incidence curves are the integrated effect of the 2 different underlying unimodal population distributions, corresponding to the early and late peak. In such cases, the 2 distributions tend to represent different aetiologies, as discussed for Hodgkin's lymphoma (MacMahon, 1966). In this instance, it seems reasonable that the 2 peaks of the NPC age-incidence curves differ substantially in terms of aetiology. This argument, together with the fact that the distance between the age intervals for the peaks is large, makes the assumption of independence sensible. If EBV infection is a common factor in the pathway of both populations, the shape of the age-incidence curves may also be influenced by the timing of events including age at infection, specific genetic events, and, possibly, environmental exposures.

The underlying assumption of the frailty modeling is the mechanistic understanding of cancer as a result of accumulated genetic damage, generally regarded as the multistage clonal expansion model of carcinogenesis. The biological interpretation of the k parameter is the number of genetic and epigenetic events required on average for a cell to become malignant (Armitage and Doll, 1954), although this interpretation should be suitably cautious for more complex multistage models. In previous frailty model studies, the estimated parameter values have been in accordance with current knowledge regarding carcinogenesis of the specific neoplasm, that is, testicular cancer (Aalen and Tretli, 1999) and colorectal cancer (Svensson and others, 2006). In the current study, the estimated k-values of 2.5 and 4.7 for the first and second peak, respectively, compare with a k-value of 3.0 from a previous simulation study on a sample of low-risk western populations reported by Doll (1971). At this time, the uniformity of bimodality among NPC cases in low-risk populations was certainly not recognized, and Doll's estimate (assuming a unimodal distribution) lies between our estimates derived using a bimodal distribution.

The k parameter of 2–3 for the early peak in the age-incidence curve may be interpreted biologically as a reflection of the 2 crude `hits’ in the carcinogenesis, that is, the genetic alterations involving major or minor susceptibility genes and a promoting effect of EBV infection. The pathogenesis leading to the late peak in the age-incidence curve is thought to be related more to the effect of environmental carcinogens possibly interacting with EBV infection. This is illustrated in Figure 4 of Bray and others (2008). It is quite plausible that environmental cofactors in the population as a whole may provide (on average) 2 more `hits', for example, loss of heterozygosity in certain genes and/or other genetic changes as described by Young and Rickinson (2004) and Chan and others (2005).

Earlier studies have concluded that the incidence of NPC in the population of individuals who acquire the disease is 2- to 3-fold higher in males than females (Chang and Adami, 2006), (Hildesheim and Levine, 1993). We have found a similar increased risk for males up to the second peak (t = 69.5 years) compared to females. A general explanation could be the tendency for less favorable smoking and alcohol consumption patterns among males. The close to doubling of risk for susceptible individuals among males up to the first peak (t = 19.5 years) is intriguing but not readily explained given present knowledge.

Finally, the bimodal frailty model developed in this paper was applied to NPC age incidence to examine susceptibility among low-risk populations. However, the model may be applied to any disease condition where the bimodality of the age-occurrence pattern can be demonstrated at the population level. For cancer, such a phenomenon is not unique to NPC; there are a number of cancer forms that exhibit 2 peaks in incidence rates followed by respective declines subsequently, and a frailty approach to their study would certainly seem warranted. Examples from cancer often involve a putative early viral component. These include Hodgkin's lymphoma, which has long been established as bimodal (MacMahon, 1966), with a relatively high proportion of cases occurring in adolescents and young adults, particularly in higher-resource countries. More recent candidates include hairy cell leukaemia (Dores and others, 2008), female breast carcinoma (Anderson and others, 2006), and Ewing's sarcoma (Cope, 2000).

FUNDING

Statistics for Innovation (sfi)2 to M.H.

We are grateful to the population-based cancer registries worldwide that submitted their data to successive volumes of Cancer Incidence in Five Continents. The authors thank Bjarte Aagnes at the Cancer Registry of Norway for providing the data. Conflict of Interest: None declared.

Association studies have been widely used to identify genetic liability variants for complex diseases. While scanning the chromosomal region 1 single nucleotide polymorphism (SNP) at a time may not fully explore linkage disequilibrium, haplotype analyses tend to require a fairly large number of parameters, thus potentially losing power. Clustering algorithms, such as the cladistic approach, have been proposed to reduce the dimensionality, yet they have important limitations. We propose a SNP-Haplotype Adaptive REgression (SHARE) algorithm that seeks the most informative set of SNPs for genetic association in a targeted candidate region by growing and shrinking haplotypes with 1 more or less SNP in a stepwise fashion, and comparing prediction errors of different models via cross-validation. Depending on the evolutionary history of the disease mutations and the markers, this set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses. Haplotype phase ambiguity is effectively accounted for by treating haplotype reconstruction as a part of the learning procedure. Simulations and a data application show that our method has improved power over existing methodologies and that the results are informative in the search for disease-causal loci.

Adaptive regressionHaplotypeMultilocus analysisSNP1.INTRODUCTION

Owing to the availability of high-throughput genotyping technologies and the comprehensive coverage of common genetic variants by the HapMap project (The International Hapmap Consortium, 2005), (The International Hapmap Consortium, 2007), association studies are widely used to dissect the genetic basis of complex diseases in a scope varying from a number of candidate genes to the whole genome. A typical association study involves initial prioritization of single nucleotide polymorphism (SNP) genotypes in a small subsample or a selection of tagSNPs derived from an existing database such as the HapMap project. These tagSNPs are subsequently genotyped for a sample of cases and controls (Smith and others, 2007). While the causal variants may not be interrogated directly, it is hoped that linkage disequilibrium (LD) mapping could narrow the search down to a small neighborhood around the causal variants. However, despite the explosion of genetic information available, challenges remain for statistical analyses due to the diversity of LD patterns in the human genome (The International Hapmap Consortium, 2005), the sheer number of SNPs being genotyped, and the complex nature of common disorders. Currently, the single-SNP scan and multiple-SNPs haplotype analyses are 2 commonly used approaches. The power comparison between these 2 approaches is somewhat inconclusive, as it depends on underlying disease models and local LD patterns (Morris and Kaplan, 2002), (Roeder and others, 2005). It has been suggested that a single-SNP scan is an effective method to detect common disease alleles, while haplotype-based methods are useful to map more recent, relatively rare mutations (Lin and others, 2004), (Schaid, 2004), though strategies to construct informative haplotypes (clusters) are far from mature. This paper pertains to adaptive SNP/haplotype analysis exploiting LD among SNPs in a candidate chromosomal region.

When many SNPs in a targeted chromosomal region are under investigation, a naive haplotype analysis using all SNPs is often ineffective due to the large number of haplotypes and hence too many degrees of freedom in an omnibus test. Instead, one may first dividing SNPs into haplotype blocks of high LD and then performing a haplotype analysis in each block (Barrett and others, 2005). However, the block definition itself is arbitrary, and typically, there is substantial correlation not captured between blocks. An alternative strategy is to construct a genealogical tree of haplotypes, known as a cladogram, and study the correlation between the disease phenotype and the clusters (clades) of haplotypes, thereby reducing the dimensionality of haplotype analyses (Templeton and others, 1987), (Seltman and others, 2001), (Molitor and others, 2003), (Durrant and others, 2004), (Morris, 2006). The motivation is that the causal allele should be embedded within the cladogram that describes the evolution of the sampled chromosomes. However, an accurate construction of the underlying cladogram typically relies on the assumption that there is no recombination. This is hardly true for any given region because of background recombination in the human genome, particularly for regions near or within recombination hot spots. To this end, a sliding window approach was proposed in the hierarchical clustering algorithm called CLADHC (Durrant and others, 2004), yet the optimal window size cannot be universal due to the diversity of local LD through the human genome. Even in an extreme scenario with complete LD, it was pointed out that cladistic approaches cannot be optimal in all disease models (Clayton and others, 2004) since the rule of clustering haplotypes is based solely on genotypic data.

Other strategies for multilocus analyses exist (e.g. Browning, 2006, Yu and Schaid, 2007, Li and others, 2007). These methods generally assume that local LD structures are somewhat contiguous, thus the order of SNP locations is critical. It is possible that SNPs that are separated apart can display strong LD, so a contiguous scan might miss signals. Similarly, multiple nonsynonymous mutations in a gene may disrupt the function of its coded protein jointly, possibly with interactions, regardless of their order in the chromosome. Furthermore, all aforementioned methods do not account for extra variability incurred by phase ambiguity in the model searching process, except the computationally intensive MCMC approach (Morris, 2006).

In this article, we propose SNP-Haplotype Adaptive REgression (SHARE), an adaptive algorithm that searches for a subset of SNPs, which fully capture genetic association in a candidate chromosomal region. The selected set of SNPs is the most informative in a heuristic sense: adding more SNPs introduces noise and excluding any SNP in the set may lose information. Contrary to the cladistic approaches, where the clustering process depends solely on haplotypes, in our algorithm, both the trait and the genotypes guide the model selection process, and the SNP selection is irrespective of the order of the SNPs. Depending on the genealogy and the ancestral recombination among disease liability mutations and markers, the most informative set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses, thereby effectively integrating a single-locus scan and a haplotype analyses into 1 unified framework. Furthermore, our algorithm stands apart from existing methods in that it accommodates phase ambiguity seamlessly by treating the inference of haplotypes as part of the procedure. The method is tailored to genetic association studies with a fair number of tagSNPs genotyped in a candidate gene approach, but, as we address in the Section 4, it can be extended to genome-wide association studies.

2.METHODS2.1.Rationale

We use an example to introduce the main idea: there generally exists a subset of SNPs that are sufficient to capture genetic association. Figure 1 shows the genealogical tree of 5 genptyped SNPs, labeled as ABCDE, and the unscored disease susceptibility SNP X. The genotyped SNPs can be tagSNPs that preserve maximal LD information with minimum redundancy. The haplotypes based on all 6 SNPs are displayed as strings of 0s and 1s, labeled numerically. Depending on where the susceptibility SNP arise, different subsets of SNPs are required to differentiate haplotypes that do and do not carry disease risk. In Figure 1(a), X occurs before A in lineage, thus only 1 SNP (A) is sufficient to capture the disease risk. In Figure 1(b), the functional variant X descended from 2, generating a new haplotype that parallels 5 and 6 in lineage. We recognize that, instead of including all haplotypes based on 5 SNPs in an analysis, if we restrict the haplotype analysis to A, D, and E, haplotype 100 carries an increased disease risk, while all other haplotypes do not. In this case, a cladistic approach will collapse 2, 5, and 6 and therefore dilute the disease signal. In the presence of recombination, the adjacent SNPs could have different genealogies. Thus, the genealogy of a sample of haplotypes is usually a graph with loops rather than a tree. Figure 1(c) depicts a situation where there is recombination between 6 and 3, occurring between the fourth and the fifth locus. A new haplotype recombinant 8 is created. The functional variant X later arose in 6. An inspection of SNPs before and after the breaking point suggests that either A and E or B and E will be adequate to discern the normal and risk-carrying haplotypes. For example, if we use SNP A and E to construct haplotypes, the haplotype 11 carries increased disease risk, while the other 3 haplotypes do not. This example sheds light on the effect of ancestral recombination on association mapping: it weakens the LD between the functional variant and the “proxy” in its lineage; in consequence, haplotypes across the breaking point become useful in mapping the functional variant. This is in the same spirit of the previous results that long haplotypes cross the recombination breaking point can help to map recent rare mutations (Lin and others, 2004). Note that the SNPs selected in Figure 1 are those before and after the functional variant in evolution, thus forming an evolutionary pocket surrounding the disease variant.

Fig. 1.

An example to show that there generally exists an optimal set of SNPs for association analysis. The order of SNPs in a haplotype is ABCDE(X). (a) The disease-causal locus X occurs before A in lineage. The optimal set for genetic association is just A. (b) The disease-causal locus X occurs after A and in parallel to D, E. The optimal set for genetic association is A, D, E. (c) The disease-causal locus X occurs after E in lineage. There is recombination between haplotypes 6 and 3, generating a recombinant 8. The vertical arrow on the top of haplotype 8 points to the break point of recombination. The optimal set for genetic association contains A, E or B, E.

To find the most informative set, ideally, we would search all possible subsets of the available SNPs using, for example, the generalized Akaike information criterion,(2.1)where ℓ is the likelihood for a model, a is a penalty parameter, and p is the number of parameters in the model. The best penalty parameter can be chosen by cross-validation. In reality, however, searching in all possible subsets quickly becomes infeasible as the number of SNPs gets larger than 20. We instead propose a stepwise algorithm to identify the most informative set. That is, we sequentially select the current best set by adding/deleting 1 SNP at a time to the previous best set, therefore substantially simplifying search paths. While stepwise algorithms have limitations, in the genetic context where LD structure is present in adjacent SNPs, stepwise selection is a natural choice, as opposed to more elaborative searching. The rationale is that the fundamental unit of inheritance—the haplotype is formed by sequential (stepwise) mutations during the history. Recombination shuffles around the haplotypes at breaking points (like hot spots), but the majority of genomic regions should be highly structured. With the nearly complete coverage of whole-genome common variations by the HapMap project, it is hard to imagine that an underlying disease loci does not exhibit any extra marginal association at all.

Different from a stepwise logistic regression treating SNPs as covariates (Cordell and Clayton, 2002), our algorithm iteratively constructs haplotypes based on the SNPs in current set. If we consider the sample space a population of haplotypes, our algorithm resembles recursive partitioning (Classification and regression tree [CART]; Breiman and others, 1984). We use the example in Figure 1(b) to illustrate this point. In Figure 1(b), a 3-SNP haplotype is best to capture the disease risk. One potential search path shown in Figure 2 is that we first find SNP A as the most significant SNP by a single-locus scan, next detect haplotypes constructed by A and D as the best 2-SNP haplotypes, finally, we reach the most informative set {A, D, E} so that a 3-SNP haplotype concentrates the disease risk. Note that adding 1 SNP actually partitions the sample space of haplotypes. For any particular haplotype, it can be sent down the tree just as an observation is being sent down in CART. While CART is effective to dissect high-order interactions, growing haplotypes is essentially refining high-order interactions between loci, as a haplotype effect is a linear combination of locus main effects and high-order interactions (Schaid, 2004).

Fig. 2.

The tree illustration of the sequential partition of haplotypes in Figure 1(b). The left panel shows the growing set of SNPs used in analysis and the right panel shows the partitions resulted from the haplotypes based on the current set of SNPs. The minimal set of SNPs that captures the genetic association is (A, D, E), with the disease risk concentrated on the haplotype 100. The path leading to discovering it could be 1 → 10 → 100. The corresponding order of SNPs in the haplotypes is A → AD → ADE.

2.2.Notation

For ease of exposition, we consider a sample of n unrelated affected cases and unaffected controls. Continuous traits can be accommodated using the generalized linear model framework. Let Yi=1 if the ith individual is a case and Yi=0 otherwise, i=1,2,…,n. Let Gi=(gi1,gi2,…,gik,…,giK) be the SNP genotypes of individual i at K loci on some chromosomal region of interest, coded as 0, 1, 2 for the number of the minor alleles at the kth locus. These SNPs could be tagSNPs selected to represent genetic polymorphisms in the targeted region and some of them may be missing for some individuals. Suppose that in addition to the genetic data, we also have information on r covariates Zi=(zi1,zi2,…,zir), containing demographic and environmental factors.

Let ΩK denote the complete set of all K SNPs, and let Ωl denote the most informative set of l SNPs that adequately captures the genetic association. By definition, Ωl⊆ΩK and 0≤l≤K. When l = 0, Ωl is an empty set, and there is no genetic association in the chromosomal region. Let GΩk be the observed genetic data on a set Ωk. Assume that in the population, there are m distinct haplotypes h1Ωk,h2Ωk,…,hmΩk based on SNPs in Ωk, with (unknown) population frequencies pΩk=(p1l,p2l,…,pml). If Ωk contains a single SNP, the haplotype is simply the genotype of the (single) locus. Hereafter, we generalize the definition of “haplotype” to include single-SNP genotypes. For the ith individual, let HiΩk={Hi1Ωk,Hi2Ωk} be the haplotype pair based on Ωk. We assume that the underlying probabilistic model describing the association, denoted as (HiΩk, Zi), is(2.2)where f(HiΩk) is a function that delineates the haplotype effect model, α is the intercept, and β and γ are regression parameters for genetic and environmental effects, respectively. For instance, in an additive model, f(HiΩk) represents a vector of m integers in {0,1,2} indicating the number for each of m possible haplotypes. In the dominant models, having 1 or 2 copies of a haplotype has the same effect. In the recessive model, only having 2 copies of the causal haplotype will affect the trait. Gene–environment interactions can be added to (2.2). For the observed data, we can compute the maximal likelihood estimators of parameters in (2.2) and obtain . Note that we have genotypes for all SNPs (ΩK); however, we only select a subset to be used in the regression model (Ωk). The best subset Ωl with its associated model (HiΩl, Zi is selected by minimizing the prediction error. Let be the probability of accurately predicting yi based on when a new independent subject comes in. We define a loss function, namely deviance or cross-entropy (Hastie and others, 2001), , that quantifies the correctness of . Our goal was to minimize the expected loss (or the expected prediction error) over Ωk to find the best set Ωl. This can be expressed as(2.3)

2.3.The algorithm

If we search for the most informative set by 10-fold cross-validation estimates of the above objective function, the algorithm is as follows:

For i = 1 to 10,

• In the ith training set, grow a sequence of nested sets Ω0⊂Ωi1⊂Ωi2⋯⊂ΩiM, where M is the largest number of SNPs in a candidate subset, specified by the investigator. Here Ω0 indicates a model without any genetic effect.

• In the ith training set, prune ΩiM back 1 SNP at a time to obtain a sequence of nested sets ΩiM⊃ΩiM−1′⊃ΩiM−2′⋯⊃Ω0 .

• In the ith training set, evaluate the prediction deviance for the models associated with Ω0⊂Ωi1⊂Ωi2⋯⊂ΩiM⊃ΩiM−1′⊃ΩiM−2′⋯⊃Ω1′.

Sum up the prediction deviances from the 10 cross-validations for each model path, choose the number of SNPs, l, with the smallest prediction deviance.

Use all data to search for the model formed by l SNPs. If l was achieved in the growing stage, grow the subset up to l SNPs; If l was achieved in the pruning stage, grow the subset up to M SNPs and prune back to l SNPs.

Let mΩt be the number of haplotypes given the current best subset Ωt with t SNPs. Starting from the best single SNP, we select the next best subset containing t + 1 SNPs, Ωt+1, with the maximal statistic φ, defined as(2.4)

This involves fitting regression model (2.2) to each candidate subset, computing the maximal likelihood, and evaluating the φ statistic. Note that the statistic φ incorporates both the information from the LD between SNPs and the regression of the trait on the SNPs. If adding 1 SNP does not increase the number of unique haplotypes, that is the new SNP is in perfect LD with the rest of SNPs, there is no contribution in model fitting and thus φ equals 0. The maximum of φt+1 represents the largest penalty parameter in AICa (2.1), so that the model with 1 extra SNP is still preferable. For all a < max(φt+1), the set associated with max(φt+1) has the minimal AICa among all candidate sets with t + 1 SNPs.

If the best model size is 0, there appears to be no genetic association in the region of interest. The lower the prediction deviance of the final model compared with the null model, the more likely the association. To assess significance of the associations, we perform a permutation test to correct the over-optimism incurred by the greedy model searching process. We first compute a nominal p value for the global haplotype effect in the final model using a Wald test. We then permute the trait 1000 times regardless of the genotypic data, carry out model searching for each permuted data set, and compute the nominal p value using a Wald test. When environmental factors are present, we permute the trait within the strata defined by environmental factors. Finally, the experimentwise p value is computed by comparing the observed p value to its null distribution.

When the haplotype phase is unknown, as is usually the case, our algorithm treats phasing as a part of the learning procedure. The full-scale haplotype phasing is carried out only once using all SNPs under investigation to obtain the maximal resolution. Because each training data consist of 9/10 full data, the haplotype frequencies estimated from each training set are usually a slight modification of those estimated from the full data and hence can be computed rather quickly by an Expectation-Maximization algorithm (Excoffier and Slatkin, 1995). One can have multiple pairs of possible haplotypes, each pair has an estimated conditional probability given all genotypes. For a model associated with Ωk, the expected deviance for a subject can be expressed as follows:(2.5)where is the estimated conditional probability of the haplotype pair for the SNPs in the model given the all genotypes. For each HiΩk, this conditional probability is equal to the sum of a collection of since each haplotype formed by Ωk represents a cluster of haplotypes formed by ΩK. For the final model, robust sandwich variance estimates are used to compute the nominal p value. Note that under the null hypothesis that there is no genetic effect, the estimation of haplotype frequencies is independent of the estimation of the regression parameters, so using a sandwich variance estimate yields a valid test for any global genetic effect.

The core of the SHARE algorithm is written in C with an R interface. An R-package can be downloaded from the first author's homepage: http://www.scharp.org/faculty/jdai as well as CRAN (The Comprehensive R Archive Network). Currently, a model searching process for for settings with 10–30 SNPs and 2000 subjects takes about half a second on a Dell workstation with a 3.0-GHz processor. In situations where the best model after selection is the null model, no permutation test is required as this clearly indicates nonsignificance. Otherwise, the permutation test can be speeded up by giving up early on clearly nonsignificant results (Besag and Clifford, 1991).

3.RESULTS3.1.Simulations

The details of the simulations are in the Biostatistics online supplementary document. We compared the empirical type I errors and the power of detecting the global genetic effect in the simulated chromosomal region when a hypothesis test is performed with a type I error of 0.05. As benchmarks of our comparison, we used the single-locus scan and the naive haplotype analysis using all SNPs, assuming known phase. The haplotype score test (Schaid and others, 2002) was included for evaluating the impact of haplotype ambiguity when model selection is not employed. We used CLADHC as an example of cladistic approaches in the comparison. The window size for CLADHC is set to 6 SNPs throughout simulations. We performed the SHARE analysis with and without knowing haplotype phase, with the maximal number of 6 SNPs in the candidate sets and 10-fold cross-validation. For all methods except the 2 using full haplotypes, permutation tests were used to correct for multiple testing.

We first simulated 2 disease models using the empirical haplotype frequencies of the F11 gene in the PGA database. Among the 45 SNPs in this gene, 11 tagSNPs (2, 3, 5, 6, 9, 11, 22, 24, 30, 42, and 45) are observed in 800 cases and 800 controls. For our first model, we simulated an unscored functional variant, which has a minor allele frequency (MAF) around 0.05 and is strongly tagged by 2 tagSNPs. Table 1 displays the haplotype frequencies formed by the 3 SNPs. Both tagSNPs 3 and 5 are correlated with SNP 12, and the haplotype 11 formed by these 2 SNPs perfectly predicts SNP 12. When an additive disease risk is added to SNP 12 using a logistic penetrance function with odds ratios (ORs) 1.5, 1.75, and 2.0, it is anticipated that a haplotype analysis that is just using SNPs 3 and 5 will best capture the disease signal. In the upper panel of Table 2, all methods yield valid tests, as the type I errors are all within a reasonable range of 0.05. Clearly, CLADHC has the worst power since the recombination between SNP 3 and SNP 5 renders it hard to construct a correct cladogram. There is a clear advantage of SHARE over the full haplotype analysis because of SNP selection. SHARE only slightly outperforms the single-locus scan since the r2 between tagSNP 3 and SNP 12 is fairly high (0.63). Among the final models selected by SHARE with p < 0.05, the median size of the best SNP set is 2. Approximately one-third of the significant models contain only 1 SNP, again because of the marginal correlation between tagSNP 3 and SNP 12. When the full haplotype test assuming phase is known and the score test with phase ambiguity are compared, it is seen that phase ambiguity diminishes the power by at most 5% (for OR = 1.75). Less impact from haplotype ambiguity was observed for SHARE. The reason is that model selection leads to far fewer SNPs being used in the final association and hence far less haplotype ambiguity needs to be resolved.

Table 1.

The first model in simulations based on empirical data: 2 tagSNPs display strong LD with the unscored causal locus. TagSNPs 3 and 5 are genotyped and SNP 12 is the unscored functional locus. Haplotype 11 by SNPs 3 and 5 perfectly tagsSNP 12. The haplotype frequencies are estimated from 23 Americans of European descend in the PGA database

Haplotype/SNP

3

5

12

Haplotype frequency

1

0

0

0

0.847

2

0

1

0

0.087

3

1

0

0

0.022

4

1

1

1

0.044

Table 2.

Simulations based on empirical data: a comparison of type I errors and power for various methods under 2 disease model in 500 simulations. Standard errors are given in parentheses. For the first model, we generate data for 800 cases and 800 controls with ORs 1.5, 1.75, and 2; for the second model, we generate data for 400 cases and 400 controls with ORs 1.25, 1.5, and 1.75

Method

Type I error

Power

OR = 1.5

OR = 1.75

OR = 2

Model 1†

Single-locus scan

0.048 (0.010)

0.318 (0.015)

0.648 (0.015)

0.914 (0.013)

Phase known

Full haplotype

0.052 (0.010)

0.284 (0.020)

0.590 (0.022)

0.870 (0.015)

CLADHC

0.056 (0.010)

0.256 (0.020)

0.546 (0.022)

0.854 (0.016)

SHARE

0.050 (0.010)

0.336 (0.021)

0.654 (0.021)

0.928 (0.012)

Phase unknown

Haplotype score

0.035 (0.008)

0.288 (0.020)

0.544 (0.022)

0.863 (0.015)

SHARE

0.054 (0.010)

0.326 (0.021)

0.650 (0.021)

0.900 (0.013)

OR = 1.25

OR = 1.5

OR = 1.75

Model 2‡

Single-locus scan

0.046 (0.007)

0.176 (0.017)

0.608 (0.015)

0.916 (0.012)

Phase known

Full haplotype

0.046 (0.009)

0.184 (0.017)

0.616 (0.022)

0.920 (0.012)

CLADHC

0.062 (0.011)

0.138 (0.015)

0.548 (0.022)

0.882 (0.014)

SHARE

0.046 (0.009)

0.182 (0.017)

0.678 (0.021)

0.952 (0.010)

Phase unknown

Haplotype score

0.050 (0.010)

0.158 (0.016)

0.586 (0.022)

0.900 (0.013)

SHARE

0.044 (0.009)

0.190 (0.018)

0.666 (0.021)

0.942 (0.010)

†

The unscored disease-causing locus is best captured by haplotypes based on 2 tagSNPs.

‡

Two tagSNPs separated apart carry disease risk additively.

For the second model, we simulated 2 tagSNPs (5, 45) contributing independently, rather than through a haplotype effect, to the disease risk. TagSNP 5 has MAF 0.13 and tagSNP 45 has MAF 0.087. The disease risk was imposed via a logistic function with ORs 1.25, 1.5, and 1.75. We generated 400 cases and 400 controls. The bottom panel of Table 2 reveals that SHARE yields the best power regardless of whether the phase is known or not. The cladistic approach has the worst power because a moving window of 6 SNPs will not cover both tagSNP 5 and tagSNP 45. However, a longer window does not help much since it is more likely to introduce recombination among SNPs. Neither the single-locus scan nor the full haplotype analysis outperforms SHARE since the single-locus scan does not combine the effects from the 2 loci and the full haplotype analysis uses too many parameters. Again phase ambiguity does not substantively impact the power of SHARE. Due to the small sample variation, SHARE with phase ambiguity seems to outperform SHARE without phase ambiguity when OR is 1.25, although this improvement is not statistically significant.

We now evaluate the performance of SHARE on average in a variety of models generated by sampling haplotypes based on coalescence theory (Hudson, 2002) and randomly assigning 1 relatively rare SNP (with MAF ≈ 0.05) to carry the disease risk. Table 3 shows the comparison of type I error and power for the various methods more than 500 simulations. The recombination rates are set to reflect regions with background recombination rate and regions with high recombination rate, such as the regions near or within hot spots. When the LD is high, the median number of common haplotypes is nearly the same as the number of tagSNPs. Because the disease locus is left out before tagSNP selection, the average maximal r2 between any tagSNP and the underlying disease locus is merely 0.36, and in only 12% of the simulations, the maximal r2 between tagSNPs and the underlying disease locus is larger than 0.8. In this case, the haplotype-based methods generally yield higher power than the single-locus scan, particularly when the signal is strong (OR = 2). CLADHC performs only slightly better than the naive full haplotype approach. It appears that not every tagSNP is necessarily useful in detecting association, as SHARE outperforms all other methods, having 5–15% more power, particularly when the OR is more than 1.75. Since the LD is strong, haplotype ambiguity has little effect on power even for the naive haplotype analysis. On the other hand, the lower half panel in Table 3 suggests that the high recombination rate drastically reduces the power for all methods considered. The average maximal r2 between tagSNPs and the underlying disease locus is only 0.25, and in only 4.4% simulations, the maximal r2 between tagSNPs and the underlying disease locus exceeds 0.8. The full haplotype method yields much lower power than the single-locus scan because of the increased number of haplotypes. Although SHARE retains the best power power, the advantage over CLADHC is not as large as in the high LD scenario, suggesting that there is limited LD structure to be exploited in regions across recombination hot spots. Contrary to a naive haplotype analysis, there is little effect of phase ambiguity on the performance of SHARE in this scenario. Note that the power shown in Table 3 is the average of 500 different models. Overall, the model selection procedure by SHARE has more power than the other approaches, particularly in regions with high LD.

Table 3.

Simulations based on coalescence by ms: a comparison of type I errors and power for various methods in 500 simulations. Standard errors are given in parentheses. The high and low LD represent recombination rate per site per generation of 10−9 and 10−7, respectively. The sample consists of 1000 cases and 1000 controls

LD[#SNP†,# Hap‡]

Method

Type I error

Power

OR = 1.5

OR = 1.75

OR = 2

High[15,16]

Single-locus scan

0.052 (0.010)

0.254 (0.019)

0.416 (0.022)

0.592 (0.022)

Phase known

Full haplotype

0.050 (0.010)

0.242 (0.019)

0.456 (0.022)

0.648 (0.021)

CLADHC

0.042 (0.009)

0.246 (0.019)

0.464 (0.022)

0.678 (0.021)

SHARE

0.050 (0.010)

0.312 (0.021)

0.548 (0.022)

0.728 (0.020)

Phase unknown

Haplotype score

0.034 (0.008)

0.216 (0.018)

0.462 (0.022)

0.664 (0.021)

SHARE

0.050 (0.010)

0.308 (0.021)

0.528 (0.022)

0.738 (0.020)

Low[15,30]

Single-locus scan

0.038 (0.009)

0.154 (0.016)

0.316 (0.021)

0.420 (0.022)

Phase known

Full haplotype

0.040 (0.009)

0.116 (0.014)

0.232 (0.019)

0.370 (0.022)

CLADHC

0.044 (0.009)

0.146 (0.016)

0.328 (0.021)

0.510 (0.022)

SHARE

0.042 (0.009)

0.174 (0.017)

0.382 (0.022)

0.516 (0.022)

Phase known

Haplotype score

0.032 (0.009)

0.084 (0.012)

0.148 (0.016)

0.322 (0.021)

SHARE

0.048 (0.010)

0.160 (0.016)

0.354 (0.021)

0.498 (0.022)

†

The median number of tagSNPs in 500 simulated data.

‡

The median number of common haplotypes in 500 simulations. The common haplotypes are defined as those with frequencies larger than 1%.

3.2.Data application

We used SHARE to re-analyze the data from a published case–control genetic association study (Smith and others, 2007). This study aimed to investigate the association of common genetic variation in 24 coagulation, anti-coagulation, fibrinolysis, and antifibrinolysis candidate genes with risk of incident nonfatal venous thrombosis in postmenopausal women. The participants were selected from a large integrated health care system in Washington State and consist of 349 cases and 1680 controls matched on age, hypertension status, and calendar year. In the original analysis, the single-locus scan and a full haplotype analysis were applied to each of 24 genes, assuming an additive genetic effect, adjusting for race and other matching variables (e.g. age and hypertension status). We illustrate our algorithm using the data on the tissue factor pathway inhibitor (TFPI) gene. The LD among five genotyped tagSNPs in TFPI gene is quite strong and the highest correlation occurs between 2 adjacent tagSNPs: rs2192824 and rs2300412 (r2=0.4). This region has been shown to have a significant global haplotype effect (Smith and others, 2007). Figure 3 shows prediction deviances, estimated by 10-fold cross-validation, of various models with different sets of SNPs. Lower deviances suggest better prediction accuracy. It is clear that there is a genetic effect in this gene as all models yield a smaller prediction deviance than the null model. We performed a permutation test within the strata defined by other covariates. The p value for a global null hypothesis is 0.023. It is interesting to observe that though SNP rs2192824 predicts the disease status fairly well, a haplotype model based on SNPs rs2192824 and rs2300412 further improves the prediction accuracy; inclusion of additional tagSNPs no longer helps. Inspection of the haplotype analysis using SNPs rs2192824 and rs2300412 (Table 4), we found that the increased disease risk is concentrated on haplotypes “01” and “10.” This pattern implies that there might be an underlying disease-causing allele that is tagged by these 2-SNP haplotypes. We searched the SeattleSNPs database for SNPs that are not genotyped in the study. Based on 23 European descended individuals, there is a total of 54 SNPs with MAF larger than 0.05, covering a distance of 3.9 kb in this region. We found 2 SNPs (rs8676500 and rs8176531), that are in perfect LD with each other, which display a pattern of correlation to rs2192824 and rs2300412 similar to the results in the SHARE analysis. The highest pairwise r2 between scored SNPs and unscored SNPs is 0.17, however, if we define a multilocus r2 as in Hao and others (2007), the highest r2 jumps to 0.72. Both rs8676500 and rs8176531 are located in the intron region of the TFPI gene. Although it is too early for an interpretation on this finding, this analysis gives useful hints for future studies, such as genotyping more SNPs that are correlated with rs2192824 and rs2300412 such as rs8676500 and rs8176531.

Table 4.

The results of a SHARE analysis on 5 SNPs in the TFPI gene. SNPs 1–5 are rs2192824, rs2300412, rs8176597, rs8176612, and rs3771059, respectively. The left table shows 10 haplotypes using all 5 SNPs in the analysis. The right table shows that using SHARE, the association was narrowed down to haplotypes based on rs2192824 and rs2300412

Haplotype

1

2

3

4

5

Frequency

1

0

0

0

0

0

0.2254

2

0

0

0

0

1

0.0018

3

0

0

0

1

0

0.0002

4

0

1

0

0

0

0.0593

5

0

1

0

0

1

0.2644

6

0

1

0

1

0

0.0003

7

0

1

1

0

0

0.0515

8

1

0

0

0

0

0.2989

9

1

0

0

0

1

0.0376

10

1

0

0

1

0

0.0605

Haplotype

1

2

Frequency

OR (95% confidence interval)

I

0

0

0.227

—

II

0

1

0.376

1.28 (1.02, 1.63)

III

1

0

0.397

1.43 (1.16, 1.87)

Fig. 3.

The prediction deviances of different models for the TFPI gene. The horizontal axis is the number of SNPs included in the sequence of best subsets when model growing and pruning. The horizontal dashed line on the top represents the deviance of a null model without considering genetic effect. The vertical dashed line indicates the switch from model growing to pruning. The deviance is calculated from a model with haplotypes constructed from SNPs in the set. The lower the deviance is interpretted as better the model prediction.

This data example shows a further benefit beyond the power enhancement that we focused on in the simulations: a parsimonious model with fewer SNPs helps us hunt for the “true” underlying “causal” mutation. The naive haplotype analysis using all 5 tagSNPs yields significant association for a global genetic effect, yet it is not clear how to pursue this analysis further based on all 10 haplotypes (shown in Table 4). On the other hand, if we are willing to take the most likely haplotype pair for subjects with haplotype phase ambiguity, we can perform a CLADHC analysis, which here suggests that the best partition of 10 haplotypes is formed by 2 clusters: a cluster with haplotypes 8 and 9 and a cluster with the remaining haplotypes. However, it is not clear how to interpret and follow-up these clusters by CLADHC.

4.DISCUSSION

Many strategies have been proposed to perform haplotype-based multilocus analyses. The majority focus on how to cluster haplotypes given a set of predefined SNPs. There are a few attempts to select SNPs that form haplotypes. For instance, to evaluate the coverage of the Affymetrix GeneChip and Illumina BeadChip on the HapMap project data, Pe'er and others (2006) used multimarker predictors that capture an additional 9–25% of SNPs in the ENCODE region or in the HapMap Phase II data. The search for multimarker predictors is limited to those SNPs with strong LD. Alternatively, it has been proposed to exhaustively search for windows of contiguous SNPs that form haplotypes (Lin and others, 2004). The computational burden can be insurmountable for large regions and it is not clear whether only considering contiguous SNPs in a window is an effective strategy. In this article, we propose a novel strategy to select the most informative set of SNPs which in turn forms the basis for haplotype analysis. The advantage of the SHARE algorithm is its adaptivity: it exploits both the LD structure and the underlying disease-generating mechanism, and the addition or deletion of SNPs is noncontiguous. In a variety of simulation settings, SHARE consistently outperforms other existing methods.

Imputation procedures have been shown to be useful to capture the association of a phenotype and unmeasured genotypes (e.g. Nicolae, 2006, Servin and Stephens, 2007). Rather than choosing haplotypes that “tag” untyped variants, as implemented in the SHARE algorithm, these procedures impute the “missing” (untyped but known) variants based on the LD estimated from public databases (e.g. the HapMap project). For common alleles already cataloged in these databases, the imputation procedures would likely be powerful since they incorporate external information. For rare alleles (MAF ≤ 5%), the imputation methods could miss the signal (depending on the size of the database), while our method may still capture it by constructing rare haplotypes. Furthermore, if there are multiple loci with interaction effects on a disease phenotype, for example multiple nonsynonymous mutations, a haplotype analysis with model selection can be useful to integrate the joint and interactive effects of the multiple loci.

Our method is motivated, but not limited by candidate gene studies. To scale our haplotype analyses up to genome-wide association studies, our strategy is to first divide the genome into long haplotype blocks and perform an adaptive haplotype analysis in each block. We define blocks by long chromosomal regions between recombination hot spots that may stretch several hundred kbs and contain a fair number of tagSNPs (10 ∼ 50). This is rather a loose criterion compared with the definition of haplotype blocks used in Gabriel and others (2002) or Wang and others (2002). In phase II of the HAPMAP project, 32,996 recombination hot spots were identified of which 68% are localized to a region of ≤5 kb. The spacing between adjacent hot spots is 100–200 kb on average (The International Hapmap Consortium, 2007). Permutation test on the genome-wide level would be computationally demanding. However, it is unnecessary since there should be at most a handful of blocks that suggest genetic association. In the supplementary material (available at Biostatistics online), we discuss strategies to reduce the computation in assessing genome-wide significance. Further details to improve SHARE to a version that can deal with genome wide association studies will be pursued in future work.

Leptospirosis is the most widespread zoonosis throughout the world and human mortality from severe disease forms is high even when optimal treatment is provided. Leptospirosis is also one of the most common causes of reproductive losses in cattle worldwide and is associated with significant economic costs to the dairy farming industry. Herds are tested for exposure to the causal organism either through serum testing of individual animals or through testing bulk milk samples. Using serum results from a commonly used enzyme-linked immunosorbent assay (ELISA) test for Leptospira interrogans serovar Hardjo (L. hardjo) on samples from 979 animals across 12 Scottish dairy herds and the corresponding bulk milk results, we develop a model that predicts the mean proportion of exposed animals in a herd conditional on the bulk milk test result. The data are analyzed through use of a Bayesian latent variable generalized linear mixed model to provide estimates of the true (but unobserved) level of exposure to the causal organism in each herd in addition to estimates of the accuracy of the serum ELISA. We estimate 95% confidence intervals for the accuracy of the serum ELISA of (0.688, 0.987) and (0.975, 0.998) for test sensitivity and specificity, respectively. Using a percentage positivity cutoff in bulk milk of at most 41% ensures that there is at least a 97.5% probability of less than 5% of the herd being exposed to L. hardjo. Our analyses provide strong statistical evidence in support of the validity of interpreting bulk milk samples as a proxy for individual animal serum testing. The combination of validity and cost-effectiveness of bulk milk testing has the potential to reduce the risk of human exposure to leptospirosis in addition to offering significant economic benefits to the dairy industry.

BayesianLatent class analysisLeptospirosis1.INTRODUCTION

Leptospirosis is the most widespread zoonosis throughout the world (Meites and others, 2004) and carries with it implications for both human and animal health. Human mortality from severe disease forms, Weil's disease and severe pulmonary hemorrhage syndrome, is high with mortality rates in excess of 10% and 50%, respectively, even when optimal treatment is provided (McBride and others, 2005). In regard to animal health, leptospirosis is one of the most common pathogen-related infections responsible for reproductive losses in cattle worldwide (Grooms, 2006) and is associated with significant economic costs to the dairy farming industry (Bennett and Ijpelaar, 2005). Studies from Ireland (Leonard and others, 2004) found that infection with Leptospira interrogans serovar Hardjo (L. hardjo), the most prominent strain found in Europe, was present in some 79% of 347 dairy herds sampled. Screening programs are commonplace across Europe and the United States and use a range of laboratory testing kits for detecting the presence of leptospira antibodies in either serum or milk.

Leptospirosis can be effectively controlled by annual vaccination. The risk of zoonosis is a main motivating factor behind full herd vaccination and has resulted in significant decreases in occupationally acquired infection (Thornley and others, 2002). Herds are tested for the presence of L. hardjo antibodies by either (i) sampling from a bulk milk tank, comprising a joint contribution of milk from multiple animals in a herd, or else (ii) serum samples taken from individual animals. The ability to predict herd prevalence through bulk milk sampling is economically very attractive to dairy producers compared with the significant veterinary and laboratory costs involved with individually testing all animals in a herd. Scottish Agricultural College (SAC) veterinary services use the Ceditest L. hardjo enzyme-linked immunosorbent assay (ELISA) kit for diagnostic testing of exposure to L. hardjo in milk and serum. The kit protocol provides cutoffs, which classify the presence of L. hardjo antibodies in either sera or milk as negative, inconclusive, or positive. The recommendation following an inconclusive bulk milk result is that follow-up serum testing of all individual animals may be appropriate.

Using bulk milk samples collected from 12 unvaccinated farms and corresponding serum samples from all animals contributing to each of the bulk milk tanks, our goal was to develop a robust statistical model, which predicts the proportion of animals exposed to L. hardjo in a herd conditional on the concentration of L. hardjo antibodies present in the herd's bulk milk sample. For a number of years, SAC veterinary services have provided such model-based predictions as part of their diagnostic testing service. This value-added interpretation has proved popular and appears to be of significant value to dairy farmer clients. The modeling subsequently discussed is an attempt to improve the approach currently used by removing the crucial, yet unsupported, assumption that the serum ELISA test currently used is sufficiently accurate to be considered error free in its predicted classifications. Introducing even a small probability of error into the model of this diagnostic test could greatly affect our confidence in its predictions. We are unaware of any peer-reviewed work that supports high accuracy of the Ceditest L. hardjo kit and it is not validated as a gold standard test by the OIE—the World Organisation for Animal Health—which validates and certifies all animal diagnostic tests used in the European Union.

Through our analyses, we aim to provide improved guidance on the interpretation of bulk milk test results with a view to avoiding unnecessary follow-up serum testing of individual animals. Additional by-products of this work are estimates of the sensitivity and specificity of the Ceditest L. hardjo serum ELISA kit, which are of general interest as this is a commonly used ELISA test.

2.DATA AND METHODS

Bulk milk samples were collected from 34 dairy herds from distinct farms across Scotland and screened using the Ceditest indirect ELISA for Leptospira interrogans serovar Hardjo antibodies. The ELISA uses antigens to capture and quantify the amount of target antibody present in a serum or milk sample. The test results in a color reaction measured in terms of optical density values by an ELISA reader. Optical densities provide a numerical quantification of the amount of L. hardjo antibody present in the sample being tested. The final numerical output is a standardized percentage positivity (PP) relative to a fixed reference sample (PP = optical density of sample being tested/optical density of reference sample). To ensure robustness, each test result is required to meet extensive validation criteria set out in the manufacturer guidelines in addition to SAC internal standard operating procedures for laboratory quality assurance. The recommended Ceditest interpretation for PP from sera is PP< 20%—negative for L. hardjo specific antibodies, 20% ≤ PP ≤ 45%—inconclusive, and PP > 45%—positive. The Ceditest interpretation for PP from bulk milk is PP < 40%—negative for L. hardjo specific antibodies, 40% ≤ PP ≤ 60%—inconclusive, and PP > 60% positive.

Our initial study of 34 herds collected only bulk milk samples, as opposed to matched samples from the bulk milk tank and all individual animals contributing to this tank. From these 34 herds, a subset of herds were selected for follow-up whole herd serum testing. The empirical distribution of bulk milk PP values across all 34 herds was stratified into 3 blocks, and from within each of these blocks, farms were recruited on the basis of practical considerations such as geographical location and perceived willingness of the farmer to take part in the study. A total subset of 12 herds were recruited for matched bulk milk and whole herd serum testing. Section A in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org) compares the empirical bulk milk PP distribution from all 34 herds in the initial study with the follow-up subset of 12 herds, with the latter appearing representative of the former.

A single sample of blood was taken from each individual animal and a single milk sample was collected from each herd's bulk milk tank. The total volume of each serum/bulk milk sample collected was more than that required to fill a single well in the ELISA plate. This was to enable laboratory quality control measures where each serum/bulk milk sample is assayed in duplicate to ensure the ELISA kit is functioning properly. The bulk milk PP values reported in Table 1 are the means across both replicates as is typical laboratory practice. All 12 bulk milk samples met all quality control criteria. A small number of tests on sera from individual animals showed unusually large variation between replicates and were discarded (there was no evidence to reject an assumption of independence of discarded samples from farms or laboratory batches).

Table 1.

Observed number of animals in each herd, which tested negative or nonnegative for Leptospira interrogans serovar Hardjo using Ceditest serum ELISA test, and the bulk milk PP from each herd using Ceditest milk ELISA test. Serum PP cutoff criteria were as per test manufacturer guidelines: 0 ≤ negative < 20, 20 ≤ inconclusive ≤ 45, and positive > 45. Due to the very small number of animals testing inconclusive (3% of total), this category was combined with the positive category

Farm

Bulk milk PP

No. negative cows

No. inconclusive/positive cows

1

14.28

200

2

2

17.78

51

0

3

20.35

125

1

4

34.12

47

1

5

45.50

52

1

6

73.99

64

5

7

80.68

107

9

8

109.60

19

27

9

115.08

21

56

10

121.36

35

32

11

122.83

28

55

12

144.27

5

36

Total

754

225

The Ceditest L. hardjo ELISA for sera provides 3 categories of diagnosis; however, due to the paucity of inconclusive responses, we present analyses where inconclusive and positive responses are collapsed into a single combined class; in total, only 3% of animals tested inconclusive compared with 77% negative and 20% positive. There is a good practical rationale for collapsing the inconclusive and positive categories. From the perspective of the dairy farmer either a sufficiently high level of disease is present in the herd for action to be required or else the herd is disease free, that is, the level of disease is estimated to be sufficiently low to be ignored on both economic and welfare grounds in which case no action is required. Therefore, a pragmatic and conservative approach given the available data is to combine inconclusive and positive responses into a single nonnegative class.

Using Bayesian inference, we develop a generalized linear model with a single binary response denoting the presence (absence) of exposure to disease, as indicated by a positive test for the presence of antibodies in each individual animal based on serum test results, with bulk milk PP as an explanatory covariate. We adopt a Bayesian approach as this provides a robust and numerically tractable way of fitting our model to data given the presence of latent variables (the unobserved diagnostic test error and herd prevalence, as discussed later). Bayesian methods are not the only available methodology, however, they have been shown in some cases to provide more stable estimates than alternative techniques such as the expectation maximization algorithm (Dempster and others, 1977) in the estimation of diagnostic test accuracy when the true disease state is unknown (Enoe and others, 2000). The adoption of Bayesian methods is also increasingly common in this area (Branscum and others, 2005).

Our model is intended as an aid in supporting disease management on dairy farms and as such we require that it be as robust as possible in its predictions. Hence, we do not assume a priori that the serum ELISA is a gold standard test, and we allow for the possibility that the data may exhibit clustering and hence model overdispersion. A random effect term at farm level is considered as a means of dealing with overdispersion. Our general model is defined in (2.1)–(2.4)(2.1)(2.2)(2.3)(2.4)where Yi denotes the number of animals that tested positive for the presence of antibodies on farm i, qi is the probability for each animal of it independently testing positive from a total of ni animals, φi is a farm-level random effect, f(·) denotes the link function, f(p)=log{p/(1−p)} or f(p)=log[−log{1−p}].

We follow the “no gold standard” parameterization set out by Joseph and others (1995), hencewhere p denotes the true within-herd prevalence of disease, S is the test sensitivity, and C the specificity. These are the latent variables in our model which, given sufficient degrees of freedom, are estimated indirectly from the data. We refer to p as denoting true within-herd prevalence distinguishing it from q, the probability that an animal will test positive. We use noninformative priors for all parameters, specifically: β∼N(0,1000), θ∼N(0,1000), S∼U(0,1), and C∼U(0,1), where N(μ,σ2) denotes a Gaussian density with mean μ and variance σ2; and U(a, b) is a uniform density on the interval (a, b). The standard deviation of the farm-level random effect, σf, was given a prior of U(0, 100). Posterior distributions for all model parameters were estimated using an implementation of the slice sampler due to Neal (2003) written in C using the GNU scientific library (Galassi and others, 2006). This code was validated against a range of models for which the posterior distributions could be calculated analytically or were known from existing published studies. Many chains were run using different initial seeds and the typical burn-in period appeared to be very small (several thousand iterations) for all models examined. All results presented are based on output from chains that were run for a considerably large number of iterations, with output from multiple chains combined so that all parameter estimates were based on samples with effective sample sizes (for definition, see Gelman and others, 2004) of at least 10 000. Our justification for this is 2-fold: (i) there exists very high correlation between the intercept and gradient parameters, θ and β, respectively, which can result in slow traversal of the parameter space; and (ii) due to the nature of our models, there is the potential for complex mixing, specifically oscillation between multiple stationary distributions, as discussed below.

3.RESULTS

Our analysis has 2 main objectives (i) assess the predictability of within-herd prevalence from bulk milk PP and (ii) estimation of the accuracy of the Ceditest serum ELISA. It is important to note that these 2 objectives are not independent: the model developed in (i) also estimates the accuracy of the serum ELISA. Hence, our subsequent estimates of test sensitivity and specificity are conditional on the assumption that our choice of model is a good fit to the observed data. For this reason, we examine a number of competing models to identify an appropriate optimal model given the data available.

3.1.Model selection

A series of models of increasing complexity were fitted to the data to assess the statistical support for the use of bulk milk PP as a predictor of mean herd prevalence. Models with logistic and complementary log–log link functions, with and without overdispersion, were all examined. Model complexity was increased systematically from the null model (comprising only a constant term without overdispersion) up to the most general model defined in (2.1)–(2.4). As is typical in Bayesian model comparison, we used Bayes factors (Gelman and others, 2004), specifically a comparison of log marginal likelihoods, as the goodness of fit criterion when comparing the various models.

A potential complication in estimating the parameters in our various models is the presence of multiple solutions. The existence of multiple solutions can easily be explained by analogy with the Hui–Walter model (Hui and Walter, 1980), a standard model for estimating disease prevalence in the absence of a gold standard test. The Hui–Walter model has 2 optimal solutions in a maximum likelihood sense. If the set of parameters {p, S, C} represents a solution, then {1−p,1−C,1−S} is also a solution. Assessing convergence and estimation of stationary distributions in respect of the Hui–Walter model are discussed by Toft and others (2007).

Parameter estimation can become problematic if multiple solutions exist, as the sampler may jump between solutions requiring then the disentanglement of the posterior distributions for each respective solution. Despite extensive tuning of the slice sampler parameters (see Neal, 2003) and running extremely long chains of up to 2 × 107 iterations, we were unable to force such jumping to occur when sampling from any of our models. Further details can be found in section B in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).

Table 2 details the goodness of fit using log marginal likelihoods for the different models explored. We additionally include in our analyses, a variant of the parameterization of our general model where f(pi)=θ+β(bulkmilki+φi); we use the same noninformative priors as previously. These 2 parameterizations are mathematically equivalent; however, extensive simulations have shown that scaling the random effect term by the bulk milk regression coefficient, β, results in improved mixing when the sample size is small, giving parameter estimates with lower variances and increased goodness of fit. As the sample size increases, the 2 parameterizations give indistinguishable results as should be the case. Section C in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org) contains a comparison of the mixing of each parameterization using simulated data with parameter estimates similar to those from the observed data.

Table 2.

Model selection using log marginal likelihood as the goodness of fit criteria. Including a gradient term (bulk milk PP coefficient) greatly improves the model fit. The inclusion of a random effect term is also strongly supported, however, the precise parameterization has little effect as does the choice of link function. Of the parameterizations explored, scaling the farm level random effect by the gradient parameter maximizes the marginal likelihood, as does the use of a logistic link function

Model

log (marginal likelihood)

logistic

cloglog

θ

− 528.24

− 528.25

θ + βbulkmilki

− 281.05

− 282.15

θ + βbulkmilki + ϕi

− 273.58

− 274.20

θ + β(bulkmilki + ϕi)

− 272.98

− 273.11

Multiple chains were run for each model and the log likelihood values calculated every 1 000 steps, with the output from various chains pooled (after allowing sufficient burn-in) until the combined effective sample size was in excess of 10 000. To ensure robust estimation of the log marginal likelihood, we follow Congdon (2001) and divide the output into batches, calculate the harmonic mean in each batch, and then take the mean of these values. Up to 8 batches were tried along with the median rather than the mean over batches. Such variations had negligible impact on the resulting marginal likelihood values, giving confidence in the robustness of our estimates. The difference in log marginal likelihoods between the models with and without a bulk milk term is large. Congdon (2001), table 10.1, provides guidelines on the magnitude of differences required between Bayes factors to be notable, ranging from weak support denoting the smallest difference between log marginal likelihoods, through to very strong support denoting a difference in log marginal likelihoods of at least 5. The choice of link function in the various models has little effect (weak support). The inclusion of a random effect term to allow for overdispersion at the farm level has very strong support; however, the precise parameterization of this term has little effect (only weak support) on the overall model fit.

We have a number of models with comparable goodness of fit and it is therefore informative to compare the predictions in mean within-herd prevalence and diagnostic test accuracy between these various models. Section D in the supplementary material available at Biostatistics online (http://www.biostatistics. oxfordjournals.org) contains a detailed comparison. Of the parameterizations explored, scaling the farm-level random effect by the bulk milk regression coefficient maximizes the marginal likelihood, and from the comparisons between the alternative models and parameterizations, we choose as our optimal model log{p/(1−p)}=θ+β(bulkmilki+φi). Parameter estimates for this model, including test sensitivity and specificity, are detailed in Table 3. Finally, to investigate the robustness of our chosen model, we fitted it to jackknife samples (Efron and Tibshirani, 1993) from our data set of 12 farms (see section E in the supplementary material available at Biostatistics online http://www.biostatistics.oxfordjournals.org). The model appears relatively robust to the choice of farms with the exception of farm 12 when estimating test sensitivity S. Exclusion of farm 12 from the data has a substantial impact on the resulting estimate of S. This can be explained by the relative position of farm 12 in Figure 1. Generally, the higher the proportion of test positive animals in a farm the greater influence it will exert on estimates of S, which in this case is also combined with farm 12 having by far the largest bulk milk PP value compared with farms 8–11, which have lower but relatively uniform bulk milk PP values. However, farms 8–11 do exhibit substantial variance in prevalence, which affects the estimation of S.

Fig. 1.

Observed data and predicted values. A comparison of the observed proportion of animals testing positive in each farm, the predicted mean proportion of positive tests using the optimal model, and the predicted mean true prevalence of exposure to disease within each herd via a latent variable.

Figure 1 shows a comparison of the observed data (the proportion of positive tests and corresponding observed bulk milk PP), fitted values from our optimal model (the predicted mean proportion of positive tests conditional on bulk milk PP), and predictions of the mean prevalence of exposure to disease in the herd (the latent variable in our model denoting true exposure status). We find that as the bulk milk PP increases, mean prevalence in the herd also increases, as must intuitively be the case.

Of particular interest is the comparison between the predicted proportion of positive tests and mean prevalence in the context of a high specificity and moderate sensitivity. In herds with low prevalence, the proportion of positive tests was a good estimator of true prevalence, however, in herds with higher prevalence using the serum ELISA test alone underestimates the true prevalence. This can be seen in Figure 1 by the difference between the proportion of positive tests and true prevalence of exposure in the five farms with highest prevalence. As the bulk milk PP increases, more antibodies are present in the milk tank and, by implication, more of the herd is likely to have been exposed to disease, therefore test accuracy in detecting exposed animals becomes relatively more important. Plotting the predicted proportion of positive tests against predicted mean prevalence (not illustrated) shows increasing divergence as bulk milk PP increases. This is to be expected given our estimates of test sensitivity (see Section 3.3).

Figure 2 shows the posterior distribution for the mean prevalence of exposure to disease predicted by our model. The Ceditest kit interpretation for bulk milk is that a PP of less than 40% is indicative of an unexposed herd. This is of particular interest as we find from our model that using a PP cutoff of 41% or less ensures that there is at least a 97.5% probability of less than 5% of the herd being exposed to L. hardjo.

Fig. 2.

Posterior distribution for mean prevalence in the optimal model.

3.3.Serum ELISA test accuracy

Figure 3 shows estimates of the posterior densities for the serum ELISA specificity (C) and sensitivity (S) from our optimally fitting model (see Table 3 for 95% confidence intervals). The test is extremely good at correctly predicting unexposed animals; however, there is considerably more uncertainty regarding the correct classification of exposed animals. The uncertainty in our estimate of sensitivity could be due, at least in part, to the relatively small proportion of animals that tested nonnegative.

We have developed a method to predict within-herd prevalence of exposure to an important endemic and zoonotic pathogen using estimates of bulk milk PP. Commonly used ELISA kits were used for both milk and serum testing with observed data on 979 animals split across 12 bulk milk samples, with each sample collected from a distinct farm. A Bayesian latent variable generalized linear mixed model was used to estimate the accuracy of the serum ELISA test and provide a robust predictive model. Our goal was to provide a method able to evaluate the bulk milk interpretation provided by the test manufacturers guidelines on a “live” data set and provide additional value in terms of robust predictions of within-herd prevalence of exposure to disease.

The test interpretation guidelines from Ceditest for their ELISA for L. hardjo in bulk milk state that a PP of less than 40% is indicative of an unexposed herd. We estimate with 97.5% probability that less than 5% of a herd is exposed if the PP is less than or equal to approximately 41%. The latter interpretation is an entirely reasonable and practical measure of an unexposed herd. The consistency between this interpretation and the interpretation provided with the ELISA kit, given that each was based on different data sources and different estimation methods, provides very strong evidence in support of bulk milk testing as a means of identifying herds, which have not been exposed to disease.

Our method provides value-added interpretation in the form of predictions for the mean prevalence of exposure to disease conditional on bulk milk PP. However, these predictions do suffer from a relatively high degree of uncertainty (see Figure 3), a significant contributor being the large observed variation in the proportion of positive tests between farms with relatively similar bulk milk PP values (see farms 9 and 10 in Table 1). Despite extensive checks, we were unable to identify satisfactorily reasons for these variations. This additional variance between farms necessitated the inclusion of sizeable random effects in our model and thus a loss in predictive precision.

We hope that our analyses will further support the use of bulk milk testing as an effective disease surveillance tool and encourage rigorous statistical validation of commonly used diagnostic tests.

FUNDING

This work was supported by Schering–Plough who sponsored the data collection and veterinary fieldwork. Statistical analyses were undertaken by part of the Scottish Government funded Centre of Excellence in epidemiology, population health, and infectious disease control. Biomathematics and Statistics Scotland and the Scottish Agricultural College both receive financial support from the Scottish Government (RERAD). Funding to pay the Open Access publication charges for this article was provided by the Scottish Agricultural College.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at http://www.biostatistics.oxfordjournals.org.

[Supplementary Material]

The authors thank 2 anonymous reviewers and the editor for constructive comments that have improved the article. Conflict of Interest: None declared.

Interval-censored longitudinal data taken from a Norwegian study of individuals with Parkinson's disease are investigated with respect to the onset of dementia. Of interest are risk factors for dementia and the subdivision of total life expectancy (LE) into LE with and without dementia. To estimate LEs using extrapolation, a parametric continuous-time 3-state illness–death Markov model is presented in a Bayesian framework. The framework is well suited to allow for heterogeneity via random effects and to investigate additional computation using model parameters. In the estimation of LEs, microsimulation is used to take into account random effects. Intensities of moving between the states are allowed to change in a piecewise-constant fashion by linking them to age as a time-dependent covariate. Possible right censoring at the end of the follow-up can be incorporated. The model is applicable in many situations where individuals are followed over a long time period. In describing how a disease develops over time, the model can help to predict future need for health care.

Many population studies now have longitudinal follow-up and mortality information that can be combined to investigate transitions between health and ill health prior to death. Multistate models can be used to describe these transitions when states reflect health status and a death state is included. In describing how a disease develops over time, the models can help to predict future need for health care. Examples of applications are studies where patients are followed after a surgical operation, studies where stages of a disease are monitored after an infection, and longitudinal epidemiological studies. Our research was initiated by a Norwegian study into dementia and survival among patients with Parkinson's disease (Buter and others, 2008). In this study, the health states are “no dementia, dementia, and death,” with the assumption that transitions from dementia to no dementia are not possible. Individuals with Parkinson's disease are more likely to develop dementia than individuals without the disease (see, e.g. de Lau and others, 2005). For individuals with Parkinson's disease, the onset of dementia is an important predictor of health and need for care.

This paper presents Bayesian inference for a continuous-time 3-state illness–death Markov model. The intensities of moving between the states are related to covariates and random effects. By including age as a time-dependent covariate, time dependency of intensities can be taken into account in a piecewise-constant fashion. Relaxing the assumption of constant intensities makes the model applicable in many situations where individuals are followed over a long time period. One way to make use of the random-effect structure is to model possible heterogeneity that is not captured by observed covariates. The model in this paper assumes that transition times are interval censored except for known death times. Right censoring at the end of follow-up can be incorporated.

In addition, we show how to apply the Markov model to estimate life expectancy (LE). Using a parametric model for the time-dependent intensities makes it possible to extrapolate the model beyond the follow-up time and to estimate how many years of total LE will be spent in a disease state and which risk factors are important. In the presence of a random-effect structure, microsimulation can be used to estimate LE.

For the Norwegian study, Buter and others (2008) fitted a 3-state fixed-effect model using maximum likelihood estimation. They presented LEs conditional on baseline state and used a nonparametric bootstrap to estimate the variance of the estimated LEs. The Bayesian approach makes it easy to incorporate random effects, and there is no need for an additional stage for the estimation of the distribution of LEs because the approach is well suited to investigate derived variables from posterior distributions of the model parameters. In addition to the LEs conditional on baseline state, we estimate marginal LEs that do not require a specified baseline state but instead require the distribution of the baseline state.

One of the first publications on Bayesian inference for continuous-time multistate models is Sharples (1993). In recent years, additional work has been presented by, for example, Welton and Ades (2005) and Pan and others (2007). Our work can be seen as an extension to the Bayesian inference in Pan and others (2007) in that we allow for time-dependent intensities and exact death times.

Laditka and Wolf (1998) used microsimulation to estimate active LE given a fixed-effect discrete-time Markov model. The variance of estimated LE, however, was not discussed. To estimate the variance of quantities that are derived by using microsimulation, first- and second-order uncertainty have to be distinguished (Halpern and others, 2000). We will show how this is handled in our setting.

An important aspect of the data that determine the model chosen is whether or not transition times are known. For Bayesian inference given a multistate model with known transition times, see Kneib and Hennerfeind (2008). In our setting, as is common in many studies, known transition times are not available except for the transition into the death state. The disease status of the individuals is observed at prescheduled interviews, and as a consequence, observations are interval censored. Our multistate model is defined using transition intensities, but the likelihood contributions of observed time intervals are defined using transition probabilities. In this way, we account for the interval censoring in line with frequentist multistate models (see, e.g. Kay, 1986).

The paper is organized as follows. Section 2 introduces the model, and Section 3 shows how to estimate LEs given the model with random effects. In Section 4, an extension of the model is presented to take right censoring into account. Section 5 discusses the application, and Section 6 concludes the paper.

2.MARKOV MODEL

The first-order Markov assumption in a multistate model implies that the probability of moving to another state only depends on the current state. Our 3-state Markov model assumes that there is no recovery from State 2 back to State 1 and that known transition times are only available for transitions into the death state. We regress intensities on age as a time-dependent covariate to model possible change in the intensities.

In the model, the time interval between 2 consecutive measurements is not fixed but is allowed to vary between and within individuals. Although the model is formulated for individually observed time intervals, the subscript denoting individuals will be suppressed in order to keep notation simple. Let t denote time since entry to the study. At time t ≥ 0, the state of an individual is xt ∈ {1,2,3}. A transition at time t from State r to State s, r ≠ s, occurs with intensity qrs(t), where qrs(t) ≥ 0 for (r,s) ∈ {(1,2),(1,3),(2,3)} and qrs(t) = 0 for (r,s) ∈ {(2,1),(3,1),(3,2)}. Intensities are regressed on covariates and random effects by the log-linear model(2.1)where v⊤ denotes the transpose of vector v, βrs = (β0.rs,β1.rs,…, βp.rs)⊤, z(t) = (1,z1(t), …, zp(t))⊤, and random-effect parameter τ = (τ12,τ13,τ23)⊤ is multivariate normally distributed with mean zero and unknown covariance matrix Σ. The model is flexible to other random-effect structures as will be illustrated in the application.

The time dependency of the intensities is taken into account by a piecewise-constant approximation: intensities are assumed to be constant within individually observed time intervals but may vary between intervals. In our model, covariate vector z(t) is deterministic in the sense that we know its values at every time t ≥ 0. A typical example of this is the time-dependent covariate age. The constant intensities qrs(t) for an observed interval (t, u] are defined using the covariate values midway, that is, at time (t + u)/2. In case observed time intervals are wide and/or the time dependency is strong, this approach can be fine-tuned by subdividing observed intervals in shorter intervals and assuming that intensities are constant within the shorter intervals (Van den Hout and Matthews, 2008), but this method is not required here.

The statistical model is defined using transition probabilities, that is, probabilities of moving between the states. Transition probabilities for an observed time interval (t,u] are given by the 3×3 matrix P(t, u) = exp[(u − t)Q(t)], where for t ≥ 0 we define(2.2)and where exp[M] is defined for any square matrix M as the limit of the series ∑k = 0∞Mk/k!. The rs-entry P(t,u)[r,s] is given by ℙ(Xu = s|Xt = r). See, for example, Norris (1997) for details on time-continuous Markov chains with constant intensities. Matrix P(t, u) is available in closed form for the 3-state model without recovery.

Assume that an individual has observations at times t1 = 0, t2,…, tM. Using the Markov assumption, the probability of the observed trajectory xt1, xt2,…, xtM conditional on the first observation xt1 is given by(2.3)where the conditioning on covariates and random effects is ignored in the notation. The contribution of this individual to the marginal likelihood is given by(2.4)where f is the density of the multivariate normal distribution with mean zero and covariance matrix Σ. To obtain the likelihood for the data, the individual contributions are taken together in a multiplication.

Given that the goal is Bayesian inference, we will use Markov chain Monte Carlo (MCMC) methods for approximating posterior distributions of model parameters. To do this, observed intervals (t1, t2],…, (tM − 1, tM] are modeled independently by using the multinomial distribution for the transitions to State Xj + 1 given State Xj, j = 1,…,M − 1 (Pan and others, 2007).

Assume that the individual with observation times t1 = 0,t2,…,tM has random-effect parameter vector τ. We recode States 1, 2, and 3 as (1,0,0), (0,1,0), and (0,0,1), respectively. Accordingly, we change the notation for the random variable that denotes the state from scalar X to vector X. For intervals (tj, tj + 1] that an individual survives, we assume thatwhere(2.5)for j ∈ {1, …, M − 1}.

In case an individual does not survive (tM − 1,tM], we assume thatwhere(2.6)given(2.7)

This means that in case of death at tM, we assume an unknown state just before death and then a transition into the death state within a small time interval ϵ (cf. Sharples, 1993). The probability of this event is denoted by πd. Because of the ϵ-approximation of the exact death time in (2.7), 1 − πd ≠ P(tM − 1, tM)[xtM − 1, 1] + P(tM − 1, tM)[xtM − 1, 2], hence the adjustment in (2.6) to ensure a proper distribution. Even though exact death times are known, we adhere to the ϵ-approximation. In maximum likelihood estimation, instantaneous intensities can be used, and in that case, the likelihood contribution of an observed death time would be P(tM − 1, tM)[xtM − 1, 1]q13(t) + P(tM − 1, tM)[xtM − 1, 2]q23(t) (see, e.g. Van den Hout and Matthews, 2009). In the Bayesian framework, we have to ensure that πd is a probability, that is, 0 ≤ πd ≤ 1, and we use the ϵ-approximation.

To estimate the total LE, the distribution of the state at baseline t1 = 0 is needed. At baseline, the state is either State 1 or State 2. We propose to use the logistic regression model for the distribution of X0 = Xt1. Defining θ = ℙ(X0 = 2|z(0)), this model is given by logit(θ) = α⊤z(0), where α = (α1,…,αp)⊤. It follows that

For ease of exposition, we use the same covariate vector for the baseline distribution and for the intensities. This is not a necessary condition.

The following priors are used for the model parameters in the application. For the regression coefficients in the model for baseline state and the regression coefficients in the Markov model, we specify vague univariate normal distributions with mean zero and large variance. For the inverse of covariance matrix Σ, we specify a Wishart distribution. These choices will be illustrated in Section 5.

With both the distribution of the data and the prior distribution of the parameters specified, MCMC methods can be used to estimate the model. The models in this paper were programmed and run in OpenBUGS (Thomas and others, 2006). Output of OpenBUGS can be transported into R (R Development Core Team, 2008) using the R package coda (Plummer and others, 2006).

3.LES AND MICROSIMULATION

For the case without random effects, LE in State s ∈ {1,2} given initial state r ∈ {1, 2} is given by(3.1)where 𝒵 = {z(t)|t ≥ 0}. Expected LE in State s irrespective of the initial state (marginal LE) is given by(3.2)where θ = ℙ(X0 = 2|𝒵) is as defined in Section 2. Expected total LE is given by etot(𝒵) = e1(𝒵) + e2(𝒵). For applications of LEs given frequentists illness–death Markov models, see, for example, Izmirlian and others (2000) and Lièvre and others (2003).

Two methods can be used to estimate the posterior distribution of LEs. The first approximates the integral in (3.1) numerically, for example, by using the trapezoidal rule. The second is microsimulation where individual trajectories are simulated and corresponding individual survival is used to estimate LEs. Both methods can be applied within each MCMC run or after convergence using the posterior distribution of the model parameters. Straightforward numerical evaluation is possible if the covariate vector z(t) is deterministic in the sense that conditional on z(0), 𝒵 is completely specified. Microsimulation is more flexible in that it can also deal with random effects and with time-dependent covariates such as information about visited states. Micro