Figures

Abstract

Aims

The main focus of this study is to illustrate the importance of the statistical analysis in the evaluation of the accuracy of malaria diagnostic tests, without admitting a reference test, exploring a dataset (3317) collected in São Tomé and Príncipe.

Methods

Bayesian Latent Class Models (without and with constraints) are used to estimate the malaria infection prevalence, together with sensitivities, specificities, and predictive values of three diagnostic tests (RDT, Microscopy and PCR), in four subpopulations simultaneously based on a stratified analysis by age groups (, 5 years old) and fever status (febrile, afebrile).

Results

In the afebrile individuals with at least five years old, the posterior mean of the malaria infection prevalence is 3.2% with a highest posterior density interval of [2.3–4.1]. The other three subpopulations (febrile 5 years, afebrile or febrile children less than 5 years) present a higher prevalence around 10.3% [8.8–11.7]. In afebrile children under-five years old, the sensitivity of microscopy is 50.5% [37.7–63.2]. In children under-five, the estimated sensitivities/specificities of RDT are 95.4% [90.3–99.5]/93.8% [91.6–96.0] – afebrile – and 94.1% [87.5–99.4]/97.5% [95.5–99.3] – febrile. In individuals with at least five years old are 96.0% [91.5–99.7]/98.7% [98.1–99.2] – afebrile – and 97.9% [95.3–99.8]/97.7% [96.6–98.6] – febrile. The PCR yields the most reliable results in four subpopulations.

Conclusions

The utility of this RDT in the field seems to be relevant. However, in all subpopulations, data provide enough evidence to suggest caution with the positive predictive values of the RDT. Microscopy has poor sensitivity compared to the other tests, particularly, in the afebrile children less than 5 years. This type of findings reveals the danger of statistical analysis based on microscopy as a reference test. Bayesian Latent Class Models provide a powerful tool to evaluate malaria diagnostic tests, taking into account different groups of interest.

Funding: This study was supported by STP malaria control programme financed by the International Cooperation and Development Fund of Taiwan and RIDES/Malaria CPLP. Statistical research was partially sponsored by national funds through the Fundação Nacional para a Ciência e Tecnologia (FCT), Portugal, under the projects PTDC/SAU-ESA/81240/2006 and PEst-OE/MAT/UI0006/2011. A. Subtil has an FCT PhD grant SFRH/BD/69793/2010. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Malaria is caused by Plasmodium parasites that infect humans through the bites of an infected female mosquito of the genus Anopheles. Plasmodium falciparum, P. vivax, P. ovale and P. malariae are the main species of malaria parasites. The first two species cause the most infections worldwide [1]. The World Malaria Report 2010 [2] summarizes information from 106 malaria-endemic countries (and 2 countries that were certified as free of malaria in 2010: Morocco and Turkmenistan). This report estimated that the number of cases of malaria changed from 233 million in 2000 to 225 million in 2009. The number of deaths due to malaria is estimated to have decreased from 985 000 in 2000 to 781 000 in 2009. As pointed out by Wongsrichanalai et al. [3], the discrepancy found in worldwide malaria statistics (values range from 300 to 500 millions cases a year) emphasizes the importance of correctly diagnosing malaria to better understand its true extent.

The good clinical practice recommends the parasitological confirmation of the diagnosis of malaria through microscopy. There are some exceptions, namely for children under the age of 5 years in high prevalence areas, where there is no evidence that the benefits of microscopy confirmation exceed the risk of not treating false negatives, for cases of fever in established malaria epidemics where resources are limited and for locations where good quality microscopy is not feasible [1]. This method is cheap, but time-consuming, labor intensive and depends on the quality of the blood films and the expertise of the lab technicians.

In recent years, a variety of rapid diagnostic tests (RDTs) have been explored (e.g. [4]–[8]). RDTs are often more costly than microscopy and this should be borne in mind when deciding purchase quantities and level of use in a health care system [1]. Rapid diagnostic tests may have a crucial role in malaria control in poor countries [3]. On the other hand, even in the United States, according to Stauffer et al. [9], approximately 4 million travelers to developing countries seek health care, with cases of malaria reported annually. These authors explored the performance of a RDT approved by the US Food and Drug Administration, pointing out that the diagnosis of malaria is frequently delayed by physicians who have no tropical medicine experience and by lack of the technical expertise.

Molecular techniques such as polymerase chain reaction (PCR) and quantitative nucleic acid sequence bases amplification are also used, but are not widely used in resource-limited settings [10].

In this work, a statistical analysis will be carried out to explore the performance of three diagnostic tests – a Rapid Diagnostic Test (RDT), the Microscopy and a Polymerase Chain Reaction (PCR) technique – applied in 3317 blood samples collected in São Tomé and Príncipe. In 2005, this country began an initiative aimed at reducing malaria-related mortality to zero [11]. Lee et al. [12], [13] present some results on pre-elimination of malaria on the island of Príncipe and show a remarkable decline in malaria morbidity and mortality after the implementation of an integrated malaria control programme in 2004. According to World Malaria Report 2010 [2], São Tomé and Príncipe belongs to a group of 11 African countries that showed a reduction of more than 50% in either confirmed malaria cases or malaria admissions and deaths in recent years due to intense malaria control interventions. However, the World Malaria Report 2010 [2] points out that in 2009 there was evidence of an increase in malaria cases in São Tomé and Príncipe. This report notes that “the increases in malaria cases highlight the fragility of malaria control and the need to maintain control programmes even if numbers of cases have been reduced substantially”.

Statistical analysis is crucial to validate diagnostic tests. The development of classic (frequentist) and Bayesian statistical approaches for evaluation of the diagnostic tests in the absence of a gold standard test has been an active field of biostatistical research applied to many areas, including tropical diseases (e.g. [14], [15]), oncology [16], [17] and veterinary medicine [18]. Latent class models with two latent classes are widely used to estimate the prevalence, sensitivities and specificities in the absence of a gold standard. Microscopy has been considered as the gold standard for malaria diagnosis. However, admitting microscopy as a reference technique impairs the sensitivity and specificity estimation for other diagnostic techniques [19]. Bayesian approaches are increasingly being used in the analysis of parasitological data, including in the performance of diagnostic tests. Menten el al [20] present several Bayesian latent class models for the diagnosis of visceral leishmaniasis. Limmathurotsakul et al. [21] explore some diagnostic tests for melioidosis. In malaria, as the best of our knowledge, few papers explore latent class models. Speybroeck et al. [14] present a contribution of a Bayesian approach to estimate the prevalence of malaria, applying ELISA, PCR and microscopy to datasets from Peru, Vietnam, and Cambodia. Ochola et al. [19] use a Bayesian formulation of the latent class model of Hui and Walter to estimate the diagnostic accuracy of the malaria diagnostic techniques and microscopy in the absence of a gold standard, based on a systematic review. Fontela et al. [22] point out the poor methodological quality and/or poor reporting of published diagnostic accuracy studies on commercial tests for the three major infections: tuberculosis, malaria and human immunodeficiency virus. In this work, our goal is to explore the accuracy of three diagnostic tests for malaria, using Bayesian Latent Class Models (BLCM), considering their performances in four populations based on the combination of age groups (less than 5 years, greater than or equal to 5 years) and fever status (febrile, afebrile). BLCM without and with restrictions (also called constraints) are used to estimate the disease prevalence, together with sensitivities, specificities, and predictive values of the diagnostic tests. The choice of this type of models will be discussed in the next sections.

Materials and Methods

Malaria Diagnosis Data

In São Tomé and Príncipe (STP), a malaria programme was officially initiated in 2004, and a molecular diagnostic laboratory was set in the main island of São Tomé in 2007, following STP government directives for malaria control and for ethical clearance throughout the implementation of the programme. In the context of this program, between July 2008 and August 2009, a household survey provided data on three mentioned above diagnostic tests applied in 3317 blood samples. The households were selected randomly. Ethical approval was obtained from the Ministry of Health of the Democratic Republic of STP. Informed verbal consent was obtained from residents who answered a short questionnaire, which included information on the use of bed nets. Parents responded on behalf of infants and children [13]. It was expected and observed that the participants in the research are illiterate or semi-literate, therefore could not sign a written consent. This study took into account that the principles of verbal informed consent were the same for written informed consent. This procedure was approved by Ministry of Health of the STP. The body temperature (ear) was also collected and recodified into fever status (Febrile and Afebrile) taking into account a cut-off of 37.5°C for fever. The age groups were defined according to WHO recommendations - and years old - considering the importance of under-five children that are mainly affected by anaemia and mortality [23].

The central statistical analysis of the three diagnostic tests –1. RDT, 2. Microscopy and 3. PCR – (binary variables taking the values: 1. positive versus 0. negative) will take into account a unique dataset of the four subsamples defined by the combination of age groups and fever status, as indicated in Table 1, and it will be presented in the next section.

All cases were tested by rapid diagnostic tests (RDTs, ICT Diagnostics, Cape Town, South Africa), with blood films prepared for microscopic examination, and with dry blood spots collected on filter papers (FTA Classic Cards, Whatman, Newton, MA) for PCR as previously reported [24], [25]. Two technicians for each of six teams carried out the RDT together and decided on the reading. Three technicians recording the microscopic result were unaware of the corresponding RDT results. The technicians that performed the microscopic examination and PCR did not know the age groups and the fever status of the patients. However, the technicians that applied the RDT knew the age groups and the fever status of the individuals because they also made the demographic record of all cases including sex, age, body weight, and temperature.

Some Points regarding the Statistical Analysis

Confidence intervals in the classical analysis - using a reference test.

It is still common in medical literature, the classical statistical approach which admits the microscopy as a gold standard. This approach has been criticized also in a malaria context [14], [19]. An important related issue is also the confidence intervals that accompany the point estimate for the sensitivity or the specificity (or other proportions). This problem is not much addressed in medical literature but is still present even when a true gold standard is considered. Usually a 95% confidence interval (95% CI) is obtained by the Wald method that has been strongly criticized due to the poor coverage probability, even for large sample sizes, and the possibility of lower and upper limits outside [26]–[28]. To avoid the latter drawback, we recommend the version of the Wald CI, and other methods, given by Pires and Amado [26]. However, as poor coverage probability remains, other alternative methods for constructing confidence intervals should be used. There are a lot of alternative methods that are re-emerging, for example, the Clopper-Pearson (or exact binomial), Wilson (or score), Agresti-Coull and Jeffreys methods that provide more reliable coverage probabilities than the Wald method. In the context of diagnostic tests, Wilson method was recommended by [29]. In risk situation, when a coverage probability must be guaranteed, a conservative method (e.g. Clopper-Pearson) may present advantages. Nevertheless, these and other recommended methods may also present coverage problems near the boundaries (0 or 1) [26]–[28]. Some of these are available from R Packages or Epitools [30] (caution should be taken regarding the problem of limits outside ). The key to avoid troubles is to use several recommended methods to understand if they provide consistent information. The mathematical expressions of the methods used in this work can be found in Table 2.

Latent class models.

Ignoring the limited precision of a reference test can incur serious bias in the performance of other medical diagnostic tests and also in the prevalence estimation. Frequentist and Bayesian latent class models are important mathematical frameworks to study the prevalence and the performance of diagnostic tests in the absence of a gold standard test. In a Bayesian analysis, data are combined with the prior information that expresses expert opinions and other sources of knowledge. The elicitation of an informative prior is a hard and subjective process that needs a careful dialogue between statisticians and experts. Despite the existence of a broad and diverse literature in elicitation of prior distributions, it is mainly oriented to statisticians and not to experts in other fields. However, a vast literature has emphasized the importance of prior information. Speybroeck et al. [14] and the references therein point out the merits of the Bayesian paradigm in the estimation of the parameters associated with three diagnostic tests and the prevalence of malaria infection.

In a frequentist perspective, the parameters of latent class models can be obtained by the well-known Expectation Maximization (EM) algorithm. In a Bayesian approach, the parameters are usually estimated by Markov Chain Monte Carlo (MCMC) methods, via Gibbs sampling. The simplest model is the Two Latent Class Model (2 LCM). In this model, the true disease/infection status of an individual is considered a latent variable, , with two mutually exclusive categories (1. diseased/infected and 0. non-diseased/non-infected). The manifest binary variables, , that express the diagnostic tests results, only give an indication on disease/infection status. The 2 LCM assumes that, given the true state of the disease or infection, the results of the diagnostic tests are independent. This assumption is known as Hypothesis of Conditional Independence (HCI) and it will be discussed in the next subsection. Frequentist and Bayesian latent class models, and their extensions to more complex settings, require a careful analysis of several points to ensure reliable results.

Hypothesis of conditional independence.

According to the parsimony principle, mathematical models with the smallest number of parameters are preferred to the more complex ones. However, to investigate if the simplest and most parsimonious 2 LCM describes the data adequately, we need to check if the HCI is or not violated. The HCI in some medical problems may not be a realistic assumption, for example, when the two tests are based on a similar biological phenomenon (e.g. [20], [31]). The diagnostic of local dependence has been discussed by several authors [31]–[35] and different methods have been proposed. Among others, Hagenaars [32] suggests the analysis of the standardized residuals for each pair of manifest variables. Garrett and Zeger [34] developed a graphical method, the log odds ratio check (LORC) plot, to compare the log odds ratio for the observed and predicted two-way cross classification tables for each pair of manifest variables. Qu et al. [35] also propose a graphical method, the correlation residual plot, which is obtained by plotting residuals of pairwise correlation coefficients, defined as the difference between the observed and expected correlations. Sepúlveda et al. [31] propose the use of Biplot representations based on generalized linear models to identify conditional dependence between pairs of manifest variables within each latent class. In this field, Subtil el al. [36] simulated data incorporating local dependence between pairs of manifest variables and applied different local dependency diagnostic methods and found some problems in the detection of the violation of the principle of conditional independence. In case of failure of HCI, there are alternative approaches to 2 LCM. Alternative models that accommodate conditional dependencies have been proposed in the last decade. Albert and Dodd [37] present an overview of some modeling approaches to incorporate conditional dependence between tests. Qu et al. [35] and Hadgu and Qu [38] developed a general latent class model with random effects to incorporate possible conditional dependencies among diagnostic tests. Additionally, Dendukuri and co-authors presented models in a Bayesian perspective [39], [40]. The accessibility of MCMC methods provide solutions to complex models in evaluation of diagnostic tests [41]. On the other hand, the knowledge transfer from other areas may also contribute to this medical field. In particular, sociology and psychology offer solid methodological developments in the latent class models that may be useful in the context of multiple diagnostic tests, as pointed out by Formann [42].

Non-identifiability and label-switching problem.

The non-identifiability of latent class models is a sensitive issue that requires careful attention. If models are not identified, there will not be a unique computational solution. Jones et al. [43] and the references therein give an overview on identifiability of models for multiple binary diagnostic tests in the absence of a gold standard. Apart from checking trivial conditions, such as that the number of parameters has to be smaller than the number of different patterns, in general, for complex models it is not possible to say a priori whether a model is or not identifiable [42].

An advantage of the Bayesian approach is the incorporation of the prior information to avoid the non-identifiability. When the model is identifiable, non-informative prior distributions can be used for all parameters [40]. When the model is not identifiable, it may still be possible to obtain a solution, adding constraints on the parameters or/and by using informative prior distributions for some parameters (e.g. [44] and [40] and the references therein). In practice, a special attention should be given to the non-identifiability under symmetric priors that leads to the label switching in the MCMC output [45] produced in the parameter estimation process. Label switching occurs when latent classes change meaning over the estimation chain in the context of MCMC. Other types of estimation (e.g., maximum likelihood estimation) can exhibit this problem [46]. Machado et al. [47] show the graphical behavior of the traceplots and posterior densities for the latent class probabilities with a label-switching problem. Stephens [48] points out that the common strategy of removing label switching by imposing artificial identifiable constraints on the model parameters does not always provide a satisfactory solution. In fact, there is an active scientific debate in many fields and other solutions have been proposed in the literature (e.g.[45], [48]–[51]). On the other hand, for the situations in which two diagnostic tests are applied in two populations (the Hui-Walter paradigm), Gustafson [52], [53] demystifies the conventional view of identifiability – “identifiability good, non-identifiability bad”–, presenting realistic scenarios where a moderate amount of prior information leads to reasonable inferences from a non-identified model, and scenarios where large sample sizes may be required to obtain reasonable inferences from an identified model.

Sampling strategies or stratified analysis.

The product-multinomial model appears naturally when we collect independent samples on a number of subpopulations corresponding to the traditional stratified sampling [54]. Gustafson [52] explores one way to develop an identifiable model through pre- or post-stratification of the sample/population according to some categorical variable. Dohoo [55] and Gardner et al. [56] argue that it is acceptable to artificially construct populations with a practical meaning. The post-stratification is a way to overcome some situations, where it is inconvenient or impossible to stratify a population into strata before sampling because the value of the variable of interest is only observed after the individual is sampled. In our application, described before, the variable fever could be an example of this type.

In medical problems, the relevance of distinguishing between subsets is very important to understand if the performance of a diagnostic test varies across smaller groups. As an example, the World Malaria Report 2010 [5] emphasizes that “the clinical sensitivity of an RDT to detect malaria is highly dependent on the local conditions, including parasite density in the target population, and so will vary between populations with differing levels of transmission”. In order to estimate the prevalence and the performance measures of several diagnostic tests in the absence of a gold standard, in two or more distinct populations, BLCM are widely used by the veterinary community [57], [58], where subpopulations (e.g. herds) appear naturally or are created (e.g. [59]). On the other hand, Martinez et al. [16] present a Bayesian approach to estimate the disease prevalence, and the accuracy of three screening tests in the presence of two covariates (age, pregnancy) in the absence of a gold standard for cervical cancer. A logit link function was used to relate the covariates linearly to the screening performance measures to provide a meaningful and well-known measure of association - odds ratio. Posterior odds ratios as association measures between pregnancy and age and the performance measures of the three tests and prevalence are presented. This approach is of great importance in the discovery of potential effects of covariates in the sensitivity, specificity and prevalence. If these effects are already known, it seems to be appropriate to choose a stratified analysis, providing the performance measures of each test in each stratum. In a first view the study design - a random survey - seems to suggest a Latent Class Model with covariates [60], however, as described before, the technicians that applied the RDT knew the age groups and the fever status of the individuals. Thus, the stratified analysis is the chosen approach to the malaria dataset.

Bayesian Latent Class Models without and with Constraints

In biomedical sciences, data from multiple dichotomous diagnostic tests arise from multinomial or product-multinomial distributions depending upon the number of populations [43]. The well-known Hui-Walter model involves a split of the population into two or more populations – ()– and assuming conditional independence of the tests given the disease status; the sensitivity and specificity should be constant across populations and the prevalence of the disease is different within each population. This model becomes identifiable whenever (see [19], [61], [62]).

In this work, the subsamples 1, 2, 3 and 4 are drawn from subpopulations 1, 2, 3, and 4, where the corresponding malaria infection prevalence is denoted by . The sensitivity of test () in the subpopulation () is denoted by . Similarly, represents the specificity of test in the subpopulation . We continue to assume conditional independence of the tests given the disease status, however, the prevalence, the sensitivities, and the specificities may vary across subpopulations. For cancer ascertainment data, Bernatsky et al. [17] considered this situation but used a latent class hierarchical model. Here, we adopt a different approach, considering constraints on the general model to obtain other simpler models to model our data set.

In this work, we admit that the subpopulation counts () of the different patterns of test results (in a total of possible patterns) follow a multinomial distribution:

Multinomial , and ,

where is a vector of probabilities of observing the individual pattern of test results in population ( to as shown in the first column of Table 1) giving by.

To analyze the four subpopulations simultaneously, a product multinomial distribution is considered simply using the product of four multinomial distributions since the subpopulations are independent. This general model may be simplified to obtain other simpler models, using constraints. For example, the notation means that RDT test presents the same sensitivity across the four subpopulations of interest. In a general way, and means that the sensitivity and specificity of the test are constant over subpopulations. This simplest model (denoted by M1 in next section) with constraints considers a different prevalence for each subpopulation and the sensitivities and specificities of each test are the same across subpopulations. This model is commonly used to evaluate diagnostic tests in two or more populations (see [19], [58], [61]). The general model (no constraints are imposed on prevalence, sensitivities and specificities across subpopulations, M2 in the next section) has 28 parameters and the simplest model has only 10 parameters to be estimated, using a Bayesian approach. Introducing different constraints into M2, several other Bayesian latent class models were fitted via MCMC techniques, using Gibbs sampling, to explore the accuracy of the three diagnostic tests in the four defined subpopulations (Table 1) simultaneously.

Berkvens et al. [44] consider two types of constraints - deterministic and probabilistic. Both types of constraints express previous knowledge on parameters of a model and/or are imposed to overcome the non-identifiability of a model. The last one appears in a Bayesian context to reflect the available knowledge and uncertainty, specifying a prior distribution for a parameter. Informative priors are based on historical information, expert opinions, beliefs based on the repetition of similar experiments, and so on. If previous information is not available, a non-informative or a vague prior distributions are commonly used.

The elicitation of an informative prior is a hard and subjective process that needs a careful dialogue with experts. Despite the existence of a broad and diverse literature in elicitation of prior distributions, it is mainly oriented to statisticians and not to experts in other fields. In practice, user-friendly graphical tools are essential to lead with this sensitive issue. In this process, we used Epitools [30] to summarize Beta distributions for specified and parameters – Beta. In our opinion, the flexibility of Beta seems to be more natural than the Uniform distributions to describe probabilistically this type of performance parameters. We should note that an Uniform over the interval [0,1] is equivalent to a Beta(1,1). If previous studies have been pointed out that a particular test presents a sensitivity most of times concentrated near 1, choosing a right-skewed beta distribution with parameters , with a standard deviation of 0.053 and theoretical quantiles 0.025, 0.50 and 0.975 equal to 0.789, 0.931 and 0.990, expresses a better performance than the another right-skewed Beta distribution with parameters (standard deviation of 0.062 and quantiles: 0.025, 0.50 and 0.975 equal to 0.740, 0.895 and 0.975). A left-skewed distribution suggests a trend to a poor performance of a test or a low prevalence. For example, according to an expert, the probability of malaria infection prevalence lower than 0.15 is equal to 0.95. Additionally, he/she considers that the mean, mode and median are approximately 0.10. A Beta(15,131) seems to be a good candidate to express this information.

Some computer programming to evaluate the BLCMs was implemented in WinBUGS 1.4.3 program [63]. Appendix S1 shows an example of the code corresponding to model M5. The R statistical software version 2.80 [64] was also used to benefit from the package R2WinBUGS. In general, inferences were based on 100,000 iterations after discarding an initial burn-in of 5,000 iterations with convergence assessed by running multiple chains from various starting values [65]. All parameters were estimated with 95% credible intervals (Bayesian version of the confidence intervals). Additionally, the highest probability density (HPD) intervals for parameters of interest were obtained using BOA 1.1 7–2 [66]. These results will be presented later. Convergence was monitored using the standard diagnostic procedures based on a visual assessment of the long chains for each parameter and using the Gelman-Rubin and the Raftery-Lewis measures. The first requires a and the last one a dependence factor [66].

In terms of model selection, the Deviance Information Criterion (DIC) [67] which penalizes goodness of fit by “complexity” (with the last one measured by effective number of parameters) was valued. The model with the smallest DIC should be selected. However, if two competing models differ in DIC by less than three units, the models are not considered statistically different [62], [67]. The BUGS project [68] gives some guidelines suggesting that differences of more than 10 might definitely rule out the model with the higher DIC, differences between 5 and 10 are substantial, but if the difference in DIC is less than 5, and the models produce very different inferences, then it could be misleading just to report the model with the lowest DIC. This criterion is a generalization of the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) that are also presented (Tables 3 and 7). In models with negligible prior information, DIC will be approximately equivalent to AIC [67]. Note that we use these measures as a comparison criteria to select a model from a set of two-latent class models fitted to a particular dataset. In the literature on latent class models, some criticism has been reported when DIC, AIC and BIC are used to choose the number of latent classes [69], [70] and some variants have been proposed. To assess the adequacy of the selected model, the Bayesian p-value [71], based on Pearson statistics, was also calculated as described in detail by Nérette et al. [62]. This version of Bayesian p-value suggests the lack of fit when p-values near 0 or 1 [62], [72]. Other versions and interpretations of Bayesian p-value can be found, as well in the context of latent class models [44]. There is some subjectivity in the choice of a cut-off to indicate the adequacy of a model, as pointed out by Neelon et al. [72], by analogy to the frequentist p-value, a Bayesian p-value in (0.05, 0.95) suggests an adequate fit, although, in some cases, a stricter criterion might be more appropriate and the values should be in (0.20, 0.80). Ideally, the p-value should be close to 0.5 to express an adequate model fit [20], [72], [73].

Results and Discussion

Results with Non-informative Priors

The hypothesis of conditional independence was checked using LORC, correlation residual plots and bivariate residuals and biplots. No statistical evidence of local dependence was detected, however, taking into account that in certain situations some of these tools may not detect the HCI [36], multinomial fixed effects models with conditional dependence modeled by covariances between tests within classes (e.g. [39], [62]) were also explored following biological reasons. The fitting of such models (based on DIC and p-value and predictive frequencies of each pattern) did not show relevant information compared with simpler models. Even if these models offer a closer description of reality, the balance with parsimony remains an important issue. Interpretability and identifiability problems may arise from models with a larger number of parameters. Moreover, it is well-known that models assuming different dependency structures can provide different parameter estimates and lead to very different interpretations, in spite of their similarity in terms of adjustment measures [37]. In addition, in a Bayesian context, it might be difficult to elicit prior distributions to the covariances or random effects coefficients [20].

Following the HCI exhaustive validation and analysis of the other described topics, we gave a special attention to the results of two mentioned models - M1 and M2 - and three related models with constraints: M3, M4 and M5. Briefly,

M1.

The typical model (with constraints) admits a different prevalence for each subpopulation and the sensitivities and specificities of each test are the same across subpopulations (i.e., for , , , , , , and ),

M2.

The general model (without constraints) that assumes possible differences across subpopulations in terms of prevalence, sensitivities and specificities of each test,

M3.

This model with constraints admits a different prevalence across subpopulations, the specificity of microscopy is equal across subpopulations and also the specificity of PCR (i.e., for , and ). All the remaining parameters vary across subpopulations,

M4.

The general model - M2 - adding: ,

M5.

The same constraints of the M3 adding: .

In a first step, we explored our malaria dataset with non-informative prior distributions for all parameters related with test characteristics ( and and ), using Beta(1,1) distributions, equivalent to Uniform distributions over the interval [0,1]. For the prevalence in the four subpopulations ( and ), the Uniform distribution was considered – U(0,0.5). For the five selected models (M1, M2, M3, M4, and M5), no convergence problems were found and some of the measures that we have been discussing are presented in Table 3.

The assumption of constant test accuracy across subpopulations with different malaria infection prevalence was evaluated though model M1 and it seems to reveal a poorer fit. M2 admits differences across subpopulations in terms of prevalence, sensitivities and specificities of each test and compared with the model M1 seems to fit better. M4 adds only the possibility of febrile and afebrile under-five children and febrile with at least five years old having similar prevalence, but the test’ characteristics varying across subpopulations. This model presents a DIC similar to M2. M3 and M5 present yet better DICs. However, M3 presents a DIC not substantially different from M4. Between M3 and M5, the difference in DICs is also less than 5. Following the recommendations of the BUGS Project [68], we present the estimated parameters according to the three models to investigate possible discrepancies in estimates given by these models (see Table 4).

The posterior inferences, which combine prior information (or lack of it) with data information via Bayes’ theorem, are summarized in Table 4, presenting the posterior means and 95% credibility interval. Additional to the original parameters of the models, the positive predictive values () and negative predictive values () were also indirectly estimated using their relationship with the prevalence, sensitivities and specificities (see expressions, for example, in [29]). M3 and M5 produce similar results. M4 presents some discrepancies at least in some predictive values. According to the parsimony principle, M5 is the simplest model and all criteria of selection and goodness-of-fit are satisfactory, consequently, it is elected as the final model to fit the malaria dataset. Further analysis will be needed to see how the inferences change with different types of informative priors.

Results with Informative Priors

Some information was collected in published works to help us in the choice of the prior distributions for each parameter of our elected model. An accurate estimation is not necessary, this process is flexible and seeks some general knowledge. Additionally, expert opinions were considered in final informative prior distributions.

RDT.

Table 5 shows a range of values for sensitivities and specificities of the RDT test (ICT Diagnostics, Cape Town, South Africa), according to local area and age groups or fever status. Bendezu et al. [74] describe that the same RDT used in different places showed different results (probably related to different conditions like temperature, humidity, characteristics of the malaria parasites, etc.). The study design, the sample size and statistical analysis of each study also contribute to different findings across different studies.

Microscopy.

As microscopy is usually the reference test, very few papers present its sensitivity and specificity. Speybroeck et al. [14], using a Bayesian approach, found the following posterior means and 95% credibility intervals for sensitivity by survey: 53.0% ([42.0–70.0] in Vietnam, 90.0% [72.0–100.0] in Peru Iquitos, 89.0% [71.0–100.0] in Peru Jaen, Cambodia - Survey 1, and Cambodia - Survey 2. In terms of specificities the lower bounds of credibility intervals were higher than 94%. These authors give details about their prior distributions (Uniform) based on expert opinion. Through the classical analysis, using the PCR as a reference test, Batwala et al. [75] explored the performance of microscopy as a function of laboratory experience – health centre (HC) microscopy and expert microscopy. The point estimates and the 95% CI are reported in both cases. In patients with years, the specificity of HC microscopy was 95.7% [90.8–98.4] and the expert microscopy was 98.6% [94.9–99.8]. In children under-five, the specificities of HC microscopy and the expert microscopy were 89.0% [79.5–95.1] and 94.5% [86.6–98.5], respectively. The overall sensitivity of HC microscopy was 47.2% (36.5–58.1) and the sensitivity of expert microscopy was 46.1% [35.4–57.0].

PCR.

The Bayesian analysis performed by Speybroeck et al. [14], in the absence of a reference test, highlighted that PCR is more sensitive than microscopy and the estimates for sensitivity vary from 95.0% [89.0–100.0] in Vietnam to 98.0% [95.0–100.0] in Peru Iquitos and Peru Jaen. In terms of specificity the results are the following: Vietnam –97.0% [95.0–100.0], Peru Iquitos –99.0% [98.0–100.0] and Peru Jean –100.0% (99.0–100.0). Coleman et al. [76] studied the performance of PCR at different parasite densities relative to expert laboratory microscopy, for active surveillance of Plasmodium falciparum and Plasmodium vivax, and reported that PCR was sensitive 95.7% [84.3–99.3] and specific 98.1% [97.8–98.4] for malaria at parasite densities 500/l. However, the sensitivity of PCR dropped off markedly for parasite densities 500/l. The specificity was constantly high, with a minimum lower bound of the CI equal to 97.4%.

Based on expert opinions on malaria diagnosis in STP and published works focusing on similar diagnostic tests, we consider Beta distributions to represent a pessimistic or skeptical, a optimistic and our prior beliefs distribution. The theoretical parameters and quantiles of these distributions are presented in Table 6. Our prior distributions for each subpopulation express: (i) In general, the RDT test (ICT Diagnostics) presents a similar behavior in each subpopulation; (ii) PCR is usually more sensitive than microscopy; (iii) Microscopy is usually more specific than PCR; (iv) Sensitivity of microscopy is slightly better in the febrile individuals. Except for the specificity of microscopy, we consider the same prior distribution for each parameters across the four subpopulations (see Table 6), even though M5 only admits that the specificity of microscopy and the specificity of PCR are equal across subpopulations.

In Table 7, we present again the the posterior means and HPD intervals for each parameters through model M5 with a skeptical, a optimistic and our prior beliefs distributions. We check the convergence of all parameters, and not just those of interest, before proceeding to make any inference, using the trace plots and Gelman-Rubin and Raftery-Lewis convergence diagnostics measures (the last one is presented in Table 7). DIC, AIC, BIC and Bayesian p-values are also indicated in Table 7. These measures favors our prior beliefs distribution, but the more optimistic prior is yet admissible. The Bayesian p-value (0.007) associated to model M5 with a skeptical prior distribution reveals a prior-data conflict. Compared with the results obtained using M5 with non-informative priors, it can be seen (last columns in Table 4) that our prior information contributes to an increase of the sensitivities of microscopy and PCR, in afebrile children under-five. In the febrile children under-five, the sensitivities of RDT and PCR are also improved. The rest of the parameters are quite similar. In afebrile children under-five, the sensitivity of microscopy (even under an optimistic prior) is very low. This finding is not unexpected since other previous studies have reported low values, when this test is not considered as a gold standard, pointing out that asymptomatic cases often have undetectable malaria parasites by microscopy [14], [75].

Our study is associated to a small region composed of two islands, where an intensive malaria control programme aimed at pre-elimination of malaria was developed with success, where prevalences were highly reduced in general and many positive cases had no malaria clinical signs associated. Therefore, the data and results may not be comparable to other regions elsewhere, with higher prevelances obtained by microscopy or RDT. As mentioned before different results may reflect different factors and the word “comparison” may be too strong. Nevertheless, Chinkhumba et al. [77] state that malaria RDTs must have both sensitivity and specificity above 95% in field setting. These authors report that the sensitivity of the RDTs evaluated in their study are similar to the results of other published studies. However, they found a low specificity in febrile patients above 5 years of age. Kyabayinze et al. [78] alert to the low specificity of the ICT rapid test especially in children below 5 years of age. In our study, in terms of the RDT test, for the afebrile children under-five, the specificity estimated by posterior mean is 93.8% and in the remaining subpopulations is above 97.5%. In terms of sensitivity, for febrile children under-five, we find: 94.1% [87.5–99.4]. In all subpopulations, the positive predictive values of RDT are lower than other tests. The PCR yields reliable results in four subpopulations.

Comparison with Other Approaches

The sensitivity and specificity of several rapid malaria diagnostic tests have been estimated using the microscopy as a gold standard. However, the previous measures may change substantially considering the polymerase chain reaction (e.g. [6]) as reference. Only with the purpose of understanding the implications of the classical statistical approach (which is still common in medical literature), in this subsection, we present the performance measures of RDT, admitting the microscopy as a gold standard (Table 8). The use of interval estimation for reporting performance measures is recommended but the Wald method may not be appropriate. Thus, the Clopper-Pearson or exact binomial, Wilson (or score), Agresti-Coull and Jeffreys methods are also calculated to obtain confidence intervals (see Table 8).

Table 8. Point estimates and 95% confidence intervals, though five different methods, for the sensitivity and the specificity of RDT in each subpopulation ( and ) and overall ( and ), using microscopy as a gold standard.

The sensitivities are estimated based on smaller denominators and the corresponding Wald interval tends to provide higher lower bounds for the 95% CI than the other recommended methods. The specificities in all subpopulations are estimated from larger sample sizes and the five methods give similar results. The Wald method is not appropriate to report the performance of a diagnostic test, in particular, when the prevalence of an infection is small (high) because erratic values for the sensitivity (specificity) may occur.

Using the classical analysis (see Table 8), it is emphasized that the sensitivities of RDT are lower than the Bayesian estimates in three subpopulations. The exception is the afebrile with less than five years old. The 95% HPD intervals are narrower than 95% confidence intervals. Paradoxically, in our application, the Wald confidence intervals are the ones that resemble more the Bayesian credibility intervals, particulary for the sensitivities. Nevertheless, this method is not recommended for the typical values of sensitivities and specificities. One reviewer suggested a simple ad-hoc method, assessing the sensitivity of a method as the percentage of positive responses in the group that has positive values for both other tests and specificity as the percentage of negative among those that have negative values for both other tests. There is some proximity between this approach and the composite reference standards proposed by Alonzo and Pepe [79]. Particularly for the sensitivities, as the sample size decreases because the discrepant results between the two reference tests are discarded, the 95% confidence intervals are wider (data not shown). However, for RDT test, combining the PCR and microscopy results, the point estimates are closer to posterior means obtained by Bayesian analysis. In terms of specificities, the reduction of the sample size has less effect because the samples are already larger, leading to smoother differences between the three approaches.

In addition to the philosophical perspective that prior information is an important source to characterize a problem in a more realistic way, in this application, the major advantage of the Bayesian approach is that the subpopulations parameters are estimated by narrower intervals, compared with other approaches. Analyzing the four populations at once and informative priors could prevent identifiability problems. Furthermore, the use of constraints helps enhance the modeling versatility because it is possible to explore the differences and similarities between subpopulations.

Conclusions

The accuracy of diagnostic tests for the malaria diagnosis based on the optical microscopy as a gold standard has been criticized and alternative statistical approaches have emerged without wrongly assuming any of the diagnostic tests as a perfect gold standard. Some studies have reported the performance measures in different populations, exhibiting some differences. Here, we have addressed this problem with a novel Bayesian approach, in the malaria context, which avoids defining a gold standard and provides estimates to the malaria infection prevalence and performance measures in different subpopulations simultaneously. Some deterministic and probabilistic constraints were considered to express some available knowledge or suppositions of experts and published literature about laboratory diagnosis of malaria.

Different models were explored, some of them providing similar results. The elected model was the one that considers a different prevalence in the afebrile individuals with at least five years old and the remaining three groups with the same prevalence. This model admitted the specificity of microscopy and the specificity of PCR are equal across subpopulations and their sensitivities are different. In terms of the performance measures of RDT no constraints are imposed in each subpopulations.

The data information collected in STP seems to be dominant, since the main findings were quite stable when we use different prior distributions. When we consider a positive expectation, using an optimistic prior, or a more skeptical position (pessimistic prior), yielded the same results in terms of the order in which the tests were arranged and even in terms of the magnitude of some performance measures.

In the afebrile individuals with at least five years old, the posterior estimate of the malaria infection prevalence was around 3.2% [2.3–4.1] and in the remaining studies groups around 10.3% [8.8–11.7]. Microscopy had poor sensitivity compared to the other tests, particularly, in afebrile children under-five years old 50.5% [37.7–63.2]. The PCR yielded reliable results in four subpopulations. However, in resource-limited settings, the PCR is not yet accepted as a primary diagnostic test in malaria diagnosis. According to Chinkhumba et al. [77], malaria RDTs must have both sensitivity and specificity above 95% in field setting. In STP the results seems to satisfy this conditions in adults and children with at least five years old. In children under-five, the sensitivity was lower than this target. In all subpopulations, data provide enough evidence to suggest caution with the positive predictive values of the RDT.

Supporting Information

Acknowledgments

The authors are grateful to the staff of CNE and the delegates of district health centers for their close cooperation in field operations, and to the assistants of the Center for Disease Control of Taiwan for their technical assistance in the laboratory work. We acknowledge the RIDES/Malaria CPLP. We also thank the reviewers for their constructive suggestions.