Screening without a "Gold Standard": The Hui-Walter Paradigm Revisited

Wesley O. Johnson1, Joseph L. Gastwirth2 and Larry M. Pearson3

1 Department of Statistics, University of California, Davis, CA.
2 Department of Statistics, George Washington University, Washington, DC.
3 Department of Mathematics and Statistics, Minnesota State University, Mankato, MN.

Correspondence to Dr. Wesley O. Johnson, Division of Statistics and Graduate Group in Epidemiology, University of California, One Shields Avenue, Davis, CA 95616-8705 (e-mail: wojohnson{at}ucdavis.edu).


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND MATERIAL
 TWO TESTS, ONE POPULATION
 AN ALTERNATIVE DESIGN USING...
 DISCUSSION AND CONCLUSIONS
 APPENDIX 1
 APPENDIX 2
 References
 
The authors consider screening populations with two screening tests but where a definitive "gold standard" is not readily available. They discuss a recent article in which a Bayesian approach to this problem is developed based on data that are sampled from a single population. It was subsequently pointed out that such inferences will not necessarily be accurate in the sense that standard errors for parameters may not decrease as n increases. This problem will generally occur when the data are insufficient to estimate all of the parameters as is the case when screening a single population with two tests. If both tests are applied to units sampled from two populations, however, this particular difficulty disappears. In this article the authors further examine this issue and develop an approach based on sampling two populations that yields increasingly accurate inferences as the sample size increases.

Bayesian approach; diagnostic test; Gibbs sampler; likelihood; prevalence; sensitivity and specificity


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND MATERIAL
 TWO TESTS, ONE POPULATION
 AN ALTERNATIVE DESIGN USING...
 DISCUSSION AND CONCLUSIONS
 APPENDIX 1
 APPENDIX 2
 References
 
Consider the problem of estimating the prevalence of a disease and the accuracies of two screening tests when no "gold standard" is readily available. The two screening tests will be assumed to be conditionally independent as in Joseph et al. (1Go) and Hui and Walter (2Go). The data consist of a 2 x 2 table of counts, indicating the number of individuals out of a sample of size n that test ++, +-, -+, or -- on the two tests (table 1). The statistical difficulty here is that there are only three independent cells in the table of data (the four cell counts must add to n) for estimating five parameters (the prevalence, the two sensitivities, and the two specificities). In statistical jargon, the problem lacks identifiability as discussed by Neath and Samaniego (3Go).


View this table:
[in this window]
[in a new window]
 
TABLE 1. Two-test one-population screening data array

 
Andersen (4Go) essentially makes this point in his criticism of the Bayesian approach of Joseph et al. (1Go). The problem is that, even with a specific prior distribution on the five parameters of interest, posterior distributions need not become concentrated around the true values of the parameters as n increases. Thus, in spite of the fact that the probabilities of the four categories (++, +-, -+, --) will be estimated precisely with sufficiently large n, there is no such guarantee for the parameters of interest. Gastwirth et al. (5Go) and Johnson and Gastwirth (6Go) note that this occurs in the analysis of single-test screening data.

In the section, Two Tests, One Population, we demonstrate that, even with large n, low prevalence, and high accuracy, the marginal posteriors for the two sensitivities will be approximately the same as their prior distributions; that is, the data do not provide extra information about these parameters. If we include a second population with a different prevalence, all of the parameters are identifiable provided the tests are presumed independent, conditional on the true disease state (2Go). The Bayesian analysis of this procedure is given in the section, An Alternative Design using Two Tests and Two Populations, and an illustration is presented using the Strongyloides infection data that were analyzed by Joseph et al. (1Go). These data are augmented with data from a second population.


    BACKGROUND MATERIAL
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND MATERIAL
 TWO TESTS, ONE POPULATION
 AN ALTERNATIVE DESIGN USING...
 DISCUSSION AND CONCLUSIONS
 APPENDIX 1
 APPENDIX 2
 References
 
Let T1 and T2 denote the two screening tests that can be applied to a sample of n individuals from a population, and let D denote the characteristic of interest and its absence. The data are summarized by a 2 x 2 table of counts, where the left margin corresponds to T1 and the top margin to T2 and where the first row and column correspond to a + and the second to a - on the respective tests. Let xij denote the count for row i and column j. A schematic of the data is given in table 1.

The prevalence, sensitivities, and specificities are defined as

The conditional independence of T1 and T2 implies that pr(++|D) = {eta}1{eta}2, and so on.

Joseph et al. (1Go) developed a Bayesian approach, which assumes that prior uncertainty for these parameters can be represented by independent beta prior distributions, that is, , , . Then the joint posterior distribution of the parameters can be numerically approximated by using the Gibbs sampler discussed by Gelfand and Smith (7Go) and Tanner (8Go).


    TWO TESTS, ONE POPULATION
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND MATERIAL
 TWO TESTS, ONE POPULATION
 AN ALTERNATIVE DESIGN USING...
 DISCUSSION AND CONCLUSIONS
 APPENDIX 1
 APPENDIX 2
 References
 
In this section, we review the lack of identifiability that occurs when the two tests are applied to any single population. When {pi} is very small and both tests are quite accurate, that is, ({eta}1, {eta}2, {theta}1, {theta}2) near one, the technique developed by Gastwirth et al. (5Go) and Johnson and Gastwirth (6Go) yields the following approximate likelihood:

(1)
(see the Appendix for details). Notice that the parameters {eta}1, {eta}2 do not appear in expression 1, indicating that the data contain no information about them. As a result, the posterior information for the two sensitivities will be approximately the same as the prior information.

To give a specific illustration, we generated the data (x11 = x12 = x21 = 2, x22 = 194) with n = 200 by fixing ({pi}, {eta}1, {eta}2, {theta}1, {theta}2) = (0.01, 0.9, 0.9, 0.99, 0.99) and then finding the table of integers that most closely matched the resulting expected values. For example, E(x11) = n{{pi}{eta}1{eta}2 + (1 - {pi}) (1 - {theta}1)(1 - {theta}2)} = 1.64, E(x12) = E(x21) = 2.14, and E(x22) = 194.1. We assume a prior for the parameters with a{pi} = 1, b{pi} = 9, and a{eta}1 = a{eta}2 = a{theta}1 = a{theta}2 = 9, and b{eta}1 = b{eta}2 = b{theta}1 = b{theta}2 = 1.

In addition, we considered the same data, only in multiples of 10 and 100 times the above data vector, while holding the priors fixed. Results are given in table 2. Note that the standard deviations for {pi}, {theta}1, and {theta}2 decline substantially as the sample size is increased, while those for {eta}1 and {eta}2 do not.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Posterior means and standard deviations (in parentheses) for the low prevalence/high accuracy data (presented in the section, Two Tests, One Population) for sample sizes 1, 10, and 100 times the data

 
Simply stated, the screening data resulting from one or two tests in the absence of a gold standard test for confirmatory testing are insufficient to estimate all of the parameters, even under the assumption of conditionally independent tests. While the Bayesian approach takes advantage of any current scientific knowledge about the accuracies of the tests and the prevalence of the population, Bayesian inferences for this problem will not generally converge to the "true" values regardless of how large is the sample size n.


    AN ALTERNATIVE DESIGN USING TWO TESTS AND TWO POPULATIONS
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND MATERIAL
 TWO TESTS, ONE POPULATION
 AN ALTERNATIVE DESIGN USING...
 DISCUSSION AND CONCLUSIONS
 APPENDIX 1
 APPENDIX 2
 References
 
The identifiability problem can be overcome by sampling from a second population with a different prevalence and then testing persons with both tests. Following the method of Hui and Walter (2Go), we assume that the tests have the same accuracy rates in both populations with respective prevalences {pi}1 and {pi}2. The data are presented as the 2 x 2 x 2 table of counts {xijk}, as exemplified in table 3, with the first subscript, i, denoting the outcome of T1, the second subscript, j, denoting the outcome of T2, and the third subscript, k, denoting the population. There are now six independent cells and six parameters, so under the conditional independence assumption, the identifiability problem no longer exists.


View this table:
[in this window]
[in a new window]
 
TABLE 3. Appended Strongyloides infection data

 
We illustrate the method assuming independent beta priors for the parameters. Details of the appropriate Gibbs sampling approach are given in the Appendix. We consider an augmented data set where the sample from population 1 is the Strongyloides infection data analyzed by Joseph et al. (1Go), and an additional sample is constructed that is presumed to be from a second population. These data are given in table 3. To construct data from population 2, we first selected n2 = 201. We then assumed that the data were generated from a model where the observed cell relative frequencies, xij2/n2, were approximately equal to their corresponding expectations; for example, , and so on. We considered the collection of all possible parameter vectors ({pi}1, {pi}2, {eta}1, {eta}2, {theta}1, {theta}2) satisfying these four constraints and chose the vector (0.4, 0.65, 0.955, 0.607, 0.351, 0.993). These values were chosen specifically so that they would not cohere with the information used by Joseph et al. (1Go). We then selected the second population data to fit as closely as possible to the given choice of parameters.

To illustrate the role of the prior distribution, we analyzed the data using the prior from Joseph et al. (1Go), with a Uniform prior for {pi}2, and also with a prior that coheres with the data; for example, {pi}1 ~ beta(4, 6), {theta}1 ~ beta(19, 1), {pi}1 ~ beta(4, 6), {pi}2 ~ beta(7, 3), {eta}2 ~ beta(6, 4), {theta}2 ~ beta(99, 1). In both cases we considered increased sample sizes keeping the relative proportions, xijk/nk, constant for the selected sample sizes. The results in table 4 show that, when the prior conforms with the data (cases 4 and 5), the estimates converge faster to their true values than when the prior knowledge fails to conform with it. In either case, the effect of the prior diminishes as n -> {infty} (compare with Gelman et al. (9Go)).


View this table:
[in this window]
[in a new window]
 
TABLE 4. Posterior means and standard deviations (in parentheses) for the augmented Strongyloides infection data

 

    DISCUSSION AND CONCLUSIONS
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND MATERIAL
 TWO TESTS, ONE POPULATION
 AN ALTERNATIVE DESIGN USING...
 DISCUSSION AND CONCLUSIONS
 APPENDIX 1
 APPENDIX 2
 References
 
We have illustrated how Bayesian inferences based on single population data for the prevalence and accuracies of two screening tests may be imprecise regardless of the sample size. This problem inevitably arises in nonidentifiable situations.

We showed that this problem is eliminated when one can apply the tests to two populations with different prevalences. Provided that the assumptions of the Hui-Walter (2Go) paradigm are satisfied, the Bayesian inferences will be consistent; that is, estimates will converge to the true values of the underlying parameters even when the prior information turns out not to be in good agreement with the observed data.

We also considered a large sample approach via the expectation-maximization algorithm (compare with Dempster et al. (10Go)). The expectation-maximization algorithm was used to obtain the posterior mode. This method is somewhat simpler to implement and is more stable when sample sizes are large. Thus, it would be preferred to the Gibbs sampling approach in this instance. See Singer et al. (11Go) for details on maximum likelihood estimation via the expectation-maximization algorithm in the Hui-Walter model. To make interval inferences, a method of obtaining standard errors is required (8Go). This involves obtaining the inverse of minus second derivative matrix of the log posterior evaluated at the mode. With Uniform priors, this is equivalent to obtaining the Fisher observed information, which would be obtained in the context of standard large sample maximum likelihood estimation.

The problems described in the single-population setting are due to the lack of identifiability that affects both frequentist and Bayesian inference. The utility of the Bayesian method as a partial resolution to the one population problem depends on the quality of available prior information, because reliable and accurate prior information in conjunction with good data can only improve inferences.


    APPENDIX 1
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND MATERIAL
 TWO TESTS, ONE POPULATION
 AN ALTERNATIVE DESIGN USING...
 DISCUSSION AND CONCLUSIONS
 APPENDIX 1
 APPENDIX 2
 References
 
Low Prevalence High Accuracy
The likelihood, L, for our observed data is


We make the assumptions , where , , are positive, as in Johnson and Gastwirth (6Go). These assumptions are completely analagous to those made in the usual Poisson approximation to the binomial; see, for example, p. 195 of Larsen and Marx (12Go). So we see that = n{pi} and that the assumptions imply that we are assuming {pi} to be small and the sensitivities and specificities to be near one with large n. The likelihood simplifies to expression 1, where we note that x22 behaves like n in large samples.


    APPENDIX 2
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND MATERIAL
 TWO TESTS, ONE POPULATION
 AN ALTERNATIVE DESIGN USING...
 DISCUSSION AND CONCLUSIONS
 APPENDIX 1
 APPENDIX 2
 References
 
The Gibbs Sampler for the Hui-Walter Model
We use the notation '.' to indicate various subtotals. For example, x1.. denotes the total number of positive outcomes on test 1, x..k = nk for k = 1, 2 denotes the two population sample sizes, and so on.

The missing or latent data are the 2 x 2 x 2 table of counts for those persons who are D's, for example, {zijk}. Given the observed counts {xijk}, the table of missing counts consists of independently distributed binomial variates with zijk|{xijk} ~ Bin(xijk,pijk), where pijk is the conditional probability of being a D given the person is from row i, column j, and population k. For example, using Bayes theorem,

compare with Brookmeyer and Gall (13Go) and Gastwirth (14Go). Furthermore, the posterior distribution of the parameters, given the data {xijk} and the missing data {zijk}, is the product of independent beta posteriors for each parameter. For example, the augmented data posterior for {pi}k is beta(a{pi}k + z..k, b{pi}k + nk - z..k); the corresponding distributions for {eta}1 and {eta}2 are beta(a{eta}1 + z1.., b{eta}1 + z2..) and beta(a{eta}2 + z.1., b{eta}2 + z.2.), respectively; and for {theta}1 and {theta}2, they are beta(a{theta}1 + y2.. - z2.., b{theta}1 + y1.. - z1..) and beta(a{theta}2 + y.2. - z.2., b{theta}2 + y.1. - z.1.).

Thus, given starting values for the parameters, one can alternately sample from these two sets of distributions to obtain a Gibbs sample from the joint distribution.


    ACKNOWLEDGMENTS
 
The first author acknowledges support by the NRI competitive grants program/US Department of Agriculture award 98-35204-6535. The second author acknowledges support from the National Science Foundation and that some of the work was completed while visiting the Biostatistics Branch of the Division of Cancer and Epidemiology and Genetics of the National Cancer Institute.


    References
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND MATERIAL
 TWO TESTS, ONE POPULATION
 AN ALTERNATIVE DESIGN USING...
 DISCUSSION AND CONCLUSIONS
 APPENDIX 1
 APPENDIX 2
 References
 

  1. Joseph L, Gyorkos TW, Coupal L. Bayesian estimation of disease prevalence and parameters for diagnostic tests in the absence of a gold standard. Am J. Epidemiol 1995;141:263–72.[Abstract]
  2. Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics 1980;36:167–71.[ISI][Medline]
  3. Neath A, Samaniego FJ. On the efficacy of Bayesian inference for nonidentifiable models. Am Statistician 1997;51:225–32.[ISI]
  4. Andersen S. Re: "Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard." (Letter). Am J Epidemiol 1997;145:290–1.[ISI][Medline]
  5. Gastwirth JL, Johnson WO, Reneau DM. Bayesian analysis of screening data: application to AIDS in blood donors. Can J Stat 1991;19:135–50.[ISI]
  6. Johnson WO, Gastwirth JL. Bayesian inference for medical screening tests: approximations useful for the analysis of acquired immune deficiency syndrome. J R Stat Soc (B) 1991;53:427–39.[ISI]
  7. Gelfand AE, Smith AF. Sampling-based approaches to calculating marginal densities. J Am Stat Assoc 1990;85:398–409.[ISI]
  8. Tanner MA. Tools for statistical inference. New York, NY: Springer-Verlag, 1993.
  9. Gelman A, Carlin JB, Stern HS, et al. Bayesian data analysis. London, UK: Chapman and Hall, 1995.
  10. Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc (B) 1977;39:1–38.[ISI]
  11. Singer R, Boyce W, Gardner I, et al. Evaluation of blue-tongue virus diagnostic tests in free-ranging bighorn sheep. Prev Vet Med 1998;35:265–82.[ISI][Medline]
  12. Larsen RJ, Marx ML. An introduction to mathematical statistics and its applications. 2nd ed. Upper Saddle River, NJ: Prentice Hall, 1986.
  13. Brookmeyer R, Gail MH. AIDS epidemiology, a quantitative approach. New York, NY: Oxford University Press, 1993.
  14. Gastwirth JL. The statistical precision of medical screening procedures: application to polygraph and AIDS antibodies test data. Stat Sci 1987;2:213–22.
Received for publication April 6, 1998. Accepted for publication July 27, 2000.