Accuracy of Alternative Approaches to Capture-Recapture Estimates of Disease Frequency: Internal Validity Analysis of Data from Five Sources

Ernest B. Hook1,2 and Ronald R. Regal3

1 School of Public Health, University of California, Berkeley, Berkeley, CA.
2 Department of Pediatrics, School of Medicine, University of California, San Francisco, CA.
3 Department of Mathematics and Statistics, University of Minnesota, Duluth, MN.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The authors used "internal validity analysis" to evaluate the performance of various capture-recapture methods. Data from studies with five overlapping, incomplete lists generated subgroups whose known sizes were compared with estimates derived from various four-source capture-recapture analyses. In 15 data sets unanalyzed previously (five subgroups of each of three new studies), the authors observed a trend toward mean underestimation of the known population size by 16–25%. (Coverage of the 90% confidence intervals associated with the method found to be optimal was acceptable (13/15), despite the downward bias.) The authors conjectured that (with the obvious exception of geographically disparate lists) most data sets used by epidemiologists tend to have a net positive dependence; that is, cases captured by one source are more likely to be captured by some other available source than are cases selected randomly from the population, and this trend results in a bias toward underestimation. Attempts to ensure that the underlying assumptions of the methods are met, such as minimizing (or adjusting adequately) for the possibility of loss due to death or migration, as was undertaken in one exceptional study, appear likely to improve the behavior of these methods. Am J Epidemiol 2000;152:771–9.

Akaike information criterion; Bayesian information criterion; benzodiazepines; cerebrovascular disorders; Down syndrome; loglinear model; narcotics; scleroderma, systemic

Abbreviations: AIC, Akaike information criterion; AICC, Akaike information criterion corrected; BIC, Bayesian information criterion; DIC, Draper's modification of the Schwarz' information criterion; EB, Evans and Bonett; HR, Hook and Regal; IC, information criterion; SD, standard deviation; SIC, Schwarz' information criterion.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Capture-recapture methods are used widely in epidemiology to estimate prevalence of disease (1GoGo–3Go). An apparent attraction of these methods is that an investigator may use preexisting, overlapping, incomplete lists of affected persons to generate a potentially useful estimate and thus, it is hoped, avoid the need to attempt an exhaustive and expensive complete census. Indeed another application of capture-recapture methods is evaluation of the likely completeness of any such attempted census (4Go) by either treating each major type of ascertainment as a "source" of an overlapping, incomplete list or undertaking an independent sample of the target population to use as a second source in capture-recapture analysis of the pooled census results.

Application of these methods, especially those involving preexisting lists of cases, has been subject to some criticism; often, the mechanisms involved in ascertaining cases from one or more sources violate underlying assumptions of the statistical methods used (5GoGo–7Go). The most prevalent approach to capture-recapture analysis in epidemiology—introduced by Fienberg (8Go) and by Bishop et al. (9Go), which extended the work of Wittes (10Go)—is some application of log-linear methods. For example, if there are k lists or sources, the investigator estimates the number in the "missing" or unobserved cell of a 2k table. This unobservable cell count corresponds to those persons missed by all k sources. Estimating the missing cell enables the entire population to be estimated. However, one must presume no "variable catchability" of cases in the population studied, that is, that all persons are equally likely to be captured by any given source. This presumption requires, for instance, that the population be "closed": there is no loss or gain of cases from death, travel, or migration during the time interval analyzed. But most populations on which data are available from overlapping, incomplete lists and that are of interest to epidemiologists are "open" to a greater or lesser extent. For this reason alone, some variable catchability is almost always present in the data sets usually available to epidemiologists (1Go11Go).

Thus, practical concerns about these methods indicate the importance of evaluating their behavior under the almost certain violation of underlying assumptions in epidemiologic application (1Go). Often, the investigator cannot estimate even the extent of such violations present in any data set or the likely direction of the overall effect on the bias of the derived estimate, that is, whether it is an underestimate or overestimate. For instance, for cases ascertained by using death certificates (as in one study included here (12Go)), less time will have been available for ascertainment during the prevalence period studied; however, these cases are more likely than more mildly affected cases to have been ascertained by other medical sources, making it very difficult to predict overall the direction of the net bias when such a source is used. For this reason, we focused on evaluating the actual behavior of these methods with various sets of "real" data gathered by epidemiologists and the extent to which their application may lead to a seriously misleading estimate of known size in particular instances in which the true size is known. For these purposes, we applied various capture-recapture methods to estimation of the known sizes of particular lists (or sources) of cases generated within such studies. With k overlapping, incomplete lists of cases in one population, one may use the information from any k – 1 of the lists to attempt an estimate of the size of the singled-out source and compare the estimate with the known size. We have designated this approach as a method of internal validity evaluation of various methods used.

In a previous analysis of 20 subpopulations of known size reported in five separate studies, we found, by using this approach, that most capture-recapture methods involving log-linear techniques likely to be adopted by investigators resulted in a mean underestimate of about 10–20 percent of the true population size (13Go). This analysis was applied primarily to estimates reached from three overlapping lists of cases (i.e., three source estimates), limiting the complexity and ranges of the approaches considered. Here, we extended these analyses to 15 additional subpopulations in data from three separate reports, each of which enabled evaluation of estimates reached from four overlapping, incomplete lists of cases. Doing so enabled some extension and expansion of the methods considered. We also reevaluated one data set considered earlier.


    BACKGROUND AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Capture-recapture analysis with only two sources generates one estimate associated with a single model. Log-linear methods with three sources enable eight different hierarchic models to be constructed that include main effects for all sources (i.e., data from all of the sources are used), each associated with a potentially different estimate. (In each instance, the missing cell count is treated as if it was a parameter and was estimated by using the maximum likelihood method.) These eight are the independent model (with 3 df), three different models that allow for one two-way interaction (each with 2 df), three models that allow for two two-way interactions (1 df), and one model that includes all three two-way interactions. As the latter has zero degrees of freedom, it is designated "saturated" (1Go8Go9Go). Four sources are associated with 113 and five sources with 6,893 different possible models (1Go). Proposed methods of model selection are discussed below. Because of the sensitivity of the estimate associated with some models (especially the saturated model) to small, especially null, cells, use of a small sample adjustment may enable derivation of a finite or plausible estimate associated with some method that may not be obtained otherwise.

For any data set with three or more sources, the combination of various proposed alternatives to model selection, adjustment (if any) for model uncertainty, and small sample adjustment (if any) generate a number of different possible approaches obtained by using log-linear methods and, consequently, estimates that may differ considerably (1Go). A major issue of practical importance then is deciding which of the many possible approaches performs optimally.

For this analysis, information was available from three new studies (12Go14GoGoGoGo–18Go), each with five incomplete, partially overlapping sources (labeled i, where i is one of the sources labeled A, B, C, D, and E). Five such sources generated 25 - 1 = 31 different "cells," each of which included some number (or zero) of observed persons. By using varying subscripts, we denoted each cell (or the number observed within it) as x11111 through x00001, as indicated in table 1, rows 1–31. We denoted the total number in any source i as Ni, the number unique to source i as xi, and the number in source i also found in other sources (i.e., those cells that present the values observed in source i and some other source) as (Ni)o for any source i:

(1)


View this table:
[in this window]
[in a new window]
 
TABLE 1. Data structures analyzed to determine the accuracy of capture-recapture estimates of disease frequency

 
For sources A, B, C, D, and E, xi = x10000, x01000, x00100, x00010, and x00001, as indicated in table 1, rows 16, 24, 28, 30, and 31, respectively. For any source in a five-source data set, 24 - 1 = 15 intersections are possible with the other four sources. (Cells in rows 1–15 of table 1 give these values for source A, for example.) We regarded these 15 as "observed" cells for which we could construct a four-source "internal" capture-recapture analysis of the total in i as if we did not know the total Ni or the number xi unique to source i. From the numbers in the observed cells, we used some capture-recapture method to be described below to derive an estimate of the total in i, and, to evaluate the method, compare that estimate with the known value of Ni. Thus, for source A, we undertook a four-source capture-recapture analysis by using data on cases in sources B, C, D, and E also present in source A (cells in rows 1–15) to generate an estimate of the size of source A and to compare that estimate with the known total of A given by the sum of cells in rows 1–16.

We evaluated the adequacy of the estimate by using the logarithm of its relative bias (log relative bias ), where

(2)

We calculated confidence intervals by using a likelihood-based method described earlier (13Go19Go). A value of zero indicates an accurate estimate, a negative value indicates an underestimate, and a positive value indicates an overestimate.

We evaluated estimates produced by 10 different methods of model selection or model weighting. These methods are described in the paragraphs that follow.

We looked at the following two methods that used an estimate associated with a prespecified model, which previous simulations and/or theoretical considerations suggested might be useful (1Go13Go):

  1. All two-way interactions present
  2. The saturated model

The estimates associated with the saturated model were derived from closed-form expressions, as given by Bishop et al. (9Go). A major disadvantage of this method is the sensitivity of the estimate to null values in any cell. Furthermore, however good the estimate may be, the complexity of the saturated model tends to result in very wide, sometimes uselessly large if not infinite associated confidence intervals.

With more than three sources, there is no closed-form expression for estimates associated with the presence of all two-way interactions. To derive an estimate, one may use an iterative proportional-fitting algorithm as, for example, the one given by Bishop et al. (9Go) or a modification of the Newton-Raphson method described by Haberman (20Go), which computes maximum likelihood estimates for any log-linear model.

We also evaluated four methods that select a model with a minimum information criterion (IC). We considered information criteria IC, which is some variant or extension of

(3)
where G2 is the likelihood ratio statistic (-2 logarithm of the ratio of the likelihood of the fitted model to the likelihood of the saturated model) (9Go), df is the number of degrees of freedom for the comparison of any fitted model with the saturated model, that is, the number of degrees of freedom not "consumed" by the model (0 for the saturated model up to 4 for the independent model with four sources), and c is a constant that varies with each method, as follows:

1. Minimum AIC: c = 2, the Akaike information criterion (21Go)

2. Minimum SIC: c = ln ((Ni)o), sometimes known as the Bayesian information criterion (BIC) or Schwarz' information criterion (22Go)

3. Minimum DIC: c = ln (((Ni)o)/2{pi}), which is Draper's modification of the Schwarz' Bayesian information criterion (23Go)

We also define a "corrected AIC" or minimum AICC:

4. Minimum AICC = AIC + ((2p(p + 1))/(((Ni)o) - p - 1)), a correction to the Akaike information criterion, originally designated AICc, where, with k sources for a model with df degrees of freedom, p is the number of parameters; therefore, p = (2k) - df - 1 (24Go25Go).

For any data set, any two criteria may of course "choose" the same model as optimal (and thus imply the same estimate). If they all choose different models then, if No denotes the number of observed cases, for No > 46, as is true with almost all of the data here and in most investigations involving three or more sources, the order of complexity of models selected, from least to highest, is SIC, DIC, AICC, and AIC.

We also reviewed four related, informal Bayesian methods, as implied by the methods of Draper (23Go). These methods weight each estimate by a function of the IC of its associated model. For analyses of four sources, as were conducted here, each weighted estimate is derived from a weighted combination of 113 different estimates. If, for source i, is the estimate derived from model j and ICij is the value of the IC associated with model j, then the weighted estimate derived from all 113 models is

(4)
Draper originally suggested this approach for his modification of the SIC, and previous work suggested that it might be usefully applied with other information criteria as well (13Go).

If null (empty) cells are present, then a model may result in an impossibly large ("infinite") estimate. If any single model of those 113 known to be possible in a four-source analysis (1Go) results in an impossible estimate, then no weighted estimate can be derived. If there are no null cells, or if an appropriate small sample adjustment is introduced, then a finite estimate associated with all 113 models and thus a weighted estimate can be derived.

For these reasons, we also considered two candidate small sample adjustments from the literature. One implied by an Evans and Bonett proposal (add 1/(2(k-1)) to each of the 2k - 1 cells that occur with k sources) adds 0.125 to each cell in the four-source analyses undertaken here (26Go). Thus, for source A, this amount would be added to each cell in the first 15 rows of table 1 and a total of (7)(0.25) = 1.75 subtracted from the final estimate. We denoted this as the Evans and Bonett (EB) adjustment. The other adjustment we suggested based on results with simulations adds 1.0 to all cells in the denominator of the expression for the estimate generated by the saturated model for the missing cell (and derives the final estimate by subtracting from the calculated value the sum of the amounts added to undertake the correction) (1Go). For source A, these are cells x11111, x11100, x11010, x11001, x10110, x10101, and x10011, as given in table 1, rows 1, 4, 6, 7, 10, 11, and 13, respectively. We denoted this as the Hook and Regal (HR) adjustment. As suggested by the results of our simulations (1Go), for those methods that used an IC to select a single model, we applied the correction before calculating the IC, an approach for which subsequent work in which internal validity analysis was used has indicated is preferable to applying the correction after selecting a model with an IC (unpublished observations).

The three new data sources (data sets 1–3) are defined in table 1; each has five data (sub)sets, for a total of 15 (12Go14Go15Go27Go). Our earlier analysis (13Go) considered data from many other separate sources, but only one enabled four-source internal validity analyses to be undertaken, as was conducted here (data set 4, table 1 (9Go)). We reevaluated the results of that particular data set for comparison with the four-source validity analyses in which these new data sets were used. Note that our analyses and inferences apply to our estimates of known subsets of these populations only, not to any estimates derived or published from these data on the total target population.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Estimates
Summary results are presented in table 2. The trend, as in past analysis, was systematically toward a negative value of the log relative bias, that is, to an underestimate of known population sizes, with some exceptions as noted below. The magnitude of the trend varied with the method of estimation used and the approach to small sample adjustment, if any.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Summary results of analysis of accuracy of alternative approaches to capture-recapture estimates of disease frequency

 
In the absence of a small sample adjustment, derivation of an estimate and confidence interval associated with the saturated model was not possible for any of the 15 data (sub)sets. Consequently, no weighted estimates (or intervals) could be derived either. For the remaining methods of model selection, all resulted in finite intervals, with two exceptions. No useful interval could be derived by using the "all two-way" method with no adjustment (there was no upper limit) for sources B and E of data set 3. For source B of this data set as well, the "all two-way" associated estimate with no adjustment was implausibly large. Only 19 cases were observed (of 46 known), but 4,344 were estimated (log relative bias = 4.55). (In these two instances, and when no small sample adjustment was used, the other single-model methods—minimum SIC, minimum DIC, minimum AIC, and minimum AICC—selected simpler models than "all two-way" and were all associated with plausible estimates and intervals.) Some results are presented in table 2, therefore, excluding lists from sources B and E of data set 3.

Similarly, for source A of data set 4, results of which appear separately in table 3, the minimum DIC method (with no adjustment) selected a model associated with no useful upper limit and with an implausibly large estimate, more than 100-fold larger than the observed number of cases (32 observed (36 known), 4,354 estimated; log relative bias = 4.88). In this particular instance, the minimum DIC method selected a more complex model than the other IC models did; indeed, it was more complex than even "all two-way," a consequence of the relatively small number of cases in the source—36—distributed among (24 - 1) = 15 cells. This finding illustrates a general trend that, in the presence of null cells and without a small sample correction, the more complex the model considered, the more likely it is to be associated with an infinite or implausibly large estimate, an infinite upper limit, or both. (Note that a model chosen as optimal by an IC, even if relatively simple, may result in an infinite estimate depending on the presence and location of null cells.)


View this table:
[in this window]
[in a new window]
 
TABLE 3. Coverage by 90% confidence intervals and estimated log relative bias, by data set, for estimates of the accuracy of alternative approaches to capture-recapture estimates of disease frequency

 
Only with a small sample adjustment were we able to compare all available data on all proposed methods considered. With or without this adjustment, the saturated model performed uniformly poorly compared with other approaches, and we excluded it from further consideration. With adjustment, among the remaining approaches, the most widely used method of model selection (minimum AIC) produced a mean log relative bias to about -0.25 (i.e., about a 22 percent underestimate), with a standard deviation of about the same magnitude as the log relative bias. Without a small sample evaluation, among the 13 data (sub)sets on which we could compare all single-model methods directly, the mean log relative bias of the estimates associated with the minimum AIC method was about the same, -0.24 (standard deviation (SD), 0.21).

Small sample adjustment
The EB adjustment performed consistently, although not always markedly, better than the HR adjustment. For example, with the minimum AIC method, the results with the EB and HR adjustments were -0.24 (SD, 0.24) and -0.26 (SD, 0.25), respectively. With the EB adjustment, among the nine different methods of model selection or model adjustment, the log relative bias varied from -0.17 (SD, 0.24) for the all-two-way-interactions model to -0.25 (SD, 0.24) for the minimum SIC method. For the HR adjustment, the values of log relative basis ranged from -0.26 to -0.30. (The standard deviations were also both 0.24. Again, these summaries ignore the poor saturated model estimate.)

Model selection methods
The minimum AIC and minimum AICC methods performed about the same. Both were slightly better than the SIC and DIC methods, which tend to select simpler models.

Weighting
Because of the presence of null values, we could not evaluate the effect of weighting without using some small sample adjustment. With the preferable EB small sample adjustment, weighting the 113 different models improved matters slightly; for example, the weighted AIC log relative bias was -0.22 (SD, 0.21) compared with -0.24 (SD, 0.24) for the minimum AIC method. With the less-optimal HR adjustment, however, the trend was in the other direction. As the EB adjustment results in a lower and "smoother" alteration to cell entries than the HR adjustment does, we predict that in data sets for which there are no null cells and thus no small sample adjustment is required to derive weighted estimates, analogous evaluation will establish also that without such an adjustment, weighting improves estimates.

Method of model selection
All information criteria resulted in about the same magnitude of log relative bias. Only the all-two-way-interactions model resulted in an improvement (again using the EB adjustment): mean log relative basis = -0.17 (SD, 0.24). The all-two-way-interactions model was also an improvement over any method of model weighting.

Coverage by confidence intervals
Table 2 also presents the results with regard to coverage. Only the all-two-way-interactions model (using either no adjustment or the EB correction) resulted in anything close to acceptable coverage by the calculated 90 percent confidence intervals.

Heterogeneity among data sets
Table 3 presents data for each data set, with the two small sample adjustments, for the two optimal methods of model selection (all-two-way-interactions models and minimum AIC). The tendency toward an underestimate was found for all three data sets analyzed. Performance was most variable in data set 3.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
These results confirm the earlier trend reported of a bias toward underestimates of populations of known sizes. In fact, the magnitude of the bias tended to be higher in these studies than in the pooled results of those analyzed previously (13Go).

The EB adjustment and the all-two-way-interactions model performed the best of all candidate approaches considered: mean log relative bias = -0.17 (SD, 0.24) or about a mean 16 percent underestimate. Coverage by 90 percent intervals was adequate: 13/15 = 87 percent.

The results also tend to confirm the earlier trend that 1) with many approaches, use of the optimal method of model selection works better in the absence of the small sample adjustments; and 2) of the two adjustments evaluated, EB is preferable to HR (but refer to the discussion below).

Earlier simulations (with three sources) (1Go) indicated that the HR adjustment performed notably better than the EB adjustment. The more complex data here altered this inference, at least for these data sets. The HR adjustment only adds values (1.0) to cell entries in the denominator of the expression for the saturated estimate. Therefore, it can only deflate estimates associated with the saturated model. It tends to do the same for less-complex models, resulting in lower estimates than those reached by using the EB adjustment. If both methods tend to result in underestimates, as was observed here, then one may consequently expect the EB adjustment to tend to perform better, as was observed.

The general trend toward underestimation suggests that most data sets available for study by epidemiologists (at least as exemplified by those in the literature that we have been able to evaluate) tend to have positive net dependence. That is, a source tending to be typical of that used by epidemiologists tends to be more likely to capture a case found by some other source or sources than some randomly selected case in the population. Certainly, sources with different geographic catchment areas (e.g., clinics in different areas of a jurisdiction) may produce exceptions to this trend. (Note that sources A and B of data set 1 are likely to be negatively dependent for geographic reasons and thus to result in two-source overestimates if the other sources are ignored. Yet keeping them separate and considering all five sources separately, as was done here, still resulted in a tendency toward underestimation.) Investigators can anticipate bias resulting from such an expected negative geographic dependency among two sources, and they can take steps to circumvent it by pooling them and treating them as a single source. Such a tactic (source pooling) appears unlikely to address as readily biases toward overall positive dependencies, if only because it is difficult to decide which sources not to pool, and, of course, pooling all sources will prevent derivation of any estimate.

Our previous report primarily examined data sets with k = 4, for which validity analyses were undertaken on (k - 1) = 3 sources (13Go). In that study, the "saturated model" performed relatively well with regard to estimates as compared with the results of the analysis here, for which (k - 1) = 4 in all data sets. This apparent discrepancy may be explained by noting that, with three sources, the saturated model, that is, the one with 0 df, is the all-two-way model, the optimal model found here. With four sources, the saturated model is more complex than "all two-way" is. Thus, "all two-way" tends to perform optimally in both. We are searching for sources with k = 6 to examine this inference at higher levels of complexity.

Results of evaluation of the performance of the minimum AIC method (the most popular model selection procedure) and of the use of the all-two-way-interactions model in relation to the subpopulations in each data set are shown in table 3. Both in this and our previous study (13Go), no obvious characteristic of a data set investigated, for example, total number observed, could explain the trend in performance with regard to its subpopulations.

We did note one result of interest, however, for data set 4, which we included in our earlier analysis (13Go). The results (table 3) indicate trends at marked variance with those new data sets presented in the tables here (discussed above) and, by implication, the internal validity analyses included in our previous results (1Go). (Our earlier analysis did not examine heterogeneity in results among studies in this way, as we do here (table 3).) With data set 4 and use of either of the small sample adjustments, most approaches give estimates with values of log relative bias much closer to the optimal value of zero. Moreover, the HR adjustment for these data tends to perform better than the EB adjustment.

Conceivably, the results for data set 4 might derive only from chance deviation from a general trend. We searched for aspects of that study that might have contributed to its exceptionality. These data originated in an intensive multisource survey by Fabia (28Go) of Down syndrome in Massachusetts, reanalyzed in part by Wittes (10Go). (Refer to Hook and Regal (13Go) for further comments.) Wittes restricted her analysis to those cases born in the catchment area during a 4-year period and known to be still alive on a particular day some years later. (Death certificate records were not used, although, if available, they might have contributed information on cases who died after the cutoff date.) By virtue of the latter restriction, Wittes ensured formally that she analyzed a population in which no losses occurred because of death. (We suspect also, but cannot substantiate, that here the data set was limited only to cases known to be still within Massachusetts and thus was formally "closed" in a statistical sense.) Therefore, Wittes removed an important potential source of "variable catchability" in the population analyzed, a source that almost certainly is present in other data sets on which we have analyzed internal validity. This better performance does not establish the absence of other sources of variable catchability in these data or its presence in other sets we have analyzed. However, it seems likely to explain why, in contrast to the other data sets we have evaluated to date, most methods worked relatively well here.

A number of formal assumptions underlie statistical application of capture-recapture methods, including the presence of a "closed" population (1GoGo–3Go6Go). Rarely if ever can an epidemiologist formally establish that all of these underlying conditions hold. Indeed, in many circumstances all too familiar to an epidemiologist using readily available data, they are unlikely to hold (5GoGo–7Go). The main assumption investigators make in the usual application of log-linear methods of capture-recapture analysis to data from k sources is that any variable catchability and/or source dependency present in the population analyzed results in no more than a "net" k - 1 source interaction that may be "modeled" with the observed data (1Go). This assumption (similar to the assumption of "randomness" or at least "unbiasedness" in observational studies) must remain almost always an unprovable "act of faith." Certainly, the closer the underlying data structure comes to meeting the statistical assumptions, the better the expected performance of the methods, the lower the degree of likely complexity of the model that describes the underlying data structure, and the greater the plausibility of this act of faith. Moreover, when the investigator has data on covariates, then, by stratification of the population and/or a more direct adjustment for covariates to derive the total population size, he or she may be able to correct some of the deviation from the underlying assumptions usually invoked in analyses.

In any event, whatever the "wrinkle" in the methods we evaluated, we found that with only one data set to date did these methods appear to work relatively well. With the other seven data sets to which we applied internal validity analysis (three from this study, four earlier (1Go)), the methods used tended to produce underestimates of appreciable magnitude. (The only exception was use of the saturated model in the earlier analysis, which tended to result in relatively good estimates but confidence intervals so wide that the method was not useful in practice.) Use of the model that includes all two-way interactions and of the EB adjustment appears to give the least-biased estimates with the best coverage, although the average underestimate was still 16 percent (mean log relative bias = -0.17).

This and other conclusions reached here must be regarded cautiously as still-preliminary inferences, even though they were derived from "real" populations and not simulations. Moreover, no rote approach to capture-recapture estimation of human populations abolishes the need to closely attend to the nature of the sources of ascertainment used, to attempt to understand the structure of the population studied, and, most critically, in the spirit of the "W. Edwards Deming" perspective, to interpret the results from the perspective of the eventual intended use of the estimates (6Go7Go27Go).


    ACKNOWLEDGMENTS
 
The authors thank S. G. Hay, B. W. Gillespie, J. V. Lacey, K. Cruickshank, and X. L. Du for communication and/or comments pertinent to data and S. T. Buckland for useful discussion.


    NOTES
 
Correspondence to Dr. Ernest B. Hook, 303 Warren Hall, School of Public Health, University of California, Berkeley, Berkeley, CA 94720-7360 (e-mail: ebhook{at}socrates.berkeley.edu).


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 BACKGROUND AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Hook EB, Regal RR. Capture-recapture methods in epidemiology: methods and limitations. Epidemiol Rev 1995;17:243–64. (Erratum published in Am J Epidemiol 1998;148:1219).[ISI][Medline]
  2. International Working Group for Disease Monitoring and Forecasting. Capture-recapture and multiple-record systems estimation I: history and theoretical development. Am J Epidemiol 1995;142:1047–58.[Abstract]
  3. International Working Group for Disease Monitoring and Forecasting. Capture-recapture and multiple-record systems estimation II: applications in human diseases. Am J Epidemiol 1995;142:1059–68.[Abstract]
  4. Hook EB, Regal RR. The value of capture-recapture methods even for apparent exhaustive surveys. The need for adjustment for source of ascertainment intersection in attempted complete prevalence studies. Am J Epidemiol 1992;135:1060–7.[Abstract]
  5. Cormack R. Problems with using capture-recapture in epidemiology: an example of a measles epidemic. J Clin Epidemiol 1999;52:909–14.[ISI][Medline]
  6. Hook EB, Regal RR. Recommendations for presentation and evaluation of capture-recapture estimates in epidemiology. J Clin Epidemiol 1999;52:917–26.[ISI][Medline]
  7. Hook EB, Regal RR. On the need for a 16th and 17th recommendation for capture-recapture analysis. J Clin Epidemiol (in press).
  8. Fienberg SE. The multiple recapture census for closed populations and incomplete 2k contingency tables. Biometrika 1972;59:591–603.[ISI]
  9. Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge, MA: MIT Press , 1975:125, 189, 247, 253.
  10. Wittes JT. Estimation of population size: the Bernoulli census. Doctoral dissertation. Department of Statistics, Harvard University, Cambridge, MA, 1970.
  11. Hook EB, Regal RR. Effect of variation in probability of ascertainment by sources ("variable catchability") upon "capture-recapture" estimates of prevalence. Am J Epidemiol 1993;137:1148–66.[Abstract]
  12. Du XL, Sourbutts J, Cruickshank K, et al. A community based stroke register in a high risk area for stroke in north west England. J Epidemiol Community Health 1997;51:474–8.
  13. Hook EB, Regal RR. Validity of methods for model selection, weighting for model uncertainty, and small sample adjustment in capture-recapture estimation. Am J Epidemiol 1997;145:1138–44.[Abstract]
  14. Burns CJ, Laing TTJ, Gillespie BW, et al. The epidemiology of scleroderma among women: assessment of risk from exposure to silicone and silica. J Rheumatol 1996;23:1904–11.[ISI][Medline]
  15. Laing TJ, Gillespie BW, Toth MB, et al. Racial differences in scleroderma among women in Michigan. Arthritis Rheum 1997;40:734–42.[Medline]
  16. Du XL. A community-based stroke register and case-control study of stroke and the quality of hypertension control in northwest England. Doctoral dissertation. University of Manchester, Manchester, England, 1997.
  17. Hay G. The selection from multiple data sources in epidemiological capture-recapture studies. Statistician 1997;46:516–20.
  18. Hay G, McKeganey N. Estimating the prevalence of drug misuse in Dundee, Scotland: an application of capture-recapture methods. J Epidemiol Community Health 1996;50:469–72.[Abstract]
  19. Regal RR, Hook EB. Goodness-of-fit based confidence intervals for estimates of the size of a closed population. Stat Med 1984;3:287–91.[ISI][Medline]
  20. Haberman SJ. Analysis of qualitative data. Vol 2. New York, NY: Academic Press, 1979:571.
  21. Sakamoto Y, Ishiguro M, Kitigawa G. Akaike information criterion statistics. Tokyo, Japan: KTK Scientific, 1986.
  22. Schwarz G. Estimating the dimension of a model. Ann Stat 1978;6:461–4.[ISI]
  23. Draper D. Assessment and propagation of model uncertainty. J R Stat Soc (B) 1995;57:45–70.[ISI]
  24. Hurvich CM, Tsai C. Model selection for extended quasi-likelihood models in small samples. Biometrics 1995;51:1077–84.[ISI][Medline]
  25. Buckland ST, Burnham KP, Augustin NH. Model selection: an integral part of inference. Biometrics 1997;53:603–18.[ISI]
  26. Evans MA, Bonett DG. Bias reduction for multiple-recapture estimators of closed population size. Biometrics 1994;50:388–95.[ISI][Medline]
  27. Deming WE. Quality, productivity and competitive position. Cambridge, MA: MIT Center for Advanced Engineering Study, 1982.
  28. Fabia JJ. Down's syndrome (mongolism)—a study of 2421 cases born alive to Massachusetts residents 1958–1966. Doctoral dissertation. Department of Epidemiology (Public Health), Harvard University, Boston, MA, 1970.
Received for publication May 3, 1999. Accepted for publication December 22, 1999.