1 School of Public Health, University of California, Berkeley, Berkeley, CA.
2 Department of Pediatrics, School of Medicine, University of California, San Francisco, CA.
3 Department of Mathematics and Statistics, University of Minnesota, Duluth, MN.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Akaike information criterion; Bayesian information criterion; benzodiazepines; cerebrovascular disorders; Down syndrome; loglinear model; narcotics; scleroderma, systemic
Abbreviations: AIC, Akaike information criterion; AICC, Akaike information criterion corrected; BIC, Bayesian information criterion; DIC, Draper's modification of the Schwarz' information criterion; EB, Evans and Bonett; HR, Hook and Regal; IC, information criterion; SD, standard deviation; SIC, Schwarz' information criterion.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Application of these methods, especially those involving preexisting lists of cases, has been subject to some criticism; often, the mechanisms involved in ascertaining cases from one or more sources violate underlying assumptions of the statistical methods used (57
). The most prevalent approach to capture-recapture analysis in epidemiologyintroduced by Fienberg (8
) and by Bishop et al. (9
), which extended the work of Wittes (10
)is some application of log-linear methods. For example, if there are k lists or sources, the investigator estimates the number in the "missing" or unobserved cell of a 2k table. This unobservable cell count corresponds to those persons missed by all k sources. Estimating the missing cell enables the entire population to be estimated. However, one must presume no "variable catchability" of cases in the population studied, that is, that all persons are equally likely to be captured by any given source. This presumption requires, for instance, that the population be "closed": there is no loss or gain of cases from death, travel, or migration during the time interval analyzed. But most populations on which data are available from overlapping, incomplete lists and that are of interest to epidemiologists are "open" to a greater or lesser extent. For this reason alone, some variable catchability is almost always present in the data sets usually available to epidemiologists (1
11
).
Thus, practical concerns about these methods indicate the importance of evaluating their behavior under the almost certain violation of underlying assumptions in epidemiologic application (1). Often, the investigator cannot estimate even the extent of such violations present in any data set or the likely direction of the overall effect on the bias of the derived estimate, that is, whether it is an underestimate or overestimate. For instance, for cases ascertained by using death certificates (as in one study included here (12
)), less time will have been available for ascertainment during the prevalence period studied; however, these cases are more likely than more mildly affected cases to have been ascertained by other medical sources, making it very difficult to predict overall the direction of the net bias when such a source is used. For this reason, we focused on evaluating the actual behavior of these methods with various sets of "real" data gathered by epidemiologists and the extent to which their application may lead to a seriously misleading estimate of known size in particular instances in which the true size is known. For these purposes, we applied various capture-recapture methods to estimation of the known sizes of particular lists (or sources) of cases generated within such studies. With k overlapping, incomplete lists of cases in one population, one may use the information from any k 1 of the lists to attempt an estimate of the size of the singled-out source and compare the estimate with the known size. We have designated this approach as a method of internal validity evaluation of various methods used.
In a previous analysis of 20 subpopulations of known size reported in five separate studies, we found, by using this approach, that most capture-recapture methods involving log-linear techniques likely to be adopted by investigators resulted in a mean underestimate of about 1020 percent of the true population size (13). This analysis was applied primarily to estimates reached from three overlapping lists of cases (i.e., three source estimates), limiting the complexity and ranges of the approaches considered. Here, we extended these analyses to 15 additional subpopulations in data from three separate reports, each of which enabled evaluation of estimates reached from four overlapping, incomplete lists of cases. Doing so enabled some extension and expansion of the methods considered. We also reevaluated one data set considered earlier.
![]() |
BACKGROUND AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For any data set with three or more sources, the combination of various proposed alternatives to model selection, adjustment (if any) for model uncertainty, and small sample adjustment (if any) generate a number of different possible approaches obtained by using log-linear methods and, consequently, estimates that may differ considerably (1). A major issue of practical importance then is deciding which of the many possible approaches performs optimally.
For this analysis, information was available from three new studies (1214
18
), each with five incomplete, partially overlapping sources (labeled i, where i is one of the sources labeled A, B, C, D, and E). Five such sources generated 25 - 1 = 31 different "cells," each of which included some number (or zero) of observed persons. By using varying subscripts, we denoted each cell (or the number observed within it) as x11111 through x00001, as indicated in table 1, rows 131. We denoted the total number in any source i as Ni, the number unique to source i as xi, and the number in source i also found in other sources (i.e., those cells that present the values observed in source i and some other source) as (Ni)o for any source i:
![]() | (1) |
|
We evaluated the adequacy of the estimate by using the logarithm of its relative bias (log relative bias
), where
![]() | (2) |
We calculated confidence intervals by using a likelihood-based method described earlier (1319
). A value of zero indicates an accurate estimate, a negative value indicates an underestimate, and a positive value indicates an overestimate.
We evaluated estimates produced by 10 different methods of model selection or model weighting. These methods are described in the paragraphs that follow.
We looked at the following two methods that used an estimate associated with a prespecified model, which previous simulations and/or theoretical considerations suggested might be useful (113
):
The estimates associated with the saturated model were derived from closed-form expressions, as given by Bishop et al. (9). A major disadvantage of this method is the sensitivity of the estimate to null values in any cell. Furthermore, however good the estimate may be, the complexity of the saturated model tends to result in very wide, sometimes uselessly large if not infinite associated confidence intervals.
With more than three sources, there is no closed-form expression for estimates associated with the presence of all two-way interactions. To derive an estimate, one may use an iterative proportional-fitting algorithm as, for example, the one given by Bishop et al. (9) or a modification of the Newton-Raphson method described by Haberman (20
), which computes maximum likelihood estimates for any log-linear model.
We also evaluated four methods that select a model with a minimum information criterion (IC). We considered information criteria IC, which is some variant or extension of
![]() | (3) |
1. Minimum AIC: c = 2, the Akaike information criterion (21)
2. Minimum SIC: c = ln ((Ni)o), sometimes known as the Bayesian information criterion (BIC) or Schwarz' information criterion (22)
3. Minimum DIC: c = ln (((Ni)o)/2), which is Draper's modification of the Schwarz' Bayesian information criterion (23
)
We also define a "corrected AIC" or minimum AICC:
4. Minimum AICC = AIC + ((2p(p + 1))/(((Ni)o) - p - 1)), a correction to the Akaike information criterion, originally designated AICc, where, with k sources for a model with df degrees of freedom, p is the number of parameters; therefore, p = (2k) - df - 1 (2425
).
For any data set, any two criteria may of course "choose" the same model as optimal (and thus imply the same estimate). If they all choose different models then, if No denotes the number of observed cases, for No > 46, as is true with almost all of the data here and in most investigations involving three or more sources, the order of complexity of models selected, from least to highest, is SIC, DIC, AICC, and AIC.
We also reviewed four related, informal Bayesian methods, as implied by the methods of Draper (23). These methods weight each estimate by a function of the IC of its associated model. For analyses of four sources, as were conducted here, each weighted estimate is derived from a weighted combination of 113 different estimates. If, for source i,
is the estimate derived from model j and ICij is the value of the IC associated with model j, then the weighted estimate derived from all 113 models is
![]() | (4) |
If null (empty) cells are present, then a model may result in an impossibly large ("infinite") estimate. If any single model of those 113 known to be possible in a four-source analysis (1) results in an impossible estimate, then no weighted estimate can be derived. If there are no null cells, or if an appropriate small sample adjustment is introduced, then a finite estimate associated with all 113 models and thus a weighted estimate can be derived.
For these reasons, we also considered two candidate small sample adjustments from the literature. One implied by an Evans and Bonett proposal (add 1/(2(k-1)) to each of the 2k - 1 cells that occur with k sources) adds 0.125 to each cell in the four-source analyses undertaken here (26). Thus, for source A, this amount would be added to each cell in the first 15 rows of table 1 and a total of (7)(0.25) = 1.75 subtracted from the final estimate. We denoted this as the Evans and Bonett (EB) adjustment. The other adjustment we suggested based on results with simulations adds 1.0 to all cells in the denominator of the expression for the estimate generated by the saturated model for the missing cell (and derives the final estimate by subtracting from the calculated value the sum of the amounts added to undertake the correction) (1
). For source A, these are cells x11111, x11100, x11010, x11001, x10110, x10101, and x10011, as given in table 1, rows 1, 4, 6, 7, 10, 11, and 13, respectively. We denoted this as the Hook and Regal (HR) adjustment. As suggested by the results of our simulations (1
), for those methods that used an IC to select a single model, we applied the correction before calculating the IC, an approach for which subsequent work in which internal validity analysis was used has indicated is preferable to applying the correction after selecting a model with an IC (unpublished observations).
The three new data sources (data sets 13) are defined in table 1; each has five data (sub)sets, for a total of 15 (1214
15
27
). Our earlier analysis (13
) considered data from many other separate sources, but only one enabled four-source internal validity analyses to be undertaken, as was conducted here (data set 4, table 1 (9
)). We reevaluated the results of that particular data set for comparison with the four-source validity analyses in which these new data sets were used. Note that our analyses and inferences apply to our estimates of known subsets of these populations only, not to any estimates derived or published from these data on the total target population.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Similarly, for source A of data set 4, results of which appear separately in table 3, the minimum DIC method (with no adjustment) selected a model associated with no useful upper limit and with an implausibly large estimate, more than 100-fold larger than the observed number of cases (32 observed (36 known), 4,354 estimated; log relative bias = 4.88). In this particular instance, the minimum DIC method selected a more complex model than the other IC models did; indeed, it was more complex than even "all two-way," a consequence of the relatively small number of cases in the source36distributed among (24 - 1) = 15 cells. This finding illustrates a general trend that, in the presence of null cells and without a small sample correction, the more complex the model considered, the more likely it is to be associated with an infinite or implausibly large estimate, an infinite upper limit, or both. (Note that a model chosen as optimal by an IC, even if relatively simple, may result in an infinite estimate depending on the presence and location of null cells.)
|
Small sample adjustment
The EB adjustment performed consistently, although not always markedly, better than the HR adjustment. For example, with the minimum AIC method, the results with the EB and HR adjustments were -0.24 (SD, 0.24) and -0.26 (SD, 0.25), respectively. With the EB adjustment, among the nine different methods of model selection or model adjustment, the log relative bias varied from -0.17 (SD, 0.24) for the all-two-way-interactions model to -0.25 (SD, 0.24) for the minimum SIC method. For the HR adjustment, the values of log relative basis ranged from -0.26 to -0.30. (The standard deviations were also both 0.24. Again, these summaries ignore the poor saturated model estimate.)
Model selection methods
The minimum AIC and minimum AICC methods performed about the same. Both were slightly better than the SIC and DIC methods, which tend to select simpler models.
Weighting
Because of the presence of null values, we could not evaluate the effect of weighting without using some small sample adjustment. With the preferable EB small sample adjustment, weighting the 113 different models improved matters slightly; for example, the weighted AIC log relative bias was -0.22 (SD, 0.21) compared with -0.24 (SD, 0.24) for the minimum AIC method. With the less-optimal HR adjustment, however, the trend was in the other direction. As the EB adjustment results in a lower and "smoother" alteration to cell entries than the HR adjustment does, we predict that in data sets for which there are no null cells and thus no small sample adjustment is required to derive weighted estimates, analogous evaluation will establish also that without such an adjustment, weighting improves estimates.
Method of model selection
All information criteria resulted in about the same magnitude of log relative bias. Only the all-two-way-interactions model resulted in an improvement (again using the EB adjustment): mean log relative basis = -0.17 (SD, 0.24). The all-two-way-interactions model was also an improvement over any method of model weighting.
Coverage by confidence intervals
Table 2 also presents the results with regard to coverage. Only the all-two-way-interactions model (using either no adjustment or the EB correction) resulted in anything close to acceptable coverage by the calculated 90 percent confidence intervals.
Heterogeneity among data sets
Table 3 presents data for each data set, with the two small sample adjustments, for the two optimal methods of model selection (all-two-way-interactions models and minimum AIC). The tendency toward an underestimate was found for all three data sets analyzed. Performance was most variable in data set 3.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The EB adjustment and the all-two-way-interactions model performed the best of all candidate approaches considered: mean log relative bias = -0.17 (SD, 0.24) or about a mean 16 percent underestimate. Coverage by 90 percent intervals was adequate: 13/15 = 87 percent.
The results also tend to confirm the earlier trend that 1) with many approaches, use of the optimal method of model selection works better in the absence of the small sample adjustments; and 2) of the two adjustments evaluated, EB is preferable to HR (but refer to the discussion below).
Earlier simulations (with three sources) (1) indicated that the HR adjustment performed notably better than the EB adjustment. The more complex data here altered this inference, at least for these data sets. The HR adjustment only adds values (1.0) to cell entries in the denominator of the expression for the saturated estimate. Therefore, it can only deflate estimates associated with the saturated model. It tends to do the same for less-complex models, resulting in lower estimates than those reached by using the EB adjustment. If both methods tend to result in underestimates, as was observed here, then one may consequently expect the EB adjustment to tend to perform better, as was observed.
The general trend toward underestimation suggests that most data sets available for study by epidemiologists (at least as exemplified by those in the literature that we have been able to evaluate) tend to have positive net dependence. That is, a source tending to be typical of that used by epidemiologists tends to be more likely to capture a case found by some other source or sources than some randomly selected case in the population. Certainly, sources with different geographic catchment areas (e.g., clinics in different areas of a jurisdiction) may produce exceptions to this trend. (Note that sources A and B of data set 1 are likely to be negatively dependent for geographic reasons and thus to result in two-source overestimates if the other sources are ignored. Yet keeping them separate and considering all five sources separately, as was done here, still resulted in a tendency toward underestimation.) Investigators can anticipate bias resulting from such an expected negative geographic dependency among two sources, and they can take steps to circumvent it by pooling them and treating them as a single source. Such a tactic (source pooling) appears unlikely to address as readily biases toward overall positive dependencies, if only because it is difficult to decide which sources not to pool, and, of course, pooling all sources will prevent derivation of any estimate.
Our previous report primarily examined data sets with k = 4, for which validity analyses were undertaken on (k - 1) = 3 sources (13). In that study, the "saturated model" performed relatively well with regard to estimates as compared with the results of the analysis here, for which (k - 1) = 4 in all data sets. This apparent discrepancy may be explained by noting that, with three sources, the saturated model, that is, the one with 0 df, is the all-two-way model, the optimal model found here. With four sources, the saturated model is more complex than "all two-way" is. Thus, "all two-way" tends to perform optimally in both. We are searching for sources with k = 6 to examine this inference at higher levels of complexity.
Results of evaluation of the performance of the minimum AIC method (the most popular model selection procedure) and of the use of the all-two-way-interactions model in relation to the subpopulations in each data set are shown in table 3. Both in this and our previous study (13), no obvious characteristic of a data set investigated, for example, total number observed, could explain the trend in performance with regard to its subpopulations.
We did note one result of interest, however, for data set 4, which we included in our earlier analysis (13). The results (table 3) indicate trends at marked variance with those new data sets presented in the tables here (discussed above) and, by implication, the internal validity analyses included in our previous results (1
). (Our earlier analysis did not examine heterogeneity in results among studies in this way, as we do here (table 3).) With data set 4 and use of either of the small sample adjustments, most approaches give estimates with values of log relative bias much closer to the optimal value of zero. Moreover, the HR adjustment for these data tends to perform better than the EB adjustment.
Conceivably, the results for data set 4 might derive only from chance deviation from a general trend. We searched for aspects of that study that might have contributed to its exceptionality. These data originated in an intensive multisource survey by Fabia (28) of Down syndrome in Massachusetts, reanalyzed in part by Wittes (10
). (Refer to Hook and Regal (13
) for further comments.) Wittes restricted her analysis to those cases born in the catchment area during a 4-year period and known to be still alive on a particular day some years later. (Death certificate records were not used, although, if available, they might have contributed information on cases who died after the cutoff date.) By virtue of the latter restriction, Wittes ensured formally that she analyzed a population in which no losses occurred because of death. (We suspect also, but cannot substantiate, that here the data set was limited only to cases known to be still within Massachusetts and thus was formally "closed" in a statistical sense.) Therefore, Wittes removed an important potential source of "variable catchability" in the population analyzed, a source that almost certainly is present in other data sets on which we have analyzed internal validity. This better performance does not establish the absence of other sources of variable catchability in these data or its presence in other sets we have analyzed. However, it seems likely to explain why, in contrast to the other data sets we have evaluated to date, most methods worked relatively well here.
A number of formal assumptions underlie statistical application of capture-recapture methods, including the presence of a "closed" population (13
6
). Rarely if ever can an epidemiologist formally establish that all of these underlying conditions hold. Indeed, in many circumstances all too familiar to an epidemiologist using readily available data, they are unlikely to hold (5
7
). The main assumption investigators make in the usual application of log-linear methods of capture-recapture analysis to data from k sources is that any variable catchability and/or source dependency present in the population analyzed results in no more than a "net" k - 1 source interaction that may be "modeled" with the observed data (1
). This assumption (similar to the assumption of "randomness" or at least "unbiasedness" in observational studies) must remain almost always an unprovable "act of faith." Certainly, the closer the underlying data structure comes to meeting the statistical assumptions, the better the expected performance of the methods, the lower the degree of likely complexity of the model that describes the underlying data structure, and the greater the plausibility of this act of faith. Moreover, when the investigator has data on covariates, then, by stratification of the population and/or a more direct adjustment for covariates to derive the total population size, he or she may be able to correct some of the deviation from the underlying assumptions usually invoked in analyses.
In any event, whatever the "wrinkle" in the methods we evaluated, we found that with only one data set to date did these methods appear to work relatively well. With the other seven data sets to which we applied internal validity analysis (three from this study, four earlier (1)), the methods used tended to produce underestimates of appreciable magnitude. (The only exception was use of the saturated model in the earlier analysis, which tended to result in relatively good estimates but confidence intervals so wide that the method was not useful in practice.) Use of the model that includes all two-way interactions and of the EB adjustment appears to give the least-biased estimates with the best coverage, although the average underestimate was still 16 percent (mean log relative bias = -0.17).
This and other conclusions reached here must be regarded cautiously as still-preliminary inferences, even though they were derived from "real" populations and not simulations. Moreover, no rote approach to capture-recapture estimation of human populations abolishes the need to closely attend to the nature of the sources of ascertainment used, to attempt to understand the structure of the population studied, and, most critically, in the spirit of the "W. Edwards Deming" perspective, to interpret the results from the perspective of the eventual intended use of the estimates (67
27
).
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|