a Division of Tropical Medicine, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool L3 5QA, UK.
b University Department of Medicine, University Hospital Aintree, Liverpool L9 7AL, UK.
c School of Health, Liverpool John Moores University, 79 Tithebarn Street, Liverpool L2 2ER, UK.
Reprint requests to: Dr GV Gill, Department of Medicine, University Hospital Aintree, Lower Lane, Liverpool L9 7AL, UK. E-mail: G.Gill{at}liv.ac.uk
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods Six lists of known diabetic patients attending different medical settings during the study year were obtained. The effects on total enumeration after aggregation of these lists were examined using increasing numbers of demographic data items as patient identifiers. The CR estimates of prevalence were obtained using 15 different combinations of two lists. Estimates were obtained after log-linear modelling for interdependence between different combinations of three and four lists, and after combining the six available lists into three logical lists.
Results For matching patients, adding date of birth to first name and family name as matching criteria increased the total of identified patients from 2500 to 2585 (3% increase), corresponding to a period prevalence of 1.5% (95% CI : 1.411.52). Addition of further identifiers, such as partial postcode, only increased the estimate by a further 15 patients (0.5%), and more detailed matching with full postcode introduced uncertainty. The use of two-list CR yielded widely varying estimates of the total diabetic population from 1379 (95% CI : 4352273) to 9554 (95% CI : 729110 983). Log-linear modelling using different combinations of three and four lists produced estimates of 5074 (95% CI : 44175947) and 5578 (95% CI : 49187081), respectively, after compensating for statistical interdependence between the lists used. The appropriate condensation of six available lists into three lists for modelling yielded estimates of 5492 (95% CI : 48706285), corresponding to a CR-adjusted period prevalence of 3.1% (95% CI : 3.033.19%).
Conclusions In a Western population, the only demographic data required for matching patients on lists used for CR methods are first name, family name and date of birth, if unique identifiers such as social security numbers are not available. Two lists alone do not produce reliable data, and at least three lists are needed to allow for modelling for dependence between datasets. The use of more than three lists does not substantially alter the absolute value or confidence of enumeration, and multiple lists (if available) should be condensed into three lists for use in CR calculations.
Accepted 1 December 1999
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Issues sometimes arise on how to fulfil the mathematical assumptions, and what will happen if any are violated.5 The optimal number of data sources that should be used has not been adequately explored. The aim of this study was to examine the effects of using different numbers of data sources, and the effects of using dependent data sources, on CR population estimates, using data collected to estimate the prevalence of diabetes.
![]() |
Subjects and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The information collected was entered on to a computer using the database software Epi-Info version 6.04.8 The analysis used SPSS for Windows9 and Generalised Linear Interactive Modelling (GLIM) 4.10 The data were checked and corrected for key-punching errors. Records without the information mentioned above were removed from the list.
Cases were matched by using the patient identifiers (i.e. family name, first name, date of birth and postcode). This was done by using a sort and aggregate command in SPSS for Windows. This process identified cases that appeared on single, double or multiple lists.
Using two-list CR, the total diabetic population with 95% CI can be determined using formulae described elsewhere;11,12 For three or more lists, log-linear modelling13,14 was used to determine the number of missing cases. The observed number of cases was used as the dependent variable, and presence or absence on each list as the independent variables. The analysis was performed by fitting the simplest model, and continued until a model was found in which the residual deviance (on the 2 distribution) did not differ significantly from the observed data. GLIM provide fitted values under the model for the number of individuals in each observed category and in the unobserved category. If the model provides a reasonable fit for all the observed data, then the expected value of the missing cases' may be extracted directly. These statistical techniques have been fully described and validated elsewhere.14 The missing cases' is the number of cases in the population which have not been captured by any source. The total diabetic population was calculated by adding the number of missing cases to the aggregated cases identified by all the lists.
The 95% CI for missing cases was estimated using a software macro in GLIM. For each value of the interval, the change in log-likelihood above minimum is equal to 3.84.15 A similar range of intervals was then calculated for the total estimated population. The prevalence of diabetes was determined by dividing the number of cases by the total population in the group or subgroup.
The study was approved by the Ethics Committee of the Sefton Health Authority. Confidentiality of information was maintained all the time. Only one author (AAI) had access to the full identity of patients and this information was removed after the matching procedure.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Estimation of known diabetes population
Two lists
Using various pairs of lists, the total number of the known diabetes population was estimated. Data from Table 1 were rearranged into 22 contingency tables and the total known diabetes population (N) with 95% CI was determined. There were 15 possible combinations of two lists, which provided different estimates of the total N and the number of missing cases (m), as shown in Figure 1
. It was not possible to estimate the number of diabetic patients using the Children's Hospital list in combination with the other hospital-based lists because children with diabetes in the area are looked after exclusively at the local Children's Hospital and there were no duplicates between these lists. The remaining combinations yielded estimates of N varying from 9554 (95% CI : 729110 983) to 1379 (95% CI : 4852273). As the number of overlapping cases (duplicates) decreased, both the estimated N and missing values (m) increased, and were considerably bigger than the number of cases identified by all lists (2585), with wide CI.
|
|
|
Prevalence rates
The estimated mid-year population of South Sefton in 1994 was 176 682 (Merseyside Information Services, 1996). The CR estimated diabetic population ranged between 5074 to 5578, depending on the number of sources used in the calculation (Table 4). The crude period prevalence rates expressed per 100 of known diabetes, using three, four and combinations of six lists were 1.4 (95% CI : 1.321.43), 1.4 (95% CI : 1.381.49), and 1.5 (95% CI : 1.411.52), respectively, compared with CR estimated period prevalences of 2.9 (95% CI : 2.792.95), 3.2 (95% CI : 3.083.24) and 3.1 (95% CI : 3.033.19), respectively. Although there were slight differences in the estimated total diabetic population after increasing the number of lists, there were no significant differences in the prevalence rate estimates. The number of diabetic patients estimated using the first three lists was the closest to the aggregated cases (5074 versus 2585) and had the highest ascertainment rate (47.7%). These lists, however, do not include diabetic patients registered at the Children's Hospital, and patients attending the Retinal Clinic and on the Stroke database. It was important to include these lists in the calculation, in order to have a homogeneous age distribution. Using the combination of six lists produced a prevalence rate of 3.1% (95% CI : 3.033.19).
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We used family name, first name and date of birth for the matching process. An attempt was made to use fewer criteria (family name and first name), but this resulted in some patients having similar matching criteria. Overall, adding the date of birth as a matching criterion increased the estimate of cases by 3.6%, and adding the first part of the postcode made little further difference apart from ensuring that occasional patients from outside the study area were not included. This is a smaller difference than was observed in a study in Thailand, performed using data on drug misuse and police arrest. In that study the total number of cases varied by a factor of 45% overall, depending on whether two, three, four or five specific identifiers were used.20
The choice and quality of lists is a very important factor in obtaining good estimates, but our data show that increasing the number of lists does not necessarily produce significant improvements in prevalence estimates. Analysis using a two-list CR technique can be used as a guide to determining whether lists are independent or not,21 but two lists on their own are not sufficiently accurate in most circumstances and as with our results (see previously in Results section under Two lists), widely varying and erroneous estimates of population may be obtained, depending on excessive or inadequate overlap. Fulfilling the second assumption of valid CR, i.e. that subjects have an equal chance of appearing on each list, is always difficult, and represents the problem of dependency, which is very importantour Children's Hospital list is a good example, as this list is highly independent from all but the GP lists. Children are unlikely to develop retinopathy and therefore will not appear on the Retinal Clinic list, and in our area are admitted to their own hospital rather than the local adult hospital. Similarly, the lists of Hospital Admission and Stroke Database, and Hospital Admission and Retinal Clinic, were positively dependent, so that CR estimates using these were far less than the actual identified cases.
Our results demonstrate interactions between lists, the degree of which may differ in different age groups. In the sex-specific analyses, the simplest model that fitted the data was also the model which had an interaction between GP, Diabetes Centre and Hospital Admission lists. When data were further divided into age groups, the interaction between lists diminished. The simplest model that fitted most age group data was the model in which there was no interaction between lists.
If lists are dependent, three choices can be made: to discard the lists, to combine them, or to use log-linear modelling.14 In our study, we combined several lists to produce three lists for CR, which we believe is appropriate in a medical setting, because of difficulties in finding other independent and reliable lists, and also because calculations are simpler. However, even combined three-list CR cannot entirely remove dependency problems, and thus the use of log-linear modelling is vital to overcome such difficulties.
In summary, our analysis has shown that relatively few patient identifiers are needed to match patients in the lists used for CR in Western populations.22,23 Full name and date of birth appear sufficient, though of course a unique identification number (such as national insurance number or social security code) would make identification much simpler. Our data suggest that three lists are sufficient. Interdependence is still a potential problem, but appropriate combination of lists and log-linear modelling can overcome this. Complex analysis of more than three lists does not appear to add more precision to estimates of population size. The narrow CI of numbers of cases and prevalence estimates we obtained would appear to support this. Although we have concerned ourselves specifically with the enumeration of a diabetic population, we believe that our findings are applicable to the use of CR methods in other medical areas.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 McCarty DJ, Tull ES, Moy CS, Kwoh CK, LaPorte RE. Ascertainment corrected rates: application of capture-recapture methods. Int J Epidemiol 1993;22:55965.[Abstract]
3 Bruno G, LaPorte RE, Merletti F, Biggeri A, McCarty D, Pagano G. National Diabetes Programs, application of capture-recapture to count diabetes? Diabetes Care 1994;17:53855.
4
Wadsworth E, Shield J, Hunt L, Baum D. Insulin dependent diabetes in children under 5: incidence and ascertainment validation for 1992. Br Med J 1995;310:70003.
5 Papoz L, Balkau B, Lellouch J. Case counting in epidemiology: limitation of methods based on multiple data sources. Int J Epidemiol 1996;25:47478.[Abstract]
6 Office for National Statistics (ONS). ONS Population and Health Monitor. London: HMSO, 1998.
7 WHO. Diabetes mellitus: report of a WHO Study Group. WHO Tech Rep Ser 727. Geneva: WHO, 1985.
8 Dean AG, Dean JA, Coulombier D et al. EpiInfo Version 6: A Word Processing Database and Statistics Program for Epidemiology on Microcomputers. Atlanta: Centers for Disease Control and Prevention, 1994.
9 Norusis MJ. SPSS or Windows Base System User's Guide release 6.0. Chicago: SPSS Inc., 1993.
10 Francis B, Green M, Payne C. GLIM 4: The Statistical System for Generalized Linear Interactive Modelling. New York: Oxford Science Publications, 1993.
11 LaPorte RE, McCarty D, Bruno G, Tajima N, Baba S. Counting diabetes in the next millennium: application of capture-recapture technology. Diabetes Care 1993;16:52834.[Abstract]
12
LaPorte RE. Assessing the human condition: capture-recapture techniques. Br Med J 1994;308:56.
13 Bishop YMM, Fienberg SE, Holland PW. Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press, 1975, pp.22956.
14 Cormack RM. Log-linear models for capture-recapture. Biometrics 1989;45:395413.[ISI]
15 Cormack RM. Interval estimation for mark-recapture studies of closed populations. Biometrics 1992;48:56776.[ISI][Medline]
16 Unwin N, Alberti KGMM, Bhopal R, Harland J, Watson W, White M. Comparison of the current WHO and new ADA criteria for the diagnosis of diabetes mellitus in the three ethnic groups in the UK. Diab Med 1998;15:55457.[ISI][Medline]
17 McKeigue PM, Pierpoint P, Ferrie JE, Marmot MG. Relationship of glucose intolerance and hyperinsulinaemia to body fat pattern in South Asian and Europeans. Diabetologia 1992;35:78591.[ISI][Medline]
18 Currie CJ, Peters JR. Estimation of unascertained diabetes prevalence: different effects on calculation rates and resource utilisation. Diab Med 1997;13:47781.
19 Gatling W, Hill RD. General characteristics of a community-based diabetic population. Pract Diabetes 1988;5:10407.
20 Mastro TD, Kitayaporn D, Weniger BG. Estimating the number of HIV-infected injection drug users in Bangkok: a capture-recapture method. Am J Public Health 1994;84:109499.[Abstract]
21 International Working Group for Disease Monitoring and Forecasting. Capture-recapture and multiple-record systems estimation I: history and theoretical development. Am J Epidemiol 1995;142:104758.[Abstract]
22 Squires NF, Beeching NJ, Schlecht BJM, Ruben SM. An estimate of the prevalence of drug misuse in Liverpool and a spatial analysis of known addiction. J Public Health 1995;17:10309.
23 Devine M, Syed Q, Tocque K, Bellis M. Capture-recapture estimates of whooping cough in the Merseyside area. Commun Dis Public Health 1998;1:12125.[Medline]