How many data sources are needed to determine diabetes prevalence by capture-recapture?

AA Ismaila, NJ Beechinga, GV Gillb and MA Bellisc

a Division of Tropical Medicine, Liverpool School of Tropical Medicine, Pembroke Place, Liverpool L3 5QA, UK.
b University Department of Medicine, University Hospital Aintree, Liverpool L9 7AL, UK.
c School of Health, Liverpool John Moores University, 79 Tithebarn Street, Liverpool L2 2ER, UK.

Reprint requests to: Dr GV Gill, Department of Medicine, University Hospital Aintree, Lower Lane, Liverpool L9 7AL, UK. E-mail: G.Gill{at}liv.ac.uk


    Abstract
 Top
 Abstract
 Introduction
 Subjects and Methods
 Results
 Discussion
 References
 
Background Capture-recapture (CR) methods are increasingly used to estimate the size of human populations, including those with diabetes. Few studies have examined the demographic details needed to match patients on the lists used in these techniques, or to determine the optimum number of lists.

Methods Six lists of known diabetic patients attending different medical settings during the study year were obtained. The effects on total enumeration after aggregation of these lists were examined using increasing numbers of demographic data items as patient identifiers. The CR estimates of prevalence were obtained using 15 different combinations of two lists. Estimates were obtained after log-linear modelling for interdependence between different combinations of three and four lists, and after combining the six available lists into three logical lists.

Results For matching patients, adding date of birth to first name and family name as matching criteria increased the total of identified patients from 2500 to 2585 (3% increase), corresponding to a period prevalence of 1.5% (95% CI : 1.41–1.52). Addition of further identifiers, such as partial postcode, only increased the estimate by a further 15 patients (0.5%), and more detailed matching with full postcode introduced uncertainty. The use of two-list CR yielded widely varying estimates of the total diabetic population from 1379 (95% CI : 435–2273) to 9554 (95% CI : 7291–10 983). Log-linear modelling using different combinations of three and four lists produced estimates of 5074 (95% CI : 4417–5947) and 5578 (95% CI : 4918–7081), respectively, after compensating for statistical interdependence between the lists used. The appropriate condensation of six available lists into three lists for modelling yielded estimates of 5492 (95% CI : 4870–6285), corresponding to a CR-adjusted period prevalence of 3.1% (95% CI : 3.03–3.19%).

Conclusions In a Western population, the only demographic data required for matching patients on lists used for CR methods are first name, family name and date of birth, if unique identifiers such as social security numbers are not available. Two lists alone do not produce reliable data, and at least three lists are needed to allow for modelling for ‘dependence’ between datasets. The use of more than three lists does not substantially alter the absolute value or confidence of enumeration, and multiple lists (if available) should be condensed into three lists for use in CR calculations.

Accepted 1 December 1999


    Introduction
 Top
 Abstract
 Introduction
 Subjects and Methods
 Results
 Discussion
 References
 
Capture-recapture (CR) methods are gaining popularity for determining the incidence and prevalence of diabetes,1,2 and are supported by the World Health Organization.1 The techniques are based on the use of multiple data sources to estimate the size of the total population concerned, and there are four basic assumptions i.e. the existence of independent sources; each individual has the same chance of being included in each source; all individuals can be identified and matched; and there are no changes in the population. In CR studies, two, three or four data sources are commonly used.3,4 Estimation of the prevalence of Type 2 diabetes by CR methods is particularly appropriate, as care of patients is typically ‘split’ between hospital clinics and primary care facilities. To estimate the prevalence of diabetes, the data sources usually used include lists from general practitioners, family physicians or other health professionals, hospital admission or discharges, specialist clinic lists, diabetes registers, prescription lists and membership lists from local diabetic associations. The choice of data sources is important in determining the accuracy of the estimates of prevalence.

Issues sometimes arise on how to fulfil the mathematical assumptions, and what will happen if any are violated.5 The optimal number of data sources that should be used has not been adequately explored. The aim of this study was to examine the effects of using different numbers of data sources, and the effects of using dependent data sources, on CR population estimates, using data collected to estimate the prevalence of diabetes.


    Subjects and Methods
 Top
 Abstract
 Introduction
 Subjects and Methods
 Results
 Discussion
 References
 
The study was conducted in an urban area of North Liverpool with an estimated mid-year population of 176 682.6 The target population was all known people with diabetes, alive during a one-year study period from October 1995 to September 1996. Diabetes was diagnosed clinically using WHO criteria.7 The sources of data used to identify patients with diabetes were taken from general practices (GP) in the area, outpatients attending the hospital Diabetes Centre, hospitals, discharge data, patients attending a hospital Retinal Clinic, a research list of stroke inpatients with diabetes, and a list of children with diabetes attending a local children's hospital. The information used to create each of the six lists of patients was family name, first name, postcode, date of birth and sex.

The information collected was entered on to a computer using the database software Epi-Info version 6.04.8 The analysis used SPSS for Windows9 and Generalised Linear Interactive Modelling (GLIM) 4.10 The data were checked and corrected for key-punching errors. Records without the information mentioned above were removed from the list.

Cases were matched by using the patient identifiers (i.e. family name, first name, date of birth and postcode). This was done by using a sort and aggregate command in SPSS for Windows. This process identified cases that appeared on single, double or multiple lists.

Using two-list CR, the total diabetic population with 95% CI can be determined using formulae described elsewhere;11,12 For three or more lists, log-linear modelling13,14 was used to determine the number of missing cases. The observed number of cases was used as the dependent variable, and presence or absence on each list as the independent variables. The analysis was performed by fitting the simplest model, and continued until a model was found in which the residual deviance (on the {chi}2 distribution) did not differ significantly from the observed data. GLIM provide fitted values under the model for the number of individuals in each observed category and in the unobserved category. If the model provides a reasonable fit for all the observed data, then the expected value of the ‘missing cases' may be extracted directly. These statistical techniques have been fully described and validated elsewhere.14 The ‘missing cases' is the number of cases in the population which have not been captured by any source. The total diabetic population was calculated by adding the number of missing cases to the aggregated cases identified by all the lists.

The 95% CI for missing cases was estimated using a software macro in GLIM. For each value of the interval, the change in log-likelihood above minimum is equal to 3.84.15 A similar range of intervals was then calculated for the total estimated population. The prevalence of diabetes was determined by dividing the number of cases by the total population in the group or subgroup.

The study was approved by the Ethics Committee of the Sefton Health Authority. Confidentiality of information was maintained all the time. Only one author (AAI) had access to the full identity of patients and this information was removed after the matching procedure.


    Results
 Top
 Abstract
 Introduction
 Subjects and Methods
 Results
 Discussion
 References
 
The total number of people identified as having diabetes from all six lists was 2585. This was determined from a pool of 25 GP lists (1469 cases), the diabetes centre (1252), hospital admissions (454 cases), retinal clinic (351 cases), children's hospital (64 cases) and the stroke database (38 cases). The distribution of the cases between lists is shown in Table 1Go.


View this table:
[in this window]
[in a new window]
 
Table 1 Aggregated cases of known diabetic patients identified by the six lists
 
Matching criteria
The aggregate command in SPSS for Windows was used to match cases between lists, and different numbers of criteria used for matching resulted in differing numbers of cases. Using two criteria (family name and first name), the aggregated number of cases was 2500. Using three criteria (family name, first name, date of birth), produced 2585 cases, using four criteria (adding first part of postcode) the aggregated cases numbered 2600, and five criteria (adding both parts of postcode) produced 2720. It was decided to use three criteria (as above) for the matching in this study. Some cases (85 of 2585 or 3%) will be missed if only the first two criteria (family name and first name) are used. Adding the first part of the postcode made little difference to the aggregated case estimate, and adding the second part of the postcode added 100 to the aggregated case estimate, but this may introduce inaccuracy as some patients moved house within the area, during the study period.

Estimation of known diabetes population
Two lists
Using various pairs of lists, the total number of the known diabetes population was estimated. Data from Table 1Go were rearranged into 22 contingency tables and the total known diabetes population (N) with 95% CI was determined. There were 15 possible combinations of two lists, which provided different estimates of the total N and the number of missing cases (m), as shown in Figure 1Go. It was not possible to estimate the number of diabetic patients using the Children's Hospital list in combination with the other hospital-based lists because children with diabetes in the area are looked after exclusively at the local Children's Hospital and there were no duplicates between these lists. The remaining combinations yielded estimates of N varying from 9554 (95% CI : 7291–10 983) to 1379 (95% CI : 485–2273). As the number of overlapping cases (duplicates) decreased, both the estimated N and missing values (m) increased, and were considerably bigger than the number of cases identified by all lists (2585), with wide CI.



View larger version (16K):
[in this window]
[in a new window]
 
Figure 1 Known diabetic patients estimated by two-list capture-recapture techniques The wide variation of the number of cases and the range of 95% CI is demonstrated, depending on which list is used. The horizontal lines represent the 2585 cases actually identified from aggregating the six lists. (See Table 1Go for description of lists)

 
Three lists
Using the General Practice, Diabetes Centre and Hospital Admission lists, the aggregated number of the known diabetes population was 2422. Using log-linear modelling, the first model tested, permitting no interaction between the lists, fitted the data poorly (P = 0.00) (Table 2Go). Models permitting one interaction were then tested, and also fitted the data poorly. A model permitting two interactions, however (model 3b), incorporating dependency between the GP and Diabetes Centre and the Diabetes Centre and Hospital Admission lists, fitted the data well (P > 0.05). This was confirmed by examination of predicted values which were all extremely close to the observed values. This showed the interdependence of the Diabetes Centre lists with GP and Hospital Admission lists (i.e. patients seen at the Diabetes Centre were also often on GP lists, and diabetic patients admitted to the hospital were also seen at the Diabetes Centre). Having selected model 3b, the number of missing cases was estimated as 2652 (95% CI : 1995–3525), yielding an estimate of the total diabetic population of 5074 (95% CI : 4417–5947).


View this table:
[in this window]
[in a new window]
 
Table 2 Models for list interdependence tested for goodness-of-fit, with a likelihood ratio {chi}2 using log-linear modelling, with three lists
 
Four lists
The total aggregated number of patients identified by four lists (GP, Diabetes Centre, Hospital Admission and Retinal Clinic) came to 2538. Using log-linear modelling, the first model with four lists independent fitted the data poorly, as did models with one, two or three pairs related. The simplest model that fitted the data was given by an equation with the terms of L1, L2, L3, L4 and the interactions L1.L2 and L2.L3.L4 (Table 3Go). The estimated number of missing cases was 3040 (95% CI : 2380– 3883), which gave an estimate of the total diabetic population of 5578 (95% CI : 4918–7081).


View this table:
[in this window]
[in a new window]
 
Table 3 Models for list interdependence tested for goodness-of-fit, with a likelihood ratio {chi}2 using log-linear modelling, with four lists
 
Six lists
With more than four lists, CR calculations become increasingly complex, and we preferred a six list analysis ‘collapsed’ into three lists by combining the Diabetes Centre list with the Children's Hospital list (known as Lb), and combining the Hospital Admission list with the Retinal Clinic list and the Stroke Database (known as Lc), while the GP list remains as L1. These lists were combined based on the results of the two-list analysis, and on the basis that they had similar characteristics and would produce more homogenous populations in terms of age. The aggregated number of cases identified by the combination of lists was 2585. These data were cross-tabulated according to the presence or absence of cases in each of three new lists into a 23 contingency table. Once again, a model with two interactions (like 3b above) fitted the data best (P > 0.05). The estimated number of missing cases was 2907 (95% CI : 2285–3700), and this gave an estimate of the total diabetic population of 5492 (95% CI : 4870–6285).

Prevalence rates
The estimated mid-year population of South Sefton in 1994 was 176 682 (Merseyside Information Services, 1996). The CR estimated diabetic population ranged between 5074 to 5578, depending on the number of sources used in the calculation (Table 4Go). The crude period prevalence rates expressed per 100 of known diabetes, using three, four and combinations of six lists were 1.4 (95% CI : 1.32–1.43), 1.4 (95% CI : 1.38–1.49), and 1.5 (95% CI : 1.41–1.52), respectively, compared with CR estimated period prevalences of 2.9 (95% CI : 2.79–2.95), 3.2 (95% CI : 3.08–3.24) and 3.1 (95% CI : 3.03–3.19), respectively. Although there were slight differences in the estimated total diabetic population after increasing the number of lists, there were no significant differences in the prevalence rate estimates. The number of diabetic patients estimated using the first three lists was the closest to the aggregated cases (5074 versus 2585) and had the highest ascertainment rate (47.7%). These lists, however, do not include diabetic patients registered at the Children's Hospital, and patients attending the Retinal Clinic and on the Stroke database. It was important to include these lists in the calculation, in order to have a homogeneous age distribution. Using the combination of six lists produced a prevalence rate of 3.1% (95% CI : 3.03–3.19).


View this table:
[in this window]
[in a new window]
 
Table 4 The number of diabetic patients identified by different combinations of lists (95% CI given in parentheses)
 

    Discussion
 Top
 Abstract
 Introduction
 Subjects and Methods
 Results
 Discussion
 References
 
It is important to emphasize that our study is one of known diabetes cases; we cannot of course estimate numbers of those Type 2 diabetic patients as yet undiagnosed. We found that the crude period prevalence of known diabetes in South Sefton ranged from 1.4% (95% CI : 1.32–1.43) to 1.5% (95% CI : 1.41– 1.52), and CR-adjusted rates range between 2.9% (2.79–2.95) and 3.1% (95% CI : 3.03–3.19) depending on the number of lists used. The CR-adjusted period prevalence in those <30 years was 0.4% (95% CI : 0.34–0.41) and the CR-adjusted period prevalence amongst adults aged >=30 was 5.2% (95% CI : 5.09–5.36%). These figures are within the expected ranges of prevalence in the UK and in some parts of Europe, which range between 1.6% and 5%.16–19

We used family name, first name and date of birth for the matching process. An attempt was made to use fewer criteria (family name and first name), but this resulted in some patients having similar matching criteria. Overall, adding the date of birth as a matching criterion increased the estimate of cases by 3.6%, and adding the first part of the postcode made little further difference apart from ensuring that occasional patients from outside the study area were not included. This is a smaller difference than was observed in a study in Thailand, performed using data on drug misuse and police arrest. In that study the total number of cases varied by a factor of 45% overall, depending on whether two, three, four or five specific identifiers were used.20

The choice and quality of lists is a very important factor in obtaining good estimates, but our data show that increasing the number of lists does not necessarily produce significant improvements in prevalence estimates. Analysis using a two-list CR technique can be used as a guide to determining whether lists are independent or not,21 but two lists on their own are not sufficiently accurate in most circumstances and as with our results (see previously in Results section under ‘Two lists’), widely varying and erroneous estimates of population may be obtained, depending on excessive or inadequate overlap. Fulfilling the second assumption of valid CR, i.e. that subjects have an equal chance of appearing on each list, is always difficult, and represents the problem of ‘dependency’, which is very important—our Children's Hospital list is a good example, as this list is highly independent from all but the GP lists. Children are unlikely to develop retinopathy and therefore will not appear on the Retinal Clinic list, and in our area are admitted to their own hospital rather than the local adult hospital. Similarly, the lists of Hospital Admission and Stroke Database, and Hospital Admission and Retinal Clinic, were positively dependent, so that CR estimates using these were far less than the actual identified cases.

Our results demonstrate interactions between lists, the degree of which may differ in different age groups. In the sex-specific analyses, the simplest model that fitted the data was also the model which had an interaction between GP, Diabetes Centre and Hospital Admission lists. When data were further divided into age groups, the interaction between lists diminished. The simplest model that fitted most age group data was the model in which there was no interaction between lists.

If lists are dependent, three choices can be made: to discard the lists, to combine them, or to use log-linear modelling.14 In our study, we combined several lists to produce three lists for CR, which we believe is appropriate in a medical setting, because of difficulties in finding other independent and reliable lists, and also because calculations are simpler. However, even combined three-list CR cannot entirely remove dependency problems, and thus the use of log-linear modelling is vital to overcome such difficulties.

In summary, our analysis has shown that relatively few patient identifiers are needed to match patients in the lists used for CR in Western populations.22,23 Full name and date of birth appear sufficient, though of course a unique identification number (such as national insurance number or social security code) would make identification much simpler. Our data suggest that three lists are sufficient. Interdependence is still a potential problem, but appropriate combination of lists and log-linear modelling can overcome this. Complex analysis of more than three lists does not appear to add more precision to estimates of population size. The narrow CI of numbers of cases and prevalence estimates we obtained would appear to support this. Although we have concerned ourselves specifically with the enumeration of a diabetic population, we believe that our findings are applicable to the use of CR methods in other medical areas.


    Acknowledgments
 
We thank Miss Liz Harsnape, Mr Keith Jones, Ms Susan Kerr, Dr Colin Smith, Dr Anil Sharma, Mr Kevin McDonald, and Professor Ronald LaPorte, as well as the staff of the Walton Hospital Diabetes Centre. This work was part of a PhD project undertaken by Dr A Ismail at the University of Liverpool, funded by the University of Science, Malaysia.


    References
 Top
 Abstract
 Introduction
 Subjects and Methods
 Results
 Discussion
 References
 
1 WHO Study Group. Prevention of diabetes mellitus. WHO Tech Rep Ser 844; Geneva: WHO, 1994.

2 McCarty DJ, Tull ES, Moy CS, Kwoh CK, LaPorte RE. Ascertainment corrected rates: application of capture-recapture methods. Int J Epidemiol 1993;22:559–65.[Abstract]

3 Bruno G, LaPorte RE, Merletti F, Biggeri A, McCarty D, Pagano G. National Diabetes Programs, application of capture-recapture to count diabetes? Diabetes Care 1994;17:538–55.

4 Wadsworth E, Shield J, Hunt L, Baum D. Insulin dependent diabetes in children under 5: incidence and ascertainment validation for 1992. Br Med J 1995;310:700–03.[Abstract/Free Full Text]

5 Papoz L, Balkau B, Lellouch J. Case counting in epidemiology: limitation of methods based on multiple data sources. Int J Epidemiol 1996;25:474–78.[Abstract]

6 Office for National Statistics (ONS). ONS Population and Health Monitor. London: HMSO, 1998.

7 WHO. Diabetes mellitus: report of a WHO Study Group. WHO Tech Rep Ser 727. Geneva: WHO, 1985.

8 Dean AG, Dean JA, Coulombier D et al. EpiInfo Version 6: A Word Processing Database and Statistics Program for Epidemiology on Microcomputers. Atlanta: Centers for Disease Control and Prevention, 1994.

9 Norusis MJ. SPSS or Windows Base System User's Guide release 6.0. Chicago: SPSS Inc., 1993.

10 Francis B, Green M, Payne C. GLIM 4: The Statistical System for Generalized Linear Interactive Modelling. New York: Oxford Science Publications, 1993.

11 LaPorte RE, McCarty D, Bruno G, Tajima N, Baba S. Counting diabetes in the next millennium: application of capture-recapture technology. Diabetes Care 1993;16:528–34.[Abstract]

12 LaPorte RE. Assessing the human condition: capture-recapture techniques. Br Med J 1994;308:5–6.[Free Full Text]

13 Bishop YMM, Fienberg SE, Holland PW. Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press, 1975, pp.229–56.

14 Cormack RM. Log-linear models for capture-recapture. Biometrics 1989;45:395–413.[ISI]

15 Cormack RM. Interval estimation for mark-recapture studies of closed populations. Biometrics 1992;48:567–76.[ISI][Medline]

16 Unwin N, Alberti KGMM, Bhopal R, Harland J, Watson W, White M. Comparison of the current WHO and new ADA criteria for the diagnosis of diabetes mellitus in the three ethnic groups in the UK. Diab Med 1998;15:554–57.[ISI][Medline]

17 McKeigue PM, Pierpoint P, Ferrie JE, Marmot MG. Relationship of glucose intolerance and hyperinsulinaemia to body fat pattern in South Asian and Europeans. Diabetologia 1992;35:785–91.[ISI][Medline]

18 Currie CJ, Peters JR. Estimation of unascertained diabetes prevalence: different effects on calculation rates and resource utilisation. Diab Med 1997;13:477–81.

19 Gatling W, Hill RD. General characteristics of a community-based diabetic population. Pract Diabetes 1988;5:104–07.

20 Mastro TD, Kitayaporn D, Weniger BG. Estimating the number of HIV-infected injection drug users in Bangkok: a capture-recapture method. Am J Public Health 1994;84:1094–99.[Abstract]

21 International Working Group for Disease Monitoring and Forecasting. Capture-recapture and multiple-record systems estimation I: history and theoretical development. Am J Epidemiol 1995;142:1047–58.[Abstract]

22 Squires NF, Beeching NJ, Schlecht BJM, Ruben SM. An estimate of the prevalence of drug misuse in Liverpool and a spatial analysis of known addiction. J Public Health 1995;17:103–09.

23 Devine M, Syed Q, Tocque K, Bellis M. Capture-recapture estimates of whooping cough in the Merseyside area. Commun Dis Public Health 1998;1:121–25.[Medline]