Department of Public Health, Wellington School of Medicine, University of Otago, PO Box 7343, Wellington, New Zealand. E-mail: tblakely{at}wnmeds.ac.nz
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Method We describe a duplicate method to calculate the PPV of record linkage when each record can only be involved in one match (e.g. linking population files to death files). The method does not require a validation subset of records from both files with detailed personal information (e.g. name and address), and is therefore ideal for linkage projects using anonymous data. The duplicate method assumes that the number of records from one file with zero, one, two, etc., links from the other file is distributed in a manner predicted by combinatorial probabilities. Having made this assumption, the number of false positive links, and hence the PPV, are estimable. We demonstrate this duplicate method using output from anonymous and probabilistic record linkage of census and mortality records in New Zealand.
Results The PPV estimates conform to the pattern expected based on the underlying theory of probabilistic record linkage, and were robust to sensitivity analyses. We encourage other researchers to further assess the accuracy of this method.
Keywords Medical record linkage, predictive value of tests, sensitivity and specificity, epidemiological methods, censuses, mortality
Accepted 12 August 2002
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Record linkage methodology |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Deterministic record linkage
Deterministic record linkage is where we look for exact (dis)agreement on one or more matching variables between files. For example, we might simply use a social security number common to two files. However, coding errors of the social security number on one file mean that some true matches (a comparison pair of two records from different files for the same person) will be missed.
Probabilistic record linkage
Probabilistic record linkage uses information on a greater number of matching variables, and allows for the amount of information provided by any (dis)agreement on matching variables. For example, agreement on social security number is more suggestive of a match than is agreement on sex. Also, agreements on rare values of a given matching variable (e.g. surname Blakely) are more suggestive than agreements on common values (e.g. Smith).
At the heart of probabilistic record linkage are u probabilities and m probabilities. Consider the matching variable month of birth. The probability of this variable agreeing purely by chance for a comparison pair of two records not belonging to the same individual (i.e. a non-match) is about 1/12 = 0.083. This value is the u probability. (For a matching variable that has an uneven distribution of values in the files [e.g. country of birth], the u probability will vary by value.) The m probability is the probability of agreement for a given matching variable when the comparison pair is a match. As all matching variables are prone to mis-coding, the m probability is less than 1.0. The value of the m probability is estimated (sometimes iteratively) during the specification of the record linkage strategy based upon prior information and the proportion of agreements among the comparison pairs accepted as links. (As we never know which comparison pairs are actually the matches, we use the links we accept during the record linkage process to iteratively estimate the m probability.) In this example, assume the m probability was 0.95. These u and m probabilities are then used to determine frequency ratios or (dis)agreement weights (Table 2). In this example, a comparison pair that agreed on month of birth would be assigned a weight of 3.51 and a comparison pair that disagreed on month of birth would be assigned a weight of -4.20. The setting of u and m probabilities and the corresponding weights is repeated for all matching variables, and possibly additionally for all values of each/some of the matching variables. The total weight for a given comparison pair is simply the sum of the (dis)agreement weights for each matching variable. The total weight will be a large positive number if all/most matching variables agree, or a large negative number if all/most matching variables disagree.
|
![]() |
Record linkage from an epidemiological perspective |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
![]() |
These parameters will vary depending on the cut-off weight: moving it to the left in Figure 1 will increase the sensitivity, but also increase the number of false positives; moving it to the right will decrease the sensitivity, but also decrease the number of false positives.
When record linkage is used to determine the outcome in a cohort study, what effect do errors in the record linkage have on subsequent analyses of the association of exposure with the outcome? False positives incurred during the record linkage will bias both the risk ratios and risk differences to the null, so long as the specificity is non-differential by the exposure variable(s) measured for the cohort study-base (i.e. a non-differential misclassification bias of the mortality outcome).1,6,7 However, the effect of false negatives incurred during the record linkage (i.e. imperfect sensitivity) is to cause an underestimate of the risk difference onlythe risk ratio remains unaffected so long as the sensitivity is non-differential by the exposure variable(s).1,8 Thus, when trade-offs are required between the number of false positives and false negatives incurred in a record linkage project a sensible strategy is to sacrifice the sensitivity (and incur many false negatives or missed matches) but maintain a high specificity (and incur few false positives or incorrect links). With this strategy the measured risk ratio in subsequent cohort analyses should be unbiased, although statistical power will be somewhat reduced.1 (An additional strategy is to actually adjust the observed risk ratios and risk differences for misclassification bias of the outcome incurred during the record linkage process. A description of these adjustment procedures using estimates of the sensitivity and specificity or positive predictive value is beyond the scope of this paper, but are well described elsewhere.6,911)
Minimizing the number of false positive links requires first quantifying their number by values of the total weight score to permit an informed decision about what value to set the final cut-off weight. There are several examples in the published literature where the cut-off was determined by manual inspection of a subset of the comparison pairs that had matching variables which were not available for all the records.1218 For example, Muse et al. linked anonymous human immunodeficiency virus data but for a sub-sample of records had names allowing a validation of the larger anonymous record linkage project.18 In the absence of such a gold-standard practitioners are forced to rely more on the art of record linkage.19 For example, comparison pairs in the grey-zone (i.e. the zone either side of the dotted line in Figure 1) are manually reviewed and a decision on linkage status made on the basis of what looks alright. In probabilistic record linkage, it is also possible to estimate the absolute odds (and thereby the PPV) of a comparison pair being a match for a given weight score.3,1921 However, this method is prone to bias due to correlated agreements and disagreements between matching variables for a given comparison pair. For example, if sex was coded incorrectly for a given record the chance of another coding error for that particular record is probably greater than for any randomly selected record. Also, age-related bias due to the alteration in prior probability of death for any cohort followed over time may bias the absolute odds method for calculating the PPV.3,20
![]() |
Duplicate method for determining false positives |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The duplicate method involves simultaneously solving the combinatorial probabilities for zero, one, or two census links for a given mortality record. Assume that above a given total weight score, there is a uniform probability, p, that any one mortality record will have a purely chance link with any one census record. Let t be the probability that a mortality record has a true link or match, and n be the number of census records (trials) compared to each mortality record. Thus:
= [1-t] [(1-p)n]
= [1-t] [n p(1-p)n-1]
= [1-t] [n(n-1)/2)p2(1-p)n-2]
= [1-t] [n(n-1)(n-2)/6)p3(1-p)n-3]
etc.
= [t] [(1-p)n-1]
= [t] [(n-1) p[(1-p)n-2]
= [t] [(n-1) (n-2)/2)p2(1-p)n-3]
etc.
Note that the sum of the odd-numbered probabilities is just (1-t) since the terms in the second brackets are the binomial probabilities of observing 0, 1, 2,... n false links in n comparisons and thus sum to unity. Similarly, the even-numbered probabilities sum to t. Thus the sum of all possible probabilities is (1-t) + t = 1.
In practice, at and above a given total weight score we may observe the proportion of mortality records with zero, one, and two census record links at the specified weight cut-off in the linkage as X, Y, and Z, where:
![]() |
Multiplying the equation for Y by (n-1)(1-(1-p))/(1-p), subtracting the equation for Z, and then substituting X/(1-p)n for (1-t) (from the equation for X), we get a quadratic in (1-p):
![]() | (1) |
where n is the number of census records that can possibly be compared to each mortality record. The equation has two roots. Back substitution gives values for p and t. The correct one of these two roots will give t < 1 and 0 <(1-p) < 1.
When a mortality record agreed exactly with two or more census records (therefore each link scores exactly the same total weight), one of these duplicate links was almost certainly the match and the other(s) a false-positive link. As they were indistinguishable we discarded both links to prevent false positive links. When the duplicate links had different total weight scores we assumed the highest scoring link was the match (a reasonable assumption when the majority of matches [if present] agree on all matching variables as was the case in this study), and rejected the remaining lower scoring duplicate links. Given these two decision rules, none of the even number probabilities above contribute false positive links. The proportion of all mortality records involved in false positive links can thus be approximated from the odd numbered probabilities in {Pi, i 3}, where each Pi is estimated by substitution of the derived values for p and t.
Two refinements may be used with this duplicate method, first to improve efficiency, and second to recognize that not all mortality records are eligible to have a comparison pair as the cut-off becomes very high.
Efficiency is improved by blocking, that is by comparing records on the two files only when a highly discriminating variable already agrees. For example, we might block the census and mortality files by geocode and thus only compare census and mortality records when they come from the same neighbourhood. This blocking dramatically reduces the number of comparisons between the two files, but also reduces the sensitivity (a match with disagreeing geocode would be missed or skipped) and increases the PPV (the number of false positives is a function of how many census records are compared to any given mortality record). In the above equations, n becomes the average number of census records in each blocknot the total number of census records in the file. (The effect of using an average n is explored below.)
Second, very high total weight scores will only be possible for exact agreements between records with uncommon values of the matching variables (e.g. born in Asia). In order for the duplicate method to work at these very high total weights, allowance must be made for the decreasing number of records able to score this high (a method for which is presented below). However, as most record linkage projects will accept all exact agreements this problem is not critical.
![]() |
Illustrating the duplicate method in the New Zealand Census-Mortality Study (NZCMS) |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The linkage of the 1986 census and 19861989 mortality records in the NZCMS involved eight passes using Automatch®.24 In the first pass the census and mortality records were blocked into approximately 32 000 meshblocks, the smallest administrative geographical area in New Zealand with an average of around 100 people. In all, 39 515 mortality records and 3 131 176 census records were submitted to the first pass. Among other things, the output from Automatch® includes the number of highest-scoring pairs and duplicate pairs (i.e. MP and DA Pairs, respectively, in Automatch® jargon). (Automatch® does not produce values for X, Y and Z directly.) A highest-scoring pair is the highest total weight scoring comparison pair for a given mortality record. A duplicate pair is any other comparison pair involving a mortality record that is already involved in a highest-scoring pair. Thus, above any given cut-off:
Note that:
![]() |
We used an iterative process to estimate X, Y, and Z. Equation (1) was first solved using the number of highest-scoring for X, and the number of duplicate pairs for Y (and consequently Z was initially set at zero). Next, P1, P2, P3, P4, and P5 were calculated using the p and t estimates from the first iteration, and then revised estimates of X (P1), Y (P2 + P3), and Z (P4 + P5) were made and used in the second iteration. This process was repeated until convergence was achieved.
The number of highest weight-scoring pairs and duplicate pairs above varying cut-off weights is shown in the first two columns of Table 3. In this project the majority of comparison pairs above a total weight of 14 (calculated probabilistically by Automatch®) agreed exactly on all matching variables. For any cut-off below 14 we assume that all 39 515 submitted mortality records had a chance of being involved in a false positive link. However, for any cut-off above 14 we adjusted downwards the number of submitted mortality records to approximate the number that could have actually had a link above the given weight. We used the distribution of highest-scoring pairs by weight score to approximate that number. For example, above a cut-off of 17 there were 7205 highest-scoring pairs, or 29.6% of all the 24 352 highest-scoring pairs above 14. Thus we assumed that the number of mortality records with values of their matching variables that permitted a weight score above 17 was 29.6% of 39 515, i.e. 11 691. This adjusted number of mortality records was used in combination with the number of highest weight-scoring pair and duplicate pairs to calculate X, Y and Z.
|
The calculations so far determine the PPV above different total weights. Of more relevance in setting the cut-off weight is the PPV at the margin, i.e. at or about the potential cut-off weight. We estimated this marginal PPV by determining the number of highest-scoring pairs and estimated false positives for each 1-point range of the total weight score. Results are shown in the final columns on Table 3. For example, we estimated that 70.9% of links with a total weight-score between 7 and 8 were matches, i.e. the PPV was 70.9% for this narrow range of total weight scores. The marginal PPV increased rapidly from close to 0% at a weight score of about 3.5 to 90% for a weight score of about 9.5. Thus, to ensure that the marginal false positive percentage was always greater than 90%, a cut-off score of 9 was indicated in this project.
Whilst we were unable to validate our duplicate method for calculating the PPV against a gold-standard sub-sample of comparison pairs with more discriminating matching variables (e.g. names and text addresses), two additional methods provided reassuringly similar patterns of results. (See ref. 22 for details). First, for each 1-point increase in the weight score the odds of being a false positive link approximately halves exactly as would be predicted by the absolute odds method.3,1921 Second, PPV calculations using the duplicate method for very high total weight scores (i.e. where most comparison pairs were exact agreements) were similar to calculations using a method based on the probability of any one mortality record agreeing exactly with a census record by purely chance. However, there are two advantages of the duplicate method compared to the absolute odds method and the latter chance method. Unlike the absolute odds method the duplicate method is not prone to bias from correlated coding errors; and unlike the chance method it is applicable to weight scores for non-exact agreements.
We conducted sensitivity analyses of the effect of variations about the average block size (i.e. n), assuming that false positive links only arose for P3, P5, and P7, and assuming that p was constant for all mortality records. For the situation encountered in the NZCMS, it appeared that the duplicate method was not particularly sensitive to moderate violations of these assumptions described above. (See reference 22 for details.)
![]() |
Conclusion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
KEY MESSAGES
|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Gill L, Goldacre M, Simmons H, Bettley G, Griffith M. Computerised linking of medical records: methodological guidelines. J Epidemiol Community Health 1993;47:31619.[Abstract]
3 Newcombe H. Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business. Oxford: Oxford University Press, 1988.
4 Jaro M. Probabilistic linkage of large public health data files. Stat Med 1995;14:49198.[ISI][Medline]
5 Baldwin J, Acheson E, Graham W. Textbook of Medical Record Linkage. Oxford: Oxford University Press, 1987.
6 Copeland K, Checkoway H, McMichael A, Holbrook R. Bias due to misclassification in the estimation of relative risk. Am J Epidemiol 1977;105:48895.[Abstract]
7 Rothman K, Greenland S. Modern Epidemiology. 2nd Edn. Philadelphia: Lippincott-Raven, 1998.
8 Rodgers A, McMahon S. Systematic underestimation of treatment effects as a result of diagnostic test inaccuracy: implications for the interpretation and design of thromboprophylaxis trials. Thromb Haemost 1995;73:16771.[ISI][Medline]
9 Brenner H, Gefeller O. Use of the positive predictive value to correct for disease misclassification in epidemiologic studies. Am J Epidemiol 1993;138:100715.[Abstract]
10 Green M. Use of predictive value to adjust relative risk estimates biased by misclassification of outcome status. Am J Epidemiol 1983; 117:98105.[Abstract]
11 Blakely T. Socio-economic factors and mortality among 2564 year olds: The New Zealand Census-Mortality Study. (Also at http://www.wnmeds.ac.nz/nzcms-info.html) [Doctorate]. University of Otago, 2001.
12 Muse A, Mikl J, Smith P. Evaluating the quality of anonymous record linkage using deterministic procedures with the New York State Aids Registry and a hospital discharge file. Stat Med 1995;14:499509.[ISI][Medline]
13 van den Brandt P, Schouten L, Goldbohm R, Dorant E, Hunen P. Development of a record linkage protocol for use in the Dutch cancer registry for epidemiological research. Int J Epidemiol 1990;19: 55358.[Abstract]
14 Jamieson E, Roberts J, Browne G. The feasibility and accuracy of anonymized record linkage to estimate shared clientele among three health and social service agencies. Meth Inform Med 1995;34: 37177.[ISI][Medline]
15 Goldberg M, Carpenter M, Theriault G, Fair M. The accuracy of ascertaining vital status in a historical cohort study of synthetic textiles workers using computerised record linkage to the Canadian mortality data base. Canadian J Public Health 1993;84:20104.[ISI][Medline]
16 Mi M, Kagawa J, Earle M. An operational approach to record linkage. Meth Inform Med 1983;22:7782.[ISI][Medline]
17 Calle E, Terrell D. Utility of the National Death Index for ascertainment of mortality among Cancer Prevention Study II Participants. Am J Epidemiol 1993;137:23541.[Abstract]
18 Brenner H, Schmidtmann I. Effects of record linkage errors on disease registration. Meth Inf Med 1998;37:6974.[ISI][Medline]
19 Roos LJ, Wajda A, Nicol J. The art and science of record linkage: methods that work with few identifiers. Comput Biol Med 1986;16:4557.[CrossRef][ISI][Medline]
20 Newcombe H. Age-related bias in probabilistic death searches due to neglect of the Prior Likelihoods. Computers and Biomedical Research 1995;28:8799.[CrossRef][ISI][Medline]
21 Newcombe H, Smith M, Howe G, Mingay J, Strugnell A, Abbatt J. Reliability of computerized versus manual death searches in a study of the health of Eldarado uranium workers. Comput Biol Med 1983; 13:15769.[ISI][Medline]
22 Blakely T, Salmond C, Woodward A. Anonymous record linkage of 1991 census records and 199194 mortality records: The New Zealand Census-Mortality Study (Also at http://www.wnmeds.ac.nz/nzcms-info.html). Wellington: Department of Public Health, Wellington School of Medicine, University of Otago, 1999.
23 Blakely T, Salmond C, Woodward A. Anonymous linkage of New Zealand mortality and Census data. Aust NZ J Public Health 2000;24:9295.[ISI][Medline]
24 MatchWare Technologies I. Automatch Generalised Record Linkage System, Version 4.2: Users Manual. Kennebunk, Maine: MatchWare Technologies, Inc, 1998.