Statistical Inference for the Area under the Receiver Operating Characteristic Curve in the Presence of Random Measurement Error

Enrique F. Schisterman1, David Faraggi2, Benjamin Reiser2 and Maurizio Trevisan1

1 Department of Social and Preventive Medicine, University at Buffalo, Buffalo, NY.
2 Department of Statistics, Haifa University, Haifa, Israel.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 STATISTICAL INFERENCE IN THE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX 1
 REFERENCES
 
The area under the receiver operating characteristic curve is the most commonly used measure of the ability of a biomarker to distinguish between two populations. Some markers are subject to substantial measurement error. Under normality assumptions, the authors develop a confidence interval procedure for the area under the receiver operating characteristic curve that adjusts for measurement error. This procedure assumes the availability of data from a reliability study of the biomarker. A simulation study was used to check the validity of the proposed confidence interval. Furthermore, it was shown that not adjusting for measurement error could result in a serious understatement of the effectiveness of the biomarker.

cardiovascular diseases; reliability; repeated measures; thiobarbituric acid reaction substances

Abbreviations: ROC, receiver operating characteristic; TBARS, thiobarbituric acid reaction substances


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 STATISTICAL INFERENCE IN THE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX 1
 REFERENCES
 
Receiver operating characteristic (ROC) analysis was originally developed for use with radar technology to separate observer variability from the innate detectability of the signal (1Go). It has gained popularity in laboratory medicine over the past 30 years as a tool for evaluating test performance. ROC curves simultaneously show the proportion of both "abnormal" and "normal" subjects correctly diagnosed at various test cutoff points. This graphic display not only facilitates the selection of an optimal threshold but also enables the easy comparison of different tests. The use of this tool became popular in epidemiologic research as a statistical instrument to evaluate the discriminative abilities of different biomarkers, such as oxidative stress and antioxidants biomarkers.

The area under the ROC curve, introduced to the medical literature by Hanley and McNeil (2Go), is a frequently used criterion to examine the discriminative effectiveness of biomarkers. With X and Y representing the biomarker values on the controls and cases, respectively, Bamber (3Go) showed that the area under the ROC curve is A = P (Y > X); thus, A is a global measure of how well the biomarker distinguishes between cases and controls. Assuming that X and Y follow the normal distribution, Lloyd (4Go) provided a point estimator for A, while Reiser and Guttman (5Go) developed confidence intervals.

Measurement error and biases may be attributed to laboratory equipment, variation between technicians, temporal changes, biologic variability, etc. Random measurement error can bias possible relations toward the null hypothesis (6Go). Many methods are available for correcting correlation and regression coefficients for random measurement error by using the reliability approach (7Go, 8Go). In the context of ROC curves, Coffin and Sukhatme (9Go, 10Go) provided a bias correction approximation for the point estimator of A under random measurement error in both parametric and nonparametric cases. Faraggi (11Go) considered the effect of random measurement error on the confidence interval for A in the context of the parametric normal model. He showed numerically that not taking measurement error into account can give seriously misleading results that understate the diagnostic effectiveness of the marker and result in the stated confidence intervals for A having actual coverage substantially less than their nominal values. Assuming normality, Reiser (12Go) developed a corrected confidence interval for A, adjusted for measurement error. This method assumes the availability of replicated observations on the study subjects. However, such replications are not frequently collected in epidemiologic research because of cost and feasibility considerations. In epidemiologic and medical research, reliability and validation studies are frequently used to assess the magnitude of different sources of error (7Go) on the exposure of interest.

In this paper, we present both an estimator and a confidence interval of the area under the ROC curve under normality assumptions in the presence of random normal measurement error when replications are not available in the main study and an external reliability study is performed to evaluate measurement error. In the section Statistical Inference in the Presence of Random Measurement Error, by using the Delta method (13Go), we develop an approximate confidence interval of the area when random measurement error is present. A simulation study that examines how the bias of the estimator and the coverage properties of the approximate confidence interval are influenced by different sample sizes and variances is discussed in the Simulation Study section. We provide details of the oxygen free radicals biomarker study and show the importance of our methodology, namely, we show how incorporating corrections for random measurement error affected the results.


    STATISTICAL INFERENCE IN THE PRESENCE OF RANDOM MEASUREMENT ERROR
 TOP
 ABSTRACT
 INTRODUCTION
 STATISTICAL INFERENCE IN THE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX 1
 REFERENCES
 
Assume that a biomarker follows the normal distribution Xi ~ NX, {sigma}2X) (i = 1,...,nX) in the control population and the normal distribution Yi ~ NY, {sigma}2Y) (i = 1,...,nY) in the case population. The area under the ROC curve in this case is A = {Phi}({delta}), where and and {Phi} denotes the standard normal cumulative distribution function. The almost maximum likelihood estimate (5Go) of A is , where , , SX2, and SY2 denote the sample means and variances, for the two populations (controls and cases), respectively.

In the presence of measurement error, the "true" value of the biomarker on the case and control groups (Y and X) is not available. Instead, we assume that we actually observe

(1)
where and . We further assume that the X, Y, {varepsilon}ix, and {varepsilon}iy are independent of each other. In many situations, it is reasonable to assume that the variations leading to measurement error are due to factors connected with the laboratory or measurement process and do not depend on the person's risk status or the true biomarker value. Consequently, the variances of {varepsilon}ix and {varepsilon}iy are taken to be equal.

In addition, we assume the availability of data from a reliability study designed to estimate measurement error. If we let wij, i = 1,...,n0, j = 1,...,pi denote the jth observation on the ith subject, we assume that

where Wi is the "true" value of the biomarker for the ith subject of the reliability study and . Let and = Consequently,

is an unbiased estimator of {sigma}{varepsilon}2.

Let denote the sample means for the observed xi, yi and set and .

Consequently from equation 1, since , an unbiased estimator of {sigma}X2 is . Similarly, is unbiased for {sigma}Y2.

Thus, when random measurement error is present, a natural corrected for measurement error, estimate of A is , where .

Note that the and , implying that or may be negative, which will yield or and an undefined . This situation is similar to the classical variance components problem discussed by Searle (14Go). We follow Puduri and Rao (15Go), who recommend replacing negative variance estimates by a very small number.

Consequently, taking measurement error into account results in a larger estimate of {delta} and, correspondingly, of A. Ignoring measurement error will underestimate A and will therefore understate the effectiveness of biomarkers in discriminating between cases and controls.

The amount of measurement error needs to be determined relative to the inherent marker variability. A commonly used index, termed the "reliability index" in the social science literature (16Go), is extended here to the two-populations case.

This natural extension results in . This index ranges between 0 and 1, with values close to 1 indicating that the measurement error is relatively small and values close to 0 indicating a large relative measurement error. Related reliability indexes are used in somewhat different situations by Faraggi (11Go) and Reiser (12Go).

Since {Phi} is a monotonically increasing function of {delta}, finding a confidence interval for A is equivalent to finding one for {delta}. Using the Delta method, we estimate the variance of by

(see appendix 1). Therefore, an approximate (1 - {alpha}) x 100 percent confidence interval for {delta} is given by

where z{alpha}/2 is the standard normal percentile, with the corresponding interval for A being


    SIMULATION STUDY
 TOP
 ABSTRACT
 INTRODUCTION
 STATISTICAL INFERENCE IN THE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX 1
 REFERENCES
 
To examine the coverage properties of the confidence intervals developed above, we simulated 7,300 samples for various combinations of (nx, ny, nf). We examined all combinations of (nx, ny, nf) 5 (50, 50, 19), (100, 100, 49), (900, 50, 49), (1,000, 1,000, 199); with A = 0.6, 0.7, 0.8, 0.9, 0.95; R = 0.2, 0.4, 0.6, 0.8; and {alpha} = 0.05, 0.10. We simulated nX, normally distributed values for controls with mean µX = 0 and variance {sigma}X2 + {sigma}{varepsilon}2 and ny values for cases with mean µX and variance {sigma}Y2 + {sigma}{varepsilon}2.

The parameter values were selected as µX = 0; ; .

Instead of simulating the wij, was obtained directly as , where {chi}nf2 is a chi-square variate with nf df.

The observed percentage of cases, out of the 7,300 simulations in which the 95 percent confidence intervals computed according to the methods of Hanley and McNeil (2Go) contained the true value of A are presented in table 1 for various sample sizes. The results for the 90 percent confidence interval were very similar and are omitted for the sake of brevity. The number of generated samples was chosen to be 7,300 so that the width of a 95 percent confidence interval on the actual coverage is about 0.01 when the actual coverage is close to 0.95. Whenever negative variance estimates occur, we replace Sx2 by or Sy2 by . In addition, the bias of was computed for all the simulations.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Observed coverage probabilities of confidence intervals

 
In table 1, the estimated coverage probabilities whose 95 percent confidence intervals (based on a binomial sample of 7,300 simulations) do not include the target coverage probability are indicated by italics.

For nX = nY = 50 and nf = 19, the observed coverage of the proposed confidence interval is close to its nominal value for R > 0.5, even though the nominal value sometimes lies outside the binomial confidence interval. However, for small R, the coverage tends to be conservative for A < 0.7 and low for A >= 0.7. With a larger sample size (nX = nY = 100, nf = 49), the estimated confidence interval shows some improvement, with observed coverages being reasonably close to their nominal values for all A with R > 0.2, although, in some cases, the target value is outside the binomial confidence interval. When we consider the situation of an unequal number of cases and controls, dealing with an extreme case of nX = 900 and nY = 50, the results are very similar to those previously presented. Interestingly, when nX, nY, and nf are large (nX = nY = 1,000, nf = 199), the coverage probability is very good for almost every situation except R = 0.2, which indicates that a better coverage is obtained as a function of sample size. We have also increased the sample size to be nX = nY = 10,000, nf = 1,999 and found that the observed coverage was very close to the nominal value, even for R = 0.2. Although these results indicate that asymptotically the coverage is correct, for practical purposes such large sample sizes are not available, and thus, we conclude that for R = 0.2 our confidence interval does not performed satisfactory. However, R = 0.2 is an extreme case. Note that for {sigma}X2 = {sigma}Y2 = {sigma}2 R = 0.2 implies {sigma}{varepsilon}2 = 4{sigma}2, which rarely occurs in practice (11Go).

All of the above indicate that for a wide range of sample sizes and parameter values, equation 2 provides an effective confidence interval for the area under the ROC curve.

We have computed the percentage of negative variance estimators. As one might expect, this percentage is very much affected by sample size. In the case of small sample sizes, the percentage was as high as 25 percent for R = 0.2, about 10 percent for R = 0.4, 2.5 percent for R = 0.6, and less than 0.5 percent for R = 0.8. These results were similar for various areas under the curve. As sample sizes increase, the percentage of negative variance estimates decreases, and for very large sample sizes, it becomes negligible. Finally, we have evaluated the bias of  and found that it was, at most, 2 percent and in many cases much less. Since the bias was so small, for the sake of brevity, we omitted presenting the individual biases.


    EXAMPLE
 TOP
 ABSTRACT
 INTRODUCTION
 STATISTICAL INFERENCE IN THE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX 1
 REFERENCES
 
Oxygen free radicals are believed to be related to the processes of aging and chronic diseases, as well as to atherosclerotic coronary heart disease (17Go). Recent studies have suggested that oxidative modification of low density lipoproteins may be a critical factor (17Go) in the atherosclerotic process.

A thiobarbituric acid reaction substance (TBARS) is a biomarker that measures subproducts of lipid peroxidation and has been proposed as a discriminating measurement between cardiovascular disease cases and healthy controls.

A population-based sample of randomly selected residents of Erie and Niagara counties, New York, who were aged 35–79 years is the focus of this investigation (18Go). The New York State Department of Motor Vehicles driver's license rolls were used as the sampling frame for adults between ages 35 and 64 years, while the elderly sample (ages 65–79) was randomly selected from the Health Care Financing Administration.

Blood samples, physical measurements, and a detailed questionnaire on various behavioral and physiologic patterns were obtained from study participants. Owing to cost and logistical considerations, sample replications were not obtained from the participants.

After the exclusion of participants with a history of cancer (60 subjects) and/or incomplete information on TBARS (68 subjects) and of non-White participants (75 subjects), a total of 474 White men and 494 White women were selected for the analysis. We define the cases as persons with myocardial infarction. Because of the skewness of the original data, the transformation (TBARS)-1/2 was implemented to bring the data distribution closer to normality. Note that after the transformation, the marker values for the cases become smaller than those for controls, and consequently, when estimating A, the sign on the estimator of {delta} needs to be reversed. The transformed data yield the following results:






A reliability study to estimate the variance due to random measurement error in the analysis of TBARS was conducted on a convenience sample of 10 participants. Twelve-hour fasting blood samples were obtained for seven women and three men over a period of 6 months to estimate the measurement error variability. The blood samples were obtained every month on the same day of each female's menstrual cycle and every month on the same calendar day for each male. From the reliability study, we obtain the following:



where is obtained by substituting estimates for the parameters in the formula for R. The values of nX, nY, and nf in this particular study are similar to those used for the simulations in table 1, column 3. Furthermore, since table 1 indicates that the confidence intervals computed according to equation 2 have good coverage properties for R > 0.2, we can expect them to operate reasonably well in our example with .

Not correcting for measurement error results in and in 0.550, 0.725 as the unadjusted 95 percent confidence interval for A. Applying the methods of the section Statistical Inference in the Presence of Random Measure-ment Error, we obtain the adjusted area estimate to be , with the corresponding corrected 95 confidence interval being 0.600, 0.888.

Correction for measurement error increased the estimate of the area under the ROC curve and shifted the confidence interval to include much higher values. Use of the uncorrected results understates the effectiveness of TBARS as a biomarker capable of discriminating between subjects with and those without cardiovascular disease.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 STATISTICAL INFERENCE IN THE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX 1
 REFERENCES
 
In this paper, we have shown that an approximate confidence interval for the area under the ROC curve, which takes measurement error into account through a reliability study, can be obtained by using the Delta method. Our simulation study indicates that this method has good coverage properties for a wide range of situations. If measurement error is not taken into account in the construction of the confidence interval, the effectiveness of the biomarker can be seriously understated. We assumed equality of the measurement errors for both the controls and the cases. If this assumption is not appropriate, reliability studies on both cases and controls should be conducted to allow separate estimation of the measurement errors. Applying the same methods as in the section Statistical Inference in the Presence of Random Measurement error, we now obtain

(3)


where , nfxand , nfy are the estimated variances due to random measurement error and their corresponding degrees of freedom for the controls and cases of the reliability studies, respectively. Applying equation 3 to equation 2 results in the approximate confidence interval for this case.

The use of correction methods for random measurement error remains controversial in epidemiologic studies. These methods require the assumption that the variance due to random measurement error is known or can be reliably estimated. The lack of good estimates of this variance can limit the usefulness of the correction. When the reliability or validation study involves only small numbers of subjects and replications, the estimated variance of measurement error will have a large variability, leading to uncertainty with respect to corrections. Moreover, if the measurement error varies over subgroups of the population (risk status, gender, etc.) and these subgroups are not taken into account by the reliability study, the estimates of measurement error variance may be seriously biased, leading to bias in the adjustment.

We evaluated the sensitivity of our method to deviations from distributional assumptions. We found that the method is susceptible to departure from the normality. When distributional assumptions were grossly violated, our method underestimated the area under the ROC curve, and the corresponding confidence intervals provided coverage that is less than its nominal value. In cases in which nonnormality is suspected, suitable transformations can often be used (19Go).

We have varied our choice of small numbers to replace the negative variance estimates. We looked at seven different alternatives. When , was replaced in the simulation with , , , , , , or we replaced with 0.8 x , we found the coverage probability to be robust for the seven different options. Since provided slightly better results than the other choices did, this option was used for the results reported in table 1. The clinical implications of this method are important for early detection and prevention of diseases. If one is using the corrected area results in a marker being considered a good discriminator, while the uncorrected area indicates poor discrimination, then to use this marker in future subjects it is necessary to reduce the measurement error. If the source of the error is technical, for example, laboratory equipment, several approaches to reduce random measurement error are available. An average over replications can be taken. Alternatively, the laboratory quality control protocol may be revised and improved to minimize different sources of unwanted variability (e.g., to shorten elapsed time between when the sample was taken and when it was analyzed). Sometimes, new and more effective measuring instruments and methods can be used. However, if the source of error is intraindividual variability, a standardized protocol should be implemented to reduce this source of variation.

In conclusion, we showed that measurement error could result in a serious understatement of the effectiveness of a biomarker. A point estimate and a confidence interval procedure for the area under the ROC curve, which corrects for random measurement error, were presented. This procedure assumes the availability of data from a reliability study of the biomarker.


    APPENDIX 1
 TOP
 ABSTRACT
 INTRODUCTION
 STATISTICAL INFERENCE IN THE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX 1
 REFERENCES
 
In the measurement error case, xi and yj are independent normal random variables with means and variances µX, {sigma}X2 + {sigma}{varepsilon}2 and µY, {sigma}Y2 + {sigma}{varepsilon}2. Consequently, and are independent normal random variables with means and variances µX, ({sigma}X2 + {sigma}{varepsilon}2)/nx and µY, ({sigma}Y + {sigma}{varepsilon}2)/nY, respectively.

Thus,

(A1)
and

are all mutually independent.

In addition,

so that

(A2)

We use the Delta method (13Go) to obtain the approximate variance of , noting the independence of and .

(A3)
.

Var () is given in equation A1, while for Var(), we use the Delta method again to obtain

(A4)


Applying equations A1 and A4 to equation A3 results in

(A5)

Substituting the estimates for the unknown parameters in equation A5 results in


    NOTES
 
Reprint requests to Dr. Enrique F. Schisterman, Cedars-Sinai Medical Center, Cardiac Imaging/Artificial Intelligence Medicine—A041; 8700 Beverly Boulevard, Los Angeles, CA 90069 (e-mail: schistermane{at}cshs.org).


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 STATISTICAL INFERENCE IN THE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX 1
 REFERENCES
 

  1. Erdreich LS, Lee ET. Use of relative operating characteristic analysis in epidemiology: a method for dealing with subjective judgement. Am J Epidemiol 1981;114:649–62.[ISI][Medline]
  2. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143:29–36.[Abstract]
  3. Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 1975;12:387–415.[ISI]
  4. Lloyd DK. Estimating the life cycle of complex modeled system. In: Institute of Environmental Science proceedings. Mount Prospect, IL: Institute of Environmental Science, 1980:87–96.
  5. Reiser B, Guttman I. Statistical inference for P(Y < X): the normal case. Technometrics 1986;28:253–7.[ISI]
  6. Fuller W. Measurement error models. New York, NY: John Wiley & Sons, 1987.
  7. Dunn G. Design and analysis of reliability studies: the statistical evaluation of measurement errors. New York, NY: Oxford University Press, 1989.
  8. Liu K, Stamler J, Dyer A, et al. Statistical methods to assess and minimize the role of intra-individual variability in obscuring the relationship between dietary lipids and serum cholesterol. J Chronic Dis 1987;31:399–418.
  9. Coffin M. Sukhatme S. Receiver operating characteristic studies and measurement errors. Biometrics 1997;53:823–37.[ISI][Medline]
  10. Coffin M, Sukhatme S. A parametric approach to measurement errors in receiver operating characteristic studies. Lifetime data: models in reliability and survival analysis 71–75. Dordrecht, the Netherlands: Kluwer Academic Publications, 1994.
  11. Faraggi D. The effect of random measurement error on receiver operating characteristic (ROC) curves. Stat Med 2000;19:61–70.[ISI][Medline]
  12. Reiser B. Measuring the effectiveness of diagnostic markers in the presence of measurement error through the use of ROC curves. Stat Med 2000;19:2115–29.[ISI][Medline]
  13. Miller RG Jr. Survival analysis. New York, NY: John Wiley & Sons, 1981.
  14. Searle SR. Linear models. New York, NY: John Wiley & Sons, 1971.
  15. Puduri SRS, Rao CR. Variance components: mixed models, methodologies and applications. New York, NY: Chapman and Hall, 1997.
  16. Carmines EG, Zeller RA. Reliability and validity. Series: quantitative applications in social sciences. Sage University paper. London, England: Sage Publications, 1979.
  17. Hoffman RM, Garewal HS. Antioxidants and the prevention of heart disease. Arch Intern Med 1995;155:241–6.[Abstract]
  18. Schisterman EF. Lipid peroxidation and cardiovascular disease: an ROC approach. Doctoral dissertation. State University of New York at Buffalo, Buffalo, NY, 1999.
  19. Draper NR, Smith H. Applied regression analysis. Third ed. New York, NY: John Wiley & Sons, 1998.
Received for publication January 28, 2000. Accepted for publication September 18, 2000.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (3)
Disclaimer
Request Permissions
Google Scholar
Articles by Schisterman, E. F.
Articles by Trevisan, M.
PubMed
PubMed Citation
Articles by Schisterman, E. F.
Articles by Trevisan, M.