Discrimination, adjusted correlation, and equivalence of imprecise tests: application to glucose tolerance

Jonathan Levy, Richard Morris, Margaret Hammersley, and Robert Turner

Diabetes Research Laboratories, Department of Clinical Medicine, Oxford University, Oxford OX2 6HE, United Kingdom


    ABSTRACT
Top
Abstract
Introduction
Methods
Results
Discussion
References
Appendix I
Appendix II
Appendix III

Comparison studies between physiological tests are often unsatisfactory for assessing their ability to distinguish between subjects. We recommend a simple but comprehensive protocol, using duplicate testing, that compares tests using 1) the discriminant ratio (DR) between the underlying between- and within-subject SDs, 2) correlation coefficients adjusted for attenuation due to test imprecision, and 3) unbiased estimation of the underlying linear relationship between test results. The following five alternative methods for assessing glucose tolerance were compared: fasting plasma glucose (FPG) as a single sample or as the mean of three 5-min samples (FPG3); the 1- and 2-h glucose during a low-dose intravenous glucose infusion (CIG); and the 2-h plasma glucose from a 75-g oral glucose tolerance test (OGTT). All tests had similar DRs ranging from 2.6 to 4.2. The adjusted correlation between FPG and CIG tests approached unity, and those between OGTT and other tests were ~0.9, showing that FPG3 provides similar information to the OGTT. FPG concentrations of 6.0 and 7.1 were found equivalent to the 1985 World Health Organization OGTT thresholds for impaired glucose tolerance and diabetes (7.8 and 11.1 mmol/l).

plasma glucose; precision


    INTRODUCTION
Top
Abstract
Introduction
Methods
Results
Discussion
References
Appendix I
Appendix II
Appendix III

COMPLEX PHYSIOLOGICAL responses, such as glucose tolerance, may be assessed by many different methods. For instance, the control of plasma glucose may be measured by the fasting plasma glucose (FPG), by the response to an oral or an intravenous glucose challenge, by the percentage of glycated hemoglobin (HbA1c), or by the plasma concentration of fructosamine. Different clinical tests are used to assess the same physiological variable because of different clinical circumstances, the application of advances in understanding and technique, or as a result of historical circumstances or fashion. The tests may assess differing or equivalent aspects of a physiological characteristic and may express their results in different units of measurement.

In this paper, we recommend a simple but comprehensive methodology for comparing different tests of a continuous physiological variable, which may be applied irrespective of the scales of measurements used. For such comparisons, it is important to include assessment of both the between- and within-subject variation of each test, as this allows comparison of the ability of tests to discriminate between different subjects, determination of the underlying correlation between tests after adjusting for attenuation due to within-subject variation, and unbiased estimation of the underlying relationships between the results of the different tests.

Omission of any of these components in a comparison study will provide inadequate information for choosing a particular test for a particular situation. The imprecision, or the within-subject variation, of a test is of little use by itself and must be considered in relation to the range of the test results. The "coefficient of variation," which relates imprecision to the midpoint of the range, is often inappropriate as it does not take account of the dynamic range of a test (1) and is not always comparable between different scales of measurement. The present paper introduces the concept of "discrimination," i.e., the ability to distinguish between individual subjects within a specified range of interest. This can be expressed as the discriminant ratio (DR), which is defined here as the ratio of the underlying between-subject SD (SDB) to the within-subject SD (SDW). The DR has a defined distribution, and DRs for different tests can be compared statistically.

The correlation between different tests must be included in any comparison. Test imprecision will diminish, or "attenuate," the measured correlation coefficients in a well-described way (11), and it is important to correct for this so as to be able to assess the underlying "true" correlation, as this represents the degree to which the tests are assessing the same physiological trait. This attenuation adjustment can be expressed in terms of the DRs of the respective tests.

Finally, the comparison of tests must, if possible, relate measurements by one test to those by another. Where the relationship is linear, the gradient of the relationship obtained by least squares regression is underestimated ("regression dilution") when both tests are subject to appreciable measurement error. An unbiased technique for deriving the relationship is therefore necessary, and, while many methods are available, a suitable method (27) is recommended here.

The assessment of glucose tolerance is a specific area where several methods are available to an investigator, for instance, by the measurement of steady-state plasma glucose or the response to a glucose challenge, either oral or intravenous. The standard test of glucose tolerance has, until recently, been the oral glucose tolerance test (OGTT; see Ref. 33), but, in practice, this test is not often performed. This is partly because of the inconvenience of a 2-h test and partly because of the marked variability of the OGTT. The poor reproducibility of the test (12, 21, 24, 26), which has a reported coefficient of variation of the order of 15-40%, due in part to the variable rate of gastric emptying, has predictable effects on reclassification of subjects on repeat tests, with a change in status on repeat testing in 30-60% of cases of impaired glucose tolerance (IGT; see Refs. 9, 13, 26, 28, 29) and predictable regression to the mean (26). The simple measurement of FPG has been suggested as a preferable measure (4, 8, 15, 20), and a continuous intravenous infusion of glucose (CIG) can assess glucose tolerance and give simultaneous measures of pancreatic beta -cell function and insulin resistance (16). The FPG thresholds for diabetes have been set using outcome data (4). We evaluate and compare the performance of this and other tests throughout the physiological range.

This paper outlines the concepts and components of a physiological comparison study and uses as an example the assessment of glucose tolerance by either 1) the FPG (single sample or mean of 3 consecutive samples), 2) the 1- and 2-h responses to a CIG, and 3) the standard 2-h response to the OGTT, in repeated tests in 30 subjects spanning the range of glucose tolerance.


    METHODS
Top
Abstract
Introduction
Methods
Results
Discussion
References
Appendix I
Appendix II
Appendix III

Statistical Methods

This paper considers the following three aspects relating to the assessment and comparison of different tests for measuring an underlying physiological variable such as glucose tolerance: 1) the ability of a test to discriminate between different subjects and comparison of discrimination between different tests; 2) the correlation between pairs of tests, adjusting for bias due to within-subject variation. Such variation attenuates measured correlation coefficients so that they underestimate the underlying true correlation. This is important in assessing the degree to which different tests purporting to assess the same underlying trait differ with respect to systematic between-subject factors as opposed to random within-subject variation; and 3) in cases where the relationship between a pair of tests appears to be a linear, unbiased estimation of the underlying line of equivalence between them.

Each of these aspects is required for a comprehensive comparison study and is based on a combination of well-recognized and novel concepts. Our approach is considered on a conceptual level at this point, with a more rigorous statistical treatment reserved for APPENDIXES I-III.

Discrimination between subjects. All physiological measurements 1) are subject to imprecision, which may derive from biological, sampling, and analytic sources and 2) relate implicitly to variables taking on values within a particular range of interest. The performance of a particular physiological test will depend on the relationship between both of these characteristics. Absolute measurements of the imprecision of the test are only meaningful in relation to the range of values to which that test will be applied. The smaller the former is in relation to the latter, the greater is the ability of a test to discriminate between individual subjects. In the context of measurements being obtained from a series of individuals representing the physiological spectrum of interest, we propose a novel, simple index of discrimination, the DR, the ratio between the SD of the underlying subject means (SDU), and the SD of repeated measurements on the same subject (SDW).

The discrimination of a test is not a universal property but will relate to the spectrum of values in the population being studied. Hence absolute DR values are not comparable between different populations, but they are essential when comparing the practical application and performance of different tests in the same population.

Underlying SDU. Because a physiological test may be applied to a variety of possibly nonuniform populations, it is important to assess the test in relation to its expected range of application. Subjects in a comparison study should be chosen to represent and to span the range rather than be randomly selected from particular populations of interest, and this range is characterized statistically by the SDU. The measured SD (SDB) will overestimate the underlying SDU due to the presence of within-subject variation, and it is important to adjust for this, using a standard formula, to yield an unbiased estimate of the SDU.

SDW. To relate simply to the between-subject variation, we must be able to assume a common within-subject variation for all of the subjects in the study. This property is called "homoscedasticity" and can be checked by simple plots of the data (see APPENDIX I). Lack of homoscedasticity can often be rectified by an appropriate numerical transformation of the test results.

For homoscedastic data, the common within-subject variance is simply the mean of the individual within-subject variances.

DR. As outlined above, the DR is defined as the ratio SDU/SDW. In a comparison study where k replicate measurements are performed in each subject, the measured SDB is calculated as the SD of the subject mean values (calculated from the k replicates). The standard mathematical adjustment to yield SDU is
SD<SUB>U</SUB> = <RAD><RCD>(SD<SUP>2</SUP><SUB>B</SUB> − SD<SUP>2</SUP><SUB>W</SUB>/<IT>k</IT>)</RCD></RAD>
so that the DR is calculated simply as
DR = <RAD><RCD>(SD<SUP>2</SUP><SUB>B</SUB> − SD<SUP>2</SUP><SUB>W</SUB>/<IT>k</IT>)/</RCD></RAD>SD<SUB>W</SUB>
This result may also be obtained by an analysis of variance (ANOVA) approach, using a fixed effects model, and this is presented in APPENDIX I. The appropriate equation is then
DR = <RAD><RCD>[(MS<SUB>B</SUB> − MS<SUB>W</SUB>)/(<IT>k</IT> × MS<SUB>W</SUB>)]</RCD></RAD>
where MSB is the between-subject mean square, MSW is the within-subject mean square, and k is the number of replicate tests in each subject, as above.

Assuming that the within-subject variation is normally distributed, we have used an analytical approach to derive confidence limits for estimated DR values and to test for the significance of differences between the DRs of different tests. We present these in APPENDIX I, where we also discuss our methodology in relation to alternative measures of test "reliability," in particular the intraclass correlation coefficient (ICC; see Refs. 5, 30).

Correlation between pairs of tests. Two tests designed to assess a complex physiological characteristic, such as degree of glycemic control, may use different methodologies, neither of which may perfectly represent the characteristic in question. The results of the tests may differ in systematic ways, independently from their random within-subject variation. The degree to which the tests measure the same characteristic may be assessed by the correlation between their results in a set of subjects. In the absence of within-subject variation, the degree to which their correlation falls short of unity would represent the extent to which the tests either fail to measure precisely the same characteristic or to measure aspects of the underlying characteristic that are differentially influenced by other factors that vary between subjects. It is this underlying true correlation that we are interested in here.

Test imprecision, however, further attenuates the observed correlation so that imperfect correlation will usually be due to a combination of the systematic between-subject factors discussed above and the presence of within-subject variation. A study comparing tests must distinguish between these two components, and this can be achieved using a standard formula that corrects measured correlations for attenuation (11). Such a correction requires knowledge of both the within- and between-subject variation of the measured values. Because the DR incorporates both of these elements, the correction can be expressed in terms of the DR
Corrected correlation coefficient 
= measured correlation coefficient × &eegr;
where
&eegr; = <IT>k</IT> × <RAD><RCD>[DR<SUP>2</SUP><SUB>1</SUB>/(1 + <IT>k</IT> × DR<SUP>2</SUP><SUB>1</SUB>)]</RCD></RAD> × <RAD><RCD>[DR<SUP>2</SUP><SUB>2</SUB>/(1 + <IT>k</IT> × DR<SUP>2</SUP><SUB>2</SUB>)]</RCD></RAD>
(see APPENDIX I). The corrected correlation coefficient, therefore, represents the degree to which the tests represent the same physiological characteristic, independent of test imprecision.

Unbiased estimation of a linear relationship. Where two tests represent the same characteristic, their results will often be found to be linearly related (although this may require numerical transformation). The determination of this relationship will be important in relating the results of one test to those of the other. Within-subject variability will lead to a noisy relationship, and a statistical approach is necessary to estimate the underlying linear equation. Although least-squares linear regression is often used for this, it is not appropriate here since it assumes that the explanatory variable is free from noise. When this is not the case, linear regression will underestimate the slope of the equation, a well-recognized effect termed regression dilution.

There is no perfect method of estimating the true relationship in these circumstances, but the method chosen here is that of least perpendicular distances corrected for scale differences, which has been shown to perform well in relation to other methods (27). The equations for this are presented in APPENDIX I.

Design of comparison studies. There is no single measure that can be used to compare physiological tests. Assessment and comparison of test discrimination and the determination of their underlying correlations and inter-relationships are equally important components. These all require simultaneous consideration of both between-subject and within-subject variation within the physiological range of interest. A comparison study should, ideally, assess both of these factors by using replicate tests in subjects chosen to span that range.

Experimental Protocols

Subjects. Thirty white Caucasian subjects were studied, consisting of 10 normoglycemic subjects, 9 subjects with IGT, and 11 with type II diabetes according to 1985 World Health Organization (WHO) definitions (33). All subjects were on a weight-maintaining diet and had not changed their medication for 4 wk before the tests. Subject characteristics are presented in Table 1 by glucose tolerance group.

                              
View this table:
[in this window]
[in a new window]
 
Table 1.   Subject characteristics

Protocols. Each subject was studied on four occasions within a 6-wk period. After a 12-h fast, subjects went to the hospital and sat on a bed for the duration of the tests. Tests were each performed on two occasions in the same subject and in random order.

FPG and CIG. Two cannulas were inserted in the same arm. One, for blood sampling, was placed at the wrist or on the dorsum of the hand, which was heated by an electrical blanket to "arterialize" the venous blood. The other cannula, for infusion of glucose, was placed in an antecubital vein. A blood sample was taken at time -10 min, and the plasma glucose concentration at this single time point was termed FPG1. Blood samples were also taken at times -5 and 0 min, and the mean of the plasma glucose at the three time points was termed FPG3. At time 0, a continuous 5 mg · kg ideal body wt-1 · min-1 infusion (22) of 10% glucose was started and continued for 2 h. One-hour CIG and 2-h CIG glucose were defined as the means of the three plasma glucose concentrations in blood samples taken at 50, 55, and 60 min and at 110, 115, 120 min, respectively.

OGTT. A single cannula was placed in an antecubital vein for blood sampling. Fasting blood samples were taken at -10, -5, and 0 min. At 0 min, the subject consumed a 75-g glucose drink, and blood samples were taken at 30, 60, 90, and 120 min.

Biochemical Assays

Plasma glucose was determined by a hexokinase-based method (Boeringer Mannheim UK, Lewes, UK) on a centrifugal COBAS MIRA autoanalyzer (Roche, Welwyn Garden City, UK).


    RESULTS
Top
Abstract
Introduction
Methods
Results
Discussion
References
Appendix I
Appendix II
Appendix III

Estimates of Glucose Tolerance

Values for glucose tolerance using FPG1, FPG3, 1-h CIG, 2-h CIG, and 2-h OGTT are presented in Table 2 as median and ranges. Fasting and CIG measures were homoscedastic, and the within- and between-subject variations are illustrated as plots of difference (first test - second test) vs. mean (of the 2 tests) in Fig. 1. The 2-h OGTT was found to have within-subject variation increasing with mean values, and this was corrected by log transformation, with the transformed data presented as difference vs. mean plots in Fig. 2 and as medians and ranges for the whole group in Table 2. The underlying SDU and SDW and the DRs of the tests are also presented in Table 2.

                              
View this table:
[in this window]
[in a new window]
 
Table 2.   Within- and between-subject SDs and discriminant ratios of measures of glucose tolerance



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 1.   First minus second test difference vs. mean plots for the mean of three 5-min samples of fasting plasma glucose (A) and 1-h (B) and 2-h (C) plasma glucose during a constant 5 mg · kg ideal body wt-1 · min-1 intravenous glucose infusion (CIG).


View larger version (11K):
[in this window]
[in a new window]
 
Fig. 2.   First minus second test difference vs. mean plots for 2-h oral glucose tolerance test (OGTT) for untransformed plasma glucose (A) and glucose logarithmically transformed to ensure homoscedasticity (B).

DR values for the five measures, together with the one SE range of the estimates, are illustrated in Fig. 3. Although the lowest DR was the FPG1 and the highest the 2-h CIG, there was no significant difference between them on the overall statistical test (chi 24 = 6.2, P = 0.19, using Eq. 10 in APPENDIX I). Consideration was given to the exclusion of a subject whose inter-test difference in the 2-h CIG was four SDs from the mean of the rest of the group (see Fig. 1). In the absence of an identifiable reason for the large difference between his two test values, this subject was included in the analyses presented here, although, if he were excluded, the DR of the 2-h CIG would rise to 6.1, significantly greater than the DRs of the other tests.


View larger version (9K):
[in this window]
[in a new window]
 
Fig. 3.   Discriminant ratio values for fasting plasma glucose (single sample and mean of 3 samples), 1-h CIG, 2-h CIG, and 2-h OGTT for plasma glucose. Error bars represent 1 SE of the estimate.

Table 3 shows the Pearson correlations between the tests, both before and after adjustment for attenuation. The correlations were calculated between subject means of duplicate tests, to give the best estimate of the underlying relationships between tests. Adjusted correlations between the fasting and both of the intravenous measures were high, approaching one. Those between the 2-h OGTT results and the other tests were somewhat lower (~0.9), indicating that there was some biological discordance in their relationship, independent of their within-subject measurement error.

                              
View this table:
[in this window]
[in a new window]
 
Table 3.   Measured Pearson correlation coefficients between within-subject means of duplicate tests

Figure 4 shows the scattergram between the 2-h plasma OGTT (on a logarithmic scale) and the FPG. It also illustrates the line of equivalence, derived as explained above, and the linear regression line of the 2-h OGTT on FPG. The dilution effect of the within-subject variation on the regression line can be seen clearly. Table 4 gives coefficients for the unbiased linear equations relating the test values, and Table 5 gives the points on the various scales that are equivalent to the 1985 WHO OGTT thresholds for IGT and diabetes.


View larger version (12K):
[in this window]
[in a new window]
 
Fig. 4.   Scattergram of the mean of two 2-h OGTT tests (logarithmic scale) vs. the mean of two FPG3 tests showing the unbiased linear relationship (dotted line) and the linear regression relationship (dashed line).

                              
View this table:
[in this window]
[in a new window]
 
Table 4.   Gradient and intercept coefficients for lines of equivalence


                              
View this table:
[in this window]
[in a new window]
 
Table 5.   Equivalent values to common OGTT thresholds

SDW of the logarithm of the 2-h OGTT may be interpreted in relation to a standard interval in the range of glucose tolerance, for instance, the interval between the thresholds for IGT (7.8 mmol/l) and for diabetes (11.1 mmol/l). This interval is 0.153 on a logarithmic scale (base 10), whereas the SDW is 0.060. The difference between two individual measurements at either end of this interval would not be significant at the 5% level (P = 0.07), given the imprecision of the 2-h OGTT. In other words, it would not be possible to confidently distinguish between two such individual values. The same applies to all other tests examined, using the unbiased equivalent values to these thresholds given in Table 5, apart from the 2-h CIG plasma glucose, for which the two measurements at opposite ends of the equivalent interval would differ significantly (P = 0.013).


    DISCUSSION
Top
Abstract
Introduction
Methods
Results
Discussion
References
Appendix I
Appendix II
Appendix III

This study has shown that different methods of assessing glucose tolerance were broadly comparable in a range of subjects spanning normal glucose tolerance, IGT, and type II diabetes. This assessment involved the following three separate components: 1) comparison of the discrimination of the tests, i.e., their ability to distinguish between different subjects, 2) determination of the degree to which different tests measure the same underlying physiological property, and 3) estimation of the underlying relationship between the test results. The assessment of the within-subject imprecision of each test is a fundamental requirement for this evaluation, so that comparison studies must involve at least duplicate measurements in all subjects in order to determine both the between- and within-subject measurement variation of each test in the same group of individuals. This is best done in subjects who represent a clinically meaningful range of glucose tolerance. To ascribe a single value to within-subject imprecision requires homoscedasticity, and numerical transformation of results may be necessary to achieve this, as illustrated by the 2-h plasma glucose from the OGTT. A measure of imprecision is important for the assessment of changes within subjects, such as over time or after interventions (10).

The determination of imprecision alone is not adequate, however, for the assessment of the practical value of a test. The standard methods of assessing imprecision, including the coefficient of variation, have little meaning on their own, without reference to the range of measurements to which they are being applied. The ability to distinguish between individuals within this range is here termed the discrimination of the test and is assessed using the DR.

The concept of discrimination should be distinguished from the ability of a test to categorize patients by an external gold-standard dichotomy. This is a particular concern in the field of clinical chemistry, for instance, when a biochemical test is being assessed for its ability to detect the presence of a malignancy. Receiver operating characteristics (34) have been used for this purpose. However, they are not suitable for the assessment of test results as a continuous scale of measurement or for the comparison of tests without reference to an external categorization. In the field of glucose tolerance, categories have been defined on the basis of thresholds in a continuous scale of measurement for the OGTT based on external criteria, but this is a notoriously imprecise test (7), and it would not therefore be appropriate to assess possible alternative tests using a categorical approach based on these thresholds. The use of the DR provides a means of comparing how well the subjects studied can be reliably distinguished by different tests, which is an important component of a comprehensive comparison of imprecise tests. For instance, in many research studies using continuous variables, the statistical power to distinguish between groups of subjects or to determine correlations between variables will depend on discrimination.

A very similar concept, referred to as reliability, has been assessed, particularly in the psychological literature, using the ICC. This relates the covariance of replicate test results to the combined between-subject and within-subject variance and is algebraically related to the DR. However, in the context of discriminating between different subjects, it is not as easy to grasp as the DR, which has a direct intuitive relevance when considering a test's application. Furthermore, it is not as easy to derive statistical tests comparing different ICC values. A recently described method for the comparison of two ICCs does not extend to the comparison of more than two tests and was only validated for studies employing repeated tests in 100 subjects or more (2).

The choice of a methodology for comparing different tests is intimately related to its theoretical basis and in particular to the availability of statistical criteria for assessing such a comparison. Much of the theoretical discussion of the ICC and treatments of comparative tests between ICCs have been based on random effects models in which the individuals in which the tests are being performed are assumed to be drawn at random from a known normally distributed population. This presupposes a particular population in which the tests are being applied. Complex populations (for instance those consisting of subgroups) would require appropriately complex statistical models. Unfortunately, even the simplest random effects models are hard to treat analytically, as in the case of the ICC quoted above. Our approach has been to concentrate on evaluating tests across a particular physiological spectrum of interest. In the example presented here, for instance, glucose tolerance represents a physiological and pathological unity irrespective of the its distribution in particular populations. For this purpose, we take the view that it is appropriate to perform a comparison study in subjects selected to span the range of interest, which may be analyzed by a fixed-effects statistical model. Such an analysis is presented here for the DR, allowing the derivation of straightforward expressions for the SD and confidence intervals of the DR and the evaluation of the statistical significance of differences between the DRs of different tests.

Although the comparison of the DRs of different tests in the same study is valid, the DR in itself is not a universal characteristic of a test, as it will depend on the choice of subjects on which the comparison is performed. When the subjects cover a greater range of values, the DR will be larger, and vice versa, and the DR calculated when subjects are selected to span a range of interest will generally be larger than if subjects had been chosen randomly from the same population. However, it is a fundamental property when it comes to a test's practical application, and, unlike imprecision, it can be used as a basis for comparison between tests. The DRs of the five tests examined here were not significantly different, in spite of the increased complexity and expense of the CIG and the OGTT.

Two perfectly precise tests assessing the same physiological variable would be perfectly correlated. Departures from perfect correlation can be due to the following two factors: 1) underlying differences not directly related to the variable of interest. These will manifest themselves as systematic differences between subjects; for instance, in the assessment of glycemic control, which is determined by the FPG and OGTT in qualitatively different ways, the underlying correlation may fall short of unity because of the influence of factors such as the effect of gastric emptying and the influence of intestinal incretin effects that differ between the two methodological approaches; 2) diminution of the underlying correlation may also arise from within-subject variation, and this is a well-described statistical effect termed "attenuation," which may be adjusted for by standard techniques to enable the estimation of the underlying correlation (11). This adjustment depends on both test imprecision and the degree of variation between subjects and so can be expressed in terms of the test DRs.

Adjustment for attenuation will establish the degree to which the underlying correlation differs from unity due to the factors detailed in factor 1 above.

In this paper, the adjusted correlation coefficients between the fasting glucose and the CIG approached unity and were only slightly lower between these tests and the 2-h OGTT. Although additional factors unrelated to the homeostatic control of plasma glucose, such as variable gastric emptying, would contribute to this, the relatively high overall intercorrelations and the simplicity and cheapness of the FPG would recommend this as the measure of choice for the assessment of glucose tolerance.

When two tests have an underlying structural relationship between their measurements (or transformations of these) that is linear, it can be instructive to determine the equation of the "line of equivalence." Linear regression, although often used, is unsatisfactory since it assumes perfect precision in the independent variable and is subject to regression dilution. There have been many approaches to deriving an unbiased estimation, as comprehensively reviewed by Riggs et al. (27), and the "weighted least squared perpendicular distance" approach (Riggs' "PW" method) has been used in this paper. In the present study, the assumption of linearity, within the limits set by the imprecision of the tests, was supported by visual inspection of plots of the relationships (data not shown). The FPG threshold concentrations recommended by the American Diabetes Association (4), based on studies of the prevalence of retinopathy in three distinct populations, were confirmed as equivalent to the established OGTT thresholds for IGT and diabetes.

We also calculated the SD of loge(DR) estimates for different numbers of replicate tests, using the Taylor series expansion, subject to a constraint on the total number of tests performed. For a study comparing two methods using a total of 60 tests, the power to detect a difference between DRs of 2.5 and 4.0 is 58% using 30 subjects and two replicate tests, rising to 72% with 20 subjects and three tests, an increase in power of 24%. Further increasing the number of replicates gives smaller increases in power, e.g., for four tests in 15 subjects the power is 78%, with even smaller gains for more than four replicates. There would appear to be some advantage in using three, rather than two, replicate tests in each subject, but there is little advantage in increasing the number beyond three given the need to obtain sufficient subjects to adequately cover the range of interest.

This study showed that, with OGTT done under carefully controlled conditions, with a reproducibility that is somewhat better than reported elsewhere, it was not possible to distinguish between the WHO thresholds for IGT and diabetes at a 5% significance level, and this was also the case for FPG, even when the mean of three samples at 5-min intervals was assessed (the between-sample variation being relatively small in relation to the between-day variation). It is therefore not surprising, with two thresholds close together, that repeat measurements often give change of status. Improved classification could be achieved by taking the mean of determinations on more than one day. However, although these classifications may be useful for epidemiological purposes, for practical purposes the actual OGTT or FPG value is more informative (31, 32).

In summary, we have outlined a comprehensive but simple methodology for the comparison of imprecise tests, encouraging 1) comparison of test discrimination, expressed as the DR, 2) the evaluation of the degree of agreement between tests based on correlation coefficients adjusted for attenuation, and 3) in the case of a linear relationship between test results (or their mathematical transformations), the use of an unbiased method for estimating the underlying equation. For such a comparison study, it is important to determine the within-subject variation of each test as well as the variation between subjects. Application of these methods to various tests of glucose tolerance demonstrated similar discrimination, acceptable agreement, and an unbiased estimation of the FPG values equivalent to those of the 2-h OGTT. The latter agree closely with the outcome-derived thresholds currently being recommended by the American Diabetes Association. However, because the thresholds for IGT and diabetes are within measurement error and cannot be reliably distinguished, the absolute 2-h OGTT or FPG is more informative than the categorization.


    APPENDIX I. STATISTICAL METHODS
Top
Abstract
Introduction
Methods
Results
Discussion
References
Appendix I
Appendix II
Appendix III

This section presents a more detailed mathematical treatment of the concepts outlined in METHODS.

Discrimination Between Subjects

Statistical model. We consider the comparison of different tests, each measuring the same physiological variable. Each test is performed k times on each of n subjects, with the order of the tests being randomized for each subject.

Considering first a single test in isolation, an appropriate model is
<IT>X</IT><SUB><IT>ij</IT></SUB> = &mgr; + &agr;<SUB><IT>i</IT></SUB> + &egr;<SUB><IT>ij</IT></SUB>  for <IT>i</IT> = 1, … , <IT>n</IT> and <IT>j</IT> = 1, … , <IT>k</IT> (1)
where Xij is the result of the test performed for the j'th time on the i'th subject, µ is the overall mean value of the variable in question on the scale of the current test, and alpha i is the true value of the i'th subject, measured as a deviation from the mean (thus Sigma i=1,n alpha i = 0); epsilon ij represents day-to-day variation, which includes both biological and assay variation; the epsilon ij are assumed to be independent, normally distributed random variables with mean zero and variance sigma 2. Equation 1 is a standard one-way ANOVA.

The assumption of constant variance (or homoscedasticity) of the error term, sigma 2, can be checked graphically. If k >=  5, we can calculate the quartiles of the k replicate test results for each subject and plot log(interquartile range) against log(median) (14). If 2 < k < 5, plot the SD of the k replicates against the mean for each subject (23), and if k = 2 plot the differences (1st - 2nd replicate) between the pairs of tests against the subject means (6). If the assumption of homoscedasticity holds, the plotted measure of variation [log(interquartile range), SD, or difference] should be approximately constant across the range of subjects. If there appears to be a systematic relationship between the measure of variation and subject medians or means, this can often be removed by mathematically transforming the results of Xij. A common case in physiological measurements is where the SD increases in direct proportion to the mean, when a log transformation of the Xij stabilizes the variance and log(Xij) can then be used in place of Xij in Eq. 1. Other transformations can be considered for different relationships between the subject SDs and means (14, 23).

It is also possible to check the assumption that the epsilon ij have a normal distribution by plotting the ordered residuals from fitting Eq. 1 against standard normal deviates in a "normal probability plot" (3). However, the ANOVA procedures used here are fairly robust to moderate departures from normal distribution and can be used without such sophisticated checking, provided homoscedasticity of variance holds and the data do not exhibit marked skewness.

In these experiments, subjects are selected to span a range of glucose tolerance and are not chosen randomly from a prespecified population. The subject effects alpha i are therefore considered as "fixed" rather than "random" effects.

DR. As a measure of discrimination between subjects, we define the true DR, Delta , as the ratio of the underlying SDB to the SDW
&Dgr; = <FENCE><RAD><RCD><FENCE><LIM><OP>∑</OP><LL><IT>i</IT>=1,<IT>n</IT></LL><UL> </UL></LIM> &agr;<SUP>2</SUP><SUB><IT>i</IT></SUB>/(<IT>n</IT> − 1)</FENCE></RCD></RAD></FENCE>&sfgr; (2)
Unbiased estimates of the between- and within-subject variances are given by (MSB - MSW)/k and MSW, respectively, where MSB and MSW are the between- and within-subject mean squares from a standard one-way ANOVA, i.e.
MS<SUB>B</SUB> = <IT>k</IT> × <LIM><OP>∑</OP><LL><IT>i</IT>=1,<IT>n</IT></LL><UL> </UL></LIM> (M<SUB><IT>i</IT></SUB> − M)<SUP>2</SUP>/(<IT>n</IT> − 1) (3)
MS<SUB>W</SUB> = <LIM><OP>∑</OP><LL><IT>i</IT>=1,<IT>n</IT></LL><UL> </UL></LIM>  <LIM><OP>∑</OP><LL><IT>j</IT>=1,<IT>k</IT></LL><UL> </UL></LIM> (<IT>X</IT><SUB><IT>ij</IT></SUB> − M<SUB><IT>i</IT></SUB>)<SUP>2</SUP>/[<IT>n × </IT>(<IT>k</IT> − 1)] (4)
and Mi = Sigma j=1,k Xij/k and M = Sigma i=1,n Mi/n, the subject and overall means.

We then estimate Delta  empirically as the ratio of the between- to within-subject standard deviations
DR = <RAD><RCD>[(MS<SUB>B</SUB> − MS<SUB>W</SUB>)/(<IT>k</IT> × MS<SUB>W</SUB>)]</RCD></RAD> (5)
The DR is algebraically related to the ICC, which is commonly used as a measure of the reliability of tests (5, 30)
DR = <RAD><RCD>[ICC/(1 − ICC)]</RCD></RAD> (6)

However, the methodology developed for ICCs is in the context of a random effects model, rather than the fixed effects model used here, so published results for SDs and confidence intervals cannot be used. The DR gives a measure that is intuitively closer to the idea of discrimination between subjects, whereas the ICC is a measure of correlation. In addition, for tests with good discrimination, ICC values tend to cluster unhelpfully close to their upper limit of one. Furthermore, there is no simple practicable test available for the comparison of ICCs from different tests in a random effects model. A recently described method for the comparison of two ICCs does not extend to the comparison of more than two tests and was only validated for studies employing repeated tests in 100 subjects or more (2). We have derived straightforward expressions for the SD and confidence intervals of the DR in a fixed effects model and a test for the equivalence of DRs in a comparison study.

Confidence limits for DR. Confidence limits for the DR can be found by noting that
DR = <RAD><RCD>[(<IT>F</IT><SUB>0</SUB> − 1)/<IT>k</IT>]</RCD></RAD> (7)
where F0 = MSB/MSW is the standard F statistic from the one-way ANOVA. F0 has a noncentral F distribution with degrees of freedom nu 1 = n - 1 and nu 2 = n × (k - 1) and noncentrality parameter
&lgr; = <IT>k</IT> × <LIM><OP>∑</OP><LL><IT>i</IT>=1,<IT>n</IT></LL><UL> </UL></LIM> &agr;<SUP>2</SUP><SUB><IT>i</IT></SUB>/&sfgr;<SUP>2</SUP> = (<IT>n</IT> − 1) × <IT>k</IT> × &Dgr;<SUP>2</SUP>
lambda  can be estimated by (n - 1) × k × DR2, and a 95% confidence interval for Delta  is
<RAD><RCD>[(<IT>F</IT><SUB>L</SUB> − 1)/<IT>k</IT>]</RCD></RAD> ≤ &Dgr; ≤ <RAD><RCD>[(<IT>F</IT><SUB>U</SUB> − 1)/<IT>k</IT>]</RCD></RAD> (8)
where FL and FU are the lower and upper 2.5% of the noncentral F (17).

Noncentral F tables are not widely available (18), and a reliable approximation to FL and FU can be made using a central F distribution (25)
<IT>F</IT><SUB>L</SUB> ≈ (1 + <IT>k</IT> × DR<SUP>2</SUP>) × F′<SUB>L</SUB> and <IT>F</IT><SUB>U</SUB> ≈ (1 + <IT>k</IT> × DR<SUP>2</SUP>) × <IT>F</IT>′<SUB>U</SUB>
where F'L and F'U are the lower and upper 2.5% of a central Fnu ,nu 2 distribution and
&ngr; = (<IT>n</IT> − 1) × (1 + <IT>k</IT> × DR<SUP>2</SUP>)<SUP>2</SUP>/(1 + 2 × <IT>k</IT> × DR<SUP>2</SUP>)
Comparison of DRs. We derived a test for the equivalence of several DRs by assuming the following model
<IT>X</IT><SUB><IT>i jh</IT></SUB> = &mgr;<SUB><IT>h</IT></SUB> + &agr;<SUB><IT>ih</IT></SUB> + &egr;<SUB><IT>i jh</IT></SUB>
for <IT>i</IT> = 1, … , <IT>n</IT>,  <IT>j</IT> = 1, … , <IT>k</IT>,  and  <IT>h</IT> = 1, … , <IT>r</IT> (9)
Xijh is now the result of the h'th test performed for the j'th time on the i'th subject, µh is the mean value of the variable in question on the scale of test h, and alpha ih is the true value of the i'th subject measured using test h (Sigma i=1,n alpha ih = 0 for each test h); epsilon ijh are again assumed to be independent, normally distributed random variables with zero mean and variance sigma 2h.

Using this model, the DRs for each test are statistically independent. We used simulations (see APPENDIX II) to show that the distribution of loge(DR) is approximately normal if the model assumptions hold. We then used Cochran's theorem (19) to show that the statistic Q has a chi 2 distribution with r - 1 degrees of freedom, where
<IT>Q</IT> = <LIM><OP>∑</OP><LL><IT>h</IT>=1,<IT>r</IT></LL><UL> </UL></LIM> w<SUB><IT>h</IT></SUB> × (L<SUB><IT>h</IT></SUB> − L)<SUP>2</SUP>
L<SUB><IT>h</IT></SUB> = log<SUB>e</SUB>(DR of test <IT>h</IT>) L = <FENCE><FENCE><LIM><OP>∑</OP><LL><IT>h</IT>=1,<IT>r</IT></LL><UL> </UL></LIM> w<SUB><IT>h</IT></SUB> × L<SUB><IT>h</IT></SUB></FENCE></FENCE><FENCE><LIM><OP>∑</OP><LL><IT>h</IT>=1,<IT>r</IT></LL><UL> </UL></LIM> w<SUB><IT>h</IT></SUB></FENCE>
w<SUB><IT>h</IT></SUB> = 1/s<SUP>2</SUP><SUB><IT>h</IT></SUB> and s<SUB><IT>h</IT></SUB> is the standard deviation of L<SUB><IT>h</IT></SUB>. (10)

The DRs are unequal at a significance level of 0.05 if Q exceeds 95% of the chi 2r-1 distribution.

We derived an expression for sh, the estimated SD of Lh, from the mean and variance of the noncentral F distribution using a Taylor series expansion; details are given in APPENDIX III.

Alternative models. The models we have used, given by Eqs. 1 and 9, for observations from the comparison study have been deliberately chosen for their relative simplicity. Although some of the algebra is intricate, all of the calculations we have presented can be easily implemented using spreadsheet software and do not require the use of specialized statistical packages. However, some of our model assumptions do warrant further discussion.

First, the choice of a fixed rather than a random effects model is unusual in this kind of context. However, subject selection in our study is clearly nonrandom in that we have deliberately chosen roughly equal numbers of normal glucose tolerance, IGT, and diabetic subjects. Even within each of these subpopulations, sampling is unlikely to be random as subjects are sought to span the range of interest as evenly as possibly, which is likely to result in oversampling from the extremes of the distribution. Such a sampling scheme is likely to produce a DR that is higher than that which would be obtained from a random sample from the population, and its use is restricted to comparison with other tests in the same study. It is not appropriate to formally compare test DRs that have been derived from different populations.

A population consisting of clearly defined subpopulations might be best treated using a mixed model, or even structural equation modelling. This would require more sophisticated analytical techniques and a larger scale of comparison study than we have presented in this paper. In the particular example presented here, however, the "subgroups" are not clearly separable but are arbitrarily defined by thresholds in a continuous spectrum. In this situation, which is relatively common in physiology, the approach taken here would be adequate, relatively simple, and practical. A formal comparison of the use of more complex models and the simple approach made here is beyond the scope of this paper.

The assumption of independence of the error terms epsilon ijh is unlikely to be completely true since most biological measurements exhibit some degree of "autocorrelation," i.e., correlation between successive measurements made on the same subject. In the context of these studies, where repeat measurements are almost always made on different days and often several days or even weeks apart, the magnitude of such autocorrelation is likely to be small compared with the total within-subject variation in which we are interested. Furthermore, accurately estimating autocorrelation coefficients would be difficult in relatively small studies, and the degree of mathematical complexity would increase such that specialized statistical methodology and software would be needed, rendering the procedures inaccessible to many researchers. However, our methodology might not be appropriate where repeat measurements were made within the same day or where other biological reasons existed for suspecting nonnegligible autocorrelation.

Correlation Between Pairs of Tests

The nature of the relationship between a pair of tests can be examined graphically by plotting the subject means for the first test against those for the second. In many cases, particularly after transformations to ensure homoscedasticity, the relationship will be approximately linear, and the degree of correlation can be assessed using the Pearson product-moment correlation coefficient, r (3).

In the model of Eq. 9 for r = two tests, we are interested in the correlation between the underlying subject means alpha i1 and alpha i2. However, in the presence of within-subject variation, the sample correlation coefficient, i.e., the correlation between the two sets of observed subject means, underestimates the true correlation between the tests; this effect is known as attenuation and means that, even if the true subject means alpha i1 and alpha i2 were perfectly correlated, the correlation between the observed subject means would be less than unity because of the random fluctuations due to within-subject measurement variation.

Standard results from measurement error theory (11) show that the correlation between two measurements, both of which are subject to error, is attenuated by the factor
&eegr; = <RAD><RCD>(&kgr;<SUB>1</SUB> × &kgr;<SUB>2</SUB>)</RCD></RAD>
where kappa 1 and kappa 2 are the reliability coefficients of the two tests. From Eq. 6
&kgr; = DRM<SUP>2</SUP>/(1 + DRM<SUP>2</SUP>)
where DRM is the DR of the means Mi, i = 1,...,n, of the k replicate measurements on each subject, rather than of the individual measurements themselves.

Taking the mean of Eq. 1 over the k replicates yields
M<SUB><IT>i</IT></SUB> = &mgr; + &agr;<SUB><IT>i</IT></SUB> + &xgr;<SUB><IT>i</IT></SUB>  for <IT>i</IT> = 1, … , <IT>n</IT>
where xi i is normally distributed with mean of zero and variance sigma 2/k. Thus, in Eq. 2, sigma  must be replaced by sigma /<RAD><RCD><IT>k</IT></RCD></RAD>, which we estimate by <RAD><RCD>(MS<SUB>W</SUB>/<IT>k</IT>)</RCD></RAD>, and from Eq. 5
DRM = DR × <RAD><RCD><IT>k</IT></RCD></RAD> (11)
Thus
&eegr; = <RAD><RCD>{[DRM<SUP>2</SUP><SUB>1</SUB>/(1 + DRM<SUP>2</SUP><SUB>1</SUB>)] × [DRM<SUP>2</SUP><SUB>2</SUB>/(1 + DRM<SUP>2</SUP><SUB>2</SUB>)]}</RCD></RAD>
The Pearson correlation coefficient r can be adjusted for attenuation by dividing it by eta , i.e.
<IT>r</IT><SUB>adj</SUB> = <IT>r</IT>/&eegr;
where radj is the adjusted r.

In cases where the relationship between the tests is clearly nonlinear, the Spearman rank correlation coefficient rs should be used in place of r to assess the comparability of the tests. However, there is no universal formula for the attenuation of rs in the presence of measurement error.

Unbiased Estimation of Linear Relationship

In the case where the relationship between a pair of tests is linear, it may be useful to obtain unbiased estimates of the gradient and intercept of the line. Linear regression gives biased estimates because it only considers errors in the dependent variable, and clearly both tests here are subject to error; the gradient is always underestimated, and regression of subject means from test 1 on those from test 2 clearly gives a different relationship to that of test 2 on test 1.

The method that we have chosen to estimate the linear relationship between the subject mean measurements from test 1 and those from test 2 is that of "perpendicular least squares, properly weighted." This essentially minimizes the sum of the squared perpendicular distances between the observed data and the fitted line, but with an adjustment that makes the method invariant to linear transformations of the measurement scales. If Mi1 and Mi2, i = 1,...,n, are the subject means (over the k replicate tests), then the estimated gradient is
<IT>b</IT> = {S<SUB>22</SUB> − &thgr; × S<SUB>11</SUB> + <RAD><RCD>[(S<SUB>22</SUB> − &thgr; × S<SUB>11</SUB>)<SUP>2</SUP> + 4 × &thgr; × S<SUP>2</SUP><SUB>12</SUB>]</RCD></RAD>}
/(2 × S<SUB>12</SUB>)
where S11 = Sigma i=1,n (Mi1 - M1)2; S22 = Sigma i=1,n (Mi2 - M2)2; and S12 = Sigma i=1,n (Mi1 - M1) × (Mi2 - M2). M1 and M2 are the overall means for each test, i.e., Mh Sigma i=1,n Mih/n, h = 1, 2, and theta  = sigma 22/sigma 21.

The sigma 21 and sigma 22 are estimated from their respective MSW, so that we estimate theta  by
MS<SUB>W</SUB> (<IT>test 2</IT>)/MS<SUB>W</SUB> (<IT>test 1</IT>)
The intercept is then estimated as
<IT>a</IT> = M<SUB>2</SUB> − <IT>b</IT> × M<SUB>1</SUB>
This method is described and contrasted with other methods by Riggs et al. (27), where it is shown to perform well under a range of values of theta  when the correlation between the Mi1 and Mi2 is fairly high (above ~0.5) and theta  is estimated fairly precisely. Such conditions are likely to apply in these experiments: the Mi1 and Mi2 are measuring the same underlying physiological variable so the correlation will be high, and sigma 1 and sigma 2, and hence theta , are directly estimated from the repeat measurements using each test.


    APPENDIX II. SIMULATIONS
Top
Abstract
Introduction
Methods
Results
Discussion
References
Appendix I
Appendix II
Appendix III

We used simulations to examine the distribution of the DR and loge(DR) and to check the accuracy of the Taylor series formula for the SD of loge(DR), given the form of model described by Eq. 1. These were performed for all combinations of the following values of n (number of subjects), k (number of replicate tests), and Delta  (the true DR)
<IT>n</IT> 5, 10, 15, 20, 30, 40, 60, 100, 200, 500
<IT>k</IT> 2, 4, 6, 8, 10
&Dgr; 1.5, 2.0, 3.0, 4.0, 5.0
For each of these combinations of n, k, and Delta , the following procedure was performed.

1) An arbitrary subject mean µ was chosen, along with a set of n equally spaced subject effects alpha i chosen symmetrically around zero so that Sigma i=1,n alpha i = 0.

2) The sigma 2, the within-subject variance, was calculated as
&sfgr;<SUP>2</SUP> = <FENCE><FENCE><LIM><OP>∑</OP><LL><IT>i</IT>=1,<IT>n</IT></LL><UL> </UL></LIM> &agr;<SUP>2</SUP><SUB><IT>i</IT></SUB>/(<IT>n</IT> − 1)</FENCE> </FENCE>&Dgr;<SUP>2</SUP>
3) For each i = 1,...,n and j = 1,...,k, a random observation epsilon ij was generated from a normal distribution with mean zero and variance sigma 2. Xij were then generated from Eq. 1.

4) The DR and hence loge(DR) were calculated from the Xij using Eqs. 3-5.

5) The SD of loge(DR) was calculated from the Taylor series approximation (Eq. 17 of APPENDIX III), using the noncentrality parameter lambda  evaluated from the DR estimate at the current step of the simulation.

6) Steps 3-5 were repeated 500 times, yielding a distribution of 500 values for each of DR, loge(DR), and SD of loge(DR).

7) The distributions of DR and loge(DR) were checked for normality using the Shapiro-Wilk test and were plotted as histograms.

8) The true SD of loge(DR) was estimated from the simulated distribution of loge(DR).

9) The distribution of Taylor series estimates of the SD of loge(DR) was compared with the true SD by plotting the median, upper and lower quartiles, and 5 and 95% against n for different values of k and Delta .

Examination of P values from the Shapiro-Wilk test showed some evidence that loge(DR) was not quite normally distributed (slightly >10% of the P values examined were <0.05, but there was no apparent relationship between low P values and n, k, or Delta ). However, this was a marked improvement over the DR itself, for which >50% of the P values were <0.05. Histograms also showed the distribution of loge(DR) to be symmetric, whereas that of DR was markedly positively skewed (data not shown). Loge(DR) was deemed to be sufficiently close to normal for use in the chi 2 test for equality of DRs. Plots showed that the median Taylor series estimate for the SD of loge(DR) was generally within ±10% of the true value for n >=  10. However, for k = 2, the SD could be overestimated by as much as 20% for 10 <=  n <=  20. The distribution of SDs is positively skewed and, for n >=  10, the 5% SD was at most 10% below the true value. Overestimation could be more marked, but even the upper quartile SDs were within +25% of the true value (data not shown). Because the chi 2 test statistic is a function of the reciprocal of the SD, the test is conservative with respect to overestimates of the SD, i.e., one is unlikely to wrongly reject the null hypothesis (no difference between the DRs), but the test may not be particularly sensitive to genuine differences for small values of n, especially if k = 2.

Estimates of the SD become very inaccurate for n < 10, and the approximation should not be used in this range. However, we would not recommend performing an evaluation study of this kind on such a small number of subjects, since the objective is to characterize the performance of the tests over a reasonable range of the variable of interest.


    APPENDIX III. SD OF LOGE(DR) USING TAYLOR SERIES APPROXIMATION
Top
Abstract
Introduction
Methods
Results
Discussion
References
Appendix I
Appendix II
Appendix III

We derived an estimate of the variance (and hence the SD) of loge(DR) using a first-order Taylor series expansion. From Eq. 7
DR = <RAD><RCD>[(<IT>F</IT><SUB>0</SUB> − 1)/<IT>k</IT>]</RCD></RAD>
where F0 = MSB/MSW has a noncentral F distribution with degrees of freedom nu 1 = n - 1 and nu 2 = n × (k - 1) and noncentrality parameter
&lgr; = <IT>k</IT> × <LIM><OP>∑</OP><LL><IT>i</IT>=1,<IT>n</IT></LL><UL> </UL></LIM> &agr;<SUP>2</SUP><SUB><IT>i</IT></SUB>/&sfgr;<SUP>2</SUP> = (<IT>n</IT> − 1) × <IT>k</IT> × &Dgr;<SUP>2</SUP>
Let LDR = loge(DR)

Expanding LDR as a Taylor series in F0 about its mean F'0 gives, to first order
LDR ≈ LDR′ + (<IT>F</IT><SUB>0</SUB> − <IT>F</IT>′<SUB>0</SUB>) × d(LDR)/d<IT>F</IT>′<SUB>0</SUB> (12)
where LDR' is LDR evaluated at F'0 and d(LDR)/dF'0 is also evaluated at F'0. Hence
var(LDR) ≈ var(<IT>F</IT><SUB>0</SUB>) × [d(LDR)/d<IT>F</IT>′<SUB>0</SUB>]<SUP>2</SUP> (13)
where var indicates variance. Now
d(LDR)/d<IT>F</IT>′<SUB>0</SUB> = (1/DR′) × ½ × [(<IT>F</IT>′<SUB>0</SUB> − 1)/<IT>k</IT>]<SUP>−½</SUP> × (1/<IT>k</IT>)
 = 1/[2<IT>k</IT> × (DR′)<SUP>2</SUP>] (14)
where DR' is DR evaluated at F'0. From general properties of the noncentral F distribution (17)
<IT>F</IT>′<SUB>0</SUB> = &ngr;<SUB>2</SUB> × (&ngr;<SUB>1</SUB> + &lgr;)/[&ngr;<SUB>1</SUB> × (&ngr;<SUB>2</SUB> − 2)] = <IT>n</IT> × (<IT>k</IT> − 1) 
× (<IT>n</IT> − 1 + &lgr;)/[(<IT>n</IT> − 1) × (<IT>nk</IT> − <IT>n</IT> − 2)]
⇒ DR′ = √{[<IT>n</IT> × (<IT>k</IT> − 1) × &lgr; + 2 
× (<IT>n</IT> − 1)]/[<IT>k</IT> × (<IT>n</IT> − 1) × (<IT>nk</IT> − <IT>n</IT> − 2)]} (15)
and var(<IT>F</IT><SUB>0</SUB>) = 2 × (&ngr;<SUB>2</SUB>/&ngr;<SUB>1</SUB>)<SUP>2</SUP> × [(&ngr;<SUB>1</SUB> + &lgr;)<SUP>2</SUP> + (&ngr;<SUB>1</SUB> + 2&lgr;) 
× (&ngr;<SUB>2</SUB> − 2)]/[(&ngr;<SUB>2</SUB> − 2)<SUP>2</SUP> × (&ngr;<SUB>2</SUB> − 4)]
= 2<IT>n</IT><SUP>2</SUP>(<IT>k</IT> − 1)<SUP>2</SUP> × [&lgr;<SUP>2</SUP> + 2&lgr; × (<IT>nk</IT> − 3) + (<IT>n</IT> − 1) 
× (<IT>nk</IT> − 3)]/[(<IT>n</IT> − 1)<SUP>2</SUP>(<IT>nk</IT> − <IT>n</IT> − 2)<SUP>2</SUP> 
× (<IT>nk</IT> − <IT>n</IT> − 4)] (16)

Substituting for DR' from Eq. 15 into Eq. 14 and for d(LDR)/dF'0 from Eq. 14 and var(F0) from Eq. 16 into Eq. 13 gives
var(LDR) ≈ <IT>n</IT><SUP>2</SUP>(<IT>k</IT> − 1)<SUP>2</SUP> 
× [&lgr;<SUP>2</SUP> + 2&lgr; × (<IT>nk</IT> − 3) + (<IT>n</IT> − 1) × (<IT>nk</IT> − 3)]/
{2 × (<IT>nk</IT> − <IT>n</IT> − 4) × [<IT>n</IT>(<IT>k</IT> − 1)&lgr; + 2(<IT>n</IT> − 1)]<SUP>2</SUP>} (17)
which is evaluated by noting that
&lgr; = (<IT>n</IT> − 1) × <IT>k</IT> × &Dgr;<SUP>2</SUP>
and replacing Delta 2 by DR2, i.e.
&lgr; ≈ (<IT>n</IT> − 1) × <IT>k</IT> × DR<SUP>2</SUP>


    ACKNOWLEDGEMENTS

We are grateful for the assistance from Dr. Sue Manley and Nuala Walravens.


    FOOTNOTES

This study was done with aid of grants from Servier and the Alan & Babette Sainsbury Trust.

Address for reprint requests: J. C. Levy, Diabetes Research Laboratories, Radcliffe Infirmary, Woodstock Rd., Oxford OX2 6HE, UK.

Received 30 December 1997; accepted in final form 15 October 1998.


    REFERENCES
Top
Abstract
Introduction
Methods
Results
Discussion
References
Appendix I
Appendix II
Appendix III

1.   Allison, D. B. Limitations of coefficient of variation as index of measurement reliability. Nutrition 9: 559-560, 1993[Medline].

2.   Alsawalmeh, Y. M., and L. S. Feldt. Test of hypothesis that the intraclass reliability coefficient is the same for two measurement procedures (Abstract). Appl. Psychol. Measurement 16: 195, 1992.

3.   Altman, D. G. Practical Statistics for Medical Research (1st ed.). London: Chapman & Hall, 1991, p. 293-294.

4.   American Diabetes Association Expert Committee. Report of the expert committee on the diagnosis and classification of diabetes mellitus. Diabetes Care 20: 1183-1197, 1997[Medline].

5.   Bartko, J. J. The intraclass correlation coefficient as a measure of reliability. Psychol. Rep. 19: 3-11, 1966[Medline].

6.   Bland, J. M., and D. G. Altman. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1: 307-310, 1986[Medline].

7.   Bortheiry, A. L., D. A. Malerbi, and L. J. Franco. The ROC curve in the evaluation of fasting capillary blood glucose as a screening test for diabetes and IGT. Diabetes Care 17: 1269-1272, 1994[Abstract].

8.   Engelgau, M. M., T. J. Thompson, W. H. Herman, J. P. Boyle, R. E. Aubert, S. J. Kenny, A. Barran, E. S. Sous, and M. A. Ali. Comparison of fasting and 2-hour glucose and HbA1c levels for diagnosing diabetes: diagnostic criteria and performance revisited. Diabetes Care 20: 785-791, 1997[Abstract].

9.   Forrest, R. D., C. A. Jackson, and J. S. Yudkin. The abbreviated glucose tolerance test in screening for diabetes: the Islington Diabetes Survey. Diabet. Med. 4: 544-554, 1988.

10.   Fraser, C. G., and E. K. Harris. Generation and application of data on biological variation in clinical chemistry. Crit. Rev. Clin. Lab. Sci. 27: 409-437, 1989[Medline].

11.   Fuller, W. A. Measurement Error Models (1st ed.). New York: Wiley, 1987, p. 4.

12.   Ganda, O. P., J. L. Day, J. J. Connon, and R. E. Gleason. Reproducibility and comparative analysis of repeated intravenous and oral glucose tolerance tests. Diabetes 27: 715-725, 1978[Abstract].

13.   Harding, P. E., N. W. Oakley, and V. Wynn. Reproducibility of oral glucose tolerance data in normal and mildly diabetic subjects. Clin. Endocrinol. Metab. 2: 387-395, 1973.

14.   Hoaglin, D. C., F. Mosteller, and J. W. Tukey. Understanding Robust and Exploratory Data Analysis. New York: Wiley, 1983.

15.   Holman, R. R., and R. C. Turner. The basal plasma glucose: a simple, relevant index of maturity-onset diabetes. Clin. Endocrinol. Metab. 14: 279-286, 1980.

16.   Hosker, J. P., D. R. Matthews, A. S. Rudenski, M. A. Burnett, P. Darling, E. G. Bown, and R. C. Turner. Continuous infusion of glucose with model assessment: measurement of insulin resistance and beta-cell function in man. Diabetologia 28: 401-411, 1985[Medline].

17.   Johnson, N. L., S. Kotz, and N. Balakrishnam. Noncentral chi 2 distributions. Noncentral F distributions. In: Continuous Univariate Distributions (2nd ed.). Chichester, UK: Wiley, 1995, vol. 2, p. 433-434, 480-482.

18.   Lachenbruch, P. A. The non-central F distribution---extension of Tang's tables (Abstract). Ann. Math. Stat. 37: 744, 1966.

19.   Lindgren, B. W. Linear models and analysis of variance. In: Statistical Theory (3rd ed.). New York: Macmillan, 1976, p. 525-528.

20.   McCance, D. R., R. L. Hanson, M. A. Charles, L. T. Jacobsson, D. J. Pettitt, P. H. Bennett, and W. C. Knowler. Comparison of tests for glycated haemoglobin and fasting and two hour plasma glucose concentrations as diagnostic methods for diabetes. Br. Med. J. 308: 1323-1328, 1994[Abstract/Free Full Text].

21.   McDonald, G. W., G. F. Fisher, and C. Burnham. Reproducibility of the oral glucose tolerance test. Diabetes 14: 473-480, 1965.

22.   Metropolitan Life Insurance Company. Net weight standard for men and women. Stat. Bull. Metrop. Insur. Co. 40: 1-4, 1959.

23.   Montgomery, D. C. Design and Analysis of Experiments (3rd ed.). Singapore, China: Wiley, 1991, p. 103-108.

24.   Olefsky, J. M., and G. M. Reaven. Insulin and glucose responses to identical oral glucose tolerance tests performed forty-eight hours apart. Diabetes 23: 449-453, 1974[Medline].

25.   Patniak, P. B. The non-central X2 and F distributions and their applications. Biometrika 36: 202-232, 1949.

26.   Riccardi, G., O. Vaccaro, A. Rivellese, S. Pignalosa, L. Tutino, and M. Mancini. Reproducibility of the new diagnostic criteria for impaired glucose tolerance. Am. J. Epidemiol. 121: 422-429, 1985[Abstract].

27.   Riggs, D. S., J. A. Guarnieri, and S. Addelman. Fitting straight lines when both variables are subject to error. Life Sci. 22: 1305-1360, 1978[Medline].

28.   Saad, M. D., W. C. Knowler, D. J. Pettitt, R. G. Nelson, D. M. Mott, and P. H. Bennett. The natural history of impaired glucose tolerance in the Pima Indians. N. Engl. J. Med. 319: 1500-1506, 1988[Abstract].

29.   Sartor, G., B. Schersten, S. Carlstrom, A. Melander, A. Norden, and G. Persson. Ten year follow-up of subjects with impaired glucose tolerance: prevention of diabetes by tolbutamide and diet regulation. Diabetes 29: 41-49, 1980[Abstract].

30.   Shrout, P. E., and J. L. Fleiss. Intraclass correlations: use in assessing rater reliability. Psychol. Bull. 86: 420-428, 1979.

31.   Turner, R. C., R. R. Holman, D. R. Matthews, S. P. O'Rahilly, A. S. Rudenski, and W. J. Braund. Diabetes nomenclature: classification or grading of severity? Diabet. Med. 3: 216-220, 1986[Medline].

32.   Turner, R. C., J. I. Mann, R. D. Simpson, E. Harris, and R. Maxwell. Fasting hyperglycaemia and relatively unimpaired meal responses in mild diabetes. Clin. Endocrinol. Metab. 6: 253-264, 1977.

33.  WHO expert, committee on diabetes mellitus. Second report. World Health Organisation Technical Support Series (2nd. report), 1985, p. 727.

34.   Zweig, M. H., and G. Campbell. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem. 39: 561-577, 1993[Abstract/Free Full Text].


Am J Physiol Endocrinol Metab 276(2):E365-E375
0002-9513/99 $5.00 Copyright © 1999 the American Physiological Society