Discrimination, adjusted correlation, and equivalence of imprecise
tests: application to glucose tolerance
Jonathan
Levy,
Richard
Morris,
Margaret
Hammersley, and
Robert
Turner
Diabetes Research Laboratories, Department of Clinical Medicine,
Oxford University, Oxford OX2 6HE, United Kingdom
 |
ABSTRACT |
Comparison studies between physiological tests
are often unsatisfactory for assessing their ability to distinguish
between subjects. We recommend a simple but comprehensive protocol,
using duplicate testing, that compares tests using
1) the discriminant ratio (DR)
between the underlying between- and within-subject SDs,
2) correlation coefficients adjusted
for attenuation due to test imprecision, and
3) unbiased estimation of the
underlying linear relationship between test results. The following five
alternative methods for assessing glucose tolerance were compared:
fasting plasma glucose (FPG) as a single sample or as the mean of three 5-min samples (FPG3); the 1- and
2-h glucose during a low-dose intravenous glucose infusion (CIG); and
the 2-h plasma glucose from a 75-g oral glucose tolerance test (OGTT).
All tests had similar DRs ranging from 2.6 to 4.2. The adjusted
correlation between FPG and CIG tests approached unity, and those
between OGTT and other tests were ~0.9, showing that
FPG3 provides similar information
to the OGTT. FPG concentrations of 6.0 and 7.1 were found equivalent to
the 1985 World Health Organization OGTT thresholds for
impaired glucose tolerance and diabetes (7.8 and 11.1 mmol/l).
plasma glucose; precision
 |
INTRODUCTION |
COMPLEX PHYSIOLOGICAL responses, such as glucose
tolerance, may be assessed by many different methods. For instance, the
control of plasma glucose may be measured by the fasting plasma glucose (FPG), by the response to an oral or an intravenous glucose challenge, by the percentage of glycated hemoglobin
(HbA1c), or by the plasma concentration of fructosamine. Different clinical tests are used to
assess the same physiological variable because of different clinical
circumstances, the application of advances in understanding and
technique, or as a result of historical circumstances or fashion. The
tests may assess differing or equivalent aspects of a physiological characteristic and may express their results in different units of measurement.
In this paper, we recommend a simple but comprehensive methodology for
comparing different tests of a continuous physiological variable, which
may be applied irrespective of the scales of measurements used. For
such comparisons, it is important to include assessment of both the
between- and within-subject variation of each test, as this allows
comparison of the ability of tests to discriminate between different
subjects, determination of the underlying correlation between tests
after adjusting for attenuation due to within-subject variation, and
unbiased estimation of the underlying relationships between the results
of the different tests.
Omission of any of these components in a comparison study will provide
inadequate information for choosing a particular test for a particular
situation. The imprecision, or the within-subject variation, of a test
is of little use by itself and must be considered in relation to the
range of the test results. The "coefficient of variation," which
relates imprecision to the midpoint of the range, is often
inappropriate as it does not take account of the dynamic range of a
test (1) and is not always comparable between different scales of
measurement. The present paper introduces the concept of
"discrimination," i.e., the ability to distinguish between
individual subjects within a specified range of interest. This can be
expressed as the discriminant ratio (DR), which is defined here as the
ratio of the underlying between-subject SD (SDB) to the within-subject SD
(SDW). The DR has a defined
distribution, and DRs for different tests can be compared statistically.
The correlation between different tests must be included in any
comparison. Test imprecision will diminish, or "attenuate," the
measured correlation coefficients in a well-described way (11), and it
is important to correct for this so as to be able to assess the
underlying "true" correlation, as this represents the degree
to which the tests are assessing the same physiological trait. This
attenuation adjustment can be expressed in terms of the DRs of the
respective tests.
Finally, the comparison of tests must, if possible, relate measurements
by one test to those by another. Where the relationship is linear, the
gradient of the relationship obtained by least squares regression is
underestimated ("regression dilution") when both tests are
subject to appreciable measurement error. An unbiased technique for
deriving the relationship is therefore necessary, and, while many
methods are available, a suitable method (27) is recommended here.
The assessment of glucose tolerance is a specific area where several
methods are available to an investigator, for instance, by the
measurement of steady-state plasma glucose or the response to a glucose
challenge, either oral or intravenous. The standard test of glucose
tolerance has, until recently, been the oral glucose tolerance test
(OGTT; see Ref. 33), but, in practice, this test is not often
performed. This is partly because of the inconvenience of a 2-h test
and partly because of the marked variability of the OGTT. The poor
reproducibility of the test (12, 21, 24, 26), which has a reported
coefficient of variation of the order of 15-40%, due in part to
the variable rate of gastric emptying, has predictable effects on
reclassification of subjects on repeat tests, with a change in status
on repeat testing in 30-60% of cases of impaired glucose
tolerance (IGT; see Refs. 9, 13, 26, 28, 29) and predictable regression
to the mean (26). The simple measurement of FPG has been suggested as a
preferable measure (4, 8, 15, 20), and a continuous intravenous infusion of glucose (CIG) can assess glucose tolerance and give simultaneous measures of pancreatic
-cell function and insulin resistance (16). The FPG thresholds for diabetes have been set using
outcome data (4). We evaluate and compare the performance of this and
other tests throughout the physiological range.
This paper outlines the concepts and components of a physiological
comparison study and uses as an example the assessment of glucose
tolerance by either 1) the FPG
(single sample or mean of 3 consecutive samples),
2) the 1- and 2-h responses to a
CIG, and 3) the standard 2-h
response to the OGTT, in repeated tests in 30 subjects spanning the
range of glucose tolerance.
 |
METHODS |
Statistical Methods
This paper considers the following three aspects relating to the
assessment and comparison of different tests for measuring an
underlying physiological variable such as glucose tolerance: 1) the ability of a test to
discriminate between different subjects and comparison of
discrimination between different tests;
2) the correlation between pairs of
tests, adjusting for bias due to within-subject variation. Such
variation attenuates measured correlation coefficients so that they
underestimate the underlying true correlation. This is important in
assessing the degree to which different tests purporting to assess the
same underlying trait differ with respect to systematic between-subject
factors as opposed to random within-subject variation; and
3) in cases where the relationship
between a pair of tests appears to be a linear, unbiased estimation of
the underlying line of equivalence between them.
Each of these aspects is required for a comprehensive comparison study
and is based on a combination of well-recognized and novel concepts.
Our approach is considered on a conceptual level at this point, with a
more rigorous statistical treatment reserved for APPENDIXES
I-III.
Discrimination between subjects. All
physiological measurements 1) are
subject to imprecision, which may derive from biological, sampling, and
analytic sources and 2) relate
implicitly to variables taking on values within a particular range of
interest. The performance of a particular physiological test will
depend on the relationship between both of these characteristics.
Absolute measurements of the imprecision of the test are only
meaningful in relation to the range of values to which that test will
be applied. The smaller the former is in relation to the latter, the
greater is the ability of a test to discriminate between individual
subjects. In the context of measurements being obtained from a series
of individuals representing the physiological spectrum of interest, we
propose a novel, simple index of discrimination, the DR, the ratio
between the SD of the underlying subject means
(SDU), and the SD of repeated measurements on the same subject
(SDW).
The discrimination of a test is not a universal property but will
relate to the spectrum of values in the population being studied. Hence
absolute DR values are not comparable between different populations,
but they are essential when comparing the practical application and
performance of different tests in the same population.
Underlying SDU.
Because a physiological test may be applied to a variety of possibly
nonuniform populations, it is important to assess the test in relation
to its expected range of application. Subjects in a comparison study
should be chosen to represent and to span the range rather than be
randomly selected from particular populations of interest, and this
range is characterized statistically by the
SDU. The measured SD
(SDB) will overestimate the
underlying SDU due to the presence
of within-subject variation, and it is important to adjust for this,
using a standard formula, to yield an unbiased estimate of the
SDU.
SDW.
To relate simply to the between-subject variation, we must be able to
assume a common within-subject variation for all of the subjects in the
study. This property is called "homoscedasticity" and can be
checked by simple plots of the data (see APPENDIX I). Lack
of homoscedasticity can often be rectified by an appropriate numerical
transformation of the test results.
For homoscedastic data, the common within-subject variance is
simply the mean of the individual within-subject variances.
DR. As outlined above, the DR is
defined as the ratio
SDU/SDW.
In a comparison study where k
replicate measurements are performed in each subject, the measured
SDB is calculated as the SD of the subject mean values (calculated from the
k replicates). The standard mathematical adjustment to yield
SDU is
so
that the DR is calculated simply as
This
result may also be obtained by an analysis of variance (ANOVA)
approach, using a fixed effects model, and this is presented in
APPENDIX I. The appropriate equation is then
where
MSB is the between-subject mean
square, MSW is the within-subject
mean square, and k is the number of
replicate tests in each subject, as above.
Assuming that the within-subject variation is normally distributed, we
have used an analytical approach to derive confidence limits for
estimated DR values and to test for the significance of differences
between the DRs of different tests. We present these in APPENDIX
I, where we also discuss our methodology in relation to
alternative measures of test "reliability," in particular the
intraclass correlation coefficient (ICC; see Refs. 5, 30).
Correlation between pairs of tests.
Two tests designed to assess a complex physiological characteristic,
such as degree of glycemic control, may use different methodologies,
neither of which may perfectly represent the characteristic in
question. The results of the tests may differ in systematic ways,
independently from their random within-subject variation. The degree to
which the tests measure the same characteristic may be assessed by the correlation between their results in a set of subjects. In the absence
of within-subject variation, the degree to which their correlation
falls short of unity would represent the extent to which the tests
either fail to measure precisely the same characteristic or to measure
aspects of the underlying characteristic that are differentially
influenced by other factors that vary between subjects. It is this
underlying true correlation that we are interested in here.
Test imprecision, however, further attenuates the observed correlation
so that imperfect correlation will usually be due to a combination of
the systematic between-subject factors discussed above and the presence
of within-subject variation. A study comparing tests must distinguish
between these two components, and this can be achieved using a standard
formula that corrects measured correlations for attenuation (11). Such
a correction requires knowledge of both the within- and between-subject
variation of the measured values. Because the DR incorporates both of
these elements, the correction can be expressed in terms of the DR
where
(see
APPENDIX I). The corrected correlation coefficient,
therefore, represents the degree to which the tests represent the
same physiological characteristic, independent of test imprecision.
Unbiased estimation of a linear
relationship. Where two tests represent the same
characteristic, their results will often be found to be linearly
related (although this may require numerical transformation). The
determination of this relationship will be important in relating the
results of one test to those of the other. Within-subject variability
will lead to a noisy relationship, and a statistical approach is
necessary to estimate the underlying linear equation. Although
least-squares linear regression is often used for this, it is not
appropriate here since it assumes that the explanatory variable is free
from noise. When this is not the case, linear regression will
underestimate the slope of the equation, a well-recognized effect
termed regression dilution.
There is no perfect method of estimating the true relationship in these
circumstances, but the method chosen here is that of least
perpendicular distances corrected for scale differences, which has been
shown to perform well in relation to other methods (27). The equations
for this are presented in APPENDIX I.
Design of comparison studies. There is
no single measure that can be used to compare physiological tests.
Assessment and comparison of test discrimination and the determination
of their underlying correlations and inter-relationships are equally
important components. These all require simultaneous consideration of
both between-subject and within-subject variation within the
physiological range of interest. A comparison study should, ideally,
assess both of these factors by using replicate tests in subjects
chosen to span that range.
Experimental Protocols
Subjects. Thirty white Caucasian
subjects were studied, consisting of 10 normoglycemic subjects, 9 subjects with IGT, and 11 with type II diabetes according to 1985 World
Health Organization (WHO) definitions (33). All subjects were on a
weight-maintaining diet and had not changed their medication for 4 wk
before the tests. Subject characteristics are presented in Table
1 by glucose tolerance group.
Protocols. Each subject was studied on
four occasions within a 6-wk period. After a 12-h fast, subjects went
to the hospital and sat on a bed for the duration of the tests. Tests
were each performed on two occasions in the same subject and in random order.
FPG and CIG. Two cannulas were
inserted in the same arm. One, for blood sampling, was placed at the
wrist or on the dorsum of the hand, which was heated by an electrical
blanket to "arterialize" the venous blood. The other cannula, for
infusion of glucose, was placed in an antecubital vein. A blood sample
was taken at time
10 min, and
the plasma glucose concentration at this single time point was termed
FPG1. Blood samples were also
taken at times
5 and
0 min, and the mean of the plasma
glucose at the three time points was termed
FPG3. At time
0, a continuous 5 mg · kg ideal body
wt
1 · min
1
infusion (22) of 10% glucose was started and continued for 2 h.
One-hour CIG and 2-h CIG glucose were defined as the means of the three
plasma glucose concentrations in blood samples taken at 50, 55, and 60 min and at 110, 115, 120 min, respectively.
OGTT. A single cannula was placed in
an antecubital vein for blood sampling. Fasting blood samples were
taken at
10,
5, and 0 min. At 0 min, the subject
consumed a 75-g glucose drink, and blood samples were taken at 30, 60, 90, and 120 min.
Biochemical Assays
Plasma glucose was determined by a hexokinase-based method (Boeringer
Mannheim UK, Lewes, UK) on a centrifugal COBAS MIRA autoanalyzer
(Roche, Welwyn Garden City, UK).
 |
RESULTS |
Estimates of Glucose Tolerance
Values for glucose tolerance using
FPG1,
FPG3, 1-h CIG, 2-h CIG, and 2-h
OGTT are presented in Table 2 as median and
ranges. Fasting and CIG measures were homoscedastic, and the within-
and between-subject variations are illustrated as plots of difference (first test
second test) vs. mean (of the 2 tests) in Fig.
1. The 2-h OGTT was found to have
within-subject variation increasing with mean values, and this was
corrected by log transformation, with the transformed data presented as
difference vs. mean plots in Fig. 2 and as
medians and ranges for the whole group in Table 2. The underlying
SDU and
SDW and the DRs of the tests are
also presented in Table 2.

View larger version (10K):
[in this window]
[in a new window]
|
Fig. 1.
First minus second test difference vs. mean plots for
the mean of three 5-min samples of fasting plasma glucose
(A) and 1-h
(B) and 2-h
(C) plasma glucose during a constant
5 mg · kg ideal body
wt 1 · min 1
intravenous glucose infusion (CIG).
|
|

View larger version (11K):
[in this window]
[in a new window]
|
Fig. 2.
First minus second test difference vs. mean plots for 2-h oral glucose
tolerance test (OGTT) for untransformed plasma glucose
(A) and glucose logarithmically
transformed to ensure homoscedasticity
(B).
|
|
DR values for the five measures, together with the one SE range of the
estimates, are illustrated in Fig. 3.
Although the lowest DR was the
FPG1 and the highest the 2-h CIG,
there was no significant difference between them on the overall
statistical test (
24 = 6.2, P = 0.19, using Eq. 10 in APPENDIX I). Consideration was given
to the exclusion of a subject whose inter-test difference in the 2-h
CIG was four SDs from the mean of the rest of the group (see Fig. 1).
In the absence of an identifiable reason for the large difference
between his two test values, this subject was included in the analyses
presented here, although, if he were excluded, the DR of the 2-h CIG
would rise to 6.1, significantly greater than the DRs of the other
tests.

View larger version (9K):
[in this window]
[in a new window]
|
Fig. 3.
Discriminant ratio values for fasting plasma glucose (single sample and
mean of 3 samples), 1-h CIG, 2-h CIG, and 2-h OGTT for plasma glucose.
Error bars represent 1 SE of the estimate.
|
|
Table 3 shows the Pearson correlations
between the tests, both before and after adjustment for attenuation.
The correlations were calculated between subject means of duplicate
tests, to give the best estimate of the underlying relationships
between tests. Adjusted correlations between the fasting and both of
the intravenous measures were high, approaching one. Those between the
2-h OGTT results and the other tests were somewhat lower (~0.9),
indicating that there was some biological discordance in their
relationship, independent of their within-subject measurement error.
Figure 4 shows the scattergram between the
2-h plasma OGTT (on a logarithmic scale) and the FPG. It also
illustrates the line of equivalence, derived as explained above, and
the linear regression line of the 2-h OGTT on FPG. The dilution effect
of the within-subject variation on the regression line can be seen
clearly. Table 4 gives coefficients for the
unbiased linear equations relating the test values, and Table
5 gives the points on the various scales
that are equivalent to the 1985 WHO OGTT thresholds for IGT and
diabetes.

View larger version (12K):
[in this window]
[in a new window]
|
Fig. 4.
Scattergram of the mean of two 2-h OGTT tests (logarithmic scale) vs.
the mean of two FPG3 tests showing the unbiased linear
relationship (dotted line) and the linear regression relationship
(dashed line).
|
|
SDW of the logarithm of the 2-h
OGTT may be interpreted in relation to a standard interval in the range
of glucose tolerance, for instance, the interval between the thresholds
for IGT (7.8 mmol/l) and for diabetes (11.1 mmol/l). This interval is
0.153 on a logarithmic scale (base
10), whereas the
SDW is 0.060. The difference
between two individual measurements at either end of this interval
would not be significant at the 5% level
(P = 0.07), given the imprecision of
the 2-h OGTT. In other words, it would not be possible to confidently
distinguish between two such individual values. The same applies to all
other tests examined, using the unbiased equivalent values to these
thresholds given in Table 5, apart from the 2-h CIG plasma glucose, for
which the two measurements at opposite ends of the equivalent interval
would differ significantly (P = 0.013).
 |
DISCUSSION |
This study has shown that different methods of assessing glucose
tolerance were broadly comparable in a range of subjects spanning
normal glucose tolerance, IGT, and type II diabetes. This assessment
involved the following three separate components: 1) comparison of the discrimination
of the tests, i.e., their ability to distinguish between different
subjects, 2) determination of the
degree to which different tests measure the same underlying physiological property, and 3)
estimation of the underlying relationship between the test results. The
assessment of the within-subject imprecision of each test is a
fundamental requirement for this evaluation, so that comparison studies
must involve at least duplicate measurements in all subjects in order
to determine both the between- and within-subject measurement variation
of each test in the same group of individuals. This is best done in
subjects who represent a clinically meaningful range of glucose
tolerance. To ascribe a single value to within-subject imprecision
requires homoscedasticity, and numerical transformation of results may
be necessary to achieve this, as illustrated by the 2-h plasma glucose
from the OGTT. A measure of imprecision is important for the assessment
of changes within subjects, such as over time or after interventions
(10).
The determination of imprecision alone is not adequate, however, for
the assessment of the practical value of a test. The standard methods
of assessing imprecision, including the coefficient of variation, have
little meaning on their own, without reference to the range of
measurements to which they are being applied. The ability to
distinguish between individuals within this range is here termed the
discrimination of the test and is assessed using the DR.
The concept of discrimination should be distinguished from the ability
of a test to categorize patients by an external gold-standard dichotomy. This is a particular concern in the field of clinical chemistry, for instance, when a biochemical test is being assessed for
its ability to detect the presence of a malignancy. Receiver operating
characteristics (34) have been used for this purpose. However, they are
not suitable for the assessment of test results as a continuous scale
of measurement or for the comparison of tests without reference to an
external categorization. In the field of glucose tolerance, categories
have been defined on the basis of thresholds in a continuous scale of
measurement for the OGTT based on external criteria, but this is a
notoriously imprecise test (7), and it would not therefore be
appropriate to assess possible alternative tests using a categorical
approach based on these thresholds. The use of the DR provides a means
of comparing how well the subjects studied can be reliably
distinguished by different tests, which is an important component of a
comprehensive comparison of imprecise tests. For instance, in many
research studies using continuous variables, the statistical power to
distinguish between groups of subjects or to determine correlations
between variables will depend on discrimination.
A very similar concept, referred to as reliability, has been assessed,
particularly in the psychological literature, using the ICC. This
relates the covariance of replicate test results to the combined
between-subject and within-subject variance and is algebraically
related to the DR. However, in the context of discriminating between
different subjects, it is not as easy to grasp as the DR, which has a
direct intuitive relevance when considering a test's application.
Furthermore, it is not as easy to derive statistical tests comparing
different ICC values. A recently described method for the comparison of
two ICCs does not extend to the comparison of more than two tests and
was only validated for studies employing repeated tests in 100 subjects
or more (2).
The choice of a methodology for comparing different tests is intimately
related to its theoretical basis and in particular to the availability
of statistical criteria for assessing such a comparison. Much of the
theoretical discussion of the ICC and treatments of comparative tests
between ICCs have been based on random effects models in which the
individuals in which the tests are being performed are assumed to be
drawn at random from a known normally distributed population. This
presupposes a particular population in which the tests are being
applied. Complex populations (for instance those consisting of
subgroups) would require appropriately complex statistical models.
Unfortunately, even the simplest random effects models are hard to
treat analytically, as in the case of the ICC quoted above. Our
approach has been to concentrate on evaluating tests across a
particular physiological spectrum of interest. In the example presented
here, for instance, glucose tolerance represents a physiological and
pathological unity irrespective of the its distribution in particular
populations. For this purpose, we take the view that it is appropriate
to perform a comparison study in subjects selected to span the range of
interest, which may be analyzed by a fixed-effects statistical model.
Such an analysis is presented here for the DR, allowing the derivation of straightforward expressions for the SD and confidence intervals of
the DR and the evaluation of the statistical significance of differences between the DRs of different tests.
Although the comparison of the DRs of different tests in the same study
is valid, the DR in itself is not a universal characteristic of a test,
as it will depend on the choice of subjects on which the comparison is
performed. When the subjects cover a greater range of values, the DR
will be larger, and vice versa, and the DR calculated when subjects are
selected to span a range of interest will generally be larger than if
subjects had been chosen randomly from the same population. However, it
is a fundamental property when it comes to a test's practical
application, and, unlike imprecision, it can be used as a basis for
comparison between tests. The DRs of the five tests examined here were
not significantly different, in spite of the increased complexity and
expense of the CIG and the OGTT.
Two perfectly precise tests assessing the same physiological variable
would be perfectly correlated. Departures from perfect correlation can
be due to the following two factors:
1) underlying differences not
directly related to the variable of interest. These will manifest
themselves as systematic differences between subjects; for instance, in
the assessment of glycemic control, which is determined by the FPG and
OGTT in qualitatively different ways, the underlying correlation may
fall short of unity because of the influence of factors such as the
effect of gastric emptying and the influence of intestinal incretin
effects that differ between the two methodological approaches;
2) diminution of the underlying correlation may also arise from within-subject variation, and this is a
well-described statistical effect termed "attenuation," which may
be adjusted for by standard techniques to enable the estimation of the
underlying correlation (11). This adjustment depends on both test
imprecision and the degree of variation between subjects and so can be
expressed in terms of the test DRs.
Adjustment for attenuation will establish the degree to which the
underlying correlation differs from unity due to the factors detailed
in factor 1 above.
In this paper, the adjusted correlation coefficients between the
fasting glucose and the CIG approached unity and were only slightly
lower between these tests and the 2-h OGTT. Although additional factors
unrelated to the homeostatic control of plasma glucose, such as
variable gastric emptying, would contribute to this, the relatively
high overall intercorrelations and the simplicity and cheapness of the
FPG would recommend this as the measure of choice for the assessment of
glucose tolerance.
When two tests have an underlying structural relationship between their
measurements (or transformations of these) that is linear, it can be
instructive to determine the equation of the "line of
equivalence." Linear regression, although often used, is
unsatisfactory since it assumes perfect precision in the independent variable and is subject to regression dilution. There have been many
approaches to deriving an unbiased estimation, as comprehensively reviewed by Riggs et al. (27), and the "weighted least squared perpendicular distance" approach (Riggs' "PW" method) has
been used in this paper. In the present study, the assumption of
linearity, within the limits set by the imprecision of the tests, was
supported by visual inspection of plots of the relationships (data not
shown). The FPG threshold concentrations recommended by the American
Diabetes Association (4), based on studies of the prevalence of
retinopathy in three distinct populations, were confirmed as equivalent
to the established OGTT thresholds for IGT and diabetes.
We also calculated the SD of
loge(DR) estimates for different
numbers of replicate tests, using the Taylor series expansion, subject
to a constraint on the total number of tests performed. For a study
comparing two methods using a total of 60 tests, the power to detect a
difference between DRs of 2.5 and 4.0 is 58% using 30 subjects and two
replicate tests, rising to 72% with 20 subjects and three tests, an
increase in power of 24%. Further increasing the number of replicates
gives smaller increases in power, e.g., for four tests in 15 subjects
the power is 78%, with even smaller gains for more than four
replicates. There would appear to be some advantage in using three,
rather than two, replicate tests in each subject, but there is little
advantage in increasing the number beyond three given the need to
obtain sufficient subjects to adequately cover the range of interest.
This study showed that, with OGTT done under carefully controlled
conditions, with a reproducibility that is somewhat better than
reported elsewhere, it was not possible to distinguish between the WHO
thresholds for IGT and diabetes at a 5% significance level, and this
was also the case for FPG, even when the mean of three samples at 5-min
intervals was assessed (the between-sample variation being relatively
small in relation to the between-day variation). It is therefore not
surprising, with two thresholds close together, that repeat
measurements often give change of status. Improved classification could
be achieved by taking the mean of determinations on more than one day.
However, although these classifications may be useful for
epidemiological purposes, for practical purposes the actual OGTT or FPG
value is more informative (31, 32).
In summary, we have outlined a comprehensive but simple methodology for
the comparison of imprecise tests, encouraging
1) comparison of test
discrimination, expressed as the DR,
2) the evaluation of the degree of
agreement between tests based on correlation coefficients adjusted for
attenuation, and 3) in the case of a linear relationship between test results (or their mathematical transformations), the use of an unbiased method for estimating the
underlying equation. For such a comparison study, it is important to
determine the within-subject variation of each test as well as the
variation between subjects. Application of these methods to various
tests of glucose tolerance demonstrated similar discrimination, acceptable agreement, and an unbiased estimation of the FPG values equivalent to those of the 2-h OGTT. The latter agree closely with the
outcome-derived thresholds currently being recommended by the American
Diabetes Association. However, because the thresholds for IGT and
diabetes are within measurement error and cannot be reliably
distinguished, the absolute 2-h OGTT or FPG is more informative than
the categorization.
 |
APPENDIX I. STATISTICAL METHODS |
This section presents a more detailed mathematical treatment of the
concepts outlined in METHODS.
Discrimination Between Subjects
Statistical model. We consider the
comparison of different tests, each measuring the same physiological
variable. Each test is performed k
times on each of n subjects, with the
order of the tests being randomized for each subject.
Considering first a single test in isolation, an appropriate model is
|
(1)
|
where
Xij
is the result of the test performed for the
j'th time on the
i'th subject, µ is the overall mean
value of the variable in question on the scale of the current test, and
i is the true value of the
i'th subject, measured as a deviation from the mean (thus
i=1,n
i = 0);
ij represents day-to-day
variation, which includes both biological and assay variation; the
ij are assumed to be
independent, normally distributed random variables with mean zero and
variance
2.
Equation 1 is a standard one-way ANOVA.
The assumption of constant variance (or homoscedasticity) of the error
term,
2, can be checked
graphically. If k
5, we can
calculate the quartiles of the k
replicate test results for each subject and plot log(interquartile range) against log(median) (14). If 2 < k < 5, plot the SD of the
k replicates against the mean for each
subject (23), and if k = 2 plot the
differences (1st
2nd replicate) between the pairs of tests
against the subject means (6). If the assumption of homoscedasticity
holds, the plotted measure of variation [log(interquartile range), SD, or difference] should be approximately constant
across the range of subjects. If there appears to be a systematic
relationship between the measure of variation and subject medians or
means, this can often be removed by mathematically transforming the
results of
Xij.
A common case in physiological measurements is where the SD increases
in direct proportion to the mean, when a log transformation of the
Xij
stabilizes the variance and
log(Xij)
can then be used in place of
Xij
in Eq. 1. Other transformations can be
considered for different relationships between the subject SDs and
means (14, 23).
It is also possible to check the assumption that the
ij have a normal distribution
by plotting the ordered residuals from fitting Eq. 1 against standard normal deviates in a "normal probability plot" (3). However, the ANOVA procedures used here are
fairly robust to moderate departures from normal distribution and can
be used without such sophisticated checking, provided homoscedasticity
of variance holds and the data do not exhibit marked skewness.
In these experiments, subjects are selected to span a range of glucose
tolerance and are not chosen randomly from a prespecified population.
The subject effects
i are
therefore considered as "fixed" rather than "random" effects.
DR. As a measure of discrimination
between subjects, we define the true DR,
, as the ratio of the
underlying SDB to the
SDW
|
(2)
|
Unbiased estimates of the between- and within-subject variances are
given by (MSB
MSW)/k
and MSW, respectively, where
MSB and MSW are the between- and within-subject
mean squares from a standard one-way ANOVA, i.e.
|
(3)
|
|
(4)
|
and
Mi =
j=1,k
Xij/k
and M =
i=1,n
Mi/n,
the subject and overall means.
We then estimate
empirically as the ratio of the between- to
within-subject standard deviations
|
(5)
|
The DR is algebraically related to the ICC, which is commonly used as a
measure of the reliability of tests (5, 30)
|
(6)
|
However, the methodology developed for ICCs is in the context of a
random effects model, rather than the fixed effects model used here, so
published results for SDs and confidence intervals cannot be used. The
DR gives a measure that is intuitively closer to the idea of
discrimination between subjects, whereas the ICC is a measure of
correlation. In addition, for tests with good discrimination, ICC
values tend to cluster unhelpfully close to their upper limit of one.
Furthermore, there is no simple practicable test available for the
comparison of ICCs from different tests in a random effects model. A
recently described method for the comparison of two ICCs does not
extend to the comparison of more than two tests and was only validated
for studies employing repeated tests in 100 subjects or more (2). We
have derived straightforward expressions for the SD and confidence
intervals of the DR in a fixed effects model and a test for the
equivalence of DRs in a comparison study.
Confidence limits for DR. Confidence
limits for the DR can be found by noting that
|
(7)
|
where
F0 = MSB/MSW
is the standard F statistic from the
one-way ANOVA. F0
has a noncentral F distribution with
degrees of freedom
1 = n
1 and
2 = n × (k
1) and noncentrality
parameter
can be estimated by (n
1) × k × DR2, and a 95% confidence
interval for
is
|
(8)
|
where
FL and
FU are the lower
and upper 2.5% of the noncentral F
(17).
Noncentral F tables are not widely
available (18), and a reliable approximation to
FL and
FU can be made
using a central F distribution (25)
where
F'L and
F'U are
the lower and upper 2.5% of a central
F
,
2
distribution and
Comparison of DRs. We derived a test
for the equivalence of several DRs by assuming the following
model
|
(9)
|
Xijh
is now the result of the h'th test
performed for the j'th time on the
i'th subject,
µh is the mean value of the
variable in question on the scale of test
h, and
ih is the true value of the
i'th subject measured using test
h
(
i=1,n
ih = 0 for each test
h);
ijh are again assumed to be
independent, normally distributed random variables with zero mean and
variance
2h.
Using this model, the DRs for each test are statistically independent.
We used simulations (see APPENDIX II) to show that the
distribution of loge(DR) is
approximately normal if the model assumptions hold. We then used
Cochran's theorem (19) to show that the statistic
Q has a
2 distribution with
r
1 degrees of freedom,
where
|
(10)
|
The DRs are unequal at a significance level of 0.05 if
Q exceeds 95% of the
2r
1 distribution.
We derived an expression for sh,
the estimated SD of Lh, from the
mean and variance of the noncentral F distribution using a Taylor series expansion; details are given in
APPENDIX III.
Alternative models. The models we have
used, given by Eqs. 1 and 9, for observations from the
comparison study have been deliberately chosen for their relative
simplicity. Although some of the algebra is intricate, all of the
calculations we have presented can be easily implemented using
spreadsheet software and do not require the use of specialized
statistical packages. However, some of our model assumptions do warrant
further discussion.
First, the choice of a fixed rather than a random effects model is
unusual in this kind of context. However, subject selection in our
study is clearly nonrandom in that we have deliberately chosen roughly
equal numbers of normal glucose tolerance, IGT, and diabetic subjects.
Even within each of these subpopulations, sampling is unlikely to be
random as subjects are sought to span the range of interest as evenly
as possibly, which is likely to result in oversampling from the
extremes of the distribution. Such a sampling scheme is likely to
produce a DR that is higher than that which would be obtained from a
random sample from the population, and its use is restricted to
comparison with other tests in the same study. It is not appropriate to
formally compare test DRs that have been derived from different populations.
A population consisting of clearly defined subpopulations might be best
treated using a mixed model, or even structural equation modelling.
This would require more sophisticated analytical techniques and a
larger scale of comparison study than we have presented in this paper.
In the particular example presented here, however, the
"subgroups" are not clearly separable but are arbitrarily defined
by thresholds in a continuous spectrum. In this situation, which is
relatively common in physiology, the approach taken here would be
adequate, relatively simple, and practical. A formal comparison of the
use of more complex models and the simple approach made here is beyond
the scope of this paper.
The assumption of independence of the error terms
ijh is unlikely to be
completely true since most biological measurements exhibit some degree
of "autocorrelation," i.e., correlation between successive
measurements made on the same subject. In the context of these studies,
where repeat measurements are almost always made on different days and
often several days or even weeks apart, the magnitude of such
autocorrelation is likely to be small compared with the total
within-subject variation in which we are interested. Furthermore,
accurately estimating autocorrelation coefficients would be difficult
in relatively small studies, and the degree of mathematical complexity
would increase such that specialized statistical methodology and
software would be needed, rendering the procedures inaccessible to many
researchers. However, our methodology might not be appropriate where
repeat measurements were made within the same day or where other
biological reasons existed for suspecting nonnegligible autocorrelation.
Correlation Between Pairs of Tests
The nature of the relationship between a pair of tests can be examined
graphically by plotting the subject means for the first test against
those for the second. In many cases, particularly after transformations
to ensure homoscedasticity, the relationship will be approximately
linear, and the degree of correlation can be assessed using the Pearson
product-moment correlation coefficient, r (3).
In the model of Eq. 9 for
r = two tests, we are interested in
the correlation between the underlying subject means
i1 and
i2.
However, in the presence of within-subject variation, the sample
correlation coefficient, i.e., the correlation between the two sets of
observed subject means, underestimates the true correlation between the
tests; this effect is known as attenuation and means that, even if the
true subject means
i1
and
i2
were perfectly correlated, the correlation between the observed subject
means would be less than unity because of the random fluctuations due
to within-subject measurement variation.
Standard results from measurement error theory (11) show that the
correlation between two measurements, both of which are subject to
error, is attenuated by the factor
where
1 and
2 are the reliability
coefficients of the two tests. From Eq. 6
where
DRM is the DR of the means Mi,
i = 1,...,n, of the
k replicate measurements on each
subject, rather than of the individual measurements themselves.
Taking the mean of Eq. 1 over the
k replicates yields
where
i is normally distributed with
mean of zero and variance
2/k.
Thus, in Eq. 2,
must be replaced
by
/
, which we estimate by
, and from
Eq. 5
|
(11)
|
Thus
The Pearson correlation coefficient r
can be adjusted for attenuation by dividing it by
, i.e.
where
radj is the
adjusted r.
In cases where the relationship between the tests is clearly nonlinear,
the Spearman rank correlation coefficient
rs should be used
in place of r to assess the
comparability of the tests. However, there is no universal formula for
the attenuation of rs in the
presence of measurement error.
Unbiased Estimation of Linear Relationship
In the case where the relationship between a pair of tests is linear,
it may be useful to obtain unbiased estimates of the gradient and
intercept of the line. Linear regression gives biased estimates because
it only considers errors in the dependent variable, and clearly both
tests here are subject to error; the gradient is always underestimated,
and regression of subject means from test
1 on those from test 2 clearly gives a different relationship to that of test
2 on test 1.
The method that we have chosen to estimate the linear relationship
between the subject mean measurements from test
1 and those from test
2 is that of "perpendicular least squares, properly weighted." This essentially minimizes the sum of the squared
perpendicular distances between the observed data and the fitted line,
but with an adjustment that makes the method invariant to linear
transformations of the measurement scales. If
Mi1
and
Mi2,
i = 1,...,n, are the subject means (over
the k replicate tests), then the
estimated gradient is
where
S11 =
i=1,n
(Mi1
M1)2;
S22 =
i=1,n
(Mi2
M2)2;
and S12 =
i=1,n
(Mi1
M1) × (Mi2
M2).
M1 and
M2 are the overall means for each
test, i.e., Mh =
i=1,n
Mih/n,
h = 1, 2, and
=
22/
21.
The
21 and
22 are estimated from their respective
MSW, so that we estimate
by
The
intercept is then estimated as
This method is described and contrasted with other methods by Riggs et
al. (27), where it is shown to perform well under a range of values of
when the correlation between the
Mi1 and
Mi2
is fairly high (above ~0.5) and
is estimated fairly precisely.
Such conditions are likely to apply in these experiments: the
Mi1
and
Mi2
are measuring the same underlying physiological variable so the
correlation will be high, and
1
and
2, and hence
, are
directly estimated from the repeat measurements using each test.
 |
APPENDIX II. SIMULATIONS |
We used simulations to examine the distribution of the DR and
loge(DR) and to check the accuracy
of the Taylor series formula for the SD of
loge(DR), given the form of model
described by Eq. 1. These were
performed for all combinations of the following values of
n (number of subjects),
k (number of replicate tests), and
(the true DR)
For each of these combinations of n,
k, and
, the following procedure
was performed.
1) An arbitrary subject mean µ was
chosen, along with a set of n equally
spaced subject effects
i
chosen symmetrically around zero so that
i=1,n
i = 0.
2) The
2, the within-subject variance,
was calculated as
3) For each
i = 1,...,n and
j = 1,...,k, a random observation
ij was generated from a normal
distribution with mean zero and variance
2.
Xij
were then generated from Eq. 1.
4) The DR and hence
loge(DR) were calculated from the
Xij
using Eqs. 3-5.
5) The SD of
loge(DR) was calculated from the
Taylor series approximation (Eq. 17 of
APPENDIX III), using the noncentrality parameter
evaluated from the DR estimate at the current step of the simulation.
6) Steps
3-5 were repeated
500 times, yielding a distribution of 500 values for each of DR,
loge(DR), and SD of
loge(DR).
7) The distributions of DR and
loge(DR) were checked for
normality using the Shapiro-Wilk test and were plotted as histograms.
8) The true SD of
loge(DR) was estimated from the
simulated distribution of
loge(DR).
9) The distribution of Taylor series
estimates of the SD of loge(DR)
was compared with the true SD by plotting the median, upper and lower
quartiles, and 5 and 95% against n
for different values of k and
.
Examination of P values from the
Shapiro-Wilk test showed some evidence that
loge(DR) was not quite normally
distributed (slightly >10% of the P
values examined were <0.05, but there was no apparent relationship
between low P values and
n, k,
or
). However, this was a marked improvement over the DR itself, for
which >50% of the P values were
<0.05. Histograms also showed the distribution of
loge(DR) to be symmetric, whereas
that of DR was markedly positively skewed (data not shown).
Loge(DR) was deemed to be
sufficiently close to normal for use in the
2 test for equality of DRs.
Plots showed that the median Taylor series estimate for the SD of
loge(DR) was generally within
±10% of the true value for n
10. However, for k = 2, the SD could be overestimated by as much as 20% for 10
n
20. The distribution of SDs is
positively skewed and, for n
10, the 5% SD was at most 10% below the true value. Overestimation could
be more marked, but even the upper quartile SDs were within +25% of
the true value (data not shown). Because the
2 test statistic is a function
of the reciprocal of the SD, the test is conservative with respect to
overestimates of the SD, i.e., one is unlikely to wrongly reject the
null hypothesis (no difference between the DRs), but the test may not
be particularly sensitive to genuine differences for small values of
n, especially if
k = 2.
Estimates of the SD become very inaccurate for
n < 10, and the approximation should
not be used in this range. However, we would not recommend performing
an evaluation study of this kind on such a small number of subjects,
since the objective is to characterize the performance of the tests
over a reasonable range of the variable of interest.
 |
APPENDIX III. SD OF LOGE(DR) USING TAYLOR SERIES
APPROXIMATION |
We derived an estimate of the variance (and hence the SD) of
loge(DR) using a first-order
Taylor series expansion. From Eq. 7
where
F0 = MSB/MSW
has a noncentral F distribution with
degrees of freedom
1 = n
1 and
2 = n × (k
1) and noncentrality parameter
Let LDR = loge(DR)
Expanding LDR as a Taylor series in
F0 about its mean
F'0 gives, to first
order
|
(12)
|
where
LDR' is LDR evaluated at
F'0 and
d(LDR)/dF'0 is also
evaluated at F'0. Hence
|
(13)
|
where
var indicates variance. Now
|
(14)
|
where DR' is DR evaluated at
F'0. From general
properties of the noncentral F
distribution (17)
|
(15)
|
|
(16)
|
Substituting for DR' from Eq. 15
into Eq. 14 and for
d(LDR)/dF'0 from
Eq. 14 and
var(F0) from
Eq. 16 into Eq. 13 gives
|
(17)
|
which is evaluated by noting that
and
replacing
2 by
DR2, i.e.
 |
ACKNOWLEDGEMENTS |
We are grateful for the assistance from Dr. Sue Manley and Nuala Walravens.
 |
FOOTNOTES |
This study was done with aid of grants from Servier and the Alan & Babette Sainsbury Trust.
Address for reprint requests: J. C. Levy, Diabetes Research
Laboratories, Radcliffe Infirmary, Woodstock Rd., Oxford OX2 6HE, UK.
Received 30 December 1997; accepted in final form 15 October 1998.
 |
REFERENCES |
1.
Allison, D. B.
Limitations of coefficient of variation as index of measurement reliability.
Nutrition
9:
559-560,
1993[Medline].
2.
Alsawalmeh, Y. M.,
and
L. S. Feldt.
Test of hypothesis that the intraclass reliability coefficient is the same for two measurement procedures (Abstract).
Appl. Psychol. Measurement
16:
195,
1992.
3.
Altman, D. G.
Practical Statistics for Medical Research (1st ed.). London: Chapman & Hall, 1991, p. 293-294.
4.
American Diabetes Association Expert Committee.
Report of the expert committee on the diagnosis and classification of diabetes mellitus.
Diabetes Care
20:
1183-1197,
1997[Medline].
5.
Bartko, J. J.
The intraclass correlation coefficient as a measure of reliability.
Psychol. Rep.
19:
3-11,
1966[Medline].
6.
Bland, J. M.,
and
D. G. Altman.
Statistical methods for assessing agreement between two methods of clinical measurement.
Lancet
1:
307-310,
1986[Medline].
7.
Bortheiry, A. L.,
D. A. Malerbi,
and
L. J. Franco.
The ROC curve in the evaluation of fasting capillary blood glucose as a screening test for diabetes and IGT.
Diabetes Care
17:
1269-1272,
1994[Abstract].
8.
Engelgau, M. M.,
T. J. Thompson,
W. H. Herman,
J. P. Boyle,
R. E. Aubert,
S. J. Kenny,
A. Barran,
E. S. Sous,
and
M. A. Ali.
Comparison of fasting and 2-hour glucose and HbA1c levels for diagnosing diabetes: diagnostic criteria and performance revisited.
Diabetes Care
20:
785-791,
1997[Abstract].
9.
Forrest, R. D.,
C. A. Jackson,
and
J. S. Yudkin.
The abbreviated glucose tolerance test in screening for diabetes: the Islington Diabetes Survey.
Diabet. Med.
4:
544-554,
1988.
10.
Fraser, C. G.,
and
E. K. Harris.
Generation and application of data on biological variation in clinical chemistry.
Crit. Rev. Clin. Lab. Sci.
27:
409-437,
1989[Medline].
11.
Fuller, W. A.
Measurement Error Models (1st ed.). New York: Wiley, 1987, p. 4.
12.
Ganda, O. P.,
J. L. Day,
J. J. Connon,
and
R. E. Gleason.
Reproducibility and comparative analysis of repeated intravenous and oral glucose tolerance tests.
Diabetes
27:
715-725,
1978[Abstract].
13.
Harding, P. E.,
N. W. Oakley,
and
V. Wynn.
Reproducibility of oral glucose tolerance data in normal and mildly diabetic subjects.
Clin. Endocrinol. Metab.
2:
387-395,
1973.
14.
Hoaglin, D. C.,
F. Mosteller,
and
J. W. Tukey.
Understanding Robust and Exploratory Data Analysis. New York: Wiley, 1983.
15.
Holman, R. R.,
and
R. C. Turner.
The basal plasma glucose: a simple, relevant index of maturity-onset diabetes.
Clin. Endocrinol. Metab.
14:
279-286,
1980.
16.
Hosker, J. P.,
D. R. Matthews,
A. S. Rudenski,
M. A. Burnett,
P. Darling,
E. G. Bown,
and
R. C. Turner.
Continuous infusion of glucose with model assessment: measurement of insulin resistance and beta-cell function in man.
Diabetologia
28:
401-411,
1985[Medline].
17.
Johnson, N. L.,
S. Kotz,
and
N. Balakrishnam.
Noncentral
2 distributions. Noncentral F distributions.
In: Continuous Univariate Distributions (2nd ed.). Chichester, UK: Wiley, 1995, vol. 2, p. 433-434, 480-482.
18.
Lachenbruch, P. A.
The non-central F distribution
extension of Tang's tables (Abstract).
Ann. Math. Stat.
37:
744,
1966.
19.
Lindgren, B. W.
Linear models and analysis of variance.
In: Statistical Theory (3rd ed.). New York: Macmillan, 1976, p. 525-528.
20.
McCance, D. R.,
R. L. Hanson,
M. A. Charles,
L. T. Jacobsson,
D. J. Pettitt,
P. H. Bennett,
and
W. C. Knowler.
Comparison of tests for glycated haemoglobin and fasting and two hour plasma glucose concentrations as diagnostic methods for diabetes.
Br. Med. J.
308:
1323-1328,
1994[Abstract/Free Full Text].
21.
McDonald, G. W.,
G. F. Fisher,
and
C. Burnham.
Reproducibility of the oral glucose tolerance test.
Diabetes
14:
473-480,
1965.
22.
Metropolitan Life Insurance Company.
Net weight standard for men and women.
Stat. Bull. Metrop. Insur. Co.
40:
1-4,
1959.
23.
Montgomery, D. C.
Design and Analysis of Experiments (3rd ed.). Singapore, China: Wiley, 1991, p. 103-108.
24.
Olefsky, J. M.,
and
G. M. Reaven.
Insulin and glucose responses to identical oral glucose tolerance tests performed forty-eight hours apart.
Diabetes
23:
449-453,
1974[Medline].
25.
Patniak, P. B.
The non-central X2 and F distributions and their applications.
Biometrika
36:
202-232,
1949.
26.
Riccardi, G.,
O. Vaccaro,
A. Rivellese,
S. Pignalosa,
L. Tutino,
and
M. Mancini.
Reproducibility of the new diagnostic criteria for impaired glucose tolerance.
Am. J. Epidemiol.
121:
422-429,
1985[Abstract].
27.
Riggs, D. S.,
J. A. Guarnieri,
and
S. Addelman.
Fitting straight lines when both variables are subject to error.
Life Sci.
22:
1305-1360,
1978[Medline].
28.
Saad, M. D.,
W. C. Knowler,
D. J. Pettitt,
R. G. Nelson,
D. M. Mott,
and
P. H. Bennett.
The natural history of impaired glucose tolerance in the Pima Indians.
N. Engl. J. Med.
319:
1500-1506,
1988[Abstract].
29.
Sartor, G.,
B. Schersten,
S. Carlstrom,
A. Melander,
A. Norden,
and
G. Persson.
Ten year follow-up of subjects with impaired glucose tolerance: prevention of diabetes by tolbutamide and diet regulation.
Diabetes
29:
41-49,
1980[Abstract].
30.
Shrout, P. E.,
and
J. L. Fleiss.
Intraclass correlations: use in assessing rater reliability.
Psychol. Bull.
86:
420-428,
1979.
31.
Turner, R. C.,
R. R. Holman,
D. R. Matthews,
S. P. O'Rahilly,
A. S. Rudenski,
and
W. J. Braund.
Diabetes nomenclature: classification or grading of severity?
Diabet. Med.
3:
216-220,
1986[Medline].
32.
Turner, R. C.,
J. I. Mann,
R. D. Simpson,
E. Harris,
and
R. Maxwell.
Fasting hyperglycaemia and relatively unimpaired meal responses in mild diabetes.
Clin. Endocrinol. Metab.
6:
253-264,
1977.
33.
WHO expert, committee on diabetes mellitus. Second report.
World Health Organisation Technical Support
Series (2nd. report), 1985, p. 727.
34.
Zweig, M. H.,
and
G. Campbell.
Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine.
Clin. Chem.
39:
561-577,
1993[Abstract/Free Full Text].
Am J Physiol Endocrinol Metab 276(2):E365-E375
0002-9513/99 $5.00
Copyright © 1999 the American Physiological Society