a Department of Epidemiology and Biostatistics, The University of Western Ontario, London, Ontario, Canada.
Reprint requests: E Bartfay, Radiation Oncology Research Unit, Apps Level 4, Kingston General Hospital, Kingston, Ontario, Canada, K7L 2V7. E-mail: emma.bartfay{at}krcc.on.ca
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods Simulation studies were used to compare coverage levels and widths for constructing confidence intervals. Sample size requirements were compared for multinomial and dichotomous data. We illustrate our results using a published data set on drinking habits that assesses agreement among primary and proxy respondents.
Results Our results show that when multinomial data are treated as dichotomous, not only do the expected confidence interval widths become greater, but the penalty in terms of larger sample size requirements for hypothesis testing can be severe.
Conclusion We conclude that there are clear advantages in preserving multinomial data on the original scale rather than collapsing the data into a binary trait.
Keywords Agreement, kappa statistic, sample size, confidence interval, epidemiological studies
Accepted 10 May 2000
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Most of the literature on agreement assessment, however, has focused on continuous or dichotomous outcome data.36 Nevertheless, recent interest in kappa7 reflects the importance of this statistic which can be applied to more general problems. For instance, investigators in epidemiological studies often rely on proxy informants when primary subjects are unavailable to provide the needed information, particularly when study subjects are elderly or very young children.8,9 It has been suggested that the criteria for evaluating agreement between information obtained from primary subjects and their proxy respondents depends on their relationship, the research subject matter, and even the subjects'; ethnicity.912 For example, wives seem to report more accurately their husbands'; dietary intake, while husbands tend to be more accurate about their wives'; alcohol consumption.13,14 Moreover, children can provide reliable information on the smoking habits of their cohabiting parents, whereas parents are not effective informants for evaluating their children';s oral health status.15,16 In spite of the varying degrees of proxy-primary agreement reported in the literature, formal statistical evaluation has not been routinely used.9
One purpose of this paper is to show how the results of Donner and Eliasziw2 can be extended to provide a statistical procedure to construct confidence intervals about the kappa coefficient for multinomial data. In addition, we will demonstrate the consequence of collapsing multinomial data into a binary outcome measure. Our results show that this practice can be disadvantageous in terms of sample size requirements for hypothesis testing and for the expected confidence interval width. We illustrate these results using data on drinking habits from a previously published study.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
with parameters, p1, p2, ..., pJ (p1 + p2 + ... + pJ = 1) and xt1 + xt2 + ... + xtJ = 2. The parameters p1, p2, ..., pJ represent the probability of a rating being classified into categories 1, 2, ..., J, respectively. We further assume that p1, p2, ..., pJ follow a Dirichlet distribution with parameters 1,
2, ...,
J, and with density function given by
![]() |
where j > 0, j = 1, 2, ..., J. The joint distribution of xt1, xt2, ..., xtJ is Dirichlet-multinomial,17 which can be written as
![]() |
If we let Pr(j, j';) be the probability that the first rating is in category j and the second rating is in category j';, the basic model can be written as
![]() |
where =
1 +
2 + ... +
J.
Letting µj = j/
and
= (1 +
)1 the Dirichlet-multinomial model can also be expressed as
![]() |
for all j, j'; = 1, 2, ..., J.
The coefficient of interobserver agreement defined above has a parallel interpretation as the correlation between any two subjects within a litter in toxicological studies.18
Each pair of ratings may be regarded as falling into one of the J(J + 1)/2 classifications (see Table 1 for data layout). Letting ni denote the number of subjects in classification i, i = 1, 2, ..., J(J + 1)/2, the log-likelihood function can be written as
|
![]() |
In order to construct a one degree of freedom GOF test, we may further combine all discordant cells into a single cell. The modified log-likelihood function may then be expressed as
![]() |
where mi = ni for concordant cells and mJ+1 represents the sum of all discordant cells. In the next three subsections, we will show how one can construct confidence interval and to estimate sample size requirements for hypothesis testing for .
Confidence interval construction
Suppose it is of interest to construct a 100(1 )% confidence interval for
. The observed frequencies mi, corresponding to the Pri';(
), i = 1, 2, ..., J + 1, follow a multinomial distribution, conditional on the sample size n (Table 1
). The estimated probabilities Pri';(
) can be obtained by replacing µj by their suitable estimates in Pri';(
), i = 1, 2, ..., J + 1. It follows that
![]() |
has a limiting chi-square distribution with one degree of freedom. One can obtain the two corresponding 100(1 )% confidence limits for
by finding the admissible roots to the polynomial equation
2G =
21
. When a closed form solution is unattainable, the confidence limits may be expressed in numeric form by replacing µj';s with their maximum likelihood estimates and numerically solving the equation above for
. Maximum likelihood estimates can be obtained solving
logLM /
µ1 = 0, ...,
logLM /
µJ1 = 0 and
logLM /
= 0 simultaneously. For the case of a binary outcome variable, an explicit expression for the maximum likelihood estimator was obtained by Bloch and Kraemer.3
Hypothesis testing
The procedure above may also be used to test hypotheses concerning . Suppose it is of interest to test H0:
=
0, where
0 is a pre-specified value. The GOF test statistic is given by
![]() |
Under H0, 02 follows an approximate chi-square distribution with one degree of freedom. The Pri(
0), i = 1, 2, ..., J + 1, are obtained by replacing µj, j = 1, 2, ..., J1 with their maximum likelihood estimates and
by
0.
Sample size requirements
Suppose that it is of interest to estimate the number of subjects needed to test the null hypothesis H0: =
0 versus Ha:
=
a at the 100
% significance level (2-sided), and with power (1 ß). Under Ha, the GOF statistic has a non-central chi-square distribution with one degree of freedom, and with corresponding non-centrality parameter given by
![]() |
If 1 ß(1, ,
) denotes the power of the GOF statistic corresponding to
and
, one can determine the sample size required to test H0:
=
0 versus Ha:
=
a by using tables of the non-central chi-square distribution (e.g. Haynam et al.19). The required number of subjects is then given by
![]() |
As an example, suppose that we have a trinomial outcome variable and wish to test H0: = 0.2 at
= 0.05 (2-sided) and ß = 0.2. When µ1 = 0.2, µ2 = 0.3 and
a = 0.4, we can compute the required number of subjects from the equation above as n = 118.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To conserve space, we present some of the results in Table 2. It is seen that most of the coverage levels fall between 940 and 960, and are therefore generally acceptable. When the number of subjects is increased to 100 and 200, the differences in coverage level become negligible. The confidence interval results further show that the three-category interval widths are consistently narrower at all parameter values. The advantage is particularly apparent at n = 50 for
= 0.1, where some of the observed differences in average width are as great as 0.17, or an increase of 48.6%. The average widths become more similar when the number of subjects is increased to 100 and 200. Results for n = 200 are not provided for reasons of space. Nevertheless, using three categories still produced narrower interval widths at all parameter values.
|
|
![]() |
Application |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Information on alcohol consumption was collected using the typical occasions method.22 In this present paper, we are interested in the degree of agreement regarding drinking habits among primary respondents (non-fatal myocardial infarction cases) and the proxy respondents (closest next-of-kin). Three categories were used in the data analysis: (I) non-drinker; (II) occasional drinker, or (III) regular drinker.
The data for the 58 respondent pairs are given in Table 4 with the results summarized in Table 5
. When the data from the 3 x 3 table are used to estimate agreement, the 95% confidence interval for
is given by (0.38, 0.73). When we collapse the data into either non-regular (groups I and II) or regular drinker (III), the 95% confidence interval for
is given by (0.45, 0.85). Alternatively, we can collapse the data into either non-drinker (group I) or drinker (groups II and III), the 95% confidence interval for
is given by (0.32, 0.77). In both cases, we observe an increase in confidence interval width of 16 and 34%, respectively. It is clear that the confidence interval is narrower when more categories are used.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Kraemer24 addressed the problem of multinomial outcome categories by proposing the use of a symmetric matrix of coefficients to measure reliability. For this approach, the intraclass kappa coefficients from the matrix diagonal represent the degree of agreement for a particular category relative to all other categories combined. An advantage of the matrix approach is that it meets the concerns of those who criticized the use of a single overall measure of reliability for multinomial data (e.g. Roberts and McNamee25). When the main interest is in an initial global measure of agreement, however, these methods might become cumbersome to perform and perhaps provide more information than is needed.
The results of our simulation study, as reflected in the example, show that there are clear advantages to preserving multinomial data on the original scale rather than collapsing the data into a binary trait. Depending upon how the categories are collapsed, the penalty in terms of larger sample size requirements for testing hypotheses can be quite severe. These observations are also consistent with those of previous authors,2628 who show there can be a severe loss of power when an inherently continuous variable is dichotomized. However, when there are small numbers in the categories, collapsing these categories may sometimes be the only possible solution in practice in order to facilitate data analysis. Nonetheless, when an investigator considers collapsing categories, biological or clinical relevancy should also be taken into account.
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Donner A, Eliasziw M. A goodness-of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance-testing and sample size estimation. Statist Med 1992;11: 151119.[ISI]
3 Bloch DA, Kraemer HC. 2 x 2 kappa coefficients: measures of agreement or association. Biometrics 1989;45:26987.[ISI][Medline]
4 Mak TK. Analysing intraclass correlation for dichotomous variables. Appl Stat 1988;37:34452.[ISI]
5 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:15974.[ISI][Medline]
6 Basu S, Basu A. Comparison of several goodness-of-fit tests for the kappa statistic based on exact power and coverage probability. Statist Med 1995;14:34756.[ISI]
7 Banerjee M, Capozzoli L, McSweeney L, Sinha D. Beyond kappa: a review of interrater agreement measures. Can J Stat 1999;27: 323.[ISI]
8 Pierre U, Wood-Dauphinee S, Korner-Bitensky N, Gayton D, Hanley J. Proxy use of the Canadian SF-36 in rating health status of the disabled elderly. J Clin Epidemiol 1998;51:98390.[ISI][Medline]
9 Whiteman D, Green A. Wherein lies the truth? Assessment of agreement between parent proxy and child respondents. Int J Epidemiol 1997;26:85559.[Abstract]
10
Navarro AM. Smoking status by proxy and self report: rate of agreement in different ethnic groups. Tob Control 1999;8:18285.
11 MaCarthur C, Dougherty G, Pless IB. Reliability and validity of proxy respondent information about childhood injury: an assessment of a Canadian surveillance system. Am J Epidemiol 1997;145: 83441.[Abstract]
12 Walker AM, Velema JP, Robins JM. Analysis of case-control data derived in part from proxy respondents. Am J Epidemiol 1988;127: 90514.[ISI][Medline]
13 Humble CG, Samet JM, Skipper BE. Comparison of self- and surrogate-reported dietary information. Am J Epidemiol 1984;119: 8698.[Abstract]
14 Cahalan D. Quantifying alcohol consumption: patterns and problems. Circulation 1981;64(Suppl.III):714.[Medline]
15 Barnett T, O';Loughlin J, Paradis G, Renaud L. Reliability of proxy reports of parental smoking by elementary schoolchildren. Ann Epidemiol 1997;7:39699.[ISI][Medline]
16 Beltran ED, Malvitz DM, Eklund SA. Validity of two methods for assessing oral health status of populations. J Public Health Dent 1997;57:20614.[ISI][Medline]
17 Johnson NL, Kotz S. Discrete Distribution. New York: Wiley, 1969.
18 Chen JJ, Kodell RL, Howe RB, Gaylor DW. Analysis of trinomial responses from reproductive and developmental toxicity experiments. Biometrics 1991;47:104958.[ISI][Medline]
19 Haynam GE, Govindarajulu Z, Leone GC. Tables of the Cumulative Non-central Chi-square Distribution, Case Statistical Laboratory, Publication No. 104, 1962. Part of the tables have been published in: Harter HL, Owen DB (eds). Selected Tables in Mathematical Statistics Vol. 1. Chicago: Markham, 1970.
20 Robey RR, Barcikowski RS. Type I error and the number of iterations in Monte Carlo studies of robustness. Br J Math Stat Psychol 1992; 45:28388.[ISI]
21 Graham P, Jackson R. Primary versus proxy respondents: comparability of questionnaire data on alcohol consumption. Am J Epidemiol 1993;138:44352.[Abstract]
22 Alanko T. An overview of techniques and problems in measurement of alcohol consumption. Res Adv Alcohol Drug Prob 1984;8:299326.
23 Donner A, Eliasziw M. A hierarchical approach to inferences concerning interobserver agreement for multinomial data. Statist Med 1997;16:1097106.[ISI]
24 Kraemer HC. Measurement of reliability for categorical data in medical research. Stat Meth Med Res 1992;1:18399.[Medline]
25 Roberts C, McNamee R. A matrix of kappa-type coefficients to assess the reliability of nominal scales. Statist Med 1998;17:47188.[ISI]
26 Cohen J. The cost of dichotomization. Appl Psychol Measurement 1983; 7:24953.[ISI]
27 Kraemer HC. A measure of 2 x 2 association with stable variance and approximately normal small-sample distribution: planning cost-effective studies. Biometrics 1986;42:35970.[ISI][Medline]
28 Donner A, Eliasziw M. Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement. Biometrics 1994;50:55055.[ISI]