The effect of collapsing multinomial data when assessing agreement

E Bartfaya,b and A Donnera

a Department of Epidemiology and Biostatistics, The University of Western Ontario, London, Ontario, Canada.

Reprint requests: E Bartfay, Radiation Oncology Research Unit, Apps Level 4, Kingston General Hospital, Kingston, Ontario, Canada, K7L 2V7. E-mail: emma.bartfay{at}krcc.on.ca


    Abstract
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Application
 Discussion
 References
 
Background In epidemiological studies researchers often depend on proxies to obtain information when primary subjects are unavailable. However, relatively few studies have performed formal statistical inference to assess agreement among proxy informants and primary study subjects. In this paper, we consider inference procedures for studies of interobserver agreement characterized by two raters and three or more outcome categories. Of particular interest is the consequence of dichotomizing such data on the expected confidence interval width for the kappa coefficient. The effect of dichotomization on sample size requirements for testing hypotheses concerning kappa is also evaluated.

Methods Simulation studies were used to compare coverage levels and widths for constructing confidence intervals. Sample size requirements were compared for multinomial and dichotomous data. We illustrate our results using a published data set on drinking habits that assesses agreement among primary and proxy respondents.

Results Our results show that when multinomial data are treated as dichotomous, not only do the expected confidence interval widths become greater, but the penalty in terms of larger sample size requirements for hypothesis testing can be severe.

Conclusion We conclude that there are clear advantages in preserving multinomial data on the original scale rather than collapsing the data into a binary trait.

Keywords Agreement, kappa statistic, sample size, confidence interval, epidemiological studies

Accepted 10 May 2000


    Introduction
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Application
 Discussion
 References
 
Since its introduction by Cohen,1 the kappa coefficient ({kappa}) has become a very popular index for quantifying agreement among raters with respect to categorical measurements. The principal advantage of kappa as compared to measures of agreement proposed earlier is that it corrects for the excess agreement expected by chance. Donner and Eliasziw2 developed a goodness-of-fit (GOF) procedure to construct inferences for the kappa statistic when the trait of interest is measured on a dichotomous scale. In particular, they showed how one can construct confidence intervals for the kappa statistic and estimate sample size requirements for hypothesis testing using this procedure.

Most of the literature on agreement assessment, however, has focused on continuous or dichotomous outcome data.3–6 Nevertheless, recent interest in kappa7 reflects the importance of this statistic which can be applied to more general problems. For instance, investigators in epidemiological studies often rely on proxy informants when primary subjects are unavailable to provide the needed information, particularly when study subjects are elderly or very young children.8,9 It has been suggested that the criteria for evaluating agreement between information obtained from primary subjects and their proxy respondents depends on their relationship, the research subject matter, and even the subjects'; ethnicity.9–12 For example, wives seem to report more accurately their husbands'; dietary intake, while husbands tend to be more accurate about their wives'; alcohol consumption.13,14 Moreover, children can provide reliable information on the smoking habits of their cohabiting parents, whereas parents are not effective informants for evaluating their children';s oral health status.15,16 In spite of the varying degrees of proxy-primary agreement reported in the literature, formal statistical evaluation has not been routinely used.9

One purpose of this paper is to show how the results of Donner and Eliasziw2 can be extended to provide a statistical procedure to construct confidence intervals about the kappa coefficient for multinomial data. In addition, we will demonstrate the consequence of collapsing multinomial data into a binary outcome measure. Our results show that this practice can be disadvantageous in terms of sample size requirements for hypothesis testing and for the expected confidence interval width. We illustrate these results using data on drinking habits from a previously published study.


    Materials and Methods
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Application
 Discussion
 References
 
Suppose that a sample of n subject-pairs has been selected. Each individual is asked to classify a response into one of J (>2) mutually exclusive categories. Let xtj denote the number of individuals of the tth pair in category j, where t = 1, 2, ..., n and j = 1, 2, ..., J. Assume that the joint distribution of xt1, xt2, ..., xtJ is multinomial, i.e.


with parameters, p1, p2, ..., pJ (p1 + p2 + ... + pJ = 1) and xt1 + xt2 + ... + xtJ = 2. The parameters p1, p2, ..., pJ represent the probability of a rating being classified into categories 1, 2, ..., J, respectively. We further assume that p1, p2, ..., pJ follow a Dirichlet distribution with parameters {alpha}1, {alpha}2, ..., {alpha}J, and with density function given by


where {alpha}j > 0, j = 1, 2, ..., J. The joint distribution of xt1, xt2, ..., xtJ is Dirichlet-multinomial,17 which can be written as


If we let Pr(j, j';) be the probability that the first rating is in category j and the second rating is in category j';, the basic model can be written as


where {theta} = {alpha}1 + {alpha}2 + ... + {alpha}J.

Letting µj = {alpha}j/{theta} and {kappa} = (1 + {theta})–1 the Dirichlet-multinomial model can also be expressed as


for all j, j'; = 1, 2, ..., J.

The coefficient of interobserver agreement {kappa} defined above has a parallel interpretation as the correlation between any two subjects within a litter in toxicological studies.18

Each pair of ratings may be regarded as falling into one of the J(J + 1)/2 classifications (see Table 1Go for data layout). Letting ni denote the number of subjects in classification i, i = 1, 2, ..., J(J + 1)/2, the log-likelihood function can be written as


View this table:
[in this window]
[in a new window]
 
Table 1 Data layout
 

In order to construct a one degree of freedom GOF test, we may further combine all discordant cells into a single cell. The modified log-likelihood function may then be expressed as


where mi = ni for concordant cells and mJ+1 represents the sum of all discordant cells. In the next three subsections, we will show how one can construct confidence interval and to estimate sample size requirements for hypothesis testing for {kappa}.

Confidence interval construction
Suppose it is of interest to construct a 100(1 – {alpha})% confidence interval for {kappa}. The observed frequencies mi, corresponding to the Pri';({kappa}), i = 1, 2, ..., J + 1, follow a multinomial distribution, conditional on the sample size n (Table 1Go). The estimated probabilities Pri';({kappa}) can be obtained by replacing µj by their suitable estimates in Pri';({kappa}), i = 1, 2, ..., J + 1. It follows that


has a limiting chi-square distribution with one degree of freedom. One can obtain the two corresponding 100(1 – {alpha})% confidence limits for {kappa} by finding the admissible roots to the polynomial equation {chi}2G = {chi}21–{alpha}. When a closed form solution is unattainable, the confidence limits may be expressed in numeric form by replacing µj';s with their maximum likelihood estimates and numerically solving the equation above for {kappa}. Maximum likelihood estimates can be obtained solving {partial}logLM / {partial}µ1 = 0, ..., {partial}logLM / {partial}µJ–1 = 0 and {partial}logLM / {partial}{kappa} = 0 simultaneously. For the case of a binary outcome variable, an explicit expression for the maximum likelihood estimator was obtained by Bloch and Kraemer.3

Hypothesis testing
The procedure above may also be used to test hypotheses concerning {kappa}. Suppose it is of interest to test H0: {kappa} = {kappa}0, where {kappa}0 is a pre-specified value. The GOF test statistic is given by


Under H0, {chi}02 follows an approximate chi-square distribution with one degree of freedom. The Pri({kappa}0), i = 1, 2, ..., J + 1, are obtained by replacing µj, j = 1, 2, ..., J–1 with their maximum likelihood estimates and {kappa} by {kappa}0.

Sample size requirements
Suppose that it is of interest to estimate the number of subjects needed to test the null hypothesis H0: {kappa} = {kappa}0 versus Ha: {kappa} = {kappa}a at the 100{alpha}% significance level (2-sided), and with power (1 – ß). Under Ha, the GOF statistic has a non-central chi-square distribution with one degree of freedom, and with corresponding non-centrality parameter given by


If 1 – ß(1, {lambda}, {alpha}) denotes the power of the GOF statistic corresponding to {lambda} and {alpha}, one can determine the sample size required to test H0: {kappa} = {kappa}0 versus Ha: {kappa} = {kappa}a by using tables of the non-central chi-square distribution (e.g. Haynam et al.19). The required number of subjects is then given by


As an example, suppose that we have a trinomial outcome variable and wish to test H0: {kappa} = 0.2 at {alpha} = 0.05 (2-sided) and ß = 0.2. When µ1 = 0.2, µ2 = 0.3 and {kappa}a = 0.4, we can compute the required number of subjects from the equation above as n = 118.


    Results
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Application
 Discussion
 References
 
Confidence interval comparison
A Monte Carlo simulation study was conducted to evaluate the effect on coverage level and confidence interval width of dichotomizing trinomial outcome data. The parameters in the simulation included various values of µ1, µ2 and {kappa}, as well as the total number of subjects (n = 50, 100, 200). The number of replications used in the simulation was 1000, which allows a departure of 0.025 from the true coverage of 95% to be detected as statistically significant with 90% power.20

To conserve space, we present some of the results in Table 2Go. It is seen that most of the coverage levels fall between 940 and 960, and are therefore generally acceptable. When the number of subjects is increased to 100 and 200, the differences in coverage level become negligible. The confidence interval results further show that the three-category interval widths are consistently narrower at all parameter values. The advantage is particularly apparent at n = 50 for {kappa} = 0.1, where some of the observed differences in average width are as great as 0.17, or an increase of 48.6%. The average widths become more similar when the number of subjects is increased to 100 and 200. Results for n = 200 are not provided for reasons of space. Nevertheless, using three categories still produced narrower interval widths at all parameter values.


View this table:
[in this window]
[in a new window]
 
Table 2 Effect of dichotomization on coverage level and confidence interval widtha
 
Sample size requirements comparison
We now consider the effect of dichotomization on sample size requirements for testing H0: {kappa} = {kappa}o against Ha: {kappa} = {kappa}a. For this purpose, we specify the values of µ1, µ2, {kappa}0 and {kappa}a that correspond to the three-category case and may combine any two categories to facilitate this comparison. For example, suppose we have µ1 = 0.2, µ2 = 0.2 and µ3 = 0.6 for a trinomial outcome variable. Collapsing the data into a binary trait yields (i) µ = µ1 = 0.2; (1 – µ) = 0.8 and (ii) µ = µ1 + µ2 = 0.4; (1 – µ) = 0.4. The number of subjects required in the trinomial and binomial cases are displayed in Table 3Go for µ1 = (0.2, 0.3), µ2 = (0.2, 0.3, 0.4, 0.6), {kappa}0 = (0.1, 0.2), {kappa}a = (0.3, 0.4, 0.5), {alpha} = (0.01, 0.05) and ß = (0.2, 0.1). The corresponding number of subjects required when ß = 0.1 is reported in parentheses.


View this table:
[in this window]
[in a new window]
 
Table 3 Comparison of sample size requirements
 
The overall results show that when a trinomial outcome variable is collapsed to create a binary variable, substantial increases in sample size are required in order to maintain the same level of power at a given level of significance. The magnitude of this effect can be influenced by how the categories are combined. For example, when testing H0: {kappa} = 0.2 against Ha: {kappa} = 0.4 at {alpha} = 0.05 and ß = 0.2, a trinomial outcome variable with µ1 = 0.2 and µ2 = 0.3 requires a sample size of 118 subjects. The corresponding number of subjects required for a dichotomous outcome variable is 248 at µ = 0.2 with (1 – µ) = 0.8, but is reduced to 189 at µ = (µ1 + µ2) = 0.5 with (1 – µ) = 0.5.


    Application
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Application
 Discussion
 References
 
An example: Drinking habits among non-fatal myocardial infarction patients
For illustrative purposes, we use part of the data from a community-based case-control study of coronary heart disease.21 The study population consisted of all white men and women aged 25–64 who resided in the Auckland Statistical Area from 1986 to 1988. Cases included all non-fatal myocardial infarction patients from a World Health Organization MONICA project and all myocardial infarction deaths. Controls for the non-fatal myocardial infarction patients were a group-matched age- and sex-stratified random sample from the study population.

Information on alcohol consumption was collected using the ‘typical occasions’ method.22 In this present paper, we are interested in the degree of agreement regarding drinking habits among primary respondents (non-fatal myocardial infarction cases) and the proxy respondents (closest next-of-kin). Three categories were used in the data analysis: (I) non-drinker; (II) occasional drinker, or (III) regular drinker.

The data for the 58 respondent pairs are given in Table 4Go with the results summarized in Table 5Go. When the data from the 3 x 3 table are used to estimate agreement, the 95% confidence interval for {kappa} is given by (0.38, 0.73). When we collapse the data into either non-regular (groups I and II) or regular drinker (III), the 95% confidence interval for {kappa} is given by (0.45, 0.85). Alternatively, we can collapse the data into either non-drinker (group I) or drinker (groups II and III), the 95% confidence interval for {kappa} is given by (0.32, 0.77). In both cases, we observe an increase in confidence interval width of 16 and 34%, respectively. It is clear that the confidence interval is narrower when more categories are used.


View this table:
[in this window]
[in a new window]
 
Table 4 Data layout for drinking habits by primary and proxy respondents
 

View this table:
[in this window]
[in a new window]
 
Table 5 95% Confidence interval for {kappa}: alcohol consumption example
 

    Discussion
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Application
 Discussion
 References
 
Donner and Eliasziw23 proposed a hierarchical approach to the construction of inferences concerning interobserver agreement when the outcome variable of interest is multinomial. By combining the original categories into binary traits, the authors were able to perform a series of nested, statistically independent inferences. However, this method is only appropriate when some of the outcome categories can be naturally combined to answer a series of questions that are of a priori interest.

Kraemer24 addressed the problem of multinomial outcome categories by proposing the use of a symmetric matrix of coefficients to measure reliability. For this approach, the intraclass kappa coefficients from the matrix diagonal represent the degree of agreement for a particular category relative to all other categories combined. An advantage of the matrix approach is that it meets the concerns of those who criticized the use of a single overall measure of reliability for multinomial data (e.g. Roberts and McNamee25). When the main interest is in an initial global measure of agreement, however, these methods might become cumbersome to perform and perhaps provide more information than is needed.

The results of our simulation study, as reflected in the example, show that there are clear advantages to preserving multinomial data on the original scale rather than collapsing the data into a binary trait. Depending upon how the categories are collapsed, the penalty in terms of larger sample size requirements for testing hypotheses can be quite severe. These observations are also consistent with those of previous authors,26–28 who show there can be a severe loss of power when an inherently continuous variable is dichotomized. However, when there are small numbers in the categories, collapsing these categories may sometimes be the only possible solution in practice in order to facilitate data analysis. Nonetheless, when an investigator considers collapsing categories, biological or clinical relevancy should also be taken into account.


    Acknowledgments
 
Dr Bartfay';s research has been partially supported by a grant from the Advisory Research Committee of Queen';s University and Dr Donner';s research has been partially supported by a grant from the Natural Sciences and Engineering Research Council of Canada.


    Notes
 
b Current address: Radiation Oncology Research Unit, Department of Oncology, and Department of Community Health and Epidemiology, Queen';s University, Kingston, Ontario, Canada. Back


    References
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Application
 Discussion
 References
 
1 Cohen J. A coefficient of agreement for nominal scales. Edu Psychol Measurement 1960;20:37–46.

2 Donner A, Eliasziw M. A goodness-of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance-testing and sample size estimation. Statist Med 1992;11: 1511–19.[ISI]

3 Bloch DA, Kraemer HC. 2 x 2 kappa coefficients: measures of agreement or association. Biometrics 1989;45:269–87.[ISI][Medline]

4 Mak TK. Analysing intraclass correlation for dichotomous variables. Appl Stat 1988;37:344–52.[ISI]

5 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–74.[ISI][Medline]

6 Basu S, Basu A. Comparison of several goodness-of-fit tests for the kappa statistic based on exact power and coverage probability. Statist Med 1995;14:347–56.[ISI]

7 Banerjee M, Capozzoli L, McSweeney L, Sinha D. Beyond kappa: a review of interrater agreement measures. Can J Stat 1999;27: 3–23.[ISI]

8 Pierre U, Wood-Dauphinee S, Korner-Bitensky N, Gayton D, Hanley J. Proxy use of the Canadian SF-36 in rating health status of the disabled elderly. J Clin Epidemiol 1998;51:983–90.[ISI][Medline]

9 Whiteman D, Green A. Wherein lies the truth? Assessment of agreement between parent proxy and child respondents. Int J Epidemiol 1997;26:855–59.[Abstract]

10 Navarro AM. Smoking status by proxy and self report: rate of agreement in different ethnic groups. Tob Control 1999;8:182–85.[Abstract/Free Full Text]

11 MaCarthur C, Dougherty G, Pless IB. Reliability and validity of proxy respondent information about childhood injury: an assessment of a Canadian surveillance system. Am J Epidemiol 1997;145: 834–41.[Abstract]

12 Walker AM, Velema JP, Robins JM. Analysis of case-control data derived in part from proxy respondents. Am J Epidemiol 1988;127: 905–14.[ISI][Medline]

13 Humble CG, Samet JM, Skipper BE. Comparison of self- and surrogate-reported dietary information. Am J Epidemiol 1984;119: 86–98.[Abstract]

14 Cahalan D. Quantifying alcohol consumption: patterns and problems. Circulation 1981;64(Suppl.III):7–14.[Medline]

15 Barnett T, O';Loughlin J, Paradis G, Renaud L. Reliability of proxy reports of parental smoking by elementary schoolchildren. Ann Epidemiol 1997;7:396–99.[ISI][Medline]

16 Beltran ED, Malvitz DM, Eklund SA. Validity of two methods for assessing oral health status of populations. J Public Health Dent 1997;57:206–14.[ISI][Medline]

17 Johnson NL, Kotz S. Discrete Distribution. New York: Wiley, 1969.

18 Chen JJ, Kodell RL, Howe RB, Gaylor DW. Analysis of trinomial responses from reproductive and developmental toxicity experiments. Biometrics 1991;47:1049–58.[ISI][Medline]

19 Haynam GE, Govindarajulu Z, Leone GC. Tables of the Cumulative Non-central Chi-square Distribution, Case Statistical Laboratory, Publication No. 104, 1962. Part of the tables have been published in: Harter HL, Owen DB (eds). Selected Tables in Mathematical Statistics Vol. 1. Chicago: Markham, 1970.

20 Robey RR, Barcikowski RS. Type I error and the number of iterations in Monte Carlo studies of robustness. Br J Math Stat Psychol 1992; 45:283–88.[ISI]

21 Graham P, Jackson R. Primary versus proxy respondents: comparability of questionnaire data on alcohol consumption. Am J Epidemiol 1993;138:443–52.[Abstract]

22 Alanko T. An overview of techniques and problems in measurement of alcohol consumption. Res Adv Alcohol Drug Prob 1984;8:299–326.

23 Donner A, Eliasziw M. A hierarchical approach to inferences concerning interobserver agreement for multinomial data. Statist Med 1997;16:1097–106.[ISI]

24 Kraemer HC. Measurement of reliability for categorical data in medical research. Stat Meth Med Res 1992;1:183–99.[Medline]

25 Roberts C, McNamee R. A matrix of kappa-type coefficients to assess the reliability of nominal scales. Statist Med 1998;17:471–88.[ISI]

26 Cohen J. The cost of dichotomization. Appl Psychol Measurement 1983; 7:249–53.[ISI]

27 Kraemer HC. A measure of 2 x 2 association with stable variance and approximately normal small-sample distribution: planning cost-effective studies. Biometrics 1986;42:359–70.[ISI][Medline]

28 Donner A, Eliasziw M. Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement. Biometrics 1994;50:550–55.[ISI]





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (3)
Request Permissions
Google Scholar
Articles by Bartfay, E
Articles by Donner, A
PubMed
PubMed Citation
Articles by Bartfay, E
Articles by Donner, A