a School of Public Health and Community Medicine, The Hebrew University and Hadassah, P.O. Box 12272, Jerusalem 91010, Israel. E-mail: om{at}cc.huji.ac.il.
b Department of Epidemiology & Public Health, Institute of Child Health,30 Guilford Street, London WC1N 1EH, UK.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods Using data from the 1958 British birth cohort, we examined the relationship between socioeconomic conditions, indicated by occupational class at four ages, and self-rated health. Results obtained for a dichotomous variable using logistic regression were compared with alternative methods for ordered categorical variables including polytomous regression, cumulative odds, continuation ratio and adjacent categories models.
Results and Conclusions Findings concerning the relationship between socioeconomic position and self-rated health yielded by a logistic regression model were confirmed by alternative statistical methods which incorporate the ordered nature of self-rated health. Similarity of results was found regarding size and significance of main effects, type of association and interactive effects.
Keywords Self-rated health, social class, logistic regression, polytomous regression, cumulative odds model, continuation ratio model, adjacent categories model
Accepted 29 June 1999
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Although self-rated health is measured as a categorical response, usually having three to five categories, it is often collapsed into a dichotomous variable of good versus less than good health when it is used as a dependent variable.916 The justification for this practice has not been established. It may be that the categories of self-rated health represent an arbitrary classification of underlying continuous phenomena.17 Alternatively, the categories may represent intrinsically distinct health states, which are predicted by different factors.18,19 It is important, therefore, to establish whether analyses of self-rated health are sensitive to categorization. Investigation of the uncollapsed variable may shed light on whether threshold effects exist which would justify the practice of collapsing this health measure. To our knowledge this issue has not been examined.
The collapsing of categories of a categorical variable has been discussed in the statistical literature and it is recognised that dichotomization, whilst valid, involves loss of information and may lead to reduction in efficiency in the statistical analysis under consideration.20,21 Alternative approaches, which relate to the ordinal nature of a response variable have been suggested within the framework of logit and loglinear models for ordered categorical variable. These approaches include the continuation ratio model, the cumulative odds model and the adjacent categories model.2024
Elsewhere we have examined the association of cumulative socioeconomic circumstances and education with self-rated health among 33 year olds in the 1958 British birth cohort.25 Five specific questions were identified as follows. First, do socioeconomic conditions (as indicated by occupational class) at four life stages contribute to self-rated health at age 33. Second, are these contributions of similar magnitude. Third, is there an interaction effect of socioeconomic conditions at different ages. Fourth, is the cumulative effect of socioeconomic conditions distinct from that of education. Finally, are there gender differences in these relationships. This previous work used self-rated health as a dichotomous variable. The purpose here is to establish whether results for the dichotomous outcome differ from those obtained with alternative approaches based on self-rated health as an ordered categorical variable.
![]() |
Statistical models |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Consider a polytomous variable Y having k categories denoted by 1,2,..., k corresponding to the ordered values y1,..., yk and let x = (x1,..., xp) be a column vector of covariates. A possible generalization of the logistic regression model is the ordinary polytomous model.26 The model is given by
![]() | (1) |
Where ßj are regression coefficients vectors and ßji represents the log odds ratio for Y = yj versus Y = y1 per unit increase in xi. The model is based on a reference category, here the first one, and involve k 1 logits equations each having separate parameters. While the model's parameters are easy to interpret, the model does not incorporate the natural ordering of the response variable Y. The following models address this issue.
The cumulative odds model does incorporate the ordered nature of the response variable. This model, which was suggested by Walker and Duncan27 and further discussed by McCullagh,20 is related to cumulative probabilities and is given by
![]() | (2) |
with 2 >
3 > ... >
k.
Consider transforming the Y variable into a binary variable in which all categories from 1 to j 1 form one category, and those from j to k form the second category. The logits of such binary variables are modelled in the cumulative odds model. The vector of regression coefficients, ß, does not depend on j, thus this model assumes that the association between X and Y is independent of j. Comparing the odds of having a response >yj for x = x1 with that for x = x2 will yield
|
The assumption of equality of the log odds-ratio over all the cut-off points, is also referred to as the proportional odds assumption.20 A more general model which relaxes this assumption will include of the right hand side of (2) j + ß
jx. Such a model can be used to assess the appropriateness of the proportional odds assumption by testing the equality of the ßjs. The model stays practically the same if the order of the categories in the Y variable is reversed. Further, collapsing some of the categories will only affect the intercept parameters. This collapsability property allows modelling an underlying continuous variable.24
Another model, which incorporates the natural order of the response, is the continuation ratio model. Here the logits to be modelled are those yielded by transforming Y into a binary variable in the following way: one category is formed from the j category of Y, and the other is formed by categories 1 to j 1. The model has the following form
![]() | (3) |
The model, which was suggested by Feinberg,28 is based on the probability of being in category j, conditional on being in categories smaller than j. The slope ßi, corresponding to the covariate xi, represents the change in the relative chance of a particular ranking, against a lower ranking, for a unit change in xi. Reversing the order of the categories will yield a different model, this will also be the case when some of the categories of Y will be collapsed. The model is related to Cox's proportional hazard model.29
The adjacent categories model is an additional model suggested for analysing an ordered categorical response. The model is based on comparing in each logit two adjacent categories and has the following form
![]() | (4) |
The parameter ßji corresponds to the regression coefficient of the i-th covariate for the log odds of (Y = yj) relative to (Y = yj 1). The model is related to the ordinary polytomous model presented in (1) where in both models the regression coefficient ß, depends on j, and thus is different for each logit modelled. A simpler model may include an assumption that the ßj's are all equal implying a similar logit effect for all (k 1) pairs of adjacent response categories, such a model will have on the right hand side j + ß
x. The adjacent categories model was presented by Goodman30 and the model with fixed ß is similar to a loglinear model having a linear by linear association.21 Such a model requires assigning scores to the response variable.
It is important to note the parameters of the models presented above, while representing in each case a constant () and a regression parameter (ß) are not comparable since each predicts different probabilities.
Variations and generalizations of the models described above have also been presented,31 including the stereotype model,32 which is based on model (1) together with constraints on the regression parameters so as to have a monotonic pattern. While models based on logits are natural for studying an ordered categorical response variable, additional models based on mean response can also be employed. These models which resemble linear regression models for continuous response, are appropriate mainly for analysing underlying continuous variables and have some disadvantages when used for categorical ordered variables.26 In addition, a number of possibilities for assigning scores to an ordered categorical variable can be employed, including methods based on an underlying continuous distribution and methods based on maximizing correlation.3335
The selection of an appropriate model is usually carried out using goodness of fit considerations. However, additional criteria should be addressed as well. These criteria include the formulation of the research question, namely which logit was of interest a priori, as well as the type of dependent ordered variable under consideration; that is, does it represent an underlying continuous phenomenon or a discrete one? Is the model parsimonious? Does modelling involve the assignment of scores to the ordered categories? Finally, what is the effect of combining adjoining categories on the model?
Given the diverse approaches available for analysing the association between a set of covariates and an ordered categorical response, it is of interest to establish whether conclusions would differ substantively using alternative methods. Especially whether the results yielded by models which are based on a polytomous response variable differ from those yielded by a model which dichotomizes the response variable. We therefore selected the ordinary polytomous model, the cumulative odds model, the continuation ratio model and the adjacent categories model and compared the conclusions yielded by these models with that of a logistic regression model based on a dichotomization of self-rated health.
![]() |
Health and socioeconomic circumstances in the 1958 birth cohort |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Measures
For self-rated health, cohort members gave an overall assessment of their health, rated as excellent, good, fair or poor during a personal interview at age 33.
Socioeconomic circumstances were indicated by social class, based on occupation, at four ages: (i) father's occupation at the time of the respondent's birth (using the Registrar General's (RG) 1950 classification); (ii) father's occupation when the cohort member was aged 16 (to reduce the effects of sample attrition due to missing data, fathers social class at age 11 was used if 16-year data were not available); (iii) cohort member's own occupation at age 23 (using the 1980 RG classification); (iv) cohort member's own occupation at age 33 (using the 1990 RG classification). Four social groups were used: I & II (professional and managerial), III non-manual (unskilled), III manual (skilled) and IV & V (unskilled manual). Both men and women were classified on the basis of their own occupation.
We constructed a summary measure of socioeconomic circumstances at the four ages (SES lifescore) in which each life stage had equal weighting and which was compared with other measures of cumulative SES.25 This simple score was derived by adding the social class value (14) at different ages, and ranged from 4 (for those always in the highest social class) to 16 (for those always in the lowest class). About 25% of men had scores below 8, and 25% had scores above 12; for women the percentages were 25% and 17% respectively.
Educational qualifications obtained by age 33 were classified into five groups: above A Level (29% men, 26% women), A Level or equivalent (24% men, 10% women) O' Level or equivalent (24% men, 36% women), less than O' Level (14% men, 17% women), and no qualifications (9% men, 11% women).
Data analysis
The approaches for comparison were ordinary polytomous regression, the cumulative odds, the continuation ratio, the adjacent categories method and a logistic regression analysis based on a dichotomization of self-rated health. For the continuation ratio, the assumption of parallelism was tested. For the adjacent categories method a uniform association having a fixed ß was used to represent a more parsimonious model and equally-spaced scores were assigned to self-rated health.
Three models were examined for each method, first with social class at the four ages separately as the explanatory variables, second with the SES summary (lifescore), third with lifescore and education simultaneously. Both social class and educational qualifications were used as ordinal categorical variables with equally spaced scores assigned to the categories. Different ways for assigning the scores to these variables and the sensitivity of the results of logistic regression models to the different scores have been reported elsewhere.35 A potential problem in Model 1 is collinearity arising from the correlation between the independent variables. We assessed the size of this problem and its impact on the estimated parameters for the logistic regression model, and found that the results are not affected by collinearity.38
Goodness of fit was assessed for each model using the deviance statistic. However, when the number of covariate patterns was large, we used a modified test based on a procedure suggested by Hosmer and Lemeshow.39 The estimated logits were divided into 10 equal groups and in each group the observed and expected frequencies were compared for the categories of self-rated health. A Pearson 2 test was used to summarize the comparison. Furthermore, evaluation of the residuals and graphical methods was also used to assess the fit of the models.
The results for logistic regression and the three alternative approaches were compared with respect to the effect size, the significance of results, gender differences, linearity of association (which was tested with a quadratic term) and tests of interactions.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
|
The power of the statistical testing carried out under the different approaches is influenced, as in other cases of statistical inference, by the sample size and the effect size. In our analyses both the sample and effect sizes are large and so we considered whether our findings were robust with a smaller sample and effect size. Focusing on the association between lifetime SES and self-rated health among men, we therefore considered two additional situations. The first in which the same bivariate association prevailed between these two variables but the sample size was reduced to 10% of the original male sample (i.e. 346). The second in which for the reduced sample there was also a smaller effect. We compared the results of the logistic regression model to that of the adjacent categories, cumulative odds and continuation ratio methods. For the first situation all methods showed a significant association between lifetime SES and self-rated health. For the second situation the results differed: logistic regression showed no significant association with an estimated slope of 0.069 (SE 0.0521), whereas the other methods showed a significant association. For example the estimated slope was 0.073 (SE 0.027) for the adjacent categories model and 0.099 (SE 0.035) for the cumulative odds model. The reduction in the sample size resulted in only four individuals in the poor category for self-rated health. We repeated the analyses described above after combining the poor and fair categories and found similar results.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Logistic regression and other ordered categorical models
The results yielded by the logistic regression approach were similar to those obtained by the other methods. The logistic regression analysis was based on a dichotomization of the self-rated health variable, which involves loss of information, and in addition, the ordinal nature of the variable is not incorporated. Both aspects may lead to a reduction in statistical efficiency.21,24 However, only small differences in power and efficiency were evident. These small differences are related to the relatively large sample size. Using simulations Armstrong and Sloan41 compared results from logistic regression with the cumulative odds and continuation ratio. They concluded that logistic regression involved only a slight reduction in power compared with the other approaches, when the ordinal scale is collapsed into a dichotomy with equal numbers in each category. Dichotomization in our study resulted in about 12% in one category and 88% in the other. However, our sample size was far larger than that of Armstrong and Sloan.41 The similarity of results together with the good model fit presented in the present paper provides further evidence of the robustness of the logistic regression results and justifies previous work carried out with the 1958 birth cohort.9,10 Further, given the simplicity and wide use of logistic regression, particularly in epidemiology, our results support this practice within the context of a large sample size. However, as was illustrated by an example, in situations characterized by either a smaller sample, or a smaller effect size or both, results from logistic regression may differ from those yielded by the other methods.
The findings of the logistic regression analyses regarding our main research questions, were confirmed by the other methods, in respect of the accumulation and magnitude of effect of socioeconomic conditions on self-rated health, interactive effects, the combined effects of lifetime SES and education, and gender differences.
Ordered categorical models
The methods which incorporated the categorical nature of self-rated health also gave similar results. Ananth and Kleinbaum24 used data with a single covariate and compared a number of approaches including the continuation ratio, the cumulative odds, the ordinary polytomous and the adjacent categories. The results from the different approaches were similar. Two other studies, one based on simple data with a single covariate42 and the other based on more complex data43 reported similarity of findings for several analytical approaches. An additional study44 used simulations to compare results based on proportional odds to those of the stereotype model. It concluded that the performance of the first approach was superior to that of the second. Our analysis, which is based on a large data set and a number of covariates, provides additional support for the similarity of results of the various methods.
All four approaches examined fit the data for men rather well, with a less good fit for women and an indication of better fit for the adjacent categories approach than the continuation ratio or the cumulative odds approaches. However, the deviances reported for each approach are not completely comparable, and thus cannot be used as the only measure of goodness of fit. Further, while we have used a modified test of goodness of fit when the number of covariate patterns was large, the resulting 2 tests were based on tables in which the expected number of cases was smaller than five in a number of cells. The small number of expected cases was due to the small number of subjects who rated their health as poor. This together with the sensitivity of the
2 test to a large sample size as ours, leads to model selection which depends on scrutinization of the basic relationships, graphical representations, conceptualization of the response variable and, most importantly, the specific research questions. The approaches of continuation ratio, cumulative odds and adjacent categories methods are more parsimonious than ordinary polytomous regression. The continuation ratio approach assumes that each category of self-rated health is of an intrinsic interest, while the adjacent categories and cumulative odds approaches imply an underlying continuous attribute. The assumption of parallelism that was found to be appropriate for the continuation ratio model suggests that the associations hold across different comparisons (each health category versus better health). Whereas, with the ordinary polytomous models there was limited evidence, for women but not men, for the existence of a threshold in the association with cumulative socioeconomic conditions and self-rated health. Our analyses did not indicate that one method was strongly preferable to any other, but was suggestive of monotonic relationships throughout the categories of self-rated health. This finding accords with recently published results on the continuity of self-rated health17,19 and contrasts with previously reported studies suggesting that there are different predictors for good and for less than good health.18 However, when self-rated health was used to predict mortality, each decrease in the level of rating was found to incrementally increase the risk of mortality.45,46
In conclusion, results concerning the relationship between socioeconomic position and self-rated health yielded by a logistic regression model were confirmed by alternative statistical methods which incorporate the ordered nature of self-rated health. Similarity of results was found regarding size and significance of main effects, type of association and interactive effects. Methods based on self-rated health with four categories suggest that self-rated health forms a continuum.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Power C, Manor O, Fox AJ. Health and Class: The Early Years. London: Chapman and Hall, 1991.
3 Moller L, Kristensen TS, Hollnagel H. Self rated health as a predictor of coronary heart disease in Copenhagen, Denmark. J Epidemiol Community Health 1996;50:42328.[Abstract]
4 Fylkesnes K. Determinants of health care utilizationvisits and referrals. Scand J Soc Med 1993;21:4050.[ISI][Medline]
5 Idler EL, Benyamini Y. Self rated health and mortality: a review of twenty-seven community studies. J Health Soc Behav 1997;38:2137.[ISI][Medline]
6 Miilunpalo S, Vuori I, Oja P, Pasanen M, Urponen H. Self-rated health status as a health measure: the predictive value of self-reported health status on the use of physician services and on mortality in the working age population. J Clin Epidemiol 1997;50:51728.[ISI][Medline]
7 Krause NM, Jay GM. What do global self-rated health items measure? Med Care 1994;32:93042.[ISI][Medline]
8 Lundberg O, Manderbacka K. Assessing reliability of a measure of self-rated health. Scand J Soc Med 1996;24:21824.[ISI][Medline]
9
Power C, Matthews S, Manor O. Inequalities in self-rated health in the 1958 birth cohort: life time social circumstances or social mobility? Br Med J 1996;313:44953.
10 Power C, Matthews S, Manor O. Inequalities in self-rated health: explanations from different stages in life. Lancet 1998;351:100914.[ISI][Medline]
11 Mackenbach JP, Kunst AE, Cavelaars AEJM, Groenhof F, Geurts JJM. Socioeconomic inequalities in morbidity and mortality in western Europe. Lancet 1997;349:165559.[ISI][Medline]
12 Shetterly S, Baxter J, Mason LD, Hamman RF. Self rated health among Hispanic vs non-Hispanic white adults: the San Luis Valley Health and Aging Study. Am J Public Health 1996;86:1798801.[Abstract]
13 Rahkonen O, Arber S, Lahelma E. Health inequalities in early adulthood: a comparison of young men and women in Britain and Finland. Soc Sci Med 1995;41:16371.[ISI][Medline]
14 West P. Inequalities? Social class differentials in health in British youth. Soc Sci Med 1988;27:29196.[ISI][Medline]
15 Arber S. Comparing inequalities in women's and men's health: Britain in the 1990's Soc Sci Med 1997;44:77387.[ISI][Medline]
16 Macran S, Clarke L, Sloggett A, Bethune A. Women's socioeconomic status and self-assessed healthidentifying some disadvantaged groups. Soc Health Ill 1994;16:182208.[ISI]
17 Manderbacka K, Lahelma E, Martikainen P. Examining the continuity of self-rated health. Int J Epidemiol 1998;27:20813.[Abstract]
18 Smith AMA, Shelley JM, Dennerstein L. Self rated health: biological continuum or social discontinuity. Soc Sci Med 1994;39:7783.[ISI][Medline]
19 Mackenbach JP, Van Den Bos J, Joung IMA, Van De Mheen H, Stronks K. The determinants of excellent health: different from the determinants of ill-health? Int J Epidemiol 1994;23:127381.[Abstract]
20 McCullagh P. Regression models for ordinal data (with discussion). J R Statis Soc B 1980;42:10942.
21 Agresti A. Analysis of Ordinal Categorical Data. New York: Wiley, 1984.
22 McCullagh P, Nelder JA. Generalized Linear Models (2nd Edn). London: Chapman and Hall, 1989.
23 Anderson JA, Philips PR. Regression, discrimination, and measurement models for ordered categorical variables. Appl Statist 1981;30: 2231.
24 Ananth CV, Kleinbaum DG. Regression models for ordinal responses: a review of methods and applications. Int J Epidemiol 1997;26: 132333.[Abstract]
25 Power C, Manor, O, Matthews S. (1999) Duration and timing of exposure: effects of socio-economic environment on adult health. Am J Public Health 1999;7:105965.
26 Agresti A. Categorical Data Analysis. New York: Wiley, 1990.
27 Walker SH, Duncan DB. Estimation of the probability of an event as a function of several independent variables. Biometrika 1967;54: 16779.[ISI][Medline]
28 Feinberg S. The Analysis of Cross-Classified Categorical Data, 2nd Edn. Cambridge MA: MIT Press, 1980.
29 Läära E, Mathews JSN. The equivalence of two models for ordinal data. Biometrika 1985;72:20607.[ISI]
30 Goodman LA. The analysis of dependence in cross-classifications having ordered categories, using log-linear models for frequencies and log-linear models for odds. Biometrics 1983;39:14960.[ISI][Medline]
31 Peterson BL, Harrell FE. Partial proportional odds models for ordinal response variables. Appl Statis 1990;39:20517.
32 Anderson JA. Regression and ordered categorical variables (with discussion). J R Statist Soc B 1984;46:130.[ISI]
33 Fielding A. Scoring functions for ordered classifications in statistical analysis. Quality Quantity 1993;27:117.
34 Gilula Z. Grouping and association in two-way contingency tables: a canonical correlation analytic approach. J Am Stat Ass 1986;81: 77379.[ISI]
35 Manor O, Matthews S, Power C. Comparing measures of health inequality. Soc Sci Med 1997;45:76171.[ISI][Medline]
36 Butler NR, Bonham DG. Perinatal Mortality. Edinburgh, Livingstone, 1963.
37 Ferri E (ed). Life at 33: The Fifth Follow up of the National Child development Study. London: National Children's Bureau, 1993.
38 Wax Y. Collinearity diagnosis for a relative risk regression analysis: An application to assessment of diet-cancer relationship in epidemiological studies. Stat Med 1992;11:127387.[ISI][Medline]
39 Hosmer DW, Lemeshow D. Applied Logistic Regression. New York: Wiley, 1989.
40 Wagstaff A, van Doorslaer E. Measuring inequalities in health in the presence of multiple-category morbidity indicator. Health Econ 1994;3:28191.[ISI][Medline]
41 Armstrong BG, Sloan M. Ordinal regression models for epidemiologic data. Am J Epidemiol 1989;129:191204.[Abstract]
42 Cox C, Chuang C. A comparison of chi-square partitioning and two logit analyses of ordinal pain data from a pharmaceutical study. Stat Med 1984;3:27385.[ISI][Medline]
43 Greenwood G, Farewell V. A comparison of regression models for ordinal data in an analysis of transplanted-kidney function. Cand J Stat 1988;16:32535.
44 Holtbrugge W, Schumacher M. A comparison of regression models for the analysis of ordered categorical data. Appl Stat 1991;40: 24959.[ISI]
45 Pijls LTJ, Feskens EJM, Kromhout D. Self-rated health, mortality and chronic disease in elderly men: the Zutphen study, 19851990. Am J Epidemiol 1993;138:84048.[Abstract]
46 McCallum J, Shadbolt B, Wang D. Self reported health and survival: a 7-year follow up study of Australian elderly. Am J Public Health 1994;84:110005.[Abstract]