MRC Social and Public Health Sciences Unit, University of Glasgow, 4 Lilybank Gardens, Glasgow G12 8RZ, UK. E-mail: geoff{at}msoc.mrc.gla.ac.uk
In their editorial on a life course approach to chronic disease epidemiology Ben-Shlomo and Kuh predict that techniques ... currently under-utilized in conventional epidemiological analyses, for example structural equation modelling, path analysis, G-estimation and multi-level modelling, will become more widespread.1 In this issue, Singh-Manoux and colleagues2 present a structural equation model which offers a simple and appealing solution to a type of problem that will be familiar to epidemiologists. The problem, in their example, is that the unconditional effect of education on health is positive, as expected, but when conditioned on, or adjusted for, occupational grade and income, the direction of the effect is reversed.
Situations like this are common where predictors are correlated and typically involve imprecise estimates, due to large standard errors, or unstable estimates that are sensitive to changes in relatively few data points. The problems are more severe the greater the degree of association among the predictors with the most severe form occurring when there is a perfect linear relationship between them. The term collinearity, strictly speaking, refers to this extreme case, although its usage has now been extended to cover less than perfect association. In observational research, true collinearity is very rare and apparent examples are much more likely to be due to model mispecification.
Perhaps by analogy with true collinearity, collinearity in the looser sense is often treated as a technical problem and one to be overcome by technical means, for example: variable selection, principal component scores, or techniques like ridge regression. However, it can also be viewed as an issue of interpretation. Singh-Manoux and colleagues take this line describing the negative conditional effect of education as misleading although not erroneous and they add it is plausible to believe that the better educated have poorer psychosocial health than those less educated given that they have achieved the same income and occupational status. With correlated predictors, it is the subjects who do not conform to the pattern that provide information about the conditional effects and, for highly correlated predictors, these may be few in number and heterogeneous. In such cases it can be useful to ask who are these people and why are they exceptional? and even to examine the data for possible answers. Substantial measurement error may be one answer.
In contrast to the technical remedies for collinearity, the solution proposed by Singh-Manoux and colleagues imposes a temporal ordering on the predictors, which yields more plausible results.
For those inclined towards a life course approach this may be particularly appealing. Theoretically important variables are retained rather than being dropped or rendered less interpretable as principal component scores and they are ordered, or structured, to represent a theoretically based model. Add to these advantages the prospect that regression dilution can be reduced by employing latent variables, and the use of full information maximum likelihood to reduce the impact of missing values, and structural equation models (SEM) begin to seem attractive indeed.
Why then are they still under-utilized in epidemiology? Unfamiliar terminology and methods? The fact that some (LISREL) models appear to be formulated entirely in Greek? Or the dozens of vicariously related fit statistics? More probably it is because the most popular SEM programs (LISREL, EQS and AMOS) lack many of the basic features available in general or generalized linear models.
Structural equation models, in common with many other multivariate techniques, assume that all the variables employed are continuously and normally distributed. Adhering strictly to this assumption would severely restrict their use and exclude some of the control variables routinely included in models for other health outcomes, e.g. sex, social class, and smoking. The paper by Singh-Manoux et al. typifies the more pragmatic use of SEM. None of their predictors are continuous and some of the data are not normally distributed so that the results are checked using distribution free methods. The one dichotomous variable, sex, is handled partly by separate analysis and partly by multi-group analysis: a technique whereby the separate covariance matrices for subgroups are analysed jointly and subgroup differences are modelled by imposing or relaxing across group constraints. However, multi-group analysis is usually confined to a single variable. It is not uncommon to see published models that simply include dichotomous variables, like sex, as if they were continuous normal covariates, with or without the usual advice to treat the results with caution (How much caution?). What would a newcomer make of a method whose practitioners frequently flout its basic assumptions? What if they were also told that there is debate about how to include interactions and non-linear relationships into such models?3
Then there is the problem of equivalent models. In the paper, models II and III are equivalentthat is, they both fit the data equally well. But model III is not the only other model equivalent to model II. Take Figure 2, for example: the three boxes for the predictors of health could be re-labelled with any of the five other permutations of Education, Occupation and Income and still yield equivalent models. Choosing the most plausible model makes sense, but care must be taken to avoid circular reasoning.
Having said all that, SEM is a developing area and methods which remove some of the limitations are percolating through to mainstream packages. At the same time, there are new programs, such as Mx4 and Mplus,5 which are much more flexible, both in the range of data types that can be accommodated and the models that can be fitted.
Structural equation models can be thought of as combining path analysis with latent variables. Singh-Manoux and colleagues emphasize the advantages of the path analysis aspect but the incorporation of latent variables is at least as important. Indeed, Muthén6 argues that the notion of latent variables, when expanded to include latent categorical variables, subsumes a wide range of statistical concepts and their associated methods of analysis. These include random effects, multilevel models, growth curve models, latent class analysis, and cluster analysis. His general latent variable modelling framework may already contain most of the tools needed for a life course approach and surely that is an appealing prospect.
References
1 Ben-Shlomo Y, Kuh D. A life course approach to chronic disease epidemiology: conceptual models, empirical challenges and interdisciplinary perspectives. Int J Epidemiol 2002;31:28593.
2 Singh-Manoux A, Clarke P, Marmot M. Multiple measures of socioeconomic position and psychosocial health: proximal and distal measures. Int J Epidemiol 2002;31:119299.
3 Schumacker RE, Marcoulides GA (eds). Interaction and Nonlinear Effects in Structural Equation Modeling. Fullerton: California State University, 1998.
4 Neale MC. Mx: Statistical Modeling. 2nd Edn. Box 710 MCV, Richmond, VA 23298: Department of Psychiatry 1994. http://www.vcu.edu/mx/mxkey.html Acknowledgements (3 September 2002).
5 Muthén LK, Muthén BO. Mplus Users Guide. Los Angeles, CA: Muthén & Muthén, 19982001.
6 Muthén BO. Beyond SEM: general latent variable modelling. Behaviormetrika 2002;29:81117.