1 Department of Statistics, Texas A&M University, College Station, TX.
2 Department of Large Animal Medicine and Surgery, Texas A&M University, College Station, TX.
Received for publication February 23, 2001; accepted for publication April 18, 2002.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
case-control studies; colic; effect modifiers (epidemiology); epidemiologic methods; heterogeneity; logistic models; multicenter studies
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Specifically, we examine whether a characteristic of a matching covariate is associated with effect heterogeneity for the association of equine colic with other covariates of interest. More generally, the methods described can be extended to examine effect heterogeneity by any ordered categorical or continuous matching covariate; other categorical factors such as geographic region can also be considered. The methods described in this report may be particularly useful for the analysis of multicenter case-control studies where effect heterogeneity by center or center size may occur. Failure to account for effect heterogeneity by center could lead to inappropriate generalization of study findings among centers or failure to detect factors relating to exposure or outcome that differ among centers.
Although epidemiologic studies of equine colic have been reported (213), few describe management practices that predispose to colic (3, 4, 6, 8, 10). Recent change in diet has been identified as a risk factor for colic (3, 4); in these reports, participating veterinarians contributed various numbers (clusters) of matched pairs of cases of colic and controls. A natural question to ask is the extent to which cluster size results in effect heterogeneity for the association of diet change with colic. The rationale for examining cluster size is that veterinarians who contributed more pairs of cases and controls may have differed in meaningful ways from those who contributed less. This is an illustration of a multicenter study where center-level-specific factors (e.g., number of contributed matched pairs) may be associated with effect heterogeneity.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Participating veterinarians were asked to provide data for one horse treated for colic and one horse that received emergency treatment for any condition other than colic, monthly between March 1, 1997, and February 28, 1998. A colic case was defined as the first horse treated during a given month for signs of intraabdominal pain. A control was defined as the next horse that received emergency treatment for any condition other than colic and was treated by the veterinarian who treated the horse with colic. Controls were examined no more than 30 days after the corresponding colic case. The rationale for selecting the next horse treated for a condition other than colic as a control was to obviate any seasonal bias in choosing the comparison population, because incidence of colic is considered to be seasonal and because feeding and other management practices of horses vary by season; for example, the amount of fresh grass ingested while grazing varies between winter and spring. Horses were functionally matched on the basis of month, veterinarian, and region. Horses less than 6 months of age were excluded, because the types of colic in weanlings and foals are different from those in horses. Because some participating veterinarians were employed in the same practice, matched case-control pairs were often contributed by practices with cases and controls not matched by individual veterinarian. For the purposes of this study, we examined only the 498 matched pairs (996 horses) contributed by the 145 veterinarians in which the case-control sets were seen by the same veterinarian (table 1). Data collected for cases and controls included information regarding identifiers for the horse, farm, and veterinarian, date of examination, and various management factors (including dietary practices) (4).
|
Statistical methods and analysis of the data
This section describes the development of graphs-based methods, with associated confidence intervals, for understanding effect heterogeneity in a variable that is part of the matching process. Currently, no such graphs-based methods exist. The graphs used for our method, however, can be constructed by any conditional logistic regression program. In our study, the basic matching depended on the veterinarian, and the derived variable was the number of matched pairs contributed by a veterinarian. Each member of a matched pair had the same value of the potential source of effect heterogeneity.
Because it may not be obvious in what follows, we emphasize that every conditional logistic regression that we consider below is a multivariate conditional logistic regression: All the factors that were considered in the original multivariate model of risk factors for colic are considered simultaneously in our method. These factors were recent change in diet, recent change in type or batch of hay, recent change in stabling/housing management, recent change in weather conditions, history of previous colic, age (>10 years or 10 years), Arabian breed, whether the horse was exercised at least once each week, farm acreage (farms of >25 acres (0.1012 km2) and farms of
25 acres), recent administration of an anthelmintic, and whether anthelmintics were administered regularly (regular parasite control).
A simple method to test for effect heterogeneity is to add one or more multiplicative interaction terms to the model, multiplying the potential source of effect heterogeneity by the covariate in question and then testing whether this derived variable has a statistically significant effect. This can be done either simultaneously or in a series of conditional logistic regressions with each variable alone but in turn having the multiplicative interaction. Our method differs from this in two respects: 1) it is based on graphs rather than on statistical significance tests; and 2) it does not assume that the effect heterogeneity is of multiplicative form.
Our method is a version of what is known as the varying coefficient model (14), in which the coefficients of a conditional logistic regression vary with a matching variable. The idea of using the varying coefficient model in conditional logistic regression has not been reported previously, to the authors knowledge. We describe our method as it applies to our example. The Appendix gives algebraic details of the varying coefficient model.
Our method to understand effect heterogeneity graphically is based on a type of moving average. The moving average approach entails running conditional logistic regression analyses for various subcohorts of the potential source of effect heterogeneity and running conditional logistic regressions for each of the various subcohorts. In our example, the potential source of effect heterogeneity, k, was the number of matched pairs contributed by a veterinarian. We ran conditional logistic regression models for six subcohorts, each subcohort defined by a window of the number of matched pairs contributed. Subcohort 1 contained those matched pairs from veterinarians who contributed k = 1, 2, 3, 4, 5, or 6 matched pairs; subcohort 2 contained those matched pairs from veterinarians who contributed k = 2, 3, 4, 5, 6, or 7 matched pairs, and so on. The final subcohort, subcohort 6, consisted of those veterinarians who contributed o
µo
matched pairs. The regression coefficients are then plotted as a function of the source of effect heterogeneity. Any resultant trends can then be investigated further.
To assess probabilistically whether the odds ratio depended heavily on the subcohort number j, we conducted two analyses. First, we bootstrapped the data nonparametrically (15) by sampling veterinarians with replacement and then by sampling matched pairs by veterinarians with replacement (as an aside, we also sampled veterinarians with replacement in each subcohort defined by the number of matched pairs, with no major differences). Having formed the new bootstrap sample, we recomputed the moving average subcohorts j = 1, 2, ..., 6, as previously described, thereby obtaining the coefficients ß(j). We then fit linear and quadratic regressions to the plot of the coefficients ß(j) against j. We then used bootstrap confidence intervals to make probability statements. We also investigated a second, more parameterized analysis. Instead of using our semiparametric subcohort method, we fit conditional logistic regression with linear and quadratic modifying effects for the change in diet. In the linear case, for example, to the dummy variable (X) indicating a change in diet, we added the variable k x X, where k is the number of matched pairs contributed by the veterinarian.
The method described above for ordered categorical variables can be extended to a continuous source of effect heterogeneity, if one categorizes the potential source of heterogeneity and then uses the methods described above (e.g., 5-year age categories instead of the specific age of the veterinarian). More generally, if one prefers to treat continuous variables as continuous data rather than as categorical data, we describe in the Appendix how this can be done with varying coefficient models (14), which are a class of models that contains ours as a special case.
Another way to model effect heterogeneity is through multiplicative interactions (i.e., to multiply each covariate by the source of effect heterogeneity, possibly modeling the product nonparametrically; see the Appendix for details). Our approach has the advantage of interpretation: Regression coefficients themselves depend on the source of effect heterogeneity. More importantly, our graphs-based method gives, we think, more direct and intuitive understanding of the effect heterogeneity than do significance levels obtained by fitting models with numerous multiplicative interactions.
Ours is not the only possible method that can be used to study effect heterogeneity. In the Appendix, we not only describe the varying coefficient model algebraically, but we also discuss another method, currently available only as a user-contributed S-PLUS program (MathSoft, Inc., Cambridge, Massachusetts) and suggested to us by a referee, that can be used to understand a specific type of effect modification, although the program was not intended for evaluating effect modification in case-control studies. This alternative method is not graphs based and requires special software. It is also not a varying coefficient model for effect heterogeneity except in the case that all covariates are binary. Our preliminary simulations (data not shown) suggest that, in this case, our method is more powerful statistically in detecting linear and quadratic effect heterogeneity.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Because conditional logistic regression was performed with all variables in the model simultaneously, we can consider effect heterogeneity for other risk factors. For example, consider regular parasitic control. In figure 2, for this risk factor we plot the regression coefficients ß(j) for the jth subcohort. Note here that effect heterogeneity appears to be nonlinear. Indeed, we fit a quadratic regression to this figure, resulting in the fit 0.74 + 0.05 x k 0.02 x k2. A 95 percent confidence interval for the quadratic coefficient is 0.03, 0.01, so that a quadratic effect is indicated. We performed the same analysis for diet change and, as expected, the quadratic effect was not statistically significant.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
This basic method is semiparametric in the following sense. In a standard conditional logistic regression with no effect heterogeneity, one is making the parametric modeling assumption that ß(k) does not depend on k. On the other hand, a parametric model for linear effect heterogeneity assumes that ß(k) = 0 +
1 k, for some values of
0 and
1. In our initial analysis, we tested whether the number of matched pairs contributed by the veterinarian was a source of linear effect heterogeneity for diet change and found this to be highly statistically significant. The linear effect heterogeneity test is a type of test for trend, but as a test alone it sheds no light on the shape of the effect heterogeneity, other than to suggest that it is somewhat monotone in pattern. In contrast, the method we describe enables one to describe the shape of the effect heterogeneity by ordered categorical or continuous matching covariates and allows for any shape of effect heterogeneity. For example, we found evidence of a quadratic effect heterogeneity for regular parasitic control. The basic point is that our method allows the study of effect heterogeneity without having to specify the shape a priori.
This advantage of a semiparametric method becomes more evident when one allows for the possibility of more than one covariate. Recall that our analyses were multivariate conditional logistic regression, with all 11 aforementioned risk factors for colic considered simultaneously. Using parametric modeling, one would necessarily be required to specify 11 different functions. Some covariates would not have effect heterogeneity attributable to k, some would be linear in k (as described above for diet change), while others might be quadratic in k (as described above for parasitic control). Our method will, at the least, provide an analytical means for proposing models for each of the covariates. For example, if only one covariate (e.g., X) appears to have systematic effect heterogeneity and if it appears to be linear in the source of effect heterogeneity (e.g., k), then this can be modeled directly by adding to the model the term k x X. Effect heterogeneity can be tested by conditional logistic regression software. Thus, the bootstrap, which requires special programming, is not necessary for using our approach.
The method of running a separate conditional logistic regression for each value of k will often be unacceptable, because the conditional logistic regression in the subcohort of matched pairs for a covariate may either be wildly variable or, in the worst instance, not even converge. Using the moving average approach obviates this problem. The methods we have proposed extend easily to the analysis of any ordered categorical variable as a potential source of effect heterogeneity. The numeric scale attributed to the categories should be selected with care, because the horizontal scale can affect the slope of the dose response. If the conditional logistic regressions are stable for each value of the ordered categorical variable, then one can run these regressions and plot the regression coefficients against the ordered categories. Otherwise, the moving average idea can be used, along with the bootstrap-based confidence intervals described above. These methods also can be applied to continuous covariates (Appendix). It also may be possible to extend these methods to categorical data, such as geographic information. For example, the conditional logistic regression for subcohorts of pairs matched on geographic location could be run, and one could plot the resulting coefficients. Evaluation of the resulting plot might indicate a spatial pattern that could be represented in a model. Although our study is an example of 1:1 matching, the method is applicable for any type matching for which conditional logistic regression can be applied.
The results indicated that veterinarians who participated more completely differed from those who did not participate completely. This may have been because the former provided more accurate information or because these veterinarians had busier practices that saw different types of horses and clients than did those veterinarians that saw fewer horses. If, for example, the type of diet change or response to diet change depended on a horses activity level or other management factors, it is possible that veterinarians who participated more completely were more likely to see horses at a certain level of activity (e.g., racing horses) or horses managed in a particular manner (e.g., horses that were predominately stalled). Alternatively, the veterinarians who contributed more cases may have been less careful with data collection and tended to record similar differential responses for their matched pairs (i.e., were more likely to indicate that cases but not controls had a recent change in diet).
The response rate by veterinarians was low. Explanations include the fact that some veterinarians listed as large-animal or mixed-animal practices likely treated no or few horses (i.e., treated large animals other than horses) and lacked the interest and time to collect and transmit data for a 12-month study using a long questionnaire. Comparison of respondents with nonrespondents was limited because the only information recorded about veterinarians was their address and type of practice. There was no significant difference between respondents and nonrespondents regarding the distribution of practice types; however, the categories of practice type may not have been useful to discriminate between those large-animal practices that treated horses and those that treated few or none.
Only case-control pairs that were matched by specific veterinarian were considered in this analysis. Excluding sets of pairs matched by a clinic (rather than by the individual veterinarian) could have biased our results. We believe that the effect of this bias was minimal, however, because veterinarians who contributed a large number of cases individually were associated with clinics that contributed a large number of cases. Consequently, we believe the results would have been similar if we had matched by practice. Unfortunately, we did not include a covariate for practice.
This report describes a graphs-based method for assessing and characterizing the presence of effect heterogeneity by a matched covariate in matched case-control studies. The method can be applied to ordered categorical or continuous covariates for which cases have been matched with controls. The graphs themselves can be constructed with any conditional logistic regression program. The graphs may suggest forms of effect heterogeneity, for example, linear or quadratic, that can be modeled directly using conditional logistic regression software. At least in principle, it would appear that the method could yield information about effect heterogeneity even if the covariate of interest were not ordered (e.g., geographic data), although this remains to be explored.
Our method applies without change when the variable suspected of being linked to effect heterogeneity is measured as a matching factor for the pairs (see the Appendix if this variable is a continuous one). Our method is restricted to the case of a single variables being linked to effect heterogeneity. In principle, it would seem possible to extend the idea to the case of effect heterogeneity in two different variables. We have not attempted to do this and believe that such an extension is likely to be challenging.
Our method is a type of regression, where, in effect, the odds ratios are being regressed on a variable suspected of being linked to effect heterogeneity. As in any regression method, there are dangers in overinterpretation and variable misspecification. The overinterpretation issue is a standard one: One cannot claim causality nor can one claim that there are no other variables that might be linked to effect heterogeneity. As mentioned, our method can help to identify variables that seem to be correlated with the cluster-specific odds ratios and thus to document that there may be effect heterogeneity. The method, however, cannot determine causality and, in particular, cannot determine whether the variable of interest is an important determinant of responses to exposure or a merely ecologically related factor.
The misspecification issue is more subtle but still important. For example, it might be that there are two variables that are linked to effect heterogeneity. If one applies our method to such a case, with a single variable, the same thing can happen as can happen in any regression problem where two variables affect a response but only one is placed in a model. As is well known, such model misspecification can lead to missed effects or effects even of the wrong sign. Thus, it is entirely possible that instances may arise whereby our method does not detect effect heterogeneity in a single variable when in fact effect heterogeneity exists and is "caused" by two variables. Alternatively, as in any regression problem, it is possible that our single-variable method may suggest an increasing pattern in odds ratios, when in fact controlling for a second variable would have shown that the first variable was associated with decreasing odds ratios. Unexplained variables affect all regression models, and ours is no exception. Patterns can be obscured or even wrongly interpreted, and significance levels can be affected in ways that are impossible to predict. Variance estimation can also be biased because of cluster-based dependencies that are secondary to an incompletely specified model for the binary outcomes, in the presence of effect heterogeneity.
Despite these important caveats, we believe that our simple graphs-based method represents an advance in suggesting a way to gain some understanding of the possibility of heterogeneity in odds ratios.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
APPENDIX |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Section 1
We let D be case-control status, S be all stratum-level variables (stratum is defined as the level of the matching covariate of interest, in our case the level of pairs submitted by the veterinarian), X1, X2, ..., XM be the individual-level predictors, and Z be the variable that may be an effect modifier. In the colic case-control study, Z = k, the number of matched pairs contributed by the veterinarian.
The usual unconditional risk model without effect modification is that the logit of risk conditional on covariates is
Logit{Pr(D = 1)} = ß1 X1 + ß2 X2 ... + ßM XM + q(S), (1)
where q(S) includes the intercept and the unknown effects of the strata that disappear with the conditioning in conditional logistic regression; that is, they are not modeled.
The most general varying coefficient model allows all the regression parameters in equation 1 to depend on the effect modifer Z, that is,
Logit{Pr(D = 1)} = ß1(Z) X1 + ß2(Z) X2 ... + ßM(Z) XM + q(S). (2)
If Z were categorical, as in our study, and if there were sufficient observations for each level of Z, then one simple way to fit equation 2 is to run a conditional logistic regression separately for each category and then graph each of the regression coefficients as a function of Z. Any conditional logistic regression program can be used to construct the graphs.
The interaction model described in the text takes a different form, namely,
Logit{Pr(D = 1)} = ß1X1 + ... + ßMXM + 1(Z x X1) +
2 (Z x X2) ... +
M(Z x XM) + q(S). (3)
In equation 3, the functions 1(Z x X1), ...,
M(Z x XM) are functions of the products of Z and X. Fitting equation 3 requires specialized software. As far as we know, no commercial packages include options for fitting equation 3, although for S-PLUS a program is available on the Web (http://lib.stat.cmu.edu/s-news/Burst/12907). In addition, if Z is an ordered categorical variable and any of the predictors are continuous variables, equation 3 suffers from the difficulty of interpreting multiplicative interactions. We believe that our model (equation 2) is more natural and more easily interpreted. We point out that even the use of equation 3 is new in the context of matched studies.
There is one case when equation 2 and equation 3 coincide, namely, when all of the predictors X1, X2, ..., XM are binary. Such a model is overparameterized in this case, so that the ßs must be removed from the model for fitting. In this case, the strength of our graphs-based method is that any conditional logistic regression program can be used and not just specialized software.
Section 2
The text describes how to fit equation 2 when Z is a categorical variable. Here we describe how to fit equation 2 when Z is a continuous variable. The basic varying coefficient model is fit by what is called local linear regression (LOESS smoothing) (14, 16). The idea is to select matched pairs whose value of Z is somewhat near any given value, for example, z, and then run a conditional logistic regression on this subset of the data. To implement this method, one must specify a percentage, called the span; typically, a span = 67 percent is used as the default. Then one collects the values of Z, called the nearest neighbors, containing the span percentage closest to z: If the span = 67 percent, then these are the collection of the 67 percent of the Z values closest to z. Let be the range of the collected nearest neighbors. Define the weight for any value of Z by w = 0 if Z is not in the nearest neighbor collection, while otherwise
w = {1 (Z z)2/2}3.
One then runs a weighted multivariate conditional logistic regression with weights w and with the predictors Xj and Xj(Z z), for j = 1, ..., M. If the former have regression coefficient estimates 0 j and the latter have regression coefficient estimates
1 j, then the estimate of ßj(z) is ßj(z) =
0 j +
1 j z. Although these local regression methods are well established, their application to evaluation of effect modification in conditional logistic regression is new.
As before, a plot of ß(z) against z can suggest structure for effect modification. One can, for example, fit a linear regression in this plot and then use the bootstrap for inference in the same way that was described for the categorical case.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|