Department of Epidemiology, UCLA School of Public Health, 22333 Swenson Drive, Topanga, CA 90290, USA.
![]() |
Abstract |
---|
Keywords Bayesian statistics, empirical-Bayes estimation, hierarchical regression, mixed models, multilevel modelling, random-coefficient regression, ridge regression, risk assessment, Stein estimation
Accepted 29 June 1999
![]() |
Introduction |
---|
Multilevel modelling has appeared and reappeared in many forms over the last 50 years. For example, the following techniques are all special cases of or equivalent to hierarchical regression: Bayesian, empirical-Bayes (EB), Stein, penalized likelihood, mixed-model, ridge, and random-coefficient regression, and variance-components analysis.16 Conventional estimation methods, such as least squares and maximum likelihood, are also special cases in which there is only one level in the underlying model. The hierarchical approach unifies these methods and clarifies their meaning. Most importantly, it unifies the seemingly disparate methods of frequentist (classical) and Bayesian analysis. This unification is a major benefit of the approach, for it leads to methods that are superior to both classical frequentist and Bayesian methods.
I will here barely touch on the diverse applications, forms, and examples of multilevel modelling. My reason is that a more basic conceptual introduction seems necessary. Despite several decades of published examples and methodological studies demonstrating the superiority of the multilevel (hierarchical) perspective, and its widespread acceptance in the social sciences, it is not widely understood, taught, or employed in the health sciences. Lack of affordable and user-friendly software has no doubt been an obstacle to use, although as discussed below many software options are now available for multilevel modelling.
Another obstacle, however, is that most presentations move into technical and subject-matter details too rapidly for health-science readers, and few discuss the general multilevel perspective. I will develop that perspective here, although to limit the scope I will focus on estimation error, which is just one of several major problems addressed by multilevel modelling. I will explain how one by-product of multilevel modellinghierarchical estimationcan help address that problem. The examples used here are artificial and too simple to warrant even an ordinary regression analysis. Their purpose is only to provide a basis for understanding the much more complex real examples that can be found in the references.
![]() |
The Estimation Problem |
---|
Suppose our data comprise an observation of A spontaneous abortions in a cohort of N first planned pregnancies in that exposed environment. The observed proportion A/N is the usual estimator of the risk parameter under standard validity assumptions; it is known as the maximum-likelihood estimator (MLE). Here, these validity assumptions may be summarized by stating that the observed proportion A/N differs from the target risk
in only a random fashion, in accord with the assumed probability model (usually, but not necessarily, a binomial model). In the present example, this validity assumption is met by ensuring that subject selection is random and outcome ascertainment is complete. Such assumptions are of course difficult to meet in practice; nonetheless all standard statistical procedures rely on them, a point which is often overlooked when comparing new and standard methods.
Bias versus random scatter
The expectation of the observed proportion A/N is the average of this proportion over all possible samples (study repetitions) from the target population, with each possible value for A/N weighted by its probability. Given the above validity assumptions, this expectation equals the true abortion risk, which is the target parameter . Denoting expectation by E( ), we can restate this property of the proportion A/N with the equation:
E(A/N) =
In statistical theory, the bias of an estimator is defined to be the difference between the expectation (average value) of the estimator and the target parameter. An estimator is unbiased when its expectation equals the target parameter or, equivalently, when its bias is zero. The above equation thus says that, in a valid cohort study, the proportion A/N is unbiased for the average risk , regardless of the size of
or N.
For many decades, statistical unbiasedness was taken to be a prerequisite for a good estimator. Gradually, however, it was realized that other properties are important as well, and that much could be gained by compromising unbiasedness to improve those properties.710 One such property is the standard deviation (SD) of the estimator, which is a measure of the scatter or spread the estimator would exhibit around its own expectation across random samples from the target population. (Its exact definition is not needed here and can be found in any statistics text.) Estimators with large standard deviations (random scatter) are unreliable estimators of the target parameter, even if they are unbiased.
Unfortunately, any estimator that has a smaller standard deviation than A/N may (depending on what is) have non-zero statistical bias, which is to say, its expectation may not equal the target parameter
. We thus face a trade-off: If we want to reduce the standard deviation (random variation) of our estimator, we must risk incurring some statistical bias. But we ought to be willing to risk adding a small amount of bias in an estimator if that estimator suffers from much less random variation than A/N.
Figure 1 illustrates the principle that neither bias nor scatter should be considered alone. Suppose we have three rifles, and we wish to test which one will more often win when sighted on
and shot, where a win is a hit anywhere inside the ring around
. Rifle 1 has straight sights and a short worn-out barrel, and tends to scatter its shots (the s) widely around
when it is aimed straight at
; only about 20% of those shots get inside the ring. Rifle 2 has bent sights and a long new barrel, and tends to concentrate its shots (the Xs) in a small cluster to the left of
when it is aimed straight at
; about 80% of those shots get inside the ring. Rifle 1 has unbiased sights because its scatter of shots is evenly distributed around
; Rifle 2 has biased sights. But Rifle 1 is less likely to win than Rifle 2 because it has larger scatter than Rifle 2. Rifle 3 also has bent sights and an even longer new barrel, so that it tends to concentrate its shots (the +s) even more tightly than Rifle 2; its sights are so bent, however, that only about 10% of its shots get inside the ring.
|
Unfortunately, this reasonable-sounding criterion turns out to be very hard to use for statistical practice. A much simpler way to balance bias against scatter is to choose the estimator with the smallest average squared distance from . This average is known as expected squared error (ESE) or mean squared error (MSE)7 and is a common measure of estimation inaccuracy. By this criterion, Rifle 2 again outperforms Rifles 1 and 3.
One very useful property of ESE is its Pythagorean (sum-of-squares) relation to bias and standard deviation:
![]() | (1) |
This equation displays the trade-off we must make between bias and scatter when choosing estimators based on their expected squared error. Note that, for a statistically unbiased estimator like A/N, the bias is zero and so ESE = SD2 if the study is perfectly valid. In practice, however, we should expect all the estimators (including A/N) to have some bias due to imperfections in study design or conduct (such as case underascertainment).
![]() |
Bayes Estimation |
---|
A Bayes estimator for uses probability models to combine prior information (embodied in r and n) with the data according to a rule known as Bayes' Theorem.4,8,9,1113 Such estimators can be rather complicated, but in many examples a reasonable approximate Bayes estimator can be obtained by taking a weighted average of the ordinary estimator and our prior guess for
. One such estimator weights the observed proportion A/N and the prior mean r by their respective sample sizes N and n. If we let w = N/(N + n), this estimator is the weighted average:
![]() | (2) |
is called a posterior expectation or posterior mean for
; it is posterior in the sense that it can only be computed after the study has been conducted. N + n is the sum of the study cohort size N and our prior sample size n, so it can be thought of as representing the effective sample size on which the estimator
is based. The weight w for A/N represents the proportion of this effective sample size that comes from the study data, while the weight 1 - w represents the proportion that comes from our prior information. If we set n to zero, we get w = 1 and put no weight on our prior guess r; when that happens, we get back A/N from formula 2. Thus, A/N is the special case of the estimator
when we completely disregard prior information about
.
To illustrate the above formulae, suppose we observe A = 4 spontaneous abortion in a cohort of N = 10 first pregnancies. The observed proportion is then 4/10 = 0.40, which is the usual estimate of , the true risk. With a prior guess of r = 0.20 for
and a weight for this guess equal to n = 5 pregnancies, our weight w for A/N is 10/(10 + 5) = 0. 67 and our approximate Bayes estimator for
is
![]() |
which is 17% smaller than the observed proportion of 0.40.
The shrinkage viewpoint
Why should we expect to ever do any better than A/N when n is greater than zero? Imagine again our unbiased but very scattered Rifle 1, which corresponds to A/N. Suppose we deflect every shot emerging from its barrel toward a point r (for example, imagine Rifle 1 shoots steel bullets and we place a powerful magnet at r). Figure 2
illustrates what happens if r is next to the ring and each bullet is deflected about halfway toward r. When we aim Rifle 1 at
, the deflection toward r produces a tighter cluster whose centre is shifted away from
, toward r. Thus, the hits are now biased away from
, but less scattered and hence closer to
on average, provided r is not too far from
.
|
The estimator is an example of a shrinkage estimator, because it is the usual estimator A/N shrunk toward the point r by a degree proportional to 1 - w, as in Figure 2
.7,14 The term shrinkage is a bit misleading because
will be bigger than A/N if r is bigger than A/N. The phrase shrinkage toward r should thus be interpreted as averaging with r, not as becoming smaller.
An important question is just how much we should shrink toward r. This question comes down to how small to make w: If w = 1,
= A/N and there is no shrinkage; as w shrinks toward 0,
shrinks toward r. If r were equal to the target parameter
, we would want w = 0, because then
= r =
; that is,
would equal the target. In reality, we do not know with much certainty how far r is from
; we are shooting in the dark. Using a value of w closer to 1 than zero provides some protection against shrinking
too far in the wrong direction, which could happen if r were very far from
.
Choosing the prior parameters
How should we pick r and n? A vague guideline is that r should reflect the best of our knowledge and n should adequately reflect our uncertainty. It turns out, however, that we can screen out bad choices for the prior parameters r and n. More precisely, we can always see whether our choices make more accurate than A/N within the range for
in which we think accuracy is crucial.
To be more precise, suppose we are given a range L to
U for the target parameter
, with
L > 0 and
U < 1. Then we can check to make sure that our choice for the prior parameters r and n yield an estimator
that has smaller expected squared error than A/N for any
in that range. The chosen
may have higher expected squared error than A/N if
is outside the given range; however, we can always make the range wide enough so that it includes all values of
for which accuracy is crucial.
To illustrate, suppose we are most worried about accurately estimating spontaneous-abortion risk for values of that are slightly or moderately elevated above normal, because such values are liable to generate ambiguous study results. Suppose spontaneous abortions of first planned pregnancies are independent across women in our study cohort; N = 10 such pregnancies are observed; and slightly or moderately elevated risk for these women would fall somewhere inside the range
L = 0.10 to
U = 0.40. Suppose that we wish to evaluate the pair r = 0.20, n = 5, for which w = 10/(10 + 5) = 0.67. The resulting approximate Bayes estimator is:
![]() |
From formula (A2) in Appendix 1, we find that will have smaller expected squared error than A/N when the risk is slightly or moderately elevated (i.e. when
is between 0.10 and 0.40). If one spontaneous abortion is observed in the cohort (i.e. if A = 1), then A/N will equal 1/10 = 0.10 and
will equal 2/15 = 0.13; on the other hand, if A = 4, then A/N = 4/10 = 0.40 and
= 5/15 = 0.33. Thus,
represents only a modest alteration of A/N when r = 0.20, n = 5, N = 10, and A/N is between
L and
U.
The difference between A/N and grows as A/N moves outside the range
L,
U. But the idea is to choose
L and
U so that accuracy is not so important outside that range: We can and should choose
L so that both estimators are likely to be unambiguously low if
<
L; likewise, we can and should choose
U so that both estimators are likely to be unambiguously high if
>
U. The point of using
instead of A/N is to concentrate on improving accuracy where we need it most, in the grey zone between
L and
U, where ambiguous values of A/N are most likely to arise.
The multilevel viewpoint
The estimator is a very simple example of a multilevel or hierarchical estimator based on two levels, or stages, of models.19,15 These stages are as follows: The first stage is the model for the incidence, A/N, of spontaneous abortions in the study cohort, as a probabilistic function of the target parameter
; specifically, A/N is assumed to deviate from
in a purely random fashion. This model is the one and only stage used to derive A/N as the maximum-likelihood estimator for the average risk
. The second stage is a model for the average risk
as a probabilistic function of r and n; specifically,
is assumed to deviate from r in a purely random fashion to an extent inversely related to n. Because the parameters r and n can be derived before the study is conducted, this second-stage model is usually called the prior model.4
It may (and perhaps should) at first seem odd to model the target , which is presumably a fixed and real quantity, as a function of our prior mean r, which after all is just our best guess about
. In fact, the oddity of this stepmodelling a presumably fixed reality as a function of an ideahas provoked some of the most vociferous resistance to and misunderstanding of Bayes estimators. There are, however, several explanations for why the model is from the prior mean r to the target
.
An intuitive explanation is that we are attempting to make inferences from our initial guess r to the reality , and hence the best way we can express our initial uncertainty concerning
is to use our best guess r as a reference point. More rigorous justification along these lines can be based on subjective probability logic.1113 A frequentist argument is essentially that given above: The model provides estimators with smaller expected squared error than A/N under explicit assumptions.7
The picture with two cohorts
With two cohorts, we can more easily translate the lesson in Figures 1 and 2 to a statistical setting as follows: Suppose we have a cohort from Finland and a cohort from Sweden, and we have a choice among three different statistical methods for estimating risk in the cohorts. Let
1 be the risk in the Finnish cohort and
2 the risk in the Swedish cohort. The point
in the figures now represents the target of the study:
is the pair of risks (
1,
2), so that
1 is the x-coordinate (abscissa) and
2 is the y-coordinate of
. The different symbols in Figure 1
(, X, +) correspond to typical results from statistical methods 1, 2, 3. Figure 1
illustrates how method 2 (the Xs) tends to give results closest to the risk pair
, even though its results are more biased than those from method 1 (the s) and more variable than those from method 3 (the +s). It thus appears that a compromise between bias and variability tends to produce results closest to the truth.
Suppose now we have good guesses r1 and r2 (perhaps from other data) for the target risks 1 and
2. The point r in Figure 2
represents this pair of guesses, i.e. r is the point with coordinates (r1, r2). Figure 2
illustrates how the unbiased but highly variable results from method 1 can be brought closer to the target pair
by pulling them toward r, which introduces some bias, but more than compensates by greatly reducing variability. More generally, it suggests that the most accurate methods may be found by introducing some bias into statistically unbiased estimators to reduce their variability and thus increase their accuracy.
![]() |
Stein Estimation |
---|
When three or more unknown target parameters are being simultaneously estimated, we can construct two-stage estimators that always have lower total ESE than conventional (maximum-likelihood or weighted-least-squares) one-stage estimators. This gain is achieved by a clever strategy, known as Stein estimation, of weighting prior guesses in a manner that directly depends on their agreement with the observations.7,19,20 To illustrate, let us expand the earlier example by supposing that we have three independent cohorts of first planned pregnancies, all in a presumably exposed occupation but in different locations, e.g. Finland, Sweden, Denmark, and that the target parameters are the average risks 1,
2,
3 in the three cohorts. Let N1, N2, N3 be the cohort sizes, with N+ = N1 + N2 + N3, and let A1, A2, A3 be the numbers of spontaneous abortions in each cohort, with A+ = A1 + A2 + A3.
Now suppose that our prior guesses for 1,
2,
3 are r1, r2, r3 (these guesses might all be the same, but need not be so). The Stein (or James-Stein) estimators
1,
2,
3 of
1,
2,
3 are given by
![]() | (3) |
for k = 1, 2, or 3. Here, 1,
2,
3 are estimated weights; their formula is a bit complicated and so is deferred to Appendix 2. An important point about the weight formula is that it makes
k close to 0, and hence makes
k close to rk, if Ak/Nk appears highly variable or if all the observed proportions A1/N1, A2/N2, A3/N3 are close to their prior means r1, r2, r3 (which suggests we have made good guesses). On the other hand, the weight formula makes
k close to 1, and hence makes
k close to Ak/Nk, if Ak/Nk appears very precise or if any of the observed proportions are far from their prior means (which, assuming the study was valid, suggests we have made at least one bad guess).
To illustrate the Stein formula, suppose we observe A1 = 4, A2 = 1, and A3 = 3 spontaneous abortions in Finnish, Swedish, and Danish pregnancy cohorts of size 10 each (N1 = N2 = N3 = 10), so that the observed proportions Ak/Nk are 0.40, 0.10, and 0.30, with a prior guess of 0.20 for the risk in all three cohorts (r1 = r2 = r3 = 0.20). From Appendix 2 we get estimated weights of 1 = 0.68,
2 = 0.70, and
3 = 0.68 for the three observed proportions. Applying formula 3, we obtain
1 = 0.34,
2 = 0.13, and
3 = 0.27 for the Stein estimates of the true abortion risks in the three cohorts. Each of these estimates is the result of pulling away from the observed proportions, toward their common prior mean of 0.20.
An amazing property of the Stein estimators is that their total expected squared error is guaranteed to be less than that of the conventional estimators, regardless of the values of the target parameters or the prior means.7,19,20 That is, taken as an ensemble, the Stein estimators will outperform the conventional estimators even if our prior means are very wrong. For example, but for the approximation error, the total of the expected squared errors of 1,
2,
3 will be less than the total of the expected squared errors of A1/N1, A2/N2, A3/N3 even if r1, r2, r3 are all far from
1,
2,
3. In this sense, and unlike Bayes estimators, the overall (ensemble) performance of the Stein estimators is robust to our prior guesses. Nonetheless, like Bayes estimators, if the prior guess used for one of the target parameters is poor, the Stein estimator for that parameter may be much less accurate than the conventional estimator for that parameter. It is only the total squared error of the Stein estimators that is guaranteed to be lower. To ensure that each of the Stein estimators is more accurate than its conventional counterpart, we would have to make a good prior guess for each target parameter.
The shrinkage viewpoint
The robustness advantage of the Stein estimators relative to Bayes estimators may be explained from a shrinkage perspective.7,1921 If the conventional estimator Ak/Nk is highly variable, it should get more shrinkage toward a stable point (that is, get a smaller k) in order to reduce its variability. Such shrinkage will introduce excessive bias, however, if the chosen stable point rk is too far from the true parameter
k. One way to protect ourselves from shrinking too much toward the wrong point is to shrink in proportion to how good our guesses appear to be. The weight
k is constructed in such a manner that it responds as needed to the apparent variability of Ak/Nk and to the distance of our prior means from the conventional estimates (Appendix 2).
The multilevel viewpoint
Stein estimators are two-stage estimators with the same first-stage model as the conventional and Bayes estimators. The second-stage model differs somewhat from that of the Bayes estimators. As in Bayes estimation, each k is assumed to deviate only randomly from its prior mean rk. Unlike in Bayes estimation, however, the extent of this deviation is no longer assumed to be known. Thus, Stein estimators depend on one less second-stage assumption than do Bayes estimators.
![]() |
Empirical Bayes Estimation |
---|
To illustrate, let us further expand the earlier example to four cohorts. For example, suppose we add a Norwegian cohort of size N4 to our study, in which we observe A4 spontaneous abortions; now N+ = N1 + N2 + N3 + N4 and A+ = A1 + A2 + A3 + A4, with 4 the true Norwegian risk. With conventional single-stage estimation and no further information, we have just two choices for estimating the target parameters: Either estimate each separately by A1/N1, A2/N2, A3/N3 and A4/N4; or else assume that
1 =
2 =
3 =
4, and estimate each with the same estimator, A+/N+. Using A1/N1, A2/N2, A3/N3 and A4/N4 to estimate
1,
2,
3,
4 would be the natural choice if we were absolutely certain that abortion risks differ across the countries to an important extent; using A+/N+ to estimate
1,
2,
3,
4 would be the natural choice if we were absolutely certain the risks do not differ. But what if we are not at all sure about whether the risks differ to an important extent? Hierarchical methods allow us more sensible options for this situation, including compromises between the two conventional extremes.
An exchangeability assumption
Suppose that our background information about the target parameters 1,
2,
3,
4 is such that, before seeing the data, we would not be able to delineate any reasons to distinguish among them; in technical terms, suppose we would consider them to be exchangeable.5,11,24,25 Such exchangeability implies that we would offer about the same prior guess (r1
r2
r3
r4) with about the same effective sample size (n1
n2
n3
n4) for all four parameters. This is a much weaker assumption than that the parameters really are equal; it only says that, without seeing the data, we cant yet tell how they might differ. It is, essentially, an uncertain and qualitative prior guess about the similarity of the cohorts. If we did not consider
1,
2,
3,
4 exchangeable (e.g. if we suspected that
1 <
2 <
3 <
4) we could modify the estimators below to reflect our actual views, although the result would be more complicated. It turns out, however, that even if we do not accurately capture our background (prior) information, the EB estimators will still have an advantage over the conventional and Stein estimators.7,19,20
A fully empirical estimator
A key idea behind Bayes and Stein estimation is that, by analogy with Figure 2, we can stabilize (reduce the standard deviation of) the conventional estimators A1/N1, A2/N2, A3/N3, A4/N4 by shrinking them toward a given point.20,21 Given that we will shrink them all toward the same value (because we consider the underlying risks to be exchangeable), which such value would minimize our total expected squared error? A moment's reflection might suggest that a good choice would be some weighted average of
1,
2,
3,
4; call that unknown average
. We can construct an estimator of
from the data, and plug this estimator into formula 2 or 3 to get new estimators for
1,
2,
3,
4. We can also estimate the best weights to attach to A1/N1, A2/N2, A3/N3, A4/N4 when we plug our estimator of
into formula 2 or 3.
With weights proportional to the cohort sizes Nk, we obtain the crude proportion A+/N+ as the estimator of . If we use
in place of rk in formula 3 and use A+/N+ to estimate
, we get approximate empirical Bayes (EB) estimators ê1, ê2, ê3, ê4 of
1,
2,
3,
4, which have the form
![]() | (4) |
for k = 1, 2, 3, or 4. Here, 1,
2,
3,
4 are the estimated weights; their formula differs from that used for Stein estimation (Appendix 2). An important point about formula 4 is that it makes
k close to 0, and hence êk close to A+/N+, when Ak/Nk is unstable or when all the observed proportions A1/N1, A2/N2, A3/N3, A4/N4 are close together; in the latter event, the data appear compatible with exchangeability. On the other hand, the weight formula makes
k close to 1 and hence makes êk close to Ak/Nk if Ak/Nk appears very precise or if the observed proportions are extremely spread out, for in the latter event, the data appear incompatible with the exchangeability assumption. (A rationale for using A+/N+ in formula 4 is that it is an unbiased estimator of
when the weights used for averaging the
k are proportional to the cohort sizes Nk. In modern practice, better estimators of
than A+/N+ are used; A+/N+ is used here only for simplicity, although it works well enough when
1,
2,
3,
4 are not very spread out.)
To illustrate the EB formula, suppose we observe A1 = 4, A2 = 1, A3 = 3, and A4 = 5 spontaneous abortions in Finnish, Swedish, Danish, and Norwegian cohorts of size 10 each, so that the observed proportions are 0.40, 0.10, 0.30, and 0.50, and the crude proportion A+/ N+ is 13/40 = 0.33. From Appendix 2 we get estimated weights of 0.79, 0.84, 0.80, and 0.78 for the four observed proportions, and ê1 = 0.38, ê2 = 0.14, ê3 = 0.31, and ê4 = 0.46 for the EB estimates of the risks in the four cohorts. Each of these estimates are the result of pulling away from the observed proportions, toward their estimated common prior mean of 0.33.
The estimator in formula 4 is called empirical Bayes because it uses data (empirical) quantities such as A+/N+ in place of all the prior parameters in the Bayes estimator (formula 2). As with Stein estimators, the total expected squared error of the EB estimators is guaranteed to be less than that of the conventional estimators, no matter what the target parameters really are.7,19,21 That is, taken as an ensemble, the EB estimators will outperform the conventional estimators even if our exchangeability assumption (similarity judgement) was very wrong. For example, but for the approximation error, the total of the expected squared errors of ê1, ê2, ê3, ê4 will be less than the total of the expected squared errors of A1/N1, A2/N2, A3/N3, A4/N4, even if 1,
2,
3,
4 are far apart. In this sense, the overall (ensemble) performance of the EB estimators is robust.
Unlike the Bayes and Stein estimators, the accuracy of the EB estimators does not depend on assumptions about the exact values of the true parameters; only a much weaker assumption of similarity (exchangeability) is used. This can be viewed as another robustness advantage of EB estimation. Nonetheless, if the exchangeability assumption is not a good approximation to reality, some of the EB estimators in the ensemble may be less accurate than their conventional counterparts. For example, if the cohorts were all of equal size and risk except for cohort 4, and the latter was half the size and twice the risk of the other cohorts, the EB estimator for cohort 4 (ê4) might be much less accurate than the conventional estimator A4/N4. Again, it is only the total squared error of the EB estimators that is guaranteed to be lower. To ensure that each EB estimator is more accurate than its conventional counterpart, we would have to use good exchangeability assumptions; for example, if 4 was twice the size of
1,
2, and
3, we would get best accuracy by assuming that
4/2 is exchangeable with
1,
2,
3, rather than that all four parameters are exchangeable. Thus, as with all statistics, the accuracy of our background assumptions will be crucial to the accuracy of our estimates.
The shrinkage viewpoint
The robustness advantages of EB estimators may be explained from the shrinkage perspective.7,14,20,21,26 If the conventional estimator Ak/Nk is highly variable, it should get more shrinkage toward a more stable point (that is, it should get a smaller k) to reduce its variability. If we have no secure prior guess and if the conventional estimates are close to one another, then the natural point to pull them toward is close to A+/N+, which is the estimate we would use if we thought the parameters were equal. Conversely, if a conventional estimator has low variability, it does not need much pull toward a stable point to improve its accuracy, so
k can be close to 1. And, if the conventional estimates are not close together, then there is no basis for pulling them much towards A+/N+. The weight
k is constructed in such a manner that it responds as needed to the apparent variability of Ak/Nk and the spread among the conventional estimates (Appendix 2).
The multilevel viewpoint
The EB estimators are two-stage estimators, with the same first-stage model as all the earlier estimators, but with a different underlying second-stage model. This is because the EB estimators make no use of our own prior (second-stage) means; instead, they employ an estimate of as the prior mean. The second-stage model underlying the EB estimators represents
1,
2,
3,
4 as probabilistic functions of
, in that
1,
2,
3,
4 are assumed to deviate only randomly from
. Unlike in Bayes or Stein estimation, however, this prior mean is not assumed to be known. Thus, EB estimators depend on one less assumption than do Stein estimators. (It should be noted, however, that some authors use the term Stein estimation as a synonym for empirical-Bayes estimation.)
![]() |
Bayes Empirical Bayes and Beyond |
---|
In empirical Bayes estimation, we can improve our estimate of in the same way we did when we had just one target parameter
: By using prior information. Recall that, when we had just one parameter to estimate, we could construct simple two-stage estimators, known as Bayes estimators, whose ESE could be lower than that of conventional single-stage estimators over a given range of practical concern. In a similar fashion, we can construct basic three-stage estimators whose ESE can be lower than that of empirical Bayes estimators over a given range of practical concern. These estimators are sometimes called Bayes empirical Bayes (BEB) estimators.15,31,32
Returning to the example, suppose as before that we consider 1,
2,
3,
4 exchangeable, so that we would give the same prior guess r with the same prior certainty (e.g. as measured by n) for each. In addition, suppose we specify the same range
L to
U in which we want our BEB estimators to outperform the competitors given earlier. (Again, these assumptions are not essential, but formulas become more complex if a different prior pair rk, nk and range is specified for each target parameter.) Some approximate BEB estimators of
1,
2,
3,
4 have the form
![]() | (5) |
where
![]() | (6) |
As in formula 4, A+/N+ serves as an estimator of and the
k are the estimated weights for the conventional estimators (although the weight formulas are more complex than in EB estimation). The main addition over the EB estimator in formula 4 is that we now also use our prior guess r to estimate
. This is done by averaging r with the sample estimate of
using a weight u (formula 6) in a manner that parallels ordinary Bayes estimation (formula 2).
The BEB estimators defined by formulas 5 and 6 are based on a three-stage model. As with all the estimators, the first stage is a model for the incidence proportions Ak/Nk as probabilistic functions of their target parameters k, in which Ak/Nk deviates only randomly from
k. As with EB estimation, the second stage is a model for the parameters
k as probabilistic function of their mean
, in which
k deviates randomly from
. The new, third stage is a model for the mean
as a probabilistic function of the prior mean r, in which
deviates only randomly from r.
One major conceptual difference between the ordinary Bayes and BEB models is that our prior information (embodied in r and u) has been moved up one stage in the hierarchy, from the second to the third stage. Just as the second-stage model is often called the prior model, the third-stage model is often called the hyperprior model;9 r is then called the hyperprior mean. An advantage of introducing this additional model is analogous to the advantage of introducing the prior model in estimating one parameter. Suppose we can specify some range L to
U for
1,
2,
3,
4 with
L > 0,
U
1, within which we want to improve the accuracy of our estimates (values outside the range may be unimportant because such values would either be likely to produce obvious results, or would be extremely implausible.) We can then make sure that our values for r and u produce BEB estimators that have a smaller total expected squared error than the corresponding EB estimators within the specified range for
.
Multilevel modelling provides many directions in which EB and BEB estimation can be generalized. For example, there is no theoretical limit to the number of stages (levels) allowed. Although practical limits of computation and imagination have kept most applications in health sciences to two-level models, examples involving three or more levels often arise in the social sciences.1,3 With such extensions, multilevel models can encompass all other parametric regression models as special cases, and can be extended to encompass non-parametric regression as well (via the relation of roughness penalties to prior information).33,34
![]() |
Some Issues in Multilevel Modelling |
---|
Limitations of modelling
Two points cannot be overemphasized. First, every inferential statistic (such as a P-value or confidence interval) is model based, in that same set of constraints (i.e. a model) on the data-generating process must be assumed in order to derive tests and estimates of quantities of interest.17,35 This limitation applies to so-called non-parametric methods, as well as to explicit modelling methods. It is a severe limitation of every analysis of epidemiological studies of cause and effect because in such studies we almost never know enough about the processes that determine exposure or disease to be even moderately certain that our model is even approximately valid. The best we can do is explicate the model underlying our analysis so that everyone can critically evaluate the model assumptions. Multilevel modelling is distinguished only by its unfamiliarity, which obliges one to make more effort to explain the model. But multilevel modelling need not involve stronger assumptions than ordinary modelling, and in fact provides an opportunity to use weaker assumptions (that is, more flexible models) than used in ordinary single-level analyses. This increased flexibility is achieved by pushing the models ordinarily used for single-level analysis up to the second level in the multilevel analysis.34,36,37
Second, inferential statistics, whether non-parametric (i.e. based on a weak but hidden model) or explicitly model-based, are not data summaries.35,38 True data summaries, such as cross-tabulations and data plots, should be examined in every analysis as a check on statistical assumptions and results.34,38 For example, most epidemiological statistics depend in subtle ways on sample-size requirements that should be checked against data summaries.39 Similarly, associations, trends, or more complex patterns that emerge from model-based estimates need to be checked against true data summaries (such as tables of observed proportions) to determine the degree to which the patterns reflect observations as opposed to model assumptions.16,18,34 Again, multilevel models are not special in this regard, although they do require checking of all levels, which can be laborious. In return, however, they can provide the benefits described here and elsewhere.7,9,14,1923,2632,34,36,37
Complexity concerns
One cost of moving to the multilevel perspective is its greater complexity. Multilevel models demand more mental effort to understand and build. They also demand some considerations of the relative importance of different errors (see the discussion of estimation criteria below). I view these demands as beneficial, for the effort is in connecting the model and estimation criteria to the scientific context.37 Nonetheless, there is a danger that multilevel analysis will be employed in the same automatic and careless manner that is common among ordinary regression analysis, with the same ill effects. But the potential for misuse should not be an excuse to deny the benefits of a method to careful users.
There are many situations in which the complexity of a full multilevel analysis is not needed. This occurs, for example, when only first-stage parameters are of interest and the data set is so large that all first-stage standard errors are small; here, there is no need for variance reduction. Conversely, if the data set is so small that only a few free parameters can be used in the first-level model, a full multilevel approach (with its attendant flexibility and robustness) will not be applicable, though more restrictive Bayesian analyses can still be useful for both bias and variance reduction.39 Finally, the data may exhibit results so obvious that any complex approach is pointless, such as in the classic studies of large effects (smoking and lung cancer, estrogens and endometrial cancer, vinyl chloride and angiosarcoma, etc.).
On the other hand, there are many complex problems in which multilevel modelling provides an alternative superior to the oversimplifications required by standard approaches. This advantage is especially great in studies that search for effects or interactions among many exposures (so-called fishing expeditions'), in which standard methods of forcing in all variables or using mechanical variable-selection algorithms easily produce invalid inferences.2730,34,36,37
Modifying the estimation criterion
Nearly all discussions, including the present one, focus on minimizing total expected squared error as an estimation criterion. Nonetheless, the general lessons illustrated earlier continue to hold if we employ other criteria that, like the total ESE criterion, consider errors due to bias and errors due to random variation as equally costly, and that consider larger deviations from a target worse than smaller deviations. An example of such a criterion is expected absolute error, or EAE. Using this criterion, it will still be true that we can achieve lower expected errors than conventional estimators by averaging results from different levels of a multilevel model ('shrinkage),7 though the form of the weight estimators will change. The same comments apply if we replace total ESE or total EAE by a weighted total; in the example, cost weighting would be appropriate if each unit of error in the Finnish cohort was more costly than each unit of error in the other cohorts.
Multilevel estimation criteria can be modified further to recognize qualitative differences in error costs. For example, model parameters often can be divided into two groups, one containing parameters that represent effects of the exposures of interest (including any product terms involving an exposure), and the other containing parameters of no direct interest (such as intercepts and confounder coefficients). A simple way to account for this coefficient grouping is to use separate estimation criteria for the two groups of parameters,40 for example by shrinking only the estimates of interest.27,28,30,32,34,41 Further improvement can be obtained by using accuracy in estimating the exposure coefficients as the sole criterion.42,43 Though this criterion leads to modification of all the parameter estimates, it sacrifices accuracy in the confounder and intercept estimates whenever such sacrifice appears to improve accuracy in estimation of exposure coefficients.
Software for fitting multilevel models
At least four types of software options are currently available for fitting multilevel models under the ESE criterion. First, there are several commercial packages for fitting multilevel models,44 although not all are capable of handling the binomial and Poisson outcomes common in epidemiology. Second, if one is familiar with the correspondence between multilevel models and mixed models,15,37,41 one can translate the desired multilevel model into a generalized-linear mixed model, fit that model with a generalized-linear mixed model program such as SAS GLIMMIX, and translate the results back to multilevel-model estimates. Similar translations can be made to allow fitting of multilevel models via ridge-regression or penalized-likelihood software. Third, one can employ a specialized SAS procedure for obtaining two-stage model estimates from ordinary regression output,45 available on the World-Wide Web at http://darwin.cwru.edu/~witte/hm.html. These three options are based on approximations that appear to work well for typical epidemiological data, at least as well as the maximum-likelihood methods used to fit ordinary single-stage models.27,29,30 A fourth, more accurate option, available to those familiar with advanced hierarchical-Bayesian methodology, is to employ Markov-Chain Monte-Carlo techniques.4,5
Interval estimation
For conceptual purposes I have focused on point estimation, but in practice one needs interval estimates for the parameters of interest. The simplest 95% confidence intervals are Wald intervals, found by taking the point estimate (or some transform of it) ±1.96 estimated standard errors. The Wald method is only a large-sample approximation when dealing with non-normal data or non-linear models (as when analysing risks or rates), and has the least accurate coverage among approximate intervals. In multilevel modelling, it tends to suffer additional inaccuracy by falling more frequently to one side than the other of the true parameter. This lopsidedness of coverage is a direct consequence of the bias in the multilevel point estimator, on which the Wald interval is centered. Despite this problem, multilevel Wald 95% intervals appear to provide conservatively valid (i.e. at least 95%) average coverage for parameter ensembles in common epidemiological settings.27,29,30 For more accurate intervals, one will have to employ more accurate analytic approximations or Monte-Carlo methods.4,5
![]() |
Conclusion |
---|
I have here focused on a more general application of multilevel modelling: reduction of estimation error through incorporation of prior information about parameter sizes (including information about similarity or relative sizes). This focus provides a conceptual unification of seemingly disparate Bayesian and frequentist approaches to estimation. Multilevel analysis allows one to use prior information in a manner more cautious than ordinary Bayesian analysis and more thorough than ordinary frequentist analysis. Many published epidemiological applications, as well as simulation studies, have illustrated this benefit in problems involving multiple comparisons, variable selection, or sparse data.7,19,21,23,2630,32,37,41,50 I hope that the present exposition encourages readers to examine these applications and to contribute further examples of multilevel analysis to the literature.
![]() |
Appendix 1 |
---|
[E() -
]2 + var(
) = (1 - w)2 (r -
)2 + w2
2/N (A1)
where 2 = Nvar(A/N). Because the ESE of A/N is
2/N,
will have lower ESE than A/N if and only if (A1) is less than
2/N, or equivalently if and only if
(r - )2 < (1 - w2)
2/N(1 - w)2 = (2/n + 1/N)
2 (A2)
Given an interval [L,
U] with 0 <
L <
U < 1, we can find a nonempty set of positive pairs (n, r) which ensure that (A2) holds for every
in the interval, and check whether (A2) holds for every
in the interval for a given pair (n, r). In the example, N = 10, n = 5, r = 0.2, and
2 =
(1 -
), so (A2) becomes
(0.2 - )2 < (2/5 + 1/10)
(1 -
) (A3)
which holds for all in the example interval (
L,
U) = (0.1, 0.4). Of course, n = 5, r = 0.2 is not necessarily optimal, and the optimal pair (n, r) will depend on the optimality criterion.7,51
![]() |
Appendix 2 |
---|
1 - k = max[0, (K - 2) uk vk/K]
where
vk = k (1 -
k)/Nk (A4)
is the estimated variance of Ak/Nk,
uk = 1/(vk + 2), (A5)
2 =
kukdk/
kuk (A6)
is the estimated sum of squared deviations of the k about their prior means, and19
dk = (Ak/Nk - rk)2 - vk (A7)
Because k appears in the definition of vk,
2 appears in the definition of uk, and uk appears in the definition of
2, computation of these quantities requires iteration among formulas 3 and A3 - A7. Because 1 -
k is the weight placed on vu, it may be seen that shrinkage increases as the variance estimate vk for Ak/Nk increases.
For the EB estimator in formula 4, wk, vk, and dk are given by
1 - k = max[0, (K - 3) ukvk /( K - 1)] (A8)
vk = êk (1 - êk)/Nk (A9)
and
dk = K(Ak/Nk - A+/N+)2/( K - 1) - vk (A10)
which require iteration among formulae 4, A5, A6, and A8-A10.26
If the denominators Nk represent person-time rather than person-counts, formula A4 becomes k/Nk and A9 becomes êk/Nk. (A4)
![]() |
Acknowledgments |
---|
![]() |
References |
---|
2 Searle SR, Casella G, McCulloch CE. Variance Components. New York: Wiley, 1992.
3 Goldstein H. Multilevel Statistical Models, 2nd edn. London: Edward Arnold, 1995.
4 Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. New York: Chapman and Hall, 1995.
5 Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis. New York: Chapman and Hall, 1996.
6 Kreft IGG, de Leeuw J. Introducing Multilevel Modelling. London: Sage, 1998.
7 Efron B. Biased versus unbiased estimation. Adv Math 1975;16: 25977.[ISI]
8 Leamer EE. Specification Searches. New York: Wiley, 1978.
9 Good IJ. Good Thinking. Minneapolis: University of Minneapolis Press, 1983, Ch. 9.
10 Meeden G. Estimation when using a statistic that is not sufficient. Am Stat 1987;41:13536.[ISI]
11 DeFinetti B. The Theory of Probability. Vols. 1 and 2. New York: Wiley, 1974.
12 Howson C, Urbach P. Scientific Reasoning: The Bayesian Approach, 2nd edn. LaSalle, IL: Open Court, 1993.
13 Greenland S. Probability logic and probabilistic induction. Epidemiology 1998;9:32232.[ISI][Medline]
14 Copas JB. Regression, prediction, and shrinkage. J Royal Stat Soc Ser B 1983;45:31154.[ISI]
15 Good IJ. Hierarchical Bayesian and empirical Bayesian methods (letter). Am Stat 1987;41:92.
16 Box GEP. An apology for ecumenism in statistics. In: Box GEP, Leonard T, Wu CF (eds). Scientific Inference, Data Analysis, and Robustness. New York: Academic Press, 1983.
17 Robins JM, Greenland S. The role of model selection in causal inference from nonexperimental data. Am J Epidemiol 1986;123: 392402.[ISI][Medline]
18 Hill JR. A general framework for model-based statistics. Biometrika 1990;77:11526.[ISI]
19 Efron B, Morris CN. Stein's estimation rule and its competitorsan empirical Bayes approach. J Am Stat Assoc 1973;68:11730.[ISI]
20 Brandwein AC, Strawderman WE. Stein estimation: the spherically symmetric case. Stat Sci 1990;5:35669.
21 Efron B, Morris CN. Stein's paradox in statistics. Sci Am 1977;236: 11927.[ISI]
22 Casella G. An introduction to empirical-Bayes data analysis. Am Stat 1985;39:8387.[ISI]
23 Greenland S, Poole C. Empirical Bayes and semi-Bayes approaches to occupational and environmental hazard surveillance. Arch Environ Health 1994;49:916.[ISI][Medline]
24 Draper D, Hodges JS, Mallows CL, Pregibon D. Exchangeability and data analysis. J Roy Stat Soc Ser A 1993;156:937.[ISI]
25 Greenland S, Draper D. Exchangeability. In: The Encyclopedia of Biostatistics. New York: Wiley, 1998.
26 Efron B, Morris CN. Data analysis using Stein's estimator and its generalizations. J Am Stat Assoc 1975;70:31119.[ISI]
27 Greenland S. Methods for epidemiologic analyses of multiple exposures: a review and comparative study of maximum-likelihood, preliminary-testing, and empirical-Bayes regression. Stat Med 1993; 12:71736.[ISI][Medline]
28 Witte JS, Greenland S, Haile RW, Bird CL. Hierarchical regression analysis applied to a study of multiple dietary exposures and breast cancer. Epidemiology 1994;5:61221.[ISI][Medline]
29 Witte JS, Greenland S. Simulation study of hierarchical regression. Stat Med 1996;15:116170.[ISI][Medline]
30 Greenland S. Second-stage least squares versus penalized quasi-likelihood for fitting hierarchical models in epidemiologic analyses. Stat Med 1997;16:51526.[ISI][Medline]
31 Deeley JJ, Lindley DV. Bayes empirical Bayes. J Am Stat Assoc 1981;76: 83341.[ISI]
32 Greenland S. Hierarchical regression for epidemiologic analyses of multiple exposures. Environ Health Perspect 1994;102(Suppl.8):3339.
33 Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models. New York: Chapman and Hall, 1994, sec. 3.8.3.
34 Greenland S. Introduction to regression modeling. In: Rothman KJ, Greenland S (eds). Modern Epidemiology, 2nd edn. Philadelphia: Lippincott-Raven, 1998, Ch. 21.
35 Greenland S. Randomization, statistics, and causal inference. Epidemiology 1990;1:42129.[Medline]
36 Greenland S. Multilevel modeling and model averaging. Scand J Work Environ Health 2000 (In press).
37 Greenland S. When should Epidemiologic Regressions Use Random Coefficients? Technical Report, UCLA Department of Epidemiology, September 1999.
38 Greenland S. Summarization, smoothing, and inference. Scand J Soc Med 1993;21:22732.[ISI][Medline]
39 Greenland S. Small-sample bias and corrections for conditional maximum-likelihood odds-ratio estimators. Biostatistics 2000;1: In press.
40 Lindsay BG. Using empirical partially Bayes inference for increased efficiency. Ann Stat 1985;13:91431.[ISI]
41 Greenland S. A semi-Bayes approach to the analysis of correlated multiple associations, with an application to an occupational cancer-mortality study. Stat Med 1992;11:21930.[ISI][Medline]
42 Kalish L. Reducing mean squared error in the analysis of pair-matched case-control studies. Biometrics 1990;46:49399.[ISI][Medline]
43 Greenland S. Reducing mean squared error in the analysis of stratified epidemiologic studies. Biometrics 1991;47:77375.[ISI][Medline]
44 Kreft IGG, de Leeuw J, van der Leeden R. Review of five multilevel analysis programs: BMDP5V, GENMOD, HLM, ML3, VARCL. Am Stat 1994;48:32435.[ISI]
45 Witte JS, Greenland S, Kim L-L. Software for hierarchical modeling of epidemiologic data. Epidemiology 1998;9:56366.[ISI][Medline]
46 Kreft IGG (ed.). Hierarchical linear models: problems and prospects. J Educ Behav Stat 1995;20:109240.[ISI]
47 Wong GY, Mason WM. The hierarchical logistic regression model for multilevel analysis. J Am Stat Assoc 1985;80:91324.
48 Raudenbush SW, Bryk AS. Empirical-Bayes meta-analysis. J Educ Stat 1985;10:7598.
49 Von Korff M, Koepsell T, Curry S, Diehr P. Multilevel analysis in epidemiologic research on health behaviors and outcomes. Am J Epidemiol 1992;135:107782.[Abstract]
50 Greenland S, Robins JM. Empirical-Bayes adjustments for multiple comparisons are sometimes useful. Epidemiology 1991;2:24451.[Medline]
51 Cox DR, Hinkley DV. Theoretical Statistics. New York: Chapman and Hall, 1974, Ch. 11.