Principles of multilevel modelling

Sander Greenland

Department of Epidemiology, UCLA School of Public Health, 22333 Swenson Drive, Topanga, CA 90290, USA.


    Abstract
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
Background Multilevel modelling, also known as hierarchical regression, generalizes ordinary regression modelling to distinguish multiple levels of information in a model. Use of multiple levels gives rise to an enormous range of statistical benefits. To aid in understanding these benefits, this article provides an elementary introduction to the conceptual basis for multilevel modelling, beginning with classical frequentist, Bayes, and empirical-Bayes techniques as special cases. The article focuses on the role of multilevel averaging (‘shrinkage’) in the reduction of estimation error, and the role of prior information in finding good averages.

Keywords Bayesian statistics, empirical-Bayes estimation, hierarchical regression, mixed models, multilevel modelling, random-coefficient regression, ridge regression, risk assessment, Stein estimation

Accepted 29 June 1999


    Introduction
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
The vast increase in computing power over recent decades has led to the emergence of multilevel models and its equivalents as practical and powerful analysis tools.1–6 This approach involves specifying two or more levels, or stages, of relationships among study variables and parameters. These levels are arranged in a hierarchy; hence the approach is also commonly known as hierarchical modelling or hierarchical regression. Ordinary regression techniques represent a special case in which there is only one level of the hierarchy.

Multilevel modelling has appeared and reappeared in many forms over the last 50 years. For example, the following techniques are all special cases of or equivalent to hierarchical regression: Bayesian, empirical-Bayes (EB), Stein, penalized likelihood, mixed-model, ridge, and random-coefficient regression, and variance-components analysis.1–6 Conventional estimation methods, such as least squares and maximum likelihood, are also special cases in which there is only one level in the underlying model. The hierarchical approach unifies these methods and clarifies their meaning. Most importantly, it unifies the seemingly disparate methods of frequentist (‘classical’) and Bayesian analysis. This unification is a major benefit of the approach, for it leads to methods that are superior to both classical frequentist and Bayesian methods.

I will here barely touch on the diverse applications, forms, and examples of multilevel modelling. My reason is that a more basic conceptual introduction seems necessary. Despite several decades of published examples and methodological studies demonstrating the superiority of the multilevel (hierarchical) perspective, and its widespread acceptance in the social sciences, it is not widely understood, taught, or employed in the health sciences. Lack of affordable and user-friendly software has no doubt been an obstacle to use, although as discussed below many software options are now available for multilevel modelling.

Another obstacle, however, is that most presentations move into technical and subject-matter details too rapidly for health-science readers, and few discuss the general multilevel perspective. I will develop that perspective here, although to limit the scope I will focus on estimation error, which is just one of several major problems addressed by multilevel modelling. I will explain how one by-product of multilevel modelling—hierarchical estimation—can help address that problem. The examples used here are artificial and too simple to warrant even an ordinary regression analysis. Their purpose is only to provide a basis for understanding the much more complex real examples that can be found in the references.


    The Estimation Problem
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
To fix ideas, let us consider one of the simplest cases of statistical estimation in health science: estimation of the average risk of an outcome in a single cohort. Specifically, suppose we are concerned with the probability (average risk) of spontaneous abortion as an outcome of first planned pregnancy among women exposed to some work environment. Denote this average risk by {theta}. This risk parameter {theta} is our target parameter.

Suppose our data comprise an observation of A spontaneous abortions in a cohort of N first planned pregnancies in that exposed environment. The observed proportion A/N is the usual estimator of the risk parameter {theta} under standard validity assumptions; it is known as the maximum-likelihood estimator (MLE). Here, these validity assumptions may be summarized by stating that the observed proportion A/N differs from the target risk {theta} in only a random fashion, in accord with the assumed probability model (usually, but not necessarily, a binomial model). In the present example, this validity assumption is met by ensuring that subject selection is random and outcome ascertainment is complete. Such assumptions are of course difficult to meet in practice; nonetheless all standard statistical procedures rely on them, a point which is often overlooked when comparing new and standard methods.

Bias versus random scatter
The expectation of the observed proportion A/N is the average of this proportion over all possible samples (study repetitions) from the target population, with each possible value for A/N weighted by its probability. Given the above validity assumptions, this expectation equals the true abortion risk, which is the target parameter {theta}. Denoting expectation by E( ), we can restate this property of the proportion A/N with the equation:

E(A/N) = {theta}

In statistical theory, the bias of an estimator is defined to be the difference between the expectation (average value) of the estimator and the target parameter. An estimator is unbiased when its expectation equals the target parameter or, equivalently, when its bias is zero. The above equation thus says that, in a valid cohort study, the proportion A/N is unbiased for the average risk {theta}, regardless of the size of {theta} or N.

For many decades, statistical unbiasedness was taken to be a prerequisite for a good estimator. Gradually, however, it was realized that other properties are important as well, and that much could be gained by compromising unbiasedness to improve those properties.7–10 One such property is the standard deviation (SD) of the estimator, which is a measure of the scatter or spread the estimator would exhibit around its own expectation across random samples from the target population. (Its exact definition is not needed here and can be found in any statistics text.) Estimators with large standard deviations (random scatter) are unreliable estimators of the target parameter, even if they are unbiased.

Unfortunately, any estimator that has a smaller standard deviation than A/N may (depending on what {theta} is) have non-zero statistical bias, which is to say, its expectation may not equal the target parameter {theta}. We thus face a trade-off: If we want to reduce the standard deviation (random variation) of our estimator, we must risk incurring some statistical bias. But we ought to be willing to risk adding a small amount of bias in an estimator if that estimator suffers from much less random variation than A/N.

Figure 1Go illustrates the principle that neither bias nor scatter should be considered alone. Suppose we have three rifles, and we wish to test which one will more often ‘win’ when sighted on {theta} and shot, where a ‘win’ is a hit anywhere inside the ring around {theta}. Rifle 1 has straight sights and a short worn-out barrel, and tends to scatter its shots (the •s) widely around {theta} when it is aimed straight at {theta}; only about 20% of those shots get inside the ring. Rifle 2 has bent sights and a long new barrel, and tends to concentrate its shots (the Xs) in a small cluster to the left of {theta} when it is aimed straight at {theta}; about 80% of those shots get inside the ring. Rifle 1 has ‘unbiased’ sights because its scatter of shots is evenly distributed around {theta}; Rifle 2 has biased sights. But Rifle 1 is less likely to ‘win’ than Rifle 2 because it has larger scatter than Rifle 2. Rifle 3 also has bent sights and an even longer new barrel, so that it tends to concentrate its shots (the +s) even more tightly than Rifle 2; its sights are so bent, however, that only about 10% of its shots get inside the ring.



View larger version (6K):
[in this window]
[in a new window]
 
Figure 1 Clusters of shots (estimates) from three different rifles (estimators) sighted on point {theta}: • = Rifle 1 shots, X = Rifle 2 shots, + = Rifle 3 shots

 
Expected squared error
Rifle 1 is like an estimator with no bias but large random error; Rifle 2 is like an estimator with moderate bias and moderate random error; and Rifle 3 is like an estimator with large bias but small random error. Our thought experiment with such rifles thus might suggest the following criterion for choosing estimators: Take the one with the smallest average distance from the target {theta}. By this criterion, Rifle 2 outperforms Rifles 1 and 3.

Unfortunately, this reasonable-sounding criterion turns out to be very hard to use for statistical practice. A much simpler way to balance bias against scatter is to choose the estimator with the smallest average squared distance from {theta}. This average is known as expected squared error (ESE) or mean squared error (MSE)7 and is a common measure of estimation inaccuracy. By this criterion, Rifle 2 again outperforms Rifles 1 and 3.

One very useful property of ESE is its Pythagorean (sum-of-squares) relation to bias and standard deviation:


(1)

This equation displays the trade-off we must make between bias and scatter when choosing estimators based on their expected squared error. Note that, for a statistically unbiased estimator like A/N, the bias is zero and so ESE = SD2 if the study is perfectly valid. In practice, however, we should expect all the estimators (including A/N) to have some bias due to imperfections in study design or conduct (such as case underascertainment).


    Bayes Estimation
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
How do we find an estimator with smaller ESE than the observed proportion A/N? A/N is statistically unbiased and has the smallest standard deviation possible for such an estimator. Thus, the only way to do better is to search among statistically biased estimators with smaller SD than A/N, and determine which ones have smaller ESE as well. One systematic approach to this search goes as follows. Before seeing the study data, suppose that your best guess is that the average risk {theta} is somewhere near r. You should be unsure about your guess, however; suppose for the moment you would give your best guess the same weight as you would give to the results from a valid study of size n. For example, based on background literature on spontaneous abortions among first planned pregnancies and on the effects of exposures in the work environment, you might offer 0.20 as your best guess for the risk {theta}, so r = 0.20. Also, you might give your guess the same weight as you would give an estimate from a cohort of size n = 5; in other words, you would think your educated guess to be worth as much as observation of only five pregnancies. The values r and n are called your prior parameters, because they represent your state of knowledge prior to seeing the new data A and N. In particular, r is called your prior expectation or prior mean for {theta}, and n is called your prior effective sample size. Taken together, they imply that your prior information is equivalent to having observed a = n • r cases in a valid cohort study of size n; in the example, the prior information is equivalent to having observed 5(0.20) = 1 case in a valid cohort study of five pregnancies.

A Bayes estimator for {theta} uses probability models to combine prior information (embodied in r and n) with the data according to a rule known as Bayes' Theorem.4,8,9,11–13 Such estimators can be rather complicated, but in many examples a reasonable approximate Bayes estimator can be obtained by taking a weighted average of the ordinary estimator and our prior guess for {theta}. One such estimator weights the observed proportion A/N and the prior mean r by their respective sample sizes N and n. If we let w = N/(N + n), this estimator is the weighted average:


(2)

is called a posterior expectation or posterior mean for {theta}; it is ‘posterior’ in the sense that it can only be computed after the study has been conducted. N + n is the sum of the study cohort size N and our prior sample size n, so it can be thought of as representing the effective sample size on which the estimator is based. The weight w for A/N represents the proportion of this effective sample size that comes from the study data, while the weight 1 - w represents the proportion that comes from our prior information. If we set n to zero, we get w = 1 and put no weight on our prior guess r; when that happens, we get back A/N from formula 2. Thus, A/N is the special case of the estimator when we completely disregard prior information about {theta}.

To illustrate the above formulae, suppose we observe A = 4 spontaneous abortion in a cohort of N = 10 first pregnancies. The observed proportion is then 4/10 = 0.40, which is the usual estimate of {theta}, the true risk. With a prior guess of r = 0.20 for {theta} and a weight for this guess equal to n = 5 pregnancies, our weight w for A/N is 10/(10 + 5) = 0. 67 and our approximate Bayes estimator for {theta} is


which is 17% smaller than the observed proportion of 0.40.

The shrinkage viewpoint
Why should we expect to ever do any better than A/N when n is greater than zero? Imagine again our unbiased but very scattered Rifle 1, which corresponds to A/N. Suppose we deflect every shot emerging from its barrel toward a point r (for example, imagine Rifle 1 shoots steel bullets and we place a powerful magnet at r). Figure 2Go illustrates what happens if r is next to the ring and each bullet is deflected about halfway toward r. When we aim Rifle 1 at {theta}, the deflection toward r produces a tighter cluster whose centre is shifted away from {theta}, toward r. Thus, the hits are now biased away from {theta}, but less scattered and hence closer to {theta} on average, provided r is not too far from {theta}.



View larger version (8K):
[in this window]
[in a new window]
 
Figure 2 How cluster from Rifle 1 could be made better by pulling toward a point r that need not equal {theta}

 
The same phenomenon applies with the estimator . As we increase n, we decrease w and so increase the weight we put on the prior guess r, which increases the bias but reduces the scatter of . Of course, the better our guess, the closer r to {theta}, and the less bias we will incur.

The estimator is an example of a ‘shrinkage estimator’, because it is the usual estimator A/N ‘shrunk’ toward the point r by a degree proportional to 1 - w, as in Figure 2Go.7,14 The term ‘shrinkage’ is a bit misleading because will be bigger than A/N if r is bigger than A/N. The phrase ‘shrinkage toward r’ should thus be interpreted as ‘averaging with r’, not as ‘becoming smaller’.

An important question is just how much we should ‘shrink’ toward r. This question comes down to how small to make w: If w = 1, = A/N and there is no shrinkage; as w shrinks toward 0, shrinks toward r. If r were equal to the target parameter {theta}, we would want w = 0, because then = r = {theta}; that is, would equal the target. In reality, we do not know with much certainty how far r is from {theta}; we are ‘shooting in the dark’. Using a value of w closer to 1 than zero provides some protection against shrinking too far in the wrong direction, which could happen if r were very far from {theta}.

Choosing the prior parameters
How should we pick r and n? A vague guideline is that r should reflect the best of our knowledge and n should adequately reflect our uncertainty. It turns out, however, that we can screen out ‘bad’ choices for the prior parameters r and n. More precisely, we can always see whether our choices make more accurate than A/N within the range for {theta} in which we think accuracy is crucial.

To be more precise, suppose we are given a range {theta}L to {theta}U for the target parameter {theta}, with {theta}L > 0 and {theta}U < 1. Then we can check to make sure that our choice for the prior parameters r and n yield an estimator that has smaller expected squared error than A/N for any {theta} in that range. The chosen may have higher expected squared error than A/N if {theta} is outside the given range; however, we can always make the range wide enough so that it includes all values of {theta} for which accuracy is crucial.

To illustrate, suppose we are most worried about accurately estimating spontaneous-abortion risk for values of {theta} that are ‘slightly or moderately elevated above normal’, because such values are liable to generate ambiguous study results. Suppose spontaneous abortions of first planned pregnancies are independent across women in our study cohort; N = 10 such pregnancies are observed; and ‘slightly or moderately elevated risk’ for these women would fall somewhere inside the range {theta}L = 0.10 to {theta}U = 0.40. Suppose that we wish to evaluate the pair r = 0.20, n = 5, for which w = 10/(10 + 5) = 0.67. The resulting approximate Bayes estimator is:


From formula (A2) in Appendix 1, we find that will have smaller expected squared error than A/N when the risk is slightly or moderately elevated (i.e. when {theta} is between 0.10 and 0.40). If one spontaneous abortion is observed in the cohort (i.e. if A = 1), then A/N will equal 1/10 = 0.10 and will equal 2/15 = 0.13; on the other hand, if A = 4, then A/N = 4/10 = 0.40 and = 5/15 = 0.33. Thus, represents only a modest alteration of A/N when r = 0.20, n = 5, N = 10, and A/N is between {theta}L and {theta}U.

The difference between A/N and grows as A/N moves outside the range {theta}L, {theta}U. But the idea is to choose {theta}L and {theta}U so that accuracy is not so important outside that range: We can and should choose {theta}L so that both estimators are likely to be unambiguously low if {theta} < {theta}L; likewise, we can and should choose {theta}U so that both estimators are likely to be unambiguously high if {theta} > {theta}U. The point of using instead of A/N is to concentrate on improving accuracy where we need it most, in the ‘grey zone’ between {theta}L and {theta}U, where ambiguous values of A/N are most likely to arise.

The multilevel viewpoint
The estimator is a very simple example of a multilevel or hierarchical estimator based on two levels, or stages, of models.1–9,15 These stages are as follows: The first stage is the model for the incidence, A/N, of spontaneous abortions in the study cohort, as a probabilistic function of the target parameter {theta}; specifically, A/N is assumed to deviate from {theta} in a purely random fashion. This model is the one and only stage used to derive A/N as the maximum-likelihood estimator for the average risk {theta}. The second stage is a model for the average risk {theta} as a probabilistic function of r and n; specifically, {theta} is assumed to deviate from r in a purely random fashion to an extent inversely related to n. Because the parameters r and n can be derived before the study is conducted, this second-stage model is usually called the prior model.4

It may (and perhaps should) at first seem odd to model the target {theta}, which is presumably a fixed and real quantity, as a function of our prior mean r, which after all is just our best guess about {theta}. In fact, the oddity of this step—modelling a presumably fixed reality as a function of an idea—has provoked some of the most vociferous resistance to and misunderstanding of Bayes estimators. There are, however, several explanations for why the model is from the prior mean r to the target {theta}.

An intuitive explanation is that we are attempting to make inferences from our initial guess r to the reality {theta}, and hence the best way we can express our initial uncertainty concerning {theta} is to use our best guess r as a reference point. More rigorous justification along these lines can be based on subjective probability logic.11–13 A frequentist argument is essentially that given above: The model provides estimators with smaller expected squared error than A/N under explicit assumptions.7

The picture with two cohorts
With two cohorts, we can more easily translate the lesson in Figures 1 and 2GoGo to a statistical setting as follows: Suppose we have a cohort from Finland and a cohort from Sweden, and we have a choice among three different statistical methods for estimating risk in the cohorts. Let {theta}1 be the risk in the Finnish cohort and {theta}2 the risk in the Swedish cohort. The point {theta} in the figures now represents the target of the study: {theta} is the pair of risks ({theta}1, {theta}2), so that {theta}1 is the x-coordinate (abscissa) and {theta}2 is the y-coordinate of {theta}. The different symbols in Figure 1Go (•, X, +) correspond to typical results from statistical methods 1, 2, 3. Figure 1Go illustrates how method 2 (the Xs) tends to give results closest to the risk pair {theta}, even though its results are more biased than those from method 1 (the •s) and more variable than those from method 3 (the +s). It thus appears that a compromise between bias and variability tends to produce results closest to the truth.

Suppose now we have good guesses r1 and r2 (perhaps from other data) for the target risks {theta}1 and {theta}2. The point r in Figure 2Go represents this pair of guesses, i.e. r is the point with coordinates (r1, r2). Figure 2Go illustrates how the unbiased but highly variable results from method 1 can be brought closer to the target pair {theta} by pulling them toward r, which introduces some bias, but more than compensates by greatly reducing variability. More generally, it suggests that the most accurate methods may be found by introducing some bias into statistically unbiased estimators to reduce their variability and thus increase their accuracy.


    Stein Estimation
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
As with all statistical assumptions, the prior (second-stage) assumptions should be recognized and checked against the data.16–18 This means, in particular, that if the conventional estimates turn out to be very far (many standard deviations) from our prior guesses, we may become concerned that our prior guesses are very wrong and so give them little or no weight.

When three or more unknown target parameters are being simultaneously estimated, we can construct two-stage estimators that always have lower total ESE than conventional (maximum-likelihood or weighted-least-squares) one-stage estimators. This gain is achieved by a clever strategy, known as Stein estimation, of weighting prior guesses in a manner that directly depends on their agreement with the observations.7,19,20 To illustrate, let us expand the earlier example by supposing that we have three independent cohorts of first planned pregnancies, all in a presumably exposed occupation but in different locations, e.g. Finland, Sweden, Denmark, and that the target parameters are the average risks {theta}1, {theta}2, {theta}3 in the three cohorts. Let N1, N2, N3 be the cohort sizes, with N+ = N1 + N2 + N3, and let A1, A2, A3 be the numbers of spontaneous abortions in each cohort, with A+ = A1 + A2 + A3.

Now suppose that our prior guesses for {theta}1, {theta}2, {theta}3 are r1, r2, r3 (these guesses might all be the same, but need not be so). The Stein (or James-Stein) estimators 1, 2, 3 of {theta}1, {theta}2, {theta}3 are given by


(3)

for k = 1, 2, or 3. Here, 1, 2, 3 are estimated weights; their formula is a bit complicated and so is deferred to Appendix 2. An important point about the weight formula is that it makes k close to 0, and hence makes k close to rk, if Ak/Nk appears highly variable or if all the observed proportions A1/N1, A2/N2, A3/N3 are close to their prior means r1, r2, r3 (which suggests we have made good guesses). On the other hand, the weight formula makes k close to 1, and hence makes k close to Ak/Nk, if Ak/Nk appears very precise or if any of the observed proportions are far from their prior means (which, assuming the study was valid, suggests we have made at least one bad guess).

To illustrate the Stein formula, suppose we observe A1 = 4, A2 = 1, and A3 = 3 spontaneous abortions in Finnish, Swedish, and Danish pregnancy cohorts of size 10 each (N1 = N2 = N3 = 10), so that the observed proportions Ak/Nk are 0.40, 0.10, and 0.30, with a prior guess of 0.20 for the risk in all three cohorts (r1 = r2 = r3 = 0.20). From Appendix 2 we get estimated weights of 1 = 0.68, 2 = 0.70, and 3 = 0.68 for the three observed proportions. Applying formula 3, we obtain 1 = 0.34, 2 = 0.13, and 3 = 0.27 for the Stein estimates of the true abortion risks in the three cohorts. Each of these estimates is the result of pulling away from the observed proportions, toward their common prior mean of 0.20.

An amazing property of the Stein estimators is that their total expected squared error is guaranteed to be less than that of the conventional estimators, regardless of the values of the target parameters or the prior means.7,19,20 That is, taken as an ensemble, the Stein estimators will outperform the conventional estimators even if our prior means are very wrong. For example, but for the approximation error, the total of the expected squared errors of 1, 2, 3 will be less than the total of the expected squared errors of A1/N1, A2/N2, A3/N3 even if r1, r2, r3 are all far from {theta}1, {theta}2, {theta}3. In this sense, and unlike Bayes estimators, the overall (ensemble) performance of the Stein estimators is robust to our prior guesses. Nonetheless, like Bayes estimators, if the prior guess used for one of the target parameters is poor, the Stein estimator for that parameter may be much less accurate than the conventional estimator for that parameter. It is only the total squared error of the Stein estimators that is guaranteed to be lower. To ensure that each of the Stein estimators is more accurate than its conventional counterpart, we would have to make a good prior guess for each target parameter.

The shrinkage viewpoint
The robustness advantage of the Stein estimators relative to Bayes estimators may be explained from a shrinkage perspective.7,19–21 If the conventional estimator Ak/Nk is highly variable, it should get more shrinkage toward a stable point (that is, get a smaller k) in order to reduce its variability. Such shrinkage will introduce excessive bias, however, if the chosen stable point rk is too far from the true parameter {theta}k. One way to protect ourselves from shrinking too much toward the wrong point is to shrink in proportion to how good our guesses appear to be. The weight k is constructed in such a manner that it responds as needed to the apparent variability of Ak/Nk and to the distance of our prior means from the conventional estimates (Appendix 2).

The multilevel viewpoint
Stein estimators are two-stage estimators with the same first-stage model as the conventional and Bayes estimators. The second-stage model differs somewhat from that of the Bayes estimators. As in Bayes estimation, each {theta}k is assumed to deviate only randomly from its prior mean rk. Unlike in Bayes estimation, however, the extent of this deviation is no longer assumed to be known. Thus, Stein estimators depend on one less second-stage assumption than do Bayes estimators.


    Empirical Bayes Estimation
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
When there are four or more target parameters, Stein estimation can be improved upon. The key idea is to use Bayes estimators in which the prior means as well as the weights are estimated from the data, rather than being left to the judgement of the analyst. Such estimators are called empirical Bayes estimators5,7,15,19,22,23 and represent some of the most valuable outputs of multilevel modelling.

To illustrate, let us further expand the earlier example to four cohorts. For example, suppose we add a Norwegian cohort of size N4 to our study, in which we observe A4 spontaneous abortions; now N+ = N1 + N2 + N3 + N4 and A+ = A1 + A2 + A3 + A4, with {theta}4 the true Norwegian risk. With conventional single-stage estimation and no further information, we have just two choices for estimating the target parameters: Either estimate each separately by A1/N1, A2/N2, A3/N3 and A4/N4; or else assume that {theta}1 = {theta}2 = {theta}3 = {theta}4, and estimate each with the same estimator, A+/N+. Using A1/N1, A2/N2, A3/N3 and A4/N4 to estimate {theta}1, {theta}2, {theta}3, {theta}4 would be the natural choice if we were absolutely certain that abortion risks differ across the countries to an important extent; using A+/N+ to estimate {theta}1, {theta}2, {theta}3, {theta}4 would be the natural choice if we were absolutely certain the risks do not differ. But what if we are not at all sure about whether the risks differ to an important extent? Hierarchical methods allow us more sensible options for this situation, including compromises between the two conventional extremes.

An exchangeability assumption
Suppose that our background information about the target parameters {theta}1, {theta}2, {theta}3, {theta}4 is such that, before seeing the data, we would not be able to delineate any reasons to distinguish among them; in technical terms, suppose we would consider them to be exchangeable.5,11,24,25 Such exchangeability implies that we would offer about the same prior guess (r1 r2 r3 r4) with about the same effective sample size (n1 n2 n3 n4) for all four parameters. This is a much weaker assumption than that the parameters really are equal; it only says that, without seeing the data, we can’t yet tell how they might differ. It is, essentially, an uncertain and qualitative prior guess about the similarity of the cohorts. If we did not consider {theta}1, {theta}2, {theta}3, {theta}4 exchangeable (e.g. if we suspected that {theta}1 < {theta}2 < {theta}3 < {theta}4) we could modify the estimators below to reflect our actual views, although the result would be more complicated. It turns out, however, that even if we do not accurately capture our background (prior) information, the EB estimators will still have an advantage over the conventional and Stein estimators.7,19,20

A fully empirical estimator
A key idea behind Bayes and Stein estimation is that, by analogy with Figure 2Go, we can stabilize (reduce the standard deviation of) the conventional estimators A1/N1, A2/N2, A3/N3, A4/N4 by shrinking them toward a given point.20,21 Given that we will shrink them all toward the same value (because we consider the underlying risks to be exchangeable), which such value would minimize our total expected squared error? A moment's reflection might suggest that a good choice would be some weighted average of {theta}1, {theta}2, {theta}3, {theta}4; call that unknown average . We can construct an estimator of from the data, and plug this estimator into formula 2 or 3 to get new estimators for {theta}1, {theta}2, {theta}3, {theta}4. We can also estimate the best weights to attach to A1/N1, A2/N2, A3/N3, A4/N4 when we plug our estimator of into formula 2 or 3.

With weights proportional to the cohort sizes Nk, we obtain the crude proportion A+/N+ as the estimator of . If we use in place of rk in formula 3 and use A+/N+ to estimate {theta}, we get approximate empirical Bayes (EB) estimators ê1, ê2, ê3, ê4 of {theta}1, {theta}2, {theta}3, {theta}4, which have the form


(4)

for k = 1, 2, 3, or 4. Here, 1, 2, 3, 4 are the estimated weights; their formula differs from that used for Stein estimation (Appendix 2). An important point about formula 4 is that it makes k close to 0, and hence êk close to A+/N+, when Ak/Nk is unstable or when all the observed proportions A1/N1, A2/N2, A3/N3, A4/N4 are close together; in the latter event, the data appear compatible with exchangeability. On the other hand, the weight formula makes k close to 1 and hence makes êk close to Ak/Nk if Ak/Nk appears very precise or if the observed proportions are extremely spread out, for in the latter event, the data appear incompatible with the exchangeability assumption. (A rationale for using A+/N+ in formula 4 is that it is an unbiased estimator of when the weights used for averaging the {theta}k are proportional to the cohort sizes Nk. In modern practice, better estimators of than A+/N+ are used; A+/N+ is used here only for simplicity, although it works well enough when {theta}1, {theta}2, {theta}3, {theta}4 are not very spread out.)

To illustrate the EB formula, suppose we observe A1 = 4, A2 = 1, A3 = 3, and A4 = 5 spontaneous abortions in Finnish, Swedish, Danish, and Norwegian cohorts of size 10 each, so that the observed proportions are 0.40, 0.10, 0.30, and 0.50, and the crude proportion A+/ N+ is 13/40 = 0.33. From Appendix 2 we get estimated weights of 0.79, 0.84, 0.80, and 0.78 for the four observed proportions, and ê1 = 0.38, ê2 = 0.14, ê3 = 0.31, and ê4 = 0.46 for the EB estimates of the risks in the four cohorts. Each of these estimates are the result of pulling away from the observed proportions, toward their estimated common prior mean of 0.33.

The estimator in formula 4 is called empirical Bayes because it uses data (‘empirical’) quantities such as A+/N+ in place of all the prior parameters in the Bayes estimator (formula 2). As with Stein estimators, the total expected squared error of the EB estimators is guaranteed to be less than that of the conventional estimators, no matter what the target parameters really are.7,19,21 That is, taken as an ensemble, the EB estimators will outperform the conventional estimators even if our exchangeability assumption (similarity judgement) was very wrong. For example, but for the approximation error, the total of the expected squared errors of ê1, ê2, ê3, ê4 will be less than the total of the expected squared errors of A1/N1, A2/N2, A3/N3, A4/N4, even if {theta}1, {theta}2, {theta}3, {theta}4 are far apart. In this sense, the overall (ensemble) performance of the EB estimators is robust.

Unlike the Bayes and Stein estimators, the accuracy of the EB estimators does not depend on assumptions about the exact values of the true parameters; only a much weaker assumption of similarity (exchangeability) is used. This can be viewed as another robustness advantage of EB estimation. Nonetheless, if the exchangeability assumption is not a good approximation to reality, some of the EB estimators in the ensemble may be less accurate than their conventional counterparts. For example, if the cohorts were all of equal size and risk except for cohort 4, and the latter was half the size and twice the risk of the other cohorts, the EB estimator for cohort 4 (ê4) might be much less accurate than the conventional estimator A4/N4. Again, it is only the total squared error of the EB estimators that is guaranteed to be lower. To ensure that each EB estimator is more accurate than its conventional counterpart, we would have to use good exchangeability assumptions; for example, if {theta}4 was twice the size of {theta}1, {theta}2, and {theta}3, we would get best accuracy by assuming that {theta}4/2 is exchangeable with {theta}1, {theta}2, {theta}3, rather than that all four parameters are exchangeable. Thus, as with all statistics, the accuracy of our background assumptions will be crucial to the accuracy of our estimates.

The shrinkage viewpoint
The robustness advantages of EB estimators may be explained from the shrinkage perspective.7,14,20,21,26 If the conventional estimator Ak/Nk is highly variable, it should get more shrinkage toward a more stable point (that is, it should get a smaller k) to reduce its variability. If we have no secure prior guess and if the conventional estimates are close to one another, then the natural point to pull them toward is close to A+/N+, which is the estimate we would use if we thought the parameters were equal. Conversely, if a conventional estimator has low variability, it does not need much pull toward a stable point to improve its accuracy, so k can be close to 1. And, if the conventional estimates are not close together, then there is no basis for pulling them much towards A+/N+. The weight k is constructed in such a manner that it responds as needed to the apparent variability of Ak/Nk and the spread among the conventional estimates (Appendix 2).

The multilevel viewpoint
The EB estimators are two-stage estimators, with the same first-stage model as all the earlier estimators, but with a different underlying second-stage model. This is because the EB estimators make no use of our own prior (second-stage) means; instead, they employ an estimate of as the prior mean. The second-stage model underlying the EB estimators represents {theta}1, {theta}2, {theta}3, {theta}4 as probabilistic functions of , in that {theta}1, {theta}2, {theta}3, {theta}4 are assumed to deviate only randomly from . Unlike in Bayes or Stein estimation, however, this prior mean is not assumed to be known. Thus, EB estimators depend on one less assumption than do Stein estimators. (It should be noted, however, that some authors use the term ‘Stein estimation’ as a synonym for empirical-Bayes estimation.)


    Bayes Empirical Bayes and Beyond
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
What do we get for the extra labour of EB estimation? Numerous studies, in fields as diverse as epidemiology and baseball, have found that the accuracy gains over conventional estimates are usually large and are often spectacular.21,26–30 For example, several studies suggest ESE reductions on the order of >=40% in common epidemiological settings.27–30 It may come as a surprise, then, that we can do even better.

In empirical Bayes estimation, we can improve our estimate of in the same way we did when we had just one target parameter {theta}: By using prior information. Recall that, when we had just one parameter to estimate, we could construct simple two-stage estimators, known as Bayes estimators, whose ESE could be lower than that of conventional single-stage estimators over a given range of practical concern. In a similar fashion, we can construct basic three-stage estimators whose ESE can be lower than that of empirical Bayes estimators over a given range of practical concern. These estimators are sometimes called Bayes empirical Bayes (BEB) estimators.15,31,32

Returning to the example, suppose as before that we consider {theta}1, {theta}2, {theta}3, {theta}4 exchangeable, so that we would give the same prior guess r with the same prior certainty (e.g. as measured by n) for each. In addition, suppose we specify the same range {theta}L to {theta}U in which we want our BEB estimators to outperform the competitors given earlier. (Again, these assumptions are not essential, but formulas become more complex if a different prior pair rk, nk and range is specified for each target parameter.) Some approximate BEB estimators of {theta}1, {theta}2, {theta}3, {theta}4 have the form


(5)

where


(6)

As in formula 4, A+/N+ serves as an estimator of and the k are the estimated weights for the conventional estimators (although the weight formulas are more complex than in EB estimation). The main addition over the EB estimator in formula 4 is that we now also use our prior guess r to estimate . This is done by averaging r with the sample estimate of using a weight u (formula 6) in a manner that parallels ordinary Bayes estimation (formula 2).

The BEB estimators defined by formulas 5 and 6 are based on a three-stage model. As with all the estimators, the first stage is a model for the incidence proportions Ak/Nk as probabilistic functions of their target parameters {theta}k, in which Ak/Nk deviates only randomly from {theta}k. As with EB estimation, the second stage is a model for the parameters {theta}k as probabilistic function of their mean , in which {theta}k deviates randomly from . The new, third stage is a model for the mean as a probabilistic function of the prior mean r, in which deviates only randomly from r.

One major conceptual difference between the ordinary Bayes and BEB models is that our prior information (embodied in r and u) has been moved up one stage in the hierarchy, from the second to the third stage. Just as the second-stage model is often called the prior model, the third-stage model is often called the hyperprior model;9 r is then called the hyperprior mean. An advantage of introducing this additional model is analogous to the advantage of introducing the prior model in estimating one parameter. Suppose we can specify some range {theta}L to {theta}U for {theta}1, {theta}2, {theta}3, {theta}4 with {theta}L > 0, {theta}U <= 1, within which we want to improve the accuracy of our estimates (values outside the range may be unimportant because such values would either be likely to produce obvious results, or would be extremely implausible.) We can then make sure that our values for r and u produce BEB estimators that have a smaller total expected squared error than the corresponding EB estimators within the specified range for {theta}.

Multilevel modelling provides many directions in which EB and BEB estimation can be generalized. For example, there is no theoretical limit to the number of stages (levels) allowed. Although practical limits of computation and imagination have kept most applications in health sciences to two-level models, examples involving three or more levels often arise in the social sciences.1,3 With such extensions, multilevel models can encompass all other parametric regression models as special cases, and can be extended to encompass non-parametric regression as well (via the relation of roughness penalties to prior information).33,34


    Some Issues in Multilevel Modelling
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
Like other techniques that claim generality and superiority to established methods, multilevel modelling raises some concerns about potential limitations relative to ordinary regression modelling. Although some of these concerns are well founded, most limitations of multilevel modelling are in fact limitations of all modelling methods. I can here provide only a brief summary of some key issues and my responses, along with references.

Limitations of modelling
Two points cannot be overemphasized. First, every inferential statistic (such as a P-value or confidence interval) is model based, in that same set of constraints (i.e. a model) on the data-generating process must be assumed in order to derive tests and estimates of quantities of interest.17,35 This limitation applies to so-called ‘non-parametric’ methods, as well as to explicit modelling methods. It is a severe limitation of every analysis of epidemiological studies of cause and effect because in such studies we almost never know enough about the processes that determine exposure or disease to be even moderately certain that our model is even approximately valid. The best we can do is explicate the model underlying our analysis so that everyone can critically evaluate the model assumptions. Multilevel modelling is distinguished only by its unfamiliarity, which obliges one to make more effort to explain the model. But multilevel modelling need not involve stronger assumptions than ordinary modelling, and in fact provides an opportunity to use weaker assumptions (that is, more flexible models) than used in ordinary single-level analyses. This increased flexibility is achieved by pushing the models ordinarily used for single-level analysis up to the second level in the multilevel analysis.34,36,37

Second, inferential statistics, whether ‘non-parametric’ (i.e. based on a weak but hidden model) or explicitly model-based, are not data summaries.35,38 True data summaries, such as cross-tabulations and data plots, should be examined in every analysis as a check on statistical assumptions and results.34,38 For example, most epidemiological statistics depend in subtle ways on sample-size requirements that should be checked against data summaries.39 Similarly, associations, trends, or more complex patterns that emerge from model-based estimates need to be checked against true data summaries (such as tables of observed proportions) to determine the degree to which the patterns reflect observations as opposed to model assumptions.16,18,34 Again, multilevel models are not special in this regard, although they do require checking of all levels, which can be laborious. In return, however, they can provide the benefits described here and elsewhere.7,9,14,19–23,26–32,34,36,37

Complexity concerns
One cost of moving to the multilevel perspective is its greater complexity. Multilevel models demand more mental effort to understand and build. They also demand some considerations of the relative importance of different errors (see the discussion of estimation criteria below). I view these demands as beneficial, for the effort is in connecting the model and estimation criteria to the scientific context.37 Nonetheless, there is a danger that multilevel analysis will be employed in the same automatic and careless manner that is common among ordinary regression analysis, with the same ill effects. But the potential for misuse should not be an excuse to deny the benefits of a method to careful users.

There are many situations in which the complexity of a full multilevel analysis is not needed. This occurs, for example, when only first-stage parameters are of interest and the data set is so large that all first-stage standard errors are small; here, there is no need for variance reduction. Conversely, if the data set is so small that only a few free parameters can be used in the first-level model, a full multilevel approach (with its attendant flexibility and robustness) will not be applicable, though more restrictive Bayesian analyses can still be useful for both bias and variance reduction.39 Finally, the data may exhibit results so obvious that any complex approach is pointless, such as in the classic studies of large effects (smoking and lung cancer, estrogens and endometrial cancer, vinyl chloride and angiosarcoma, etc.).

On the other hand, there are many complex problems in which multilevel modelling provides an alternative superior to the oversimplifications required by standard approaches. This advantage is especially great in studies that search for effects or interactions among many exposures (so-called ‘fishing expeditions'), in which standard methods of forcing in all variables or using mechanical variable-selection algorithms easily produce invalid inferences.27–30,34,36,37

Modifying the estimation criterion
Nearly all discussions, including the present one, focus on minimizing total expected squared error as an estimation criterion. Nonetheless, the general lessons illustrated earlier continue to hold if we employ other criteria that, like the total ESE criterion, consider errors due to bias and errors due to random variation as equally costly, and that consider larger deviations from a target worse than smaller deviations. An example of such a criterion is expected absolute error, or EAE. Using this criterion, it will still be true that we can achieve lower expected errors than conventional estimators by averaging results from different levels of a multilevel model ('shrinkage’),7 though the form of the weight estimators will change. The same comments apply if we replace total ESE or total EAE by a weighted total; in the example, cost weighting would be appropriate if each unit of error in the Finnish cohort was more costly than each unit of error in the other cohorts.

Multilevel estimation criteria can be modified further to recognize qualitative differences in error costs. For example, model parameters often can be divided into two groups, one containing parameters that represent effects of the exposures of interest (including any product terms involving an exposure), and the other containing parameters of no direct interest (such as intercepts and confounder coefficients). A simple way to account for this coefficient grouping is to use separate estimation criteria for the two groups of parameters,40 for example by shrinking only the estimates of interest.27,28,30,32,34,41 Further improvement can be obtained by using accuracy in estimating the exposure coefficients as the sole criterion.42,43 Though this criterion leads to modification of all the parameter estimates, it sacrifices accuracy in the confounder and intercept estimates whenever such sacrifice appears to improve accuracy in estimation of exposure coefficients.

Software for fitting multilevel models
At least four types of software options are currently available for fitting multilevel models under the ESE criterion. First, there are several commercial packages for fitting multilevel models,44 although not all are capable of handling the binomial and Poisson outcomes common in epidemiology. Second, if one is familiar with the correspondence between multilevel models and mixed models,1–5,37,41 one can translate the desired multilevel model into a generalized-linear mixed model, fit that model with a generalized-linear mixed model program such as SAS GLIMMIX, and translate the results back to multilevel-model estimates. Similar translations can be made to allow fitting of multilevel models via ridge-regression or penalized-likelihood software. Third, one can employ a specialized SAS procedure for obtaining two-stage model estimates from ordinary regression output,45 available on the World-Wide Web at http://darwin.cwru.edu/~witte/hm.html. These three options are based on approximations that appear to work well for typical epidemiological data, at least as well as the maximum-likelihood methods used to fit ordinary single-stage models.27,29,30 A fourth, more accurate option, available to those familiar with advanced hierarchical-Bayesian methodology, is to employ Markov-Chain Monte-Carlo techniques.4,5

Interval estimation
For conceptual purposes I have focused on point estimation, but in practice one needs interval estimates for the parameters of interest. The simplest 95% confidence intervals are Wald intervals, found by taking the point estimate (or some transform of it) ±1.96 estimated standard errors. The Wald method is only a large-sample approximation when dealing with non-normal data or non-linear models (as when analysing risks or rates), and has the least accurate coverage among approximate intervals. In multilevel modelling, it tends to suffer additional inaccuracy by falling more frequently to one side than the other of the true parameter. This lopsidedness of coverage is a direct consequence of the bias in the multilevel point estimator, on which the Wald interval is centered. Despite this problem, multilevel Wald 95% intervals appear to provide conservatively valid (i.e. at least 95%) average coverage for parameter ensembles in common epidemiological settings.27,29,30 For more accurate intervals, one will have to employ more accurate analytic approximations or Monte-Carlo methods.4,5


    Conclusion
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
Despite debate about its extent of applicability,46 multilevel modelling is now well established in the social sciences, where it is used for simultaneous study of relations among group-level and individual-level variables.1,3,6,46–48 The theory and examples of these applications are now found in textbooks,1,3,6 as well as in many articles in education and related fields. In what may be a promising sign, these ideas are starting to appear in the health sciences as well.49

I have here focused on a more general application of multilevel modelling: reduction of estimation error through incorporation of prior information about parameter sizes (including information about similarity or relative sizes). This focus provides a conceptual unification of seemingly disparate Bayesian and frequentist approaches to estimation. Multilevel analysis allows one to use prior information in a manner more cautious than ordinary Bayesian analysis and more thorough than ordinary frequentist analysis. Many published epidemiological applications, as well as simulation studies, have illustrated this benefit in problems involving multiple comparisons, variable selection, or sparse data.7,19,21,23,26–30,32,37,41,50 I hope that the present exposition encourages readers to examine these applications and to contribute further examples of multilevel analysis to the literature.


    Appendix 1
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
The ESE of the estimator is

[E() - {theta}]2 + var() = (1 - w)2 (r - {theta})2 + w2{sigma}2/N (A1)

where {sigma}2 = Nvar(A/N). Because the ESE of A/N is {sigma}2/N, will have lower ESE than A/N if and only if (A1) is less than {sigma}2/N, or equivalently if and only if

(r - {theta})2 < (1 - w2){sigma}2/N(1 - w)2 = (2/n + 1/N){sigma}2 (A2)

Given an interval [{theta}L, {theta}U] with 0 < {theta}L < {theta}U < 1, we can find a nonempty set of positive pairs (n, r) which ensure that (A2) holds for every {theta} in the interval, and check whether (A2) holds for every {theta} in the interval for a given pair (n, r). In the example, N = 10, n = 5, r = 0.2, and {sigma}2 = {theta}(1 - {theta}), so (A2) becomes

(0.2 - {theta})2 < (2/5 + 1/10){theta}(1 - {theta}) (A3)

which holds for all {theta} in the example interval ({theta}L, {theta}U) = (0.1, 0.4). Of course, n = 5, r = 0.2 is not necessarily optimal, and the optimal pair (n, r) will depend on the optimality criterion.7,51


    Appendix 2
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
Given K cohorts, a simple weight for the Stein estimator in formula 3 is given by

1 - k = max[0, (K - 2) uk vk/K]

where

vk = k (1 - k)/Nk (A4)

is the estimated variance of Ak/Nk,

uk = 1/(vk + 2), (A5)

2 = {Sigma}kukdk/{Sigma}kuk (A6)

is the estimated sum of squared deviations of the {theta}k about their prior means, and19

dk = (Ak/Nk - rk)2 - vk (A7)

Because k appears in the definition of vk, 2 appears in the definition of uk, and uk appears in the definition of 2, computation of these quantities requires iteration among formulas 3 and A3 - A7. Because 1 - k is the weight placed on vu, it may be seen that shrinkage increases as the variance estimate vk for Ak/Nk increases.

For the EB estimator in formula 4, wk, vk, and dk are given by

1 - k = max[0, (K - 3) ukvk /( K - 1)] (A8)

vk = êk (1 - êk)/Nk (A9)

and

dk = K(Ak/Nk - A+/N+)2/( K - 1) - vk (A10)

which require iteration among formulae 4, A5, A6, and A8-A10.26

If the denominators Nk represent person-time rather than person-counts, formula A4 becomes k/Nk and A9 becomes êk/Nk. (A4)


    Acknowledgments
 
This paper was initially drafted for a workshop on multilevel modelling sponsored by the International Society for Environmental and Occupational Health, Helsinki, Finland, 1998. Revisions greatly benefitted from the comments of Peter Morfeld, Charles Poole, and the reviewers.


    References
 Top
 Abstract
 Introduction
 The Estimation Problem
 Bayes Estimation
 Stein Estimation
 Empirical Bayes Estimation
 Bayes Empirical Bayes and...
 Some Issues in Multilevel...
 Conclusion
 Appendix 1
 Appendix 2
 References
 
1 Bryk AS, Raudenbush SW. Hierarchical Linear Models. Newbury Park, CA: Sage, 1992.

2 Searle SR, Casella G, McCulloch CE. Variance Components. New York: Wiley, 1992.

3 Goldstein H. Multilevel Statistical Models, 2nd edn. London: Edward Arnold, 1995.

4 Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. New York: Chapman and Hall, 1995.

5 Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis. New York: Chapman and Hall, 1996.

6 Kreft IGG, de Leeuw J. Introducing Multilevel Modelling. London: Sage, 1998.

7 Efron B. Biased versus unbiased estimation. Adv Math 1975;16: 259–77.[ISI]

8 Leamer EE. Specification Searches. New York: Wiley, 1978.

9 Good IJ. Good Thinking. Minneapolis: University of Minneapolis Press, 1983, Ch. 9.

10 Meeden G. Estimation when using a statistic that is not sufficient. Am Stat 1987;41:135–36.[ISI]

11 DeFinetti B. The Theory of Probability. Vols. 1 and 2. New York: Wiley, 1974.

12 Howson C, Urbach P. Scientific Reasoning: The Bayesian Approach, 2nd edn. LaSalle, IL: Open Court, 1993.

13 Greenland S. Probability logic and probabilistic induction. Epidemiology 1998;9:322–32.[ISI][Medline]

14 Copas JB. Regression, prediction, and shrinkage. J Royal Stat Soc Ser B 1983;45:311–54.[ISI]

15 Good IJ. Hierarchical Bayesian and empirical Bayesian methods (letter). Am Stat 1987;41:92.

16 Box GEP. An apology for ecumenism in statistics. In: Box GEP, Leonard T, Wu CF (eds). Scientific Inference, Data Analysis, and Robustness. New York: Academic Press, 1983.

17 Robins JM, Greenland S. The role of model selection in causal inference from nonexperimental data. Am J Epidemiol 1986;123: 392–402.[ISI][Medline]

18 Hill JR. A general framework for model-based statistics. Biometrika 1990;77:115–26.[ISI]

19 Efron B, Morris CN. Stein's estimation rule and its competitors—an empirical Bayes approach. J Am Stat Assoc 1973;68:117–30.[ISI]

20 Brandwein AC, Strawderman WE. Stein estimation: the spherically symmetric case. Stat Sci 1990;5:356–69.

21 Efron B, Morris CN. Stein's paradox in statistics. Sci Am 1977;236: 119–27.[ISI]

22 Casella G. An introduction to empirical-Bayes data analysis. Am Stat 1985;39:83–87.[ISI]

23 Greenland S, Poole C. Empirical Bayes and semi-Bayes approaches to occupational and environmental hazard surveillance. Arch Environ Health 1994;49:9–16.[ISI][Medline]

24 Draper D, Hodges JS, Mallows CL, Pregibon D. Exchangeability and data analysis. J Roy Stat Soc Ser A 1993;156:9–37.[ISI]

25 Greenland S, Draper D. Exchangeability. In: The Encyclopedia of Biostatistics. New York: Wiley, 1998.

26 Efron B, Morris CN. Data analysis using Stein's estimator and its generalizations. J Am Stat Assoc 1975;70:311–19.[ISI]

27 Greenland S. Methods for epidemiologic analyses of multiple exposures: a review and comparative study of maximum-likelihood, preliminary-testing, and empirical-Bayes regression. Stat Med 1993; 12:717–36.[ISI][Medline]

28 Witte JS, Greenland S, Haile RW, Bird CL. Hierarchical regression analysis applied to a study of multiple dietary exposures and breast cancer. Epidemiology 1994;5:612–21.[ISI][Medline]

29 Witte JS, Greenland S. Simulation study of hierarchical regression. Stat Med 1996;15:1161–70.[ISI][Medline]

30 Greenland S. Second-stage least squares versus penalized quasi-likelihood for fitting hierarchical models in epidemiologic analyses. Stat Med 1997;16:515–26.[ISI][Medline]

31 Deeley JJ, Lindley DV. Bayes empirical Bayes. J Am Stat Assoc 1981;76: 833–41.[ISI]

32 Greenland S. Hierarchical regression for epidemiologic analyses of multiple exposures. Environ Health Perspect 1994;102(Suppl.8):33–39.

33 Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models. New York: Chapman and Hall, 1994, sec. 3.8.3.

34 Greenland S. Introduction to regression modeling. In: Rothman KJ, Greenland S (eds). Modern Epidemiology, 2nd edn. Philadelphia: Lippincott-Raven, 1998, Ch. 21.

35 Greenland S. Randomization, statistics, and causal inference. Epidemiology 1990;1:421–29.[Medline]

36 Greenland S. Multilevel modeling and model averaging. Scand J Work Environ Health 2000 (In press).

37 Greenland S. When should Epidemiologic Regressions Use Random Coefficients? Technical Report, UCLA Department of Epidemiology, September 1999.

38 Greenland S. Summarization, smoothing, and inference. Scand J Soc Med 1993;21:227–32.[ISI][Medline]

39 Greenland S. Small-sample bias and corrections for conditional maximum-likelihood odds-ratio estimators. Biostatistics 2000;1: In press.

40 Lindsay BG. Using empirical partially Bayes inference for increased efficiency. Ann Stat 1985;13:914–31.[ISI]

41 Greenland S. A semi-Bayes approach to the analysis of correlated multiple associations, with an application to an occupational cancer-mortality study. Stat Med 1992;11:219–30.[ISI][Medline]

42 Kalish L. Reducing mean squared error in the analysis of pair-matched case-control studies. Biometrics 1990;46:493–99.[ISI][Medline]

43 Greenland S. Reducing mean squared error in the analysis of stratified epidemiologic studies. Biometrics 1991;47:773–75.[ISI][Medline]

44 Kreft IGG, de Leeuw J, van der Leeden R. Review of five multilevel analysis programs: BMDP5V, GENMOD, HLM, ML3, VARCL. Am Stat 1994;48:324–35.[ISI]

45 Witte JS, Greenland S, Kim L-L. Software for hierarchical modeling of epidemiologic data. Epidemiology 1998;9:563–66.[ISI][Medline]

46 Kreft IGG (ed.). Hierarchical linear models: problems and prospects. J Educ Behav Stat 1995;20:109–240.[ISI]

47 Wong GY, Mason WM. The hierarchical logistic regression model for multilevel analysis. J Am Stat Assoc 1985;80:913–24.

48 Raudenbush SW, Bryk AS. Empirical-Bayes meta-analysis. J Educ Stat 1985;10:75–98.

49 Von Korff M, Koepsell T, Curry S, Diehr P. Multilevel analysis in epidemiologic research on health behaviors and outcomes. Am J Epidemiol 1992;135:1077–82.[Abstract]

50 Greenland S, Robins JM. Empirical-Bayes adjustments for multiple comparisons are sometimes useful. Epidemiology 1991;2:244–51.[Medline]

51 Cox DR, Hinkley DV. Theoretical Statistics. New York: Chapman and Hall, 1974, Ch. 11.