From the National Immunization Program, Centers for Disease Control and Prevention, Atlanta, GA.
Received for publication April 16, 2002; accepted for publication July 19, 2002.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
epidemiologic methods; ethnic groups; hypothesis testing; immunization; statistics; vaccination; vaccines
Abbreviations: Abbreviations: DTP, diphtheria and tetanus toxoids and pertussis; NIS, National Immunization Survey; TOST, two one-sided test.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
To determine whether equivalence has been achieved, the question "Has a disparity been eliminated?" is often asked. To answer this question, a test for difference between groups may be conducted. If the test reveals a statistically significant difference, the answer to the question is "no." Of course, the public health importance of the difference must be assessed; if the goal of "no difference" remains a priority, limited public health resources might be directed toward determining and correcting the cause of a disparity that no longer threatens public health.
The first step in avoiding this problem is to consider how the question about disparity is being framed. Instead of asking whether disparityor differencein vaccination coverage among population groups has been eliminated, we might instead ask, "Has practical equivalence been achieved?"
No sample survey can be used to prove that a difference has been eliminated, for two reasons. First, even if no significant difference was found in a given sample, increasing the power of the study would probably reveal some difference, however small. Second, in the unlikely event that a difference in vaccination coverage were totally eliminated, proving it would be impossible without the use of an error-free census, not a sample. In any sample, a difference smaller than the samples margin of error will be undetectable.
Questioning equivalenceinstead of looking for a lack of differenceis more than just semantics. Questioning equivalence reveals the most appropriate way to use a sample survey to assess disparities in vaccination coverage, and it is operationally different from looking for a difference. Equivalence testing, a process mandated by the Food and Drug Administration to determine whether a generic drug performs as well as the brand-name drug (within a previously established, tolerable difference), may address the limitations of testing for public health disparities in the traditional manner (25).
![]() |
DIFFERENCE TESTS VERSUS EQUIVALENCE TESTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
However, if the analysis does not reveal a statistically significant difference between groups, the null hypothesis must standit cannot be rejected. But failing to find difference is not proof of similarity. As was noted above, failing to find a difference may just indicate a sample size too small to detect a difference.
Equivalence tests
In equivalence testing, the null hypothesis is formulated so that the statistical test is proof of similarity; it states that the groups differ by more than a tolerably small amount, . The alternative hypothesis is that the groups differ by less than
that is, they are similar. Thus, if one rejects the null hypothesis, one may correctly state that the groups are similar. A more complete discussion of equivalence testing appears elsewhere (810).
In this paper, we demonstrate equivalence testing using data from the National Immunization Survey (NIS). As recently as 10 years ago, disparities in childhood vaccination coverage contributed to a resurgence of measles in the United States. In particular, coverage was low among minority children in inner cities (11). Since then, gaps have narrowed (12). Using data from 2000, we show that racial/ethnic groups which differ in vaccination coverage by an amount that is "statistically significant" under the traditional approach are, for some comparisons, equivalent at a level of precision permitted by the survey. Defining tolerable levels of difference is a necessary precursor to applying equivalence testing to assess equity. Below, we discuss implications for defining tolerable levels of difference in this area of public health.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The weighting estimation methodology of the NIS adjusts vaccination coverage estimates for household nonresponse, households with multiple telephone lines or without telephones, and provider nonresponse. NIS methods have been described in detail elsewhere (1315).
Using SUDAAN, version 7.5 (16), we calculated point estimates of coverage and their standard deviations by race/ethnicity for selected vaccines; these statistics formed the basis for both difference testing and equivalence testing. We conducted pairwise comparisons of immunization coverage among Whites, the largest group and (usually) the group with the greatest immunization coverage, and the largest minority racial/ethnic groups: Blacks, Hispanics, and Asians. Other groups, such as American Indians and Pacific Islanders, were not considered because of small numbers of children in those groups in the NIS.
Statistical analysis
Difference testing
Traditional tests of differences were conducted comparing coverage among minority groups with that among Whites. In statistical tests, the probability of incorrectly rejecting the null hypothesis (that is, the level) is often set at 0.05. Therefore, we constructed 95 percent confidence intervals ((1
)100 percent = 95 percent) for differences in coverage between two groups. A difference is statistically significant if the 95 percent confidence interval for the difference in coverage does not include 0; that is, if (point estimate of coveragewhite point estimate of coverageminority) ± 1.96(standard deviationwhite2 + standard deviationminority2)0.5 does not include 0.
In difference testing, the null hypothesis is "no difference." Thus, is the probability of type I error (concluding that a difference exists when, in fact, none does). Reducing the chance of this error increases ß, the chance of type II error (the probability that the populations coverages will be found not to differ when the true difference is not zero).
Equivalence testing
The two one-sided test (TOST) procedure, the most basic form of equivalence testing used to compare two groups, has been studied extensively (3, 5, 7). To test for equivalence, confidence intervals for the difference between two groups must be defined, just as with difference testing. In a TOST analysis, a (1 2)100 percent confidence interval is constructed. A careful explanation of why one defines a (1 2
)100 percent confidence interval, not (1
)100 percent, appears elsewhere (5). In this TOST analysis, as in the difference analysis, we select
= 0.05. Thus, in the TOST analysis, we constructed 90 percent confidence intervals ((1 2 0.05)100 percent = 90 percent). In a TOST analysis, we reject the null hypothesis that the groups differ by at least
and declare two groups similar at the
= 0.05 level, if the 90 percent confidence interval for the difference in coverage is completely contained in the interval with endpoints
and +
. That is, the groups are similar if (point estimate of coveragewhite point estimate of coverageminority) ± 1.645(standard deviationwhite2 + standard deviationminority2)0.5 is completely contained in the interval with endpoints
and +
.
In equivalence testing, the null hypothesis is "a difference of or more." Thus,
is the probability of concluding that the populations differ by less than
when, in fact, the difference is
or more. Similarly, ß is the probability that the populations coverages will be found to differ by at least
when the true difference is less than
.
For this analysis, we demonstrate equivalence testing using = 0.05. This value was selected for this demonstration because NIS data for 2000 can detect equivalences at
= 0.05 between Whites and the other races/ethnicities and vaccine series presented here. For some vaccines in some contexts, this value might have public health relevance. However, for other vaccines in other contexts, an appropriate value might be lower or higher. As we discuss below, if researchers apply the equivalence testing method in studies that conclude with public-health recommendations, the critical choice of
should depend on public health issues and will vary with the context.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Comparisons of test results
All four possible combinations of results arise in this comparison of difference testing and equivalence testing. Both difference tests and equivalence tests must be interpreted carefully and correctly.
Positive for difference and equivalence
For some comparisons, the difference test shows difference, but the equivalence test shows similarity at = 0.05. This holds in seven of 21 tests shown in table 1 and figure 1. For example, for
3 doses of Haemophilus influenzae type b vaccine and
3 doses of hepatitis B vaccine, difference testing finds significant differences among Blacks versus Whites and among Hispanics versus Whites. However, in these instances, the equivalence test allows us to conclude that the groups are similar at
= 0.05.
Negative for difference and positive for equivalence
When comparing Blacks and Whites for varicella vaccine coverage, difference testing does not show a difference, but we cannot conclude from this test that the groups are identical. The test merely shows that a difference has not been proven. If the sample size were large enough, a statistically significant difference would probably emerge. However, for the same comparison, equivalence testing permits us to conclude that the groups are similar regarding varicella vaccination, within the tolerance of the test.
Positive for difference and negative for equivalence
For coverage with 4 doses of DTP, the results from difference testing indicate that coverage is significantly different among Blacks versus Whites and among Hispanics versus Whites. However, the tests of equivalence do not reveal similarity. (Remember that in equivalence testing, failure to find similarity is not proof of difference.)
Negative for difference and equivalence
For measles-mumps-rubella vaccine coverage among Asians versus Whites, the difference test did not show difference, and similarity could not correctly be concluded from the difference test. The equivalence test did not show similarity and could not prove difference. In this case, neither the evidence against equality nor similarity was strong enough to reject the null hypotheses.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Difference testing is rigorous and is appropriate for many applicationsfor example, to quantify disparities that adversely affect public health. However, statistically significant differences are not necessarily meaningful from a public health standpoint. Moreover, difference testing allows no conclusions about similarity between groups (we implicitly assume that only a test resulting in a decision to declare that the groups differ or that they cannot be proven to differ is being considered). Difference testing should be reserved for situations where the appropriate research question is, "Is there a disparity between groups that can be addressed with public health interventions?"
On the other hand, when interventions have succeeded in reducing disparities between groups, the appropriate research question is, "Is there practical equivalence among these groups?" In equivalence testing, as in difference testing, the method for showing that groups are similar is rigorous. With this method, the researcher must specify an acceptable difference between groups, . We cannot reject the null hypothesis of a difference greater than
unless the statistical evidence is strong. The sample size might limit the values of
that are practical; if both
and the sample size are small, it is unlikely that equivalence can be established.
For results of equivalence testing to be useful in assessing immunization coverage, must be carefully selected. For example, some small differences in coverage between groups may not increase the risk of morbidity and mortality from vaccine-preventable disease. Determining the magnitude of a tolerable difference will depend on many factors. For example, a tolerable difference for one vaccine might not be tolerable for another, depending on the epidemiology of the disease in question. A
considered acceptable on a national scale might be unacceptable on a local scale.
As with difference testing, equivalence testing requires an adequate sample size. The NIS samples smaller numbers of minorities than of Whites, which can limit the ability of the data to detect similarity at small values of . Oversampling minorities or combining data from multiple years could increase the sensitivity. Surveys with smaller sample sizes than that of the NIS might or might not provide sufficient discriminatory power to support equivalence testing with reasonable values of
.
Reducing disparities in immunization coverage should remain a top priority. However, even when health interventions succeed, differences will persist, to some small degree. Establishing the goal of practical equivalence, defining a tolerable difference that does not threaten public health, and using equivalence testing to prove that equity has been achieved could permit us to focus limited public health resources on identifying and addressing those disparities that might threaten public health.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|