Assessing Equivalence: An Alternative to the Use of Difference Tests for Measuring Disparities in Vaccination Coverage

Lawrence E. Barker, Elizabeth T. Luman, Mary M. McCauley and Susan Y. Chu

From the National Immunization Program, Centers for Disease Control and Prevention, Atlanta, GA.

Received for publication April 16, 2002; accepted for publication July 19, 2002.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 DIFFERENCE TESTS VERSUS...
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Eliminating health disparities in vaccination coverage among various groups is a cornerstone of public health policy. However, the statistical tests traditionally used cannot prove that a state of no difference between groups exists. Instead of asking, "Has a disparity—or difference—in immunization coverage among population groups been eliminated ?," one can ask, "Has practical equivalence been achieved?" A method called equivalence testing can show that the difference between groups is smaller than a tolerably small amount. This paper demonstrates the method and introduces public health considerations that have an impact on defining tolerable levels of difference. Using data from the 2000 National Immunization Survey, the authors tested for statistically significant differences in rates of vaccination coverage between Whites and members of other racial/ethnic groups and for equivalencies among Whites and these same groups. For some minority groups and some vaccines, coverage was statistically significantly lower than was seen among Whites; however, for some of these groups and vaccines, equivalence testing revealed practical equivalence. To use equivalence testing to assess whether a disparity remains a threat to public health, researchers must understand when to use the method, how to establish assumptions about tolerably small differences, and how to interpret the test results.

epidemiologic methods; ethnic groups; hypothesis testing; immunization; statistics; vaccination; vaccines

Abbreviations: Abbreviations: DTP, diphtheria and tetanus toxoids and pertussis; NIS, National Immunization Survey; TOST, two one-sided test.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 DIFFERENCE TESTS VERSUS...
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Eliminating differences in immunization coverage that occur by gender, race/ethnicity, income, geographic location, and other factors is a cornerstone of public health policy in the United States (1). Tremendous progress has been made. Evidence comes from survey research, which played a crucial role in identifying immunization disparities, quantifying them, and charting the progress of efforts to eliminate them. Today, survey research is used to investigate whether equivalence has been achieved among groups. This task is more challenging than many people realize.

To determine whether equivalence has been achieved, the question "Has a disparity been eliminated?" is often asked. To answer this question, a test for difference between groups may be conducted. If the test reveals a statistically significant difference, the answer to the question is "no." Of course, the public health importance of the difference must be assessed; if the goal of "no difference" remains a priority, limited public health resources might be directed toward determining and correcting the cause of a disparity that no longer threatens public health.

The first step in avoiding this problem is to consider how the question about disparity is being framed. Instead of asking whether disparity—or difference—in vaccination coverage among population groups has been eliminated, we might instead ask, "Has practical equivalence been achieved?"

No sample survey can be used to prove that a difference has been eliminated, for two reasons. First, even if no significant difference was found in a given sample, increasing the power of the study would probably reveal some difference, however small. Second, in the unlikely event that a difference in vaccination coverage were totally eliminated, proving it would be impossible without the use of an error-free census, not a sample. In any sample, a difference smaller than the sample’s margin of error will be undetectable.

Questioning equivalence—instead of looking for a lack of difference—is more than just semantics. Questioning equivalence reveals the most appropriate way to use a sample survey to assess disparities in vaccination coverage, and it is operationally different from looking for a difference. Equivalence testing, a process mandated by the Food and Drug Administration to determine whether a generic drug performs as well as the brand-name drug (within a previously established, tolerable difference), may address the limitations of testing for public health disparities in the traditional manner (2–5).


    DIFFERENCE TESTS VERSUS EQUIVALENCE TESTS
 TOP
 ABSTRACT
 INTRODUCTION
 DIFFERENCE TESTS VERSUS...
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Limitations of difference tests
Difference tests have been widely used to answer questions about whether a disparity has been successfully addressed; however, these tests are subject to well-known limitations, and the results are sometimes misinterpreted (4, 6, 7). In tests of difference, analysts test the null hypothesis that the groups under consideration do not differ. If the analysis reveals a statistically significant difference between groups, the null hypothesis of no difference is rejected. Here, analysts sometimes call attention to a difference, even though it may be too small to be meaningful. This problem plagues large samples, in which differences too small to be of public health concern are often statistically significant.

However, if the analysis does not reveal a statistically significant difference between groups, the null hypothesis must stand—it cannot be rejected. But failing to find difference is not proof of similarity. As was noted above, failing to find a difference may just indicate a sample size too small to detect a difference.

Equivalence tests
In equivalence testing, the null hypothesis is formulated so that the statistical test is proof of similarity; it states that the groups differ by more than a tolerably small amount, {Delta}. The alternative hypothesis is that the groups differ by less than {Delta}{nabla}that is, they are similar. Thus, if one rejects the null hypothesis, one may correctly state that the groups are similar. A more complete discussion of equivalence testing appears elsewhere (8–10).

In this paper, we demonstrate equivalence testing using data from the National Immunization Survey (NIS). As recently as 10 years ago, disparities in childhood vaccination coverage contributed to a resurgence of measles in the United States. In particular, coverage was low among minority children in inner cities (11). Since then, gaps have narrowed (12). Using data from 2000, we show that racial/ethnic groups which differ in vaccination coverage by an amount that is "statistically significant" under the traditional approach are, for some comparisons, equivalent at a level of precision permitted by the survey. Defining tolerable levels of difference is a necessary precursor to applying equivalence testing to assess equity. Below, we discuss implications for defining tolerable levels of difference in this area of public health.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 DIFFERENCE TESTS VERSUS...
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The National Immunization Survey
The NIS is conducted annually by the Centers for Disease Control and Prevention to obtain estimates of vaccination coverage rates for noninstitutionalized US children aged 19–35 months. The NIS is a random-digit-dialing survey of households with age-eligible children, followed by a mail survey of the children’s vaccination providers to validate vaccination information. In general, analysis of NIS data is limited to children whose providers respond.

The weighting estimation methodology of the NIS adjusts vaccination coverage estimates for household nonresponse, households with multiple telephone lines or without telephones, and provider nonresponse. NIS methods have been described in detail elsewhere (13–15).

Using SUDAAN, version 7.5 (16), we calculated point estimates of coverage and their standard deviations by race/ethnicity for selected vaccines; these statistics formed the basis for both difference testing and equivalence testing. We conducted pairwise comparisons of immunization coverage among Whites, the largest group and (usually) the group with the greatest immunization coverage, and the largest minority racial/ethnic groups: Blacks, Hispanics, and Asians. Other groups, such as American Indians and Pacific Islanders, were not considered because of small numbers of children in those groups in the NIS.

Statistical analysis
Difference testing
Traditional tests of differences were conducted comparing coverage among minority groups with that among Whites. In statistical tests, the probability of incorrectly rejecting the null hypothesis (that is, the {alpha} level) is often set at 0.05. Therefore, we constructed 95 percent confidence intervals ((1 – {alpha})100 percent = 95 percent) for differences in coverage between two groups. A difference is statistically significant if the 95 percent confidence interval for the difference in coverage does not include 0; that is, if (point estimate of coveragewhite – point estimate of coverageminority) ± 1.96(standard deviationwhite2 + standard deviationminority2)0.5 does not include 0.

In difference testing, the null hypothesis is "no difference." Thus, {alpha} is the probability of type I error (concluding that a difference exists when, in fact, none does). Reducing the chance of this error increases ß, the chance of type II error (the probability that the populations’ coverages will be found not to differ when the true difference is not zero).

Equivalence testing
The two one-sided test (TOST) procedure, the most basic form of equivalence testing used to compare two groups, has been studied extensively (3, 5, 7). To test for equivalence, confidence intervals for the difference between two groups must be defined, just as with difference testing. In a TOST analysis, a (1 – 2{alpha})100 percent confidence interval is constructed. A careful explanation of why one defines a (1 – 2{alpha})100 percent confidence interval, not (1 – {alpha})100 percent, appears elsewhere (5). In this TOST analysis, as in the difference analysis, we select {alpha} = 0.05. Thus, in the TOST analysis, we constructed 90 percent confidence intervals ((1 – 2 0.05)100 percent = 90 percent). In a TOST analysis, we reject the null hypothesis that the groups differ by at least {Delta} and declare two groups similar at the {alpha} = 0.05 level, if the 90 percent confidence interval for the difference in coverage is completely contained in the interval with endpoints {Delta} and +{Delta}. That is, the groups are similar if (point estimate of coveragewhite – point estimate of coverageminority) ± 1.645(standard deviationwhite2 + standard deviationminority2)0.5 is completely contained in the interval with endpoints –{Delta} and +{Delta}.

In equivalence testing, the null hypothesis is "a difference of {Delta} or more." Thus, {alpha} is the probability of concluding that the populations differ by less than {Delta} when, in fact, the difference is {Delta} or more. Similarly, ß is the probability that the populations’ coverages will be found to differ by at least {Delta} when the true difference is less than {Delta}.

For this analysis, we demonstrate equivalence testing using {Delta} = 0.05. This value was selected for this demonstration because NIS data for 2000 can detect equivalences at {Delta} = 0.05 between Whites and the other races/ethnicities and vaccine series presented here. For some vaccines in some contexts, this value might have public health relevance. However, for other vaccines in other contexts, an appropriate value might be lower or higher. As we discuss below, if researchers apply the equivalence testing method in studies that conclude with public-health recommendations, the critical choice of {Delta} should depend on public health issues and will vary with the context.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 DIFFERENCE TESTS VERSUS...
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Difference and equivalence in coverage
When we examine point estimates of vaccination coverage among young children by race/ethnicity, percentages are generally higher among Whites and Asians (table 1). Smaller standard deviations for estimates of coverage among Whites show less variance in the data, which reflects the larger number of Whites sampled.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Assessing differences and equivalencies in early childhood immunization coverage by self-reported race/ethnicity, National Immunization Survey, 2000
 
For difference tests, boldface type in table 1 indicates that these groups differ from non-Hispanic Whites; that is, the null hypothesis of no difference is rejected. For vaccines for which coverage differs between two groups, the confidence intervals of the differences do not contain zero (figure 1, left column). Traditional tests of differences show that, compared with Whites, coverage among Blacks is significantly lower for all vaccines listed, except varicella vaccine. Among Hispanics, coverage is significantly higher for varicella vaccine and lower for all other vaccines except measles-mumps-rubella vaccine. Compared with Whites, Asians have significantly higher coverage for one dose of varicella vaccine.



View larger version (25K):
[in this window]
[in a new window]
 
FIGURE 1. Differences and equivalencies in early childhood immunization coverage by self-reported race/ethnicity for three minority groups versus Whites, National Immunization Survey, 2000. The horizontal T-shaped bars depict confidence intervals. Confidence intervals for statistically significant differences (95% confidence intervals) do not include zero. Confidence intervals for equivalencies (90% confidence intervals) are contained in the interval with endpoints –5% and +5%. Confidence intervals marked with an asterisk go beyond ±10%. 3 DTP, three or more doses of diphtheria and tetanus toxoids and pertussis vaccine; 4 DTP, four or more doses of diphtheria and tetanus toxoids and pertussis vaccine; Polio, three or more doses of poliovirus vaccine; MMR, one or more doses of measles-mumps-rubella vaccine; Hib, three or more doses of Haemophilus influenzae type b vaccine; Hep B, three or more doses of hepatitis B vaccine; Varicella, one or more doses of varicella vaccine.

 
For equivalence tests, boldface type in table 1 indicates similarity; that is, the null hypothesis of a difference greater than ±{Delta} is rejected. For vaccines for which coverage is similar between two groups, confidence intervals for the difference are completely contained in the interval with endpoints –5 percent and +5 percent (figure 1, right column). Within {Delta} = 5 percent, Blacks in comparison with Whites have similar coverage for >=3 doses of diphtheria and tetanus toxoids and pertussis (DTP) vaccine, >=3 doses of Haemophilus influenzae type b vaccine, >=3 doses of hepatitis B vaccine, and one dose of varicella vaccine (table 1). Among Hispanics, coverage is equivalent to that among Whites for >=3 doses of DTP, >=3 doses of poliovirus vaccine, one dose of measles-mumps-rubella vaccine, >=3 doses of Haemophilus influenzae type b vaccine, and >=3 doses of hepatitis B vaccine. Among Asians, coverage is equivalent to that among Whites for >=4 doses of DTP and >=3 doses each of DTP, poliovirus vaccine, and hepatitis B vaccine.

Comparisons of test results
All four possible combinations of results arise in this comparison of difference testing and equivalence testing. Both difference tests and equivalence tests must be interpreted carefully and correctly.

Positive for difference and equivalence
For some comparisons, the difference test shows difference, but the equivalence test shows similarity at {Delta} = 0.05. This holds in seven of 21 tests shown in table 1 and figure 1. For example, for >=3 doses of Haemophilus influenzae type b vaccine and >=3 doses of hepatitis B vaccine, difference testing finds significant differences among Blacks versus Whites and among Hispanics versus Whites. However, in these instances, the equivalence test allows us to conclude that the groups are similar at {Delta} = 0.05.

Negative for difference and positive for equivalence
When comparing Blacks and Whites for varicella vaccine coverage, difference testing does not show a difference, but we cannot conclude from this test that the groups are identical. The test merely shows that a difference has not been proven. If the sample size were large enough, a statistically significant difference would probably emerge. However, for the same comparison, equivalence testing permits us to conclude that the groups are similar regarding varicella vaccination, within the tolerance of the test.

Positive for difference and negative for equivalence
For coverage with >=4 doses of DTP, the results from difference testing indicate that coverage is significantly different among Blacks versus Whites and among Hispanics versus Whites. However, the tests of equivalence do not reveal similarity. (Remember that in equivalence testing, failure to find similarity is not proof of difference.)

Negative for difference and equivalence
For measles-mumps-rubella vaccine coverage among Asians versus Whites, the difference test did not show difference, and similarity could not correctly be concluded from the difference test. The equivalence test did not show similarity and could not prove difference. In this case, neither the evidence against equality nor similarity was strong enough to reject the null hypotheses.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 DIFFERENCE TESTS VERSUS...
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Equivalence testing permits investigators to pose and answer very precise questions about the levels of difference in vaccination coverage that remain between Whites and members of other racial/ethnic groups. As we have shown, for some comparisons, a difference test may reveal that groups differ by a statistically significant amount. However, equivalence testing for that particular comparison may reveal that the groups are, for practical purposes, equivalent.

Difference testing is rigorous and is appropriate for many applications—for example, to quantify disparities that adversely affect public health. However, statistically significant differences are not necessarily meaningful from a public health standpoint. Moreover, difference testing allows no conclusions about similarity between groups (we implicitly assume that only a test resulting in a decision to declare that the groups differ or that they cannot be proven to differ is being considered). Difference testing should be reserved for situations where the appropriate research question is, "Is there a disparity between groups that can be addressed with public health interventions?"

On the other hand, when interventions have succeeded in reducing disparities between groups, the appropriate research question is, "Is there practical equivalence among these groups?" In equivalence testing, as in difference testing, the method for showing that groups are similar is rigorous. With this method, the researcher must specify an acceptable difference between groups, {Delta}. We cannot reject the null hypothesis of a difference greater than {Delta} unless the statistical evidence is strong. The sample size might limit the values of {Delta} that are practical; if both {Delta} and the sample size are small, it is unlikely that equivalence can be established.

For results of equivalence testing to be useful in assessing immunization coverage, {Delta} must be carefully selected. For example, some small differences in coverage between groups may not increase the risk of morbidity and mortality from vaccine-preventable disease. Determining the magnitude of a tolerable difference will depend on many factors. For example, a tolerable difference for one vaccine might not be tolerable for another, depending on the epidemiology of the disease in question. A {Delta} considered acceptable on a national scale might be unacceptable on a local scale.

As with difference testing, equivalence testing requires an adequate sample size. The NIS samples smaller numbers of minorities than of Whites, which can limit the ability of the data to detect similarity at small values of {Delta}. Oversampling minorities or combining data from multiple years could increase the sensitivity. Surveys with smaller sample sizes than that of the NIS might or might not provide sufficient discriminatory power to support equivalence testing with reasonable values of {Delta}.

Reducing disparities in immunization coverage should remain a top priority. However, even when health interventions succeed, differences will persist, to some small degree. Establishing the goal of practical equivalence, defining a tolerable difference that does not threaten public health, and using equivalence testing to prove that equity has been achieved could permit us to focus limited public health resources on identifying and addressing those disparities that might threaten public health.


    NOTES
 
Reprint requests to Dr. Lawrence Barker, Centers for Disease Control and Prevention, 1600 Clifton Road NE, Mailstop E-62, Atlanta, GA 30333 (e-mail: lsb8{at}cdc.gov). Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 DIFFERENCE TESTS VERSUS...
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. US Public Health Service, Department of Health and Human Services. Healthy People 2010. Washington, DC: US Public Health Service, 2000.
  2. Food and Drug Administration. Bioavailability and bioequivalence requirements. Code of Federal Regulations, title 21, vol 5, part 320 (21 CFR 320). Washington, DC: Food and Drug Administration, 1992.
  3. Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm 1987;15:657–80.[ISI][Medline]
  4. Farrington CP, Manning G. Test statistics and sample size formulae for comparing binomial trials with null hypothesis of non-zero difference or nonunity relative risk. Stat Med 1990;9:1447–54.[ISI][Medline]
  5. Barker L, Rolka H, Rolka D, et al. Equivalence testing for binomial random variables: which test to use? Am Stat 2001;55:279–87.[ISI]
  6. Westlake WJ. Response to T. B. L. Kirkwood: bioequivalence testing—a need to rethink. Biometrics 1981;37:589–94.[ISI]
  7. Huh MH. Equivalence testing as an alternative to significance testing. J Korean Stat Soc 1994;23:199–206.
  8. Blackwelder WC. Proving the null hypothesis in clinical trials. Control Clin Trials 1982;3:345–53.[ISI][Medline]
  9. Blackwelder WC. Equivalence trials. In: Armitage P, Colton T, eds. Encyclopedia of biostatistics. New York, NY: John Wiley and Sons, Inc, 1998.
  10. Shulka R, Wang Q, Fulk F. Bioequivalence approach for whole effluent toxicity testing. Toxicol Chem 2000;19:169–74.
  11. Atkinson WL, Orenstein WA, Krugman S. The resurgence of measles in the United States, 1989–1990. Annu Rev Med 1992;43:451–63.[ISI][Medline]
  12. Luman ET, Barker LE, Simpson DM, et al. National, state, and urban-area vaccination-coverage levels among children aged 19–35 months, United States, 1999. Am J Prev Med 2001;20(suppl 4):88–153.[ISI][Medline]
  13. Zell E, Ezzati-Rice T, Battaglia M, et al. National Immunization Survey: the methodology of a vaccination surveillance system. Public Health Rep 2000;115:65–77.[ISI][Medline]
  14. Smith PJ, Battaglia MP, Hugins VJ, et al. Overview of the sampling design and statistical methods used in the National Immunization Survey. Am J Prev Med 2001;20(suppl 4):17–24.[ISI][Medline]
  15. Smith PJ, Rao JN, Battaglia MP, et al. Compensating for provider nonresponse using response propensities to form adjustment cells: The National Immunization Survey. (Vital and health statistics, series 2, no. 133). Hyattsville, MD: National Center for Health Statistics, 2001.
  16. Shah BV, Barnwell BG, Bieler GS. SUDAAN version 7.5 manual. Research Triangle Park, NC: Research Triangle Institute, 1997.




This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (1)
Disclaimer
Request Permissions
Google Scholar
Articles by Barker, L. E.
Articles by Chu, S. Y.
PubMed
PubMed Citation
Articles by Barker, L. E.
Articles by Chu, S. Y.