a Department of Psychiatry, Nagoya City University Medical School, Mizuho-cho, Mizuho-ku, Nagoya 467-8601, Japan.
b Departments of Medicine, and Clinical Epidemiology Biostatistics, McMaster University, 1200 Main St West, Hamilton, Ontario L8N 3Z5, Canada.
Toshiaki A Furukawa, Department of Psychiatry, Nagoya City University Medical School, Mizuho-cho, Mizuho-ku, Nagoya 467-8601, Japan. E-mail: furukawa{at}med.nagoya-cu.ac.jp
Abstract
Background Meta-analyses summarize the magnitude of treatment effect using a number of measures of association, including the odds ratio (OR), risk ratio (RR), risk difference (RD) and/or number needed to treat (NNT). In applying the results of a meta-analysis to individual patients, some textbooks of evidence-based medicine advocate individualizing NNT, based on the RR and the patient's expected event rate (PEER). This approach assumes constant RR but no empirical study to date has examined the validity of this assumption.
Methods We randomly selected a subset of meta-analyses from a recent issue of the Cochrane Library (1998, Issue 3). When a meta-analysis pooled more than three randomized controlled trials (RCT) to produce a summary measure for an outcome, we compared the OR, RR and RD of each RCT with the corresponding pooled OR, RR and RD from the meta-analysis of all the other RCT. Using the conventional P-value of 0.05, we calculated the percentage of comparisons in which there were no statistically significant differences in the estimates of OR, RR or RD, and refer to this percentage as the concordance rate.
Results For each effect measure, we made 1843 comparisons, extracted from 55 meta-analyses. The random effects model OR had the highest concordance rate, closely followed by the fixed effects model OR and random effects model RR. The minimum concordance rate for these indices was 82%, even when the baseline risk differed substantially. The concordance rates for RD, either fixed effects or random effects model, were substantially lower (5465%).
Conclusions The fixed effects OR, random effects OR and random effects RR appear to be reasonably constant across different baseline risks. Given the interpretational and arithmetic ease of RR, clinicians may wish to rely on the random effects model RR and use the PEER to individualize NNT when they apply the results of a meta-analysis in their practice.
Keywords Meta-analysis, odds ratio, risk ratio, risk difference, number needed to treat, evidence-based medicine
Accepted 25 June 2001
Today's meta-analyses summarize their results in several ways. When the outcome is dichotomous, some authors prefer the number needed to treat (NNT), because it expresses the efforts that clinicians and patients must expend in order to accomplish the desired treatment target. The NNT is calculated as the inverse of the risk difference (RD), which is an absolute measure of effectiveness. However, many meta-analyses continue to utilize relative measures of effectiveness such as odds ratio (OR) and risk ratio (RR).
It is the ultimate aim of evidence-based medicine (EBM) to individualize group data from clinical research, in order to satisfy each individual patient's values and preferences.1 Therefore, many EBM theorists note that since event rates vary, often dramatically, across patients, a single NNT is unlikely to be applicable to all patients. They therefore advocate individualizing the NNT, depending on estimates of RR obtained from group studies and on each patient's expected event rate (PEER).2 This approach is based on the assumption of constant RR, i.e. the relative benefits and risks of therapy are the same for persons with high or low PEERs'.3,p.120 Basically the argument goes that, assuming a constant RR, the absolute benefit of the treatment, usually expressed as NNT, is smaller among low risk patients than among high-risk patients. A classical example where this holds is the treatment of hypertension.4
But is this premise true for a wide variety of health interventions? Sackett et al. themselves wrote, this (constancy of RR) is a big assumption.5,p.170
To the best of the present authors' knowledge, no empirical study to date has directly examined how applicable or generalizable various effect measures of meta-analyses are in actual practice. One study showed that RR and OR are more independent of the baseline risk than RD for a wide range of randomized controlled trials (RCT),6 but these investigators examined individual trials and do not tell us which summary effect measure was better for meta-analysis.
Another study compared fixed as well as random effects model OR and RD for 125 meta-analyses and found that RD tended to be more heterogeneous among trials.7 They noted that random effects estimates often showed wider CI than fixed effects models and concluded that, since formal tests of heterogeneity are often underpowered, it might be appropriate to assume that systematic differences among trials are always present and to use a random effects model. This study did not include RR in their comparisons. Moreover, neither of these two studies tells us how often a summary effect measure, OR or RR or RD, is constant over a range of baseline risks. The constancy of an effect measure is an important factor to consider in deciding if we can individualize NNT and which summary measure to utilize if we want to do so. The present report presents the results of an empirical examination of the generalizability of the most commonly used measures of association for summarizing treatment effects in meta-analyses.
Methods
We included all the meta-analyses in the field of psychiatry as well as a randomly selected subset of all the other meta-analyses in other branches of medicine contained in a recent issue of the Cochrane Library.8 When a meta-analysis pooled more than three RCT to produce a summary measure for one outcome, we compared the OR, RR and RD of each of the included RCT with the pooled OR, RR and RD, respectively, from the other RCT. At minimum, a meta-analysis of three RCT would contribute three comparisons because, for a single outcome, we would compare each of the three RCT to a meta-analysis of the other two. Another meta-analysis of three RCT would contribute nine comparisons if the study pooled the three RCT for three discrete outcomes, such as acceptability of treatment, response, and side effects. Furthermore, for each outcome, the number of comparisons was equal to the number of RCT: a meta-analysis of four RCT would yield four comparisons, five RCT five comparisons, and so on.
Using methods described by Fleiss,9 if the individual OR (or RR or RD) and the pooled OR (or RR or RD) were statistically significantly different at a conventional P-value of 0.05, we regarded them as discordant and, if not, as concordant.
We calculated pooled estimates using both a fixed effects model (Mantel-Haenszel) and a random effects model (DerSimonian and Laird). Theoretically neither may be entirely satisfactory because the latter is only exchanging the questionable homogeneity assumption of the former for a fictitious random distribution of effects.10,11 In practice, when pooling non-heterogeneous studies, investigators have found that both agree rather well, but the random effect model tends to be more conservative, and often yields wider CI.7,12
Because it is conceivable that the rate of concordance could be artificially inflated due to small sample size of some of the RCT involved (that is, we fail to reject the hypothesis that the point estimate from the individual RCT differs from the pooled estimate because of inadequate precision, and therefore excessively wide CI), we conducted a sensitivity analysis restricted to comparisons in which both individual RCT and the corresponding meta-analyses produced statistically significant results. In order to examine the consistency of treatment effectiveness indices when the control event rate (CER) differs substantially, we conducted another sensitivity analysis limited to instances where the results of RCT and the meta-analysis were statistically significant and in addition the CER of individual RCT was less than half or more than twice of that of the weighted average of the other studies from that meta-analysis. We further examined concordance rates when the results of the individual RCT and the meta-analysis were statistically significant and the CER of individual RCT was three-times different from that of the weighted average of the other studies.
Our results showed that the fixed effects OR, random effects OR and random effects RR all produced potentially acceptable concordance rates between one RCT and the meta-analysis of similar RCT (see Results). In order to individualize NNT, we would apply these indices of treatment effectiveness to PEER by the following formulae:
![]() |
We therefore next examined the extent to which these models would produce similar individualized NNT across the range of baseline risks in which clinicians would typically apply the method of individualizing the NNT. For this analysis, we used meta-analyses which produced statistically significant fixed effects model OR. For each meta-analysis that met this criterion, we calculated the NNT assuming patient expected event rates of 0.1, 0.2, 0.3, 0.4, and 0.5.
To determine the extent of agreement, we needed to define a range of NNT in which the clinical implications are likely to be very similar. We chose the following (inevitably somewhat arbitrary) criteria: for NNT of 15 differences of 3; for NNT of 610, differences of
4; for NNT of 1150, differences of
15; for NNT 51100, differences of
30; for NNT over 100, <0.3 x NNT. Using these criteria, we calculated agreement of the individualized NNT based on fixed or random effects OR and random effects RR.
Because the results were very similar between psychiatry and general medicine, we present the combined results. Because of lack of independence of effect measures in these sets of comparisons, we were unable to calculate 95% CI for the concordance rates or to examine if the differences in the concordance rates were statistically significant. The results, however, show us how often, in absolute terms, we can expect the pooled effects of meta-analyses to apply to separate groups of patients.
We did not exclude the comparisons where statistical heterogeneity was noted if the studies were combined and the summary measures were reported in the original meta-analyses. Nor did we consider the impact of switching from the absence to the presence of the selected outcome event for the RR (for instance, from death to survival, or persistent disease to cure), although the RR of event and the RR of no event can make a substantial difference in the estimated effect size, its 95% CI and observed heterogeneity (in contrast to RR, OR and RD are symmetrical around 1 and 0, respectively, if we switch the selected event, and therefore do not present such problems).13 We aimed to examine the generalizability of pooled results as they are currently practised and reported.
Results
We made 1843 comparisons between OR, RR or RD of an individual RCT and the pooled OR, RR or RD of meta-analyses of all the other comparable RCT for various outcome variables extracted from 55 meta-analyses in the Cochrane Library (16 from psychiatry and 39 from general medicine). These included such diverse topics as antenatal thyroxin releasing hormone (TRH) prior to preterm delivery, antibiotics in salmonella, anticoagulation following non-embolic stroke, clozapine for schizophrenia and pharmacotherapy for dysthymia.
In terms of the total sample of comparisons made, all effect measures appeared to be reasonably and satisfactorily generalizable at around 90% concordance rates, but random effects model OR and RR produced the highest concordance rates (92%) (Table 1).
|
When we further limited the comparisons to instances in which the CER of individual RCT was less than half or more than twice of that of the corresponding meta-analyses random effects model OR had the highest concordance rate (88%), closely followed by fixed effects model OR and the random effects model RR (87% and 84%, respectively). The concordance rate of the RD, both fixed and random effects model, showed marked declines from the values obtained for the total sample. The results were consistent when we examined more extreme cases where the CER was three-times different (Table 1).
We noted no particular clinical area where either the OR or the RR showed more than occasional inconsistency across studies. Out of 412 comparisons in which both RCT and meta-analysis had significant results, in only 17 instances (4%) was the discrepancy qualitative, i.e. an RCT produced an RR in the opposite direction from the random effects model RR of the meta-analysis of the other RCT. Instances in which qualitative differences were noted for a number of outcomes were very diverse and showed no common features that we can discern: antibiotics for treating salmonella gut infection, prophylactic surfactant in preterm infants, amodiapine versus chlorquinine in symptomatic patients with malaria, and clozapine versus typical antipsychotics in schizophrenia.
For a range of patient's expected event rates (PEER), point estimates of the individualized NNT calculated from fixed effects OR, random effects OR and random effects RR all produced good to excellent agreement, and were unlikely to lead to differing clinical decisions (Table 2).
|
In this study, the random effects model OR showed the greatest consistency across RCT within meta-analyses, closely followed by the fixed effects model OR and random effects model RR. All of these measures of effect showed individual RCT results consistent with those of the other trials addressing the same question 82% or more of the time, even when the baseline risk differed substantially. On the other hand, the random or fixed effects model RD proved substantially less generalizable.
This degree of concordance for some measures of associations is surprising and encouraging, given that trials differ in the patients recruited, the way the interventions are administered, and the way the outcomes are measured, all of which can influence the size of the treatment effect. Publication bias could have inflated the concordance rates. This could occur if negative RCT, which would likely have RR qualitatively discrepant from those in positive RCT, were not published. In addition, systematic reviews that suggested substantial heterogeneity may not have been performed or published. We cannot know the extent of this bias.
Demonstrating similar treatment effects across differing groups of patients within a series of trials would provide the strongest support for assuming a constant RR or OR. For instance, assume that pooling results in the low, moderate, and high-risk patients who participated in a group of trials showed a similar magnitude of effect. Such a finding would provide very powerful evidence for applying a single OR or RR in calculating the likely benefit in all such patients. Unfortunately, such data are seldom available. The results of this study provide somewhat weaker, but still compelling, evidence that we may safely assume a similar magnitude of treatment effect when we want to individualize treatment decisions in separate groups of individuals or in an individual patient who may have a varying baseline risk.
On the other hand, we found that no effect measure was 100% applicable to all the possibly similar groups of patients. Our results apply only to the range of baseline risks seen in the studies that we included. The largest range was a 30-fold difference but the large majority was up to 5-fold difference; in 81% the difference was no greater than 2-fold, in another 10% no greater than 3-fold, and in another 5% no greater than 5-fold. Applying our results to greater differences in baseline risk is less secure. Examples are recently accumulating where OR and RR do appear to differ materially among subgroups of patients with differing baseline risks. They include anti-arrhythmic drugs after myocardial infarction,14 carotid endarterectomy,15 and human immunodeficiency virus infection.16 Our results suggest that these cases represent exceptions, and that across various health interventions in humans, fixed effects model OR and random effects model OR or RR would be correct in eight to nine out of ten instances when applied to separate groups of individuals.
On the basis of our results, the best summary measure for a meta-analysis might be the random effects model OR. Clinicians could then use this OR to calculate PEER-adjusted NNT. Moreover, OR has some theoretical advantages over RR, because (1) it is symmetric around unity, (2) it does not predict impossible event rates if measure is assumed constant, (3) efficient estimation in small samples is available, (4) it can be easily expanded to a model with multiple factors and multiple levels, and (5) it can be estimated from any of the basic three epidemiological study designs (retrospective, cross-sectional or prospective).17
However, along with these mathematical properties, there are other factors to consider when recommending a summary measure for meta-analyses, such as the ease of interpretation and communication.18 Clinicians find the OR difficult to interpret19 and repeated examples show that even the most prestigious journals misinterpret OR as if they were RR.20,21 This difficulty appears even greater when the result is to be used to obtain informed consent from a patient.22 The difference between OR and RR is large when the CER is moderate to high and/or when the OR and RR are much greater or smaller than 1.0, and misinterpreting OR as RR often ends up overestimating the benefits or harms of an intervention.13 Furthermore, calculation of NNT from OR and PEER is arithmetically complicated.23 On the other hand, our analyses suggested that point estimates of individualized NNT agree well if we calculate them from OR or RR.
Our results, and the additional considerations we have outlined, suggest the following approach to individualizing estimates of treatment benefit. First, the clinician should examine the available results to ensure that there is no evidence that relative risk varies substantially across risk groups. In the absence of such evidence, the clinician can safely use the random effects model RR to estimate PEER-adjusted NNT for individual patients they treat.
KEY MESSAGES
|
References
1 Glasziou P, Guyatt GH, Dans AL, Dans LF, Straus S, Sackett DL. Applying the results of trials and systematic reviews to individual patients. Evidence-Based Medicine 1998;3:16566.
2 Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical Epidemiology: A Basic Science for Clinical Medicine. 2nd Edn. Boston/Toronto/London: Little, Brown and Company, 1991.
3 Sackett DL, Straus SE, Richardson WS, Rosenberg W, Haynes RB. Evidence-Based Medicine: How to Practice and Teach EBM. 2nd Edn. Edinburgh: Churchill Livingstone, 2000.
4 Mulrow CD, Cornell JA, Herrera CR, Kadri A, Farnett L, Aguilar C. Hypertension in the elderly. Implications and generalizability of randomized trials. JAMA 1994;272:193238.[Abstract]
5 Sackett DL, Richardson WS, Rosenberg W, Haynes RB. EvidenceBased Medicine: How to Practice & Teach EBM. New York: Churchill Livingstone, 1997.
6 Schmid CH, Lau J, McIntosh MW, Cappelleri JC. An empirical study of the effect of the control rate as a predictor of treatment efficacy in meta-analysis of clinical trials. Stat Med 1998;17:192342.[CrossRef][ISI][Medline]
7 Engels EA, Schmid CH, Terrin N, Olkin I, Lau J. Heterogeneity and statistical significance in meta-analysis: an empirical study of 125 meta-analyses. Stat Med 2000;19:170728.[CrossRef][ISI][Medline]
8 Cochrane Collaboration. Cochrane Library [database on disk and CDROM]. Oxford: Update Software, Issue 3, 1998.
9 Fleiss JL. The statistical basis of meta-analysis. Stat Methods Med Res 1993;2:12145.[Medline]
10 Greenland S. A critical look at some popular meta-analytic methods. Am J Epidemiol 1994;140:29096.[Abstract]
11 Lau J, Ioannidis JP, Schmid CH. Summing up evidence: one answer is not always enough. Lancet 1998;351:12327.[CrossRef][ISI][Medline]
12 Berlin JA, Laird NM, Sacks HS, Chalmers TC. A comparison of statistical methods for combining event rates from clinical trials. Stat Med 1989;8:14151.[ISI][Medline]
13 Deeks JJ, Altman DG. Effect measures for meta-analysis of trials with binary outcomes. In: Egger M, Davey Smith G, Altman DG (eds). Systematic Reviews in Health Care: Meta-Analysis in Context. London: BMJ Books, 2001.
14 Boissel JP, Collet JP, Lievre M, Girard P. An effect model for the assessment of drug benefit: example of antiarrhythmic drugs in postmyocardial infarction patients. J Cardiovasc Pharmacol 1993;22:35663.[ISI][Medline]
15 Rothwell PM. Can overall results of clinical trials be applied to all patients? Lancet 1995;345:161619.[CrossRef][ISI][Medline]
16 Ioannidis JP, Cappelleri JC, Schmid CH, Lau J. Impact of epidemic and individual heterogeneity on the population distribution of disease progression rates. An example from patient populations in trials of human immunodeficiency virus infection. Am J Epidemiol 1996;144: 107485.[Abstract]
17 Walter SD. Choice of effect measure for epidemiological data. J Clin Epidemiol 2000;53:93139.[CrossRef][ISI][Medline]
18 Deeks J. What is an odds ratio? Bandolier 1997;2:67.
19 Sinclair JC, Bracken MB. Clinically useful measures of effect in binary analyses of randomized trials. J Clin Epidemiol 1994;47:88189.[ISI][Medline]
20 Hayes RJ. Odds ratios and relative risks [letter]. Lancet 1988;ii:338.
21
Altman DG, Deeks JJ, Sackett DL. Odds ratio should be avoided when events are common. Br Med J 1998;317:1318.
22 Feinstein AR. Indexes of contrast and quantitative significance for comparisons of two groups. Stat Med 1999;18:255781.[CrossRef][ISI][Medline]
23 Sackett DL, Deeks JJ, Altman DG. Down with odds ratios! Evidence-Based Medicine 1996;1:16466.