University of Kansas School of Medicine, Wichita, 1Arthritis Research Center Foundation, Wichita, Kansas and 2Vanderbilt University School of Medicine, Nashville, Tennessee, USA
Correspondence to: F. Wolfe, National Data Bank for Rheumatic Diseases, Arthritis Research Center Foundation, 1035 N. Emporia, Suite 230, Wichita, KS 67214, USA. E-mail: fwolfe{at}arthritis-research.org
In this issue of Rheumatology, Brennan et al. [1] describe the cost-effectiveness (CE) of one of the most important and effective treatments in rheumatology, etanercept. They do it in an article that represents the state of the art of rheumatology CE analyses. Their work is transparent, careful, thoughtful and insightful: a model for others working in this field. CE analyses have influenced NICE (National Committee on Clinical Excellence; www.nice.org.uk) and have had a profound effect on rheumatology treatment options in the UK and elsewhere. This encomium notwithstanding, we think the premises of this and similar studies may be untenable. We offer these criticisms in the spirit of scientific inquiry.
Rheumatology CE studies use a mixture of data from randomized clinical trials (RCTs), observational studies and extrapolations. From extrapolations of RCT results, we then estimate long-term costs and effectiveness in the non-randomized setting of clinical practice. CE studies have addressed vigorously the assumptions relating to right censoring. That is, how can we model CE when we do not have long-term outcome data? However, almost no attention has been directed to an area of even greater importance: the extent to which measurement of patient status at the onset of the RCT is a reflection of the actual status of rheumatoid arthritis (RA) patients who will receive the treatment.
In the etanercept clinic trial used as the exemplar in this study, the benefit of etanercept in Health Assessment Questionnaire (HAQ) [2, 3] units was 0.88 units for responders and 0.37 for non-responders (Table 1 in [1]). The authors assume a 50% response to etanercept based on ACR20 response criteria [4], which would result in an overall HAQ benefit of 0.63 units for those exposed to etanercept. This measure of clinical change (response) and the actual level of the HAQ is central to CE studies because the utility measurement used (the EuroQol [5]) is determined entirely from the HAQ, as there are no long-term EuroQol data. In addition, the HAQ is used to estimate the risk of mortality attributed to RA. So it would seem to be appropriate to ask whether the 0.36 change in HAQ among ACR non-responders, the 0.88 change among ACR responders and the overall 0.66 unit change seen in the RCT are reasonable and valid in the setting of clinical practice.
At first blush, these results appeared reasonable to us, as major improvement is in agreement with our clinical experience and collected data when we compared patients' clinical status at the start of therapy with clinical status in the months that followed. However, decidedly different results were obtained when we examined results using the US National Data Bank for Rheumatic Diseases (NDB) [6], in which we identified 785 RA patients who were not receiving etanercept when the HAQ was first measured but who were receiving it 6 months later and who had a duration of etanercept therapy of 3 months or more. In these patients, the mean HAQ score decreased from the initial score of 1.22 (S.D. 0.7) to 1.15 (S.D. 0.7), a difference of -0.07 [95% confidence interval (CI) -0.10 to -0.05] over the 6-month period. If the expected 6-month increase in HAQ score in the absence of treatment is added in, the HAQ benefit is 0.08 units. The difference between the RCT and the observational study results (which may be thought to represent practice results) was in excess of nine-fold. On the basis of simulation studies in 410 community RA patients, these HAQ results are equivalent to DAS28 scores of 4.5 (95% CI 4.4 to 4.6) and 4.4 (95% CI 4.2 to 4.6). Although we have presented data on etanercept, the results are similar when infliximab is studied.
There are several possible explanations for this difference between RCT results and clinical practice. The primary explanation is that RCT data and NDB data use two different timings to examine the effect of etanercept on HAQ. In RCTs, the test is between a deliberately tested worst state that was identified by inclusion criteria and some future time (usually 6, 12 or 24 months). In the NDB analyses shown, the test is between the patient's usual clinical state 13 months before starting etanercept and 36 months after starting etanercept. In addition, RCT patients are not allowed to receive additional therapy in the month(s) before starting the RCT. In actual clinical practice, patients may receive intra-articular and intramuscular corticosteroids as well as oral corticosteroids; and they may receive additional disease-modifying anti-rheumatic drugs (DMARDs) or DMARD adjustments. Another way of appreciating the HAQ change seen in the NDB is to consider it as the incremental benefit of etanercept when added to all other treatment modalities and smoothed by expanding the comparison time domain to the 1- to 3-month period before starting etanercept.
The comparative observations from the NDB do not suggest in any way that the RCTs are incorrect. They do suggest, however, that extrapolating RCT results such as these to long-term CE analyses may not be justified, as community results as defined above differ from RCT results. This caveat extends also to hypothesized HAQ progression rates if they are determined from RCT data. If the magnitude of benefit seen in RCTs is not of the magnitude seen in actual practice, then CE conclusions based on RCT data are not valid. This is the main objection we have to the study of Brennan et al. [1].
If the main purpose of CE analysis is to draw conclusions that will be applicable to general patient usage, we question whether RCT rheumatology data that address the issue of comparative efficacy are appropriate data for long-term cost-effectiveness analysis. In addition, persons participating in clinical trials are unrepresentative of patients in clinical practice on the basis of clinical severity [7], and the results of clinical trials are often discordant with results from clinical practice [812]. We would also like to make the point that problems with RCT data in CE analyses do not exist when the before state does not depend on fluctuating clinical symptoms. Therefore, CE analyses of osteoporosis therapy in RA, for example, would not have the problems we have cited above.
In the model proposed by Brennan et al., patients having an unsatisfactory response to two DMARDs may receive an anti-tumour necrosis factor (TNF) agent or alternative therapy, provided they have active RA, defined by a DAS score greater than 5.1 [13]. They remain on one of these therapies until they fail to achieve or maintain a DAS benefit of 1.2 units, at which point they are switched to the next DMARD until a total of three DMARDs are used up. Given initial DMARD- and anti-TNF-specific responses to therapy as proposed and treatment-specific estimated rates of increase for HAQ in following years, the calculation of benefit [QALYs (quality-adjusted life years)] is easy. A major concern, however, is whether this model is too far from reality to be useful. We suspect it is.
Failing a DMARD or biological agent by failing to have a 1.2-unit DAS response or an ACR20 is the given reason to change a DMARD in the model. DMARD/biological success is the other side of the coin: a 20% ACR response or a DAS reduction of 1.2 units. Yet these definitions may have little relevance to clinical practice. In clinical practice, patients have an unsatisfactory response when they are not doing as well as they could reasonably wish. Practically, this means (i) unsatisfactory functional and pain status or (ii) an unacceptable rate of progression. It is clearly possible to be an ACR20 and DAS28 responder but still have an HAQ or pain score that is unacceptable. We examined treatment success and failure in two ways. First, if we consider low disease activity (DAS < 3.2) as a measure of treatment success, only 27% of the 410 community patients have had a treatment success. If one defines treatment failure as not being very satisfied with one's health, then 21% of RA patients in the NDB without other comorbidity may be considered to be treatment successes and 79% treatment failures [14]. Treatment failure and success in the RCT model is different from failure and success in actuality.
It is also true that patients do not necessarily abandon an unsuccessful treatment. They may choose to remain on therapy because they presume that they would be worse without the therapy. They may also continue on therapy because they believe it is slowing progression adequately for them. For example, in the NDB, 2881 patients receiving methotrexate but not anti-TNF treatment had a mean HAQ score of 1.03, but an annual HAQ progression rate of 0.005 (95% C.I -0.001 to 0.011). Among 4204 patients on anti-TNF treatment, the mean HAQ score was 1.20 and the annual progression rate was 0.012 (95% CI 0.004 to 0.021). These data and discussion suggest to us that the modelled switching may not actually occur.
Can the criteria method for success and failure, as modelled by Brennan et al., work in practice? The British Society for Rheumatology (BSR) guidelines [15] require having a DAS score >5.1 to receive anti-TNF treatment. To continue therapy, a good response (a reduction in the DAS of 1.2) is required. Assuming one starts at a DAS of 5.1, simulation studies indicate that a 1.2-unit response represents an improvement of four tender joints, three swollen joints, a difference in erythrocyte sedimentation rate of 9 units, a patient global difference (scale 010) of 1.2 units and an HAQ difference of 0.31 units. We expect the following will happen. In the ordinary clinic setting without the availability of additional therapy, the clinician might treat the patient who is in a flare with intermittent steroids or joint injections and the passage of time. Where TNF therapy is available, a number of flare patients will be started on anti-TNF therapy. We suspect that clinicians will be liberal with their assessment of joints and patients may slightly over-report severity and joint tenderness, as is the case in RCTs [16]. After 3 months, we suspect the opposite will occur. Two joints added at the start and two subtracted at 3 months, as well as similar changes in patient global assessment, can lead to a good DAS response when the understood alternative is to not receive and continue an important and potentially useful therapy. If this gaming actually happens, then, even using the strict BSR guidelines, results will not be as hypothesized in the CE model.
We have suggested that observational data may be more useful than RCT data in measuring the initial response for CE modelling. The critique that observational study results do not match the results of RCTs is often dismissed in favour of the RCT result because RCTs are considered to be the best gold standard evidence (Class 1 evidence), while observational studies are considered to be Class 2b evidence or below [17]. However, the actual response that occurs in practice cannot be measured in RCTs, and it is the actual response rather than the RCT response that ultimately determines true cost-effectiveness. Where illness severity fluctuates, before and after comparisons may be misleading when before data are sampled at flare time. We think that the proper approach may be to adopt the methods of the economists, who use smoothed rates (running averages) to overcome fluctuations that would otherwise lead to misleading conclusions.
A necessary conclusion from these observations is that RA DMARD RCT data by themselves may be too far removed from reality to form the basis of CE analyses, a role that may more properly belong to observational databases. We wonder whether CE analyses in rheumatic diseases that are keyed on symptomatic status and RCTs should play any role in the approval process of DMARD and biological therapies. This presents a conundrum, as observational clinical data can only be obtained after drug approval. One solution is for clinicians to collect simple self-report data at all times (much as blood pressure is measured). Then, when patients enter RCTs the overall results of therapy can be assessed beyond the flare/regression to the mean design that appears to form some of the basis of the RCT response. With better data, the elegant and thoughtful analyses of Brennan et al. [1] can provide meaningful information to clinicians and regulatory authorities.
The authors have declared no conflicts of interest.
References