How to interpret radiological progression in randomized clinical trials?
R. B. M. Landewé,
M. Boers1 and
D. M. F. M. van der Heijde
Department of Internal Medicine/Rheumatology, University Hospital Maastricht and
1 Department of Clinical Epidemiology and Biostatistics, Free University Medical Center, Amsterdam, The Netherlands
Radiological progression is a hallmark of rheumatoid arthritis (RA) that can be visualized on plain radiographs. Advantages of this technique include its low cost and wide availability. Standards for measurement and analysis of damage progression are now available and broadly accepted. Disadvantages are that plain radiographs are insensitive for early changes.
The acceptance of standards has encouraged the rheumatological community and drug registration authorities to rely on radiological progression as one of the most important outcome parameters by which to judge the efficacy of new disease-modifying anti-rheumatic drugs (DMARDs). Hence, information about which anti-rheumatic drugs retard progression, as well as their rank order, is important. Unfortunately, clinical trials usually study only one or two active drugs, so that a rank order can only be obtained from systematic reviews (meta-analyses).
Jones et al. [1] have taken on this difficult task, and their review appears in this issue of Rheumatology. The cornerstone of their article is a league table including all available DMARDs with respect to their potential to retard radiological progression. There is no question that this hierarchy will attract most attention. Unfortunately, several methodological issues raise concerns about its validity, and these will be discussed below.
A systematic review attempts to synthesize (or pool) data across studies in a quantifiable way. Any systematic review should, like primary studies, meet a number of methodological standards, and the authors have taken care of most of these. However, comparing or pooling is justified only if the original studies show clinical as well as statistical homogeneity. The latter means that the trials to be compared all show approximately the same trend; this will not be discussed in detail here. The formerclinical homogeneityis highly unlikely in the current collection of studies. Clinical homogeneity means that patient characteristics across different studies are comparable, methods are similar and treatments (kind, dosage and treatment rules) are the same. The included studies were only homogeneous with respect to the disease under study (RA) and the kind of drug. Studies including patients with early RA were compared with those including more advanced RA, studies with different dosage schedules of the same drug were compared and, perhaps most importantly, because the studies applied different radiological scoring techniques (Larsen score and/or modifications, Sharp score and/or modifications, scoring with sequence known or sequence blinded) the original results had to be transformed in order to allow pooling.
To allow comparison, the authors used parametric descriptive statistics (group mean, standard deviation, standardized progression scores and -differences) to summarize individual trial results. Such summaries are most useful when the data are normally distributed. Unfortunately, radiological progression over time shows a highly skewed distribution pattern. Radiological scores start at zero (no damage), can only deteriorate over time (although rare cases of improvement or repair of damage have now been documented [2]), and the majority of patients will show no or only mild progression in a 1-yr period. As an example, we show a frequency distribution of individual first-year progression scores in the COBRA (Combinatietherapie bij Reumatoide Artritis) trial [3, 4] (Fig. 1
). It is a typical example of a skewed distribution, the majority of patients showing zero or minor progression and only a subset of patients showing substantial progression.
As most pooling requires parametric statistics, mathematical transformation of source data (e.g. by calculating logarithms) may improve the distribution characteristics importantly. However, most authors avoid this: they correctly report medians and quartiles and analyse their data set in a non-parametric (distribution-free) fashion. In our view, it is unacceptable to use untransformed radiographic data as if they were normally distributed. This importantly limits the manoeuvrability of the meta-analyst, because the means and standard deviations of logarithmically transformed source data are hardly ever, if ever, presented in articles.
Another problem of parametric statistics in skewed data lies in their sensitivity to selective dropout of cases. Table 1
summarizes the means, standard deviations and medians calculated for the radiographic progression scores of the COBRA trial. In order to demonstrate this sensitivity to dropout, we recalculated the group mean, standard deviation and standardized progression score in the same data set but omitting 10% of the observations by either random or selective dropout. The group mean and especially the standard deviation were particularly sensitive to selectively leaving out the 10% highest progression scores, in contrast with leaving out the 10% lowest progression scores and with random dropout. The repercussions of this sensitivity to positive extremes' (which is inherent to left-sided skewed distributions) for the standardized progression score are clear: we here encounter the intuitively paradoxical situation of a decreasing mean effect (which is logical because the high scores are selectively left out), whereas the standardized score suggests the opposite. So, neglecting distributional pre-assumptions and using group means and standard deviations to describe radiological progression may have made the results of this systematic review more sensitive to selective dropout. Is selective dropout a likely feature of the included trials? In our opinion, the answer is definitely yes. In general, patients with a worse prognosis (greater disease activity, greater radiological progression) have a higher prior probability of premature discontinuation in any clinical trial, and patients completing the entire trial duration have a more favourable prognosis, either by nature or by treatment. This bias by trial completion is highly likely in trials with losses to follow-up (i.e. no data beyond a certain time point) greater than 10%, and is the most important argument for relying on intention-to-treat (ITT) analysis. In the included trials, loss to follow-up ranged between 9 and 51% of the participants, and many of the trials were not analysed by ITT. And even if the analysis was done by ITT, it is critical to know how the investigators handled the patients in whom no follow-up scores were available. A frequent method of imputing missing data, called last observation carried forward (LOCF), may have merits in the case of clinical or laboratory data; however, it underestimates the true progression of damage and contributes to the effect of sensitivity to positive extremes that we mentioned above. Thus, the quality of the studies, the authors' report [1] and their method of subsequent pooling prevent any meaningful ranking of DMARDs.
View this table:
[in this window]
[in a new window]
|
TABLE 1. Parametric and non-parametric descriptive statistics in the COBRA trial and the relative effects of random and selective dropout
|
|
The authors also presented the data in a dichotomized fashion. Dividing patients into those with and those without progression suggests that a clear cut-off exists. The authors decided to distinguish patients with worsening of erosions from those with unchanged erosions, assuming that such a distinction can be made reliably. But scoring erosions or joint damage is a matter of subjective judgement, best demonstrated by visualizing the level of agreement in progression scores between two separate observers who have scored the same data set independently. In a BlandAltman plot, the mean score of the two observers of every patient is plotted against the difference of both scores (Fig. 2
). Every dot in Fig. 2
incorporates the results of two separate scorings on the same data, and the greater the distance to the x-axis, the greater the inter-observer variability or measurement error. It is obvious that there is a significant amount of disagreement between the two observers, not only in the higher progression score range but also in the lower range, which means that zero is not a rational cut-off point for progression. It is better to choose a cut-off level that accommodates measurement error, e.g. using the concept of the smallest detectable difference (SDD). The SDD relies on the assumption that a progression score obtained in a given patient should be greater than a score that is compatible with measurement error if there is no real progression. The 95% level SDD informs us that a patient with no real progression would show spurious changes in sequential films no greater than that SDD in 95% of cases. We often choose a probability level of 95%, but it is conceptually right to choose less conservative probability levels, such as 80%. In the COBRA trial we calculated a SDD of 8.7 Sharp units based on a 95% level of agreement, and 5.7 units based on a 80% level of agreement. Thus, a less conservative SDD will lead us to classify more patients as true progressors. The reader who is further interested in calculating SDDs is referred to the literature [57]. The concept of the SDD is conservative with respect to classifying a patient as progressive, but rates the reality of measurement error at its true value. We therefore strongly recommend its use in clinical trials, not least because it provides a tool to pool data from different trials in the future. This recommendation was also the result of a consensus meeting on how to report radiographic data [8].
Could the results of the trials investigated here have been biased by improperly chosen cut-off points? We calculated odds ratios for different cut-off points in the COBRA trial (Table 2
), as Jones et al. did for the trials they investigated. It is obvious that the odds ratios were dependent on the cut-off level that was chosen, which in this example is primarily due to a ceiling effect: the prevalence of progression with a cut-off of zero was so high that treatment contrast could not be detected. In general, an odds ratio is dependent on the background prevalence of the condition for which the odds ratio is being calculated. A glance at Table 2
of the review of Jones et al. shows that the background prevalence of progression of erosions in the trials under study differed substantially (from 10% to more than 90%), which makes pooling a hazardous exercise. As we have already argued that a cut-off level of zero (which neglects measurement error) provides unreliable results, we seriously doubt the stability of the odds ratios presented in this systematic review.
View this table:
[in this window]
[in a new window]
|
TABLE 2. Percentage of patients with radiological progression and odds ratios in both arms of the COBRA trial, with different cut-off scores
|
|
Radiological progression has become a standard outcome in longer-term clinical trials of DMARDs in RA. There is a justifiable need to compare DMARD performances with respect to this outcome across different trials, and the most appropriate technique available to do so is the systematic review, with or without pooling of data of different trials. Currently, radiographic data and the available radiological scoring methods have a number of purely statistical qualities that importantly limit the ability to perform standard calculations and testing. Within the context of a clinical trial, these limitations are often of modest importance, because in general and by definition trial groups include similar patients and are handled similarly during the trial. For example, the parametric two-sample Student t-test, which is often used to compare radiological progression in two groups, is quite robust against violation of normality. In pooling exercises, however, these limitations become critical. Lack of clinical homogeneity, significant loss to follow-up and improper ITT analysis, parametric analysis of data with a non-parametric distribution, and unreliable cut-off scores for progression strongly impair the comparability of progression scores across different trials.
Given these limitations, the future comparability of radiographic data between trials can be improved if trial investigators include at least the following methodological criteria in their protocol and report. Some of these recommendations were already summarized in the consensus report on how to report radiographic data that was mentioned above [7]:
- obtain and score follow-up X-rays in all patients regardless of premature discontinuation, and limit data imputation;
- use one of the validated radiological scoring methods;
- have X-rays read by at least two observers, state explicitly whether or not the sequence was blinded (because future meta-analysis must be stratified for this factor), and use the mean of the observers' scores as the progression score;
- present radiological progression as medians and 25th and 75th percentiles (box and whisker plots) (and have available in an appendix the means and standard deviations as well as log-transformed means and standard deviations, to allow future methodological studies);
- present smallest detectable differences, based on 80% as well as on 95% agreement.
These recommendations assure a basic level of homogeneity, give insight into some aspects of measurement error, and provide a scientific rationale for choosing a cut-off point for radiological progression.
Notes
Correspondence to: R. B. M. Landewé, Department of Internal Medicine/Rheumatology, P.O. Box 5800, 6202 AZ Maastricht, The Netherlands. 
References
- Jones G, Halbert J, Crotty M, Shanahan EM, Batterham M, Ahern M. The effect of treatment on radiological progression in rheumatoid arthritis: a systematic review of randomized placebo-controlled trials. Rheumatology 2002;41:613.[Abstract/Free Full Text]
- Rau R, Wassenberg S, Herborn G, Perschel WT, Freitag G. Identification of radiologic healing phenomena in patients with rheumatoid arthritis. J Rheumatol 2001;28:260815.[ISI][Medline]
- Boers M, Verhoeven AC, Markusse HM et al. Randomised comparison of combined step-down prednisolone, methotrexate and sulphasalazine with sulphasalazine alone in early rheumatoid arthritis. Lancet 1997;350:30918.[CrossRef][ISI][Medline]
- Landewe RB, Boers M, Verhoeven AC et al. COBRA combination therapy in patients with early rheumatoid arthritis: long-term structural benefits of a brief intervention. Arthritis Rheum 2002;46:34756.[CrossRef][ISI][Medline]
- Lassere M, Boers M, van der Heijde D et al. Smallest detectable difference in radiological progression. J Rheumatol 1999;26:7319.[ISI][Medline]
- Lassere MN, van der Heijde D, Johnson K et al. Robustness and generalizability of smallest detectable difference in radiological progression. J Rheumatol 2001;28:9113.[ISI][Medline]
- Lassere MN, van der Heijde D, Johnson KR, Boers M, Edmonds J. Reliability of measures of disease activity and disease damage in rheumatoid arthritis: implications for smallest detectable difference, minimal clinically important difference, and analysis of treatment effects in randomized controlled trials. J Rheumatol 2001;28:892903.[ISI][Medline]
- van der Heijde D, Simon L, Smolen J et al. How to report radiographic data in randomized clinical trials in rheumatoid arthritis: guidelines from a roundtable discussion. Arthritis Rheum 2002;47:2158.[CrossRef][Medline]
Accepted 28 June 2002