Department of Internal Medicine, Division of Rheumatology, University Hospital Maastricht, Maastricht and
1 Department of Epidemiology and Biostatistics, Medical Faculty, Vrije Universiteit, Amsterdam, The Netherlands
Correspondence to:
D. van der Heijde, Department of Internal Medicine, Division of Rheumatology, University Hospital Maastricht, PO Box 5800, 6202 AZ Maastricht, The Netherlands.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods. Two studies were performed with 10 and 12 patients fulfilling the American College of Rheumatology criteria. In Study 1, two sets of films with a 1 yr interval were scored in chronological order, in pairs, and as single films. In Study 2, four sets of films, with a 1 yr interval each, were scored in chronological order, as single films and as single-pair (right and left together). All films were scored with the Sharp/van der Heijde method by two independent observers. Data were analysed with a repeated measures ANOVA using a full mixed effects model. Two generalizability (G) coefficients were constructed for reliability and for change.
Results. Study 1: the interobserver reliability was similar for the three methods (Greliability chronological 0.94, paired 0.88, single 0.93); progression was a mean increase (averaged over patients, observers and methods) from 26 to 37 (P=0.046). The sensitivity for change was greater for the chronological than for the paired and single scoring (Gchange 0.39, 0.22 and 0.24, respectively). Study 2: the interobserver reliability was 0.86 for chronological, 0.76 for single-pair and 0.91 for single readings. Significantly more progression was measured with the chronological compared with the single-paired and single methods (15.9 vs 8.5 and 8.3; P=0.0001). A constant progression was suggested by chronological reading, in contrast to a stabilization in the other two methods after 1 yr.
Conclusion. Reading films in chronological order is most sensitive to change in a time period up to 3 yr follow-up; this was already present after 1 yr, but even more pronounced with longer follow-up.
KEY WORDS: Radiographs, Reading order, Paired/single films, Rheumatoid arthritis, Clinical trials
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Recently, we carried out a non-standardized literature review on the methods to score films in randomized clinical trials. This survey revealed some interesting facts. A number of therapeutic trials did not state in what order films were scored. One trial used the method of scoring single films; half of the remaining trials scored in chronological order, the other half used paired scoring but provided no information on timing. Studies comparing methods were also scarce. Fries et al. [8] showed that if films were read in pairs compared with single films, precision was greater. Recently, two Italian groups assessed the influence of reading the films in chronological order, in pairs or as single films [9, 10]. Films of hands and feet were read with the Larsen method in one study and films of hands with the Sharp method in the other study. Both groups conclude that paired reading is preferable, although they draw their conclusions on completely different grounds.
At the time of the above-mentioned publications, we had performed two studies to evaluate these issues. The aim of our studies was to assess the possible influence of knowledge about the chronology. In these two studies, films were scored in chronological order, in pairs, as single-pairs (hands or feet) and single (one hand or foot). The first study had films of two points in time per patient, the second study of four points in time. As far as we are aware, this is the first study with more than two points in time per patient and also the first one scored with the Sharp/van der Heijde method including hands and feet (Table 1). We chose this method because it seems to be the most sensitive and we have a lot of experience with it, although it has the disadvantage of being more time consuming than the Larsen method [3].
|
![]() |
Patients and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
All films in both studies were scored by two experienced readers (DvdH and AB) independently by the Sharp/van der Heijde method [3, 7]. Erosions were scored in 32 joints in the hands and in 12 joints in the feet with a maximum of 5 per joint in the hands and 10 in the feet. Erosions were scored according to the surface of the joint involved. Joint space narrowing was graded from 0 to 4 in 30 joints in the hands and in 12 joints in the feet. This results in a total damage score that can range from 0 to 448. When scored in true chronological order, scores cannot decrease (`once an erosion, always an erosion'). This is similar to the published Sharp/van der Heijde modification and thereafter applied in many studies [3, 7]. Erosions and joint space narrowing were summed to obtain the total score. The results of the total score are presented, unless stated otherwise.
Statistical analysis
The main analyses concentrated on the total damage score. Secondary analyses were performed for erosions and narrowing scores, respectively. Results are expressed as means.
![]() |
Study 1 |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Statistical analysis
A repeated measures ANOVA analysed the data; a full mixed effects model was constructed with patient (n=10) and order of application of method (n=6) as random factors, and method (n=3), observer (n=2) and time (n=2) as fixed factors. In order to check the assumptions of the ANOVA (i.e. normality of the distribution of the residuals and constancy of variance of residuals), we plotted residuals vs fitted values. We found no substantial deviations from the assumptions of ANOVA. As order of application of method did not change the findings of the analysis, the results will be presented without this factor, for clarity.
For each method, separate ANOVA tables yielded mean square estimates that were used to calculate expected mean squares and, from these, variance estimates. Two generalizability (G) coefficients were constructed to express the efficiency of the three scoring methods; in each, the numerator identifies the signal and the denominator the sum of signal and noise. G coefficients can range from 0 to 1, and usually values above 0.80 are considered as good.
Per method, interobserver reliability was expressed as the G coefficient (VC is the variance component) [11]:
![]() |
Also per method, sensitivity to change was expressed as the G coefficient:
![]() |
For comparability with other reports, signal-to-noise ratios were also calculated by dividing the square root of the signal variance (VCtime +VCobserver,time ), i.e. the signal S.D., by the square root of the noise variance (VCtime,pat +VCobserver,time,pat ), i.e. the noise S.D. However, the S.D. values are not directly comparable with those of other reports because our model is more complex, and the S.D. estimates are composed of more terms.
Results
The two observers found the single method less time consuming, but also stated that the paired and chronological readings offered the possibility to compare right and left, or baseline and follow-up, and to assess the quality of films. The interobserver reliability (expressed as G coefficients) of the absolute scores was good and not influenced by the method (chronological 0.94, paired 0.88 and single 0.93).
Table 2 shows the ANOVA table. The main effect of time, i.e. progression, was a mean increase in the joint score (averaged over patients, observers and methods) from 26 to 37 (P=0.046). The main effect of observer (averaged over patients, time and methods) was highly significant because one observer scored consistently higher than the second observer (P=0.003). The main effect of method (averaged over patients, time and observers) was not significant (P=0.34). The two-factor interaction effects show that the difference in absolute scores between the two observers (i.e. averaged over time) was not considerably influenced by the method (methodxobserver): chronological 6.1, paired 8.2 and single 6.8 (P=0.706). Likewise, the effect of time (progression) averaged over methods did not differ significantly between the observers (observerxtime): 10 vs 13 (P=0.10). However, the effect of time averaged over observers differed significantly between the methods (methodxtime): chronological 14.6, paired 10.4, single 9.4 (P=0.027). Finally, the three-factor interaction (methodxobserverxtime) was not significant (P=0.34), indicating that the interaction between the method and time was not significantly different between the two observers.
|
|
A secondary analysis showed that both erosions and narrowing contributed to the effect of time (factor time for erosions P=0.06 and for narrowing P=0.04). Erosions and narrowing had no significantly different effect on the method; in other words, the difference noted in progression between the three methods could not be attributed to one of the two subscores of the total score.
![]() |
Study 2 |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Statistical analysis
The study question was whether the differences between the three methods would disappear with longer follow-up. A repeated measures ANOVA similar to that in Study 1 was performed. However, only the interobserver G coefficient was calculated. The G coefficient for change would be hard to interpret given the multiple time points and thus many different and interrelated possibilities to express change (e.g. change 01, change 12, change 02, change 03 yr). Again, plots of residuals vs fitted values were normally distributed.
Results
Observer 1 needed on average 11.2 min (range 7.321.5) for a set in chronological order, 11.3 min (4.222.0) for single-pair and 10.7 min (4.818.0) for single readings. However, observer 2 took almost twice as long to score a set in chronological order (27.7; range 20.038.0) compared with single-pair (15.4; range 10.521.0) and single reading (18.5; range 9.530.0), respectively. Technical problems were felt to play a role in 2% of the films if scored in chronological order vs 13% in single-pair and 4% in single readings.
The interobserver G coefficients for absolute scores were 0.86 for chronological, 0.76 for single-pair and 0.91 for single readings.
The main effect of time (averaged over patients, observers and methods) was significant: the mean initial score was 19.2, the final (3 yr) score was 30.1 (P=0.002). Again, the main effect of observer (averaged over patients and methods) was significant, with a mean difference of 5.4 (P=0.03). The main effect of method was not significant (P=0.53). As in Study 1, the difference in absolute scores between the two observers was not statistically different between the methods: chronological 3.2, single-pair 5.6 and single 7.3 (P=0.15). Likewise, the effect of time (progression) averaged over methods did not differ significantly between the observers: 11.2 vs 10.5 (P=0.83). However, the effect of time averaged over observers differed significantly between the three methods: chronological 15.9, single-paired 8.5, single 8.3 (P=0.0001) (Fig. 2). A constant progression is suggested by chronological reading, in contrast to a stabilization in the other two methods after 1 yr. A similar pattern was observed for both observers (three-factor interaction not significant; P=0.57).
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our data are based on small numbers of patients. In general, our position is that non-significant results from small studies should be considered with great caution, with an eye open for the possibility that lack of power, rather than absence of effects, may be the explanation. However, if a relatively small study yields convincing (significant) results, as our study does, it can safely be concluded in hindsight that the study, relatively small as it is, had sufficient power to demonstrate the effects of interest. Moreover, although our study is indeed small in number of patients included, we believe that our design is highly efficient; this may explain why the power (precision) of our study turns out to be greater than might be expected at first sight.
Recently, two other papers dealing extensively with this methodological issue were published (Table 1) [9, 10]. The first is a letter to the editor on 100 patients with early RA and 18 month follow-up films [10]. Hand films were scored according to Sharp by two independent readers. They found a higher progression rate in chronological readings, followed by the paired and single readings. The interobserver reliability of the absolute scores for the total score was greater for the paired readings (0.900.93) than for the chronological (0.820.85) and single (0.760.80) readings. However, the differences in interobserver reliability for the progression scores were only marginal between the chronological and paired readings (0.67 and 0.64, respectively), but greater than the single reading (0.54). They also calculated a signal-to-noise ratio, but this ratio is not directly comparable with ours because they used a simple two-factor model without interaction terms, as opposed to our full three-factor model. Their ratios, calculated as intrapatient S.D. divided by inter-rater S.D., were 4.8, 3.8 and 2.3 for the chronological, paired and single methods, respectively. The higher ratio of chronological scoring compared with that of single scoring is in agreement with our finding, although the results of their paired reading were better than ours. They concluded that paired reading is the preferred method because of the higher sensitivity to change, better interobserver reliability, and conservative reading if the order is not known.
Comparing our results on interobserver reliability with the above cited study shows that the interobserver reliability on absolute scores was somewhat higher in our study (chronological 0.94, paired 0.88 and single 0.93), whereas the interobserver reliability was only slightly worse for the paired readings compared with chronological and single readings.
The second study concerns 284 RA patients with early disease, with films at baseline and after 12 months. Hand and foot films were scored according to Larsen by a panel of three observers that provided a consensus score (Table 1) [9]. The authors found that the S.D. of the paired readings was smallest. However, the mean change with this method was also smallest. A more appropriate way is to compare the coefficients of variation (CVs), relating the S.D. to the change. For progression in eroded joint counts, they were 1.46, 1.64 and 1.93 for the chronological, paired and single readings, respectively, and for progression in damage score 1.38, 1.72 and 2.47, respectively. Expressed in this way, the chronological method is the one to prefer with the greatest sensitivity to change (and therefore most powerful in a trial). The authors also performed a bootstrap analysis and give most weight to this final analysis and prefer paired readings.
Both studies did not provide information on the order that they used in their study to apply the three methods: totally random, or first all films with one method, thereafter with the second and lastly with the third method. Only the second paper stated that the films of the same patient were scored in different sessions. If the orders are not scored randomly, this could possibly have influenced the results. We copied the films in order to be able to randomize the orders and showed that ordering did not influence the results. The use of copies may have influenced precision overall, but with all methods read on copies, it is unlikely that the quality of the films biased the differences between the methods in any way.
The methodological distinctions between our two studies and the two studies mentioned before are as follows: (1) compared with the study by Salaffi and Carotti [10], we also included feet (in contrast to hands alone); (2) compared with the study by Ferrara et al. [9], we used the Sharp/van der Heijde method (in contrast to the Larsen method). Sensitivity to change is reported to be greater for the Sharp/van der Heijde score compared with the Larsen score, and this also applies for combined assessment of hands and feet compared with hands only [3]. So it could be hypothesized that the combination of the most sensitive methods (Sharp/van der Heijde and hands and feet) distinguishes best between the various methods of scoring. On the other hand, the numbers of patients in the other two studies were much larger than in ours. However, the repeated measures ANOVA and the application of generalizability theory uses all information available, reducing the number of patients needed to draw reliable conclusions.
Our data on 3 yr follow-up give very interesting and new additional information. This is especially relevant because in clinical therapeutic trials increasingly more than two films are used to evaluate treatment effects.
Differences in absolute scores between the two observers were marked, but there was no difference in assessing progression. This has been shown before and is not a major drawback if average scores of the observers are used [13].
The signal-to-noise ratio of chronological scoring was greater than for paired and single scoring. This results in a marked difference in power in a clinical trial. The choice in which order to score the films could have a similar or even greater impact than the choice of the scoring method itself. The question we cannot answer at this moment is whether this higher sensitivity to change coincides with bias. In other words: more signal, but less valid, i.e. a false signal? We hypothesized that the chronological method would be able to detect differences earlier, but that this would disappear with longer follow-up. This was tested in the second study. However, the differences became even more apparent. Also, the chronological order suggested progression in all three periods of follow-up, whereas the single and single-pair methods suggested progression only during the first period of follow-up. Thus, we still do not have the definite answer as to which method is measuring the truth. To answer this, we need a gold standard, which is in fact currently not available for radiographic damage. Therefore, we need to use a surrogate gold standard, another external criterion for damage. We think that sonography and MRI are also not suitable because the scoring methods for these methods are not validated yet. An expert panel might decide whether there is a true difference between a set of films, and subsequently the relationship between this judgement and the data obtained with the three methods could be a way forward. An alternative would be to read the radiographs of a therapeutic trial with known efficacy with the three methods.
What type of information do we have available depending on the radiographs that are present at the same time when scoring the films? Table 3 summarizes the different sources of information that can influence the scoring and in which type of scoring method these are present. Sources of information which can influence the scoring are: (1) the identical joint at a different time; (2) contralateral joint at the same point in time; (3) other joints in different regions. This information can be helpful in scoring a joint, e.g. a change in positioning can be seen in (1); an anatomical variation can be seen in (2). However, these various sources of information can also introduce bias, e.g. if the joints in the feet do not show erosions, the expectation could be that no erosions will be present in the hands. The balance between positive extra information, which leads to reduction in measurement error, and negative extra information, which leads to the introduction of bias, should be investigated for the various sources of information.
|
It might very well be that the chronological order is biased and overestimates progression of damage, i.e. the signal is false. On the other hand, the films were selected to show progression of damage in at least some of the sets. With the single and single-pair methods, on average no progression at all could be detected in the second and third year. One might really doubt whether this occurrence is realistic. The interpretation could also be that non-chronological scoring introduces measurement error by limiting the information the reader gets, and that the signal is lost in the noise.
In conclusion, reading films in chronological order is most sensitive to change, but it cannot yet be excluded that it overestimates the progression of damage. In randomized clinical trials where sensitivity to change is pivotal, we would advise scoring chronologically. In this situation, bias (if present) will be in the same direction in both arms of the trial, assuming films are read in a blinded fashion. Effective treatment will show a measurable reduction in joint damage compared with the less (or not) effective treatment arm. In observational studies with many potential sources of bias, the choice is more difficult. Chronological reading could be chosen as the most sensitive scoring method and to ensure comparability with the results of clinical trials. On the other hand, the reduction of possible bias could be an argument to choose paired, single-pair or single reading, although it cannot be excluded that these methods are biased in the opposite direction (i.e. show no change where in fact one has occurred). More data are needed to make a final choice for observational studies.
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|