Reading radiographs in chronological order, in pairs or as single films has important implications for the discriminative power of rheumatoid arthritis clinical trials

D. van der Heijde, A. Boonen, M. Boers2, P. Kostense1 and S. van der Linden

Department of Internal Medicine, Division of Rheumatology, University Hospital Maastricht, Maastricht and
1 Department of Epidemiology and Biostatistics, Medical Faculty, Vrije Universiteit, Amsterdam, The Netherlands

Correspondence to: D. van der Heijde, Department of Internal Medicine, Division of Rheumatology, University Hospital Maastricht, PO Box 5800, 6202 AZ Maastricht, The Netherlands.


    Abstract
 Top
 Abstract
 Introduction
 Patients and methods
 Study 1
 Study 2
 Discussion
 References
 
Objective. To determine the influence of reading series of films in chronological order, in pairs with unknown time sequence, or as single films, on precision and sensitivity to change.

Methods. Two studies were performed with 10 and 12 patients fulfilling the American College of Rheumatology criteria. In Study 1, two sets of films with a 1 yr interval were scored in chronological order, in pairs, and as single films. In Study 2, four sets of films, with a 1 yr interval each, were scored in chronological order, as single films and as single-pair (right and left together). All films were scored with the Sharp/van der Heijde method by two independent observers. Data were analysed with a repeated measures ANOVA using a full mixed effects model. Two generalizability (G) coefficients were constructed for reliability and for change.

Results. Study 1: the interobserver reliability was similar for the three methods (Greliability chronological 0.94, paired 0.88, single 0.93); progression was a mean increase (averaged over patients, observers and methods) from 26 to 37 (P=0.046). The sensitivity for change was greater for the chronological than for the paired and single scoring (Gchange 0.39, 0.22 and 0.24, respectively). Study 2: the interobserver reliability was 0.86 for chronological, 0.76 for single-pair and 0.91 for single readings. Significantly more progression was measured with the chronological compared with the single-paired and single methods (15.9 vs 8.5 and 8.3; P=0.0001). A constant progression was suggested by chronological reading, in contrast to a stabilization in the other two methods after 1 yr.

Conclusion. Reading films in chronological order is most sensitive to change in a time period up to 3 yr follow-up; this was already present after 1 yr, but even more pronounced with longer follow-up.

KEY WORDS: Radiographs, Reading order, Paired/single films, Rheumatoid arthritis, Clinical trials


    Introduction
 Top
 Abstract
 Introduction
 Patients and methods
 Study 1
 Study 2
 Discussion
 References
 
Structural damage as seen on radiographs is an important feature in rheumatoid arthritis (RA). The ability to sustain joint structure and functional capacity determine whether a drug has disease-controlling capacity [1]. Radiographs are included in the core set of measures to evaluate trials with a duration of 1 yr and longer [2]. Many scoring methods for the assessment of radiographic damage are available [3]. Probably the most well known and widely used are the Larsen and Sharp methods with their modifications [47]. In the past, various aspects of the validity of these assessments have been addressed, such as which view should be used, which abnormalities should be scored, aspects of intra- and interobserver reliability, and so on [3]. One of the issues that has not yet received a lot of attention is in which order films should be scored to measure progression. Many possibilities exist, e.g. (1) films grouped per patient (all available radiographs of the hands and feet of one particular patient) ordered chronologically (`chronological'); (2) films grouped per patient, but in random time order (`paired'); (3) films grouped per region (e.g. both hands) from a particular patient at a single point in time (`single-pair'); (4) single films without any grouping or ordering, i.e. all films of all patients mixed randomly (`single'). Theoretically, there are advantages and disadvantages for any of these possibilities. Scoring in chronological order probably provides the most information to the reader. This may help to reduce measurement error introduced solely by variation in positioning or quality of the films. However, it could also introduce bias as the observer may expect progression of damage over time. In contrast, the reading of a single film at a time is unbiased, but probably more prone to measurement error. The advantages and disadvantages of the other methods range between these two extremes.

Recently, we carried out a non-standardized literature review on the methods to score films in randomized clinical trials. This survey revealed some interesting facts. A number of therapeutic trials did not state in what order films were scored. One trial used the method of scoring single films; half of the remaining trials scored in chronological order, the other half used paired scoring but provided no information on timing. Studies comparing methods were also scarce. Fries et al. [8] showed that if films were read in pairs compared with single films, precision was greater. Recently, two Italian groups assessed the influence of reading the films in chronological order, in pairs or as single films [9, 10]. Films of hands and feet were read with the Larsen method in one study and films of hands with the Sharp method in the other study. Both groups conclude that paired reading is preferable, although they draw their conclusions on completely different grounds.

At the time of the above-mentioned publications, we had performed two studies to evaluate these issues. The aim of our studies was to assess the possible influence of knowledge about the chronology. In these two studies, films were scored in chronological order, in pairs, as single-pairs (hands or feet) and single (one hand or foot). The first study had films of two points in time per patient, the second study of four points in time. As far as we are aware, this is the first study with more than two points in time per patient and also the first one scored with the Sharp/van der Heijde method including hands and feet (Table 1Go). We chose this method because it seems to be the most sensitive and we have a lot of experience with it, although it has the disadvantage of being more time consuming than the Larsen method [3].


View this table:
[in this window]
[in a new window]
 
TABLE 1.  Schedule of the methods used in the present and published studies
 

    Patients and methods
 Top
 Abstract
 Introduction
 Patients and methods
 Study 1
 Study 2
 Discussion
 References
 
Two studies were performed. First, patients, methods and results will be described for the first study, thereafter for the second study. The details of both studies are presented in Table 1Go. All patients fulfilled the 1987 American College of Rheumatology criteria for RA and had a disease duration of <1 yr at the start. Films were made at the start and after 1 yr of follow-up. Films were selected for high and low scores at the first film, and for having high and low progression between the two films. The films were made in posteroanterior view, copied three times and blinded, and scored by three different methods. The order of patients, as well as the order of application of the methods in a particular patient, were also randomized (random number table). Selection, ordering and blinding were performed by one of us (MB) not involved in scoring.

All films in both studies were scored by two experienced readers (DvdH and AB) independently by the Sharp/van der Heijde method [3, 7]. Erosions were scored in 32 joints in the hands and in 12 joints in the feet with a maximum of 5 per joint in the hands and 10 in the feet. Erosions were scored according to the surface of the joint involved. Joint space narrowing was graded from 0 to 4 in 30 joints in the hands and in 12 joints in the feet. This results in a total damage score that can range from 0 to 448. When scored in true chronological order, scores cannot decrease (`once an erosion, always an erosion'). This is similar to the published Sharp/van der Heijde modification and thereafter applied in many studies [3, 7]. Erosions and joint space narrowing were summed to obtain the total score. The results of the total score are presented, unless stated otherwise.

Statistical analysis
The main analyses concentrated on the total damage score. Secondary analyses were performed for erosions and narrowing scores, respectively. Results are expressed as means.


    Study 1
 Top
 Abstract
 Introduction
 Patients and methods
 Study 1
 Study 2
 Discussion
 References
 
Patients and methods
In Study 1, two sets of hand and foot films of 10 patients, with a 1 yr interval between the first and second set, were scored. In this study, the three scoring methods were as follows. (1) Single: a single hand or foot is presented to the reader. The time, patients and other radiographs of the same patient at that point in time are in random order. (2) Paired: all hands and feet of the same patient of both points in time are presented together. The left and right hand and foot of the same point in time are kept together. The order in time is unknown to the reader. (3) Chronological: all hands and feet of the same patient of both points in time are presented together. The left and right hands and feet of the same point in time are kept together. The order in time is known to the reader.

Statistical analysis
A repeated measures ANOVA analysed the data; a full mixed effects model was constructed with patient (n=10) and order of application of method (n=6) as random factors, and method (n=3), observer (n=2) and time (n=2) as fixed factors. In order to check the assumptions of the ANOVA (i.e. normality of the distribution of the residuals and constancy of variance of residuals), we plotted residuals vs fitted values. We found no substantial deviations from the assumptions of ANOVA. As order of application of method did not change the findings of the analysis, the results will be presented without this factor, for clarity.

For each method, separate ANOVA tables yielded mean square estimates that were used to calculate expected mean squares and, from these, variance estimates. Two generalizability (G) coefficients were constructed to express the efficiency of the three scoring methods; in each, the numerator identifies the signal and the denominator the sum of signal and noise. G coefficients can range from 0 to 1, and usually values above 0.80 are considered as good.

Per method, interobserver reliability was expressed as the G coefficient (VC is the variance component) [11]:

Also per method, sensitivity to change was expressed as the G coefficient:

For comparability with other reports, signal-to-noise ratios were also calculated by dividing the square root of the signal variance (VCtime +VCobserver,time ), i.e. the signal S.D., by the square root of the noise variance (VCtime,pat +VCobserver,time,pat ), i.e. the noise S.D. However, the S.D. values are not directly comparable with those of other reports because our model is more complex, and the S.D. estimates are composed of more terms.

Results
The two observers found the single method less time consuming, but also stated that the paired and chronological readings offered the possibility to compare right and left, or baseline and follow-up, and to assess the quality of films. The interobserver reliability (expressed as G coefficients) of the absolute scores was good and not influenced by the method (chronological 0.94, paired 0.88 and single 0.93).

Table 2Go shows the ANOVA table. The main effect of time, i.e. progression, was a mean increase in the joint score (averaged over patients, observers and methods) from 26 to 37 (P=0.046). The main effect of observer (averaged over patients, time and methods) was highly significant because one observer scored consistently higher than the second observer (P=0.003). The main effect of method (averaged over patients, time and observers) was not significant (P=0.34). The two-factor interaction effects show that the difference in absolute scores between the two observers (i.e. averaged over time) was not considerably influenced by the method (methodxobserver): chronological 6.1, paired 8.2 and single 6.8 (P=0.706). Likewise, the effect of time (progression) averaged over methods did not differ significantly between the observers (observerxtime): 10 vs 13 (P=0.10). However, the effect of time averaged over observers differed significantly between the methods (methodxtime): chronological 14.6, paired 10.4, single 9.4 (P=0.027). Finally, the three-factor interaction (methodxobserverxtime) was not significant (P=0.34), indicating that the interaction between the method and time was not significantly different between the two observers.


View this table:
[in this window]
[in a new window]
 
TABLE 2.  ANOVA table of Study 1 with patient (n = 10) as random factor, and method (n = 3), observer (n = 2) and time (n = 2) as fixed factors
 
Figure 1Go shows the Bland and Altman plot of the difference in progression between the two observers per method [12]. On the y-axis, the difference in progression as assessed by the two observers is presented; on the x-axis, the mean of progression as assessed by the two observers is presented. In the ideal situation, all points would be situated on or close to y=0. It can be clearly seen that there are major differences between the observers in the paired method, but not in the chronological and single readings. The differences between the observers are quite similar over the full range of measured progression; there is no clear tendency to increase or decrease if more progression is observed. From this plot, it can also be seen that one observer is consistently scoring higher than the other observer, resulting in more points in the lower right quadrant.



View larger version (11K):
[in this window]
[in a new window]
 
FIG. 1.  Bland and Altman plot presenting the difference in progression between the two observers in relation to the mean progression measured by the two observers for chronological, paired and single readings (Study 1). The chronological and single readings are distributed equally around a difference of zero, one observer is reading paired sets systematically higher than the other observer.

 
The G coefficient for change was 0.39 for chronological, 0.23 for paired and 0.22 for single readings. Expressed differently, the ratio of signal S.D. to noise S.D. was 0.79 for chronological, 0.55 for paired and 0.53 for single readings. Given that the S.D. of progression is around 16 in all three methods (data not shown), the signal is almost 50% stronger in chronological vs the other methods. This has major implications for the power of a study. For example, a trial with 10 patients per group and a S.D. of 16 (as in this study) has a calculated power of 47% with chronological reading (difference 14.6) and for the other methods 24% (paired) and 21% (single) (differences 10.4 and 9.4, respectively).

A secondary analysis showed that both erosions and narrowing contributed to the effect of time (factor time for erosions P=0.06 and for narrowing P=0.04). Erosions and narrowing had no significantly different effect on the method; in other words, the difference noted in progression between the three methods could not be attributed to one of the two subscores of the total score.


    Study 2
 Top
 Abstract
 Introduction
 Patients and methods
 Study 1
 Study 2
 Discussion
 References
 
Patients and methods
Given the results of the first study, we hypothesized that the chronological method would be able to detect radiographic differences earlier, but that this effect would disappear with longer follow-up. This hypothesis was tested in the second study. Four sets of hand and foot films of 12 patients (completely different from those in Study 1), with a 1 yr interval between each of the four sets, were scored. Because (1) Study 1 did not show relevant differences between paired and single readings in assessing progression, (2) the interobserver reliability was lower for the paired readings and (3) because the observers assumed that single readings were less time consuming and (4) it might be helpful to have the right and left hand (or foot) at the same time, it was decided to investigate the influence of reading in chronological order vs single vs single with pairs of hands or feet (`single-pair'; Table 1Go). The same readers as in Study 1 were involved. The time for each reading was recorded, but due to inadequate instruction, one observer noted the time of the reading in seconds and the other in full minutes. Therefore, the differences between the observers are not precise, but the differences between the three methods for each observer reflect true differences. Thus, the films were scored by three methods that varied partially from the first study: (1) chronological; (2) single-pair: both hands or both feet from the same point in time are presented together, time and patient are in random order; (3) single. In this second study, it was also recorded whether a film, according to one or both observers, had technical problems that could interfere with appropriate scoring, such as bad positioning of the joints on the radiograph.

Statistical analysis
The study question was whether the differences between the three methods would disappear with longer follow-up. A repeated measures ANOVA similar to that in Study 1 was performed. However, only the interobserver G coefficient was calculated. The G coefficient for change would be hard to interpret given the multiple time points and thus many different and interrelated possibilities to express change (e.g. change 0–1, change 1–2, change 0–2, change 0–3 yr). Again, plots of residuals vs fitted values were normally distributed.

Results
Observer 1 needed on average 11.2 min (range 7.3–21.5) for a set in chronological order, 11.3 min (4.2–22.0) for single-pair and 10.7 min (4.8–18.0) for single readings. However, observer 2 took almost twice as long to score a set in chronological order (27.7; range 20.0–38.0) compared with single-pair (15.4; range 10.5–21.0) and single reading (18.5; range 9.5–30.0), respectively. Technical problems were felt to play a role in 2% of the films if scored in chronological order vs 13% in single-pair and 4% in single readings.

The interobserver G coefficients for absolute scores were 0.86 for chronological, 0.76 for single-pair and 0.91 for single readings.

The main effect of time (averaged over patients, observers and methods) was significant: the mean initial score was 19.2, the final (3 yr) score was 30.1 (P=0.002). Again, the main effect of observer (averaged over patients and methods) was significant, with a mean difference of 5.4 (P=0.03). The main effect of method was not significant (P=0.53). As in Study 1, the difference in absolute scores between the two observers was not statistically different between the methods: chronological 3.2, single-pair 5.6 and single 7.3 (P=0.15). Likewise, the effect of time (progression) averaged over methods did not differ significantly between the observers: 11.2 vs 10.5 (P=0.83). However, the effect of time averaged over observers differed significantly between the three methods: chronological 15.9, single-paired 8.5, single 8.3 (P=0.0001) (Fig. 2Go). A constant progression is suggested by chronological reading, in contrast to a stabilization in the other two methods after 1 yr. A similar pattern was observed for both observers (three-factor interaction not significant; P=0.57).



View larger version (10K):
[in this window]
[in a new window]
 
FIG. 2.  Influence of order of scoring on progression over 3 yr (average of two observers, Study 2).

 
A separate analysis of erosions and narrowing showed comparable results (data not shown). The effect on the difference in progression between the three methods was as strong for erosions and narrowing as for total score.


    Discussion
 Top
 Abstract
 Introduction
 Patients and methods
 Study 1
 Study 2
 Discussion
 References
 
Our studies have shown that scoring films in chronological order results in higher progression rates and a better signal-to-noise ratio than scoring films in pairs or singly. This was evident after 1 yr, but even more pronounced with longer follow-up. From Study 1, it was obvious that the results between paired and single scoring were very similar in respect to progression rates. However, paired scoring performed worse in terms of precision. This could be seen in both G coefficients of change and reliability. In the second study, it became clear that there is no relevant difference between reading according to the single and the single-pair method.

Our data are based on small numbers of patients. In general, our position is that non-significant results from small studies should be considered with great caution, with an eye open for the possibility that lack of power, rather than absence of effects, may be the explanation. However, if a relatively small study yields convincing (significant) results, as our study does, it can safely be concluded in hindsight that the study, relatively small as it is, had sufficient power to demonstrate the effects of interest. Moreover, although our study is indeed small in number of patients included, we believe that our design is highly efficient; this may explain why the power (precision) of our study turns out to be greater than might be expected at first sight.

Recently, two other papers dealing extensively with this methodological issue were published (Table 1Go) [9, 10]. The first is a letter to the editor on 100 patients with early RA and 18 month follow-up films [10]. Hand films were scored according to Sharp by two independent readers. They found a higher progression rate in chronological readings, followed by the paired and single readings. The interobserver reliability of the absolute scores for the total score was greater for the paired readings (0.90–0.93) than for the chronological (0.82–0.85) and single (0.76–0.80) readings. However, the differences in interobserver reliability for the progression scores were only marginal between the chronological and paired readings (0.67 and 0.64, respectively), but greater than the single reading (0.54). They also calculated a signal-to-noise ratio, but this ratio is not directly comparable with ours because they used a simple two-factor model without interaction terms, as opposed to our full three-factor model. Their ratios, calculated as intrapatient S.D. divided by inter-rater S.D., were 4.8, 3.8 and 2.3 for the chronological, paired and single methods, respectively. The higher ratio of chronological scoring compared with that of single scoring is in agreement with our finding, although the results of their paired reading were better than ours. They concluded that paired reading is the preferred method because of the higher sensitivity to change, better interobserver reliability, and conservative reading if the order is not known.

Comparing our results on interobserver reliability with the above cited study shows that the interobserver reliability on absolute scores was somewhat higher in our study (chronological 0.94, paired 0.88 and single 0.93), whereas the interobserver reliability was only slightly worse for the paired readings compared with chronological and single readings.

The second study concerns 284 RA patients with early disease, with films at baseline and after 12 months. Hand and foot films were scored according to Larsen by a panel of three observers that provided a consensus score (Table 1Go) [9]. The authors found that the S.D. of the paired readings was smallest. However, the mean change with this method was also smallest. A more appropriate way is to compare the coefficients of variation (CVs), relating the S.D. to the change. For progression in eroded joint counts, they were 1.46, 1.64 and 1.93 for the chronological, paired and single readings, respectively, and for progression in damage score 1.38, 1.72 and 2.47, respectively. Expressed in this way, the chronological method is the one to prefer with the greatest sensitivity to change (and therefore most powerful in a trial). The authors also performed a bootstrap analysis and give most weight to this final analysis and prefer paired readings.

Both studies did not provide information on the order that they used in their study to apply the three methods: totally random, or first all films with one method, thereafter with the second and lastly with the third method. Only the second paper stated that the films of the same patient were scored in different sessions. If the orders are not scored randomly, this could possibly have influenced the results. We copied the films in order to be able to randomize the orders and showed that ordering did not influence the results. The use of copies may have influenced precision overall, but with all methods read on copies, it is unlikely that the quality of the films biased the differences between the methods in any way.

The methodological distinctions between our two studies and the two studies mentioned before are as follows: (1) compared with the study by Salaffi and Carotti [10], we also included feet (in contrast to hands alone); (2) compared with the study by Ferrara et al. [9], we used the Sharp/van der Heijde method (in contrast to the Larsen method). Sensitivity to change is reported to be greater for the Sharp/van der Heijde score compared with the Larsen score, and this also applies for combined assessment of hands and feet compared with hands only [3]. So it could be hypothesized that the combination of the most sensitive methods (Sharp/van der Heijde and hands and feet) distinguishes best between the various methods of scoring. On the other hand, the numbers of patients in the other two studies were much larger than in ours. However, the repeated measures ANOVA and the application of generalizability theory uses all information available, reducing the number of patients needed to draw reliable conclusions.

Our data on 3 yr follow-up give very interesting and new additional information. This is especially relevant because in clinical therapeutic trials increasingly more than two films are used to evaluate treatment effects.

Differences in absolute scores between the two observers were marked, but there was no difference in assessing progression. This has been shown before and is not a major drawback if average scores of the observers are used [13].

The signal-to-noise ratio of chronological scoring was greater than for paired and single scoring. This results in a marked difference in power in a clinical trial. The choice in which order to score the films could have a similar or even greater impact than the choice of the scoring method itself. The question we cannot answer at this moment is whether this higher sensitivity to change coincides with bias. In other words: more signal, but less valid, i.e. a false signal? We hypothesized that the chronological method would be able to detect differences earlier, but that this would disappear with longer follow-up. This was tested in the second study. However, the differences became even more apparent. Also, the chronological order suggested progression in all three periods of follow-up, whereas the single and single-pair methods suggested progression only during the first period of follow-up. Thus, we still do not have the definite answer as to which method is measuring the truth. To answer this, we need a gold standard, which is in fact currently not available for radiographic damage. Therefore, we need to use a surrogate gold standard, another external criterion for damage. We think that sonography and MRI are also not suitable because the scoring methods for these methods are not validated yet. An expert panel might decide whether there is a true difference between a set of films, and subsequently the relationship between this judgement and the data obtained with the three methods could be a way forward. An alternative would be to read the radiographs of a therapeutic trial with known efficacy with the three methods.

What type of information do we have available depending on the radiographs that are present at the same time when scoring the films? Table 3Go summarizes the different sources of information that can influence the scoring and in which type of scoring method these are present. Sources of information which can influence the scoring are: (1) the identical joint at a different time; (2) contralateral joint at the same point in time; (3) other joints in different regions. This information can be helpful in scoring a joint, e.g. a change in positioning can be seen in (1); an anatomical variation can be seen in (2). However, these various sources of information can also introduce bias, e.g. if the joints in the feet do not show erosions, the expectation could be that no erosions will be present in the hands. The balance between positive extra information, which leads to reduction in measurement error, and negative extra information, which leads to the introduction of bias, should be investigated for the various sources of information.


View this table:
[in this window]
[in a new window]
 
TABLE 3.  Possible sources of information when reading radiographs on the presence of abnormalities (e.g. films of the hands)
 
The rule that the scores cannot decrease in chronological scoring means that error can only be in one direction: overestimation. This was the rule as it was published with the modification of the Sharp method [3, 7]. This was chosen because in series of films it was regularly seen that on film 1 there was an erosion, which was not present on film 2, but was present again on film 3. So, it was obvious that even though the erosion could not be seen on film 2, it must have been present. It would be worthwhile to see what would happen if this rule was not applied. One obvious effect would be the introduction of more measurement error.

It might very well be that the chronological order is biased and overestimates progression of damage, i.e. the signal is false. On the other hand, the films were selected to show progression of damage in at least some of the sets. With the single and single-pair methods, on average no progression at all could be detected in the second and third year. One might really doubt whether this occurrence is realistic. The interpretation could also be that non-chronological scoring introduces measurement error by limiting the information the reader gets, and that the signal is lost in the noise.

In conclusion, reading films in chronological order is most sensitive to change, but it cannot yet be excluded that it overestimates the progression of damage. In randomized clinical trials where sensitivity to change is pivotal, we would advise scoring chronologically. In this situation, bias (if present) will be in the same direction in both arms of the trial, assuming films are read in a blinded fashion. Effective treatment will show a measurable reduction in joint damage compared with the less (or not) effective treatment arm. In observational studies with many potential sources of bias, the choice is more difficult. Chronological reading could be chosen as the most sensitive scoring method and to ensure comparability with the results of clinical trials. On the other hand, the reduction of possible bias could be an argument to choose paired, single-pair or single reading, although it cannot be excluded that these methods are biased in the opposite direction (i.e. show no change where in fact one has occurred). More data are needed to make a final choice for observational studies.


    Acknowledgments
 
We would like to thank Mrs L. Heusschen for her excellent assistance in the entire project.


    Notes
 
2 Present address: Department of Clinical Epidemiology, VU University Hospital, Amsterdam, The Netherlands. Back


    References
 Top
 Abstract
 Introduction
 Patients and methods
 Study 1
 Study 2
 Discussion
 References
 

  1.  Edmonds JP, Scott DL, Furst DE, Brooks P, Paulus HE. Antirheumatic drugs: a proposed new classification. [Editorial] Arthritis Rheum 1993;36:336–9.[ISI][Medline]
  2.  Boers M, Tugwell P, Felson DT et al. World Health Organization and International League of Associations for Rheumatology core endpoints for symptom modifying antirheumatic drugs in rheumatoid arthritis clinical trials. J Rheumatol 1994;41(suppl.):86–9.
  3.  van der Heijde DM. Plain X-rays in rheumatoid arthritis: overview of scoring methods, their reliability and applicability. Baillière's Clin Rheumatol 1996;10:435–53.[ISI][Medline]
  4.  Larsen A, Dale K, Eek M. Radiographic evaluation of rheumatoid arthritis and related conditions by standard reference films. Acta Radiol Diagn Stockh 1977;18:481–91.[Medline]
  5.  Sharp JT, Young DY, Bluhm GB et al. How many joints in the hands and wrists should be included in a score of radiologic abnormalities used to assess rheumatoid arthritis? Arthritis Rheum 1985;28:1326–35.[ISI][Medline]
  6.  Rau R, Herborn G. A modified version of Larsen's scoring method to assess radiologic changes in rheumatoid arthritis. J Rheumatol 1995;22:1976–82.[ISI][Medline]
  7.  van der Heijde DM, van Riel PL, Nuver Zwart IH, Gribnau FW, van de Putte LB. Effects of hydroxychloroquine and sulphasalazine on progression of joint damage in rheumatoid arthritis. Lancet 1989;i:1036–8.
  8.  Fries JF, Bloch DA, Sharp JT et al. Assessment of radiologic progression in rheumatoid arthritis. A randomized, controlled trial. Arthritis Rheum 1986;29:1–9.[ISI][Medline]
  9.  Ferrara R, Priolo F, Cammisa M et al. Clinical trials in rheumatoid arthritis: methodological suggestions for assessing radiographs arising from the Grisar Study. Ann Rheum Dis 1997;56:608–12.[Abstract/Free Full Text]
  10. Salaffi F, Carotti M. Interobserver variation in quantitative analysis of hand radiographs in rheumatoid arthritis: comparison of 3 different reading procedures. J Rheumatol 1997;24:2055–6.[ISI][Medline]
  11. Streiner DL, Norman GR. Health measurements scales. A practical guide to their development and use. Oxford: Oxford University Press, 1995:104–80.
  12. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;i:307–10.
  13. Sharp JT. Radiologic assessment as an outcome measure in rheumatoid arthritis. Arthritis Rheum 1989;32:221–9.[ISI][Medline]
Submitted 13 July 1998; revised version accepted 4 June 1999.