University Hospital Maastricht, Maastricht, The Netherlands,
1 Evangelisches Fachkrankenhaus, Ratingen, Germany and
2 Free University, Amsterdam, The Netherlands
Correspondence to:
D. van der Heijde, Department of Rheumatology, Division of Internal Medicine, University Hospital Maastricht, PO Box 5800, 6202 AZ Maastricht, The Netherlands.
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods. Sets of seven radiographs of hands and feet were taken of 20 RA patients with a wide spectrum of radiological damage. For 14 patients, these seven radiographs were taken during a follow-up period of 5 yr, and for six patients during a follow-up of 10 yr. Each set of radiographs was scored twice by the same observer (DvdH). Erosions and joint space narrowing were scored with SHS (range 0448) in 32 and 30 joints in the hands, respectively, and both in 12 joints in the feet. SENS gives a score of 1 if there is any erosion in a joint and also 1 if there is any narrowing in the joint (range 086). In each case, SENS was derived from SHS. To analyse data, generalizability theory and repeated measurements ANOVA were used.
Results. The overall reliability coefficient was 0.81 for SHS and 0.80 for SENS. Intra-observer reliability [intraclass correlation coefficient (ICC)] was 0.99 and 0.98 for SHS and SENS, respectively. The ICC for the sensitivity to change was 0.84 for SHS and 0.88 for SENS. The smallest detectable difference (SDD) could be determined for both methods. The presence of progression based on this SDD was very comparable between the two methods.
Conclusion. The measurement properties of SENS are good and comparable to SHS. This makes SENS suitable for use in clinical practice and in large (epidemiological) studies, especially in the first years of disease.
KEY WORDS: Rheumatoid arthritis, Radiological assessment, Sharp/van der Heijde score, Simplification, Reliability, Sensitivity to change.
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Several methods are available to score progressive joint damage. Some methods give a global assessment for the whole patient. Other methods give a global joint score and again others score specific joint abnormalities [1, 48]. The methods used most widely are the ones developed by Larsen and Sharp [1, 4, 6, 7]. Modified versions of both methods have been developed to overcome some disadvantages of the original methods [9, 10]. In particular, one of the authors (DvdH) has modified Sharp's method to include the joints of the feet [9]. Several studies have shown higher intra- and interobserver reliability, and a higher sensitivity to change in Sharp's method than in Larsen's method [1113]. Larsen's method has the advantage of being less time consuming than Sharp's method [13]. The disadvantage of both methods is that they require trained observers. For clinical trials, this disadvantage generally is no problem. However, in clinical practice, a less time-consuming and simplified method, with adequate reliability and sensitivity to change, would be desirable.
We propose a simplified method of scoring radiographs for clinical practice based on the Sharp/van der Heijde score (SHS): instead of grading, the number of joints with erosions and the number of joints with JSN are simply summed [simple erosion narrowing score (SENS)]. Similar suggestions have been made in the literature, but have until now never been fully validated [1, 15]. We have tested the simplified method against the reference standard (SHS) in several ways. First, we compared the reproducibility (intra-observer consistence) and the sensitivity to change. Next, we investigated how often progression was seen with SENS and not with SHS (false positive) and vice versa (false negative). We also determined how much the patient's score with SENS had to increase to measure progression reliably. Finally, we checked for a ceiling effect in progression of SENS scores.
![]() |
Patients and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Radiographic analysis
Radiographs were made in posteroanterior view and scored twice by the same trained observer (DvdH) according to a randomization list including 40 sets of radiographs (each set twice). Per patient, the radiographs were scored in chronological order. The observer was unaware of each patient's identity. The number of erosions and JSN were scored according to SHS [9]. In SHS, erosions are counted in the 10 metacarpophalangeal (MCP) joints, the eight proximal interphalangeal (PIP) joints, the two interphalangeal joints of the thumbs, the right and left first metacarpal bone, the right and left radius and ulnar bones, the right and left trapezium and trapezoid (as one unit; multangular), right and left navicular bones, right and left lunate bones, the 10 metatarsophalangeal (MTP) joints, and the two interphalangeal joints of the big toes. JSN is assessed in the 10 MCP joints, the eight PIP joints, right and left third, fourth and fifth carpometacarpal joints, right and left multangular-navicular joints, right and left capitate-navicular-lunate joints, right and left radiocarpal joints, the 10 MTP joints and the two interphalangeal joints of the big toes. Erosions are scored 1 if there is a discrete interruption of the cortical surface; if there is a larger defect, a score is given according to the surface of the joint involved [25]. Consequently, for confluent erosions, the score cannot decrease. In the hands, the maximum erosion score in a joint is 5; in the feet, it is 10.
For JSN, five grades are recognized: 0=normal, 1=focal or doubtful; 2=general, <50% of the original joint space; 3=general, >50% of the original joint space or subluxation; 4=ankylosis. If a joint cannot be scored correctly, e.g. because of previous surgery, the last score of the joint is carried forward. The maximum number of erosions is 160 in the hands and 120 in the feet; and the maximum scores for JSN are 120 and 48, respectively. The total score is the sum of scores for erosions and JSN. The maximum total score is 448.
SENS assesses the same joints. The sites that are included in both SHS and SENS are shown in Fig. 1. In SENS, a joint is scored as affected (`1') if there is any erosion in the joint. A joint is scored as affected (`1') for JSN if the joint is scored 1 or more in the original method, this means at least focal JSN. So per joint, the score can range from 0 to 2. The number of joints in which erosions can be scored is 32 in the hands and 12 in the feet; the numbers of joints in which JSN can be scored are 30 and 12, respectively. Therefore, the maximum total score of SENS per patient is 86. In this study, the scores for SENS were deducted from the SHS scores. However, we also scored 12 films directly with SENS and compared these direct scores with those derived from the SHS scores.
|
|
Statistical analysis
Reliability.
Reliability was tested by using generalizability theory, a random model ANOVA approach which estimates the components of variance within each model [16]. We have used the computer program GENOVA for PCs by Crick and Brennan, which is especially suited for calculating random model variance components within analysis of variance [17, 18]. Elementary sources of variance in data are called facets in generalizability theory. Relevant facets in this study are: method (SHS vs SENS), patient identification number (120), type of abnormality (erosion, JSN), extremity (hands, feet), time (17) and number of observations (occasion 1, occasion 2). In generalizability theory, a distinction is made between fixed and random facets. The facets `patient', `time' and `number of observations' were defined as random facets, the others as fixed facets. The facet `time' was nested within the facet `patient' because of unequal spacing of the radiographs (14 patients were followed for 5 yr and six patients were followed for 10 yr).The overall reliability coefficient over all facets is called the -coefficient, showing the reliability of the methods with all the sources of variance included. Theoretically, the
-coefficient ranges from 0 (not reliable at all) to 1.00 (maximum).
So-called decision studies were made within some of the 16 models (Table 1) of both methods to estimate the number of observations needed for a specific level of intra-observer reliability. Next to this, the intraclass correlation coefficients (ICC) of some of the models (14, 9, 1416) were calculated to determine intra-observer reliability. These ICC are not similar to the classical definition of ICC. The ICC calculated in this study are called G-coefficients as defined by Streiner and Norman [19]. We retained the term ICC to indicate that the results are comparable to the classical ICC.
Sensitivity to change.
This was measured by estimating relevant ratios of variance components from results of mixed model repeated measurements ANOVA. For each method, Norman's quasi-classical ICC formula for sensitivity to change was calculated with the components of `time' (T) and `patient by time' (PxT) [ICCs =VC(T)/(VC(T)+VC(PxT)) where VC stands for variance component]. To obtain equal time periods between radiographs, so `time' could be considered as a fixed facet, the six patients who had been followed for 10 yr were excluded from this analysis.
Smallest detectable difference (SDD).
With the results of GENOVA, the SDD can also be calculated for the sample of 14 patients with 5 yr follow-up. This can be done for SHS and SENS separately. The variance component EMS, i.e. expected mean sums of squares of the facet time crossed with patient (TxP), is needed. The square root of this EMS gives the standard error of measurement (SEM). To decide whether there is real progression or no progression at all, one-sided testing is sufficient; because of paired observations, the results should be multiplied by . For a 90% confidence interval in 14 patients, the normal range Z score is 1.282 for two radiographs. This results in the following formula: SDD=EMS(TxP)x1.282x
2 [16]. The above formula is valid if the SDD is based on the information of two (successive) radiographs. For the analyses to determine the SDD, separate analyses were performed for each pair of successive radiographs, to obtain results that will be valid if you have two radiographs only, without the information of the complete series of seven radiographs.
Sensitivity and specificity of progression.
The SDDs for SENS and SHS were used as limits to determine whether a patient showed real progression as assessed by that method. Two by two (2x2) contingency tables were made to assess the sensitivity, specificity and accuracy of SENS compared to the gold standard SHS. These tables were created for every period between two radiographs. The kappa statistic was calculated as a measure of agreement between SHS and SENS.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The time needed to score seven sets of radiographs of the hands and feet of one patient with SHS is ~25 min. SENS' scoring time is ~7 min. The scores of 12 films assessed by SENS directly and deducted from the SHS scores were compared. Out of 1032 joints, 968 showed complete agreement, 30 showed abnormality in the deducted score but not in the direct SENS, and 34 in the direct SENS but not in the deducted score. These results were obtained over 2 yr after the original scoring and also take intra-observer agreement over a long period into account.
Reliability/reproducibility
The reliability coefficient (-coefficient) of all 16 models ranged between 0.81 and 0.90 for SHS, and between 0.80 and 0.91 for SENS. The
-coefficient of model 16 by SHS and of SENS was 0.81 and 0.80, respectively, for seven radiographs and two observations. The
-coefficients in all 16 models of SHS varied between 0.81 and 0.90, those of SENS varied between 0.80 and 0.91. Table 2
shows the components of variance in model 16 of both methods. As could be expected, the percentages of variance components of `patient' and that of `time nested within patient' were by far the largest (99 and 97.7%, respectively). The high level of the latter percentage largely explains the somewhat subdued, but still very good performance of the
-coefficient. The fact that patients strongly differed over time, each in his/her own specific way, did much to hinder the overall
-coefficient reaching its maximum score of 1.00. The low variance components of `number of observations', of the interaction of `patient' and `number of observations', and of the interaction of `number of observations' and `time nested within patient' indicate a high reliability of the scoring method (Table 2
). The other remark that has to be made concerns the fact that the
-coefficient results can clearly be somewhat improved in points if one adopts other models than model 16, which summarizes over both extremities and both types of abnormalities, like the one which uses information on both types of extremities and/or abnormalities apart and next to each other (e.g. model 9). The
-coefficients of models 14, which use partial patient information only, lie between 0.82 and 0.90 (SHS), and 0.82 and 0.91 (SENS). The highest
-coefficients are attained with `erosions in feet only' (model 3) and the lowest with `erosions in hands only' (model 1). Of course, the validity of the measurement system embodied by model 16 precludes such `improvements' in reliability. On the whole, the reliability of model 16 seems to be quite acceptable.
|
The ICC for reliability in `type of abnormality' were moderate, testing the within correlation between the scores of the erosions and the scores of the JSN. In models 14 of the SHS and of SENS, these ICC were 0.72 and 0.65. In model 9, these ICC were 0.59 and 0.69, also showing a moderate positive correlation between the erosions and the JSN in the hands, and between the erosions and the JSN in the feet separately. The ICC of the hands and feet were high in model 15 (0.81 and 0.78 for SHS and SENS, respectively), which indicates an acceptable correlation between the hand scores and the foot scores. The ICC of the hands and feet decreased in model 9 (0.68 and 0.59), showing a moderate correlation between the erosions in the hands and erosions in the feet, and between the JSN in the hands and the JSN in the feet.
Sensitivity to change
Table 3 shows the results of repeated measurements ANOVA combined with the results of the calculation of the components of variance (model 16). Based on seven radiographs scored by one observer, the sensitivity to change is 0.88 in SENS and 0.84 in SHS.
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
This study showed that SENS had a reliability equal to that of SHS. The intra-observer reliability was very high in both methods and did not decrease by scoring only parts of the radiographs (only erosions, only hands, etc.). As could be expected, the greatest source of variability in scores was the diversity of patients and the diversity in the course of progression of the individual patient. The latter source of variance induced a decrease in the reliability coefficients, but given the fact that other more important sources of variability (repeated observations at each point in the progression in time) were of only minor importance, both methods fared quite well. The ICC of the facet extremities and abnormalities show that the hand scores and the foot scores, and the erosions scores and the JSN scores, do agree with each other. This agreement was not so high that one part can easily be omitted without loss of information. Fries et al. [3] described the additional information of erosions and JSN before. Also, the importance of including the foot joints in a scoring method has been described before [15]. In conclusion, both hands and feet, and erosion and JSN, should be scored to get the maximum information.
The sensitivity to change of SENS, expressed as an ICC, was very similar to that of SHS. As far as data are available, the original method of SHS seems to have a greater sensitivity to change and a higher reliability than other scoring methods, such as Larsen's method and the CMC ratio [11, 13]. SHS has the best sensitivity to change compared to some other radiographic methods, including Sharp's method [14]. Therefore, we used the most sensitive method available as the gold standard with which to compare SENS. SENS' sensitivity to change can be considered as good.
Expressed as a percentage of the maximum scores, the means of SENS were higher than the means of SHS, indicating that in these patients many joints are damaged, but the amount of damage per joint is limited (Fig. 2). Van der Heijde et al. [15] made similar observations in another larger cohort of patients with a follow-up of 3 yr.
|
The results of this study indicate that damage to joints of patients with RA can be scored reliably with SENS during the first 5 yr. More data are needed to judge the performance of the method with longer disease duration. Also, interobserver agreement will have to be included in future studies of SENS. Because of the time profit and the results of this study, SENS seems useful in clinical practice in at least the first 5 yr of RA.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|