Behçet's disease: evaluation of a new instrument to measure clinical activity

B. B. Bhakta, P. Brennan1, T. E. James2, M. A. Chamberlain, B. A. Noble3 and A. J. Silman1

Rheumatology and Rehabilitation Research Unit, University of Leeds,
1 ARC Epidemiology Unit, University of Manchester,
2 Department of Ophthalmology, St James University Hospital, Leeds and
3 Department of Ophthalmology, The General Infirmary at Leeds, Leeds, UK

Correspondence to: B. B. Bhakta, Rheumatology and Rehabilitation Research Unit, University of Leeds, 36 Clarendon Road, Leeds LS2 9NZ, UK.


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Objective. Behçet's disease (BD) is a rare multisystem disorder characterized by vasculitis. At present, there are no laboratory markers that correlate well with the clinical activity in BD. This has led to the development of an instrument (BD Current Activity Form) to measure activity. Scoring is based on the history of new clinical features present over the preceding 4 weeks prior to assessment. Standardized questions were developed for all parts of the form. The face validity of the proforma was determined following worldwide collaboration with physicians and ophthalmologists managing patients with BD. The aim of this study was to evaluate the interobserver reliability of this form.

Methods. Nineteen patients fulfilling the International Study Group criteria for BD were randomly allocated, questioned and examined independently on the same day by five physicians experienced in BD.

Results. There was good agreement between the physicians' rating of oral [intraclass correlation coefficient (ICC)=0.87] and genital (ICC=0.95) ulceration, skin involvement (ICC=0.62 for pustules and ICC=0.66 for erythema nodosum), arthritis (ICC=0.62), headache (ICC=0.80), large vessel (kappa=0.53), nervous system (kappa=0.61) and eye involvement (kappa=0.77). There was poor agreement for the question relating to the presence of bloody diarrhoea (ICC=0.28). There was significant bias in the rating of fatigue by one of the physicians (F=5.2, P=0.001).

Conclusion. Overall, this instrument has good interobserver reliability for assessing general disease activity. We therefore suggest that this proforma has a place in routine clinical monitoring of patients with BD, as well as assessing outcome in therapeutic trials.

KEY WORDS: Behçet's disease, Activity, Measurement, Reliability.


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Behçet's disease (BD) is a rare multisystem disorder characterized by vasculitis. The classification of this condition depends on the presence of clinical features as defined by the International Study Group (ISG) on Behçet's Disease [1]. The availability of an international, standardized, reliable measure of disease activity is of prime importance in monitoring the natural history and effects of therapeutic intervention in BD [2]. At present, there are no laboratory markers that correlate well with clinical activity in BD. This has led to the development of a standardized proforma [Behçet's Disease Current Activity Form (BDCAF)] to assess disease activity which is based on history of clinical features.

Previous work [3] compared two schemes that were available to assess disease activity [The Iranian Behçet's Disease Dynamic Measure (IBDDAM) [4] and a European scheme initially developed in the UK]. Interobserver and intra-observer agreement between two clinicians using both forms were assessed in 13 patients with BD as defined by the ISG criteria. Reliability depends not only on the patient's accurate recall of symptoms, but also on the clinician's interpretation of them. This study suggested that agreement between clinicians in scoring of clinical features was greater when the standard period was 28 days, as in the European form, compared with the Iranian form in which a variable time period is taken [3]. Although there was greater variability in scoring when the Iranian form was used, the opinion of the clinicians was that both forms had good aspects and that an internationally accepted activity form could be derived from them without great difficulty.

As a result of this study, a prototype form was developed incorporating aspects of both forms. This was circulated to all members of the International Scientific Committee for comments. A workshop was held in Leeds, UK, in 1994 to arrive at a consensus view about the contents of the activity form (face validity) with emphasis on the need for clarity and consistency for potential use by clinicians worldwide. The inclusion of laboratory parameters within the activity form was raised by various members of the International Scientific Committee on Behçet's Disease. Although it was agreed that inclusion of erythrocyte sedimentation rate (ESR) and C-reactive protein (CRP) measurements would not add significantly to overall measurement of disease activity [5], it was appreciated that where disease appeared to be clinically inactive, a raised ESR or CRP might prompt further investigation. There was general agreement that standardized questions should be developed for each organ system which could be readily translated for international use. The interobserver reliability of this new instrument developed from these discussions is presented here.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Subjects
Twenty patients were recruited from the combined Rheumatology/Ophthalmology Behçet's disease clinic at Leeds General Infirmary. All patients recruited fulfilled the ISG criteria for the classification of BD. Five physicians (four rheumatologists and one clinical immunologist) experienced in assessing clinical activity in BD acted as the assessors.

Instrument
The BDCAF scores oral and genital ulceration, skin, joint and gastrointestinal involvement, presence of fatigue and headache according to the duration of symptoms. The presence and type of large-vessel and central nervous system (CNS) involvement are documented. Eye activity was deemed present if there was a history of blurring of vision or if the eye was painful or red. All patients were examined by an ophthalmologist who completed the Behçet's Oculopathy Index. This index was subject to a separate reproducibility evaluation in a subsequent study. In addition, patients were asked to rate on a seven-point scale how active they felt their BD disease had been over the preceding 4 weeks and on the day of assessment. The clinicians also completed a seven-rating scale to assess their opinion of overall activity. Only new symptoms over the preceding 4 weeks that the clinicians felt were due to BD were scored. Standardized questions were developed for all parts of the form. For use during routine clinical practice, changes to current medication could also be documented. The layout and instructions for scoring are shown in Fig. 1a and bGo, respectively.



View larger version (26K):
[in this window]
[in a new window]
 
FIG. 1.  (a) The Behçet's Disease Current Activity Form. (b) Scoring system.

 
Method
The proforma was circulated to the assessors prior to the reliability study to familiarize them with the layout and instructions. This study was conducted over 1 day. All subjects were randomly allocated to the assessors (1–5). All subjects were independently questioned and examined on the same day by all observers. Scoring was based on the patient's response to a series of standard questions for each organ system subscale comprising the general disease activity form. All assessment forms were coded to prevent patient identification. At the end of each assessment, any qualitative difficulties in completing the forms were documented by the assessors.

Statistical analysis
The test–retest reliability of the scoring was assessed with respect to two properties: bias and agreement. The presence of bias relates to the systematic deviation between observers in their scoring patterns, while the level of agreement reflects the extent of random differences in scoring between the five observers. A high level of reliability for the form constitutes a high level of agreement in the score for each organ system in the absence of bias.

The level of agreement for each organ system was assessed by calculating the kappa statistic. This is the chance-corrected measure of agreement with values close to one indicating a high level of agreement (Table 1Go). For dichotomous variables, such as assessment of eye, nervous system and major vessel involvement, the calculation of the kappa statistic between two observers is a simple procedure. An overall kappa for all five observers was also calculated.


View this table:
[in this window]
[in a new window]
 
TABLE 1.  Interpretation of the kappa statistic
 
For those clinical features measured on a five-point scale, a weighted kappa statistic was calculated in order to incorporate the relative seriousness of the different amounts of disagreement. The weighted kappa statistic can be calculated between pairs of clinicians; however, the calculation of an overall weighted kappa value for all five clinicians is not possible using this method. Therefore, an overall kappa value for all five clinicians was estimated by using the intraclass correlation coefficient (ICC).

The possibility of bias was also considered separately for each organ system. The proportions of patients who were scored as having involvement within each of the dichotomous variables (eye, CNS and major vessel involvement) were compared using the {chi}2 statistic (values >3.84 indicating bias). For the organ systems that were rated using a five-point scale, the distribution of scoring was compared using the ANOVA method to detect any clinician consistently scoring higher or lower for an organ system. The strength of any bias is given as a P value. Finally, comparison of the clinicians' and patients' perception of overall disease activity was calculated using the ICC.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Of 20 patients initially recruited, one became ill during the study and was unable to complete it. Data on the remaining 19 patients are presented. Table 2Go shows the bias and level of agreement between assessors when rating oral and genital ulceration, skin, joint and gastrointestinal involvement. Level of agreement is indicated by the ICC. Overall, the ICCs are >0.60 for oral and genital ulceration, skin, joint involvement and the presence of headache, implying good agreement between assessors' rating of these symptoms. There was, however, poor agreement for rating the presence of bloody diarrhoea. There was generally little bias between the assessors apart from rating fatigue, where a significant bias was identified. This bias related to Assessor 3 who consistently rated this symptom lower than the other four assessors. Repeat analysis omitting this assessor showed very good agreement and no systematic bias between the remaining four assessors (ICC=0.82; F=0.087, P=0.967).


View this table:
[in this window]
[in a new window]
 
TABLE 2.  Assessment of bias and agreement between observers
 
Table 3Go shows the kappa values between the assessors' rating of the presence of eye activity. The overall kappa is 0.77, implying good agreement between the assessors about the presence of activity. There was significant bias in the assessment of eye disease by Assessor 5 who consistently rated more patients as having active eye disease when compared to the other four assessors.


View this table:
[in this window]
[in a new window]
 
TABLE 3.  Agreement between assessors with regard to the presence of eye activity
 
Table 4Go shows the level of agreement between the assessors in identifying active CNS involvement. The overall kappa is 0.61, indicating good agreement between the assessors about the presence of new nervous system involvement. Assessment of bias showed a complicated relationship between assessors, with Assessors 4 and 5 consistently finding more new nervous system activity than Assessors 1 and 3. There was considerable difficulty in categorizing the type of nervous system involvement. In eight patients who were deemed to have new neurological involvement by one or more of the assessors, only in three could the assessors categorize the type of involvement on clinical examination.


View this table:
[in this window]
[in a new window]
 
TABLE 4.  Agreement between assessors with regard to new activity in the CNS
 
The level of agreement in identifying new large-vessel involvement is shown in Table 5Go. Although there is complete agreement between Assessors 2 and 4 and 3 and 5, the overall kappa is 0.53, implying only moderate agreement between all the assessors about the presence of new large-vessel involvement. Significant bias was present in the rating given by Assessor 1 who consistently found more new large-vessel activity. There was again considerable difficulty in categorizing the type of large-vessel involvement. Four patients were deemed to have new large-vessel involvement by one or more of the assessors, but in only two patients was the type of involvement identified.


View this table:
[in this window]
[in a new window]
 
TABLE 5.  Agreement between assessors with regard to new large-vessel involvement
 
ICCs were calculated to determine the agreement between assessors about the amount of overall activity over the preceding 28 days. There were some differences of opinion, with an ICC value of 0.48 suggesting only moderate agreement. The patient's rating of disease activity over the preceding 28 days was compared with the clinician's rating of disease activity over the preceding 28 days. On average, the patient tended to score well-being lower than the clinician. The amount of bias was -0.94, suggesting that the patients tended to rate their well-being approximately one point (one face) lower than the clinician's impression. Comparison between the patient's self-rating score of well-being today and over the preceding 4 weeks showed a bias of 0.46. This suggests that patients tended to rate their well-being better on the day of assessment than over the preceding 28 days.


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The problem of scoring new disease activity and disaggregating it from established damage and resultant functional loss has exercised clinicians treating several multisystem diseases. Collaborative efforts have led to the establishment of several systemic lupus erythematosus indices [68]. A similar approach has been used in the development of the BD activity proforma. As with these indices, measurement of BD activity has to encompass several organ systems. Disease `severity', disease `activity' and established damage should be considered separately. The scoring system must (a) adequately span the range between no activity and maximum activity in each system, (b) be free of systematic bias between different clinicians, (c) be valid, (d) be reproducible, (e) be comprehensive enough to incorporate geographical variability in clinical features [9] and (f) be simple enough to use in routine clinical practice.

Several instruments have been developed worldwide incorporating broadly similar organ system subscales. In some instruments, the measurement of disease activity relies solely on clinical features [2], while others include laboratory investigations, and changes in body weight and temperature. Although changes in body weight and temperature may indicate systemic activity, we have found them neither specific nor sensitive enough as markers for disease activity. Similarly, haemoglobin and ESR do not correlate well with activity [5].

In common with many other rheumatic diseases, the clinical features in BD vary considerably over time. In order to document this variation, new clinical features present over the preceding 28 days are scored using the BDCAF. This represents a compromise between assessing disease activity based on (a) clinical features on the day of assessment, which may be unrepresentative of overall disease activity and (b) clinical features present over a longer time period, as in the IBDDAM, which reduces reliability in terms of accurate recall of symptoms by the patient.

The disease activity rating for oral and genital ulceration, and skin lesions using the BDCAF relies solely on the duration of symptoms and does not take into account the size or number of lesions present, which might also reflect activity. Unfortunately, although documentation of the latter may be more representative of activity, its reliability is likely to be poor because of the difficulty for patients in recalling these symptoms accurately (the number and size of ulcers/skin lesions). There was good agreement between assessors in the scoring of oral and genital ulceration, and joint and skin manifestations using duration of symptoms alone. Although duration alone may not encapsulate all the features of activity in these organ systems, this scoring appears to be relatively simple to use, reliable and free of bias. Fatigue is a common problem in patients with BD, although it is not known how it correlates with the other clinical features. Assessor 3 consistently rated the fatigue symptoms lower than the other four assessors. The results from this study confirm that clinicians may differ in whether they attribute fatigue to BD or to other conditions that may co-exist, such as fibromyalgia. Although the presence of fibromyalgia was not specifically identified in this study, further studies are needed to quantify fibromyalgic symptoms in patients with BD.

Surrogate indicators are required for lesions that are less visible. Routine direct scoring of gastrointestinal (GI) activity is difficult. The scale to assess GI tract activity is based on two questions designed to identify upper and lower GI tract inflammation. The scoring represents a compromise between ease of monitoring activity and the accuracy with which the answers to the questions reflect inflammation. The lack of agreement in rating the presence and duration of bloody diarrhoea suggests that it is not often easy to determine from history alone whether this symptom relates to mucosal inflammation or is merely the result of a combination of other conditions such as drug-related diarrhoea (e.g. colchicine) and bleeding haemorrhoids. This item was included in the BD proforma as a result of its use in a proforma for measuring clinical activity in inflammatory bowel disease by the gastroenterology service. Patients in whom mucosal inflammation was suspected were referred to the gastroenterology service for further investigation and advice.

The need for surrogate indicators of activity also applies to large-vessel and nervous system involvement. Currently, MRI does not have a role in the monitoring of neurological disease activity as MRI lesions can be identified in patients with [10] and without [11, 12] clinical evidence of CNS involvement. Therefore, assessment of CNS involvement in BD is based on clinical features. As CNS involvement may be a solitary event (e.g. stroke) or relapsing (e.g. aseptic meningitis), it seems appropriate to score activity based on the site of lesion (e.g. meningeal, hemispheric, basal ganglia, brain stem and spinal cord), bearing in mind that the pathophysiology of CNS involvement remains unclear. This system would have merit as the location of presumed pathology could be determined by clinical features and radiological imaging. Currently, there is not sufficient knowledge of the natural history of lesions at various sites to enable the severity of the type of involvement to be graded and, therefore, the BDCAF aims to record and categorize neurological events during the course of the disease. This study shows that while there was good agreement in terms of identifying new CNS involvement, there remained disagreement between the assessors on the probable location of lesions based on clinical history and examination. From a practical perspective, further opinion is sought from a neurologist when neurological symptoms or signs are present (particularly if they are new), which may then require the CNS score to be amended.

Although the skin pathergy reaction is highly specific for BD, there is considerable variation in the rate of positivity in patients from different geographical areas, which limits its clinical usefulness. A positive pathergy reaction is common in patients from Iran, Turkey and Japan, but rare in those from the UK, the USA and France. Although this test has diagnostic importance, there is no evidence to suggest that its presence correlates with disease activity. The main difficulties with using this test as a measure of disease activity are the lack of consensus on the procedure (e.g. the optimal number of needle pricks needed [13]) and the practicalities, in a routine out-patient clinic, of grading the reaction 2 days after administration. For these reasons, it was not included as part of the disease activity assessment in the BDCAF.

The reliability of scores obtained for the patient and physician perception of overall activity was only moderate and associated with bias. While an overall perception of activity is important, moderate reproducibility and the presence of bias may limit its usefulness.

The reliability and ease of use of this activity form have to be balanced against the validity of the questions being asked. If more serious manifestations can be predicted by readily assessed symptoms such as oral and genital ulceration, skin lesions, arthritis and superficial thrombophlebitis, then these organ system subscores could be expanded to provide a more accurate representation of activity. This study has highlighted the difficulty in reliably scoring uncommon manifestations such as large-vessel involvement, GI tract inflammation and nervous system involvement.

This new instrument offers an easy-to-complete and reliable method of assessing and documenting clinical activity in patients with BD for use in routine clinical practice. The proforma takes 5–10 min to complete (longer if detailed examination of the vascular tree or nervous system is required). For the purpose of research, treatment trials targeted at a specific organ system would require a more comprehensive measure of activity within that organ system. For instance, a study of treatment aimed at oral ulceration (e.g. thalidomide [14]) would not only require assessment of the duration of symptoms, but also other features which may relate to activity (number of ulcers, size of ulcers, number of crops of ulcers, site of ulcers, etc.). However, it is recommended that such detailed assessment is accompanied by this validated general disease activity instrument to alert the clinician conducting any trials to any advantageous or deleterious effects of the trial drug on other organ systems.


    Acknowledgments
 
The authors are grateful to Dr Colin G. Barnes, Professor Hasan Yazici and Dr John M. Bamford for their advice during the development of this disease activity proforma. We would also like to thank the nursing staff of the ophthalmic out-patient department for their assistance during the reproducibility study. This research was supported by a grant from the Arthritis and Rheumatism Council.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 

  1.  International Study Group for Behçet's Disease. Evaluation of diagnostic (classification) criteria in Behçet's Disease—towards internationally agreed criteria. Br J Rheumatol 1992;31:299–308.[ISI][Medline]
  2.  Chamberlain MA, Noble BA, Behçet's U.K. Study Group. Disease activity in Behçet's Disease. In: O'Duffy JD, Kokmen E, eds. Behçet's Disease, basic and clinical aspects. New York: Marcel Dekker, 1991:299–302.
  3.  Bhakta B, Hamuryudan V, Brennan P, Chamberlain MA, Barnes C, Silman AJ. Assessment of disease activity in Behçet's disease. In: Wechsler B, Godeau P, eds. Excerpta Medica Int Congress Series 1037, 611. Amsterdam: Elsevier Science Publishers BV, 1993:235–40.
  4.  Davatchi F, Akbaran M, Shahram F et al. Iran Behçet's Disease Dynamic Activity Measure. Hung Rheumatol 1991;32(suppl.):FP10–100 (Abstracts of the XIIth European Congress of Rheumatology).
  5.  Muftuoglu AU, Yazici H, Yurdakul S et al. Behçet's disease. Relation of serum C-reactive protein and erythrocyte sedimentation rates to disease activity. Int J Dermatol 1986;25:235–39.[ISI][Medline]
  6.  Symmons DP, Coppock JS, Bacon PA et al. Development and assessment of a computerised index of clinical disease activity in systemic lupus erythematosus. Members of the British Isles Lupus Assessment Group (BILAG). Q J Med 1988;69:927–37.[Medline]
  7.  Gladman DD, Goldsmith CH, Urowitz MB et al. Crosscultural validation and reliability of 3 disease activity indices in systemic lupus erythematosus. J Rheumatol 1992;19:608–11.[ISI][Medline]
  8.  Liang MH, Socher SA, Larson MG, Schur PH. Reliability and validity of six systems for the clinical assessment of disease activity in systemic lupus erythematosus. Arthritis Rheum 1989;32:1107–18.[ISI][Medline]
  9.  O'Neill T, Rigby AS, McHugh S, Silman AJ, Barnes CG on behalf of the International Study Group for Behçet's disease. Regional differences in clinical manifestations of Behçet's Disease. In: Wechsler B, Godeau P, eds. Excerpta Medica Int Congress Series 1037, 611. Amsterdam: Elsevier Science Publishers BV, 1993:159–63.
  10. Wechsler B, Dell'Isola B, Vidailhet M et al. MRI in 31 patients with Behçet's disease and neurological involvement: prospective study with clinical correlation. J Neurol Neurosurg Psychiatry 1993;56:793–8.[Abstract]
  11. Morrissey SP, Miller DH, Hermaszewski R et al. Magnetic resonance imaging of the central nervous system in Behçet's disease. Eur Neurol 1993;33:287–93.[ISI][Medline]
  12. Besana C, Comi G, Del Maschio A et al. Electrophysiological and MRI evaluation of neurological involvement in Behçet's disease. J Neurol Neurosurg Psychiatry 1989;52:749–54.[Abstract]
  13. Dilsen N, Konice M, Aral O, Inanc M, Gul A, Ocal L. Important implications of skin pathergy test in Behçet's Disease. In: Wechsler B, Godeau P, eds. Excerpta Medica Int Congress Series 1037, 611. Amsterdam: Elsevier Science Publishers BV, 1993:229–33.
  14. Revuz J, Guillaume J, Janier M et al. Crossover study of thalidomide verses placebo in severe recurrent aphthous ulceration. Arch Dermatol 1990;126:923–7.[Abstract]
Submitted 2 July 1998; revised version accepted 15 March 1999.