The Behçet's Disease Activity Index

G. Lawton, B. B. Bhakta, M. A. Chamberlain and A. Tennant

Academic Unit of Musculoskeletal and Rehabilitation Medicine, University of Leeds, Leeds, UK

Correspondence to: B. B. Bhakta, Academic Unit of Musculoskeletal and Rehabilitation Medicine, University of Leeds, 36 Clarendon Road, Leeds LS2 9NZ, UK. E-mail: b.bhakta{at}leeds.ac.uk

Abstract

Objective. To identify a subset of clinical features of Behçet's disease (BD) that can be summated to form an overall index of disease activity appropriate for clinical and research use internationally.

Methods. Completed Behçet's Disease Current Activity Forms were collected from a total of 524 patients with BD from five countries. The data from 14 questions on the form were subjected to Rasch analysis to establish whether these items form a hierarchical and unidimensional scale of disease activity, both within and between countries.

Results. The data showed a good fit to the Rasch model within three countries using a dichotomous scoring function. However, when the data from these three countries were pooled, the fit to the model was poor. Cross-cultural differential item functioning (DIF) was found in seven items in the pooled data. When the items with DIF by country were separated and two items were removed, the resulting 26-item scale showed a good fit to the Rasch model.

Conclusions. Within Turkey, Korea and the UK, the 14 items can be summated to give an index of disease activity. Analysis of the pooled data confirmed that the index is not suitable for comparison between countries or for pooling of data in the raw form, but after fitting the data to the Rasch model such comparisons can be made. This gives a scaling tool that is quick and easy to use in the clinical situation.

KEY WORDS: Behçet's disease, Disease activity, Rasch analysis, Cross-cultural validity

Currently, there are few reliable laboratory markers that reflect the fluctuating clinical signs and symptoms in Behçet's disease (BD). Judgement of disease activity is based on clinical features. There is a need to develop a standardized assessment of current disease activity for use in monitoring disease progression and for evaluating the effects of therapeutic interventions. In addition, any assessment would need to be validated for cross-cultural comparability before it is used internationally.

The Behçet's Disease Current Activity Form (BDCAF) was developed with this in mind on behalf of the International Scientific Committee on Behçet's Disease. The content of the BDCAF was based on previous work [1] that compared two schemes that were available to assess disease activity: the Iranian Behçet's Disease Dynamic Measure (IBDDAM) [2] and a European scheme initially developed in the UK. Although there was greater variability in inter- and intra-rater scoring when the Iranian form was used, the opinion of the clinicians participating in the study was that both forms had important aspects that could be incorporated into an internationally accepted activity form. As a result of this study, a prototype form was developed incorporating aspects of both forms. This was circulated to all members of the International Scientific Committee for comments. A workshop was held in Leeds, UK in 1994 to collate these comments and arrive at a consensus view about which clinical signs and symptoms to include within the activity form (BDCAF). There was an emphasis on the need for clarity of the wording of questions to allow potential use by clinicians world-wide. There was general agreement that standardized questions should be developed for each organ system which could be readily translated for international use. Previous research on the BDCAF has found good inter-observer reliability for assessing general disease activity in British patients [3], and for the orogenital ulcers and eye involvement of BD using a translated version on Turkish patients [4]. A further study found that the agreement between clinicians was better using the BDACF, which scores clinical features for the past 28 days, than for the IBDDAM, in which a variable time period is taken [5].

The current project was undertaken with the aim of confirming the construct and cross-cultural validity of a previously identified subset of clinical features [6] from the BDCAF. Preliminary work used Rasch analysis to identify the clinical features which best fitted the unidimensional construct of ‘disease activity’. The aims of this project were to reassess the psychometric properties of these clinical features in a larger international data set, to verify that they can be summated, and to produce an index to indicate how active the patient's disease has been in a defined period (1 month) preceding the consultation. The resulting index should measure only disease activity (i.e. it should be unidimensional), as distinct from permanent damage. When the intention is to sum the scores on any scale it is essential that all items of the scale are measuring the same construct, otherwise it is not possible to interpret the clinical significance of the total score. The validity of the index should also extend to its use in comparing disease activity in populations from different countries. This is particularly important in the context of multicentre international trials of therapeutic interventions.

Methods

Data collection
The clinical disease activity data used in this study were collected during the patients' routine medical care. The completed BDCAFs were returned from the five countries which participated in this study (China, Korea, Iraq, Turkey and UK). The responses to the 12 items on the BDCAF, previously identified as a potential unidimensional scale, were used [6] for the analysis as well as the two questions which assess disease activity from the patient's and clinician's perspective. These items were: a Likert scale, represented by ‘smiley’ faces ranging from very bad to very good, to indicate how the patient or the clinician felt the disease had been over the past 4 weeks, and the presence or absence (over the last 4 weeks prior to the clinic visit) of arthralgia, arthritis, diarrhoea, erythema nodosum, eye inflammation, genital ulcers, headaches, mouth ulcers, nausea/vomiting, new central nervous system involvement, new major vessel inflammation, and pustules.

Statistical analysis
A glossary of Rasch terminology is given in Appendix 1.

The Rasch model was used to determine the internal construct validity of the index, and to test for the cross-cultural equivalence of the items in the index. The model assumes that the probability of a particular patient affirming a given item or category is a logistic function of the severity of the item and the activity of the patient's disease [7]. For the index, the more active the disease the more likely the patient is to affirm any given item.

The Rasch model (dichotomous case) is given by the equation:

where Pni is the probability that person n will answer item i correctly (or be able to do the task specified by that item), {theta} is person disease activity, and b is the item activity parameter.

An implicit assumption in this approach is that the items in the index will form a hierarchical scale that measures the spectrum from no disease activity to severe disease activity.

Where data fit this model (i.e. the observations accord with the model expectations), then, given local independence (i.e. the responses to a given item are independent of the responses to other items in the scale [8]), the data are derived from a unidimensional scale.

Fit of the data to the model is assessed by a number of fit statistics [9]. The overall fit of the scale is given by the item x trait interaction {chi}2 statistic. This statistic gives an indication of any significant deviation of the data from the Rasch model and gives an indication of how well the items fit together to form a hierarchical and unidimensional scale. This statistic is calculated by summing all the {chi}2 values for each of the individual items (see below) and calculating the significance values using the summated degrees of freedom (d.f.). An estimate of the reliability of the scores on the index can also be made. This is based on the traditional method of reliability estimation, Cronbach's {alpha}. However, instead of using the raw scores, the activity estimates on the logit scale for each person are used to calculate reliability. This is called the ‘person separation index’, and estimates will be very similar to the value of Cronbach's {alpha}, hence the interpretation of the value is the same, i.e. above 0.8 is very good. In addition, the individual fit to the model of each item can be considered. The residual value is the standardized difference between each person's actual and predicted response to an item, which is summed over all persons and standardized (i.e. divided by the standard deviation). If this value is outside the desired range of ±2.5, the item is regarded as showing misfit to the model. Individual item fit is also assessed by the {chi}2 test. The {chi}2 values are calculated by grouping the patients into class intervals (approximately 50 patients in each) on the basis of their overall level of disease activity, and the mean expected response within that group is compared with the observed responses in the class intervals. A significant {chi}2 value indicates misfit. As the data have been collected from five countries, the analysis was done firstly within countries, i.e. the data from each country were analysed separately, and then across countries, i.e. the data from each country were analysed altogether, to specifically assess cross-cultural differential item functioning (DIF).

Rasch analysis also allows us to evaluate the consistency of the fit of data from different countries. That is, we can examine whether the scale is working the same way in different countries. This is done using DIF analysis. Within the framework of Rasch measurement, the scale should work in the same way irrespective of which group is being assessed. Thus, the probability of affirming an item should be the same between groups, given the same trait level [10]. Analysis of variance (ANOVA) based on the residuals (calculated for each person) is used to check for the presence of bias between countries. Again, the patients are split into class intervals and the comparison is made between the subgroups at the same level (or class interval) of disease activity. The two factors in the ANOVA are the person factor (e.g. age, gender or country) and the class intervals (as described above). DIF may manifest itself as a constant difference between countries across the trait (uniform DIF, which is the main effect), or as a variable difference, where the response functions of the two groups cross over (non-uniform DIF, which is the interaction effect). Both the country factor and the interaction with the class interval might be significant in some cases, as with the main and interaction effects in any ANOVA. Tukey's post hoc tests determine where the statistically significant differences are to be found when there are more than two groups.

When some but not all items display DIF, it is possible to make an adjustment to allow items with DIF to vary by country. To do this, an item is substituted for a series of country-specific items (e.g. headaches becomes headaches—Iraq, headaches—Turkey, etc.). For each country, only the scores observed in its corresponding item are considered, while the other items are assigned missing values. Subsequent analysis is undertaken on this expanded data set (i.e. original plus split items). This procedure has been used successfully and documented for a measure of disease activity in manic depression [11].

Due to the number of significance tests undertaken within each analysis, a significance level of P < 0.01 was used.

The software package RUMM 2010 [12] was used to complete the Rasch analysis of the data, and SPSS version 10.1 (SPSS, Chicago, IL, USA) was used for other descriptive analysis.

Results

Patient characteristics
Between 1995 and 2002, 542 completed BDCAFs were returned. The characteristics of the patients involved in this study are shown in Table 1.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Age and gender of patients

 
Analysis of individual country fit to the Rasch model
A separate analysis was performed on the data to examine how well the data from the 14 items fit the Rasch model within each country.

In all countries, for the questions relating to arthralgia, arthritis, erythema nodosum, genital ulcers, headaches, mouth ulcers, nausea/vomiting, diarrhoea, and pustules, problems were found with the three category response options. With a response category of 0 indicating no symptom in the past 4 weeks, 1 indicating a symptom for up to 2 weeks in the past 4 weeks and 2 indicating a symptom for more than 2 weeks in the past 4 weeks, the analysis showed that this response function was not working as intended for these items (i.e. they displayed disordered thresholds). Thus, for items at the level of disease activity at which a score of 1 would be expected by the model, patients were more likely to score 0 or 2. Consequently, in each country all of the items listed above were re-scored to create a dichotomous response function that theoretically represented a response of ‘symptom not present in past 4 weeks’ or ‘symptom has been present in the last 4 weeks’.

Following the re-scoring of these items, the scale demonstrated reasonable fit to the Rasch model within all countries (Table 2). However, given the low person separation index in China and Iraq (indicating a low level of reliability in the scores) and the small number of cases from each country (33 and 49 respectively), the data from these countries were not included in the pooled analysis.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Individual item fit and person separation index for each country

 
Analysis of the fit of pooled data across five countries to the Rasch model
Analysis was undertaken of the pooled data from Turkey, UK and Korea with the re-scored items as dichotomous response options. Five items showed misfit to the model (i.e. residual value over 2.5, and/or a significant {chi}2 P value of <0.01); these items are indicated by asterisks in Table 3.


View this table:
[in this window]
[in a new window]
 
TABLE 3. Individual fit of items within the BDAI to the Rasch model for the pooled data (n = ...)

 
The item–trait interaction statistic, which identifies the degree of the overall fit of the index to the Rasch model, was significant ({chi}2 = 41.343, d.f. = 14, P < 0.01). This indicates that the data are deviating significantly from the model and therefore do not represent a unidimensional and hierarchical scale. The person separation index was 0.79.

One possible cause of item misfit is DIF; therefore DIF was examined in relation to the country from which the data were obtained. Of the 14 items, seven displayed DIF by country; these items were any new central nervous system involvement, any new major vessel involvement, arthralgia, disease activity (patient), erythema nodosum, headaches, and pustules. Post hoc analysis (Tukey's test) did not identify one particular country that was showing the most deviation. Therefore, the seven items with DIF were separated for all countries and the scale was re-analysed, which effectively created a 28-item scale (seven items split across three countries and seven original items which act as link items). Unfortunately, the items disease activity patient—Korea and disease activity clinician showed significant misfit to the model and had to be removed from the analysis. Following this, the scale showed good fit to the Rasch model both at the individual level and overall ({chi}2 = 41.85, d.f. = 25, P = 0.0186), with a person separation index of 0.71, which confirmed that the problem with misfit in the pooled data was driven by differences at the country level.

Distribution of BDAI across countries
There is some variation in the distribution of symptom location on the underlying continuum of disease activity between countries (Fig. 1). For example, new major vessel involvement is the item representing the highest level of disease activity in Korea and the UK, while the highest level of disease activity in Turkey is represented by the item diarrhoea or rectal bleeding. In all countries, the item that represented the lowest level of disease activity is mouth ulcers.



View larger version (14K):
[in this window]
[in a new window]
 
FIG. 1. Location of items on the underlying logit scale of disease activity for each country. MV, major vascular.

 
Discussion
Valid measurement of disease activity in BD is important for clinical management and testing the efficacy of treatments. It is important to clearly differentiate potentially reversible activity from permanent damage. Added to this is the need to pool patient information across countries to perform comparisons, particularly in the case of international clinical trials. This study on the international validity of the BDAI addresses this issue.

The data were collected using the original proforma (BDCAF) from five countries. Fourteen items previously identified to form an index of disease activity [6] were analysed using the Rasch method. The results of the analysis showed that the three-category scoring function for seven items was not working as intended, i.e. a higher score did not consistently indicate ‘more’ disease activity within a particular item. Therefore, a two-category scoring function was necessary to satisfy the requirements of the Rasch model for the BDAI. Unfortunately, in two countries (China and Iraq) the person separation index was too low to be considered acceptable. Coupled with the small number of cases from these countries, it was felt that it was not possible to draw any meaningful conclusions from the analysis of these data. Among the remaining three countries, the UK displayed very good fit of the data to the model and Turkey and Korea showed a small amount of misfit, though for all three countries the person separation index was reasonable.

The pooled analysis demonstrated some cross-cultural DIF between the three countries, though a solution was found by splitting the items that displayed DIF. Two of the items in the split scale had to be deleted as they showed misfit. This is only necessary for the purpose of cross-cultural comparison; the full index (BDAI) can still be used within each country. This means that, to produce accuracy of outcome measurement (e.g. in an international study of a drug intervention), it would be necessary to use not the raw data, but the Rasch-transformed scores. By separating the items for each country and fitting the data to the Rasch model, it is possible to use the score obtained from the index to make comparisons between countries. This is feasible as the fit to the model allows the estimation of disease activity on an interval scale [13], and all items are calibrated on the same linear metric. Interval level measurement is necessary to perform parametric statistics and calculate change scores [14]. Data from countries that were not involved in this study can similarly be analysed following the same process to allow accurate between-country comparisons. Thus, the BDAI is potentially a useful tool for international research.

There are a number of limitations to the study. The approach assumes that it is true cross-cultural differences that cause DIF by country. However, it is acknowledged that country may be serving as a proxy factor for other sources of DIF. For example, we may be uncovering differing presentations of BD across different populations, different co-morbidities (age of appearance, symptom pattern), and the cultural or medical bias towards some items perceived as more important than others may be sources of DIF.

Another important issue is that of translating the activity measures into the main language of the various countries. In our study, only the forms used to collect data from patients in Turkey have been officially translated using appropriate methods [4]. For other countries, the responsibility for the translation during an individual patient consultation was left to the clinician using the scale. While all clinicians administering the form in this study had an excellent command of the English language, the translation was not standardized. For accuracy in recording data, ideally the form should be translated formally, using the well-recognized procedure for self-report measures [15]. Cultural differences may mean that, in certain cultures, it may be more acceptable to express symptoms or difficulties in certain ways and not in others. When the condition involves genital symptoms there may be less freedom of discussion between males and females (patients may be less willing to volunteer information).

Despite some of these limitations, we feel that the BDCAF is a convenient and logical tool, i.e. it follows the natural course of a normal consultation with a BD patient, and so can easily be administered during the course of a routine consultation with a BD patient, and can be used to generate a useful index of disease activity. An overall disease activity score (BDAI) can be derived from the form and can be used in clinical trials of BD interventions, with the caveat that the analysis presented in this paper applies only to the countries mentioned. Using the Rasch method, some of the problems associated with cross-cultural DIF can be overcome when the activity index is used in international studies. For the purpose of research, treatment trials targeted at a specific organ system would require a more comprehensive measure of activity within that organ system. For instance, a study of treatment aimed at oral ulceration [16] would require assessment not only of the duration of symptoms but also of other features which may relate to activity (number of ulcers, size of ulcers, number of crops of ulcers, site of ulcers, etc.). However, it is suggested that detailed assessment of a particular organ system is accompanied by an overall assessment of disease activity using the BDAI. This will alert the clinician conducting any trials to advantageous or deleterious effects of the trial drug on other organ systems.

Appendix 1. Glossary of terminology used in Rasch analysis (adapted from Bond [17])


Term

Explanation

Differential item functioning The variability of item response across subgroups of people identified by e.g. gender, age or race
Internal construct validity Theoretical argument that the items in a scale are actually operationalizations of the theoretical construct or latent trait under investigation; i.e. that the instrument measures exactly what it claims to measure
Interval scale A measurement scale in which the value of the unit of measurement is maintained throughout the scale so that equal numerical differences have equal arithmetic values, regardless of location. The zero point on the scale is regarded as arbitrary rather than absolute
Invariance The maintenance of the identity of a variable from one occasion to the next. For example, item estimates remain stable across suitable samples, and person estimates remain stable across suitable tests
Item fit statistics Indices that show the extent to which each item performance matches the Rasch-modelled expectations. Items that fit the model are components of a unidimensional variable
Item–trait interaction Identifies the degree of the overall fit of the measure to the model. It assesses the degree to which the measure is diverging from the model in a systematic way that is not accounted for by chance alone
Local independence The items of the scale are statistically independent of each subpopulation of patients who are homogeneous to the latent trait measured. The affirmation of one item in the scale should not increase the probability that another different item will be affirmed by an individual patient
Logit Contraction of ‘log odds unit’, which is the unit of measurement in Rasch theory
Person separation index The estimate of the replicability of person placement that can be expected if this sample of persons were to be given another set of items measuring the same construct. Analogous to Cronbach's {alpha}, it is bounded by 0 and 1
Residual Standardized difference between the observed score and the expected score according to the model
Threshold This relates to the responses made by patients to different options within an item. The threshold is the level at which the likelihood of failure to agree with or endorse a given response category (below the threshold) equates to the likelihood of agreeing with or endors-ing the category (above the threshold)
Unidimensionality The concept that one attribute of an object be measured at a time. The Rasch model requires a single construct to be underlying the items that form a hierarchical continuum

Acknowledgments

We are very grateful to Professor Z. Al-Rawi, Dr D. Bang, Dr F. Gogus, Professor D. Haskard, Dr M. R. Helbert, Professor S. Lee, Dr R. Powell, Dr S. Salman, Professor A. J. Silman, Professor H. Yazici, Professor D. Yi, Dr Z. Zhuoli and many more who provided us with data.

The BDCAF form can be obtained form Dr B. Bhakta at the correspondence address.

References

  1. Bhakta B, Hamuryudan V, Brennan P, Chamberlain MA, Barnes C, Silman AJ. Assessment of disease activity in Behçet's disease. In: Wechsler B, Godeau P, eds. Excerpta Medica International Congress Series 1037, 611. Amsterdam: Elsevier Science, 1993:235–40
  2. Davatchi F, Akbaran M, Shahram F et al. Iran Behçet's Disease Dynamic Activity Measure. Abstracts of the XIIth European Congress of Rheumatology. Hungarian Rheumatol 1991;32(Suppl.):FP10–100
  3. Bhakta BB, Brennan P, James TE, Chamberlain MA, Noble BA, Silman AJ. Behçet's disease: evaluation of a new instrument to measure clinical activity. Rheumatology 1999;38;728–33[Abstract/Free Full Text]
  4. Hamuryudan V, Fresko I, Direskeneli H et al. Evaluation of the Turkish translation of a disease activity form for Behçet's syndrome. Rheumatology 1999;38:734–6[Abstract/Free Full Text]
  5. Bhakta BB, Hamuryudan V, Brennan P, Chamberlain MA, Barnes C, Silman AJ. Assessment of disease activity in Behcet's disease. In: Wechsler B, Godeau P, eds. Excerpta Medica International Congress Series, 1037, 611. Amsterdam: Elsevier Science, 1993:235–40
  6. Chamberlain MA, Bhakta BB, Tennant A, Eyres S. Behçet's disease: diagnosis and assessment of disease activity. In: Bang D, Lee ES, Lee S, eds. Behçet's disease. Design Mecca Publishing 105–9
  7. Rasch G. Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press, 1960 (reprinted 1980)
  8. Smith RM. Fit analysis in latent trait measurement models. J Appl Measure 2000;2:199–218
  9. Andrich D. Rasch models for measurement. Sage University Paper Series on Quantitative Applications in the Social Sciences, No. 07-068. Beverly Hills: Sage Publications, 1998
  10. Angoff WH. Perspectives on differential item functioning methodology In: Holland PW, Wainer H. Differential item functioning. Hillsdale (NJ): Lawrence Erlbaum, 1993
  11. Lange R, Thalbourne MA, Houran J, Lester D. Depressive response sets due to gender and culture-based differential item functioning. Pers Individual Differences 2002;33:937–54[CrossRef][ISI]
  12. Andrich D, Lyne A, Sheridan B, Luo G. RUMM 2010. Perth: RUMM Laboratory, 2000:11
  13. Glass GV, Stanley JC (eds). Measurement scales and statistics. In: Statistical methods in education and psychology. Englewood Cliffs (NJ): Prentice Hall, 1970:7–25
  14. Svensson E. Guidelines to statistical evaluation of data from rating scales and questionnaires. J Rehabil Med 2001;33:47–8[CrossRef][ISI][Medline]
  15. Beaton DE, Bombardier C, Guillemin F, Ferraz MB. Guidelines for the process of cross-cultural adaptation of self-report measures. Spine 2000: 25:3186–91[CrossRef][ISI][Medline]
  16. Revuz J, Guillaume J, Janier M et al. Crossover study of thalidomide verses placebo in severe recurrent aphthous ulceration. Arch Dermatol 1990;126:923–27[Abstract]
  17. Bond TG, Fox CM. Applying the Rasch model: fundamental measurement in the human sciences. Hillsdale (NJ): Lawrence Erlbaum, 2001
Submitted 7 February 2003; Accepted 28 May 2003





This Article
Abstract
Full Text (PDF)
All Versions of this Article:
43/1/73    most recent
keg453v1
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Disclaimer
Request Permissions
Google Scholar
Articles by Lawton, G.
Articles by Tennant, A.
PubMed
PubMed Citation
Articles by Lawton, G.
Articles by Tennant, A.