Department of Public Health, University of Oxford, Oxford, 1 Department of Health Sciences and 2 Hull York Medical School, University of York, York, UK
Correspondence to: S. Brealey, York Trials Unit, Department of Health Sciences, Second floor, Area 4, Seebohm Rowntree Building, University of York, Heslington, York YO10 5DD, UK. E-mail: sb143{at}york.ac.uk
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods. Instruments were identified through systematic searches of the literature. Information relating to instrument content, patient population, reliability, validity and responsiveness was extracted from published papers.
Results. The 16 instruments that met the inclusion criteria varied in length from 4 to 42 items. The majority form a single index; six produce a profile of scores. Eight have been evaluated in patients with a variety of knee problems. All instruments have satisfactory internal or testretest reliability. However, there is limited empirical support for the health domains of six instruments. Patients informed the development of items within just five instruments. Few authors gave explicit consideration to the size of expected relationships in tests of construct validity. Eleven instruments have evidence for responsiveness to changes in health. The minimally important difference was not determined for any of the instruments.
Conclusions. In the absence of comparative evidence, the large number of patient-assessed instruments for knee problems makes instrument selection difficult. The Knee Injury and Osteoarthritis Outcome Score (KOOS), Knee Pain Scale and Oxford Knee Score have good evidence for reliability, content validity and construct validity. The KOOS and Oxford instruments also have evidence for responsiveness. The instruments have not been evaluated for all knee problems, and instrument appropriateness, including content relevance, must be assessed before application. The comparative evaluation of instruments is recommended.
KEY WORDS: Health status, Knee, Quality of life, Reliability, Responsiveness, Review, Validity
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Clinicians and researchers wishing to select an instrument for measuring the health of patients with knee problems have a choice of several patient-assessed instruments, irrespective of underlying diagnosis or intervention [3, 4]. In recent years, groups such as OMERACT (Outcome Measures in Rheumatology) have been working towards establishing a core set of outcome measures for assessment of rheumatoid arthritis, but no such consensus exists for knee problems. The plethora of available instruments across a broad spectrum of knee problems has led to a lack of standardization in applications, including clinical trials, which has implications for the generalizability of results [4]. The appraisal of instrument content along with evidence for measurement properties, including reliability and validity in relation to the study population, are prerequisites for appropriate instrument selection [2]. Concurrent evaluation of measurement properties from a number of instruments can inform subsequent instrument selection. However, such evidence is often unavailable and the development of new instruments will mean that there are likely to be gaps in comparative evidence.
Structured reviews that are based on comprehensive searches of the literature are further means of informing instrument selection. Four recently published reviews that included patient-assessed instruments for the knee were neither comprehensive in scope nor based on systematic searches of the literature [36]. One focused on four named journals and a literature search that was not defined [3]. Three did not report the search strategy used [46]. One presented major scoring systems which were not defined [6]. Data extracted included reliability and validity. However, different forms of testing and the quality of the results were not explicitly considered. The structured review that follows is broader in scope and draws on recommendations relating to standardized criteria for the selection of instruments [2]. It also reports evidence for responsiveness, an important criterion for instruments that are intended for application in evaluative studies including clinical trials [2]. The results of this review will inform the selection in clinical practice and research of patient-assessed instruments that are self-completed and measure aspects of the health and quality of life of patients with knee problems.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Identification of studies
The search strategy was designed to retrieve references relating to the development and evaluation of instruments, including reviews. The search terms were developed by combining terms relevant to patient-assessed health instruments with knee-specific terms [1, 2], examples of which are shown in Table 1. Databases searched included the Patient-assessed Health Instruments bibliography (http://phi.uhce.ox.ac.uk) hosted by the National Centre for Health Outcomes Development, University of Oxford, which is based on systematic searches of the literature [1]. Medline, CINAHL and EMBASE were also searched. The names of identified instruments were then used in further searches of these databases. Original papers were retrieved for references that included the development or evaluation of patient-assessed instruments that are specific to the knee. The citation lists of these papers were examined for further developmental work and other instruments. The review included evaluations of instruments in non-English-speaking populations that were published in an English language journal. It has been recognized that the first comprehensive knee rating system [79] was developed in 1977, so the search was restricted to evaluations published from 1977 to the end of 2002.
|
Data extraction
Our approach was consistent with previous reviews [36] and recommendations for the evaluation of patient-assessed instruments [2, 10]. Data were independently extracted by two reviewers [S.B., A.M.G.]. Information extracted from articles included the characteristics of patients in which the instrument was developed or evaluated, instrument content and the results of testing for reliability, validity and responsiveness, and floor and ceiling effects when available. These criteria have been recommended for the selection of instruments in clinical trials [2].
The characteristics of patients recorded were the setting, age, gender and knee condition. Instrument content was described in terms of domains of health measured and number of items.
Reliability is concerned with whether an instrument is internally consistent or reproducible. Internal consistency is tested following a single administration and assesses how well items within a scale measure a single underlying dimension. Testretest reliability is designed to take account of variation over time in stable patients. The results of tests of internal consistency and testretest reliability, specifically Cronbach's alpha and testretest correlation coefficients, are presented. Reliability estimates of 0.7 and 0.9 are recommended for instruments intended for use at the group and individual level, respectively [2, 11].
Validity is concerned with whether an instrument measures what is intended. Validity can be evaluated qualitatively through examination of instrument content, and quantitatively through factor analysis and comparisons with related variables. The source of instrument items and any evidence for content and face validity are presented. Evidence derived from factor analysis or principal component that supports dimensionality or internal construct validity is presented. External construct validation includes comparisons with other instruments and relating instrument scores to clinical and sociodemographic variables [2, 11].
Responsiveness is concerned with whether an instrument is sensitive to changes in health and can be assessed using distribution-based and anchor-based approaches [1214]. The former relate instrument score changes to some measure of variability, and include the effect size statistic [15], standardized response mean [2] and modified standardized response mean or responsiveness index [16]. Several authors suggest that statistical measures of responsiveness are an insufficient basis for assessing responsiveness and that patients views about the importance of the change should inform testing [17, 18]. Anchor-based approaches assess the relationship between changes in instrument scores and an external variable. This includes health transition items or global judgements of change that have been used to estimate the minimally important difference (MID), the instrument change score corresponding to a small but important change [12]. The MID can inform sample size calculations but consideration must be given to the specific group of patients and settings [18]. External variables, including transition ratings, have also been compared with instrument score changes using correlation. This form of longitudinal validity assesses whether changes in instrument scores concur with an accepted measure of change in patient health [14].
Data extraction covered the full range of approaches to responsiveness and descriptive statistics. Where available, data were also extracted for studies designed to assess the MID and longitudinal construct validity. Responsiveness and MIDs are not fixed properties [18, 19], and therefore the study population and context, including any intervention, were also extracted.
No ethical approval or informed consent was required for our study.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Setting and scope of instruments
Table 2 shows that the majority of instruments were developed or evaluated in the USA [2028]. Instruments were also developed or evaluated in Finland [29], Australia [30, 31], the UK [3234], Canada [3539], Japan and France [23], Germany [40] and Sweden [4144]. Eight instruments were evaluated in patients with a range of knee problems. The Functional Index Questionnaire (FIQ) has undergone the most evaluations in five patient populations [30, 33, 3537].
|
|
|
|
The dimensionality of the Edinburgh instrument, IKDC, KOOS, KOS-ADLS, KPS and KSI has been assessed using factor or principal component analysis (PCA). The Edinburgh instrument is used as a single index but the results of PCA suggests that it is multidimensional [32]. Three of the KOOS scales were found to contribute to a single component [41]. While the majority of items within the KOS-ADLS loaded onto one factor, several of the items relating to symptoms loaded onto a second factor [25]. There was limited empirical support for the hypothesized domains within the KSI and a single index was adopted [27].
All the instruments have undergone some form of testing for construct validity against scores for other instruments and clinical or sociodemographic variables (Table 5). However, few studies used formal hypotheses relating to the size of expected relationships, which would have facilitated interpretation of the results [2]. Extensive hypotheses were applied in a concurrent evaluation of the Cincinnati, KOS-ADLS, Lysholm Knee Score and SKRS instruments [22]. As hypothesized, instrument scores were (i) more highly related to scores for other knee instruments than with the generic SF-36, (ii) more highly correlated with the SF-36 scales relating to physical health than mental health, and (iii) highly correlated with severity ratings. As hypothesized, IKDC scores were more strongly correlated with the SF-36 scores that make a larger contribution to physical health than mental health [23]. As hypothesized, KOOS scores were found to have the largest correlations with SF-36 scales that relate most to the SF-36 physical component and were of a moderate-to-large level [24, 41]. As hypothesized, KOS-ADLS scores had moderately strong levels of correlation with a global rating of function that were stronger than those for the Lysholm [25]. As hypothesized, moderate correlations were found between the Oxford instrument and clinician ratings of knee function and the Health Assessment Questionnaire [34]. Correlations with the SF-36 were of a small-to-moderate size, fitting the general hypothesis of a moderate level of correlation. As hypothesized, the Swedish version of the Oxford instrument had the largest correlations with physical and pain domains of the Nottingham Health Profile, Sickness Impact Profile, SF-36, SF-12, and the Western Ontario and McMaster University osteoarthritis index (WOMAC) [43].
Responsiveness
Table 6 shows the responsiveness of the Cincinnati, KOS-ADLS, Lysholm and SKRS was assessed in a sample of 42 patients with a variety of knee disorders who were expected to improve at follow-up [22]. Standardized response means (SRMs) ranged from 0.8 for the Cincinnati to 1.1 for the KOS-ADLS. The responsiveness of the Cincinnati was further assessed in 250 patients 2 yr after anterior cruciate ligament (ACL) bonepatellar tendonbone autogenous reconstruction [21]. Effect sizes and SRMs ranged from 0.69 to 3.49. The FIQ has undergone evaluations in patients following physical therapy, and moderate [37] and large [36] effect sizes of 0.59 and 1.63 were found, respectively.
|
The KOS-ADLS was assessed in 266 patients undergoing physical therapy [25]. Effect sizes ranged from 0.44 to 1.26. The Knee VAS was assessed in two groups, both comprising 28 patients who had undergone arthroscopic partial resection of the medial meniscus or open ACL reconstruction with a bonepatellarbone graft [40]. For the first group, significant improvements in Knee VAS scores were seen at all follow-up periods. For the second group, Knee VAS scores were significantly lower at short-term follow-up and significantly higher at long-term follow-up compared with pre-operative scores [40].
The LEAP was assessed in 32 patients with osteoarthritis who were undergoing knee replacement surgery [49]. LEAP scores showed statistically significant improvements at 3 months. The Oxford Knee Score was assessed in 117 patients undergoing total knee replacement [34]. The instrument produced an effect size of 2.19, which was greater than those for the SF-36. Changes in QoL-ACL scores were found to correspond with clinical opinion in 21 out of 25 patients, but responsiveness was not statistically assessed [39].
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The existing reviews did not focus exclusively on self-assessed instruments, but limitations in their scope and search strategies used meant they failed to identify six instruments meeting the inclusion criteria for the current review. The review identified 16 patient-assessed instruments with evidence for reliability and validity. Such a large number can only serve to confuse clinicians and researchers choosing an instrument for applications including clinical trials and clinical practice. This proliferation of instruments, many of which do not adequately draw on recommended criteria for instrument development, led the authors of one recent review to recommend that no further instruments be developed [3]. The lack of standardization in the choice of instruments will limit the generalizability of findings.
Published evidence for internal consistency reliability was not available for the ARS, AKPQ, Cincinnati, Knee VAS, QoL-ACL and SKRS. These instruments sum to produce a single index. The AKPQ, Cincinnati, QoL-ACL and SKRS can also produce a profile of scores. These six instruments lack evidence for both internal reliability and internal construct validity; hence there is limited empirical support for the constructs that they purport to measure. The remaining instruments all produce reliability estimates that meet the group criterion; most estimates pertaining to the IKDC, KOS-ADLS and Oxford instruments meet the individual level criterion.
There is no published evidence relating to the content validity of the AKPQ, FIQ, KSI and LEAP. The ARS, KOOS, Knee-VAS, Oxford and QoL-ACL have evidence for content validity from a patient perspective. All instruments have undergone some form of empirical evaluation of validity, including comparisons with scores for other patient-assessed instruments and clinical variables. However, the lack of hypotheses relating to the magnitude of expected relationships limits interpretation [2]. The Cincinnati, IKDC, Oxford, Lysholm and SKRS have undergone validity testing based on explicit hypotheses relating to the size of expected relationships. Just six of the instruments have been assessed for internal construct validity; the IKDC, KPS and KSI performed satisfactorily.
Few instruments have satisfactory levels of reliability, content validity and construct validity. The Edinburgh, IKDC, KOOS, KOS-ADLS, KPS, Lysholm and Oxford instruments have the best performance in relation to these criteria. Of these instruments, published reports relating the KOOS, KPS and Oxford instruments give explicit descriptions of patient involvement in the derivation of instrument content. If a patient-assessed instrument is to have content validity as a measure that is relevant to the recipients of care, patients should be involved in the derivation of items [2].
Ten instruments have some evidence for responsiveness to changes in health, although few studies involved comparisons with other instruments. The MID has not been determined for any of the instruments. These instruments will continue to be used in assessing health outcomes in longitudinal studies and so these deficiencies must be addressed by future research.
The 16 instruments are not designed for all knee disorders and it is important that both the content of the instrument and the population in which the instrument has been evaluated are considered as part of the process of instrument selection. Reliability, validity and responsiveness are context-specific attributes, and an instrument that has demonstrated satisfactory measurement properties in one population is not necessarily appropriate for use in other populations [2]. The aim of this review is not to recommend a single instrument but rather to assess evidence for knee-specific instruments that will inform future selection for applications, including clinical trials. The review found that several instruments have been developed and evaluated in patients with specific knee problems. These instruments cannot be recommended for use in other knee problems without further evaluation.
Consideration should also be given to the use of generic instruments, which have been recommended for use alongside specific instruments [51, 52]. Generic instruments have greater potential to measure side-effects or unforeseen effects of treatment and are more suitable for economic evaluation. There are several studies relating to the performance of generic instruments in patients with knee problems. The SF-36 was found to have satisfactory reliability in patients with osteoarthritis of the knee and was more responsive than osteoarthritis-specific instruments in a group of out-patients [53]. After assessing validity, reliability and acceptability to patients, the SF-12 was recommended in preference to the Nottingham Health Profile, Sickness Impact Profile and SF-36 for patients who had undergone knee arthroplasty [44]. The EuroQol and SF-36 produced statistically significant changes in health in patients undergoing magnetic resonance imaging, with some evidence to suggest that the SF-36 was the most responsive instrument [54]. The SF-36 has evidence for responsiveness following physical therapy for knee impairments [55], surgical and non-surgical management for ACL injuries [56] and total knee arthroplasty [57, 58]. It was also concluded that the Sickness Impact Profile was sensitive to changes in health for patients with moderate osteoarthritis of the knee [59].
The SF-36 has also been recommended with the Oxford Knee Score and the WOMAC [60] for assessing the outcomes of total knee replacement [4]. The WOMAC has been widely validated in patients with osteoarthritis of the hip and knee, but, together with the Lower Extremity Function Scale, which was validated on patients following a hip or knee joint arthroplasty [61], was not included in the review. This is because these instruments were not developed solely for patients with knee problems. The WOMAC was, however, found to be more responsive than the SF-36 in patients undergoing knee surgery [62], but the SF-36 was more responsive than the WOMAC in patients with osteoarthritis who were attending rheumatology clinics [53]. The Oxford Knee Score has also been recommended in preference to the WOMAC for arthroplasty patients [44]. In contrast to the knee-specific instruments reviewed here, the WOMAC has been assessed for the MID [63].
In summary, few of the 16 patient-assessed instruments that are specific to knee problems have satisfactory evidence for reliability, validity and responsiveness. If an instrument is to adequately address the concerns of patients and hence possess content validity, instrument development must draw upon the views of patients, including how their knee problems affect their lives. Patients were involved in the development of just five instruments. Based on the review criteria, the KOOS, KPS, and Oxford instruments are the most suitable for inclusion in applications designed to assess knee-related health and outcomes from the perspective of the patient. The limited evidence for the comparative performance of instruments makes it difficult to select one particular instrument for a study population and research question. Future studies should therefore consider the inclusion of multiple instruments to assess comparative performance including generic instruments. Consideration must also be given to assessing the MID, which will inform sample size calculations in evaluative studies such as clinical trials.
|
![]() |
Acknowledgments |
---|
The authors have declared no conflicts of interest.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|