Department of Musculoskeletal Science, Royal Liverpool University Hospital and Liverpool Upper Limb Surgery Unit, Royal Liverpool University Hospital, Liverpool, UK
Correspondence to: S. P. Frostick, Department of Musculoskeletal Science, University of Liverpool, Liverpool L69 3GA, UK. E-mail: s.p.frostick{at}liv.ac.uk
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods. Items were generated using 25 patients and expert opinion, and reduced using 25 new patients to yield a nine-item patient questionnaire and a six-item clinical evaluation (of strength, motion and ulnar nerve involvement). This was validated using 63 new patients (of whom 28 were studied twice without therapy and 18 were studied again after appropriate surgery).
Results. The testretest reliability coefficient of determination (R2 = 0.93) and internal consistency (Cronbach's alpha = 0.98) were both good. Convergent validity was attested by good correlations with other scores, the Disabilities of Arm, Shoulder and Hand Questionnaire (DASH) and the Nottingham Health Profile (NHP) (physical) (R2 = 0.62 and 0.29, P<0.0005). Sensitivity to change was demonstrated by correlating preoperativepostoperative changes to those in DASH and NHP (physical) (R2 = 0.50 and 0.27, P<0.04).
Conclusion. This is a reliable, internally consistent score, correlating well with other, non-elbow specific scores and sensitive to change on treatment.
KEY WORDS: Clinical score, Elbow, Outcome measure, Validation
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Many outcome measures have been used for elbow conditions: the Mayo Elbow Performance Index [14] and its several variants [58], the Ewald Scoring system [9, 10], the Hospital for Special Surgery (HSS) Scoring System [11] and its variants [12, 13], the Flynn Criteria [14, 15], the Pritchard Score [1], the Brumfield Score [1], the Neviaser Criteria [16], the Jupiter Score [1719], the Khalfayan Score [20, 21], the Disabilities of Arm, Shoulder and Hand Questionnaire (DASH) [22], the Modified American Shoulder and Elbow Surgeons self-evaluation form (M-ASES) [23], and the same organization's Elbow Assessment form [24]. Of these, DASH and M-ASES are patient-completed functional and general questionnaires. However, general questionnaires alone may not assess accurately the symptoms and functions of an individual joint [25]; they are often lengthy and contain questions irrelevant to a specific problem or procedure [26]. The ASES Elbow Assessment form [24] is elbow-specific and combines patient- and physician-completed questions; however, it has not been validated and is somewhat unwieldy, containing over 50 responses. The others listed are physician-completed questionnaires scored by clinical assessment, some (HSS, Ewald, Mayo and Khalfayan) also containing functional questions completed by the physician on the patient's behalf. The components of these scores are compared in Table 1.
|
Above all, none of these scores have been properly validated. Full validation [31] requires assessment of internal consistency (the degree to which component responses agree with each other, giving confidence that they are measuring different aspects of the same thing); construct validity (the degree to which the instrument supports a predefined hypotheses), of which an important aspect is usually convergent validity (the degree to which the instrument correlates with other established and accepted questionnaires); testretest reliability (reproducibility); and sensitivity to change (usually to treatment). A further aspect is discriminant validity, the degree to which an instrument diverges from instruments designed to measure different things; in the present context this means anatomical specificity, which must be built into both the functional and clinical parts of the questionnaire [3234]. Of the scores listed above, Turchin et al. [30] examined construct validity alone for the HSS, Mayo and some variants, and the Ewald and Pritchard scoring systems. They found that the variable mixture of clinical and functional criteria impaired validity. There was a very low agreement between scores. Construct validity was reported, but this was done by comparing with patient- and physician-rated severity rather than a valid, standard score. Internal consistency and sensitivity to change were not assessed, and testretest reliability was measured only for M-ASES and DASH. These authors recommended that an ideal tool for the assessment of the elbow would measure pain, function and disability simultaneously, and that the outcome of treatment should be assessed on the basis of function, clinical examination and assessment of pain [30].
We have therefore developed an instrument which we call the Liverpool Elbow Score (LES). This contains two main components: a patient-rated questionnaire assessment of function, relevant to the functions of the elbow (unlike some general upper limb questionnaires mentioned above), including a question about pain; and some important relevant clinical data, which can be measured objectively and consistently, regardless of the condition of the elbow. It is simple and quick to administer. We present here evidence of its validity.
![]() |
Patients and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
There are three stages of scale development [3537]: item generation, item reduction (comprising the selection of the item pool and choice of item scaling), and the determination of reliability, validity and responsiveness. We describe separately the patient-response and clinically-assessed items.
Generation of the scale: the patient-answered questionnaire
Item generation
Items were generated from interviews with 25 patients at the Royal Liverpool University Hospital Upper Limb Unit. These had elbow problems which included rheumatoid arthritis, primary osteoarthritis, post-traumatic osteoarthritis, elbows with joint replacements, elbows with failed joint replacements, tennis elbow, elbows with loose bodies, etc. Patients were asked what they thought were the most important activities of daily living affected specifically by their elbow problem. A total of 21 common items were obtained. Experts in elbow surgery were consulted, including members of the university department where this project was done and eminent elbow surgeons from various parts of the UK. Based on their judgement, the patient responses and literature review, a list was drawn up of functional and clinical criteria comprising a total of 42 items.
Item reduction
From this list, items were selected using the following criteria. For inclusion, items had to reflect a problem which is common, has a substantial effect on daily life, is important to patients and judged important by expert clinicians, and which affects tasks which are performed by all subjects at some functional level; it should be stable over at least short periods of time yet have potential for change. We excluded items which were generic, repetitive, not reflective of disability, not relevant to the elbow and not highly endorsed by expert opinion and patients. We separated the functional items (activities of daily living) and the clinical items (assessed by the clinician), and obtained a reduced list: the functional items included the use of the other arm due to difficulty with the affected arm, combing hair, personal hygiene, dressing, household activities, lifting, pain, sport, leisure activities, driving, gardening, keyboard, writing and feeding; the clinical items assessed motion, strength, ulnar nerve problems, instability and deformity. These were then tested on 25 out-patients attending with elbow problems. Patients were asked whether they could understand the questions, and how relevant they were to their problems. The clinicians who administered these questionnaires were consulted regarding the practicality and difficulty of administering the clinical items. Six patients felt that instead of personal hygiene the term washing would be easier to understand. Seven felt that the question on driving was irrelevant because they had never driven in their lives. Sport and leisure were grouped together as most patients considered they indulged in only one of these, and some took them to be the same. In the same way gardening, keyboard and writing were not universally approved by the patients, and we omitted these as potentially lacking consistency.
The final list of items was as listed in Table 2. For each item we used a five-level Likert scale, as commonly used for questionnaires of this kind and also advised by our statisticians, starting from zero for simplicity of calculation. Wording was such that 0 represented worst/least function and 4 best/most function. In the earlier studies, patients were asked to answer about how you are now. With the last 21 patients studied (and used for the testretest study), we addressed the question of timeframe explicitly by specifying the last 4 weeks (Table 2). We asked these patients whether they would have answered any of these questions differently if a time limit had not been given, and the opposite question of 20 of the patients who had been given the earlier questionnaire. Both questions were invariably answered in the negative, and there was no discernible difference in reproducibility or internal consistency between the earlier and this slightly modified form of the questionnaire.
|
The scale as used
The final questionnaire combined a nine-item patient-answered questionnaire (PAQ) and a six-item clinical assessment score (CAS) component. For calculation of the final score, all responses were transformed to a scale of 010, and equally weighted (see Discussion) for summation by averaging, so the final score runs from 10 (best) to 0 (worst). Thus the total score can be expressed as
![]() |
![]() |
The limiting range 010 was chosen somewhat arbitrarily, to avoid on the one hand an excessive use of decimal places, and on the other hand an irrelevantly large number of levels.
Assessment of the scale: characteristics of the patients
The final questionnaire thus developed was validated in a prospective way for internal consistency, reproducibility, validity and sensitivity to clinical change [3234]. During their assessment patients were examined by the clinician and then asked to answer the questionnaire (with the clinician present). There were 112 assessments in all. We studied 63 patients (median age 55 yr, range 1577 yr) with various elbow conditions: 19 rheumatoid arthritis, 14 osteoarthritis, eight tennis elbow, six arthroplasty, six loose bodies, three fractured radial head, three ulnar nerve problems, two golfers elbow, and one each of osteochondritis dissecans and synovial chondromatosis. Of these, 28 were studied again between 1 and 3 days later when no change had occurred in their condition. Of the original 63 patients, 18 were studied again (median 24 weeks) after surgery of various kinds.
Assessment of the scale: other instruments
Patients were also asked to answer the SF-12 [39], NHP (Nottingham Health Profile) [40] and the DASH [41]. SF-12 is a 12-item short-form health survey consisting of 12 general health questions, which are weighted separately for mental and physical assessments [39]: we used the physical weights in this study. NHP is a generic health-related quality of life measure [39], structured to assess physical mobility (eight items), pain (eight items), social isolation (five items), emotional reactions (nine items), energy (three items) and sleep (five items). Each item is weighted: we used the physical weights. The DASH questionnaire is designed to measure upper limb disability and symptoms [41]. Functional domains include physical, social and psychological. It uses a single-scale, 30-item questionnaire of upper extremity function and symptoms. All three questionnaires are already well validated.
Methods of validation
Internal consistency was assessed by administering the instrument to a group of patients on one occasion and estimating to what extent the items yield similar results. For this we used Cronbach's alpha coefficient [42], which (in effect) assumes that each actual item represents a retest of a single notional item, and also the correlation between each individual item and the overall score.
Construct validity has to do with to what extent the questionnaire supports predefined hypotheses and also the degree to which the questionnaire relates to other established and accepted questionnaires. In the present context we assessed a particular component of construct validity, called convergent validity, by measuring the correlation with other established and accepted questionnaires; for this we used Pearson's correlation coefficient. We considered that the score should correlate well with the physical components of SF-12 and NHP, and also with DASH.
Testretest reliability (reproducibility) was assessed by administering the test to the same sample on two different occasions, on the assumption that there is no substantial change in what is measured. This was assessed using Pearson's correlation coefficient (R), expressed here as the coefficient of determination (R2), and by the mean ± S.D. of the testretest difference; the mean testretest S.D. was also expressed as a coefficient of variation (CV%).
Sensitivity to change was investigated by comparing pre- and postoperative scores with changes in other scores administered at the same time [41, 43], using the coefficient of determination (R2). We are of course comparing the preoperativepostoperative differences, not making claims about the benefits or otherwise of the operation.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Results of validation did not differ significantly between pre- and postoperative patients. The initial (pre-op) studies are used in the analysis and figures which follow.
Internal consistency for the score was good (Table 3); of the two main components of the score, the PAQ by itself had better internal consistency and better correlation with the overall score.
|
|
|
|
|
Diagnosis appeared to have no significant influence on the results (i.e. internal consistency, reproducibility or correlation with other measures), as assessed pragmatically by dividing the patients into two main groups: disease involving mainly the joint surface (largely osteoarthritis, but also loose bodies and osteochondritis dissecans) and disease involving mainly the periarticular structures (largely rheumatoid arthritis, but also ulnar nerve problems, tennis elbow, golfers elbow, synovial chondromatosis and posterior impingement).
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The score follows the principle that the patient can provide reliable and valid judgements of health status and of the benefits of the treatment [31]: 60% of the total score comes from the PAQ. There is some debate on the need for clinical assessment as part of a score, alongside patient evaluation. As in other cases [33], the PAQ performed somewhat better alone, at least as judged by Cronbach's alpha coefficient (Table 3). We would suggest that while leaving the CAS component out might miss important aspects of elbow pathology, it would certainly be possible to use the PAQ alone in situations where a purely postal or telephone assessment would be desirable on practical grounds.
Is a new score needed for the elbow? Existing elbow scores have not been fully validated [30, 44] and there is little agreement between the scores [30], and also little consistency in their exact definition. It was our belief that there are desirable components of an elbow score which none of the existing scores possessed. The present elbow score was developed by consulting patients with various different problems, and tested on patients with a variety of pathologies and operations. The items in the PAQ were included on the basis of interviews with patients, which itself assists in achieving content validity, and refined by testing, retesting and several consultations with experts. The resulting score was tested statistically in a prospective manner for all the components of validity which are a prerequisite for an assessment tool [34]. It has good internal consistency and reproducibility.
For purposes of construct validity and sensitivity to change, we compared it with three well-established and well-validated scores. NHP and SF-12 are questionnaires that can be used for any disease state, and DASH is a general upper limb questionnaire. We contend that this is better than comparing clinician-assessed severity and patient-assessed severity, as has been done in some validation studies [30]. Sensitivity to change is an essential criterion in scale validation [45] which has not been assessed in patients with elbow pathology [44]. We have shown that the new score responds to operative treatment at least as well as the general scores against which we compared it. The difference between them, of course, is the direct focus on the elbow (content validity).
We chose to weight each item in the score equally. This is a difficult point, and a number of different choices are available. It would have been wrong to allow the weighting to be a function of the scale length for each item, as this is decided on the basis of precedent, convenience and a judgement of the reasonableness of the size of the distinctions that the scale length implies. Thus, we adjusted each score to result in equal weight. However, it would of course be possible to alter the weightings, for example to optimize reproducibility or internal consistency. This would require a very much larger number of patients, and in fact is unlikely to depart very much from equal weighting (as the contributions to Cronbach's alpha are so similar). Any such optimization would be likely to reduce considerably the contribution of the CAS, but we would argue on general grounds that it would be unwise to ignore objective findings in the assessment of an elbow problem. One consequence of this equal weighting is that the pain item is only 1/9 (11%) of the PAQ and 1/15 (
7%) of the whole score. This is in contrast to, for example, the five major scoring systems reviewed by Turchin et al. [30], in which pain is weighted as 3050% of the final score. However, pain can make a contribution to the functional limitations referred to in any of the other PAQ items. We felt it would be impractical and inappropriate to distinguish this contribution from, for example, restrictions due to deformity, instability or weakness. The pain weighting in fact makes little difference to the performance of the test. Increasing its weighting so that the pain question contributes 40% to the overall score (54% to PAQ alone) results in a trivial decrease in reproducibility (R2 now 0.89 for both PAQ and the full LES), and very small effects on the correlations between LES and other scores, both absolute values and preoperativepostoperative differences.
We acknowledge that larger studies would be necessary to validate the LES separately for different elbow conditions. The LES was developed in tertiary care, and it would be interesting to test it in the primary care setting; we would expect that the increasing number of general practitioners with an interest in musculoskeletal conditions would find the CAS straightforward, and the PAQ could easily be administered by any health-care professional.
The authors have declared no conflicts of interest.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|