Tavistock & Portman NHS Trust, Tavistock Centre, London and Rampton Hospital, Retford, UK
Psychological Therapies Research Centre, University of Leeds, UK
Department of Psychotherapy, Manchester Royal Infirmary, UK
Psychological Therapies Research Centre, University of Leeds, UK
Correspondence: Dr Chris Evans, Rampton Hospital, Retford, Nottinghamshire DN22 0PD, UK. E-mail: chris{at}psyctc.org
Declaration of interest None. Funding detailed in Acknowledgements.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Aims To present psychometric data on reliability, validity and sensitivity to change for the COREOM (Clinical Outcomes in Routine Evaluation Outcome Measure).
Method A 34-item self-report instrument was-developed, with domains of subjective well-being, symptoms, function and risk. Analysis includes internal reliability, testretest reliability, socio-demographic differences, exploratory principal-component analysis, correlations with other instruments, differences between clinical and non-clinical samples and assessment of change within a clinical group.
Results Internal and testretest reliability were good (0.75-0.95), as was convergent validity with seven other instruments, with large differences between clinical and non-clinical samples and good sensitivity to change.
Conclusions The COREOM is a reliable and valid instrument with good sensitivity to change. It is acceptable in a wide range of practice settings.
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
"Can mental health outcome measures be developed which meet the following three criteria: (1) standardised, (2) acceptable to clinicians, and (3) feasible for ongoing routine use? We shall argue that the answers at present are yes, perhaps, and not known."
For psychotherapies, we argue that the Clinical Outcomes in Routine Evaluation Outcome Measure (COREOM) described below can answer yes, largely and generally. There have been previous initiatives to create a core battery to assess change in psychotherapy (Waskow, 1975; Strupp et al, 1997). We have analysed some reasons why these did not achieve wide uptake (Barkham et al, 1998). In the UK, the need for a core battery and routine data collection has been acknowledged, as has the need for routine effectiveness and efficacy evidence (Department of Health, 1996, 1999; Roth & Fonagy, 1996). Despite a multitude of measures (Froyd et al, 1996), there is still no single, pantheoretical measure. Such a measure would need to measure the core domains of problems, meet Thornicroft & Slade's desiderata and be copyleft (i.e. the copyright holders license it for use without royalty charges subject only to the requirement that others do not change it or make a profit out of it).
Development of the new outcome measure
This paper assesses the self-report COREOM. Its rationale and
development have been described elsewhere
(Barkham et al, 1998;
Evans et al, 2000). A
team, led by the authors, reviewed current psychological measures and produced
a measure refined in two waves of pilot work involving quantitative analyses
and qualitative feedback from a wide group of service users and clinicians.
This paper reports the psychometric properties and utility of the final
measure.
The measure
The measure fits on two sides of A4 and includes 34 simply worded items all
answered on the same five-point scale ranging from not at all to
most or all the time. It can be hand-scored or scanned by
computer. The items cover four domains: subjective well-being (four items),
problems/symptoms (twelve items), life functioning (twelve items) and risk (to
self and to others; six items) (see Table
1). Some items are tuned to lower and some to higher intensity of
problems in order to increase scoring range and sensitivity to change; 25% of
the items are positively framed with reversed scores. Overall,
the measure is problem scored (i.e. higher scores indicate more problems).
Scores are reported as means across items that give a pro-rated
score if there are incomplete responses. For example, if two items have not
been responded to, the total score is divided by 32 (see below). Pro-rating an
overall score is not recommended if more than three items have been missed;
nor should pro-rating be applied to domains if more than one item is missing
from that domain.
|
We recommend the measure to be used before and at the end of therapy. It may be useful to repeat it during therapy in longer therapies and follow-up is highly desirable, if not often sought in current clinical practice. Different therapies and services will address different questions and create very different best usage. A research-oriented example is given by Barkham et al (2001). We know of services offering very brief therapies that find it useful as an overall nomothetic assessment on initial and final session completions, whereas other services have posted the measure at referral and repeated it at assessment or first session. Services with waiting times between assessment and therapy have also found checking stability over that interval informative. Reliable and clinically significant change appraisal (see below) supports a case audit of successes and failures.
![]() |
METHOD |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The data
Results are reported on data from two main samples: a non-clinical sample
and a clinical sample. Samples are described in
Table 2.
|
The clinical data came from 23 sites that expressed an interest in such a measure in our initial survey of purchasers and providers (Evans et al, 2000) or were known through the UK Society for Psychotherapy Research's Northern Practice Research Network. The majority of sites were within the National Health Service (NHS) but also included three university student counselling services and a staff support service. Two services were focused on primary care, whereas others had wider spans of referrals. Leadership and membership varied, including medical psychotherapists, clinical psychologists, counselling psychologists, counsellors and psychotherapists. Theoretical orientation also varied, the majority describing themselves as eclectic and the remainder asserting behavioural, cognitivebehavioural or psychodynamic orientations. Minimal patient demographic information was collected but non-completion rates were not assessed because most services said that they were not logistically ready for this. Data used were the first available when from pre-treatment or the first treatment session.
One non-clinical sample was from a British university with both undergraduate and postgraduate students. To complement this in relation to the general population, a sample of convenience was sought from non-clinical workers, relatives and friends of the clinicians in the CORE battery team and in the major collaborating sites. Differences between the student and non-student samples generally were minimal and all results reported here are pooled across both.
Analysis
The COREOM data were scanned by computer using the FORMIC
data-capturing system (Formic Design and
Automatic Data Capture, 1996). Most analyses were conducted in
SPSS for Windows, version 8.0.2. Non-parametric tests were used because
statistical power was high and distributions generally differed significantly
from Gaussian. All inferential tests of differences were two-tailed against
P<0.05. The large sample sizes gave high statistical power so that
significance would be found for small effects, thus effect sizes and
confidence intervals (Gardner &
Altman, 1986) generally are reported. Most were produced by SPSS
but confidence intervals for Spearman correlations were calculated using
Confidence Interval Analysis (CIA; Gardner
et al, 1989) and those for Cronbach's alpha were
calculated using an SAS/IML (SAS
Institute, 1990) program written by one of us (C.E.), implementing
the methods of Feldt et al
(1987).
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The item that was most often incomplete was no. 19 (I have felt warmth and affection for someone) in both samples (2.5% incomplete in the non-clinical and 3.8% incomplete in the clinical sample). The overall omission rate was 1.7%. If this applied for all items, then the numbers omitted would be distributed binomially. Forty-three or more omitted would be a significantly (P<0.05) elevated number. Items exceeding this were nos 21 and 34 (43 omissions), nos 20 and 30 (44), no. 32 (49) and no. 19 (61). A significantly low number of omissions would be 24 or fewer. These items were no. 3 (23 omissions), no. 2 (20), no. 14 (18) and no. 5 (16). There is heterogeneity in omission of items, with some suggestion that later items were omitted more frequently, but there is no link with domain.
Internal consistency
Internal reliability is indexed most often by coefficient
(Cronbach, 1951), which
indicates the proportion of the variance that is covariant between items. Low
values indicate that the items do not tap a nomothetic dimension of individual
differences. Very high values (near unity) indicate that too many items are
being used or that items are semantically equivalent (i.e. not adding new
information to each other). All domains show
of >0.75 and <0.95
(i.e. appropriate internal reliability;
Table 3). Confidence intervals
show that the values are estimated very precisely by the large sample sizes.
Despite this, only the problem domain showed a statistically significant lower
reliability in the clinical than the non-clinical sample. Even this difference
of 2% (88% v. 90%) in the proportion of covariance is not
problematic, although its origins may prove to be of theoretical interest.
|
Testretest stability
Very marked score changes over a short period of time would suggest
problems. Of 55 students approached, 43 returned complete data from both
occasions. Testretest correlations were highest within domains (see
Table 4). The stability of the
risk domain was lowest at 0.64, which is unsurprising in view of the small
length and situational, reactive nature of these items. The stabilities of
0.87-0.91 for all other scores are excellent. The second part of
Table 4 gives the mean change,
95% confidence interval and the significance (Wilcoxon test), showing small
but statistically significant falls on some scores.
|
Convergent validity
As noted, the measure is designed to tap the difference between clients and
change in therapy across the three domains. Failure to correlate with
appropriate specific measures would suggest invalidity. Correlations
(Table 5) are highest against
conceptually close measures, showing convergent validity and that scores do
not just reflect common response sets.
|
The only exception was for the new version of the Beck Depression Inventory
(BDIII; Beck et al,
1996), where n gives only low precision (95% CI for
=0.51-0.87).
In one site a university counselling service clinician ratings of significant risk were recorded and 7/40 clients were considered to be at risk. Their risk scores differed statistically significantly and strongly from those of the other 33 clients (mean 1.1-0.3; 95% CI 0.41-1.2, P<0.0005), with no statistically significant differences on the other domains. This supports the allocation of risk items to this domain, as does the high correlation with severe depression on the General Health Questionnaire (GHQ).
Differences between clinical and non-clinical samples
The main validity requirement of an outcome measure is that it should
discriminate between the clinical populations for which it has been designed
and the non-clinical populations. Table
6 illustrates that these differences were large and highly
statistically significant on all domains. Confidence intervals are small,
showing that the differences are estimated precisely and are large
more than one point on a 0-4 scale for all domain scores other than risk.
|
The boxplot in Fig. 1 shows a few patients in the clinical sample scoring zero and a very few patients (outliers) in the non-clinical sample scoring very highly. However, the box for the one sample (which covers the middle 50% of scores in each group) and the median line bisecting the box for the other sample do not overlap.
|
Ethnicity, age and gender differences
Students were asked whether English was their first language. Because
omission of items might reflect linguistic problems with the measure, the
number of omitted items was related to the first language. This showed that
the 50 respondents who said that their first language was not English omitted
an average of 2.5 items, as opposed to 0.35 by the other 607 who answered the
language question in that survey. This is statistically significant
(P<0.0005) but relatively few items were dropped by either group.
Internal consistency was similar for the samples, with no statistically
significant differences, suggesting that answering in a second language in
these samples did not impair internal consistency.
Analysis showed only small correlations between scores and age. There was a
statistically significant but negligible increase in symptom scores with age
(=0.076, P=0.014) in the non-clinical sample, and small
reductions in risk (
=-0.15, P<0.0005) and function scores
(
=-0.10, P=0.004) with age in the clinical sample.
Many psychological measures show gender differences and much has been written on whether these represent response biases. In the design of the COREOM, we sought to minimise gender bias but had no belief in a gender free instrument. The results (Table 7) show moderate and statistically significant gender differences in the non-clinical samples for all domain scores except functioning. The differences in the clinical samples were smaller, with statistically significant differences on well-being and, narrowly, on risk. Clearly, gender should be taken into account when relating individual scores to referential data, but the effects of gender are small compared with effects of clinical v. non-clinical status.
|
Correlations between domain scores
Given the interrelationship between clinical domains, scores were expected
to be positively correlated. The correlations in
Table 8 show that the risk
items show lower correlations with the other scores, more so in the
non-clinical than the clinical sample. The three other scores show high
correlations with each other.
|
Exploratory principal-component analysis
Principal-component analyses were conducted separately for the clinical and
non-clinical samples. The scree plot for the non-clinical sample is shown in
Fig. 2. This shows the very
large proportion of the variance in the first component (38%) and the
suggestion of an elbow (i.e. a flatter scree)
thereafter (Cattell, 1966),
after three components.
|
The pattern matrix after oblique rotation (Table 9) shows a clear separation of the items into a negatively worded group, a group made up largely of the risk items and a positively worded group. Figure 3 presents the scree plot for the clinical sample. Again, the pattern matrix suggests three components: a problem one, a risk one and a more positively worded one. However, the solution seems to differ in fine detail from that for the non-clinical sample (Table 10).
|
|
|
Sensitivity to change
To test for possible differences relating to the nature of problems and to
differences in typical numbers of sessions offered, change was considered in
relation to three settings: counselling in primary care, student counselling
and a clinical group comprising NHS psychotherapy and/or
counselling services (i.e. the remainder of the overall sample). The results
(Table 11) show substantial
and highly statistically significant improvements on all scores for all three
settings.
|
Reliable and clinically significant change (RCSC)
The methods of classifying change as reliable and as
clinically significant address individual change rather than
group mean change. Reliable change is that found only in 5% of cases if change
were simply due to unreliability of measurement. Clinically significant change
is what moves a person from a score more characteristic of a clinical
population to a score more characteristic of a non-clinical population
(Jacobson & Truax, 1991).
The RCSC complements and extends grouped analyses
(Evans et al, 1998).
The referential data reported here give the cut-points shown in
Table 12.
|
Using those and the coefficient values of 0.94 to calculate the
reliable change criterion allows the change categories to be counted. The
three possible categories of reliability of change are: small enough to fall
within the range that would be seen by chance alone given reliability
(not reliable); reliable improvement; and reliable
deterioration. The four categories of clinical significance of change are:
stayed in the clinical range; stayed in the non-clinical range; changed from
clinical to non-clinical (clinically significant improvement);
and changed from the non-clinical to the clinical (clinically
significant deterioration). Together, these give the 12 theoretically
possible change categories seen in Table
13. Clearly, the ideal outcome is the one shown in bold: reliable
and clinically significant improvement. A few patients will score too low on
entry into therapy to show clinically significant improvement, whereas some
will score highly on entry and improve reliably but not necessarily such that
they end below the cut-point to be in the clinically significant improved
range.
|
The majority of patients showed reliable improvement in all three groups. The clinical significance results were less impressive with a slight majority, except in the primary care sample that shows no clinically significant change. Very few showed either clinically significant or reliable deterioration. However, identifying these 19 people of the 281 (7%) who seem to have shown reliable, or clinically significant, deterioration would support case-level audit.
Without knowing more about the clinical services or about the non-response rates, it is premature to interpret either the grouped or the individually categorised change data comparatively. However, they underline that the measure is sensitive to, and can usefully categorise, change in all three settings.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Is the CORE-OM reliable, valid and sensitive to change?
The results presented are satisfactory. The CORE-OM and its domain scores
show excellent internal consistency in large clinical and non-clinical
samples. In addition, it has high 1-week testretest reliability in a
small sample of students. Convergent validation against a battery of existing
measures and clinician ratings of risk is good. Gender differences are
statistically significant in the non-clinical samples but less so in the
clinical samples. Although sufficiently different to require gender-specific
referential data, the differences are small enough in relation to the
clinical/non-clinical differences to suggest that the measure is not heavily
gender-biased. On this evidence, the CORE-OM meets the required standard for
acceptable validity and reliability.
The very strong discrimination between clinical and non-clinical samples suggests that the measure is tuned to the distinction between clinical and non-clinical samples. The correlations between the domain scores and the principal-component analysis suggest that item responses across domains are highly correlated, both in clinical and non-clinical samples. A first component accounts for a large proportion of the variance, but a three-component structure that separates problems, risk items and positively scored items may be worthy of further exploration, particularly in relation to the phase model of change in psychotherapy (Howard et al, 1993). Change data from counselling in primary care, student counselling and NHS psychotherapies all suggest that the CORE-OM is sensitive to change and capable of categorising change using the methods of reliable and clinically significant change.
Is the CORE-OM acceptable and accessible?
The rates of omitted items are such that most scores can be pro-rated and
the measure has good acceptability in clinical and non-clinical use.
Non-completion rates were not assessed, so the results can be generalised only
to the population of clients who are currently willing to complete such
measures on the minimal encouragement available when a research project is
spliced onto normal clinical practice. Work is now in progress with some sites
to gain regular and detailed non-completion information and to explore
residual practitioner and patient reluctance.
The non-clinical data-sets provide referential data on score distributions in British populations that are not available for many symptom measures in routine use. Further work is in progress to develop translations into other languages. In addition, two parallel, 18-item, single-sided short forms are available for services wishing to track progress session by session. Data on them will be reported separately.
Does the CORE-OM have wide utility?
The collaboration between practitioners and researchers has produced a
reliable, valid and user-friendly core outcome measure that has clinical
utility in a range of different settings. It has achieved its design aims.
However, the aims went beyond creating another measure. The
first intention was that the CORE-OM constitutes a core:
something onto which other measures can be added
(Barkham et al, 1998),
as shown in Fig. 4. The CORE-OM
then constitutes a common, available measure to pursue the broader goals of
measurement of efficacy and effectiveness in psychological treatments. A
report on its usage in one large service is given by Barkham et al
(2001). However, to return to
Thornicroft & Slade (2000)
for a more general overview:
|
"Can mental health outcome measures be developed which meet the following three criteria: (1) standardised, (2) acceptable to clinicians, and (3) feasible for ongoing routine use?...... implementing the routine use of outcome measures is a complex task involving the characteristics of the scales, the motivation and training of staff, and the wider clinical and organisational environment.
... When assessed using these criteria [applicability, acceptability and practicality] it is clear that our current knowledge tells us more about barriers to implementing routine outcome measures than about the necessary and sufficient ingredients for their successful translation into clinically meaningful everyday use."
We believe that the CORE-OM and the CORE system provide a strong platform to amend their first assessment to yes, largely and generally for psychological therapies and we believe that the CORE-OM shows applicability, acceptability and practicality. However, we agree completely that much cultural change, in which practice-based evidence (Margison et al, 2000) must be given equal respect to evidence-based practice, will be needed for "successful translation into clinically meaningful everday use".
![]() |
Clinical Implications and Limitations |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
LIMITATIONS
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Barkham, M., Evans, C., Margison, F., et al (1998) The rationale for developing and implementing core outcome batteries for routine use in service settings and psychotherapy outcome research. Journal of Mental Health, 7, 35-47.[CrossRef]
Barkham, M., Margison, F., Leach, C., et al (2001) Service profiling and outcomes benchmarking using the CORE-OM: towards practice-based evidence in the psychological therapies. Journal of Consulting and Clinical Psychology, 69, 184-196.[CrossRef][Medline]
Beck, A. T., Ward, C. H., Mendelson, M., et al (1961) An inventory for measuring depression. Archives of General Psychiatry, 4, 561-571.
Beck, A. T., Epstein, N., Brown, G., et al (1988) An inventory for measuring clinical anxiety: psychometric properties. Journal of Consulting and Clinical Psychology, 56, 893-897.[CrossRef][Medline]
Beck, A. T., Steer, R. A. & Brown, G. K. (1996) Manual for the Beck Depression Inventory Second Edition (BDI-II). San Antonio, TX: Psychological Corporation.
Cattell, R. B. (1966) The scree test for the number of factors. Multivariate Behavioral Research, 1, 245-276.
Core System Group (1998) CORE System (Information Management) Handbook. Leeds: Core System Group.
Cronbach, L. J. (1951) Coefficent alpha and the internal structure of tests. Psychometrika, 16, 297-334.
Department of Health (1996) NHS Psychotherapy Services in England. Review of Strategic Policy. London: HMSO.
Department of Health (1999) A National Service Framework for Mental Health. London: Stationery Office.
Derogatis, L. R. (1983) SCL-90-R: Administration, Scoring & Procedures: Manual. Towson, MD: Clinical Psychometric Research.
Derogatis, L. R. & Melisaratos, N. (1983) The Brief Symptom Inventory: an introductory report. Psychological Medicine, 13, 595-605.[Medline]
Evans, C. E., Margison, F. & Barkham, M.
(1998) The contribution of reliable and clinically
significant change methods to evidence-based mental health.
Evidence-Based Mental Health,
1, 70-72.
Evans, C. E., Mellor-Clark, J., Margison, F., et al (2000) Clinical Outcomes in Routine Evaluation: the CORE Outcome Measure (CORE-OM). Journal of Mental Health, 9, 247-255.
Feldt, L. S., Woodruff, D. J. & Salih, F. A. (1987) Statistical inference for coefficient alpha. Applied Psychological Measurement, 11, 93-103.
Formic Design and Automatic Data Capture (1996) FORMIC 3 for Windows. London: Formic Ltd.
Froyd, J. E., Lambert, M. J. & Froyd, J. D. (1996) A review of practices of psychotherapy outcome measurement. Journal of Mental Health, 5, 11-15.[CrossRef]
Gardner, M. J. & Altman, D. G. (1986) Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ, 292, 746-750.[Medline]
Gardner, M. J., Gardner, S. B. & Winter, P. D. (1989) Confidence Interval Analysis (C.I.A.) Microcomputer Program Manual. London: BMJ Press.
Goldberg, D. P. & Hillier, V. F. (1979) A scaled version of the General Health Questionnaire. Psychological Medicine, 9, 139-145.[Medline]
Howard, K. I., Lueger, R. J., Maling, M., et al (1993) A phase model of psychotherapy outcome: causal mediation of change. Journal of Consulting and Clinical Psychology, 61, 678-685.[CrossRef][Medline]
Jacobson, N. S. & Truax, P. (1991) Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59, 12-19.[CrossRef][Medline]
Margison, F. R., Barkham, M., Evans, C., et al
(2000) Measurement and psychotherapy. Evidence-based practice
and practice-based evidence. British Journal of
Psychiatry, 177,
123-130.
Roth, A. & Fonagy, P. (1996) What Works for Whom? A Critical Review of Psychotherapy Research. New York: Guilford.
SAS Institute (1990) SAS/IML Software: Usage and Reference. Version 6 (1st edn). Cary, NC: SAS Institute Inc.
Strupp, H. H., Horowitz, L. M. & Lambert, M. J. (1997) Measuring Patient Changes in Mood, Anxiety and Personality Disorders: Toward a Core Battery. Washington, DC: American Psychological Association.
Thornicroft, G. & Slade, M. (2000) Are
routine outcome measures feasible in mental health? Quality in
Health Care, 9,
84.
Waskow, I. E. (1975) Selection of a core battery. In Psychotherapy Change Measures (eds I. E. Waskow & M. B. Parloff), pp. 245-269. Rockville, MD: National Institute of Mental Health.
Received for publication July 13, 2000. Revision received June 26, 2001. Accepted for publication September 27, 2001.
Related articles in BJP: