1 Department of Anesthesia, Sunnybrook and Womens College Health Sciences Centre, Womens College Campus, 76 Grenville Street, Toronto, Ontario, Canada M5S 1B2. 2 Centre for Research in Education, University Health Network, University of Toronto, Toronto, Canada
*Corresponding author. E-mail: pam.morgan@utoronto.ca
Accepted for publication: October 24, 2003
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods. Our undergraduate committee designed 10 scenarios based on curriculum objectives. Fifteen faculty members with undergraduate educational experience identified items considered appropriate for medical students performance level and identified items that, if omitted, would negatively affect grades. Items endorsed by less than 20% of faculty were omitted. For remaining items, weighting was calculated according to faculty responses. Students managed at least one scenario during which their performance was videotaped. Two raters independently completed the checklists for three consecutive sessions to determine inter-rater reliability. Validity was determined using Cronbachs with an
0.6 and
0.9 considered acceptable internal consistency. Item analysis was performed by recalculating Cronbachs
with each item deleted to determine if that item contributed to a low internal consistency.
Results. 135 students participated in the study. Inter-rater reliability of the two raters determined on the third session was 0.97 and therefore one rater completed the remaining performance assessments. Cronbachs for the 10 scenarios ranged from 0.16 to 0.93 with two scenarios demonstrating acceptable internal consistency with all items. Three scenarios demonstrated acceptable internal consistency with one item deleted.
Conclusions. Five scenarios developed for this study were shown to be valid when using the faculty criteria for expected performance level.
Br J Anaesth 2004; 92: 38892
Keywords: anaesthesia; assessment, validity; medical students, performance assessment; patient simulation
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A measurement scale should demonstrate face and content validity. Face validity ensures that scale items are actually measuring what they set out to measure. Content validity ensures that the scale has enough items and involves the appropriate domains. In addition, internal consistency should be within an acceptable range demonstrating homogeneity of the measurement tool.
There are no gold standard undergraduate performance-based assessments in anaesthesia to which simulator assessments can be compared for purposes of validation. This therefore limits comparisons to existing assessment methods such as written examinations and clinical evaluations.
A previous study assessing undergraduates performance using high-fidelity patient simulation demonstrated low internal consistency of checklist assessments.2 It was deemed important to address potential reasons for these findings and whether imposed changes improved the validity of the assessment tool.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Once responses were received, items endorsed by fewer than 20% of faculty were deleted to form the final performance checklist. Scoring of the remaining performance items was weighted according to the percentage of faculty endorsement for each performance and critical performance item. A total score of 100 was possible for each scenario and a negative grade possible if critical performance items were omitted.
All final year medical students were invited to participate in the study (n=169). Full day simulator sessions were carried out every 2 weeks during the anaesthesia rotation for a period of 36 weeks. Students were assigned randomly to manage each scenario. All assessments were done using the MedSim simulator, MedSim Incorporated (International Headquarters, Kfar Sava, Israel). Written informed consent was obtained from each student and approximately 10 students participated in each session. Students performances were videotaped for the purpose of subsequent checklist evaluation. Two raters independently assessed the students performances of the first session using the videotapes. Once completed, the raters compared their results and reviewed the videotape for resolution of any discrepancies in marking. Both raters then independently completed the performance checklists for the next student session and results were compared. After completion of the checklists for the third session, inter-rater reliability of the marking was 0.97; therefore, one rater completed the performance checklists for the remaining 15 sessions.
To ensure that appropriate domains were being tested, students were asked to complete an evaluation form at the end of each session. One of the questions asked if the scenario content correlated with the learning objectives of the rotation using a 15 point Likert scale (1=strongly disagree, 5=strongly agree).
Internal consistency of the checklists was determined. Scores for each students performance were totalled and compared with their clinical and final examination mark in anaesthesia.
Statistical analysis
SPSS 11.0.1 for Windows was used for statistical analysis. Validity was determined using Cronbachs with an
0.6 and
0.9 considered acceptable internal consistency. Item analysis was performed by recalculating Cronbachs
with each item deleted to determine if that item contributed to a low internal consistency.
Performance checklist scores were compared with clinical, examination and final marks using a Pearson product moment correlation coefficient. P<0.05 was considered statistically significant.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
118 students completed the evaluation form related to the simulation experience. Eighty-five per cent of students either agreed or strongly agreed that the scenarios reflected the learning objectives of the rotation. Thirteen per cent had no opinion and 2% disagreed that the scenarios reflected the learning objectives.
All 15 faculty returned the scenarios with accompanying performance checklist items. Performance checklists for each scenario differed in the number of items in the final version but all checklists included items with negative grading. The checklists and grading are found in the Appendix. (See Supplementary data in the online version of this article.)
The proportion of students scoring greater than 60% on each scenario checklist is summarized in Table 1. The difference in checklist scores if negative grading was and was not included is illustrated in Figure 1.
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Face and content validity are important aspects in the development of any new assessment tool. These terms indicate whether the tool is assessing the desired qualities and whether it samples the relevant or important content or domains of a subject.5 The content of the scenarios in this study was deemed relevant by both faculty and the majority of students, which supports the fact that the domains being tested were appropriate. The final checklist evaluations in this study were seen to have face and content validity based on 15 expert opinions.
It is also important that evaluation tools or checklists undergo both frequency of endorsement and a test of homogeneity using analysis of internal consistency.5 This process involves selection of items as identified by experts, who in this case, were faculty members with significant experience in undergraduate medical education. Validation of a new measurement tool can be based on the determination of internal consistency. Internal consistency determines the degree to which each item (in this case a performance item) correlates to the scale as a whole. To determine the degree of homogeneity of the checklist or scale, it must be demonstrated that there is a moderate correlation between items.6 Cronbachs is a technique used to determine the homogeneity of a scale. If Cronbachs
increases when a specific item is deleted, the indication would be that, that items exclusion would increase the homogeneity of the scale. Therefore, the higher the
, the better, within limits. Generally an
0.60 and
0.90 is considered a reasonable measure of the validity of the scale.
Cronbachs , however, is not only dependent on the magnitude of the item correlations, but also on the number of items that are included in the scale.6 If the scale or checklist does not have many items, it may be very difficult to achieve adequate internal consistency for the performance checklists. This fact may play a part in some of the scenarios not demonstrating adequate internal consistency in this study as demonstrated by a low Cronbachs
. Expectations of medical students performances may be limited at this point in their education. It may not be possible, therefore, to develop performance-based simulator assessments with a large number of expected performance items that may in turn increase the internal consistency of the scale.
A previous study of medical students performances using high-fidelity simulation showed similar results with low internal consistency of all scenarios used in the evaluation.2 In that study, a 25-point criterion-based checklist was used and students were expected to manage a patient case from the beginning including preoperative assessment, preparation, induction, and intraoperative management. The low internal consistency was thought to be a result of the multimodal nature of the performance expectations and that students were being asked to manage a situation that was beyond their ability. It was concluded that further study was necessary and that the assessment process be limited to the management of discrete events or testing on a particular skill rather than attempting to evaluate a lengthy case scenario.2
In the current study, students were expected only to manage a discrete event and the checklist formation centred on the management of that event. Some checklists therefore, were limited in the number of items used and unlikely to demonstrate acceptable Cronbachs . Another difference between this study and the previous one, was that items in this study were weighted according to faculty endorsement. The weighting may also have affected the internal consistency of the checklists.
The absence of correlation of the checklist scores with other evaluation tools is consistent with the findings in a previous study.2 The lack of correlation may be because management of critical events using human patient simulation is a performance-based assessment and therefore would not be expected to correlate with a written examination, which is mainly a knowledge-based assessment. With regard to clinical evaluations, these are generally considered to be subjective and in the course of our 2-week rotation in anaesthesia, these evaluations are often based on intra operative case discussions and rarely based on a students hands-on management of critical events.
Investigators report that an evaluation system using the simulator was able to differentiate a group of individuals based on clinical experience or training.4 7 Devitt and colleagues demonstrated that performance scores improved as the level of expertise of the subject increased, with medical students not performing as well as postgraduates or established clinicians.7 For this reason, construct validity in this study was not assessed.
The validity and reliability of the OSCE process has been well researched and documented for use in formative evaluation and clinical competence.813 However, there is very little research that has been published, which examines the validity of high-fidelity simulator-based assessments. Although it seems logical that a performance-based assessment would be a more appropriate assessment of competence than a written examination, there are no data that support this statement.
Five of the scenarios in this study have been found to have content validity and acceptable internal consistency. These scenarios can be used with confidence to evaluate medical students performance. Although one might have expected the ventricular tachycardia scenario to have good internal consistency, a number of reasons may have accounted for this not being the case. First, none of our students have taken an Advanced Cardiac Life Support Course before taking the simulator evaluation. Secondly, in the expectations for performance were other factors other than the guidelines outlined by the American Heart Association and these included Calling for help among others. Very few students actually called for help and this fact was reflected in the poor scores on this scenario.
Scenarios with low internal consistency require modification before use as assessment tools. Recognizing that the content of these scenarios may form an important part of the students learning objectives, attention must be focused on either rewriting or re-weighting of the performance items. Attention to the inclusion of negative grading and potential alteration in the weighting of that grading may contribute to improvement in internal consistency of these scenarios.
There has been an exponential increase in the number of high-fidelity patient simulators worldwide.14 This study provides a template on which the development of valid assessment tools can be developed further for undergraduate, postgraduate, and continuing medical education as simulators become more widely used for this purpose. Before implementation of any simulator-based assessment tools however, evaluators must ensure that the tool demonstrates reliability and validity and meets the expectations of all other examination modalities.
![]() |
Supplementary data |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Morgan PJ, Cleave-Hogg D, Guest CB, Herold J. Validity and reliability of undergraduate performance assessments in an anesthesia simulator. Can J Anesth 2001; 48: 22533
3 Devitt JH, Kurrek MM, Cohen MM, Cleave-Hogg D. The validity of performance assessments using simulation. Anesthesiology 2001; 95: 3642[CrossRef][ISI][Medline]
4 Byrne A, Jones J. Responses to simulated anaesthetic emergencies by anaesthetists with different durations of clinical experience. Br J Anaesth 1997; 78: 5536
5 Streiner D, Norman G, eds. Basic Concepts, Health Measurement Scales. A Practical Guide to their Development and Use. Oxford: Oxford University Press, 1995
6 Streiner D, Norman G, eds. Selecting the Items, Health Measurement Scales, 2nd Edn. Oxford: Oxford University Press, 1995; 5468
7 Devitt J, Kurrek M, Cohen M, et al. Testing internal consistency and construct validity during evaluation of performance in a patient simulator. Anesth Analg 1998; 86: 11604[Abstract]
8 Hodges B, Regehr G, Hanson M, McNaughton N. Validation of an objective structured clinical examination in psychiatry. Acad Med 1998; 73: 91012[Abstract]
9 Hodges G, Regehr G, Hanson M, McNaughton N. An objective structured clinical examination for evaluating psychiatric clinical clerks. Acad Med 1997; 72: 71521[Abstract]
10 Hilliard R, Tallett S, Tabak D. Use of an objective structured clinical examination as a certifying examination in pediatrics. Annas RCPSC 2000; 33: 2228
11 Cohen R, Rothman A, Poldre P, Ross J. Validity and generalizability of global ratings in an objective structured clinical examination. Acad Med 1991; 66: 5458[Abstract]
12 Regehr G, MacRae H, Reznick RK, Szalay D. Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Acad Med 1998; 73: 9937[Abstract]
13 Grandmaison P, Brailovsky C, Lescop J. Content validity of the Quebec licencing examination (OSCE). Can Fam Physician 1996; 42: 2549[ISI][Medline]
14 Morgan P, Cleave-Hogg D. A worldwide survey of simulation in anesthesia. Can J Anesth 2002; 49: 65962