1Morriston Hospital, Swansea SA6 6NL, UK. 2Royal Victoria Infirmary, Newcastle upon Tyne NE1 4LP, UK*Corresponding author
Accepted for publication: October 27, 2000
Abstract
This review was undertaken to discover what assessment instruments have been used as measures of performance during anaesthesia simulation and whether their validity and reliability has been established. The literature describing the assessment of performance during simulated anaesthesia amounted to 13 reports published between 1980 and 2000. Only four of these were designed to investigate the validity or reliability of the assessment systems. We conclude that the efficacy of methodologies for assessment of performance during simulation is largely undetermined. The introduction of simulator-based tests for certification or re-certification of anaesthetists would be premature.
Br J Anaesth 2001; 86: 44550
Keywords: education, simulation of anaesthesia; anaesthesia, simulation; publications
Complex simulation of anaesthesia is an exciting development in medical education and has strong appeal as a way of measuring the performance of anaesthetists without the risks of using real patients.1 However, 15 yr ago, Vreuls and Overmayer wrote: Performance measurement in otherwise sophisticated training simulators is so poorly designed that the measures are virtually useless.2 In order to determine how much progress has been made since then we have examined five aspects of the reported assessments to evaluate a simulator-based assessment system.
(i) The type of simulator. Chopra3 developed a classification of simulators that divides simulators into three groups depending upon whether the simulation environment is of high, intermediate or low fidelity.
(ii) The type of simulation. A simulator can be used in a full theatre environment with a full complement of staff present, for short simulations with only the anaesthetist present, or as a device for training for specific tasks.
(iii) The efficacy of the assessment. Assessment of performance during simulation must be valid and reliable.4 A valid assessment tests what it is supposed to measure (Table 1). The reliability of a test reflects the consistency and precision of the observations, e.g. between assessors or between tests.
|
(v) The assessment instruments applied in conjunction with the simulation. There are a number of measures of outcome that can be used during or after a simulator session (Table 2). These can be directed towards single aspects of performance or they can depend upon a score given for the overall global performance. Multiple instruments may be applied to a single simulation, especially when the sessions are videotaped.
|
Search strategy and criteria for assessment of the papers
Articles were sought that described the assessment of performance during simulated anaesthesia and that were published between 1980 and 2000. The review was restricted to articles that had been peer reviewed. Internet search engines were used to search for anaesthesia simulation sites and identify key workers in the field. The Medline and Embase databases were then searched using the words anaesthesia and anaesthesiology (in all their variations), simulation, performance, competence and assessment, as well as the names of authors previously identified. The references in each report were examined for relevance to simulation and its associated authors were made the basis of a repeat search. This was continued through three cycles. Studies were included only if they included an attempt to measure performance.
Principal findings
The features of the assessments in these reports are shown in Table 3. Gaba and DeAnda5 studied 19 first- and second-year anaesthesia residents during simulation on the Comprehensive Anaesthesia Simulation Environment (CASE) simulator. Their purpose was to determine the response of anaesthetists to critical incidents. As part of the study they attempted to determine whether there were differences between individual trainees or between experience levels in the rapidity of detection and correction of intra-operative problems. The simulations contained six adverse events. They assessed time to detect; time to solve and adherence to protocols based on accepted resuscitation guidelines. Acceptable actions were classified as either compensation criteria or corrective criteria, depending on whether they merely reduced the consequences of the problem or treated the cause. Deviations from the protocols were classified as major or minor. The investigators showed that second-year residents corrected problems significantly more quickly than first-year residents (P<0.05) but that variations in performance were wide. Time to detect and rates of adherence to protocols did not vary between the two groups studied.
|
Howard and colleagues7 used simple written questions to assess the understanding of Anesthetic Crisis Resource Management (ACRM) principles before and after a training course, using a high-fidelity theatre-type simulator (TTS). They showed that residents improved their scores after the simulation, but experienced practitioners did not.
Gaba and colleagues8 reported a study with the primary purpose of evaluating the inter-rater variability of a system of scoring the behaviour and technical performance of anaesthetists during simulations. They retrospectively reviewed videotapes of 72 participants in ACRM courses. Each subject was rated by five observers for crisis management behaviour and by three observers for technical actions. The study showed that, for technical actions, agreement between observers was good, whereas that for crisis management skills was less good.
Schwid and ODonnell9 studied 30 anaesthetists during simulations using the Anaesthesia Simulator Consultant (ASC), a screen-based simulation programme. The purpose of the study was to assess the ability of anaesthetists to recognise diagnostic clues, to make the diagnosis rapidly, and to evaluate the patients response during simulated critical incidents. Each subject was tested for five simulations. Performance was assessed by adherence to management protocols and rated as either correct or incorrect. There was no consistent relationship between experience and performance. The major finding was a clear relationship between the ability of subjects to follow Advanced Cardiac Life Support (ACLS) guidelines and the period of time since ACLS training. No attempt was made to validate the assessment protocol.
Byrne and colleagues10 published a study using the Anaesthetic Computer-Controlled Emergency Situation Simulator (ACCESS) system, an intermediate-fidelity TTS. The simulations were short scenarios containing simple problems with well-defined solutions. Performance was assessed by time to solve and mortality. The inexperienced subjects took longer to solve the problems and mortality was higher, but this was a preliminary study and no conclusions about the efficacy of the performance assessment were drawn from the documented data.
In 1997, Byrne and Jones11 studied more anaesthetists using the same technique as in the previous study. The anaesthetists were divided into four groups depending on the length of their anaesthetic experience. Anaesthetists with <1 yrs experience took longer to treat the problem and caused more deaths. Although the result was statistically significant, there were wide variations in performance within each group.
Chopra and colleagues12 studied the performance of 28 trainee anaesthetists over two simulations using a high-fidelity TTS, the Leiden Anaesthesia Simulator. The purpose of the study was to quantitatively evaluate the efficiency of simulation as a training tool in anaesthesia. Subjects were asked to solve a crisis 4 weeks after being challenged with either the same crisis or a control unrelated emergency. Measurement of performance was based on a scoring system developed from accepted guidelines. Numerical weightings were assigned to each appropriate action as well as a judgement as to whether the action was of major or minor importance. For each individual, a treatment score represented the total score for positive actions, a deviation score represented the total of actions omitted and a total performance score was calculated from the difference between the two scores. The time taken to first response was also noted. For all four measured variables, performance during the simulation of malignant hyperthermia was significantly better in the group exposed to a training simulation of malignant hyperthermia, when compared with their baseline performance.
Lindekaer and colleagues13 studied performance in managing ventricular fibrillation. Eighty anaesthetists worked in 40 teams of two people during a simulation session of the SOPHUS simulator, a high-fidelity TTS. The simulator sessions were videotaped and reviewed along with the subjects. The performance of each team was compared with the European Resuscitation Guidelines, modified for an anaesthetized patient. The number of teams making each intervention and the times taken from the onset of arrhythmia were noted for each team. There was poor adherence to the resuscitation protocol.
Devitt and colleagues14 studied the inter-rater reliability of two anaesthetists who independently observed and scored a total of 30 clinical problems. Five problems were generated in each hour-long scenario. For each problem, the actions of the subject were rated as: 0, no response; 1, compensatory intervention; 2, corrective treatment. The two raters were in complete agreement over the score in 29 out of 30 problems. The authors report this as demonstrating a high level of inter-rater reliability.
In 1998, Devitt and colleagues15 described a study primarily designed to test the items in a rating system developed specifically to evaluate the performance of anaesthetists in a simulated system. The principal objective of the study was to determine whether the test used was reliable as judged by the internal consistency between test items. A secondary item was to see whether the test item could discriminate between two groups of anaesthetists with different lengths of experience. This would demonstrate that the test had construct validity. They studied the performance of 25 anaesthetists during a 1 h simulator session. The inter-rater reliability of the scoring system had been evaluated previously.14 The more experienced anaesthetists produced significantly higher scores (P<0.001). Individuals did not score consistently across all five scenarios, i.e. there was poor internal consistency between test items. This variation was mostly in two items and when these items were removed from each scenario the consistency was improved.
In 1998, Kurrek and colleagues16 studied the performance of 89 anaesthetists during a simulation, on a high-fidelity TTS, that contained an episode of ventricular fibrillation. Performance was rated on a three-point scale. Seventy per cent of participants had taken an ACLS course and these subjects achieved higher scores than those who had not been trained (P<0.05). The assessment was limited to measuring adherence to resuscitation guidelines in a small part of a lengthy simulation.
Morgan and Cleave Hogg17 reported a pilot study of an assessment of medical students who undertook the anaesthetic management of simulated patients in a high-fidelity TTS. Six 15 min scenarios were developed and each student worked through one of these. Each students performance was videotaped and five evaluators separately assessed it using standardized performance evaluation criteria. The authors do not explain how their scoring system operated. The simulator scores showed high inter-rater reliability. The simulator marks showed poor correlation with marks allocated for the clinical attachment and short answer questions.
Discussion
All the reports reviewed described assessment of performance in the course of conducting a simulated general anaesthetic during which critical incidents were presented.
What aspect of performance was tested?
No study directly assessed propositional knowledge. All the studies made some assessment of procedural knowledge using several assessment techniques: (i) compliance with a recognized treatment algorithm, e.g. ALS;5 6 8 9 1316 (ii) compliance with criteria determined by the researchers as representing good practice;5 6 8 10 1115 17 (iii) compliance with management systems developed for dealing with anaesthetic crises;7 8 (iv) complex observations with associated scoring systems;79 and (v) clinical judgement and decision making were assessed in several studies by the measurement of time to respond or time to solve.5 6 10 11 The study by Gaba and colleagues8 included assessment of professional behaviours in the form of crisis management scoring. None of these investigations specifically assessed psychomotor skills.
Assessment instruments used
In many of these studies scoring was by review of videotapes.5 6 8 12 1417 A number of workers used measures of time to solve or time to detect.5 6 10 11 Byrne and colleagues used patient death as a measure of outcome.10 11 Psychological tests have been used in the operating room to measure the performance of anaesthetists.18 19 There are no reports of this form of assessment being used in conjunction with anaesthetic simulation.
The efficacy of assessments used in the simulator
In this review we have found only four studies that were designed primarily to investigate aspects of the validity and reliability of assessments undertaken in the simulator.8 14 15 17
Validity
None of the studies attempted to prove the face or content validity of their assessments. These aspects of the assessments cannot be taken for granted. The construct validity of the assessment system was investigated directly by Devitt and colleagues15 in a small study of 25 subjects. The individual test items showed poor internal consistency such that conclusions cannot be made about the performance of individuals. Few of these papers make reference to criterion validity, i.e. the extent to which the subjects performance is judged adequate. They record the performance of individuals but do not attempt to transfer this into a measurement of competence. The raters in the study by Gaba and colleagues considered some of the performances deficient if two of the observers considered that the subjects had omitted two actions that the simulator crew had rated as essential.8 Other studies do not identify any reference criteria for marking a performance as unsatisfactory.
Morgan and Cleave-Hogg17 discovered poor correlation of their simulator assessment with either written tests or the opinions of supervising consultants. The authors suggest that their new test measures different aspects of their students performance.
A number of studies found that individual test subjects had a wide variation in response to apparently similar test items.5 6 10 11 15 16 This can be explained as lack of concurrent validity in the assessment system. It is well recognized with other assessment systems, such as multiple choice questions, that apparently well designed questions can lead to unpredicted responses.4 These questions are usually discarded. It must be presumed that a similar situation exists in simulator assessments.
Reliability
Devitt and colleagues14 investigated the inter-rater reliability of scoring. Gaba and colleagues,8 using three raters, showed good inter-rater reliability in scoring technical actions. There was less inter-rater agreement between five raters scoring ACRM behaviours (such as orientation to case, communication, feedback, anticipation, etc.). The authors analyse the statistics of inter-rater reliability in depth and note that their scoring system requires further development before it can be used for assessment of competence in high-stakes situations such as certification. Morgan and Cleave-Hogg17 found that the inter-rater reliability of their assessment in 25 subjects was good.
Thirteen reports include comparative assessment of subjects working in anaesthesia simulators. In two of these the construct validity of the assessment is addressed16 17 and in three measures are described that attempt to determine the inter-rater reliability of the assessment.8 14 17 (i) Simulators can generate a variety of tasks that can be used as the basis for the assessment of performance. (ii) Simulators can be used to measure adherence to protocols. (iii) There is some preliminary evidence that scoring actions in response to simulated situations appears to show inter-rater reliability. The validity of observations is not established. (iv) The reported within-subject and within-group variability calls into question some of the stimulusresponse expectations of the investigators. (v) Attempts to assess the performance of anaesthetists are in their infancy and few of the studies to date have specifically designed assessment to address the questions of validity and reliability. It must be concluded that any moves towards using performance in administering simulated anaesthesia, as assessment in training or in re-certification procedures are premature.
References
1 Kapur PA, Steadman RH. Patient simulator competency testing: ready for takeoff? Anesth Analges 1998; 86: 11379[Abstract]
2 Vreuls D, Overmayer RW. Human system performance measurements in training simulators. Human Factors 1985; 27: 24150[ISI]
3 Chopra V. Anesthesia simulators. In: Aitkenhead AR, ed. Clinical Anesthesiology: Quality Assurance and Risk Management in Anesthesia. London: WB Saunders; 1996, 297316
4 Black PJ. Testing Friend or Foe: Theory and Practice of Assessment and Testing. London: Falmer Press, 1998; 3750
5 Gaba DM, DeAnda A. The response of anesthesia trainees to simulated critical incidents. Anesth Analg 1989; 68: 44451[Abstract]
6 DeAnda A, Gaba DM. Role of experience in the response to simulated critical incidents. Anesth Analg 1991; 72: 30815[Abstract]
7 Howard SK, Gaba DM, Fish KJ, Yang G, Sarnquist FH. Anesthesia crisis resource management training: teaching anesthesiologists to handle critical incidents. Aviat Space Environ Med 1992; 63: 763[ISI][Medline]
8 Gaba DM, Howard SK, Flanagan B, Smith BE, Fish KJ, Botney R. Assessment of clinical performance during simulated crises using both technical and behavioral ratings. Anesthesiology 1998; 89: 818[ISI][Medline]
9 Schwid HA, ODonnell D. Anesthesiologists management of simulated critical incidents. Anesthesiology 1992; 76: 495501[ISI][Medline]
10 Byrne AJ, Hilton PJ, Lunn JN. Basic simulations for anesthetists. A pilot study of the ACCESS system. Anaesthesia 1994; 49: 37681[ISI][Medline]
11 Byrne AJ, Jones JG. Responses to simulated anesthetic emergencies by anesthetists with different duration of clinical experience. Br J Anaesth 1997; 78: 5536
12 Chopra V, Gesink BJ, De Jong J, Bovill JG, Spierdijk J, Brand R. Does training on an anesthesia simulator lead to an improvement in performance? Br J Anaesth 1994; 73: 2937[Abstract]
13 Lindekaer AL, Jacobsen J, Anderson G, Laub M, Jensen PF. Treatment of ventricular fibrillation during anesthesia in an anesthesia simulator. Acta Anaesthesiol Scand 1997; 41: 128084[ISI][Medline]
14 Devitt JH, Kurrek MM, Cohen MM, Fish P, Murphy PM, Szalai J-P. Testing the raters: inter-rater reliability of standardized anesthesia simulator performance. Can J Anesth 1997; 44: 9248[ISI]
15 Devitt JH, Kurrek MM, Cohen MM et al. Testing internal consistency and construct validity during evaluation of performance in a patient simulator. Anesth Analg 1998; 86: 116064[Abstract]
16 Kurrek MM, Devitt JH, Cohen MM. Cardiac arrest in the OR: how are our ACLS skills? Can J Anesth 1998; 45: 13032[Abstract]
17 Morgan PJ, Cleave-Hogg D. Evaluation of medical students performance using the anaesthesia simulator. Br J Anaesth 2000; 34: 425
18 Weinger MB, Herndon OW, Zornow MH, Paulus MP, Gaba DM, Dallen LT. An objective methodology for task analysis and workload assessment in anesthesia providers. Anesthesiology 1994; 80; 7792[ISI][Medline]
19 Weigner MB, Herdon OW, Gaba DM. The effect of electronic record keeping and transesophageal echocardiography on task distribution, workload and vigilance during cardiac anesthesia. Anesthesiology 1997; 87: 14455[ISI][Medline]