Use of a high-fidelity simulator to develop testing of the technical performance of novice anaesthetists{dagger}

F. C. Forrest*,1,2, M. A. Taylor1,2, K. Postlethwaite3 and R. Aspinall1,2

1Sir Humphry Davy Department of Anaesthesia, Bristol Royal Infirmary, Upper Maudlin Street, Bristol BS2 8HW, UK. 2Bristol Medical Simulation Centre, UBHT Education Centre, Upper Maudlin Street, Bristol BS2 8AE, UK. 3University of Exeter School of Education, St Luke’s Campus, Heavitree Road, Exeter EX1 2LU, UK*Corresponding author

{dagger}This article is accompanied by Editorial I.

Accepted for publication: September 7, 2001


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results of performance...
 Discussion
 References
 
Background. We used the Delphi technique to gain a consensus from 26 consultant anaesthetists about technical tasks during general anaesthesia. We then developed a technical scoring system to assess anaesthetists undertaking general anaesthesia with rapid sequence induction.

Methods. We then followed the performance of six novice anaesthetists on five occasions during their first 3 months of training. At each, visit each novice ‘anaesthetized’ the Human Patient Simulator at Bristol Medical Simulator Centre. For comparison seven post-fellowship anaesthetists were scored on one occasion.

Results. Novice scores improved significantly over the 12-week period (P<0.01). A significant difference was also found between the final novice scores and the post-fellowship subjects (P<0.05).

Conclusions. These findings suggest that simulation can be used to observe and quantify technical performance.

Br J Anaesth 2002; 88: 338–44

Keywords: anaesthesia, performance; education, Delphi technique; education, simulators


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results of performance...
 Discussion
 References
 
A ‘comprehensive anaesthesia simulation environment’1 is a useful place in which to assess many aspects of anaesthetists’ performance. It is a realistic environment where conditions can be simulated to match the needs of the assessment.

Researchers using this simulated operating theatre environment have developed objective scoring systems for both technical and behavioural performance during critical incidents.2 3 We are not aware of any simulator-based performance studies that observe routine anaesthesia or track individuals’ progress over time. To use the simulation suite to assess anaesthetic trainees, we should develop and validate a scoring system and an appropriate simulation environment.

We set out to: (1) develop an appropriate validated scoring system for technical performance of individual anaesthetists undertaking general anaesthesia with rapid sequence induction; (2) see if the expected improvement in technical performance of novice anaesthetists in their first 3 months could be observed in the simulation suite and quantified using the scoring system.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results of performance...
 Discussion
 References
 
Development of a technical scoring system using the Delphi technique
The conventional Delphi technique was used to obtain a consensus view on a subject from a group of experts.4 We identified 28 consultant anaesthetists in our region with an interest in training as an ‘expert’ group and invited them by letter to take part in the study. Each anaesthetist was given the opportunity to decline.

A list was made of the technical tasks undertaken during rapid sequence induction and maintenance of general anaesthesia. This list was based on observed practice, textbook descriptions, and guidelines issued by the Association of Anaesthetists.5 The original list of tasks was formulated into a scoring sheet with room for comment.

The scoring sheet was sent to the participants who were invited to comment about the inclusion of each task and rank its relative importance on a scale of 1–5. They were asked to add tasks to the list that they felt were important and had been omitted. The scoring sheet was returned to us. The results were analysed to obtain a mean score for each task. The mean scores and additional tasks were noted on the scoring sheet to produce a new modified scoring sheet.

The review process was repeated. Each participant was sent a copy of the new modified scoring sheet and a copy of their own original scores. They were asked to review their original response against the mean value derived from the group opinion and modify their response if they felt it was appropriate. To comply with the conventional Delphi technique the review process had to be done at least once. The additional items were reviewed separately for a second and final time in order to comply with the Delphi process.

The forms were processed through a central office and results analysed using Microsoft Excel. The anonymity of the participants was preserved with a numerical identifier.

Results of scoring system development
Twenty-eight consultant anaesthetists were approached to take part in the study. Twenty-six agreed to do so.

The list of tasks with associated group mean scores from the first and second surveys and the final weighted scores (whole number) are shown in Table 1. The 19 additional items, their relevant mean scores for two rounds, and final weighted scores are shown in Table 2. We stopped the Delphi process after two rounds.


View this table:
[in this window]
[in a new window]
 
Table 1 Tasks and associated mean scores from the first and second rounds of Delphi. The final weighted whole number ‘score’ for each task is also shown
 

View this table:
[in this window]
[in a new window]
 
Table 2 Additional tasks added by respondents after the first Delphi round. First and second round scores plus final weighted score for each item are shown
 
Methods used to observe technical performance
This part of the study was approved by the Local Ethics Committee and conducted at the Bristol Medical Simulation Centre. This contains a high-fidelity Human Patient Simulator produced by Medical Education Technology Inc. This is kept in a realistic operating theatre environment, complete with anaesthetic machine, simple and invasive monitoring, and a full array of anaesthetic and surgical equipment. Two cameras and microphones discreetly placed in the operating theatre are used for recording the events onto videotape.

We recruited six novice anaesthetists from four different hospitals in the South West. We brought them to the centre on five separate occasions, at the end of weeks 1, 2, 4, 8, and 12 of their training. On each visit the novices were given a different ‘patient’ to anaesthetize on their own. They received the scenarios in the same order and each of the scenarios was designed to prompt the anaesthetist to plan a general anaesthetic with rapid sequence induction. On their first visit to the centre each novice was given a brief introduction to the simulation environment primarily to explain the features of the simulator.

The patient simulator can be programmed to represent patients with different baseline physiological conditions. These simulated patients can be created repeatedly. For this study all the simulated patients were programmed as ‘standard man’ a fit 70-kg man. However, the patient details presented to the anaesthetist (gender, presenting complaint, medical history, and the planned operation) varied from visit to visit and are summarized in Table 3.


View this table:
[in this window]
[in a new window]
 
Table 3 Summary details of the simulated patient for each study week
 
The theatre set-up was the same in each scenario:

1. the anaesthetist was told it was the first case on a Monday morning list and the theatre had not been in use over the weekend;

2. the form indicating consent for operation was absent from the notes but was available in a nearby office;

3. it was not apparent that the surgeon was in the theatre suite at the start of the case.

Thus, the anaesthetist was expected to check the equipment, find the form showing patient consent and contact the surgeon before starting the anaesthetic.

An anaesthetic assistant (a trainer in the simulator centre) was available to help with preparation and conduct of anaesthesia. The trainee could also ask further questions of the simulated patient. Once the anaesthetic was underway, the surgical procedure was acted out by members of the research team. The novice was expected to continue anaesthesia until the end of surgery, wake the patient up, and then hand over to recovery staff. The trainees were not debriefed on their performance during the study.

Each novice was videotaped for the whole procedure. Videotape captured the action in the theatre suite as well as an inset display showing the monitored physiological variables. Performance was scored from videotape rather than from the trainees direct actions using the Delphi derived technical scoring system described above. For this study we used the tasks listed in Table 1. All videos were analysed separately by two raters.

In addition to the novice anaesthetists, we recruited seven post-fellowship anaesthetists to the study. They were asked to anaesthetize a single case (week 1 patient, Table 3). This patient was presented in an identical fashion. Each of the anaesthetists was videotaped and scored by the two raters.

Finally, to assess the reliability of the scoring system, five consultant anaesthetists (not involved with the study) who were familiar with the simulation suite were asked to rate one videotape. They were given a brief verbal introduction to the use of the technical scoring system and then scored Novice 2 (week 12 tape).


    Results of performance assessment
 Top
 Abstract
 Introduction
 Methods
 Results of performance...
 Discussion
 References
 
The scores attributed by the two raters for each novice at the first and last visit are shown in Table 4. The scores on each visit for each of the six novices are plotted in Figure 1. Because the null hypothesis was no increase in score over the training period, one-tailed, paired t-tests for the scores at the first and last visit (weeks 1 and 12 of training) were performed on the data from each rater. Both sets of data showed a significant difference P<0.01 (rater 1, P=0.002; rater 2, P=0.003) with the week 12 scores higher than the week 1 scores.


View this table:
[in this window]
[in a new window]
 
Table 4 Week 1 and week 2 technical scores, mean and SD, for the six novices by two raters
 


View larger version (18K):
[in this window]
[in a new window]
 
Fig 1 Plots showing technical scores awarded by two raters at each of five visits during the first 12 weeks of training. Each box represents interquartile range for six subjects, the whiskers represent 10th and 90th percentiles. The median is shown within the box and the full range of scores represented by the dots.

 
Both raters scored the post-fellowship group. Table 5 shows the scores from each rater and the experience of each anaesthetist at the time of testing. Because the null hypothesis was no difference between the two groups, two-tailed, non-paired t-tests comparing novice scores at week 12 and the scores from the post-fellowship group for both raters were performed. These showed a significant difference P<0.05 (rater 1, P=0.03; rater 2, P=0.01).


View this table:
[in this window]
[in a new window]
 
Table 5 Technical scores for experienced anaesthetists. *Indicates previous simulator experience
 
The scoring patterns from five raters reviewing novice 2 are shown in Table 6. This shows the degree of agreement between raters on whether or not they did or did not see the novice performing an action. There was complete agreement on 62 out of 91 actions (68.1%) and four out of five raters agreed on 80 out of 91 actions (88%). Technical scores were calculated for each of the five raters (rater 3=265, rater 4=284, rater 5=305, rater 6=270, and rater 7=273). The standard deviation (SD) of these scores is 16. The SD for all novice 2 scores between week 1 and 12 is 92 (rater 1) and 93 (rater 2). The SD of all six novices scores at week 1 is 51 (rater 1) and 53 (rater 2); at week 12 this SD falls to 36 (rater 1) and 32 (rater 2).


View this table:
[in this window]
[in a new window]
 
Table 6 Presence (1) or absence (0) of items by five new raters of one videotape

 

    Discussion
 Top
 Abstract
 Introduction
 Methods
 Results of performance...
 Discussion
 References
 
An anaesthetist’s actions in the operating theatre can be broadly divided into technical and non-technical components. In order to validate the simulation suite as an environment for assessing the technical aspects of performance, we needed to show that different performance abilities could be measured using a valid scoring system. We hypothesized that it would be easiest to demonstrate changes in technical performance in a short period of time in novice anaesthetists. We therefore developed a scenario of general anaesthesia with rapid sequence induction, which requires the individual to follow a series of prescribed technical steps (especially at induction and extubation). All SHOs must be familiar with this technique by the time they have to provide anaesthesia with limited supervision (usually after 3 months of training).

The Delphi technique and scoring system
We wanted a scoring system that would be valid and reliable. Previously described tests of skills6 or technical ability, specific2 or non-specific7 to the simulator were not suitable for scoring this scenario. The problems relating to methods of assessment used in simulation have recently been reviewed by Byrne and Greaves.8 They concluded that of 13 publications relating to assessment of performance only four had paid attention to investigating the validity and reliability of the assessment system.

To address some of the issues of validity we chose the Delphi technique to develop our scoring system. Delphi has not been widely used in medicine but it has been used to describe pointers of diagnosis in malignant hyperpyrexia9 and to define necessary skills for pre-registration trainee doctors.10 Using Delphi we would gain a consensus expert opinion (important in rapid sequence induction of anaesthesia which has been shown to be open to individual interpretation11) and the tasks would have scores ranked for importance. Thus, we hoped the scoring system applied to this one scenario would have construct validity.12

The Delphi process is cumbersome to perform. We satisfied convention by recruiting a panel with 15–30 ‘experts’. We felt that a group of consultant anaesthetists with responsibility for training had sufficient experience in rapid sequence induction to be considered ‘expert’. Although three or four round processes are encouraged to achieve maximum input from experts we ended the study after two rounds because:

• mean scores were rounded up (>=0.5) and only one out of 92 items would have changed final allotted scores between the first and second rounds;

• the task list was extremely long and respondents needed to remain motivated to complete it;

• motivation seemed good on the first round as we had useful feedback and comment, which became less on the second round.

We used mean rather than median scores as a measure of central tendency and for feedback to the group. We reviewed the difference between mean and median scores for the second round data and found no difference.

The results of the novice study suggest that the scoring system is valid as we demonstrated that the technical scores of all the novices improved with time (Fig. 1). Also, the expert group (only two of whom had prior simulator experience) performed better than novices after 3 months of training.

A scoring system should have good inter-rater reliability if it is to be used for a trainee assessment programme. Many performance studies in simulators have been criticized for failing to address this issue.8 Gaba found better inter-rater reliability with technical rather than behavioural rating.2 We would also expect good inter-rater reliability on a checklist scoring system where actions are either present or absent and videotape can be reviewed if concentration is lost or clarity required. Our scores from two raters look similar but might be criticized because the ranking of novices differs slightly between raters. Rater 1 consistently scored higher than rater 2. The actual difference in scores was small and equivalent to the value of two or three items off the checklist.

To study inter-rater reliability further we asked five raters who had never used the scoring system before to score one tape after minimal tuition. Agreement was good especially if a consensus was taken (i.e. four out of five raters). However, this makes no allowance for the influence of chance. We therefore calculated the actual technical scores generated by the five raters, which ranged from 265 to 305 with a SD of 16 (the equivalent to three or four items on the scoring system).

When we compare this to the results generated by rater 1 and rater 2 scoring the six novices at week 1, we find that the SD for the novices are similar for both raters (51 and 53) but markedly greater than the SD of the single video rated by the five raters. This suggests that rater 1 and rater 2 found differences amongst the novices at week 1, which were unlikely to be simply the result of rater error. At week 12, the SD amongst the six novices were smaller (35 for rater 1; 32 for rater 2). This would be expected if training had enabled those who were less technically competent at the beginning to improve their technical skills. The variation in technical scores within the group at week 12 is still, however, at least twice as great as the variation in the rating of the single video by the five raters. It suggests that even at week 12 individual differences amongst the novices were greater than could be readily accounted for by rater errors.

We conclude that the scoring system is acceptable to a range of experts, can be used with an acceptable level of inter-rater reliability (see above), and it is valid since it detected expected performance differences (see comparisons of week 1 and week 12 and of novices and experts). If simulator tapes were to be used for real assessments and not research, we suggest that two raters might watch each tape together. Our results suggest that they would achieve reasonable levels of agreement. If they discussed any ambiguities as they arose, a sound technical judgement of performance could be achieved. This was obviously not possible as part of a research process.

The novice study
A considerable range of technical scores was noted at the novices’ first visit. Although none of the novices had previously held an anaesthetic post, two had worked in Accident and Emergency (novices 5 and 6, Table 4) and one in Intensive Care (novice 3, Table 4). All novices’ scores increased, but a higher score in week 1 did not result in a higher score in week 12.

Although we wished to find if different performance abilities could be observed and scored in the simulation environment, we were interested to discover that after 12 weeks the technical ability of the novices was less than the experts. Although it is good that experience leads to continued improvement in technical ability it does lead to the question ‘why were novices scoring less well in a set practical task?’ Novices scored poorly in machine checking, pre-induction checks and recovery hand-over compared with experts. Only one of the expert group did NOT include checks in their performance. This does raise some interesting questions. Do novices understand that it is their responsibility to check equipment? Do novices know how to check equipment by 3 months? Whose responsibility is it to teach them or have they been taught and not learned it? The Royal College of Anaesthetists have introduced guidelines for the assessment of SHOs in training including a three months assessment, which begins to address these issues.13. It is reassuring to note that all the novices in this study understood and employed a rapid sequence induction by the last assessment when they were beginning to undertake emergency cases with limited supervision.

The maximum score possible is 401. The experienced anaesthetists did not score close to the maximum technical score possible (Table 5). This may indicate that the scoring system included aspects of performance that experienced anaesthetists do not include in day-to-day practice. An example of this would be that end-tidal carbon dioxide is used to check for tube placement rather than auscultation of the chest. An alternative explanation is that this may be a feature of providing anaesthesia in a simulation suite. Only one anaesthetist wore gloves in the simulation suite, but we consider that it is highly likely that these anaesthetists would wear gloves for real patients. Further studies comparing simulator and real performance will help to identify differences in practice between the two environments.

We have shown that simulation and a scoring system can document and assess changes in technical performance. Although continued development of valid and reliable scoring systems is necessary, simulation may be a useful tool for assessing trainees during their training. Technical assessments combined with case management analysis could allow standardized competency-based assessment of training to be developed.


    References
 Top
 Abstract
 Introduction
 Methods
 Results of performance...
 Discussion
 References
 
1 Gaba DM, DeAnda A. A comprehensive anaesthesia simulation environment: Recreating the operating room for research and training. Anesthesiology 1988; 69: 387–94[ISI][Medline]

2 Gaba DM, Howard SK, Flanagan B, Smith BE, Fish, KJ, Botney R. Assessment of clinical performance during simulated crises using both technical and behavioural ratings. Anesthesiology 1998; 89: 8–18[ISI][Medline]

3 Devitt JH, Kurrek MM, Cohen MM, et al. Testing internal consistency and construct validity during evaluation of performance in a patient simulator. Anesth Analg 1998; 86: 1160–4[Abstract]

4 Clayton MJ. Delphi: a technique to harness expert opinion for critical decision-making tasks in education. Edu Psychol 1997; 4: 373–86

5 The Association of Anaesthetists of Great Britain and Ireland. Checklist for Anaesthetic Apparatus. Alresford Press Limited, 1997

6 Sivarajan M, Miller E, Handy C, et al. Objective evaluation of clinical performance and correlation with knowledge. Anesth Analg 1984; 63: 603–7[Abstract]

7 Kestin IG. A statisical approach to measuring the competence of anaesthetic trainees at practical procedures. Br J Anaesth 1995; 75: 805–9[Abstract/Free Full Text]

8 Byrne AJ, Greaves JD. Assessment instruments used during anaesthetic simulation: review of published studies. Br J Anaesth 2001; 86: 445–50[Abstract/Free Full Text]

9 Laraach MG, Localio AR, Allen GC, et al. A clinical grading scale to predict malignant hyperthermia susceptibility. Anesthesiology 1994; 80: 771–79[ISI][Medline]

10 Stewart J, O’Hallaran C, Harrigan P, et al. Identifying appropriate tasks for the pre-registration year: modified Delphi technique. BMJ 1999; 319: 224–29[Abstract/Free Full Text]

11 Thwaites AJ, Rice C.P, Smith I. Rapid sequence induction: a questionnaire survey of its routine conduct and continued management during a failed intubation. Anaesthesia 1999; 54: 372–92[ISI][Medline]

12 Black TR. Doing Quantitative Research in the Social Sciences. London: Sage, 1999

13 The CCST in Anaesthesia II. Competency based SHO training and assessment. A manual for trainees and trainers. http://www.rcoa.ac.uk