H2F BITESIZE #5

I bring you a weekly bite-sized chunk of the science behind helicopter human factors and CRM in practice, simplifying the complex and distilling a helicopter related study into a summary of less than 500 words.

TITLE:

The Reliability of Instructor Evaluations of Crew Performance: Good News and Not So Good News.

WHAT?

Study investigating the quality of instructors’ evaluation of aircrew performance in both technical and CRM assessments using a broad sample of US Navy helicopter pilots flying a simulator scenario.

WHEN?

2002. This study contributed to evidence for the later development of the CRM competencies and behavioural marker taxonomies in common use today.

WHERE?

University of Florida study on two US Navy squadrons.

WHY?

Although flight training has traditionally focused on technical skills and emergency procedures, increasing emphasis is placed on training non-technical skills related to people and connections between people and equipment. Part of the problem of evaluating CRM is how to measure CRM performance. Little has been published regarding the quality of instructor evaluations of aircrew, especially the evaluation CRM skills.

HOW?

Two instructors graded the performance of 45 helicopter crews in 7 competencies during a simulated scenario. Competencies were: mission analysis, decision making, leadership, situation awareness, adaptability/flexibility, assertiveness, and communication. Gradings were made for three types of item: (1) specific crew behaviours in response to scenario events (e.g., whether crews kept out of icing conditions), (2) generic evaluations of crew responses to scenario events (e.g., overall handling of an icing problem), and (3) CRM competencies for an entire scenario (e.g. evaluations of decision making). Instructor gradings were then statistically examined for inter-rater reliability (extent to which the two instructors’ grades were consistent with each other for each crew).

FINDINGS:

For the evaluation of specific behaviours, the reliability of the instructors was generally good; that is, they usually agreed about what crews actually did in response to events encountered in a scenario.
Evaluation of CRM competencies showed low reliability between instructors and evidence of halo effect whereby instructor judgement was based on a general impression of whether crews were competent or incompetent rather than specific evidence of CRM behaviours.
Asking instructors to evaluate specific behaviours and responses to events embedded within the scenarios led to better reliability than asking them to evaluate broader dimensions of CRM.

SO WHAT?

The more abstract the behaviours and the greater the number of behavioural dimensions evaluated, the more difficult rating them becomes. For example, it is easy for instructors to say whether a crew kept out of icing conditions but relatively difficult to assess their situation awareness of this, and other threats for the entire flight.
The results underscored the difficulty of evaluating CRM behaviours reliably. They suggested that increasing the practice of assigning behaviours to categories aids the observation of specific behaviours, but that doing so is likely to require more extensive instructor training in how to evaluate these.

REFERENCE:

Brannick, M. T., Prince, C., & Salas, E. (2002). The Reliability of Instructor Evaluations of Crew Performance: Good News and Not So Good News. The International Journal of Aviation Psychology, 12(3), 241–261. https://doi.org/10.1207/S15327108IJAP1203_4