“The general consensus seems to be that I don’t act at all.” -Gary Cooper
Average scores are typically used by vendors in 360-degree feedback reports. However, without a way to discern outliers, polarized scores and rater agreement, the interpretation of average scores can be quite misleading in reports.
It is not uncommon to use average scores to present results, since many vendors provide a table of most and least frequent behaviors. It is important that vendors provide a metric to determine whether these average scores reflect homogeneity of rater responses or enough dispersion to make the average score simple to practically use for developmental planning efforts.
For example, in each of the 360-degree feedback reports produced by Envisia Learning, Inc., there is a section at the end that provides a summary table containing average competency and item scores by each rater group, as well as an overall average of all raters, excluding self ratings. Each item or question measuring specific competencies is grouped under its appropriate competency to assist in the interpretation of the results. A feature of this section is the Index of Rater Agreement, shown in parentheses after the average scores for each rater group. This Index of Rater Agreement ranges from 0 to 1.0 and is based on a statistical measure of dispersion or spread by raters, called standard deviation. This index is derived by subtracting 1 from the calculated standard deviation divided by a scale-specific divisor.
An agreement index score of 0.0 suggests little or no rater agreement among those answering a specific question. An agreement score of 1.0 suggests uniform and consistent ratings by all raters providing feedback. Agreement index scores less than .50 might suggest a greater diversity, inconsistency, and spread among the raters.
It is not uncommon to misinterpret average scores represented on graphic comparisons as being accurate. However, when the Index of Rater Agreement is less than .50, it might suggest caution in interpreting these average scores. In reality, some raters might have a very positive bias in responding to the questions, whereas other raters might have a very negative bias in responding, creating a polarized view of the respondent. The Rater Agreement Index can be calculated at both the item and competency level. At the item level, it indicates the amount of rater agreement in answering each 360-degree feedback question.
In my experience with helping clients interpret their 360 results, it is very common for participants to focus on average scores in order to gauge at their scores. This approach seems to be logical, especially since average scores consider multiple raters rather than one.
However, there are a couple problems with focusing on average scores. First, average scores do not consider extreme or outlier scores. So, outlier scores can either significantly increase or decrease scores.
Second, I find average scores to be difficult to use with participants when interpreting the meaning of results. When participants are reactive about their scores, or when defensiveness is high, they want to know more about their scores mean, or how they came about. As a coach, I would like to facilitate their awareness and provide with as much information as I can in order to facilitate awareness. For instance, what was the consensus around the particular score? How many people were averaged in the score? What is the dispersion of rating scores on items? I often find that when I analyze their rater agreement scores within their average scores, their level of resistance of the score decreases.
Suppose someone receives a score of 4 out of 7 on a particular item. According to most typical frequency scales, this score can be interpreted as ” a moderate extent” on that item. Now suppose, you knew the breakdown of rater dispersion of this item which is, 2 people rated this item as a 6 (“To a very large extent”), and 2 people rated this item as a 2 (To a very small extent”). Clearly the breakdown adds a different level of meaning to the score. The breakdown of rater dispersions provides a whole new layer of information that can be of tremendous value when we are coaching. The actual discrepancy in rater scores can raise important facilitative questions to increase awareness in our clients.
What are your thoughts on interpreting average scores? What are ways to make the best use of them when interpreting 360 results?