“Just think of how stupid the average person is, and then realize half of them are even stupider!â€
George Carlin
Â
Most vendors use average scores of raters in their summary reports for 360 degree feedback. For example it’s not uncommon to report a table summarizing the “most frequent†and “least frequent†behaviors perceived by the different rater groups. These top/bottom lists are derived by simple average score calculations. If all raters are essentially in agreement with each other the average score is a pretty good metric.
However, quite a bit of research on 360 degree feedback suggests that we should expect diversity in ratings both within and between rater groups. The more dispersion, the more confusing average scores are in feedback reports.
As my friend and CEO of Personal Strengths Publishing Tim Scudder says, “If my head in a hot oven and my feet are in cold snow, on average, I am feeling pretty comfortable.†Average scores can be potentially misleading, particularly when behavior changes are being attemtped based on the results of 360 feedback reports.
THE FALLACY OF AVERAGE SCORES
In one meta-analytic study by Conway & Huffcutt (1997), the average correlation between two supervisors was only .50, between two peers, .37 and between two subordinates only .30. Greguras and Robie (1995) explored within-source variability in a study of 153 managers using 360-degree feedback. This study suggests that it is likely that raters within a group (e.g., direct reports) might have diverse points of view and different perceptions about the same behavior.
Given these findings, vendors who do not provide a way for participants to evaluate within-rater agreement in feedback may increase the probability that average scores used in reports can be easily misinterpreted—particularly if they are used by coaches to help coachees focus on specific competencies and behaviors for developmental planning purposes such as reviewing the “most and least†frequent behaviors or items seen as “most and least†effective.
It’s easy to observe our own coachees react to these “most/least†lists so common in vendor’s feedback reports and focus on only a few items without a clear understanding of whether the agreement within raters is low or high.
We offer at least three different ways within our reports to determine rater agreement and to offer some insight about ways to interpret and use average score summaries. I wish more vendors would do the same.
THREE MEASURES OF RATER AGREEMENT
1. RANGE OF SCORES
One simple way to view how closely raters are in agreement is to explore the “range of scores” across all questions in a competency. For example, in all of our reports using average scores we provide a very simple view of the range of scores. It provides participants with a general sense of both the very lowest and highest score by raters on eany one question composing the competency. When this range is high (e.g., 5 points or more) the participant should be concerned with the possibility that the variability is large enough that the average score might reflect polarized points of view.
2. RATING DISTRIBUTIONS
A second way we help participants understand whether raters tend to cluster their ratings or not is provided by an actual ratings distribution in our “Most Frequent” and “Least Frequent” behavior summaries in our reports. Again, it provides the participant with another qualitative way to see the actual ratings by invited raters to determine whether the average score is possibly due to an outlier or due to divergence in how the same behavior is perceived by others. (Note: The box depicts the participant’s own self-rating on that behavior).
This rating distribution is important because participants are likely just to look at the behaviors in this “Most” or “Least” frequent table and use this to focus on their development plans. If the average score is mathematically correct but due to polarized points of view by invited raters, the development plan might actually be to research why the behavior might be experienced so differently by raters.
3. STATISTICAL MEASURE OF AGREEMENT
Finally, we offer a statistical measure o rater agreement based on standard deviation. We convert this number to a percentage where “1” means that all raters were 100 percent in agreement and “0” means there is maximum disagreement in the rating scale being used.
Any score below 50 percent means that the average score should be interpreted cautiously for developmental purposes. These scores suggest that enough disagreement exists between raters to make the mean score less useful for clearly being a competency or behavior to leverage or work on as a potential development area.
Sometimes, average is just that–average.
Other times, average means you have wild disparity among raters you invited. What it means is you have a group of “supporters” and “critics” who view the world quite differently. Care needs to be taken by coaches and participants to look a bit deeper at average scores on assessment reports.
I’d rate the information in the Blog as being “above average.”
What do you think? Be well…..