We’re going to try to bucket everyone into strong, meets, and low performers, then essentially try to stack rank everyone.
Over the past month I’ve heard this sentiment in two different calibration meetings. As I’ve happened to read a bit about stack ranking, I felt some warning bells going off.
Now, I actually believe the stack ranking exercise that we ended up doing was useful. Before we look at that, let’s first recap why stack ranking has (rightfully) earned a bad rap.
Stack Ranking For Performance
Stack ranking hurts when it’s directly tied to performance and compensation, with quotas. In this form, managers are asked to rank their reports. They’re typically required to put some people in the top and/or bottom X%.
The immediate concern with this is that stack ranking is a relative measure. Someone being in the bottom 25% of your team doesn’t necessarily mean they’re bad. It’s just that most of the other people in your team are better.
The concern becomes an actual problem when performance and compensation gets tied to these measures. You shouldn’t get less compensation because everyone else on your team is better than you. You should get less compensation because you’re bad at your job.
There are also, naturally, calibration concerns. If the bottom 25% of team A is still above average, they shouldn’t be punished for being on a strong team.
Perhaps most importantly, this form of stack ranking hurts performance overall. That’s because people are incentivized to be more competitive and cutthroat, focusing on how to make themselves look better. This hurts in the long run.
Stack Ranking For Calibration
With all that, why do I think our stack ranking exercise was useful? Fundamentally, we weren’t using it to decide who gets what compensation. We were using it to calibrate our standards and come to (more of) a consensus on our values. What matters here isn’t the tool we’re using, but how it’s being used.
Our process was actually closer to this change described in the article above:
Most companies have shifted to systems that are more flexible. Employees may still be rated or ranked, but not along a bell curve or with strict cutoffs. There is also more focus on consistent feedback and how people can improve.
We did aim to bucket people into high, middle, and low performers. We did aim to create an absolute ranking. In fact, we aimed to create an absolute ranking across multiple teams. There were no quotas.
The value of this exercise came afterwards: we then discussed and debated why we felt these rankings made sense. Naturally, this raised lots of different discussion topics. Here are some samples.
Are we on the same page?
Sally and Susan are both strong performers. Sally’s really good at seeing the big picture. She helps coordinate team members well and ensures that we’re making progress towards shipping our feature. Susan’s really strong when it comes to individual work. She focuses a lot on details and puts out high quality work.
The conversation starter: why did we rank Sally higher?
There are lots of possible answers here. Maybe there’s something we value more in what Sally does. Maybe there isn’t, and she shouldn’t be ranked higher. Maybe Sally and Susan were ranked by different people and they’re miscalibrated.
The conversation is around better understanding what we find to be important, and ensuring that we agree. Sometimes we don’t agree. Sometimes we realize that we value certain qualities as equally important, even though the qualities themselves are very different. Sometimes we need to dig further and come up with more data points about the people we’re trying to compare.
Is the ranking consistent?
When you try to evaluate people into a single dimension, you’re going to see some odd results. We ranked Larry as a stronger performer and Luke as a medium performer. Larry’s only been here for 2 months but he’s doing amazingly well, whereas Luke’s been here for 1.5 years and is doing about average. Do we actually believe Larry is better?
Again, lots of possible answers. For us, this usually meant there was some dimension that we were (or weren’t) factoring in.
For example, maybe we actually think Luke is closer to getting a promotion despite believing that Larry is a stronger performer. Luke performs more solidly and has much more experience with our product and our practices. Larry performs amazing work, but is lacking this exposure.
There are at least two unspoken dimensions here. One is with our evaluation: we discarded experience when we were doing our stack ranking, but we included it when we changed the evaluation criteria.
Another is what we are evaluating for. Perhaps we believe Luke is overall better today, but we evaluated Larry more favorably because we were subconsciously looking for growth velocity.
Does the result make sense?
Occasionally, when we step back, our distribution seems weird. For example, we might have a higher than expected number of strong performers. Does this result make sense?
Again, there are lots of possible answers. Perhaps our bar has dropped or we’re miscalibrated. Perhaps they’re actually all strong performers and we happen to have an unusual case. Perhaps this is a bottleneck and so we end up with strong performers but they aren’t getting promoted.
This topic is slightly different, as it is about determining where the organization should be calibrated, not about where specifically someone is on the ranking.