Amazon staff are reportedly using the company's internal AI tools for trivial or unnecessary tasks simply to inflate their AI usage scores, which are tied to performance reviews. The gaming of metrics reveals exactly what happens when you measure AI adoption instead of outcomes - and it's a perfect case study in how not to deploy transformative technology.
According to the Financial Times, Amazon employees are using the company's AI assistant for tasks that don't require AI - writing trivial emails, generating unnecessary summaries, asking it basic questions they already know the answers to - because their AI usage metrics are tracked and factor into performance evaluations.
This is what happens when you optimize for the metric instead of the goal. Amazon wants employees to adopt AI tools because leadership believes they increase productivity. So they track AI usage. Employees respond rationally by maximizing the tracked metric, regardless of whether the AI usage actually makes them more productive. Classic Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.
I've seen this pattern before in my startup days. Company announces exciting new tool. Management wants adoption metrics for board meetings. Adoption gets tied to performance reviews. Suddenly everyone's using the tool whether it helps or not, because that's what's being measured. The tool might be great - the deployment strategy is terrible.
The problem isn't the AI tools themselves. Amazon's CodeWhisperer and internal AI assistants are probably legitimately useful for many tasks. The problem is mandating adoption through performance metrics rather than letting the tools succeed on their merits. If the AI actually saves time and improves work quality, people will use it. If you need to track usage and tie it to reviews, that suggests the value proposition isn't obvious.
What Amazon should be measuring is outcomes, not AI usage. Did the project ship faster? Is the code quality better? Are customer issues resolved more quickly? Those are the metrics that matter. But they're harder to measure than "did this person query the AI assistant 50 times this week."
