Peer-Reviewed Research Confirms What Engineers Have Been Quietly Worrying About: AI Coding Tools Backfire in Messy Codebases

I've been waiting for someone to actually study this properly, and now they have.

CodeScene has published peer-reviewed research that quantifies something many engineers have suspected but struggled to articulate: AI coding assistants don't perform uniformly. They perform dramatically worse in the kind of codebases where you most want help.

The study analyzed AI-generated refactorings across 5,000 real programs using six different large language models. The researchers measured whether AI-generated changes preserved behavior while keeping tests passing. The headline finding: in code with lower health scores, defect risk increased by at least 30% compared to healthier codebases.

Here's what makes this finding significant: the 30% increase was observed in code that was still relatively maintainable - systems scoring 7.0 or above on CodeScene's health metric. The study explicitly notes that truly low-quality legacy modules (scores of 4, 3, or 1) weren't even included in the analysis. The researchers suggest, based on prior research, that breakage rates in deeply unhealthy legacy systems are likely non-linear and could increase steeply.

Translation for anyone who doesn't live in codebases: the code where you most desperately want AI help - the scary, undocumented, spaghetti legacy system that nobody wants to touch - is precisely where AI is most likely to make things worse.

As a former engineer who has spent quality time with some genuinely horrifying fintech codebases, this result is completely unsurprising and also somewhat devastating. The promise of AI coding tools has always been "let AI handle the grunt work so humans can focus on interesting problems." If the grunt work in your messy codebase is exactly where AI fails most spectacularly, that promise collapses.

The research frames this with an elegant insight about what it means to write code in the AI era. The traditional maxim says code should be written for humans to read. But as the paper argues, if AI is increasingly modifying code, it may also need to be structured in ways machines can reliably interpret.

The practical implications are real:

Healthy code makes AI more predictable. When the codebase has clear structure, consistent naming, small functions with single responsibilities, and good test coverage, AI assistants can navigate it reliably. The same patterns that make code readable for humans make it interpretable for AI.

Unhealthy code causes defect rates to spike. Deep coupling, long functions, inconsistent abstractions - these are exactly the patterns that cause LLMs to lose context, make incorrect assumptions, and introduce subtle bugs.

This has a counterintuitive implication for how organizations should be thinking about AI adoption. The teams that will benefit most from AI coding tools are the ones who already have good engineering practices. The teams that most need productivity help - the ones drowning in legacy technical debt - will get the least benefit and potentially the most harm.

The paper's conclusion is worth sitting with: "Code Health is a key factor in whether AI coding assistants accelerate development or amplify defect risk." That's a careful, measured conclusion from peer-reviewed research. Vendors selling AI coding tools have every incentive to hide this finding. Developers and engineering managers have every reason to understand it.