A cleaner codebase results in less token usage

8 min read

Prasenjit Sarkar photo

Prasenjit Sarkar

Solutions Marketing Manager

Priyansh Trivedi photo

Priyansh Trivedi

Machine Learning Scientist

The prevailing take on AI-assisted software development goes something like this: AI agents don't get cognitively overloaded, they read fast, they don't care about naming conventions or nested logic; human limitations don’t apply. Put another way, cleaner code is critical for human readers only and doesn’t really matter for AI tools. 

So maybe clean, structured code is a solved problem. Perhaps it’s an artifact of the old era. It's a reasonable-sounding argument. But we couldn't find evidence anyone had actually tested it. So we did.

What we tested

The question sounds simple: working on the same task, does an AI agent behave differently on cleaner code versus messier code?

Answering it rigorously is harder than it looks. Real-world repos that differ on code quality also tend to differ on one hundred other things, as well—programming language, framework, test coverage, dependencies, age, team size. If an agent performed better on one repo than the other, we couldn't tell whether that was because the code was easier to work with, or simply because the agent knew the framework better.

So we created the comparison ourselves. We built six pairs of repositories where both sides ship the same application, pass the same test suite, use the same dependencies, and broadly share the same architecture. They only differed under the hood: how the code was factored, named, nested, and whether it carried the kinds of issues SonarQube flags. Same app, very different insides.

We intentionally built these pairs in two directions—some started from a clean codebase and got deliberately messed up by an agent pipeline (we called this one “Slopify”). Others started from an organically-grown messy codebase and got cleaned up by a SonarQube-guided agent (called “Vibeclean”). Running the comparison in both directions ensured that any downstream effect is due to the cleaner state of the code, and not due to our process of building the pairs.

Across these six pairs we wrote 27 coding tasks, routed through the parts of each codebase where the difference between clean and messy actually showed up. We described each task the way a product manager would describe a ticket: inputs, outputs, the behaviour a user should see. No file names, no function names, no internal hints—just enough for the agent to figure out where to go on its own.

Then we ran each task ten times on both sides of every pair, using Claude Code with Sonnet 4.6, about 540 runs in total.

What we found

Across the 540 runs, the cleaner side of every pair was measurably less expensive to run than the unclean side.

  • 7.2% fewer input tokens consumed
  • 8.5% fewer output tokens generated
  • 11.1% reduction in agent reasoning effort (note: this is an estimate as Anthropic doesn't expose reasoning-token counts directly, so we count characters off the event stream)
  • About a third fewer file revisits after the agent had already edited a file
  • 3.6% fewer turns before the first code change, on average
  • No meaningful change in whether the task got done (−0.9 percentage points)

The pass-rate number sits at noise. Whatever cleaner code does for an agent, it doesn't decide whether the work finishes. What it changes is how much the agent has to do to finish it.

One caveat worth flagging up front: these are dataset-wide averages over a wide per-task spread. Some tasks saved 40% on input tokens; a handful actually cost slightly more on the high-quality side. Across the 27 tasks, the helping effect dominates on average, but not on every task. We'll come back to where it does and doesn't.

The bottom line

Early observations point to two critically important findings:

(1) Agents' reasoning budget is impacted by messy code; and

(2) Cleaner code is now an AI infrastructure cost lever, not just an engineering best practice.

We expect that variance across tasks is real, and not every task benefits equally from a cleaner codebase. However, if the pattern holds across more repos and more models, the headline numbers understate the structural-quality effect. 

The same code that burdens a human reader burdens agents, too. In other words, a codebase that includes deeply nested logic, high cognitive complexity, cryptic naming, etc. will drive more labor and higher cost. When an agent encounters a 400-line function with branchy control flow, it has to work harder: more reading, more re-reading, more reasoning before it touches anything.

Well-maintained code gives agents shorter paths to the same answer: smaller functions, cleaner control flow, and comments that hand context directly to the agent instead of forcing it to infer from structure. The agent doesn't have to build a full mental model before it acts.

Two of the more telling signals in the data aren’t about how much the agent reads or writes, rather, they’re about how it moves through the work. On the clean side the agent re-reads files it has already edited about 34% less often than on the unclean side, and reaches its first code edit slightly sooner. Both effects show up on every pair we measured, and unlike the input/output/reasoning numbers (which swing widely from task to task), these stay consistent across the dataset.

One plausible interpretation of this clean statistical signal could be that the agent commits and moves on when it encounters the cleaner code. On messier code, it goes back to re-read what it already touched, spends longer building a mental model before its first edit, and second-guesses itself more often. The token reductions look like a consequence of that behavioural difference, not the cause of it.

Hidden tokenomics beyond the "prompt"

The financial burden of AI is shining a light on "agentic inference" costs. As software developers move from single-turn prompts to multi-step agentic workflows, token consumption has soared, with platforms like OpenRouter processing over 100 trillion  tokens annually.

A single coding agent task on a frontier model now averages three to four million tokens, accumulated across tool calls, file reads, edits, retries, and reasoning steps in one conversation (Bai et al., 2026, our work). Most of those tokens are not spent generating code. They go into reading it, reviewing it, and re-reading it (Bai et al., 2026; Salim et al., 2026). What drives the cost isn’t how much the agent writes, it’s how much code the agent has to look at in order to write it.Anything that reduces this work—smaller files, predictably named code, clearer control flows, comments documenting upstream/downstream consequences of a method—can lead to lower overall costs.

Efficiency gains with cleaner code

Our first results therefore point to a simple takeaway: cleaner code lowers what your agents’ cost to run. Across the matched pairs we tested, the cleaner side of each pair used about 7% fewer input tokens and 8% fewer output tokens than its unclean counterpart, with no meaningful change in whether the task got done. 

To some extent, what’s easy for human developers to work on is also easy for the agents to work on. Code maintained with SonarQube has less overhead, and the token reduction is the direct result.

What this means for agent-centric development 

AI usage management 

AI token usage is on its way from a line item curiosity to a real budget concern. In just two years, the share of financial operations teams actively managing AI spend has jumped from 31% to 98%. The conversation about how much an engineering organisation spends on agents is moving into the same meetings as the cloud bill, and the research we’ve shared today is one of the first signals that code quality is a crucial part of that conversation.

Most engineering teams already invest in code quality, whether through SonarQube, regular code review, or periodic refactors. What our findings suggest is that this investment now applies in two contexts, not one: the codebase a developer can keep moving in, and the per-task agent cost that scales with how legible the code is.

What's next

This work marks Sonar’s toehold into a larger set of research into the relationship between AI agents and cleaner code bases. And while we evolve our findings, we will gradually broaden our experiment setup to cover a larger set of LLMs and multiple other agentic harnesses.

Furthermore, we are confident  that the positive impact of working on cleaner code will compound over time. Though our experiments were conducted in a one-shot setting, we’ll be working on long-horizon benchmarks to prove this hypothesis more deeply in the future. Stay tuned for more.

In the meantime, check out Sonar’s AI code verification platform, SonarQube

– 

Author’s note: The research was conducted using SonarQube to define and measure code quality — the same opinionated approach we apply elsewhere. This is not a third-party study, and we’re transparent about that. The directional result is what we stand behind.

Build trust into every line of code

Integrate SonarQube into your workflow and start finding vulnerabilities today.

Rating image

4.6 / 5

Unsubscribe