Your AI bill is a code quality problem

6 min lesen

Killian Carlsen-Phelan photo

Killian Carlsen-Phelan

Developer Content Engineer

Your AI coding agent bill probably went up last month, and so did everyone else's. The fixes people reach for (switch models, tune prompts, cap usage) all target the AI side of the equation. A controlled study from Sonar looked at the code side and found that structural quality affects what agents cost to run.

Why are AI coding agent bills so high?

One developer's monthly Claude Code bill hit $1,600 after a few weeks of heavy usage. Across more than 400 engineering organizations, DX's longitudinal research found that teams mixing inline completion and agentic coding tools now spend $200 to $600 per engineer per month.

Uber rolled out Claude Code to roughly 5,000 engineers and exhausted the AI tools portion of its $3.4 billion R&D budget within four months, with CTO Praveen Neppalli Naga telling The Information that individual engineers were spending $500 to $2,000 per month on tokens. Microsoft began cancelling Claude Code licenses across its Experiences and Devices division in May, driven partly by cost and partly by a push to consolidate on GitHub Copilot CLI before its June 30 fiscal year close. GitHub Copilot's own switch to token-based billing on June 1 surfaced the same cost dynamics on a different platform, with developers projecting that heavy agentic usage would push individual costs from $29 to $750 per month.

The FinOps Foundation's 2026 report puts a number on how fast this became everyone's problem. Two years ago, 31% of financial operations teams actively managed AI spend. Today the figure is 98%, a shift the Foundation describes as AI moving "from emerging concern to everyday FinOps scope in just two years."

Bai et al. (2026), co-authored with researchers from the Stanford Digital Economy Lab, found that agentic coding tasks consume roughly 1,000 times the tokens of single-turn chat interactions. Agents read files, plan edits, execute changes, verify results, and re-send the full conversation history every turn, so the ratio reflects the work, not waste. Because each turn carries everything before it, costs compound with session length, and input tokens account for most of the total.

A 400-line method with branching control flow can't be navigated by name, so the agent reads the whole thing. Cryptic identifiers force the same exhaustive search, and tangled module boundaries send the agent looping back to verify edits across a seam. These are all familiar structural problems, but nobody had tested, until recently, whether they also show up on the AI bill.

Cleaner code, smaller footprint

To isolate the variable, Sonar's study built six matched pairs of repositories. Each pair ships the same application and passes the same test suite with the same dependencies. The only difference is internal: how the code is factored, named, and nested, and whether it carries the kinds of structural issues SonarQube flags. Three pairs started from clean codebases and were deliberately degraded; three started from organically messy codebases and were cleaned up through SonarQube-guided automation. Building pairs in both directions ensured that any observed effect came from the code's structural state, not from the construction process. Across the six pairs, researchers authored 33 coding tasks and ran each ten times on both sides, totaling 660 trials on Claude Code with Claude Sonnet 4.6.

Agents working on cleaner code used 7.1% fewer input tokens, 8.5% fewer output tokens, and 11.1% less reasoning effort. Whether or not they finished the task didn't change as there was a less than 1% difference in pass rate.

Agents on the cleaner side also re-read files they had already edited roughly 34% less often, and that pattern held across every repository pair. Token metrics swung widely from task to task, but the revisitation drop was directionally consistent. On clean code, agents committed to an edit and moved on, while on messy code they looped back to re-read what they had already touched, orienting longer before acting and revisiting changes afterward.

On one commons-bcel task, the messy side held opcode dispatch inside two parallel methods, each a several-hundred-line switch over JVM opcodes. Cleanup replaced both with thin dispatchers of a couple of dozen lines, each delegating to roughly ten named helpers. Agents working on the cleaned version of that task used 35% fewer input tokens, opened 25% fewer files, and finished in 32% fewer conversation turns. File size barely changed (Utility.java was essentially identical on both variants), but agents could now grep for handleOpcodePush() instead of scanning a full switch to find the push-handling branch.

Not every cleanup helped. On one genie task, the pipeline extracted helpers around focal launch logic but left the original logic in place, spreading work across more methods without making the core any more findable. The agents paid for the added surface area with 8% more input tokens, though other metrics were essentially flat.

Multi-module tasks carried the strongest signal. Where work required changes across two or more module boundaries, agents on the clean side used 10.7% fewer input tokens and revisited files 50.8% less often. On clean seams, they crossed a boundary and kept going; on messy seams, they looped back. Cognitive-hotspot tasks (work inside a single dense method or class) netted out roughly token-neutral, because extraction redistributed complexity across more files rather than eliminating it.

A comment-volume ablation tested the obvious alternative explanation. After equalizing comments between clean and messy variants, the clean side's advantage either held or grew. Researchers concluded that "comments and suppression markers are not what drives the cleaner-versus-messier footprint contrast," though they noted their normalization required extensive manual intervention and the study's high per-task variance prevented a stronger claim.

The study is Sonar's own, conducted on one model in one harness, across six repository pairs. Per-task variance was wide, with individual deltas ranging from -47% to +44% on input tokens, and of 27 non-calibration tasks, 16 favored the clean side while 11 favored the messy side. The directional finding holds across the dataset, but specific percentages may shift with further research.

Where the savings come from

SonarQube's cognitive complexity metric (rule S3776) measures exactly what made those commons-bcel dispatchers expensive to navigate: control-flow density that forces any reader, human or machine, to parse the whole method rather than targeting a named entry point. S3776 is classified as HIGH-severity maintainability across Java, Python, JavaScript, and TypeScript, with a default threshold of 15. Teams already tracking coverage and duplication can add cognitive complexity density as a project-level metric. The study's data suggests it functions as a reasonable proxy for how expensive a codebase is for agents to navigate, though the genie counterexample shows the limit. Extraction that adds structure without creating findable names makes things worse, not better.

Naming works the same way, and the paper's contrast between normalize_query and _xfm_q2 shows why: predictable names let agents do targeted lookups instead of scanning every file. Le et al. (2025) quantified the effect in a related study, finding that uninformative identifier renames degraded GPT-4o's summarization accuracy from 87.3% to 58.7%, nearly 30 percentage points from names alone.

Clean module boundaries explain why the multi-module track carried the strongest signal. When interfaces are clear and dependencies don't circle back on themselves, agents can cross a boundary and keep going rather than looping back to verify. When refactoring time is limited, the comment-volume ablation suggests spending it on structure rather than documentation.

The compounding question

Seven percent per task may seem modest on its own, but agents run hundreds or thousands of times on the same codebase, and whether those savings accumulate is the question that matters.

The Sonar paper raises this directly in its limitations: "The version of the maintainability argument worth measuring next is whether [per-task savings] compound: across a year of agent work on a codebase, do per-task savings accumulate, or does the codebase drift in ways that erase them?" The study measured single-task footprints on fixed codebases. It did not track what happens when agents repeatedly modify the same repository over weeks or months.

SlopCodeBench (Orlanski et al., 2026) approached from the other end. Instead of measuring how agents navigate existing code, it tracked what happens to code quality when agents extend their own prior work under evolving specifications. Agent-generated code drifted toward more verbose, less structured output, ending 2.3 times more verbose than human-maintained baselines, with structural erosion rising in 77% of trajectories. Prompt-level interventions reduced initial verbosity by roughly a third but didn't change degradation rates over time.

Neither study makes the combined claim. Connecting them is inference, and should be read as such. But the logic is straightforward: if messier code costs more for agents to work on (the Sonar finding), and agents tend to make code messier over successive iterations (the SlopCodeBench finding), then the cost trajectory for unverified agent work bends upward. Each iteration degrades the codebase slightly, and the next iteration pays more to navigate that degradation.

Running static analysis after each agent modification isn't only about catching bugs or enforcing style. It prevents the structural drift that would make the next agent run more expensive than the last one. Quality gates applied consistently preserve navigability across iterations, keeping the codebase in the range where agents commit and move forward rather than looping back.

The researchers call this the study's most important open question and identify long-horizon benchmarks as the next step. For teams making decisions now, the per-task evidence is enough to act on.

Same investment, two returns

Code quality has always paid for itself through developer productivity, resulting in fewer bugs in production, faster onboarding, and less time fighting tangled logic.

What's new is that the same structural work now also reduces AI infrastructure costs. Not the biggest lever on your AI bill (model choice and prompt caching matter more in absolute terms), but the only one that requires no changes to your tooling and no new infrastructure. A team enforcing quality gates on cognitive complexity in SonarQube is already doing the work. The same refactoring that speeds up a developer also cuts what the next agent run costs.

Further reading

Schaffen Sie Vertrauen in jede Zeile Code

Integrieren Sie SonarQube in Ihren Workflow und beginnen Sie noch heute mit der Suche nach Schwachstellen.

Rating image

4.6 / 5