Blog post

How Reasoning Impacts LLM Coding Models

Prasenjit Sarkar

Solutions Marketing Manager

September 3, 2025

10 min read

A note on methodology
Functional performance
The cost of reasoning
Code quality
A new risk profile

Try SonarQube for free

The introduction of sophisticated reasoning capabilities in models like GPT-5 marks a significant evolution in AI code generation. This report provides a deep dive into GPT-5’s four reasoning modes—minimal, low, medium, and high—to understand the impact of increased reasoning on functional correctness, code quality, security, and cost. Our analysis, based on over 4,400 Java tasks, reveals a clear trade-off: while higher reasoning delivers best-in-class functional performance, it achieves this by generating a massive volume of complex and hard-to-maintain code.

This blog post builds upon our previous analysis, The Coding Personalities of Leading LLMs—GPT-5 Update, where we evaluated GPT-5’s minimal reasoning mode against other leading models. In this research, we found that reasoning is a powerful tool for improving correctness and security, but it comes with trade-offs. Medium reasoning mode achieves the highest functional success rate and provides a good balance of performance and cost. However, regardless of the setting, GPT-5’s code requires rigorous static analysis to manage the immediate increase in technical debt and a new class of subtle, complex flaws that reasoning introduces. The key takeaway is that the impressive functional performance of reasoning models comes with significant trade-offs. While reasoning reduces common problems in the code, they also create new, hidden ones that demand greater scrutiny.

A note on methodology

Using the SonarQube Enterprise static analysis engine, Sonar has now evaluated code from GPT-5 across its four reasoning modes: minimal, low, medium, and high. Each mode was tested against over 4,400 unique Java assignments from recognized benchmarks like MultiPL-E and ComplexCodeEval. This analysis is a deep dive into the impact of reasoning and is intended as a follow-up to our previous report, which provided a broader comparison of GPT-5-minimal against other leading LLMs.

Functional performance

Increased reasoning has a positive impact on functional performance, but the returns diminish at the highest, most expensive levels.

Introducing even a little reasoning provides a material boost, with the low reasoning mode’s pass rate of ~80% representing a jump from the minimal mode’s ~75%. The performance peaks with the medium reasoning mode, which achieved the highest functional success rate in our evaluation at ~82%, slightly outperforming the much more expensive high setting (~82%). This makes the medium reasoning mode a clear “sweet spot” and also suggests that for most complex tasks where correctness is paramount, medium reasoning represents the optimal balance of performance and cost.

Table 1: Functional performance by reasoning mode

MultiPL-E benchmarks	GPT-5-high	GPT-5-medium	GPT-5-low	GPT-5-minimal
HumanEval (158 tasks)	96.84%	96.84%	96.20%	91.77%
MBPP (385 tasks)	75.13%	75.39%	73.58%	68.13%
Weighted Test Pass@1 Avg	81.78%	81.96%	80.50%	75.37%

The cost of reasoning

While functionally superior, the code generated by higher reasoning modes is not necessarily better. It is more verbose, more expensive, and contains more defects for each given task.

1. Verbosity & complexity

All GPT-5 reasoning models are more verbose and complex than their predecessor GPT-4o. As the table below shows, even GPT-5-minimal produces more than double the lines of code of GPT-4o, and this verbosity increases with higher reasoning. Furthermore, the complexity is consistently higher across all four GPT-5 modes compared to non-reasoning alternatives. This indicates a shift towards a more complex approach to problem-solving, where the model adds significantly more Lines of Code (LOC)—even between the medium and high settings—without a corresponding increase in functional performance.

Table 2: Code volume and complexity metrics

LLM model	Lines of code (LOC)	Cyclomatic complexity	Cognitive complexity	Cyclomatic complexity / LoC	Cognitive complexity / LoC
GPT-5-high	727,154	204,395	169,496	0.281	0.233
GPT-5-medium	611,112	171,485	138,925	0.281	0.227
GPT-5-low	561,325	154,776	119,313	0.276	0.213
GPT-5-minimal	490,010	145,099	111,133	0.296	0.227
GPT-4o	209,994	44,387	26,450	0.211	0.126

2. The financial cost

The cost of using GPT-5 scales with reasoning, driven by both internal “reasoning tokens” and the volume of verbose code the model generates. Developers and organizations should factor in the additional cost when deciding whether to move up to a higher reasoning setting.

Table 3: Cost per benchmark run

Reasoning mode	Cost per benchmark run
GPT-5-high	$189
GPT-5-medium	$64
GPT-5-low	$47
GPT-5-minimal	$22

Code quality

As part of this evaluation, we found that reasoning levels do not materially impact the code’s complexity density. The Cognitive Complexity Density is nearly identical across the minimal, medium, and high modes (0.227-0.233). This indicates that GPT-5 has an inherently complex coding style.

The model appears to “overthink” the answer as reasoning increases. They introduce more “Issues per passing task,” rising from 3.90 at minimal to 5.50 at the high setting. Because issue density remains stable, the greater volume of code at higher reasoning levels results in more absolute issues. This makes higher-reasoning GPT-5 a source of increased tech debt that trades long-term maintenance overhead for short-term velocity.

Table 4: Code quality and issue rates

LLM model	Passing tests %	SonarQube discovered issues	Issues per passing task
GPT-5-high	81.78%	19,968	5.50
GPT-5-medium	81.96%	16,629	4.57
GPT-5-low	80.50%	13,887	3.88
GPT-5-minimal	75.37%	13,057	3.90

A new risk profile

Another important takeaway is that reasoning shifts the type of flaws generated. It reduces common, obvious issues but replaces them with nuanced ones. Developers may get a false sense of security, as the code appears cleaner on the surface.

1. Security

The data suggests that reasoning at the medium and high levels produces more secure code. As reasoning increases, GPT-5 becomes significantly better at avoiding common, high-risk vulnerabilities. But the improvement is not perfect and introduces more complex security issues.

As the table below shows, security issues like “path-traversal & Injection” flaws are nearly eliminated at higher reasoning levels, as are other common issues like “cryptography misconfiguration.”

However, this security benefit comes at a cost. In place of these well-understood flaws, the higher reasoning modes introduce more subtle, implementation-specific vulnerabilities. The rate of “inadequate I/O error-handling” and “certificate-validation omissions” both skyrocket.. This leaves development leaders making a difficult tradeoff of reducing the prevalence of common exploits while increasing the risk of nuanced implementation flaws.

Table 5: Vulnerability sub-category distribution (%)

Vulnerability category	GPT-5-high (%)	GPT-5-medium (%)	GPT-5-low (%)	GPT-5-minimal (%)
Path-traversal & injection	0.00	1.69	0.00	20.00
Inadequate I/O error-handling	43.84	35.59	51.02	30
Cryptography misconfiguration	6.85	10.17	24.49	23.33
Certificate-validation omissions	15.07	22.03	8.16	8.33
Hard-coded credentials	10.96	15.25	6.12	5
XML External Entity (XXE)	16.44	11.86	6.12	10
JSON-injection risk	0	0	0	0
JWT signature not verified	0	0	0	0

2. Reliability

Reliability presents another difficult trade-off. The data shows a clear pattern: in higher reasoning modes, fewer severe bugs are introduced. While the code generated is not bug-free, the chance of a major issue is reduced as reasoning helps the model avoid fundamental logical errors and common API usage mistakes.

This effect is most evident in the two most significant trends shown in the table below. As reasoning increases, the rate of basic “control-flow mistake” bugs is halved from the minimal to the high setting. Inversely, the model's attempts at more complex, multi-threaded solutions lead to an increase in “concurrency / threading” bugs, which nearly double over the same range. This highlights another difficult trade-off: increasing reasoning fixes simple logical errors but creates more complex, harder-to-detect ones. Other categories are included for completeness but show less pronounced trends, indicating the primary impact of reasoning is on the core logic and complexity of the solutions.

Table 6: Bug sub-category distribution (%)

Bug category	GPT-5-high (%)	GPT-5-medium (%)	GPT-5-low (%)	GPT-5-minimal (%)
Control-flow mistake	12.57	10.85	11.60	24.26
Concurrency / Threading	38.30	35.05	27.44	20.00
API contract violation	10.53	6.23	8.47	9.18
Exception handling	5.56	6.58	8.1	9.18
Resource management / leak	7.16	9.07	9.58	11.48
Type-safety / Casts	5.56	4.27	7.92	5.25
Null / data-value issues	5.12	7.65	5.52	3.77
Performance / structure	4.82	7.12	3.87	3.77
Pattern / regex	1.61	4.98	2.39	0.82
Data-structure bug	0.0	0.18	0.37	0.0
Serialization / serializable	0.0	0.0	0.0	0.0

3. Severity

The data on issue severity reveals one of the most significant trade-offs of the reasoning models: a shift from application-breaking flaws toward a higher volume of less critical issues. This is most evident in the security profile, where the severity of vulnerabilities generated by GPT-5 is fundamentally different from other models. As shown below, all four reasoning modes produce a much lower proportion of BLOCKER vulnerabilities compared to their peers—an average of Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 Vision 90B, and OpenCoder-8B—indicating, indicating a successful effort in reducing the most severe security flaws.

Table 7: Vulnerability severity distribution (%)

LLM model	BLOCKER vulnerabilities (%)	CRITICAL vulnerabilities (%)	MINOR vulnerabilities (%)
GPT-5-high	30.14	24.66	45.21
GPT-5-medium	28.81	33.90	35.59
GPT-5-low	12.24	36.73	51.02
GPT-5-minimal	35.00	31.67	30.00
(Other Models Avg)	~63%	~27%	~8%

This trend is also reflected in the bug profile. Reasoning is effective at reducing the most severe functional bugs, and the GPT-5 suite consistently outperforms other models in this regard. The high reasoning mode produces the lowest proportion of BLOCKER bugs (~3%), a figure that rises to ~8% for the minimal mode. While the code is far from bug-free, the chance of functional error is clearly reduced with increased reasoning.

Table 8: Bug severity distribution (%)

LLM model	BLOCKER bugs (%	CRITICAL bugs (%)
GPT-5-high	2.92	2.63
GPT-5-medium	4.80	3.20
GPT-5-low	5.16	2.95
GPT-5-minimal	7.70	2.30
(Other Models Avg)	~10.2	~5.8

In summary, the issues in the lower-reasoning mode were simply easier to spot because they were more common and straightforward. This means that the higher the reasoning level of the model, the deeper the code review needs to be, and the greater the chance that issues might go unnoticed in a standard code review.

Conclusion: Trust, and verify rigorously

Reasoning is a powerful feature that allows GPT-5 to achieve a new level of functional correctness and security against common attacks. However, it is not a silver bullet. “Trust and verify” is more critical than ever for this new class of models.

From a developer’s standpoint, the danger is complacency. At a glance, the code from higher-reasoning modes will have fewer obvious logical errors and common vulnerabilities. But hidden beneath the surface is a greater volume of complex code saturated with subtle, hard-to-detect issues like concurrency bugs and insecure error handling. For teams with existing codebases, poor maintainability of code presents a significant risk.

Teams adopting GPT-5 will likely see an increase in initial feature velocity, but this will be paid for by a direct and immediate increase in technical debt. Harnessing the power of reasoning models requires a robust governance strategy, centered on rigorous static analysis to identify and manage the complex flaws they create.

twitter facebook linkedin mail

Report

The State of Code

In this four-part series, discover the most common and critical issues lurking in your codebases and what you can do to fix them before they impact production.

The Coding Personalities of Leading LLMs

Explore the habits, blind spots, and archetypes of the top five LLMs to uncover the critical risks each brings to your codebase.

The Coding Personalities of Leading LLMs—GPT-5 Update

GPT-5’s arrival on the scene adds an important new dimension to the landscape, so we have updated our analysis to include it.

Build trust into every line of code

4.6 / 5

Get started Contact sales

How Reasoning Impacts LLM Coding Models

Table of contents

A note on methodology

Functional performance

The cost of reasoning

1. Verbosity & complexity

2. The financial cost

Code quality

A new risk profile

1. Security

2. Reliability

3. Severity

Conclusion: Trust, and verify rigorously

SHARE

The State of Code

The Coding Personalities of Leading LLMs

The Coding Personalities of Leading LLMs—GPT-5 Update

Build trust into every line of code

How Reasoning Impacts LLM Coding Models

Table of contents

.css-1s68n4h{position:absolute;top:-150px;}A note on methodology.css-5cm1aq{color:#000000;}.css-s0nieh{margin-left:10px;margin-top:-1px;display:inline-block;fill:#69809B;margin-left:14px;}.css-s0nieh:hover{fill:#290042;}

Functional performance

The cost of reasoning

1. Verbosity & complexity

2. The financial cost

Code quality

A new risk profile

1. Security

2. Reliability

3. Severity

Conclusion: Trust, and verify rigorously

SHARE

The State of Code

The Coding Personalities of Leading LLMs

The Coding Personalities of Leading LLMs—GPT-5 Update

Build trust into every line of code

A note on methodology