Blog post

How Reasoning Impacts LLM Coding Models

Prasenjit Sarkar photo

Prasenjit Sarkar

Solutions Marketing Manager

10 min read

Table of contents

  • Chevron right iconA note on methodology
  • Chevron right iconFunctional performance
  • Chevron right iconThe cost of reasoning
  • Chevron right iconCode quality
  • Chevron right iconA new risk profile
Try SonarQube for free

The introduction of sophisticated reasoning capabilities in models like GPT-5 marks a significant evolution in AI code generation. This report provides a deep dive into GPT-5’s four reasoning modes—minimal, low, medium, and high—to understand the impact of increased reasoning on functional correctness, code quality, security, and cost. Our analysis, based on over 4,400 Java tasks, reveals a clear trade-off: while higher reasoning delivers best-in-class functional performance, it achieves this by generating a massive volume of complex and hard-to-maintain code.

This blog post builds upon our previous analysis, The Coding Personalities of Leading LLMs—GPT-5 Update, where we evaluated GPT-5’s minimal reasoning mode against other leading models. In this research, we found that reasoning is a powerful tool for improving correctness and security, but it comes with trade-offs. Medium reasoning mode achieves the highest functional success rate and provides a good balance of performance and cost. However, regardless of the setting, GPT-5’s code requires rigorous static analysis to manage the immediate increase in technical debt and a new class of subtle, complex flaws that reasoning introduces. The key takeaway is that the impressive functional performance of reasoning models comes with significant trade-offs. While reasoning reduces common problems in the code, they also create new, hidden ones that demand greater scrutiny.

A note on methodology

Using the SonarQube Enterprise static analysis engine, Sonar has now evaluated code from GPT-5 across its four reasoning modes: minimal, low, medium, and high. Each mode was tested against over 4,400 unique Java assignments from recognized benchmarks like MultiPL-E and ComplexCodeEval. This analysis is a deep dive into the impact of reasoning and is intended as a follow-up to our previous report, which provided a broader comparison of GPT-5-minimal against other leading LLMs.

Functional performance

Increased reasoning has a positive impact on functional performance, but the returns diminish at the highest, most expensive levels.

Introducing even a little reasoning provides a material boost, with the low reasoning mode’s pass rate of ~80% representing a jump from the minimal mode’s ~75%. The performance peaks with the medium reasoning mode, which achieved the highest functional success rate in our evaluation at ~82%, slightly outperforming the much more expensive high setting (~82%). This makes the medium reasoning mode a clear “sweet spot” and also suggests that for most complex tasks where correctness is paramount, medium reasoning represents the optimal balance of performance and cost.


Table 1: Functional performance by reasoning mode

MultiPL-E benchmarksGPT-5-highGPT-5-mediumGPT-5-lowGPT-5-minimal
HumanEval (158 tasks)96.84%96.84%96.20%91.77%
MBPP (385 tasks)75.13%75.39%73.58%68.13%
Weighted Test Pass@1 Avg81.78%81.96%80.50%75.37%
©2025, Sonar

The cost of reasoning

While functionally superior, the code generated by higher reasoning modes is not necessarily better. It is more verbose, more expensive, and contains more defects for each given task.

1. Verbosity & complexity

All GPT-5 reasoning models are more verbose and complex than their predecessor GPT-4o. As the table below shows, even GPT-5-minimal produces more than double the lines of code of GPT-4o, and this verbosity increases with higher reasoning. Furthermore, the complexity is consistently higher across all four GPT-5 modes compared to non-reasoning alternatives. This indicates a shift towards a more complex approach to problem-solving, where the model adds significantly more Lines of Code (LOC)—even between the medium and high settings—without a corresponding increase in functional performance.

Table 2: Code volume and complexity metrics

LLM modelLines of code (LOC)Cyclomatic complexityCognitive complexityCyclomatic complexity / LoCCognitive complexity / LoC
GPT-5-high727,154204,395169,4960.2810.233
GPT-5-medium611,112171,485138,9250.2810.227
GPT-5-low561,325154,776119,3130.2760.213
GPT-5-minimal490,010145,099111,1330.2960.227
GPT-4o209,99444,38726,4500.2110.126
©2025, Sonar

2. The financial cost

The cost of using GPT-5 scales with reasoning, driven by both internal “reasoning tokens” and the volume of verbose code the model generates. Developers and organizations should factor in the additional cost when deciding whether to move up to a higher reasoning setting. 

Table 3: Cost per benchmark run

Reasoning modeCost per benchmark run
GPT-5-high$189
GPT-5-medium$64
GPT-5-low$47
GPT-5-minimal$22
©2025, Sonar

Code quality

As part of this evaluation, we found that reasoning levels do not materially impact the code’s complexity density. The Cognitive Complexity Density is nearly identical across the minimal, medium, and high modes (0.227-0.233). This indicates that GPT-5 has an inherently complex coding style.

The model appears to “overthink” the answer as reasoning increases. They introduce more “Issues per passing task,” rising from 3.90 at minimal to 5.50 at the high setting. Because issue density remains stable, the greater volume of code at higher reasoning levels results in more absolute issues. This makes higher-reasoning GPT-5 a source of increased tech debt that trades long-term maintenance overhead for short-term velocity.

Table 4: Code quality and issue rates

LLM modelPassing tests %SonarQube discovered issuesIssues per passing task
GPT-5-high81.78%19,9685.50
GPT-5-medium81.96%16,6294.57
GPT-5-low80.50%13,8873.88
GPT-5-minimal75.37%13,0573.90
©2025, Sonar

A new risk profile

Another important takeaway is that reasoning shifts the type of flaws generated. It reduces common, obvious issues but replaces them with nuanced ones. Developers may get a false sense of security, as the code appears cleaner on the surface.

1. Security

The data suggests that reasoning at the medium and high levels produces more secure code. As reasoning increases, GPT-5 becomes significantly better at avoiding common, high-risk vulnerabilities. But the improvement is not perfect and introduces more complex security issues. 

As the table below shows, security issues like “path-traversal & Injection” flaws are nearly eliminated at higher reasoning levels, as are  other common issues like “cryptography misconfiguration.” 

However, this security benefit comes at a cost. In place of these well-understood flaws, the higher reasoning modes introduce more subtle, implementation-specific vulnerabilities. The rate of “inadequate I/O error-handling” and “certificate-validation omissions” both skyrocket.. This leaves development leaders making a difficult tradeoff  of reducing the prevalence of common exploits while increasing the risk of  nuanced implementation flaws.

Table 5: Vulnerability sub-category distribution (%)

Vulnerability categoryGPT-5-high (%)GPT-5-medium (%)GPT-5-low (%)GPT-5-minimal (%)
Path-traversal & injection0.001.690.0020.00
Inadequate I/O error-handling43.8435.5951.0230
Cryptography misconfiguration6.8510.1724.4923.33
Certificate-validation omissions15.0722.038.168.33
Hard-coded credentials10.9615.256.125
XML External Entity (XXE)16.4411.866.1210
JSON-injection risk0000
JWT signature not verified0000
©2025, Sonar

2. Reliability

Reliability presents another difficult trade-off. The data shows a clear pattern: in higher reasoning modes, fewer severe bugs are introduced. While the code generated is not bug-free, the chance of a major issue is reduced as reasoning helps the model avoid fundamental logical errors and common API usage mistakes.

This effect is most evident in the two most significant trends shown in the table below. As reasoning increases, the rate of basic “control-flow mistake” bugs is halved from the minimal to the high setting. Inversely, the model's attempts at more complex, multi-threaded solutions lead to an increase in “concurrency / threading” bugs, which nearly double over the same range. This highlights another difficult trade-off: increasing reasoning fixes simple logical errors but creates more complex, harder-to-detect ones. Other categories are included for completeness but show less pronounced trends, indicating the primary impact of reasoning is on the core logic and complexity of the solutions.

Table 6: Bug sub-category distribution (%)

Bug categoryGPT-5-high (%)GPT-5-medium (%)GPT-5-low (%)GPT-5-minimal (%)
Control-flow mistake12.5710.8511.6024.26
Concurrency / Threading38.3035.0527.4420.00
API contract violation10.536.238.479.18
Exception handling5.566.588.19.18
Resource management / leak7.169.079.5811.48
Type-safety / Casts5.564.277.925.25
Null / data-value issues5.127.655.523.77
Performance / structure4.827.123.873.77
Pattern / regex1.614.982.390.82
Data-structure bug0.00.180.370.0
Serialization / serializable0.00.00.00.0
©2025, Sonar

3. Severity

The data on issue severity reveals one of the most significant trade-offs of the reasoning models: a shift from application-breaking flaws toward a higher volume of less critical issues. This is most evident in the security profile, where the severity of vulnerabilities generated by GPT-5 is fundamentally different from other models. As shown below, all four reasoning modes produce a much lower proportion of BLOCKER vulnerabilities compared to their peers—an average of Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 Vision 90B, and OpenCoder-8B—indicating, indicating a successful effort in reducing the most severe security flaws.

Table 7: Vulnerability severity distribution (%)

LLM modelBLOCKER vulnerabilities (%)CRITICAL vulnerabilities (%)MINOR vulnerabilities (%)
GPT-5-high30.1424.6645.21
GPT-5-medium28.8133.9035.59
GPT-5-low12.2436.7351.02
GPT-5-minimal35.0031.6730.00
(Other Models Avg)~63%~27%~8%
©2025, Sonar

This trend is also reflected in the bug profile. Reasoning is effective at reducing the most severe functional bugs, and the GPT-5 suite consistently outperforms other models in this regard. The high reasoning mode produces the lowest proportion of BLOCKER bugs (~3%), a figure that rises to ~8% for the minimal mode. While the code is far from bug-free, the chance of functional error is clearly reduced with increased reasoning.

Table 8: Bug severity distribution (%)

LLM modelBLOCKER bugs (%CRITICAL bugs (%)
GPT-5-high2.922.63
GPT-5-medium4.803.20
GPT-5-low5.162.95
GPT-5-minimal7.702.30
(Other Models Avg)~10.2~5.8
©2025, Sonar

In summary, the issues in the lower-reasoning mode were simply easier to spot because they were more common and straightforward. This means that the higher the reasoning level of the model, the deeper the code review needs to be, and the greater the chance that issues might go unnoticed in a standard code review.

Conclusion: Trust, and verify rigorously

Reasoning is a powerful feature that allows GPT-5 to achieve a new level of functional correctness and security against common attacks. However, it is not a silver bullet. “Trust and verify” is more critical than ever for this new class of models.

From a developer’s standpoint, the danger is complacency. At a glance, the code from higher-reasoning modes will have fewer obvious logical errors and common vulnerabilities. But hidden beneath the surface is a greater volume of complex code saturated with subtle, hard-to-detect issues like concurrency bugs and insecure error handling. For teams with existing codebases, poor maintainability of code presents a significant risk.

Teams adopting GPT-5 will likely see an increase in initial feature velocity, but this will be paid for by a direct and immediate increase in technical debt. Harnessing the power of reasoning models requires a robust governance strategy, centered on rigorous static analysis to identify and manage the complex flaws they create.

REPORT

The State of Code

In this four-part series, discover the most common and critical issues lurking in your codebases and what you can do to fix them before they impact production.

Read more >

REPORT

The Coding Personalities of Leading LLMs

Explore the habits, blind spots, and archetypes of the top five LLMs to uncover the critical risks each brings to your codebase.

Read more >

BLOG

The Coding Personalities of Leading LLMs—GPT-5 Update

GPT-5’s arrival on the scene adds an important new dimension to the landscape, so we have updated our analysis to include it.

Read more >

Build trust into every line of code

Image for rating

120+ G2 Reviews

Get startedContact sales