Blog post

How Reasoning Impacts LLM Coding Models

Prasenjit Sarkar photo

Prasenjit Sarkar

Solutions Marketing Manager

10 min read

Table of contents

  • Chevron right iconA note on methodology
  • Chevron right iconFunctional performance
  • Chevron right iconThe cost of reasoning
  • Chevron right iconCode quality
  • Chevron right iconA new risk profile
Try SonarQube for free

The introduction of sophisticated reasoning capabilities in models like GPT-5 marks a significant evolution in AI code generation. This blog post provides a deep dive into GPT-5’s four reasoning modes—minimal, low, medium, and high—to understand the impact of increased reasoning on functional correctness, code quality, security, and cost. Our analysis, based on over 4,400 Java tasks, reveals a clear trade-off: while higher reasoning delivers best-in-class functional performance, it achieves this by generating a massive volume of complex and hard-to-maintain code.

This post builds upon our previous analysis, The Coding Personalities of Leading LLMs—GPT-5 Update, where we evaluated GPT-5’s minimal reasoning mode against other leading models. In this research, we found that reasoning is a powerful tool for improving correctness and security, but it comes with trade-offs. Medium reasoning mode achieves the highest functional success rate and provides a good balance of performance and cost. However, regardless of the setting, GPT-5’s code requires rigorous static analysis to manage the immediate increase in technical debt and a new class of subtle, complex flaws that reasoning introduces. The key takeaway is that the impressive functional performance of reasoning models comes with significant trade-offs. While reasoning reduces common problems in the code, they also create new, hidden ones that demand greater scrutiny.

A note on methodology

Using the SonarQube Enterprise static analysis engine, Sonar has now evaluated code from GPT-5 across its four reasoning modes: minimal, low, medium, and high. Each mode was tested against over 4,400 unique Java assignments from recognized benchmarks like MultiPL-E and ComplexCodeEval. This analysis is a deep dive into the impact of reasoning and is intended as a follow-up to our previous report, which provided a broader comparison of GPT-5-minimal against other leading LLMs.

Functional performance

Increased reasoning has a positive impact on functional performance, but the returns diminish at the highest, most expensive levels.

Introducing even a little reasoning provides a material boost, with the low reasoning mode’s pass rate of ~80% representing a jump from the minimal mode’s ~75%. The performance peaks with the medium reasoning mode, which achieved the highest functional success rate in our evaluation at ~82%, slightly outperforming the much more expensive high setting (~82%). This makes the medium reasoning mode a clear “sweet spot” and also suggests that for most complex tasks where correctness is paramount, medium reasoning represents the optimal balance of performance and cost.


Table 1: Functional performance by reasoning mode

MultiPL-E benchmarksGPT-5-highGPT-5-mediumGPT-5-lowGPT-5-minimal
HumanEval (158 tasks)96.84%96.84%96.20%91.77%
MBPP (385 tasks)75.13%75.39%73.58%68.13%
Weighted Test Pass@1 Avg81.78%81.96%80.50%75.37%
©2025, Sonar

The cost of reasoning

While functionally superior, the code generated by higher reasoning modes is not necessarily better. It is more verbose, more expensive, and contains more defects for each given task.

1. Verbosity & complexity

All GPT-5 reasoning models are more verbose and complex than their predecessor GPT-4o. As the table below shows, even GPT-5-minimal produces more than double the lines of code of GPT-4o, and this verbosity increases with higher reasoning. Furthermore, the complexity is consistently higher across all four GPT-5 modes compared to non-reasoning alternatives. This indicates a shift towards a more complex approach to problem-solving, where the model adds significantly more Lines of Code (LOC)—even between the medium and high settings—without a corresponding increase in functional performance.

Table 2: Code volume and complexity metrics

LLM modelLines of code (LOC)Cyclomatic complexityCognitive complexityCyclomatic complexity / LoCCognitive complexity / LoC
GPT-5-high727,154204,395169,4960.2810.233
GPT-5-medium611,112171,485138,9250.2810.227
GPT-5-low561,325154,776119,3130.2760.213
GPT-5-minimal490,010145,099111,1330.2960.227
GPT-4o209,99444387264500.2110.126
©2025, Sonar

2. The financial cost

The cost of using GPT-5 scales with reasoning, driven by both internal “reasoning tokens” and the volume of verbose code the model generates. Developers and organizations should factor in the additional cost when deciding whether to move up to a higher reasoning setting. 

Table 3: Cost per benchmark run

Reasoning modeCost per benchmark run
GPT-5-high$189
GPT-5-medium$64
GPT-5-low$47
GPT-5-minimal$22
©2025, Sonar

Code quality

As part of this evaluation, we found that reasoning levels do not materially impact the code’s complexity density. The Cognitive Complexity Density is nearly identical across the minimal, medium, and high modes (0.227-0.233). This indicates that GPT-5 has an inherently complex coding style.

The model appears to “overthink” the answer as reasoning increases. They introduce more “Issues per passing task,” rising from 3.90 at minimal to 5.50 at the high setting. Because issue density remains stable, the greater volume of code at higher reasoning levels results in more absolute issues. This makes higher-reasoning GPT-5 a source of increased tech debt that trades long-term maintenance overhead for short-term velocity.

Table 4: Code quality and issue rates

LLM modelPassing tests %SonarQube discovered issuesIssues per passing task
GPT-5-high81.78%19,9685.50
GPT-5-medium81.96%16,6294.57
GPT-5-low80.50%13,8873.88
GPT-5-minimal75.37%13,0573.90
©2025, Sonar

A new risk profile

Another important takeaway is that reasoning shifts the type of flaws generated. It reduces common, obvious issues but replaces them with nuanced ones. Developers may come to a false sense of security, as the code appears cleaner on the surface.

1. Security

GPT-5 is optimized for security. Higher reasoning eliminates common, well-understood attacks like “path-traversal & injection” vulnerabilities (dropping to 0% in high and low modes). However, these are replaced by subtle, harder-to-detect flaws. The percentage of vulnerabilities related to “inadequate I/O error-handling” skyrockets, at 44% in the high reasoning mode versus 30% in the minimum reasoning mode.

Table 5: Vulnerability sub-category distribution (%)

Vulnerability categoryGPT-5-high (%)GPT-5-medium (%)GPT-5-low (%)GPT-5-minimal (%)
Path-traversal & injection0.001.690.0020.00
Inadequate I/O error-handling43.8435.5951.0230
©2025, Sonar

2. Reliability

A similar trade-off occurs with bugs. As reasoning increases, the rate of fundamental “control-flow mistake” bugs decreases significantly (from ~24% at minimal to ~11-12% at higher levels). However, the percentage of advanced “concurrency / threading” bugs increases from 20% in minimal mode to ~38% in high mode. Higher reasoning attempts more complex, multi-threaded solutions and fails in more advanced ways.

Table 6: Bug sub-category distribution (%)

Bug categoryGPT-5-high (%)GPT-5-medium (%)GPT-5-low (%)GPT-5-minimal (%)
Control-flow mistake12.5710.8511.6024.26
Concurrency / Threading38.3035.0527.4420.00
©2025, Sonar

3. Severity

The data on issue severity reveals one of the most significant trade-offs of the reasoning models: a shift from application-breaking flaws toward a higher volume of less critical issues. This is most evident in the security profile, where the severity of vulnerabilities generated by GPT-5 is fundamentally different from other models. As shown below, all four reasoning modes produce a much lower proportion of BLOCKER vulnerabilities compared to their peers—an average of Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 Vision 90B, and OpenCoder-8B—indicating, indicating a successful effort in reducing the most severe security flaws.

Table 7: Vulnerability severity distribution (%)

LLM modelBLOCKER vulnerabilities (%)CRITICAL vulnerabilities (%)MINOR vulnerabilities (%)
GPT-5-high30.1424.6645.21
GPT-5-medium28.8133.9035.59
GPT-5-low12.2436.7351.02
GPT-5-minimal35.0031.6730.00
(Other Models Avg)~63%~27%~8%
©2025, Sonar

This trend is also reflected in the bug profile. Reasoning is effective at reducing the most severe functional bugs, and the GPT-5 suite consistently outperforms other models in this regard. The high reasoning mode produces the lowest proportion of BLOCKER bugs (~3%), a figure that rises to ~8% for the minimal mode. While the code is far from bug-free, the chance of functional error is clearly reduced with increased reasoning.

Table 8: Bug severity distribution (%)

LLM modelBLOCKER bugs (%CRITICAL bugs (%)
GPT-5-high2.922.63
GPT-5-medium4.803.20
GPT-5-low5.162.95
GPT-5-minimal7.702.30
(Other Models Avg)~10.2~5.8
©2025, Sonar

In summary, the severity data paints a clear picture: GPT-5’s reasoning is highly effective at steering the model away from generating high-severity bugs and vulnerabilities. While this reduces the risk of immediate, application-breaking flaws, it comes at a trade-off. The risk profile shifts toward more subtle, lower-severity issues that contribute to the overall complexity and long-term maintenance burden of the code.

Conclusion: Trust, and verify rigorously

Reasoning is a powerful feature that allows GPT-5 to achieve a new level of functional correctness and security against common attacks. However, it is not a silver bullet. “Trust and verify” is more critical than ever for this new class of models.

From a developer’s standpoint, the danger is complacency. At a glance, the code from higher-reasoning modes will have fewer obvious logical errors and common vulnerabilities. But hidden beneath the surface is a greater volume of complex code saturated with subtle, hard-to-detect issues like concurrency bugs and insecure error handling. For teams with existing codebases, poor maintainability of code presents a significant risk.

Teams adopting GPT-5 will likely see an increase in initial feature velocity, but this will be paid for by a direct and immediate increase in technical debt. Harnessing the power of reasoning models requires a robust governance strategy, centered on rigorous static analysis to identify and manage the complex flaws they create.

REPORT

The State of Code

In this four-part series, discover the most common and critical issues lurking in your codebases and what you can do to fix them before they impact production.

Read more >

REPORT

The Coding Personalities of Leading LLMs

Explore the habits, blind spots, and archetypes of the top five LLMs to uncover the critical risks each brings to your codebase.

Read more >

BLOG

The Coding Personalities of Leading LLMs—GPT-5 Update

GPT-5’s arrival on the scene adds an important new dimension to the landscape, so we have updated our analysis to include it.

Read more >

Build trust into every line of code

Image for rating

120+ G2 Reviews

Get startedContact sales
  • Follow SonarSource on Twitter
  • Follow SonarSource on Linkedin
language switcher
简体中文 (Simplified Chinese)
  • 法律文件
  • 信任中心

© 2008-2024 SonarSource SA。保留所有权利。SONAR、SONARSOURCE、SONARQUBE、 和 CLEAN AS YOU CODE 是 SonarSource SA 的商标。