The introduction of sophisticated reasoning capabilities in models like GPT-5 marks a significant evolution in AI code generation. This report provides a deep dive into GPT-5’s four reasoning modes—minimal, low, medium, and high—to understand the impact of increased reasoning on functional correctness, code quality, security, and cost. Our analysis, based on over 4,400 Java tasks, reveals a clear trade-off: while higher reasoning delivers best-in-class functional performance, it achieves this by generating a massive volume of complex and hard-to-maintain code.
This blog post builds upon our previous analysis, The Coding Personalities of Leading LLMs—GPT-5 Update, where we evaluated GPT-5’s minimal reasoning mode against other leading models. In this research, we found that reasoning is a powerful tool for improving correctness and security, but it comes with trade-offs. Medium reasoning mode achieves the highest functional success rate and provides a good balance of performance and cost. However, regardless of the setting, GPT-5’s code requires rigorous static analysis to manage the immediate increase in technical debt and a new class of subtle, complex flaws that reasoning introduces. The key takeaway is that the impressive functional performance of reasoning models comes with significant trade-offs. While reasoning reduces common problems in the code, they also create new, hidden ones that demand greater scrutiny.
A note on methodology
Using the SonarQube Enterprise static analysis engine, Sonar has now evaluated code from GPT-5 across its four reasoning modes: minimal, low, medium, and high. Each mode was tested against over 4,400 unique Java assignments from recognized benchmarks like MultiPL-E and ComplexCodeEval. This analysis is a deep dive into the impact of reasoning and is intended as a follow-up to our previous report, which provided a broader comparison of GPT-5-minimal against other leading LLMs.
Functional performance
Increased reasoning has a positive impact on functional performance, but the returns diminish at the highest, most expensive levels.
Introducing even a little reasoning provides a material boost, with the low reasoning mode’s pass rate of ~80% representing a jump from the minimal mode’s ~75%. The performance peaks with the medium reasoning mode, which achieved the highest functional success rate in our evaluation at ~82%, slightly outperforming the much more expensive high setting (~82%). This makes the medium reasoning mode a clear “sweet spot” and also suggests that for most complex tasks where correctness is paramount, medium reasoning represents the optimal balance of performance and cost.
Table 1: Functional performance by reasoning mode
MultiPL-E benchmarks | GPT-5-high | GPT-5-medium | GPT-5-low | GPT-5-minimal |
HumanEval (158 tasks) | 96.84% | 96.84% | 96.20% | 91.77% |
MBPP (385 tasks) | 75.13% | 75.39% | 73.58% | 68.13% |
Weighted Test Pass@1 Avg | 81.78% | 81.96% | 80.50% | 75.37% |
The cost of reasoning
While functionally superior, the code generated by higher reasoning modes is not necessarily better. It is more verbose, more expensive, and contains more defects for each given task.
1. Verbosity & complexity
All GPT-5 reasoning models are more verbose and complex than their predecessor GPT-4o. As the table below shows, even GPT-5-minimal produces more than double the lines of code of GPT-4o, and this verbosity increases with higher reasoning. Furthermore, the complexity is consistently higher across all four GPT-5 modes compared to non-reasoning alternatives. This indicates a shift towards a more complex approach to problem-solving, where the model adds significantly more Lines of Code (LOC)—even between the medium and high settings—without a corresponding increase in functional performance.
Table 2: Code volume and complexity metrics
LLM model | Lines of code (LOC) | Cyclomatic complexity | Cognitive complexity | Cyclomatic complexity / LoC | Cognitive complexity / LoC |
GPT-5-high | 727,154 | 204,395 | 169,496 | 0.281 | 0.233 |
GPT-5-medium | 611,112 | 171,485 | 138,925 | 0.281 | 0.227 |
GPT-5-low | 561,325 | 154,776 | 119,313 | 0.276 | 0.213 |
GPT-5-minimal | 490,010 | 145,099 | 111,133 | 0.296 | 0.227 |
GPT-4o | 209,994 | 44,387 | 26,450 | 0.211 | 0.126 |
2. The financial cost
The cost of using GPT-5 scales with reasoning, driven by both internal “reasoning tokens” and the volume of verbose code the model generates. Developers and organizations should factor in the additional cost when deciding whether to move up to a higher reasoning setting.
Table 3: Cost per benchmark run
Reasoning mode | Cost per benchmark run |
GPT-5-high | $189 |
GPT-5-medium | $64 |
GPT-5-low | $47 |
GPT-5-minimal | $22 |
Code quality
As part of this evaluation, we found that reasoning levels do not materially impact the code’s complexity density. The Cognitive Complexity Density is nearly identical across the minimal, medium, and high modes (0.227-0.233). This indicates that GPT-5 has an inherently complex coding style.
The model appears to “overthink” the answer as reasoning increases. They introduce more “Issues per passing task,” rising from 3.90 at minimal to 5.50 at the high setting. Because issue density remains stable, the greater volume of code at higher reasoning levels results in more absolute issues. This makes higher-reasoning GPT-5 a source of increased tech debt that trades long-term maintenance overhead for short-term velocity.
Table 4: Code quality and issue rates
LLM model | Passing tests % | SonarQube discovered issues | Issues per passing task |
GPT-5-high | 81.78% | 19,968 | 5.50 |
GPT-5-medium | 81.96% | 16,629 | 4.57 |
GPT-5-low | 80.50% | 13,887 | 3.88 |
GPT-5-minimal | 75.37% | 13,057 | 3.90 |
A new risk profile
Another important takeaway is that reasoning shifts the type of flaws generated. It reduces common, obvious issues but replaces them with nuanced ones. Developers may get a false sense of security, as the code appears cleaner on the surface.
1. Security
The data suggests that reasoning at the medium and high levels produces more secure code. As reasoning increases, GPT-5 becomes significantly better at avoiding common, high-risk vulnerabilities. But the improvement is not perfect and introduces more complex security issues.
As the table below shows, security issues like “path-traversal & Injection” flaws are nearly eliminated at higher reasoning levels, as are other common issues like “cryptography misconfiguration.”
However, this security benefit comes at a cost. In place of these well-understood flaws, the higher reasoning modes introduce more subtle, implementation-specific vulnerabilities. The rate of “inadequate I/O error-handling” and “certificate-validation omissions” both skyrocket.. This leaves development leaders making a difficult tradeoff of reducing the prevalence of common exploits while increasing the risk of nuanced implementation flaws.
Table 5: Vulnerability sub-category distribution (%)
Vulnerability category | GPT-5-high (%) | GPT-5-medium (%) | GPT-5-low (%) | GPT-5-minimal (%) |
Path-traversal & injection | 0.00 | 1.69 | 0.00 | 20.00 |
Inadequate I/O error-handling | 43.84 | 35.59 | 51.02 | 30 |
Cryptography misconfiguration | 6.85 | 10.17 | 24.49 | 23.33 |
Certificate-validation omissions | 15.07 | 22.03 | 8.16 | 8.33 |
Hard-coded credentials | 10.96 | 15.25 | 6.12 | 5 |
XML External Entity (XXE) | 16.44 | 11.86 | 6.12 | 10 |
JSON-injection risk | 0 | 0 | 0 | 0 |
JWT signature not verified | 0 | 0 | 0 | 0 |
2. Reliability
Reliability presents another difficult trade-off. The data shows a clear pattern: in higher reasoning modes, fewer severe bugs are introduced. While the code generated is not bug-free, the chance of a major issue is reduced as reasoning helps the model avoid fundamental logical errors and common API usage mistakes.
This effect is most evident in the two most significant trends shown in the table below. As reasoning increases, the rate of basic “control-flow mistake” bugs is halved from the minimal to the high setting. Inversely, the model's attempts at more complex, multi-threaded solutions lead to an increase in “concurrency / threading” bugs, which nearly double over the same range. This highlights another difficult trade-off: increasing reasoning fixes simple logical errors but creates more complex, harder-to-detect ones. Other categories are included for completeness but show less pronounced trends, indicating the primary impact of reasoning is on the core logic and complexity of the solutions.
Table 6: Bug sub-category distribution (%)
Bug category | GPT-5-high (%) | GPT-5-medium (%) | GPT-5-low (%) | GPT-5-minimal (%) |
Control-flow mistake | 12.57 | 10.85 | 11.60 | 24.26 |
Concurrency / Threading | 38.30 | 35.05 | 27.44 | 20.00 |
API contract violation | 10.53 | 6.23 | 8.47 | 9.18 |
Exception handling | 5.56 | 6.58 | 8.1 | 9.18 |
Resource management / leak | 7.16 | 9.07 | 9.58 | 11.48 |
Type-safety / Casts | 5.56 | 4.27 | 7.92 | 5.25 |
Null / data-value issues | 5.12 | 7.65 | 5.52 | 3.77 |
Performance / structure | 4.82 | 7.12 | 3.87 | 3.77 |
Pattern / regex | 1.61 | 4.98 | 2.39 | 0.82 |
Data-structure bug | 0.0 | 0.18 | 0.37 | 0.0 |
Serialization / serializable | 0.0 | 0.0 | 0.0 | 0.0 |
3. Severity
The data on issue severity reveals one of the most significant trade-offs of the reasoning models: a shift from application-breaking flaws toward a higher volume of less critical issues. This is most evident in the security profile, where the severity of vulnerabilities generated by GPT-5 is fundamentally different from other models. As shown below, all four reasoning modes produce a much lower proportion of BLOCKER vulnerabilities compared to their peers—an average of Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 Vision 90B, and OpenCoder-8B—indicating, indicating a successful effort in reducing the most severe security flaws.
Table 7: Vulnerability severity distribution (%)
LLM model | BLOCKER vulnerabilities (%) | CRITICAL vulnerabilities (%) | MINOR vulnerabilities (%) |
GPT-5-high | 30.14 | 24.66 | 45.21 |
GPT-5-medium | 28.81 | 33.90 | 35.59 |
GPT-5-low | 12.24 | 36.73 | 51.02 |
GPT-5-minimal | 35.00 | 31.67 | 30.00 |
(Other Models Avg) | ~63% | ~27% | ~8% |
This trend is also reflected in the bug profile. Reasoning is effective at reducing the most severe functional bugs, and the GPT-5 suite consistently outperforms other models in this regard. The high reasoning mode produces the lowest proportion of BLOCKER bugs (~3%), a figure that rises to ~8% for the minimal mode. While the code is far from bug-free, the chance of functional error is clearly reduced with increased reasoning.
Table 8: Bug severity distribution (%)
LLM model | BLOCKER bugs (% | CRITICAL bugs (%) |
GPT-5-high | 2.92 | 2.63 |
GPT-5-medium | 4.80 | 3.20 |
GPT-5-low | 5.16 | 2.95 |
GPT-5-minimal | 7.70 | 2.30 |
(Other Models Avg) | ~10.2 | ~5.8 |
In summary, the issues in the lower-reasoning mode were simply easier to spot because they were more common and straightforward. This means that the higher the reasoning level of the model, the deeper the code review needs to be, and the greater the chance that issues might go unnoticed in a standard code review.
Conclusion: Trust, and verify rigorously
Reasoning is a powerful feature that allows GPT-5 to achieve a new level of functional correctness and security against common attacks. However, it is not a silver bullet. “Trust and verify” is more critical than ever for this new class of models.
From a developer’s standpoint, the danger is complacency. At a glance, the code from higher-reasoning modes will have fewer obvious logical errors and common vulnerabilities. But hidden beneath the surface is a greater volume of complex code saturated with subtle, hard-to-detect issues like concurrency bugs and insecure error handling. For teams with existing codebases, poor maintainability of code presents a significant risk.
Teams adopting GPT-5 will likely see an increase in initial feature velocity, but this will be paid for by a direct and immediate increase in technical debt. Harnessing the power of reasoning models requires a robust governance strategy, centered on rigorous static analysis to identify and manage the complex flaws they create.