The introduction of sophisticated reasoning capabilities in models like GPT-5 marks a significant evolution in AI code generation. This blog post provides a deep dive into GPT-5’s four reasoning modes—minimal, low, medium, and high—to understand the impact of increased reasoning on functional correctness, code quality, security, and cost. Our analysis, based on over 4,400 Java tasks, reveals a clear trade-off: while higher reasoning delivers best-in-class functional performance, it achieves this by generating a massive volume of complex and hard-to-maintain code.
This post builds upon our previous analysis, The Coding Personalities of Leading LLMs—GPT-5 Update, where we evaluated GPT-5’s minimal reasoning mode against other leading models. In this research, we found that reasoning is a powerful tool for improving correctness and security, but it comes with trade-offs. Medium reasoning mode achieves the highest functional success rate and provides a good balance of performance and cost. However, regardless of the setting, GPT-5’s code requires rigorous static analysis to manage the immediate increase in technical debt and a new class of subtle, complex flaws that reasoning introduces. The key takeaway is that the impressive functional performance of reasoning models comes with significant trade-offs. While reasoning reduces common problems in the code, they also create new, hidden ones that demand greater scrutiny.
A note on methodology
Using the SonarQube Enterprise static analysis engine, Sonar has now evaluated code from GPT-5 across its four reasoning modes: minimal, low, medium, and high. Each mode was tested against over 4,400 unique Java assignments from recognized benchmarks like MultiPL-E and ComplexCodeEval. This analysis is a deep dive into the impact of reasoning and is intended as a follow-up to our previous report, which provided a broader comparison of GPT-5-minimal against other leading LLMs.
Functional performance
Increased reasoning has a positive impact on functional performance, but the returns diminish at the highest, most expensive levels.
Introducing even a little reasoning provides a material boost, with the low reasoning mode’s pass rate of ~80% representing a jump from the minimal mode’s ~75%. The performance peaks with the medium reasoning mode, which achieved the highest functional success rate in our evaluation at ~82%, slightly outperforming the much more expensive high setting (~82%). This makes the medium reasoning mode a clear “sweet spot” and also suggests that for most complex tasks where correctness is paramount, medium reasoning represents the optimal balance of performance and cost.
Table 1: Functional performance by reasoning mode
MultiPL-E benchmarks | GPT-5-high | GPT-5-medium | GPT-5-low | GPT-5-minimal |
HumanEval (158 tasks) | 96.84% | 96.84% | 96.20% | 91.77% |
MBPP (385 tasks) | 75.13% | 75.39% | 73.58% | 68.13% |
Weighted Test Pass@1 Avg | 81.78% | 81.96% | 80.50% | 75.37% |
The cost of reasoning
While functionally superior, the code generated by higher reasoning modes is not necessarily better. It is more verbose, more expensive, and contains more defects for each given task.
1. Verbosity & complexity
All GPT-5 reasoning models are more verbose and complex than their predecessor GPT-4o. As the table below shows, even GPT-5-minimal produces more than double the lines of code of GPT-4o, and this verbosity increases with higher reasoning. Furthermore, the complexity is consistently higher across all four GPT-5 modes compared to non-reasoning alternatives. This indicates a shift towards a more complex approach to problem-solving, where the model adds significantly more Lines of Code (LOC)—even between the medium and high settings—without a corresponding increase in functional performance.
Table 2: Code volume and complexity metrics
LLM model | Lines of code (LOC) | Cyclomatic complexity | Cognitive complexity | Cyclomatic complexity / LoC | Cognitive complexity / LoC |
GPT-5-high | 727,154 | 204,395 | 169,496 | 0.281 | 0.233 |
GPT-5-medium | 611,112 | 171,485 | 138,925 | 0.281 | 0.227 |
GPT-5-low | 561,325 | 154,776 | 119,313 | 0.276 | 0.213 |
GPT-5-minimal | 490,010 | 145,099 | 111,133 | 0.296 | 0.227 |
GPT-4o | 209,994 | 44387 | 26450 | 0.211 | 0.126 |
2. The financial cost
The cost of using GPT-5 scales with reasoning, driven by both internal “reasoning tokens” and the volume of verbose code the model generates. Developers and organizations should factor in the additional cost when deciding whether to move up to a higher reasoning setting.
Table 3: Cost per benchmark run
Reasoning mode | Cost per benchmark run |
GPT-5-high | $189 |
GPT-5-medium | $64 |
GPT-5-low | $47 |
GPT-5-minimal | $22 |
Code quality
As part of this evaluation, we found that reasoning levels do not materially impact the code’s complexity density. The Cognitive Complexity Density is nearly identical across the minimal, medium, and high modes (0.227-0.233). This indicates that GPT-5 has an inherently complex coding style.
The model appears to “overthink” the answer as reasoning increases. They introduce more “Issues per passing task,” rising from 3.90 at minimal to 5.50 at the high setting. Because issue density remains stable, the greater volume of code at higher reasoning levels results in more absolute issues. This makes higher-reasoning GPT-5 a source of increased tech debt that trades long-term maintenance overhead for short-term velocity.
Table 4: Code quality and issue rates
LLM model | Passing tests % | SonarQube discovered issues | Issues per passing task |
GPT-5-high | 81.78% | 19,968 | 5.50 |
GPT-5-medium | 81.96% | 16,629 | 4.57 |
GPT-5-low | 80.50% | 13,887 | 3.88 |
GPT-5-minimal | 75.37% | 13,057 | 3.90 |
A new risk profile
Another important takeaway is that reasoning shifts the type of flaws generated. It reduces common, obvious issues but replaces them with nuanced ones. Developers may come to a false sense of security, as the code appears cleaner on the surface.
1. Security
GPT-5 is optimized for security. Higher reasoning eliminates common, well-understood attacks like “path-traversal & injection” vulnerabilities (dropping to 0% in high and low modes). However, these are replaced by subtle, harder-to-detect flaws. The percentage of vulnerabilities related to “inadequate I/O error-handling” skyrockets, at 44% in the high reasoning mode versus 30% in the minimum reasoning mode.
Table 5: Vulnerability sub-category distribution (%)
Vulnerability category | GPT-5-high (%) | GPT-5-medium (%) | GPT-5-low (%) | GPT-5-minimal (%) |
Path-traversal & injection | 0.00 | 1.69 | 0.00 | 20.00 |
Inadequate I/O error-handling | 43.84 | 35.59 | 51.02 | 30 |
2. Reliability
A similar trade-off occurs with bugs. As reasoning increases, the rate of fundamental “control-flow mistake” bugs decreases significantly (from ~24% at minimal to ~11-12% at higher levels). However, the percentage of advanced “concurrency / threading” bugs increases from 20% in minimal mode to ~38% in high mode. Higher reasoning attempts more complex, multi-threaded solutions and fails in more advanced ways.
Table 6: Bug sub-category distribution (%)
Bug category | GPT-5-high (%) | GPT-5-medium (%) | GPT-5-low (%) | GPT-5-minimal (%) |
Control-flow mistake | 12.57 | 10.85 | 11.60 | 24.26 |
Concurrency / Threading | 38.30 | 35.05 | 27.44 | 20.00 |
3. Severity
The data on issue severity reveals one of the most significant trade-offs of the reasoning models: a shift from application-breaking flaws toward a higher volume of less critical issues. This is most evident in the security profile, where the severity of vulnerabilities generated by GPT-5 is fundamentally different from other models. As shown below, all four reasoning modes produce a much lower proportion of BLOCKER vulnerabilities compared to their peers—an average of Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 Vision 90B, and OpenCoder-8B—indicating, indicating a successful effort in reducing the most severe security flaws.
Table 7: Vulnerability severity distribution (%)
LLM model | BLOCKER vulnerabilities (%) | CRITICAL vulnerabilities (%) | MINOR vulnerabilities (%) |
GPT-5-high | 30.14 | 24.66 | 45.21 |
GPT-5-medium | 28.81 | 33.90 | 35.59 |
GPT-5-low | 12.24 | 36.73 | 51.02 |
GPT-5-minimal | 35.00 | 31.67 | 30.00 |
(Other Models Avg) | ~63% | ~27% | ~8% |
This trend is also reflected in the bug profile. Reasoning is effective at reducing the most severe functional bugs, and the GPT-5 suite consistently outperforms other models in this regard. The high reasoning mode produces the lowest proportion of BLOCKER bugs (~3%), a figure that rises to ~8% for the minimal mode. While the code is far from bug-free, the chance of functional error is clearly reduced with increased reasoning.
Table 8: Bug severity distribution (%)
LLM model | BLOCKER bugs (% | CRITICAL bugs (%) |
GPT-5-high | 2.92 | 2.63 |
GPT-5-medium | 4.80 | 3.20 |
GPT-5-low | 5.16 | 2.95 |
GPT-5-minimal | 7.70 | 2.30 |
(Other Models Avg) | ~10.2 | ~5.8 |
In summary, the severity data paints a clear picture: GPT-5’s reasoning is highly effective at steering the model away from generating high-severity bugs and vulnerabilities. While this reduces the risk of immediate, application-breaking flaws, it comes at a trade-off. The risk profile shifts toward more subtle, lower-severity issues that contribute to the overall complexity and long-term maintenance burden of the code.
Conclusion: Trust, and verify rigorously
Reasoning is a powerful feature that allows GPT-5 to achieve a new level of functional correctness and security against common attacks. However, it is not a silver bullet. “Trust and verify” is more critical than ever for this new class of models.
From a developer’s standpoint, the danger is complacency. At a glance, the code from higher-reasoning modes will have fewer obvious logical errors and common vulnerabilities. But hidden beneath the surface is a greater volume of complex code saturated with subtle, hard-to-detect issues like concurrency bugs and insecure error handling. For teams with existing codebases, poor maintainability of code presents a significant risk.
Teams adopting GPT-5 will likely see an increase in initial feature velocity, but this will be paid for by a direct and immediate increase in technical debt. Harnessing the power of reasoning models requires a robust governance strategy, centered on rigorous static analysis to identify and manage the complex flaws they create.