In our previous report, “The Coding Personalities of Leading LLMs,” we revealed the shared strengths and flaws of some of the most popular LLMs, while also uncovering distinct coding “personalities” for each model.
GPT-5’s arrival on the scene adds an important new dimension to the landscape, so we have updated our analysis to include it. To do an apples-to-apples comparison, we evaluated GPT-5 with minimal reasoning against Anthropic's Claude Sonnet 4 and 3.7, OpenAI's GPT-4o, Meta's Llama 3.2 90B, and the open source OpenCoder-8B.
Bottom line: GPT-5-minimal reasoning does not unseat Claude Sonnet 4 as the performance leader. It performs better than every other model we tested, but had lower functional performance than Claude Sonnet 4, while generating more verbose, complex, and issue-prone code. Claude Sonnet 4 remains the leader of the non-reasoning models, both in terms of functional performance and code quality.
A note on methodology
Using the SonarQube Enterprise static analysis engine, Sonar has now evaluated code generated from six leading LLMs, including the latest GPT-5 model from OpenAI. Each model was tested against over 4,400 unique Java assignments from recognized benchmarks like MultiPL-E and ComplexCodeEval.
For this evaluation, we analyzed “GPT-5-minimal,” which operates at the model’s lowest reasoning level, to have a fair comparison with other models like Claude Sonnet 4 that have reasoning disabled by default. Reasoning adds a number of dimensions to this analysis, which we will explore in future work.
Functional performance
The first dimension of any model’s personality is its raw functional skill. On this front, GPT-5-minimal establishes itself as a highly-competitive model with a weighted pass average of ~75%. GPT-5-minimal is a top-tier performer, second only to Claude Sonnet 4.
Table 1: Functional performance on MultiPL-E Java benchmarks
MultiPL-E Benchmarks | GPT-5-minimal | Claude Sonnet 4 | Claude 3.7 Sonnet | GPT-4o | Llama 3.2 Vision 90B | OpenCoder-8B |
HumanEval (158 tasks) | 91.77% | 95.57% | 84.28% | 73.42% | 61.64% | 64.36% |
MBPP (385 tasks) | 68.13% | 69.43% | 67.62% | 68.13% | 61.40% | 58.81% |
Weighted Test Pass@1 Avg | 75.37% | 77.04% | 72.46% | 69.67% | 61.47% | 60.43% |
The cost of performance: Extreme verbosity and complexity
We have previously seen that models that do well functionally tend to generate more lines of code per completed task. GPT-5 breaks this trend; despite not being the top performer, GPT-5-minimal generates a substantially larger and more complex volume of code than any other model, including Claude Sonnet 4.
Table 2: Code volume and complexity metrics
LLM model | Lines of code (LOC) | Cyclomatic complexity | Cognitive complexity |
GPT-5-minimal | 490,010 | 145,099 | 111,133 |
Claude Sonnet 4 | 370,816 | 81,667 | 47,649 |
Claude 3.7 Sonnet | 288,126 | 55,485 | 42,220 |
GPT-4o | 209,994 | 44,387 | 26,450 |
Llama 3.2 Vision 90B | 196,927 | 37,948 | 20,811 |
OpenCoder-8B | 120,288 | 18,850 | 13,965 |
GPT-5-minimal produced 490,010 lines of code, over 30% more than the top-performing Claude Sonnet 4. The code it generates also has a dramatic increase in cyclomatic complexity and cognitive complexity. Developers who need to review code generated by GPT-5-minimal will be faced with a tough challenge.
Deep dive into code quality
Compounding these challenges, we find that the code from GPT-5-minimal has a much higher density of issues relative to the tasks it solves.
Table 3 highlights the “Issues per passing task” from each model. GPT-5-minimal introduces 3.90 issues for every correct solution—nearly double the rate of the more concise and higher performing Claude Sonnet 4. This means for every task it completes successfully, it introduces significantly more potential defects than its competitors, resulting in a large downstream technical debt, quality, security, and verification burden.
Table 3: Overall code quality and issue rates
LLM model | Passing tests % | SonarQube discovered issues | Issues per passing task |
GPT-5-minimal | 75.37% | 13,057 | 3.90 |
Claude Sonnet 4 | 77.04% | 7,225 | 2.11 |
Claude 3.7 Sonnet | 72.46% | 6,576 | 2.04 |
GPT-4o | 69.67% | 5,476 | 1.77 |
Llama 3.2 Vision 90B | 61.47% | 5,159 | 1.89 |
OpenCoder-8B | 60.43% | 3,903 | 1.45 |
GPT-5-minimal produces the lowest density of vulnerabilities by a wide margin (0.12 per KLOC) and a relatively low bug density. However, this is balanced by a much higher density of code smells (25.28 per KLOC), indicating a primary weakness in code quality and maintainability.
Table 4: Issue density by type (per KLOC)
LLM model | Bug density (Bugs/KLOC) | Vulnerability density (Vuln./KLOC) | Code smell density (Smells/KLOC) |
GPT-5-minimal | 1.24 | 0.12 | 25.28 |
Claude Sonnet 4 | 1.14 | 0.38 | 17.96 |
Claude 3.7 Sonnet | 1.22 | 0.40 | 21.20 |
GPT-4o | 1.93 | 0.53 | 23.61 |
Llama 3.2 Vision 90B | 2.02 | 0.62 | 23.55 |
OpenCoder-8B | 2.05 | 0.56 | 29.84 |
While a low density can sometimes be misleading if a model is simply more verbose, the data on absolute vulnerability counts confirms this is not the case. With only 60 total vulnerabilities generated, GPT-5-minimal is proving its security focus is strong on both a relative and absolute basis.
Table 5: Absolute vulnerability counts
LLM model | Total vulnerabilities generated |
GPT-5-minimal | 60 |
Claude Sonnet 4 | 141 |
Claude 3.7 Sonnet | 116 |
GPT-4o | 112 |
Llama 3.2 Vision 90B | 123 |
OpenCoder-8B | 67 |
Here is some more detail regarding the coding personality of GPT-5-minimal.
Security
GPT-5-minimal’s strongest trait is its focus on security, which is evident across multiple metrics. Its issues are far less likely to be security-related; only 0.46% of its total discovered issues are vulnerabilities, a fraction of the rate for other models. Furthermore, it produces the lowest density of vulnerabilities of any model tested—just 0.12 per 1,000 lines of code (KLOC).
However, there are caveats too. The model shows a tendency to reintroduce classic security flaws that are less common in other models. Key issues include Path-traversal & Injection flaws that make up 20% of its total security vulnerabilities, indicating a different and more fundamental risk profile.
Maintainability
This model’s strong security performance is balanced by its weaker performance on code quality and maintainability.
- Code smell density: Its code is inherently less maintainable, with a high density of ~25 code smells per 1,000 lines of code.
- Complex issues: The core issue is the code’s intricacy. The solutions generated by GPT-5-minimal result in the highest percentage of code smells related to “Cognitive/computational complexity” (~12%) among all evaluated models. This tendency to produce overly complex code directly creates long-term technical debt, making it difficult to understand and maintain in the future.
Reliability
This model demonstrates a higher rate of foundational logical errors compared to its peers. “Control-flow mistake” bugs are dominant, accounting for roughly 24% of its total functional bugs. This shows that while the model can produce functionally-correct code, it is also prone to making basic logical errors.
Conclusion: Trust and verify
GPT-5 is undeniably a powerful new force in AI code generation. However, this analysis of its minimal reasoning mode shows that progress is not linear. It reveals a model that, while functionally proficient, carries a significant quality cost and presents a different profile of security and reliability considerations.
This makes the “trust and verify” mandate more critical than ever. To leverage this model's power, organizations must evolve their governance strategies:
- Manage the complexity: Its code is a prime candidate for refactoring. Static analysis is essential to identify the critical code smells and high-complexity methods that will quickly become unmaintainable.
- Scrutinize for advanced flaws: Code reviewers must be vigilant for this model's specific tendencies, including the re-emergence of classic vulnerabilities like path-traversal and a higher rate of fundamental logic errors.
This analysis shows that as AI models evolve, their flaw profiles become more nuanced. Harnessing their potential requires an equally sophisticated and adaptable approach to governance.