Blog post

The Coding Personalities of Leading LLMs—GPT-5 Update

Prasenjit Sarkar

Solutions Marketing Manager

August 27, 2025

8 min read

In our previous report, “The Coding Personalities of Leading LLMs,” we revealed the shared strengths and flaws of some of the most popular LLMs, while also uncovering distinct coding “personalities” for each model.

GPT-5’s arrival on the scene adds an important new dimension to the landscape, so we have updated our analysis to include it. To do an apples-to-apples comparison, we evaluated GPT-5 with minimal reasoning against Anthropic's Claude Sonnet 4 and 3.7, OpenAI's GPT-4o, Meta's Llama 3.2 90B, and the open source OpenCoder-8B.

Bottom line: GPT-5-minimal reasoning does not unseat Claude Sonnet 4 as the performance leader. It performs better than every other model we tested, but had lower functional performance than Claude Sonnet 4, while generating more verbose, complex, and issue-prone code. Claude Sonnet 4 remains the leader of the non-reasoning models, both in terms of functional performance and code quality.

A note on methodology

Using the SonarQube Enterprise static analysis engine, Sonar has now evaluated code generated from six leading LLMs, including the latest GPT-5 model from OpenAI. Each model was tested against over 4,400 unique Java assignments from recognized benchmarks like MultiPL-E and ComplexCodeEval.

For this evaluation, we analyzed “GPT-5-minimal,” which operates at the model’s lowest reasoning level, to have a fair comparison with other models like Claude Sonnet 4 that have reasoning disabled by default. Reasoning adds a number of dimensions to this analysis, which we will explore in future work.

Functional performance

The first dimension of any model’s personality is its raw functional skill. On this front, GPT-5-minimal establishes itself as a highly-competitive model with a weighted pass average of ~75%. GPT-5-minimal is a top-tier performer, second only to Claude Sonnet 4.

Table 1: Functional performance on MultiPL-E Java benchmarks

MultiPL-E Benchmarks	GPT-5-minimal	Claude Sonnet 4	Claude 3.7 Sonnet	GPT-4o	Llama 3.2 Vision 90B	OpenCoder-8B
HumanEval (158 tasks)	91.77%	95.57%	84.28%	73.42%	61.64%	64.36%
MBPP (385 tasks)	68.13%	69.43%	67.62%	68.13%	61.40%	58.81%
Weighted Test Pass@1 Avg	75.37%	77.04%	72.46%	69.67%	61.47%	60.43%

The cost of performance: Extreme verbosity and complexity

We have previously seen that models that do well functionally tend to generate more lines of code per completed task. GPT-5 breaks this trend; despite not being the top performer, GPT-5-minimal generates a substantially larger and more complex volume of code than any other model, including Claude Sonnet 4.

Table 2: Code volume and complexity metrics

LLM model	Lines of code (LOC)	Cyclomatic complexity	Cognitive complexity
GPT-5-minimal	490,010	145,099	111,133
Claude Sonnet 4	370,816	81,667	47,649
Claude 3.7 Sonnet	288,126	55,485	42,220
GPT-4o	209,994	44,387	26,450
Llama 3.2 Vision 90B	196,927	37,948	20,811
OpenCoder-8B	120,288	18,850	13,965

GPT-5-minimal produced 490,010 lines of code, over 30% more than the top-performing Claude Sonnet 4. The code it generates also has a dramatic increase in cyclomatic complexity and cognitive complexity. Developers who need to review code generated by GPT-5-minimal will be faced with a tough challenge.

Deep dive into code quality

Compounding these challenges, we find that the code from GPT-5-minimal has a much higher density of issues relative to the tasks it solves.

Table 3 highlights the “Issues per passing task” from each model. GPT-5-minimal introduces 3.90 issues for every correct solution—nearly double the rate of the more concise and higher performing Claude Sonnet 4. This means for every task it completes successfully, it introduces significantly more potential defects than its competitors, resulting in a large downstream technical debt, quality, security, and verification burden.

Table 3: Overall code quality and issue rates

LLM model	Passing tests %	SonarQube discovered issues	Issues per passing task
GPT-5-minimal	75.37%	13,057	3.90
Claude Sonnet 4	77.04%	7,225	2.11
Claude 3.7 Sonnet	72.46%	6,576	2.04
GPT-4o	69.67%	5,476	1.77
Llama 3.2 Vision 90B	61.47%	5,159	1.89
OpenCoder-8B	60.43%	3,903	1.45

GPT-5-minimal produces the lowest density of vulnerabilities by a wide margin (0.12 per KLOC) and a relatively low bug density. However, this is balanced by a much higher density of code smells (25.28 per KLOC), indicating a primary weakness in code quality and maintainability.

Table 4: Issue density by type (per KLOC)

LLM model	Bug density (Bugs/KLOC)	Vulnerability density (Vuln./KLOC)	Code smell density (Smells/KLOC)
GPT-5-minimal	1.24	0.12	25.28
Claude Sonnet 4	1.14	0.38	17.96
Claude 3.7 Sonnet	1.22	0.40	21.20
GPT-4o	1.93	0.53	23.61
Llama 3.2 Vision 90B	2.02	0.62	23.55
OpenCoder-8B	2.05	0.56	29.84

While a low density can sometimes be misleading if a model is simply more verbose, the data on absolute vulnerability counts confirms this is not the case. With only 60 total vulnerabilities generated, GPT-5-minimal is proving its security focus is strong on both a relative and absolute basis.

Table 5: Absolute vulnerability counts

LLM model	Total vulnerabilities generated
GPT-5-minimal	60
Claude Sonnet 4	141
Claude 3.7 Sonnet	116
GPT-4o	112
Llama 3.2 Vision 90B	123
OpenCoder-8B	67

Here is some more detail regarding the coding personality of GPT-5-minimal.

Security

GPT-5-minimal’s strongest trait is its focus on security, which is evident across multiple metrics. Its issues are far less likely to be security-related; only 0.46% of its total discovered issues are vulnerabilities, a fraction of the rate for other models. Furthermore, it produces the lowest density of vulnerabilities of any model tested—just 0.12 per 1,000 lines of code (KLOC).

However, there are caveats too. The model shows a tendency to reintroduce classic security flaws that are less common in other models. Key issues include Path-traversal & Injection flaws that make up 20% of its total security vulnerabilities, indicating a different and more fundamental risk profile.

Maintainability

This model’s strong security performance is balanced by its weaker performance on code quality and maintainability.

Code smell density: Its code is inherently less maintainable, with a high density of ~25 code smells per 1,000 lines of code.
Complex issues: The core issue is the code’s intricacy. The solutions generated by GPT-5-minimal result in the highest percentage of code smells related to “Cognitive/computational complexity” (~12%) among all evaluated models. This tendency to produce overly complex code directly creates long-term technical debt, making it difficult to understand and maintain in the future.

Reliability

This model demonstrates a higher rate of foundational logical errors compared to its peers. “Control-flow mistake” bugs are dominant, accounting for roughly 24% of its total functional bugs. This shows that while the model can produce functionally-correct code, it is also prone to making basic logical errors.

Conclusion: Trust and verify

GPT-5 is undeniably a powerful new force in AI code generation. However, this analysis of its minimal reasoning mode shows that progress is not linear. It reveals a model that, while functionally proficient, carries a significant quality cost and presents a different profile of security and reliability considerations.

This makes the “trust and verify” mandate more critical than ever. To leverage this model's power, organizations must evolve their governance strategies:

Manage the complexity: Its code is a prime candidate for refactoring. Static analysis is essential to identify the critical code smells and high-complexity methods that will quickly become unmaintainable.
Scrutinize for advanced flaws: Code reviewers must be vigilant for this model's specific tendencies, including the re-emergence of classic vulnerabilities like path-traversal and a higher rate of fundamental logic errors.

This analysis shows that as AI models evolve, their flaw profiles become more nuanced. Harnessing their potential requires an equally sophisticated and adaptable approach to governance.

twitter facebook linkedin mail