New View the LLM Leaderboard for Code Quality, Complexity, and Security

State of Code report series

The Coding Personalities of Leading LLMs

Make smarter AI adoption decisions with Sonar's latest report in The State of Code series. Explore the habits, blind spots, and archetypes of the top five LLMs to uncover the critical risks each brings to your codebase.

Download the report Executive summary

Key findings

Our deep analysis of LLM-generated code goes beyond standard benchmarks.

Coding personalities

Each LLM has a distinct style that impacts your production environment.

Shared strengths

All models consistently produce valid and create viable solutions for well-defined problems.

Shared blind spots

All models have a fundamental lack of security awareness and a bias for messy code.

Upgrades increase risk

Newer models can generate bugs that are almost twice as likely to be of the highest severity.

What our analysis uncovered

more likely for new Claude model to be of 'BLOCKER' severity than its predecessor.

of all issues found in LLM-generated code create long-term technical debt.

of the vulnerabilities for one LLM are of ‘BLOCKER’ severity.

of all bugs from one popular LLM are control-flow mistakes.

Methodology

Our analysis is based on 4,442 identical programming tasks performed by each LLM. We measured their output across multiple dimensions to create a comprehensive profile of each model's coding personality and risk profile.

Verbosity Measurement

Complexity Measurement
Communication & Documentation
Software Quality Analysis
Functional Performance

Verbosity Measurement

Verbosity quantifies the sheer volume of code each model generates to solve identical tasks.

Lines of Code (LOC)
Total number of lines of code generated across all 4,442 tasks, including blank lines and comments. This metric reveals whether a model tends toward concise or elaborate implementations.
Token Count
Total tokens generated in the code output, providing a language-agnostic measure of code volume that accounts for the actual content density.
Code Density
Ratio of executable statements to total lines, indicating how compact or spread out the code structure is.

The coding archetypes of leading LLMs

Our analysis shows that each LLM has a unique and measurable coding personality. Which one have you "hired" for your team?

Baseline performer

Senior architect

Balanced predecessor

Efficient generalist

Unfulfilled promise

Rapid prototyper

GPT-5-minimal

The baseline performer

Strong performance with traditional risk profile, but generates the most verbose and complex code.

This is the entry-level reasoning mode. It delivers strong performance that is superior to most non-reasoning models. Its personality is defined by having a more "traditional" risk profile compared to more advanced models.

It produces common and well-understood flaws, such as a significant rate of "Path-traversal & Injection" vulnerabilities (20%) and basic "Control-flow mistake" bugs.

At the same time, it introduces a new class of risk with its high verbosity and complexity, leading to the highest proportion of CRITICAL code smells of any model.

Claude Sonnet 4

The senior architect

Codes like a seasoned architect with highest functional skill, but sophistication introduces complex, high-severity bugs.

This LLM codes like a seasoned and ambitious architect tasked with building enterprise-grade systems. It exhibits the highest functional skill, successfully passing 77.04% of the benchmark tests.

The very sophistication of the model creates a lot of opportunities for higher-risk bugs that plague complex, stateful systems. Its unique bug profile reveals a high propensity for difficult concurrency and threading bugs (9.81% of its total bugs) and a significant rate of resource management leaks (15.07% of its bugs).

Claude 3.7 Sonnet

The balanced predecessor

Well-rounded developer with exceptional documentation, but still introduces high proportion of BLOCKER vulnerabilities.

This model represents a capable and well-rounded developer from a prior generation, exhibiting strong functional skills with a 72.46% benchmark pass rate.

Its most defining personality trait is its communication style—it is an exceptional documentarian, producing code with a remarkable 16.4% comment density—nearly three times higher than its successor and the highest of any model evaluated. This makes its code uniquely readable and easier for human developers to understand.

GPT-4o

The efficient generalist

Reliable middle-of-the-road developer with solid performance, but careless with logical precision.

This LLM is a reliable, middle-of-the-road developer. Its style is not as verbose as the "senior architect" nor as concise as the "rapid prototyper"—it is a jack-of-all-trades, a common choice for general-purpose coding assistance. Its code is moderately complex and its functional performance is solid.

Its distinctive personality trait, however, is revealed in the type of mistakes it makes. While generally avoiding the most severe 'BLOCKER' or 'CRITICAL' bugs, it demonstrates a notable carelessness with logical precision.

Llama 3.2 90B

The unfulfilled promise

Should be top-tier given its scale, but delivers mediocre performance with alarming security blind spots.

Given its scale and backing, this model represents what should be a top-tier contender, but its performance in our analysis suggests its promise is largely unfulfilled. Its functional skill is mediocre, with a pass rate of 61.47%, only marginally better than the much smaller open-source model we tested.

However, the model's most alarming characteristic is its remarkably poor security posture. The model exhibits a profound security blind spot, with an alarming 70.73% of the vulnerabilities it introduces being of 'BLOCKER' severity—the highest proportion of any model evaluated.

OpenCoder-8B

The rapid prototyper

Brilliant but undisciplined junior developer, perfect for rapid prototyping but buries projects in technical debt.

This LLM is the brilliant but undisciplined junior developer, perfect for getting a concept off the ground with maximum speed. Its style is defined by conciseness, producing the least amount of code (120,288 LOC) to achieve functional results.

This model is a technical debt machine, exhibiting the highest issue density of all models at 32.45 issues per thousand lines of code. Its most prominent personality flaw is a notable tendency to leave behind dead, unused, and redundant code, which accounts for 42.74% of all its code smells.

The "trust but verify" mandate for AI

AI is now a core part of software development, but performance benchmarks alone are misleading. They can lead to LLMs that solve difficult challenges but fail to write good, secure, and reliable code. To harness these powerful models responsibly, you must look beyond the benchmark. Our report provides the critical insights needed to choose the right models and use them safely.

Download the report

The three qualities of software source code

Sonar classifies the issues found in every project or codebase across three deeply interconnected software qualities: reliability, security, and maintainability.

Reliability

Bugs that would affect the software's capability to maintain its level of performance under promised conditions, potentially compromising its reliability and operational effectiveness.

Advanced Security demo

Security

Vulnerabilities and security hotspots. Vulnerabilities are code weaknesses that could be exploited for attacks, while hotspots are security-sensitive code requiring manual review.

SonarQube demo

Maintainability

Code smells, which could indicate weaknesses in design that can increase technical debt, slow down development, or increase the risk of bugs or failures down the line.

AI Code Assurance demo

Security Vulnerability Analysis

Security vulnerabilities in AI-generated code pose significant risks. Our analysis reveals distinct patterns in how each LLM handles security-critical code, with some models producing vulnerabilities of BLOCKER severity at alarming rates.