Start your free trial

Verify all code. Find and fix issues faster with SonarQube.

Functional benchmarks remain a standard for evaluating AI models, effectively measuring whether generated code can pass a test case. As LLMs evolve, they are becoming increasingly proficient at solving these functional challenges. However, for engineering leaders deploying this code into production, functional correctness is only half of the equation.

To understand the real effectiveness of AI coding models, we need to understand its structural quality, security, and maintainability as well. Thankfully, Sonar is an excellent position to do this work as we analyze over 750 billion lines of code each day.

Several months ago, we began analyzing the quality, security, and maintainability of the code created with leading LLMs by testing them on over 4,000 distinct Java programming assignments using the SonarQube static analysis engine.

Today, we are making all evaluations available in a new Sonar LLM leaderboard and sharing our latest findings on GPT-5.2 High, GPT-5.1 High, Gemini 3.0 Pro, Opus 4.5 Thinking, and Claude Sonnet 4.5.

Explore the new data on the Sonar LLM Leaderboard

Visualizing the trade-offs

To understand the trade-offs and behaviors of different models, we plotted them on three critical dimensions: pass rate (X-axis), cognitive complexity (Y-axis), and verbosity (bubble size).

As models become more “performant” and move to the right, their outputs tend to get more verbose and complex, imposing higher burdens on engineers reviewing and using the code.

The complexity correlation

Our research highlights a correlation between model reasoning capabilities and code complexity. As models attempt sophisticated, stateful solutions to harder problems, they often move away from simple code. This shift introduces engineering challenges that are harder to detect than simple syntax errors.

For example:

Opus 4.5 Thinking leads in functional performance with an 83.62% pass rate (thus it is furthest to the right in the chart above). However, this performance comes with high verbosity, generating 639,465 lines of code (LOC) to solve the benchmark test (which is why it is one of the largest bubble sizes on the chart). This is more than double the volume of less verbose models.
Gemini 3 Pro stands out as an efficiency outlier. It achieves a comparable 81.72% pass rate while maintaining low cognitive complexity and low verbosity (small bubble size). This combination suggests a unique ability to solve complex problems with concise, readable code. But Gemini has the highest issue density in contrast to the other recent models.
GPT 5.2 High ranks third in functional performance (80.66%), trailing Opus 4.5 and Gemini 3 Pro. Despite the high pass rate, it generated the highest code volume of the cohort (974,379 LOC). Compared to its predecessor (GPT 5.1 High), GPT 5.2 shows regressed maintainability and increased bug density across all severities, though it demonstrates marginal improvements in overall security and blocker-level vulnerabilities.
GPT-5.1 High also achieves an 80% pass rate but exhibits an increase in cognitive complexity (high placement on the Y-axis). This indicates that while it solves the problem, it generates logic that is structurally more difficult to read and maintain.

Engineering discipline and reliability

While models demonstrate strong logic capabilities, our analysis reveals distinct patterns in how they handle software engineering fundamentals like resource management and thread safety. Contextualizing these numbers reveals significant disparities in reliability between models that otherwise have similar pass rates.

1. Concurrency challenges: GPT-5.2 High demonstrates powerful reasoning but is more prone to concurrency errors than its peers. It generates 470 concurrency issues per million lines of code (MLOC) —a rate nearly double that of the next closest model and over 6x higher than Gemini 3 Pro.

Model	Concurrency issues per MLOC
GPT 5.2 High	470
GPT-5.1 High	241
Opus 4.5 Thinking	133
Claude Sonnet 4.5	129
Gemini 3 Pro	69

2. Resource management: Claude Sonnet 4.5 showed a higher rate of resource management leaks, generating 195 leaks per MLOC. By comparison, GPT-5.1 High produced only 51 leaks per MLOC for the same tasks.

Model	Resource leaks per MLOC
Claude Sonnet 4.5	195
GPT 5.2 High	86
Opus 4.5 Thinking	84
Gemini 3 Pro	79
GPT-5.1 High	51

3. Control flow precision: Gemini 3 Pro posted the highest rate of control flow mistakes (200 per MLOC), nearly 4x higher than Opus 4.5 Thinking (55 per MLOC). GPT 5.2 High demonstrated high precision, achieving the lowest error rate in the cohort at just 22 control flow mistakes per MLOC.

Model	Control flow mistakes per MLOC
Gemini 3 Pro	200
Claude Sonnet 4.5	152
GPT-5.1 High	98
Opus 4.5 Thinking	55
GPT 5.2 High	22

Security verification

Security remains a critical area for verification. Our analysis confirms that models do not always reliably track untrusted user input from source to sink.

Claude Sonnet 4.5 registered 198 blocker-severity vulnerabilities per MLOC, including path traversal and injection flaws. This rate is higher than other models in its class. Opus 4.5 Thinking performed significantly better with only 44 blockers per MLOC, suggesting its “thinking” process may allow for better verification of security constraints before generating output. GPT 5.2 High achieved the best security posture in the cohort, with only 16 blocker vulnerabilities per MLOC. While other metrics showed this model struggles with code volume and general bug density, its handling of critical security hotspots is currently best-in-class.

Model	Blocker vulnerabilities per MLOC
Claude Sonnet 4.5	198
Gemini 3 Pro	66
GPT-5.1 High	53
Opus 4.5 Thinking	44
GPT 5.2 High	16

The challenge of maintainability

Beyond critical bugs, maintainability remains a primary factor in the total cost of ownership for AI code. “Code smell” issues, which degrade maintainability, accounted for 92% to 96% of all detected issues across the models evaluated.

GPT-5.1 High generated over 4,400 generic smells per MLOC.

Model	Generic smells per MLOC
GPT-5.1 High	4426
GPT 5.2 High	3453
Gemini 3 Pro	3044
Claude Sonnet 4.5	2551
Opus 4.5 Thinking	2225

Claude Sonnet 4.5 bypassed more design best practices.

Model	Design best practice violations per MLOC
Claude Sonnet 4.5	4316
Gemini 3 Pro	3824
Opus 4.5 Thinking	2494
GPT 5.2 High	2293
GPT-5.1 High	1840

About the Sonar LLM Leaderboard

We created the Sonar LLM Leaderboard to provide transparency into how models build code, not just what they build. By running thousands of AI-generated solutions through SonarQube, we evaluate models on the metrics that matter to engineering leaders: security, reliability, maintainability, and complexity.

Explore the complete dataset on the Sonar LLM Leaderboard.

New data on code quality: GPT-5.2 high, Opus 4.5, Gemini 3, and more

Table of contents

Start your free trial

Verify all code. Find and fix issues faster with SonarQube.

Visualizing the trade-offs

The complexity correlation

Engineering discipline and reliability

Security verification

The challenge of maintainability

About the Sonar LLM Leaderboard

New data on code quality: GPT-5.2 high, Opus 4.5, Gemini 3, and more

Table of contents

Start your free trial

Verify all code. Find and fix issues faster with SonarQube.

.css-1s68n4h{position:absolute;top:-150px;}Visualizing the trade-offs.css-5cm1aq{color:#000000;}.css-s0nieh{margin-left:10px;margin-top:-1px;display:inline-block;fill:#69809B;margin-left:14px;}.css-s0nieh:hover{fill:#290042;}

The complexity correlation

Engineering discipline and reliability

Security verification

The challenge of maintainability

About the Sonar LLM Leaderboard

Visualizing the trade-offs