Evaluation Pipeline

Each model on this leaderboard is evaluated through a multi-stage pipeline that measures both functional correctness and code quality.

1

Code Generation

Each model receives a set of coding task prompts and generates source code solutions. Tasks span multiple languages and difficulty levels.

2

Functional Testing

Generated code is compiled and executed against a test suite. The pass rate (%) reflects how many tasks the model solved correctly.

3

Static Analysis

All generated code is analyzed by SonarQube to detect bugs, vulnerabilities, code smells, and measure complexity. Metrics are normalized per lines of code for fair comparison across models.

Benchmarks

Models are evaluated on a set of well-established coding benchmarks, run against the same task pool across languages to enable consistent static analysis with SonarQube.

ComplexCodeEval

A large-scale benchmark of advanced algorithmic challenges derived from real-world open-source projects. It tests a model's ability to produce production-grade code.

MBPP

The Mostly Basic Programming Problems benchmark. It covers fundamental programming tasks to assess general-purpose code generation.

HumanEval

OpenAI's HumanEval suite: hand-crafted problems that evaluate core programming and problem-solving skills.


Java is currently evaluated on ~3,900 tasks for ComplexCodeEval, ~400 tasks for MBPP, and ~160 tasks for HumanEval.

Correctness (Pass Rate %)

Correctness measures whether a model's generated code actually works. A task is considered solved when the generated code compiles and passes every unit test in its reference test suite.

Correctness is reported only for the two benchmarks that ship with executable tests: HumanEval and MBPP. Other benchmarks contribute to code quality metrics (complexity, security, reliability, maintainability) but not to the correctness score.

We use the standard pass@1 metric: each task is attempted once, and is counted as a success only if the single sample passes all unit tests. The metric is reported as a percentage of tasks solved correctly, this is the Pass Rate %.

Tasks where the model produced no usable answer at all are tracked as unsolved tasks and are not confused with wrong answers.

Data Sources

All evaluation data are generated by Sonar's research team. Models are evaluated using their publicly available APIs under default or recommended settings.

Static analysis is powered by SonarQube, which applies thousands of rules across multiple languages.

The leaderboard is updated as new models are released and evaluated. Results reflect the state of each model at evaluation time.

Metrics & Dimensions

Code quality is assessed across five complementary dimensions. Together they reveal a model's coding personality — not just whether it works, but how well it's written.

Correctness

Functional pass rate: the percentage of tasks where the generated code passes all tests.

Higher is better

Complexity

Cyclomatic and cognitive complexity measure how intricate and hard to understand the code is.

Per KLOC — lower is better

Security

Vulnerabilities detected by SonarQube's security rules, including injection flaws, path traversals, and insecure crypto.

Per MLOC — lower is better

Reliability

Bugs that could cause runtime failures: null dereferences, resource leaks, incorrect logic, and other defects.

Per MLOC — lower is better

Maintainability

Code smells that make the code harder to maintain: duplications, overly long methods, poor naming, and style issues.

Per MLOC — lower is better

Severity Levels

Issues detected by SonarQube are classified into four severity levels, from most to least impactful.

Severity Description
Blocker Critical issues with high probability of impacting production: data loss, security breaches, application crashes.
Critical Significant issues that are likely to cause problems: unexpected behavior, performance degradation, or security risks.
Major Quality issues that can hinder maintainability or lead to subtle bugs if left unaddressed.
Minor Low-impact issues related to style, conventions, or minor inefficiencies.