Evaluation Pipeline
Each model on this leaderboard is evaluated through a multi-stage pipeline that measures both functional correctness and code quality.
Code Generation
Each model receives a set of coding task prompts and generates source code solutions. Tasks span multiple languages and difficulty levels.
Functional Testing
Generated code is compiled and executed against a test suite. The pass rate (%) reflects how many tasks the model solved correctly.
Static Analysis
All generated code is analyzed by SonarQube to detect bugs, vulnerabilities, code smells, and measure complexity. Metrics are normalized per lines of code for fair comparison across models.
Benchmarks
Models are evaluated on a set of well-established coding benchmarks, run against the same task pool across languages to enable consistent static analysis with SonarQube.
ComplexCodeEval
A large-scale benchmark of advanced algorithmic challenges derived from real-world open-source projects. It tests a model's ability to produce production-grade code.
MBPP
The Mostly Basic Programming Problems benchmark. It covers fundamental programming tasks to assess general-purpose code generation.
HumanEval
OpenAI's HumanEval suite: hand-crafted problems that evaluate core programming and problem-solving skills.
Java is currently evaluated on ~3,900 tasks for ComplexCodeEval, ~400 tasks for MBPP, and ~160 tasks for HumanEval.
Correctness (Pass Rate %)
Correctness measures whether a model's generated code actually works. A task is considered solved when the generated code compiles and passes every unit test in its reference test suite.
Correctness is reported only for the two benchmarks that ship with executable tests: HumanEval and MBPP. Other benchmarks contribute to code quality metrics (complexity, security, reliability, maintainability) but not to the correctness score.
We use the standard pass@1 metric: each task is attempted once, and is counted as a success only if the single sample passes all unit tests. The metric is reported as a percentage of tasks solved correctly, this is the Pass Rate %.
Tasks where the model produced no usable answer at all are tracked as unsolved tasks and are not confused with wrong answers.
Data Sources
All evaluation data are generated by Sonar's research team. Models are evaluated using their publicly available APIs under default or recommended settings.
Static analysis is powered by SonarQube, which applies thousands of rules across multiple languages.
The leaderboard is updated as new models are released and evaluated. Results reflect the state of each model at evaluation time.
Metrics & Dimensions
Code quality is assessed across five complementary dimensions. Together they reveal a model's coding personality — not just whether it works, but how well it's written.
Correctness
Functional pass rate: the percentage of tasks where the generated code passes all tests.
Complexity
Cyclomatic and cognitive complexity measure how intricate and hard to understand the code is.
Security
Vulnerabilities detected by SonarQube's security rules, including injection flaws, path traversals, and insecure crypto.
Reliability
Bugs that could cause runtime failures: null dereferences, resource leaks, incorrect logic, and other defects.
Maintainability
Code smells that make the code harder to maintain: duplications, overly long methods, poor naming, and style issues.
Severity Levels
Issues detected by SonarQube are classified into four severity levels, from most to least impactful.
| Severity | Description |
|---|---|
| Blocker | Critical issues with high probability of impacting production: data loss, security breaches, application crashes. |
| Critical | Significant issues that are likely to cause problems: unexpected behavior, performance degradation, or security risks. |
| Major | Quality issues that can hinder maintainability or lead to subtle bugs if left unaddressed. |
| Minor | Low-impact issues related to style, conventions, or minor inefficiencies. |