Blog post

A technical look at SonarSweep for GPT-OSS-20B

Joe Tyler

AI Researcher

December 4, 2025

8 min read

TL;DR overview

This post provides a technical examination of SonarSweep's performance when applied to GPT OSS 20B-generated code, evaluating how well the tool detects quality and security issues in LLM-produced output.
The analysis covers the types of code quality issues most commonly introduced by large language models, including subtle logic errors, security anti-patterns, and maintainability problems that standard LLM evaluation benchmarks do not surface.
Results demonstrate SonarSweep's ability to catch LLM-specific code quality patterns that traditional testing methods miss, validating its role as a verification layer in AI-assisted development workflows.
The findings support Sonar's broader 'vibe, then verify' approach: enabling developers to use AI coding tools freely while enforcing rigorous automated quality and security checks on all generated code before it enters the codebase.

We recently released SonarSweep-java-gpt-oss-20b, a fine-tuned version of OpenAI’s gpt-oss-20b optimized for generating high-quality Java code.

This release is not intended to compete with state-of-the-art (SOTA) reasoning models. Instead, it serves as a technical demonstration of how training data quality impacts the quality of a model’s code generation output.

By processing our training dataset through the SonarSweep pipeline, we aimed to answer a critical question: Can we significantly reduce the density of bugs and vulnerabilities in generated code without increasing model size or latency?

Here is an overview of the methodology, the results, and the known limitations of this model.

The methodology

We started with OpenAI’s gpt-oss-20b base model. For the training dataset, we compiled 70k Java examples from OpenCoder and synthetic alignment data.

Before fine-tuning, we used SonarSweep to analyze and optimize this dataset. The hypothesis was that by identifying "bad" code (code smells, bugs, and security vulnerabilities) in the data before remediating and curating training examples, the resulting model would learn to follow good practice and generate expert-level coding patterns.

We fine-tuned by training LoRA adapters for all linear layers of the experts and attention blocks.

The results: Code quality and functional correctness

For our benchmarks we evaluate Functional Correctness, i.e. does the generated code pass a set of pre-defined unit tests, and Code Quality, which we quantify by the number of Sonar quality issues our SonarQube analyzers detect split across reliability, maintainability, and security.

1. Java functional correctness

Functionally, the fine-tuned model performs at near-parity with the base model. On the MultiPL-E Java benchmark, the Pass@1 score shifted marginally from 71.49% (Base) to 72.37% (Fine-tuned).

2. Java code quality

The real impact of SonarSweep is visible when we analyse the quality of the generated code. Benchmarking on ComplexCodeEval and MultiPL-E Java, the fine-tuned model produced significantly higher-quality code with fewer defects.

Code quality	Metric	Base model	SonarSweep fine-tuned model	Change
Reliability	Bugs / KLOC	0.9	0.53	▼ ~41%
Security	Vulnerabilities / KLOC	0.41	0.24	▼ ~41%
Maintainability	Code Smells / KLOC	20.04	16.29	▼ ~18%

Note: KLOC = Thousand Lines of Code.

3. Other languages and general ability

While the model was optimized exclusively for Java, we observed no significant degradation in functional correctness of a selection of non-target languages. Furthermore, the model’s general question-answering capabilities remained intact, achieving 78.12% accuracy on the MMLU benchmark—a negligible 0.79% difference from the base model.

These benchmark scores demonstrate that using SonarSweep to analyze, remediate and curate training data improves target language quality, without sacrificing the model's functional coding ability on other languages or on MMLU. There is a significant ~41% reduction in the number of bugs and security vulnerabilities generated compared to the base model - this validates that models trained on high-quality data don't just write code that works; they write code that is safer and more reliable.

What this model is (and is not)

To ensure this model is used correctly by the community, we want to be transparent about its scope: it is a demonstration of how using SonarSweep for fine-tuning can reduce downstream technical debt in LLM-generated code.

This model operates exclusively as a low-reasoning model, derived from gpt-oss-20b-low. It is optimized for speed and standard conversational tasks rather than complex chain-of-thought processing. The model is hard-coded to a low-reasoning profile.

Evaluation and access

For teams looking to train their own models or fine-tune existing ones, these results are clear: leveraging SonarSweep to boost your data quality can lead to significant improvements in the security, reliability and maintainability of LLM-generated code. More details are available on the HuggingFace model card. We invite the community to review the full metrics and test the model.

Model card and weights: SonarSource/SonarSweep-java-gpt-oss-20b
Base model: openai/gpt-oss-20b
Evaluation tools: SonarQube, MultiPL-E, ComplexCodeEval, MMLU

We welcome feedback on the Sonar Community forum.

SonarQube Cloud

SonarQube Server

SonarQube for IDE

SonarSweepEarly access

Advanced Security

MCP Server

Agentic Analysis

Context Augmentation

Remediation Agent

SonarQube Cloud

SonarQube Server

SonarQube for IDE

SonarSweepEarly access

Advanced Security

MCP Server

Agentic Analysis

Context Augmentation

Remediation Agent

AI code quality

Developer-led security

Automated code review

Platform engineering

Compliance & reporting

SDLC governance

Secrets detection

Supply chain security

All use cases

AI solutions

Architecture management

Security solutions

Code quality solutions

ROI calculator

LLM leaderboard

SonarQube vs GitHub Code Quality

Healthcare

Financial services

Retail

Federal government

Our customers

Customer stories

AI code quality

Developer-led security

Automated code review

Platform engineering

Compliance & reporting

SDLC governance

Secrets detection

Supply chain security

All use cases

AI solutions

Architecture management

Security solutions

Code quality solutions

ROI calculator

LLM leaderboard

SonarQube vs GitHub Code Quality

Healthcare

Financial services

Retail

Federal government

Our customers

Customer stories

Developer hub

Learning center

Commitment to open source

Community

Developer guides

SonarQube Server

SonarQube Cloud

SonarQube for IDE

Sonar Vulnerability database

GitHub

Bitbucket

Azure DevOps

GitLab

See all

Java

JavaScript

Python

C#