Blog post

A technical look at SonarSweep for GPT-OSS-20B

Joe Tyler photo

Joe Tyler

AI Researcher

8 min read

We recently released SonarSweep-java-gpt-oss-20b, a fine-tuned version of OpenAI’s gpt-oss-20b optimized for generating high-quality Java code.

This release is not intended to compete with state-of-the-art (SOTA) reasoning models. Instead, it serves as a technical demonstration of how training data quality impacts the quality of a model’s code generation output. 

By processing our training dataset through the SonarSweep pipeline, we aimed to answer a critical question: Can we significantly reduce the density of bugs and vulnerabilities in generated code without increasing model size or latency?

Here is an overview of the methodology, the results, and the known limitations of this model.

The methodology

We started with OpenAI’s gpt-oss-20b base model. For the training dataset, we compiled 70k Java examples from OpenCoder and synthetic alignment data.

Before fine-tuning, we used SonarSweep to analyze and optimize this dataset. The hypothesis was that by identifying "bad" code (code smells, bugs, and security vulnerabilities) in the data before remediating and curating training examples, the resulting model would learn to follow good practice and generate expert-level coding patterns.

We fine-tuned by training LoRA adapters for all linear layers of the experts and attention blocks.

The results: Code quality and functional correctness

For our benchmarks we evaluate Functional Correctness, i.e. does the generated code pass a set of pre-defined unit tests, and Code Quality, which we quantify by the number of Sonar quality issues our SonarQube analyzers detect split across reliability, maintainability, and security.

1. Java functional correctness

Functionally, the fine-tuned model performs at near-parity with the base model. On the MultiPL-E Java benchmark, the Pass@1 score shifted marginally from 71.49% (Base) to 72.37% (Fine-tuned). 

2. Java code quality

The real impact of SonarSweep is visible when we analyse the quality of the generated code. Benchmarking on ComplexCodeEval and MultiPL-E Java, the fine-tuned model produced significantly higher-quality code with fewer defects.

Code qualityMetricBase modelSonarSweep
fine-tuned model
Change
ReliabilityBugs / KLOC0.90.53▼ ~41%
SecurityVulnerabilities / KLOC0.410.24▼ ~41%
MaintainabilityCode Smells / KLOC20.0416.29▼ ~18%
Note: KLOC = Thousand Lines of Code.

3. Other languages and general ability

While the model was optimized exclusively for Java, we observed no significant degradation in functional correctness of a selection of non-target languages. Furthermore, the model’s general question-answering capabilities remained intact, achieving 78.12% accuracy on the MMLU benchmark—a negligible 0.79% difference from the base model.

These benchmark scores demonstrate that using SonarSweep to analyze, remediate and curate training data improves target language quality, without sacrificing the model's functional coding ability on other languages or on MMLU. There is a significant ~41% reduction in the number of bugs and security vulnerabilities generated compared to the base model - this validates that models trained on high-quality data don't just write code that works; they write code that is safer and more reliable.

What this model is (and is not)

To ensure this model is used correctly by the community, we want to be transparent about its scope: it is a demonstration of how using SonarSweep for fine-tuning can reduce downstream technical debt in LLM-generated code. 

This model operates exclusively as a low-reasoning model, derived from gpt-oss-20b-low. It is optimized for speed and standard conversational tasks rather than complex chain-of-thought processing. The model is hard-coded to a low-reasoning profile.

Evaluation and access

For teams looking to train their own models or fine-tune existing ones, these results are clear: leveraging SonarSweep to boost your data quality can lead to significant improvements in the security, reliability and maintainability of LLM-generated code. More details are available on the HuggingFace model card. We invite the community to review the full metrics and test the model.

We welcome feedback on the Sonar Community forum.