Sonar Announces SonarSweep to Improve Training Data Quality for Coding LLMs

GENEVA & AUSTIN – October 21, 2025 – Sonar, the industry standard for code quality and automated code review, today announced the launch of SonarSweep™, a service designed to improve the training of LLMs optimized for coding applications. SonarSweep tackles the root cause of bugs and vulnerabilities in AI-generated code, enabling organizations to build safer, more reliable AI models used by coding assistants. The service is now available in early access.

Recent research published by leading institutions, including foundation model pioneer Anthropic, highlights how sensitive model performance is to low quality and malicious training data. LLMs used in software development are susceptible to these issues because they are generally trained on a large corpus of publicly available, open source code which often contains security issues and bugs. These bugs are amplified through the model training process; even a small amount of flawed data can degrade models of any size, disproportionately degrading their output.

SonarSweep is engineered to systematically remediate, optimize, and secure coding datasets for model training. It proactively ensures that models learn from high-quality, secure examples, from pre-training to model alignment—an essential step to building reliable AI coding models. Models trained on data prepared by SonarSweep produced code with up to 67% fewer security vulnerabilities and up to 42% fewer bugs compared to models trained on the original, un-swept data. This improvement in quality and security was achieved without loss in functional performance. Additional detail into the extensive testing can be found here.

SonarSweep’s effectiveness comes from the unique ability to identify and automatically fix a wide range of code quality and security issues—from critical vulnerabilities to subtle bugs—directly within the training data itself. Applications include improving foundation model pretraining and post-training, improving existing models through reinforcement learning leveraging ‘swept’ data, and using distillation techniques to create Small Language Models (SLMs) to use with AI Agents and other special purposes.

“The latest research confirms what we’ve suspected: data quality is the Achilles’ heel of AI code generation,” said Tariq Shaukat, CEO of Sonar. “The best way to boost software development productivity, reduce risks, and improve security is to tackle the problem at inception—inside the models themselves. Vibe engineering leveraging models enhanced through SonarSweep will have fewer issues in production, reducing the burden on developers and enterprises. Combined with strong verification practices, we believe this will substantially remove a major bottleneck in AI software development.”

Sonar is uniquely positioned to offer this service to the market as the company’s flagship product SonarQube analyzes the security, reliability, and maintainability of over 750 billion lines of code each day for more than 7 million developers worldwide.

Availability: SonarSweep is now available through early access. Those interested in participating can submit a request.

About Sonar

Sonar is the trust and verification layer for AI code, and the industry standard for automated code review for 17+ years. Integrating code quality and code security into a single platform, Sonar delivers deterministic, repeatable, and actionable code verification at scale, analyzing over 750 billion lines of code daily to ensure software is secure, reliable, and maintainable. Rooted in the open source community, Sonar is trusted by 7M+ developers globally, including teams at Snowflake, Booking.com, Deutsche Bank, AstraZeneca, and Ford Motor Company.

To learn more about Sonar, please visit: www.sonar.com

Cautionary note; forward-looking statements

This press release may contain forward-looking statements, including but not limited to future expectations, plans, and product capabilities. These statements are based on current beliefs and assumptions and are subject to risks and uncertainties. The information in this press release is provided as of this date, and we undertake no obligation to update any statements

Sonar Announces SonarSweep to Improve Training Data Quality for Coding LLMs

SHARE

Build trust into all AI-generated code