为您的编码模型建立信任
主动从训练数据中消除系统性缺陷,从而训练出“安全设计”的基础模型。

AI 生成的代码质量取决于训练大语言模型的数据质量。研究表明,即使少量低质量数据也会对模型造成不成比例的“污染”,导致其生成存在缺陷且不安全的代码。
作为大多数 LLM 基础的庞大公开数据集,其实是优质代码与布满错误和安全漏洞的代码片段的混乱混合体。
在训练过程中,LLM 会内化这些有缺陷的模式,无法区分优质代码与劣质代码。它学会了复制所学到的相同错误。
LLM 在生成代码时会相应地复制这些错误和漏洞,这些缺陷可能渗入产品,并需要进行严格的验证。
生成式 AI 正在改变我们的编码方式,但 LLM 存在一个关键局限:它们生成的代码往往暗藏缺陷、安全漏洞和可维护性债务。对于 LLM 提供商以及对质量标准要求更高的企业而言,对模型进行微调和定制的需求十分迫切。SonarSweep 为以下对象提供关键的数据质量保障层:
通过从源头优化训练数据,构建设计上就安全可靠的模型,为客户在市场中赢得竞争优势。
在私有环境中自信地开发定制模型,帮助客户满足严格的合规要求并保护敏感知识产权。
在 Databricks 和 IBM 等平台上,为专业化的代理工作流创建高性能、高性价比的小型语言模型(SLM)。
通过优化训练数据集,以更少的数据和计算资源构建更强大的模型,在预算范围内实现最先进性能。


SonarSweep 能够大规模自动分析并修复训练数据集中的数千个错误、漏洞和代码质量问题。

通过严格的过滤流程剔除低质量代码。随后对精炼后的数据集进行平衡处理,确保学习过程的多样性和代表性,从而构建出功能强大的模型。

最终的“清理”后数据集已成为优化后的高质量资产,可直接用于模型训练,从而显著提升生成的代码质量。
主动从训练数据中消除系统性缺陷,从而训练出“安全设计”的基础模型。

SonarSweep 利用 Sonar 业界领先的代码分析引擎,自动处理海量训练代码,修复问题,并将有缺陷的数据转化为高质量的训练样本。
通过修复代码而非删除,我们为模型保留了宝贵的学习样本,从而提升其对复杂模式的理解能力。
我们的引擎将不良示例转化为优质示例,系统性地提升整个数据集的整体质量和安全水平。
依托全球超过 700 万开发者信赖的分析引擎,该引擎已为全球 7,000 亿行代码提供了安全保障。
SonarSweep is a product from Sonar that remediates, secures, and optimizes coding datasets used to train AI language models. It is designed for AI companies and model builders — not for software development teams managing their own codebases.
Coding LLMs are typically trained on large volumes of publicly available open-source code, which frequently contains bugs, security vulnerabilities, and poor patterns. Models learn from these flawed examples and reproduce — and in many cases amplify — those flaws in the code they generate. SonarSweep addresses this at the root by cleaning and improving the training data before it is used to train or fine-tune a model.
SonarSweep shares its underlying code analysis engines with SonarQube and SonarQube Cloud, but it is a completely separate service and does not integrate with either product. It is not an add-on, extension, or feature of any SonarQube edition.
Where SonarQube and SonarQube Cloud help development teams detect quality and security issues in their own application code during development and CI/CD, SonarSweep processes large code datasets that AI companies use to train models. The relationship is a shared technological foundation — Sonar's analysis engines — applied to an entirely different use case and a different customer.
Coding LLMs are pre-trained on raw public open-source code — code that's full of bugs, vulnerabilities, and poor patterns. Models don't just absorb these flaws; they amplify them in everything they generate. SonarSweep fixes this at the source by cleaning training data before a model ever sees it.
It reduces security vulnerabilities in model output by up to 67% and cuts bugs by up to 42%. It also handles a subtler problem: naively removing flawed code can skew language distribution in a dataset, so SonarSweep rebalances after cleaning to preserve model proficiency across all languages. And by addressing quality upfront, it eliminates the need for costly post-training correction passes.
SonarQube for IDE (formerly SonarLint) is a developer productivity tool that runs inside editors like VS Code, IntelliJ, and Eclipse, giving individual developers real-time feedback on quality and security issues as they write code. It operates at the developer level, in the IDE, during active development.
SonarSweep is not a developer tool at all. It is a data processing service for AI companies that are training or fine-tuning coding LLMs. It does not run in an IDE, does not provide feedback to developers, and is not part of a development workflow.
Yes — this is the core purpose of SonarSweep. The quality of code a language model generates is directly shaped by the quality of the data it trained on. A model that learned from code full of vulnerabilities and bugs will reproduce those patterns at scale. SonarSweep intervenes at the data stage, before training, to raise the quality floor of what the model learns from.
Models trained on SonarSweep-prepared datasets have demonstrated up to 67% fewer security vulnerabilities and up to 42% fewer bugs in their generated code compared to models trained on unswept data — with no degradation in functional performance. This was validated on the GPT-OSS-20B model.
SonarSweep supports 35+ programming languages, drawing on the full breadth of Sonar's code analysis engines — the same engines that power SonarQube and SonarQube Cloud.
In the context of LLM training data, this means SonarSweep can analyze, filter, and remediate code across all the languages that typically appear in large public code datasets: common back-end languages, front-end languages, scripting languages, systems languages, and more. Across these languages, it can identify and automatically fix over 6,700 distinct types of quality and security issues.
SonarSweep doesn't produce code changes for developers to review in pull requests. It processes and delivers cleaned training datasets to AI companies. Governance in this context sits with the AI team — validating dataset quality and model output before using the swept data in a training run.
No. SonarSweep has no connection to any SonarQube edition. It is a separate product for companies building or fine-tuning coding LLMs — not a feature unlocked through any SonarQube subscription tier.
The ROI is for AI companies, not development teams. Models trained on SonarSweep-processed data produce up to 67% fewer security vulnerabilities and up to 42% fewer bugs — with no loss in functional performance. It also reduces training cost by addressing data quality upfront, eliminating expensive post-training correction cycles.