Press release

Study finds shared strengths and common challenges across popular LLMs

Study also identifies distinct coding “personalities” behind Anthropic's Claude Sonnet 4 and 3.7, OpenAI's GPT-4o, Meta's Llama-3.2-vision:90b, and the open-source OpenCoder-8B

AUSTIN – August 13, 2025 – Today, Sonar has published a comprehensive new study analyzing the quality and security of software code produced by top Large Language Models (LLMs), finding significant strengths as well as material challenges across tested models. The research also reveals that each model evaluated has a distinct and measurable "coding personality.” These findings add a richer view beyond standard performance benchmarks, giving developers and technology leaders deeper insight into how to more safely and effectively embed AI solutions into their software development process. 

"The rapid adoption of LLMs for writing code is a testament to their power and effectiveness,” said Tariq Shaukat, CEO of Sonar. “To really get the most from them, it is crucial to look beyond raw performance to truly understand the full mosaic of a model’s capabilities. Understanding the unique personality of each model, and where they have strengths but also are likely to make mistakes, can ensure each model is used safely and securely.” 

To conduct the study, Sonar employed a proprietary analysis framework for assessing LLM-generated code, tasking the LLMs with over 4,400 Java programming assignments. The LLMs evaluated in the study included Anthropic's Claude Sonnet 4 and 3.7, OpenAI's GPT-4o, Meta's Llama-3.2-vision:90b, and OpenCoder-8B.

The Sonar team observed four key findings: 

  1. The models tested had shared strengths: 
    • All models show a strong ability to generate syntactically correct code and boilerplate for common frameworks and functions, which reliably speeds up the initial stages of development. For example, Claude Sonnet 4's success rate of 95.57% on HumanEval demonstrates a very high capability to produce valid, executable code.
    • The models possess a strong foundational understanding of common algorithms and data structures. They can create viable solutions for well-defined problems, which serve as a solid starting point for more complex features. The “weighted test Pass@1 average” provides a balanced measure of this capability, and the scores achieved by models like Claude 3.7 Sonnet (72.46%) and GPT-4o (69.67%) confirm a high degree of reliability in producing correct solutions.
    • The models are highly effective at translating code concepts and snippets from one programming language to another. This makes them a powerful tool for developers who work with different technology stacks.
  2. The models also had shared flaws: 
    • All evaluated models demonstrated significant gaps in security. Critical flaws like hard-coded credentials and path-traversal injections were common across all models. While the exact prevalence varies between models, all evaluated LLMs produced a high percentage of vulnerabilities with the highest severity ratings. For Llama-3.2-vision:90b, over 70% of its vulnerabilities are considered ‘blocker’ level of severity; for GPT-4o, it's 62.5%; and for Claude Sonnet 4, it is nearly 60%.
    • All models tested also showed a bias toward messy code: over 90% of the issues found were "code smells"—indicators of poor structure, low maintainability, and future technical debt. 
  3. The research also discovered unique "coding personalities" for each LLM based on a quantifiable analysis across three personality traits:  
    • Verbosity: The sheer volume of code a model generates to solve a given set of tasks.
    • Complexity: The structural and logical intricacy of the generated code, measured by metrics like cyclomatic and cognitive complexity.
    • Communication and documentation: The density of comments in the code, which reveals the model's tendency to explain its work.
  4. A surprising and crucial insight from the study is that improved functional performance was often accompanied by much higher levels of risk. While Claude Sonnet 4 improved its performance benchmark pass rate by 6.3% over Claude 3.7 Sonnet, meaning it solved problems more correctly, this performance gain came at a price: the percentage of high-severity bugs rose by 93%.

As the industry deepens its reliance on AI for code generation—Gartner predicts 90% of enterprise software engineers will use AI code assistants by 2028—this report highlights an urgent need for a more nuanced understanding that supplements performance benchmark scores with a direct assessment of the code's security, reliability, and maintainability. The findings demonstrate the critical need for a "trust and verify" approach, including robust governance and analysis of all AI-generated code. This allows organizations to benefit from the vast capabilities of LLMs while effectively managing the inherent risks that come with them.

Download the full report: https://www.sonarsource.com/resources/the-coding-personalities-of-leading-llms/ 

How Sonar can help

Sonar’s integrated code quality and code security solution, SonarQube, analyzes all code, including LLM-generated code, by serving as the verification and governance tool in a “trust and verify” approach. 

When customers integrate SonarQube as part of their development lifecycle, they can fuel AI-enabled development while building trust into every line of code. The solution provides a consistent review process for security, reliability, and maintainability before the code enters production. 

For organizations embracing generative AI solutions, SonarQube can help specifically by: 

  • Detecting security flaws: SonarQube's engine, with its experience in detecting vulnerabilities, can identify security issues, including the non-local data flow problems like taint tracking where LLMs face challenges. 
  • Enforcing engineering discipline: The study found that all LLMs struggle with core software engineering tenets, consistently creating severe bugs like resource leaks and API contract violations. Rules like java:S2095 serve as an example for how SonarQube flags resource leaks generated by the models.
  • Managing technical debt and code smells: The most common issue identified in the report was an inherent bias towards messy code, with code smells making up over 90% of all issues found for every model, as noted above. These smells, which include dead and redundant code, contribute to long-term technical debt. SonarQube is designed to detect these code smells, providing the analysis needed to clean up the code and make it maintainable.

To learn more about SonarQube or try for free, please visit: https://www.sonarsource.com/products/sonarcloud/ 

About Sonar

Sonar is the industry standard for automated code review, integrating code quality and code security into a single platform built for the AI-coding era. Sonar provides the essential, independent verification of all code—AI-generated and developer-written—so development teams can find and fix security, reliability, and maintainability issues quickly and effectively. Rooted in the open source community, Sonar's solutions support over 35 programming languages and are used by 7M+ developers across 400K organizations, including Barclays, MasterCard, and T-Mobile.

To learn more about Sonar, please visit: www.sonar.com 

Cautionary note; forward-looking statements

This press release may contain forward-looking statements about future expectations, plans, and prospects. These statements are based on current beliefs and assumptions and are subject to risks and uncertainties. The field of AI and large language models is rapidly evolving, and actual results may differ. The information in this press release is provided as of this date, and we undertake no obligation to update any statements.