OpenAI GPT-5.5: An evaluation

10 min de lecture

Prasenjit Sarkar photo

Prasenjit Sarkar

Solutions Marketing Manager

OpenAI logo under a magnifying glass, symbolizing an evaluation of the OpenAI LLM security model on a technical grid.

TL;DR overview

  • OpenAI’s latest model, GPT-5.5, delivers some of the strongest security metrics we have analyzed to date.
  • Security is a definitive strength for GPT-5.5, featuring a low vulnerability density of 75 per mLOC. 
  • With a flat distribution across all severity levels, the model proves it isn't just avoiding simple catches.
  • Concurrency remains a challenge across LLMs, as threading bugs at around 170 per mLOC dominate the overall profile.
  • Verification debt compounds as high-volume, complex outputs outpace manual review, shifting the burden of proof to engineering.

GPT-5.5 is the latest model from OpenAI, and it delivers huge improvements in a key area: security. In fact, its security numbers are some of the best we’ve seen. Vulnerability density is low, consistent across runs, and flat across severity levels. That's the headline. But, as with all models, there’s a more nuanced, complex story when we dig below the surface. 

We ran GPT-5.5 through Sonar's LLM evaluation framework, which is designed to measure LLM-generated code against the same rules as a developer-written codebase.

What was measured

Model: GPT-5.5

Language: Java

Benchmark: 4,444 tasks

Runs: 10 independent runs at temperature=1.0, reasoning_effort=medium

Analyzer: SonarQube systematic code analysis. Density metrics are per 1,000 lines of code (kLOC); category breakdowns are per million lines (mLOC). 

There are two important terms to define before diving into the results:  

  • Cyclomatic complexity: Counts independent paths through a function. 
  • Cognitive complexity: This SonarQube metric weights nested and deeply branched logic more heavily, reflecting how difficult the code is for a human to read. 

Neither metric captures correctness. Both correlate with how long review and testing take.

Key metrics at a glance

MetricGPT-5.5
Lines of code (total)703,324
Comments (% of LOC)2.0%
Cyclomatic complexity per kLOC251.1
Cognitive complexity per kLOC151.8
Bug density per mLOC520
Vulnerability density per mLOC75
Code smell density per kLOC17.1
Overall issue density per kLOC17.7
Functional skill (pass rate)78.7%
Missing completions0.18%

Volume and style

Across the 4,444-task benchmark, GPT-5.5 generated 703,324 lines of code and only 2% of that output is comments. In practical terms, for every 100 lines a developer opens in review, roughly two contain any explanation. Comments are not strictly necessary for explanation, if functions and variables are named well this is also sufficient. That ratio compounds with the output volume—more code and less documentation means more cognitive load on anyone touching that code after generation.

Complexity

Cognitive complexity is 151.8 per kLOC and cyclomatic complexity is 251.1 per kLOC. Cognitive complexity, as measured by SonarQube, tracks how difficult it is for a human to understand a piece of code. It penalizes nested conditionals, loops inside loops, and branching logic where a reader has to hold multiple states in their head. Code with high cognitive complexity is harder to review accurately, harder to write good tests for, and harder to modify without introducing new bugs.

700,000 lines of elevated complexity, with minimal comments, create a review surface where errors are easier to miss. Independent and deterministic code reviews become more important than ever.  

Bug density and severity

Overall bug density is 0.52 per kLOC. The chart below details the severity distribution.

SeverityPer MLOC
Blocker43
Critical26
Major232
Minor220

Blockers and criticals are low (43 and 26 per million lines)and that's what matters most for production stability. But there's a long tail of major and minor issues: 232 and 220 per million lines that don't cause immediate failures but accumulate into technical debt, slow down future changes, and occasionally surface as bugs once the codebase evolves. At GPT-5.5’s output volume, that adds up quickly.

The one category worth calling out is concurrency and threading bugs, at 170 per mLOC. That's substantially higher than any other bug category, as evidenced in the chart below.

Bug categoryPer MLOC
Concurrency / threading170
Resource / stream leaks67
Exception handling54
Type safety / casts27

Concurrency bugs are expensive. They're hard to reproduce in testing, tend to be environment dependent, and can produce intermittent failures requiring significant time to diagnose. The elevated rate here is consistent with a model that is generating more code and more concurrent patterns as part of that volume.

Vulnerability density and severity

Vulnerability density is 75 per mLOC. This is one of the cleaner security profiles we've seen. The severity breakdown holds up at every level, as seen in the chart below.

SeverityPer MLOC
Blocker18
Critical20
Major15
Minor22

The distribution is flat. Blockers and criticals aren't disproportionately high relative to major and minor, so the model isn't just avoiding trivially detectable issues while leaving deeper ones in place. 

Top vulnerability categories shown below (covering roughly 43% of total vulnerabilities; the balance spreads across smaller categories). 

Vulnerability categoryPer MLOC
Cryptography misconfiguration17
Path traversal / injection7
XML external entity (XXE)8

Cryptography misconfigurations—weak algorithms, insecure key sizes, improper use of random number generators—are one of the more common failure modes in AI-generated code. But at 17 per mLOC, GPT-5.5 keeps that category manageable with automated detection. Path traversal and injection is particularly low at 7 per mLOC. Security is a clear strength, and the consistency of the numbers makes that credible.

Maintainability signals

Code smell density is 17.1 per kLOC, driven primarily by collection and generics parameter type issues. Cases where the model uses raw types instead of properly parameterized generics, or where collection handling bypasses type safety in ways that don't cause immediate failures but create technical friction over time. In Java, these issues carry real cost: they suppress compiler warnings, make refactoring harder, and can mask bugs a properly typed implementation would catch at compile time.

The overall code smell number is not trivial, but it should be read alongside the output volume. A density of 17.1 per kLOC across 700,000 lines is a larger absolute number than the same density across a more concise output. Combined with comment density of 2%, teams maintaining this code will find fewer signposts and more accumulated minor issues to address as the codebase ages.

Functional skill

The passing test rate is 78.7%, with missing completions at just 0.18%, which means the model completes tasks reliably. But the 78.7% pass rate is where verification earns its keep: more than one in five generated solutions doesn't pass functional tests, and that rate isn't predictable in advance for any individual task. Code review and testing pipelines catch what the model misses.

What this means for developer teams using GPT-5.5

GPT-5.5's security numbers are lower, stable, and flat across severity levels. For teams where security is a primary acceptance criterion for AI generated code, those numbers matter.

The areas requiring active management are more structural than any specific bug type. The number of lines of code is large, the comments are sparse, and the cognitive complexity is elevated. Those three factors together raise the per-task cost of human review.  

If the code being generated is concurrent by nature, build in the assumption from the start that threading issues will need to be caught at the testing and analysis stage, not the generation stage. The model generates them at a higher rate than other bug categories, and they aren't reliably visible in code review alone.

Three takeaways:

  • Security is GPT-5.5's clear strength. Vulnerability density is 0.075 per kLOC and the distribution is flat across severity, meaning the model is not just avoiding easy findings.
  • Concurrency is a weak spot. Threading bugs at 170 per mLOC dominate the bug profile.
  • Volume, sparse comments, and elevated cognitive complexity shift verification cost onto the team. This is verification debt in practice: the model generates faster than an unaided team can verify, and the verification gap is where issues land.

GPT-5.5 is a very capable code generator with a strong security profile, and that finding does not remove the need for verification—rather, it changes what verification should focus on. 

GPT-5.5’s full evaluation results, along with all other evaluated models are available on the Sonar LLM Leaderboard.

Instaurer la confiance dans chaque ligne de code

Rating image

4.6 / 5