GPT-5.5’s biggest blind spot: the Java bugs your tests won’t catch

8 min de lectura

Killian Carlsen-Phelan photo

Killian Carlsen-Phelan

Developer Content Engineer

TL;DR overview

  • Concurrency bugs are among the hardest defects to catch in AI-generated Java code because they pass functional tests but fail under production thread timing.
  • Sonar’s LLM Leaderboard analysis shows concurrency bug density varies 7x across models, with GPT-5.5 producing 170 bugs per million lines of code.
  • Common failure patterns include broken double-checked locking, unsound synchronization on value-based classes like Boolean, and holding locks during Thread.sleep() calls.
  • Static analysis identifies these thread-safety risks by analyzing code structurally, catching defects that standard test frameworks cannot reliably trigger.

Sonar's LLM Leaderboard evaluations have analyzed millions of lines of AI-generated Java code across multiple models. Concurrency bugs show up in every model's output, but at rates that vary more than almost any other bug category.

What doesn't vary is the failure mode. These bugs compile and pass functional tests but break in production because their correctness depends on thread timing that no test framework controls. The patterns behind them are well-documented and detectable through static code analysis, but they live in the gap between code that passes tests and code that is thread-safe.

How concurrency rates vary across models

Sonar's evaluation framework runs each model through thousands of Java coding tasks (4,444 for the GPT-5.5 evaluation), executing multiple independent runs and analyzing the output with SonarQube's Java coding rules. The table below shows concurrency bug density for a sample of evaluated models.

ModelConcurrency bugs per million LOC
GPT-5.2 High470
GPT-5.1 High241
GPT-5.5170
Claude Opus 4.5 Thinking133
Claude Sonnet 4.5129
Gemini 3.0 Pro69

The absolute rates span a 7x range across these models alone, and the leaderboard includes additional models that widen the picture further. Concurrency accounts for nearly 50% of all bugs in some model configurations and under 3% in others, so while some models produce concurrency as their dominant bug category by a wide margin, others are led by exception handling or type safety instead. A double-checked locking violation or a lock held during sleep behaves the same way in production regardless of which model generated it.

Three patterns to watch for

The concurrency bugs that surface in these evaluations share a trait regardless of rate: their correctness depends on execution ordering and runtime object identity, not on what's written in the method body. A resource leak is visible in the code itself because you can point to the missing close() call. Whether or not double-checked locking is safe depends on the Java Memory Model's happens-before guarantees, and whether a synchronized block actually provides mutual exclusion depends on which object you're locking on and whether the JVM might be sharing that object with unrelated code. These are properties of how the program runs, not how it reads, and they're the reason concurrency bugs survive functional testing: a test exercises one execution ordering, and the bug lives in a different one.

The three patterns below, drawn from SonarQube's Java concurrency rules, each represent a different failure mode, specifically, a broken initialization sequence, a wrong lock object, and a lock held during sleep.

Double-checked locking (S2168)

Double-checked locking is meant to avoid synchronizing every call to a singleton accessor by checking null before and after the synchronized block:

public class ResourceFactory {
    private static Resource resource;

    public static Resource getInstance() {
        if (resource == null) {
            synchronized (ResourceFactory.class) {
                if (resource == null)
                    resource = new Resource();
            }
        }
        return resource;
    }
}

The "Double-Checked Locking is Broken" Declaration documented this failure in 2000. Without volatile on the resource field, the JVM is free to reorder the field assignment and the constructor completion, which means thread B can see a non-null reference to a partially constructed Resource while thread A is still inside new Resource(). The outcome depends entirely on timing, so no test suite catches it reliably. The pattern dates back to a time when synchronized methods carried significant overhead, and the double-checked idiom was widely taught as a standard optimization. Modern JVMs have closed much of that performance gap, making the synchronized version both safer and fast enough that the performance argument for double-checked locking no longer holds.

The fix is to synchronize the method:

public static synchronized Resource getInstance() {
    if (resource == null)
        resource = new Resource();
    return resource;
}

If method-level synchronization is too coarse, an inner static holder class achieves lazy initialization through the JVM's class-initialization guarantee, with no explicit synchronization needed:

private static class ResourceHolder {
    public static Resource resource = new Resource();
}

public static Resource getResource() {
    return ResourceHolder.resource;
}

The JVM guarantees that ResourceHolder is not initialized until getResource() is first called, and class initialization is inherently thread-safe per JLS 12.4, so this approach is both lazy and correct without any synchronization code.

Synchronizing on value-based classes (S1860)

The next pattern is a fundamentally different kind of failure. The synchronization mechanism itself is unsound because the lock object isn't what the developer thinks it is.

private static final Boolean bLock = Boolean.FALSE;

public void doSomething() {
    synchronized (bLock) {  // Noncompliant
        // critical section
    }
}

A private static final field used as a lock looks reasonable. The problem is that Boolean is a value-based class, and the JVM caches its instances. Every Boolean.FALSE reference in the entire application, including in third-party libraries, points to the same object in memory. Synchronizing on it means unrelated code paths can contend for the same lock, producing deadlocks with stack traces that show no logical connection between the contending threads.

The same applies to Integer.valueOf() within the cached range (-128 to 127), String literals, List.of() results, and java.time types. Two fields declared as Integer a = 0 and Integer b = 0 point to the same cached object, so synchronizing on a in one method and b in another creates a single shared lock where the developer intended two independent ones.

The fix is a dedicated Object instance:

private static final Object lock = new Object();

public void doSomething() {
    synchronized (lock) {
        // critical section
    }
}

Sleeping with a lock held (S2276)

public void doSomething() {
    synchronized (monitor) {
        while (!ready()) {
            Thread.sleep(200);  // Noncompliant
        }
        process();
    }
}

Thread.sleep() pauses the current thread but does not release the monitor lock, so every other thread waiting to enter this synchronized block is frozen for the duration of the sleep. If another thread needs this lock before it can set the condition that makes ready() return true, you have a deadlock. This pattern appears naturally in polling loops and retry logic, where Thread.sleep() is the intuitive choice for introducing a delay.

Object.wait() releases the lock while waiting, allowing other threads to make progress:

public void doSomething() {
    synchronized (monitor) {
        while (!ready()) {
            monitor.wait(200);  // Releases the lock
        }
        process();
    }
}

The distinction between sleep() and wait() is fundamental to Java concurrency, but it's also the kind of semantic difference that doesn't affect whether the code compiles or passes single-threaded functional tests. The signatures are similar, the behavior in a test with one thread is identical, and the bug only surfaces under real contention.

Why static analysis catches what tests miss

Try writing a unit test that reliably catches double-checked locking. The bug only manifests when thread A's constructor call gets reordered relative to the field assignment and thread B reads the field in between. Standard test frameworks don't control thread scheduling at that granularity, so the test may pass a thousand times and then fail once it’s in production under load.

Synchronizing on a cached Boolean.FALSE has the same problem, namely, that the deadlock requires two unrelated threads to hit their synchronized blocks concurrently, which a single-threaded test never exercises. Thread.sleep() inside a lock is functionally identical to Object.wait() when only one thread is running, so any test that doesn't simulate lock contention sees correct behavior from both.

All three patterns show the code is correct when executed by a single thread, and the bug exists only in the interaction between threads.

SonarQube's data flow analysis reasons through code paths structurally rather than relying on runtime execution, catching patterns like double-checked locking or lock-held sleep regardless of whether any test triggered the dangerous interleaving. The Java analyzer includes over 20 rules for concurrency and synchronization, with recent additions covering virtual thread semantics for Java 21+.

Concurrency rates vary more across models than almost any other bug category, but regardless of where your model sits on that spectrum, these are the bugs your test suite is least likely to catch. The complete data is on the LLM Leaderboard, and the GPT-5.5 evaluation has the methodology behind the numbers.

Genera confianza en cada línea de código.

Rating image

4.6 / 5