Claude Opus 4.7: An evaluation review & metrics benchmarks

10 min de lectura

Prasenjit Sarkar photo

Prasenjit Sarkar

Solutions Marketing Manager

TL;DR overview

  • Claude Opus 4.7 is Anthropic's flagship model, delivering 40% more concise code than version 4.6.
  • Evaluation shows an 82.52% functional pass rate with significantly improved production-critical blocker bug density.
  • High cognitive complexity and a 290 per mLOC vulnerability density require rigorous security reviews.
  • Focus verification on increased cryptography misconfigurations and hard-coded credentials to ensure safe AI-generated code.

Claude Opus 4.7 is Anthropic's latest flagship model. Using our proprietary LLM code quality and security evaluation framework, we discovered the new model delivers a clear efficiency improvement: 40% less code for the same functional pass rate as Opus 4.6 Thinking. That's what the data says at first glance. However, upon a closer look, the picture shifts. 

What was measured

Model: Claude Opus 4.7 (Adaptive Thinking mode)

Language: Java

Benchmark: 4,444 tasks (HumanEval, MBPP, ComplexCodeEval)

Analyzer: SonarQube systematic code analysis. Density metrics are per 1,000 lines of code (kLOC); category breakdowns are per million lines (MLOC)

Two important terms to define before getting into the results:

  • Cyclomatic complexity: Counts independent paths through a function.
  • Cognitive complexity: This SonarQube metric weights nested and deeply branched logic more heavily, reflecting how difficult the code is for a human to read.

Neither metric captures correctness. Both correlate with how long review and testing take.

Key metrics at a glance

MetricOpus 4.7 Thinking
Lines of code (total)336,283
Comments (% of LOC)3.8%
Cyclomatic complexity per kLOC240.63
Cognitive complexity per kLOC171.22
Bug density per mLOC800
Vulnerability density per mLOC290
Code smell density per kLOC23.01
Overall issue density per kLOC24.10
Functional skill (pass rate)82.52%
Missing completions0.45%

Volume and style

Claude Opus 4.7 produced 336,283 lines of code across 4,444 tasks. For the same tasks, Opus 4.6 Thinking produced 566,389 lines. Opus 4.7 produces 40% fewer lines for the same work. The functional pass rates are 82.52% and 82.55%. Same pass rate on the same tasks, fewer lines of code.

Comments dropped to 3.8% of the output, down from 8.2% in Opus 4.6. The code is more compact and less annotated. If you're maintaining this code past the immediate task, you'll have less inline context to work from.

Complexity

Cognitive complexity per kLOC is 171.22 and cyclomatic complexity for Opus 4.7 is 240.63. Cognitive complexity is higher than Opus 4.6 Thinking's 132.1 per kLOC. The code is shorter but denser—more branching logic and nested control flow per thousand lines. When models write less code, they often pack more logic per line. Reviewing each line takes more effort, even though there are fewer total lines. Combined with a comment density of just 3.8%, independent and deterministic code reviews become more important than ever.

Bug density and severity

Bug density is 0.80 per kLOC. Severity breakdown per million lines:

SeverityPer MLOC
Blocker74
Critical48
Major369
Minor324

Blocker bugs are at 74 per MLOC, down from 83 in Opus 4.6. Critical bugs held at 48 per MLOC. These are the two levels that cause production fires, and both improved.

Concurrency and threading bugs are at 131 per MLOC, tied with exception handling as the largest bug category in this evaluation. The rate is lower than Claude Opus 4.6's 157 per MLOC, but concurrent patterns remain the dominant area of risk.

Bug categoryPer MLOC
Concurrency / threading131
Resource / stream leaks131
Exception handling101
Type safety / casts68

Concurrency bugs are expensive. They're hard to reproduce in testing, tend to be environment-dependent, and can produce intermittent failures that take significant time to diagnose. The rate here is better than the prior generation, but it remains the dominant bug category to watch.

Vulnerability density and severity

Vulnerability density is 0.29 per kLOC. Severity breakdown per million lines:

SeverityPer MLOC
Blocker113
Critical80
Major42
Minor57

Blocker and critical vulnerabilities went up compared to Opus 4.6, which had 53 and 56 per MLOC respectively. This is where the model regressed. Specific vulnerability categories:

Vulnerability categoryPer MLOC
Cryptography misconfiguration57
Path traversal / injection24
Hard-coded credentials45
XML external entity (XXE)39

Cryptography misconfigurations, which include weak algorithms, insecure key sizes, improper use of random number generators, are one of the more common failure modes in AI-generated code, and they show up here at 57 per MLOC. Path traversal and injection at 24 per MLOC is a category SonarQube catches reliably through data  flow analysis. Hard-coded credentials are at 45 per MLOC, and XXE is at 39 per MLOC.

Maintainability signals

Code smell density is 23.01 per kLOC, driven primarily by collection and generics parameter type issues (3,565 per MLOC) and assignment, field, and scope visibility issues (2,132 per MLOC). These are cases where the model uses raw types instead of properly parameterized generics, or where field visibility is looser than it needs to be.  In Java, these issues carry real cost: they suppress compiler warnings, make refactoring harder, and can mask bugs a properly typed implementation would catch at compile time.

The overall code smell number should be read alongside the comment density. At 3.8% comments across 336,000 lines, teams maintaining this code will find fewer signposts and more accumulated minor issues to address as the codebase ages.

Functional skill

The passing test rate for Opus 4.7 is 82.52%, with missing completions at 0.45%. The functional pass rate is essentially unchanged from Opus 4.6 82.55% — the generational update preserved functional capability. But 82.52% is also where verification earns its keep: roughly one in six generated solutions doesn't pass functional tests, and that rate isn't predictable in advance for any individual task. Testing pipelines catch what the model misses.

What this means for developer teams using Opus 4.7 

The conciseness improvement is real. Roughly 40% less code for the same results means smaller review surfaces, faster iteration, and potentially lower token costs. Blocker bug density improved too, fewer issues that caused immediate production failures, continuing a positive trend from Opus 4.6.

The areas requiring active management are structural. Denser code, fewer comments, and higher per-line cognitive complexity raise the per-task cost of human review, even as total line count drops. The net review burden depends on how a team manages that tradeoff.

The vulnerability picture is the one that most deserves attention. Opus 4.7 ships fewer bugs than Opus 4.6, but more vulnerabilities, in a denser codebase with fewer comments. Fewer lines does not mean less security risk. The jump in blocker and critical vulnerabilities means security review cannot be treated as a checkbox.  Systematic multilayered code analysis tools in your development pipeline, catching path traversal, cryptography misconfigurations, and hard-coded credentials at generation time, are the practical way to address this without adding manual review time.

Three takeaways:

  • Conciseness is the headline improvement. Roughly 40% fewer lines for the same functional pass rate means smaller review surfaces and potentially lower token costs.  This is a meaningful efficiency gain.
  • Blocker bugs improved. The most production-critical bug category moved in the right direction, continuing the trend from Opus 4.6.
  • Vulnerability density increased. Blocker and critical vulnerabilities are higher than Opus 4.6, and that's where verification focus should land — particularly on cryptography, path traversal, and hard-coded credentials.

Opus 4.7 Thinking is a capable and more efficient code generator than its predecessor, and that finding does not remove the need for verification. It changes what verification should focus on as it requires closer review thanks to more compact code.

Opus 4.7 Thinking's full evaluation results, along with all other evaluated models, are available on the Sonar LLM Leaderboard.

Genera confianza en cada línea de código.

Rating image

4.6 / 5