Start your free trial

Verify all code. Find and fix issues faster with SonarQube.

On November 18th, 2025, Cloudflare experienced a significant outage that rippled through the Internet. They reacted quickly by diagnosing the issue and deploying fixes immediately, and the incident serves as a case study for engineering teams everywhere. Cloudflare provided a detailed post-mortem explaining how a small change cascaded into a global disruption.

This is the type of knowledge-sharing that allows the entire software industry to progress; it’s important for everyone to understand what can happen when you run interconnected services at the scale of the planet.

This blog post looks at how seemingly small decisions can have massive effects, and the importance of prioritizing code quality to build reliable software.

The outage

I’ll let you read the post-mortem, but it boils down to two unrelated things:

A change in database permissions
A hard-coded limit in a process routing traffic across the Cloudflare network

The code that ultimately failed was designed with performance in mind, likely with a set of expectations as to what it would consume as an input. In a high-scale environment like Cloudflare’s, hard-coded limits often exist for good reasons, such as ensuring speed and minimizing memory consumption.

Likely, the team generating the data and the team consuming it were distinct units operating under agreed-upon assumptions. But in the age of cloud computing, hidden dependencies can shift unexpectedly. The critical question of what happens if these limits are not honored may never have been part of the review conversation. This is a fact of software development. Things fall through the cracks. It is difficult for any single team to envision every cascading effect of a database change.

So what are we to do in the face of such a disheartening situation?

The real question

The most important question isn't who made the mistake, but rather if the conversation about failure modes ever happened. Was anyone even aware that the software could fail in those specific conditions?

When you look at the code Cloudflare openly shared with the world, you can see (if you read Rust fluently), that there is a seemingly innocuous call to “unwrap()” at the end.

This call is the reason the software failed. The unwrap()call takes the result of a previous call and extracts the value. If the previous call fails, unwrap() panics and kills the program.

While unwrap() is not bad in and of itself, it is part of the language and suitable in many simple cases, with the right precautions. However, in that particular case, it meant that if the expectations on the input file were not met, the program crashed. Could the software have been designed differently to handle the problem gracefully? Maybe.

As this program seems to be mission critical, it is a fair assumption that a requirement should have been “it is not allowed to fail.” How do you make sure that this requirement is met? You can carry out an exhaustive risk analysis, which might happen when the program is first designed, but is unlikely to be repeated when the program evolves over time.

My point here is not to point fingers at either the software engineers, the architect, the QA team or the company. I want to highlight how difficult it is to ensure such a requirement over the lifetime of any piece of software. If you want assurances that a requirement holds, you need an automated way to check for it. Testing helps, but it requires having identified the specific failure mode to write a test for it, which you cannot guarantee.

Catching bugs

The standard defense to that problem is “code review.” Software engineers rely on each other to catch mistakes in their code. However, reviewing the code requires keeping in mind both the goal of the change and the original requirements that should still hold to give an opinion on it. At a time where the reviewer is probably working on different things. This simple method call can easily slip in, because it does not scream “one original requirement is broken here.”

The easiest way to make sure such issues are raised to our attention is static code analysis, which reads the code to identify problematic patterns and provides context on why the pattern is bad. Rust actually comes with such an analyzer, Clippy, which can raise a warning every time unwrap() is used in a dangerous way, with a simple configuration.

Going back to our discussion, when the software was originally specified, and assuming it was deemed “mission critical,” then turning on that rule would have made sense. Months or years later, during a change introducing the call to unwrap(), it would have raised an issue, making the call stand out during development. The developer would have challenged their own decision, and maybe would have sparked a deeper conversation with upstream stakeholders, or even simply internally in the team, leading to changing the code that loads the file so that it logs a warning but only keeps whatever information holds in the pre-allocated buffer without failing completely.

No matter what the decision, the conversation would have happened because the analyzer would have flagged it.

It is extremely easy to choose which rule should run on which project, depending on your context. And if the context changes, then you can change the rules that are active and review the newly discovered problems to prepare for potential problems, just like security teams do this for new vulnerabilities

Where does code quality fit in?

At Sonar, we define code quality as the fundamental health of your codebase. It goes far beyond syntax or style. It is the structural integrity that determines whether your software operates as intended or fails under pressure.

Code quality is code governance

The critical thing here is that code quality is not simply about best practices, it is about code governance. The root cause of many outages is not just a bug. It is a lack of visibility. Governance ensures that the assumptions made when software was designed continue to hold as you modify it years later. It’s about having the right tools in place to surface where those assumptions are broken and bring them to the eyes of those who can act on it: developers.

The risk of not considering quality as an integral part of the SDLC is outages like this one, potentially leading to breach of SLAs, the cost of diagnosing, fixing and deploying a new version of your product, not to mention public relations problems. The costs can accumulate quickly, and, as we have seen here, the moment a bug manifests itself is unpredictable at best.

The acceleration of AI-generated code has made manual oversight impossible at scale, transforming automated verification from optional into a necessity. Implementing deterministic static analysis creates an essential safety net that continuously scans your entire codebase without slowing down development. While no single tool can predict the exact sequence of events in a complex outage, catching these specific logic errors early effectively breaks the chain of failure, stopping compounding issues before they ever reach production.

The Cloudflare outage and why code quality matters more than ever

Table of contents

Start your free trial

Verify all code. Find and fix issues faster with SonarQube.

The outage

The real question

Catching bugs

Where does code quality fit in?

Code quality is code governance

The Cloudflare outage and why code quality matters more than ever

Table of contents

Start your free trial

Verify all code. Find and fix issues faster with SonarQube.

.css-1s68n4h{position:absolute;top:-150px;}The outage.css-5cm1aq{color:#000000;}.css-s0nieh{margin-left:10px;margin-top:-1px;display:inline-block;fill:#69809B;margin-left:14px;}.css-s0nieh:hover{fill:#290042;}

The real question

Catching bugs

Where does code quality fit in?

Code quality is code governance

The outage