Blog post

Why I’m passionate about Static Analysis and how I helped make it better

Abbas Sabra

C++ Analyzer Developer

October 2, 2023

5 min read

I was recently interviewed on the C++ podcast, CppCast - “the first podcast by C++ developers, for C++ developers”. We talked about static analysis and how I got into it in the first place. Then we talked about Automatic Analysis for C++, a feature that we have been working on for over a year and was released just last month on SonarQube Cloud.

You can listen to the podcast here: https://cppcast.com/automatic_static_analysis/. Still, I’m going to cover most of what we talked about here, too.

How I got into static analysis

Earlier in my career, I was working in finance, where runtime efficiency is usually held up above all else - including developer efficiency. I saw much productivity lost to tooling issues that could have been avoided. I would say that we spent about 80% of our time debugging. One day, when I was working on a million-lines-of-code interest rate derivative project, I got a ticket for a bug where a calculation was coming out wrong. It took me two days to find that bug, and it turned out to be an expression with a side-effect that we relied on. Someone had moved it into a decltype. The trouble with that is that the side-effect no longer happened, impacting the calculations in the financial model.

Once I found this, I wondered if there were more cases where something similar had happened, and it occurred to me that I could write a simple script to look through the code for me. It took me less than an hour to write the script and just a few seconds to run across the whole codebase. But it found another such issue that could have led to another multi-day debugging session!

That experience got me hooked on finding things that could be quickly automated for significant productivity gains - especially when it comes to finding issues in the code - or finding patterns that might lead to issues. And that passion led me to static analysis.

The challenges of C++ tooling

Whether it’s static analysis, a code inspection tool, an IDE, or just a syntax highlighter or code formatter, C++ tooling is much more complex than most other languages. Mainly because all these tools ultimately rely on the ability to parse the language - and C++ is a complicated and resource-intensive language to parse. There are many grammatical peculiarities - such as a token changing meaning depending on what comes later or years of backward compatibility legacy - all the way back to C (and sometimes even earlier). Then there’s the preprocessor and tons of compiler extensions, which throw everything into question again. So maintaining a reliable parser for C++ is a big task for even a medium-sized team working full-time!

Things have improved since we got clang-tooling. Now, the same parser that the Clang compiler uses can be built on by other tools. However, even that is not a magic bullet. Clang-tooling can get small limited-scope projects quite far - so that’s good. Nonetheless, complex and performance-sensitive tools with a wide range of use cases, like an IDE or a full-featured static analysis tool, must deal with many extra complexities. Even before you allow for the fact that Clang is no longer the first to implement new language features, you must deal with incomplete code and exotic compiler extensions. Clang can assume the code is complete and compile based on that assumption. If it’s not, that’s a compiler error. But for something that needs to understand the code while you’re writing it - in real time - this adds a lot of extra complexity. Also, Clang has different performance constraints than the usual interactive IDE-based tools.

Unlike C++, languages with more regular syntax, typically designed with toolability in mind, are much easier to work with. That’s why, for example, IDEs for Java or C# tend to feel so much smoother and more productive - and at the same time, lighter - than those for C++, even when they are all built by the same company, like the JetBrains IDEs. Sadly, things don’t get better for tooling with “modern C++”; they even get worse! We can now write almost anything as constexpr code - which sounds like a great win. However, for tools, they now must have a full-blown C++ interpreter just to be able to parse it! Even when you aspire to use C++20 modules to solve the frequent parsing bottleneck of text-based include directives, backward compatibility always reminds you that, for C++ tooling, there is no moving forward.

Static analysis as a tool for education

We tend to think of static analysis for finding bugs - or patterns that might lead to bugs - all without compiling your code (as opposed to dynamic analysis, which works at runtime). Of course, it’s great for that. At the same time, a good static analyzer should also help you to understand why something is an issue or why there may be a better way to do something. If the spirit of Left Shifting is dealing with things at earlier and earlier stages in the pipeline, then arming you with the knowledge to avoid writing problematic code in the first place is the ultimate Left Shift. For me, that’s even more interesting. This is especially the case now that C++ is such a fast-moving target, with major new versions like C++ 20 often overturning what we consider best practices. Even the most experienced can struggle to keep up.

So, at Sonar, we strive to write good rule descriptions that help you understand the problem - and we’re constantly improving even older rules. We also have rules specifically for detecting patterns representing older usages and explaining how to update them to modern forms - and, when feasible, doing it for you. For example, static analyzers can do exceptionally well with detecting equivalent code. We build on that by detecting raw loops with a specific equivalent STL algorithm, and we encourage you to leverage the STL - perhaps using the newer range algorithms if you’re using C++20 or later. Most of us could do with making better use of the STL algorithms, so this is a great educational resource..”

Path explosion

So static analysis is great for detecting patterns in code that might lead to issues - prompting you to follow “best practices”. Detecting actual bugs - e.g., dereferencing a null pointer (where the pointer value is only known at runtime) is also possible but often much harder. It is not just harder in terms of the code needed to do the detection but harder in the mathematical sense of needing to track exponentially increasing possible states. We call this the “Path Explosion Problem”.

For example, if you write some code that, given two integers, divides one by the other, then there are various failure modes depending on the values of the integers. An obvious one is what if the denominator is zero? Now you have UB. So, you need to look at where those integers came from, their possible values, and what branches they took along the way. If you can see that, before the division, the denominator is checked against zero - and branches away if it is - we should be safe from division by zero issues. We call this theoretical stepping through stages of code “symbolic execution”. That’s reasonably achievable if that check is fairly close to the division itself. But the further away it gets, the more intermediate branches you must account for. If you cross the function boundary, then things get especially tricky. But once you have calls from other translation units, the problem becomes intractable in the general case. In some specific cases, we can do whole program analysis to catch cross-translation unit issues, but it is not feasible to do this in general. To do so accurately, you would need to effectively execute the whole program - in the analyzer - for all possible ranges of inputs. You may not even have all the source code.

But despite its limitations, symbolic execution is still very valuable; it does detect complex bugs in established codebases. It is one of the many techniques we use at Sonar to implement our rules - some of our most specialized developers are working on it.

Nonetheless, dynamic analyst tools, such as Valgrind and the Clang Sanitisers (msan, asan, ubsan, etc.), are still valuable to run alongside static analysis - although they can typically only detect issues if they are encountered at runtime. This is why I feel that detecting patterns that can lead to issues (so-called “Code Smells”) is the best contribution that static analyzers can make. If you follow these best practices, then we can usually steer clear of the actual bugs in the first place. A good example here is spotting locations where we can use abstractions like std::span or std::stringview instead of raw pointers and lengths. Better still might be to use gsl::span (from the C++ Code Guidelines Support Library), as this is also range-checked. These are all patterns we can check and warn you about - even if the code, itself, is not buggy.

How do Sonar tools fit in?

We also talked, on the episode, about the tools that we offer as part of the Sonar Solution. If you’re reading this here, you may already know about them - but it’s worth mentioning that we do have three tools and what the differences are.

SonarQube for IDE is likely to be the most familiar to many developers. It runs as a plug-in in your IDE and analyzes your code as you write - giving you real-time feedback along the lines we’ve already discussed. It also offers Quick Fixes for many issues, so it can even rewrite the code for you. That’s great for the ultimate left-shifting we talked about. But that only works if everyone is using the same tools in the same way. That’s hard to enforce in our modern heterogeneous development teams. So, we also have two services that can run as part of your server-based builds (what we sometimes call CI or CD servers). SonarQube Server and SonarQube Cloud are largely the same - but you’d usually use SonarQube Server if you’re self-hosting or SonarQube Cloud if you want us to host. SonarQube Cloud is especially useful for Open Source software projects. There’s a lot more to them than just running the same analyzers on the server. They can act as quality gates on Pull Requests, for example - so you can be sure that new issues are not being introduced. They also enable our Clean as You Code process - where by doing nothing more than keeping your new commits clean, over time the whole code base (or a significant chunk with the highest churn rate) gets cleaned along the way. This prevents the common feeling of being overwhelmed when you turn on all warnings for the first time or use a new quality tool.

Automatic Analysis

One downside to the server-based tools is that they need some configuring, integrating into your toolchain, and maintaining that over time. This is often quite a bit more involved than with other language ecosystems because of the nature of C++ build systems and the wide range of compilers. If you have dedicated DevOps resources, this shouldn’t be an issue. Yet, if this is a developer’s part-time responsibility or you’re an open-source author, this can be a bit of a barrier to entry - at least just to try them out.

So, we really wanted to make all that complexity disappear and offer a zero-config option for systematically incorporating static analysis across a project. We’ve had this for some other languages for some time now, but for C++, we - even I - considered it impossible for some time. Fortunately, we had a breakthrough last year and thought we had a shot at doing it. So, since then, I’ve been leading a small team and am pleased to say that last month, we released Automatic Analysis for C++, and I have to say, it has exceeded our expectations. It works so well that we’re now suggesting this be the default way to set up C++ analysis in SonarQube Cloud! All you need to do is give SonarQube Cloud access to your source code and tell it to analyze it, and it goes away, figures out the most likely build options, dependencies, etc., and analyzes on that basis. The entire process takes less than a minute! See for yourself. According to the data we have from our large corpus of projects we test against internally, we get something like 95% accuracy. For compilation, only 100% is good enough, but for static analysis, 95% is actually excellent - and for most projects, you would probably not know the difference. If you have a special case you can always fall back to a manual setup approach, of course.

We’re very proud of what we have achieved. I don’t believe anyone else has been able to do this yet. What excites me is that this technology can now open up static analysis to even more developers, especially those contributing to open-source projects where this feature is free!