Earlier this year, we started doing something that we had felt uncertain about for a long time, namely porting PVS-Studio to Linux. In this article, I will tell you how we made the decision to create a product for Linux distributions after 10 years of the Windows version's existence. It's a big job, which, unfortunately, involves much more work than simply compiling the source files for the new platform, as some may think.

Introduction

In fact, the Linux version of the PVS-Studio console kernel has been ready for a long time, about three years now. Why did we never show it to the public then? You see, developing a software product, even based on an already existing one, is a huge job that takes lots of human-hours and involves tons of unexpected problems and details to deal with. We already knew it then and this task was yet to be done, so there was no official support of the Linux version.

As an author of a number of articles about project checks, I, unlike my colleagues, would often find inspiration in software designed for Linux. This environment is abundant in large and interesting open-source projects that are extremely hard, if possible at all, to build under Windows. It is actually the need to be able to check such projects that has driven the development of PVS-Studio for Linux.

It took our small team a couple of months to port the PVS-Studio kernel's code to Linux. Replacing a few system calls and debugging on Chromium project enabled us to make a decent console application. We put this version on regular night builds and ran it through the Clang Static Analyzer. Thanks to regular checks of open-source projects and build management, the analyzer did fairly well for several years and would even feel quite ready for release at times. However, you don't know yet what tricks I had to use to be able to analyze projects with that version...

Using static analysis tools

Before we continue with our tool's development history, I'd like to talk about the static analysis technology as such. It will also be an answer to possible questions like, "Why use third-party tools when you can write bugless code right away and do peer code review?" This question is asked sadly often.

Static code analysis helps find errors and defects in software's source code. Whatever particular tools you are using, this is a great technique for managing the quality of your code under development. If possible, combine different static analysis tools: it can help a lot.

Some of our readers, users, and conference guests believe that peer code review is an ample means for detecting bugs at the early coding stage. Sure, such "inspections" do help find some bugs, but we all have been talking about the same thing all this time. Static analysis can be treated as automated code review. Think of a static analyzer as one of your colleagues, a virtual robot expert who doesn't get tired and takes part in every code review, pointing out fragments to be examined. Isn't it helpful?!

Many industry areas use automation to exclude so called human factor, and code quality management is no exception. We are not forcing you to give up manual code review if this is what you normally do. It's just that a static analyzer can help find even more bugs at the earliest stage possible.

Another important thing is that static analyzers don't get tired or lazy. Programmers make different kinds of mistakes in the code. What about typos? They don't catch your eye easily. Syntax mistakes? The ability to recognize them greatly depends on the reviewer's skill. Modern code sizes make the situation even worse. Many functions don't fit even widescreen displays. When context is lacking, the reviewer's attention weakens. A person grows tired after 15 minutes of closely reading program code, and it gets worse as you go on. It's no surprise that automatic analysis tools have become so popular and grow even more popular every year.

What PVS-Studio users expected of the Linux version

Our product has always attracted the interest of people who deal with software development one way or another. These are Windows users, who could try the tool right away, programmers working with other platforms and languages, and non-programmers at all. Such interest is natural, as many programming mistakes are common in a large variety of languages.

Linux users showed much persistence in asking us for a Linux version all these years. Their questions and arguments can all be summarized as follows:

Command line utility - "We don't need IDE integration!"

No installer needed - "We'll install it ourselves!"

No documentation needed - "We'll figure out how to get started ourselves!"

The remaining part of the story will show the contradiction between their statements and expectations multiple times.

A myth about understanding build scripts

I talked with some people from large commercial projects and discovered that many developers don't know how projects are built and actually don't always need deep knowledge of that process. Every developer knows how to build/debug their project/module, but this knowledge is usually reduced to just a few magical commands. Figuratively speaking, there is a large button that they just need to press to have their modules built, but they have only a general understanding of the actual mechanics behind this process. As for the build scripts, there is usually a special person assigned to manage them.

In such cases, you need a tool to check your project without integrating with build systems, if only to get started with the analyzer.

The Linux version actually appeared after we introduced a compiler monitoring system in PVS-Studio's Windows version, which gave us a tool to check any project designed for that platform. As we found later, there were quite a lot of serious projects there built with the Microsoft compiler but lacking a Visual Studio solution. Thanks to this feature, we could tell you about the analysis results for such projects as Qt, Firefox, and CryEngine5, and even work for Epic Games on fixing bugs in their code. Our research showed that you only needed to know such information about the compiler as the working directory, command line parameters, and environment variables to be able to call to the preprocessor and run the analysis.

As I was planning on checking Linux projects, I knew from the very beginning that I would not be able to figure out the specifics of integrating the analyzer with every particular project, so I made a similar monitoring system for ProcFS (/proc/id's). I took the PVS-Studio code from the Windows plugin and ran it in mono to analyze the files. We were using this method for several years with various projects, the largest of which were the Linux kernel and FreeBSD. Although it was a long established procedure, it by no means was appropriate for commercial use. The product was not ready yet.

Choosing the monitoring system

Once we decided to implement this feature, we started making prototypes and choosing among them.

(-) Clang scan-build - we examined Clang scripts and made a prototype that used a similar mechanism to assign an analyzer call to the variables CC/CXX. We had already tried this method before when analyzing open-source projects with the Clang Static Analyzer, and it had not always worked. As we learned more about this method, we discovered that project authors would often assign compilation flags to these variables as well, so overriding them would result in losing their values. That's why we discarded that method.

(+) strace - this utility generates quite a detailed trace log where most of the logged processes are irrelevant to the compilation. Its output format also lacks the process's working directory that we needed so much. However, we managed to get it by chaining the child and parent processes, and the C++ version can parse such a file very quickly by analyzing the found files in parallel. This is a good way to check projects using any build system and get started with the analyzer at the same time. For example, we used it recently for another check of the Linux Kernel, and this time it was smooth and easy.

(+) JSON Compilation Database - you can get this format for a CMake project by using one additional flag. It includes all the information required for analysis without unnecessary processes, so we supported it.

(+/-) LD_PRELOAD - analyzer integration through function replacement. This method won't work if you are already using it to build the project. There are also utilities that can use LD_PRELOAD to generate a JSON Compilation Database for non-CMake projects (for example, Bear). They are slightly different from CMake, but we supported them as well. If the project does not depend on any predefined environment variables, we will be able to check it too. Hence the mark +/-.

Developing regular tests

There are different software testing procedures. The most effective technique for testing the analyzer and its diagnostic rules is to run tests on a large code base of open-source projects. We started with about 30 large projects. I mentioned earlier that the Linux version had existed for a few years by then and we had regularly used it to check projects. Everything seemed to work well, but it was not until we launched full-fledged testing that we saw how incomplete and imperfect the analyzer was. Before the analysis can be run, the source code needs to be parsed for the analyzer to find the necessary constructs. Even though unparsed code doesn't affect the analysis quality too much, it's still an unpleasant drawback. Every compiler has non-standard extensions, but we supported all such extensions in MS Visual C/C++ years ago, while in GCC we had to start almost from scratch. Why 'almost'? Because we have had support for GCC (MinGW) under Windows for a long time, but it's not common there, so neither we nor our users had any trouble using it.

Compiler extensions

In this section, we'll talk about code constructs that, hopefully, you won't see anywhere else: constructs that use GCCextensions. Why would we need these? They are hardly used in most cross-platform projects, aren't they? Well, it turns out that programmers do use them. We came upon the code that made use of extensions when developing a testing system for Linux projects. Where things get most complicated, though, is the parsing of the standard library's code: this is where the extensions are used in full. You can never be sure about the preprocessed files of your project: for the sake of optimization, the compiler might turn a regular memset function into a macro with a statement expression. But first things first. What new constructs did we learn about when checking projects under Linux?

One of the first extensions we saw was designated initializers. These allow you to initialize an array in an arbitrary order. It is especially convenient if the array is indexed as enum: you explicitly specify the index, thus making the code easier to read and making mistakes less likely to appear when modifying it later. It looks very nice and neat:

That is, this construct can be initialized by any sequence of indexes and calls to the structure members. A range can also be used as an index:

int array[] = {
[0 ... 99] = 0,
[100 ... 199] = 10,
}

One small, but very useful from the security viewpoint, GCCextension deals with null pointers. We discussed the problem of using NULL quite a lot, so I won't repeat myself. It's somewhat better in GCC, as NULL is declared as __null in C++, and GCC prevents us from shooting ourselves in the foot like this:

GCC allows you to specify attributes __attribute__(()). There is a large list of attributes for functions, variables, and types to manage linking, alignment, optimizations, and many other features. There is one interesting attribute, transparent_union. If you pass such a union as a function parameter, you will be able to pass not only the union itself, but also the pointers from this enumeration, as arguments. The following code will be correct:

The wait function is an example that makes use of transparent_union: it can take both int* and union wait* as arguments. This is done for the sake of compatibility with POSIX and 4.1BSD.

You must have heard about the GCCbuilt-in functions. In these functions, you can use variables declared earlier than the functions themselves. A built-in function can also be passed by pointer (although it's obviously not a good idea to call it using the same pointer after the called function has returned).

And here's a small Clang extension. Even though PVS-Studio has long been friends with this compiler, it's wonderful that we still encounter new language and compiler constructs emerging there. For example:

Closed beta testing. Episode 1

Once we have prepared a stable analyzer version, documentation, and a few methods of checking projects without integration, we launched a closed beta test.

When we started handing out the analyzer to the first testers, we discovered that the executable alone was not enough. Their responses ranged from "It's a wonderful tool; we've found lots of bugs!" to "I don't trust your app and I'm not installing it to /usr/bin!" Sadly, the latter were more common. The arguments of the forum members who claimed they would be OK with just the executable file proved to be exaggerated. Not everyone can or wishes to work with the analyzer in such a format. We needed some common means of Linux software distribution.

Closed beta testing. Episode 2

After the first responses, we stopped the test and dove into hard work for about 2 weeks. Testing on third-party code revealed even more problems with compilers. Since GCC is used as a basis to build compilers and cross compilers for various platforms, people started testing our analyzer on every possible kind of software, even firmware of various devices. It generally managed to deal with those tasks, and we did receive positive feedback, but it had to skip some code fragments because of the extensions that we had to support.

False positives are inherent in any static analyzer, but their number has somewhat grown in the Linux version, so we got down to adjusting the diagnostics to the new platform and compilers.

The development of Deb/Rpm packages was a big improvement. Once we made them, all complaints about PVS-Studio installation ceased. There was probably only one person who didn't like using sudo to install the package, although almost all software is installed that way.

Closed beta testing. Episode 3

We also paused for a while to make the following improvements:

We discarded configuration files used for quick analysis: introducing Deb/Rpm packages put the problem of filling a configuration file on the first place. We had to improve the quick-analysis mode without configuration files using just two obligatory parameters: the path to the license file and the path to the analyzer log. The advanced settings for this mode were left intact.

We improved log handling in strace. Originally, strace logs were processed by a script in Perl, which was the language of the prototype. This script was slow and bad at parallelizing the analysis process. Rewriting this feature in C++ helped speed up file processing and also made it easier to maintain the whole code written in a single language.

Improving Deb/Rpm packages. Since we needed strace utility for the quick-analysis mode and the first packages included Perl/Python scripts, we failed to specify all the dependencies properly at first, and then just discarded the scripts altogether. A few people reported errors when installing the analyzer using GUI managers, and we quickly eliminated those errors. An important thing to mention here is that the testing procedure that we set up for ourselves helped quite a lot: we would deploy a few dozens of Linux distributions in Docker and install the ready packages on them. We also checked if it was possible to run already installed programs. This technique enabled us to implement new modifications in the packages and test them at a fast pace.

Other improvements of the analyzer and its documentation. All the steps and changes we were making were reflected in the documentation. As for improving the analyzer, well, this process never stops: we develop new diagnostics and improve the existing ones.

Closed beta testing. Episode 4 (Release Candidate)

During the last stage of the test, the users no longer had any trouble installing, running, and setting up the analyzer. We were receiving positive feedback, examples of real bugs found by the tool, and examples of false positives.

The testers also showed more interest in the analyzer's advanced settings, which forced us to expand the documentation with an explanation of how to integrate the analyzer with Makefile/CMake/QMake/QtCreator/CLion. These methods are discussed below.

Supported integration techniques

Integration with Makefile/Makefile.am

Although projects can be conveniently checked without integration, integrating the analyzer with build systems does have a few advantages:

Fine tuning of the analyzer;

Incremental analysis;

Running analysis in parallel on the build-system level;

Other advantages provided by the build system.

When called at the same point as the compiler, the analyzer has a correctly set-up environment, working directory, and all the parameters. That way, you have all the necessary conditions fulfilled to ensure correct and high-quality analysis.