PVS-Studio is a static code analyzer, that searches for errors and vulnerabilities in programs written in C, C++ and C#. In this article, I am going to uncover the technologies that we use in PVS-Studio analyzer. In addition to the general theoretical information, I will show practical examples of how certain technology allows the detection of bugs.

Introduction

The reason for writing this article, was my report on the open conference ISPRAS OPEN 2016 that took place in the beginning of December, in the main building of the Russian Academy of Sciences. The subject of the report: "The operation principles of PVS-Studio static code analyzer" (presentation in the pptx format)

Unfortunately, the time for the report was very limited, so I had to come up with a very short presentation, and I couldn't cover all the topics I wanted to cover. And so I decided to write this article, where I will give more details on the approaches and algorithms that we use in the development of the PVS-Studio analyzer.

At the moment, PVS-Studio is, in fact, two separate analyzers, one for C++ and another for C#. Moreover, they are written in different languages; we develop the kernel of C++ analyzer in C++, and the C# kernel - in C#.

However, developing these two kernels, we use similar approaches. Besides this, a number of employees participate in the development of both C++ and C # diagnostics at the same time. This is why I won't separate these analyzers any further in this article. The description of the mechanisms will be the same for both analyzers. Of course, there are some differences, but they are quite insignificant for the analyzer overview. If there is a need to specify the analyzer, I will say if I am talking about the C++ analyzer or C#.

The team

Before I get into the description of the analyzer, I will say a couple of words about our company, and our team.

The PVS-Studio analyzer is developed by the Russian company - OOO "Program Verification Systems". The company is growing and developing solely on profit gained from product sales. The company office is located in Tula, 200 km to the south of Moscow.

To some people it may seem that one person would be enough to create the analyzer. However, the job is much more complicated and requires a lot of work-years. The maintenance and further development of the product requires even more work-years.

We see our mission in the promoting the methodology of static code analysis. And of course, to get financial reward, developing a powerful tool that allows the detection of a large number of bugs at the earliest stages of development.

Our achievements

To spread the word about PVS-Studio, we regularly check open source projects, and describe the findings in our articles. At the moment, we have checked about 270projects.

Since the moment we started writing articles we have found more than 10 000 errors, and reported them to the authors of the projects. We are quite proud of this, and I should explain why.

If we divide the number of bugs found by the number of projects, we get quite an unimpressive number: 40 errors per project. So I want to highlight an important point; these 10000 bugs are a side effect. We have never had the goal to find as many errors as possible. Quite often, we stop when we find enough errors for an article.

This shows quite well the convenience, and the abilities, of the analyzer. We are proud that we can simply take different projects and start searching for bugs immediately, almost without the need to set up the analyzer. If it weren't so, we wouldn't be able to detect 10000 bugs just as a side effect of writing the articles.

PVS-Studio

Various additional abilities, integration with SonarQube and IncrediBuild for example.

Why C and C++

The C and C++ languages are extremely effective and graceful. But in return they require a lot of attention, and deep knowledge of the subject. This is why static analyzers are so popular among C and C++ developers. Despite the fact that the compilers and development tools are also evolving, nothing really changes. I will explain what I mean by that.

There is a saying: "Bugs. C++ bugs never change". The pointer StrippedPtr is dereferenced first, and then verified against NULL.

The analyzers are extremely helpful for C and C++ languages. This is why we started developing PVS-Studio analyzer for these languages, and will continue doing so. There is a high probability that PVS-Studio won't have less job in the future, as these languages are really popular, and dangerous, at the same time.

Why C #

Of course, in some regard, C# is more thought-out, and safer than C++. Still, it is not perfect and it also causes a lot of hassle for programmers. I'll give only one example, because it is a topic for a separate article.

Here is our old good buddy - the error we described before. A fragment from the project PowerShell:

First, the reference other.Parameters isused to get the property Count, and only then verified against null.

As you can see, in C# the pointers are now called references, but it didn't really help. If we touch upon the topic of typos, they are made everywhere, regardless of the language. In general, there is a lot to do in C#, so we continue developing this direction.

What's next?

For now we don't have exact plans on what language we want to support next. We have two candidates: Objective-C and Java. We are leaning more towards Java, but it is not decided yet.

Technologies we do not use in PVS-Studio

Before speaking about the inner structure of PVS-Studio, I should briefly state what you won't find there.

PVS-Studio has nothing to do with the Prototype Verification System (PVS). It's just a coincidence. PVS-Studio is a contraction of 'Program Verification Systems' (OOO "Program Verification Systems").

PVS-Studio does not use formal grammar for the bug search. The analyzer works on a higher level. The analysis is done on the basis of the derivation tree.

PVS-Studio does not use the Clang compiler to analyze C/C++ code; we use Clang to do the preprocessing. More details can be found in the article: "A few words about interaction between PVS-Studio and Clang". To build the derivation tree, we use our own parser that was based on the OpenC++ library, which has been quite forgotten now in the programming world. Actually there is almost nothing left from this library and we implement the support of new constructions ourselves.

When working with C# code we take Roslyn as the basis. The C# analyzer of PVS-Studio checks the source code of a program, which increases the quality of the analysis compared with binary code analysis (Common Intermediate Language).

PVS-Studio does not use the string matching and regular expressions. This way, is a dead-end. This approach has so many disadvantages that it's impossible to create a more or less qualitative analyzer based on it, and some diagnostics cannot be implemented at all. This topic is covered in more details in the article "Static analysis and regular expressions".

Technologies we use in PVS-Studio

To ensure high quality in our static analysis results, we use advanced methods of source code analysis for the program and its control flow graph: let's see what they are.

Note. Further on, we'll have a look at several diagnostics, and take a look at the principles of their work. It is important to note that I deliberately omit the description of those cases when the diagnostic should not issue warnings, so as not to overload this article with details. I have written this note for those who didn't have any experience in the development of an analyzer: don't think that it's as simple as it may seem after reading the material below. It is only 5% of the task to create the diagnostic. It's not hard for the analyzer to complain about suspicious code, it's much harder to not complain about the correct code. We spend 95% of our time "teaching" the analyzer to detect various programming techniques, which may seem suspicious for the diagnostic, but in reality they are correct.

Pattern-based analysis

Pattern-based analysis is used to search for fragments in the source code that are similar to known error containing code. The number of patterns is huge, and the complexity of their detection varies greatly. Moreover, in some cases, the diagnostics use empirical algorithms to detect typos.

For now, let's consider two simplest cases that are detected with the help of the pattern-based analysis. The first simple case:

The same set of actions is performed regardless of the condition. I think everything is so simple that it requires no special explanation. By the way, this code fragment is not taken from a student's coursework, but from the code of the GCC compiler. The article "Finding bugs in the code of GCC compiler with the help of PVS-Studio" describes those bugs we found in GCC.

Here is the second simple case (the code is taken from the FCEUX project):

The following erroneous pattern gets analyzed. Programmers know that when they allocate memory to store a string, it is necessary to allocate the memory for a character, where the end of line character will be stored (terminal null). In other words, programmers know that they must add +1 or +sizeof(TCHAR). But sometimes they do it rather carelessly. As a result, they add 1 not to the value, which returns the strlen function, but to a pointer.

This is exactly what happened in our case. strlen(name)+1 should be written instead of strlen(name+1).

There will be less memory allocated than is necessary, because of such an error. Then we'll have the access out of the allocated buffer bound, and the consequences will be unpredictable. Moreover, the program can pretend that it works correctly, if the two bytes after the allocated buffer aren't used thanks to mere luck. With a worse-case scenario, this defect can cause induced errors that will show up in a completely different place.

Now let's have a look at the analysis of the medium complexity level.

The diagnostic is formulated like this: we warn that after using the as operator, the original object is verified against null instead of the result of the as operator.

Pay attention, that the variable other gets verified against null, not the right variable. This is clearly a mistake, because further the program works with the right variable.

And in the end - here is a complex pattern, related to the usage of macros.

The macro is defined in such a way that the operation precedence inside the macro is higher than the priority outside of the macro. Example:

#define RShift(a) a >> 3
....
RShift(a & 0xFFF) // a & 0xFFF >> 3

To solve this problem we should enclose the a argument in the parenthesis in the macro (it would be better to enclose entire macro too), then it will be like this:

#define RShift(a) ((a) >> 3),

Then the macro will be correctly expanded into:

RShift(a & 0xFFF) // ((a & 0xFFF) >> 3)

The definition of the pattern looks quite simple, but in practice the implementation of the diagnostic is quite complicated. It's not enough just to analyze only "#define RShift(a) a >> 3". If warnings are issued for all strings of this kind, there will be too many of them. We should have a look at the way the macro expands in every particular case, and try to define the situations where it was done intentionally, and when the brackets are really missing.

Type inference

The type inference based on the semantic model of the program, allows the analyzer to have full information about all variables and statements in the code.

In other words, the analyzer has to know if the token Foo is a variable name, or the class name or a function. The analyzer repeats the work of the compiler, which also needs to know the type of an object and all additional information about the type: the size, signed/unsigned type; if it is a class, then how is it inherited and so on.

This is why PVS-Studio needs to preprocess the *.c/*.cpp files. The analyzer can get the information about the types only by analyzing the preprocessed file. Without having such information, it would be impossible to implement many diagnostics, or, they will issue too many false positives.

Note. If someone claims that their analyzer can check *.c/*.cpp files as a text document, without complete preprocessing, then it's just playing around. Yes, such an analyzer is able to find something, but in general it's a mere toy to play with.

So, information about the types is necessary both to detect errors, and also so as not to issue false positives. The information about classes is especially important.

Let's take a look at some examples of how information about the types is used.

The first example demonstrates that information about the type is needed to detect an error when working with the fprintf function (the code is taken from the Cocos2d-x project):

The function frintf receives the pointer of the char * type as the fourth argument. It accidentally happened so that the actual argument is a string of the wchar_t * type.

To detect this error, we need to know the type that is returned by the function gai_strerrorW. If there is no such information, it will be impossible to detect the error.

Now let's examine an example where data about the type helps to avoid a false positive.

The code "*A = *A;" will be definitely considered suspicious. However, they analyzer will be silent if it sees the following:

volatile char *ptr;
....
*ptr = *ptr; // <= No V570 warning

The volatile specifier gives a hint that it is not a bug, but the deliberate action of a programmer. The developer has to "touch" this memory cell. Why is it needed? It's hard to say, but if he does it, then there is a reason for it, and the analyzer shouldn't issue a warning.

Let's take a look at an example of how we can detect a bug, based on knowledge about the class.

PVS-Studio warning: V598 The 'memcpy' function is used to copy the fields of 'GCStatistics' class. Virtual table pointer will be damaged by this. cee_wks gc.cpp 287.

It's acceptable to copy one object into another using the memcpy function, if the objects are POD-structures. However, there are virtual methods in the class, which means that there is pointer to a virtual methods table. It's very dangerous to copy this pointer from one object to another.

So, this diagnostic is possible due to the fact that we know that the variable of the g_LastGCStatistics is a class instance, and that this class isn't a POD-type.

The UnboxedTypeSize function returns various values, including 0. Without checking that the result of the function may be 0, it is used as the denominator. This can potentially lead to division of the offset variable by zero.

The previous examples were about the range of integer values. However, the analyzer handles values of other data types, for example, strings and pointers.

Let's look at an example of incorrect handling of the strings. In this case, the analyzer stores the information that the whole string was converted to lower or uppercase. This allows us to detect the following situations:

PVS-Studio warning: V773 The function was exited without releasing the 'pMainFrame' pointer. A memory leak is possible. Merge merge.cpp 353

If the frame could not be loaded, the function exits. At the same time, the object, whose pointer is stored in the pMainFrame variable, doesn't get destroyed.

The diagnostics work as follows. The analyzer remembers that the pointer pMainFrame stores the object address, created with the new operator. Analyzing the control flow graph, the analyzer sees a return statement. At the same time, the object wasn't destroyed and the pointer continues referring to a created object. Which means that we have a memory leak in this fragment.

Method annotations

Method annotations provides more information about the used methods than can be obtained by analyzing only their signatures.

We have done a lot in annotating the functions:

C/C++. By this moment we have annotated 6570 functions (standard C and C++ libraries, POSIX, MFC, Qt, ZLib and so on).

C#. At the moment we have annotated 920 functions.

Let's see how a memcmp function is annotated in the C++ analyzer kernel:

V698 Expression 'memcmp(....) == -1' is incorrect. This function can return not only the value '-1', but any negative value. Consider using 'memcmp(....) < 0' instead. sos util.cpp 142

This code may work well, but in general, it is incorrect. The function memcmp returns values 0, greater and less than 0. Important:

"greater than zero" is not necessarily 1

"less than zero" is not necessarily -1

Thus, there is no guarantee that such code is well-behaved. At any moment the comparison may start working incorrectly. This may happen during the change of the compiler, changes in the optimization settings, and so on.

The flag INT_STATUS helps to detect one more kind of an error. The code of Firebird project:

PVS-Studio. V642 Saving the 'memcmp' function result inside the 'short' type variable is inappropriate. The significant bits could be lost breaking the program's logic. texttype.cpp 3

Again, the programmer works inaccurately, with the return result of the memcmp function. The error, is that the type size is truncated; the result is placed into a variable of the short type.

Some may think that we are just too picky. Not in the least. Such sloppy code can easily create a real vulnerability.

One such mistake, was the root of a serious vulnerability in MySQL/MariaDB in versions earlier than 5.1.61, 5.2.11, 5.3.5, 5.5.22. The reason for this was the following code in the file 'sql/password.c':

typedef char my_bool;
....
my_bool check(...) {
return memcmp(...);
}

The thing is, that when a user connects to MySQL/MariaDB, the code evaluates a token (SHA from the password and hash) that is then compared with the expected value of memcmp function. But on some platforms the return value can go beyond the range [-128..127] As a result, in 1 out of 256 cases the procedure of comparing hash with an expected value always returns true, regardless of the hash. Therefore, a simple command on bash gives a hacker root access to the volatile MySQL server, even if the person doesn't know the password. A more detailed description of this issue can be found here: Security vulnerability in MySQL/MariaDB.

PVS-Studio warning: V549 The first argument of 'memcmp' function is equal to the second argument. psymtab.c 1580

The first and second arguments are marked as POINTER_1 and POINTER_2. Firstly, this means that they must not be NULL. But in this case, we are interested in the second property of the markup: these pointers must not be the same, the suffixes _1 and _2 show that.

Because of a typo in the code, the buffer &sym1->ginfo.value is compared with itself. Relying on the markup, PVS-Studio easily detects this error.

An example of using the F_MEMCMP markup.

This markup includes a number of special diagnostics for such functions as memcmp and __builtin_memcmp. As a result, the following error was detected in the Haiku project:

PVS-Studio warning: V512 A call of the 'memcmp' function will lead to underflow of the buffer '"Private-key-format: v"'. dst_api.c 858

The string "Private-key-format: v" has 21 symbols, not 20. Thus, a smaller amount of bytes is compared than should be.

Here is an example of using the REENTERABLE markup. Frankly speaking, the word "reenterable" does not entirely depict the essence of this flag. However, all our developers are quite used to it, and don't want to change it for the sake of some beauty.

The essence of the markup is in the following. The function doesn't have any state, or any side effects; it doesn't change the memory, doesn't print anything, does not remove the files on the disc. That's how the analyzer can distinguish between correct and incorrect constructions. For example, code such as the following is quite workable:

if (fprintf(f, "1") == 1 && fprintf(f, "1") == 1)

The analyzer will not issue any warnings. We are writing two items to the file, and the code cannot be contracted to:

if (fprintf(f, "1") == 1) // incorrect

But this code is redundant, and the analyzer will be suspicious about it, as the function cosf doesn't have any state and doesn't write anything:

if (cosf(a) > 0.1f && cosf(a) > 0.1f)

Now let's go back to the memcmp function, and see which error we managed to find in PHP with the help of the markup we spoke of earlier:

PVS-Studio warning: V501 There are identical sub-expressions '!memcmp("auto", charset_hint, 4)' to the left and to the right of the '||' operator. html.c 396

It is checked twice that the buffer has the "auto" word. This code is redundant, and the analyzer assumes it has an error. Indeed, the comment tells us that comparison with the string "none" is missing here.

As you can see, using the markup, you can find a lot of interesting bugs. Quite often, the analyzers provide the possibility of annotating the functions themselves. In PVS-Studio, these opportunities are quite weak. It has only several diagnostics that you can use to annotate something. For example, the diagnostic V576 to look for bugs in the usage of the format output functions (printf, sprintf, wprintf, and so on).

We deliberately don't develop the mechanism of user annotations. There are two reasons for this:

Nobody would spend time doing the markup of functions in a large project. It's simply impossible if you have 10 million lines of code, and the PVS-Studio analyzer is meant for medium and large projects.

If some functions from a well-known library aren't marked up, it's best to write to us, and we'll annotate them. Firstly, we'll do it better and faster; secondly, the results of the markup will be available to all our users.

Once more - brief facts about the technologies

I'll briefly summarize the information about the technologies we use. PVS-Studio uses:

Pattern-based analysis on the basis of an abstract syntax tree: it is used to look for fragments in the source code that are similar to the known code patterns with an error.

Type inference based on the semantic model of the program: it allows the analyzer to have full information on all variables and statements in the code.

Symbolic execution: this allows evaluating variable values that can lead to errors, perform range checking of values.

Data-flow analysis: this is used to evaluate limitations that are imposed on the variable values when processing various language constructs. For example, values that a variable can take inside if/else blocks.

Method annotations: this provides more information about the used methods than can be obtained by analyzing only their signatures.

Based on these technologies the analyzer can identify the following classes of bugs in C, C++ and C# programs:

64-bit errors;

address of the local function is returned from the function by the reference;

arithmetic overflow, underflow;

array index out of bounds;

double release of resources;

dead code;

micro optimizations;

unreachable code;

uninitialized variables;

unused variables;

incorrect shift operations;

undefined/unspecified behavior;

incorrect handling of types (HRESULT, BSTR, BOOL, VARIANT_BOOL);

misconceptions about the work of a function/class;

typos;

absence of a virtual destructor;

code formatting not corresponding with the logic of its work;

errors due to Copy-Paste;

exception handling errors;

buffer overflow;

security issues;

confusion with the operation precedence;

null pointer/reference dereference;

dereferencing parameters without a prior check;

synchronization errors;

errors when using WPF;

memory leaks;

integer division by zero;

diagnostics, made by the user requests

Conclusion. PVS-Studio is a powerful tool in the search for bugs, which uses an up-to-date arsenal of methods for detection.

Yes, PVS-Studio is like a superhero in the world of programs.

Testing PVS-Studio

The development of an analyzer is impossible without constant testing of it. We use 7 various testing techniques in the development of PVS-Studio:

Static code analysis on the machines of our developers. Every developer has PVS-Studio installed. New code fragments and the edits made in the existing code are instantly checked by means of incremental analysis. We check C++ and C# code.

Static code analysis during the nightly builds. If the warning wasn't catered for, it will show up during the overnight build on the server. PVS-Studio scans C# and C++ code. Besides that we also use the Clang compiler to check C++ code.

Unit-tests of class, method, function levels. This approach isn't very well-devloped, as there are moments that are hard to test because of the necessity to prepare a large amount of input data for the test. We mostly rely on high-level tests.

Functional tests for specially prepared and marked up files with errors. This is our alternative to the classical unit testing.

Functional tests proving that we are parsing the main system header files correctly.

Regression tests of individual third-party projects and solutions. This is the most important and useful way of testing for us. Comparing the old and new analysis results we check that we haven't broken anything; it also provides an opportunity to polish new diagnostic messages. To do this, we regularly check open source projects. The C++ analyzer is tested on 120 projects under Windows (Visual C++), and additionally on 24 projects under Linux (GCC). The test base of the C# analyzer is slightly smaller. It has only 54 projects.

Functional tests of the user interface - the add-on, integrated in the Visual Studio environment.

Conclusion

This article was written in order to promote the methodology of static analysis. I think that readers might be interested to know not just about the results of the analyzer work, but also about the inner workings. I'll try writing articles on this topic from time to time.

Additionally, we plan to take part in various programming events, such as conferences and seminars. We will be glad to receive invitations to various events, especially those that are in Moscow and St. Petersburg. For example, if there is a programmer meeting in your institute or a company, where people share their experience, we can come and make a report on an interesting topic. For instance, about modern C++; or about the way we develop analyzers, about typical errors of programmers and how to avoid them by adding a coding standard, and so on. Please, send the invitations to my e-mail: karpov [@] viva64.com.