My coworkers recently found a bug in the x64 code generated by Visual C++. This bug exists in VC++ 2010 to VC++ 2013 RC. We put in a workaround (the traditional one of disabling optimizations for the afflicted function) and I created a minimal repro to allow reporting of the bug to Microsoft.

This post discusses the poorly understood art of how to create minimal reproes of compiler bugs using Visual C++.

The bug showed up in a large member function in a source file that, with its include files, was several hundred thousand lines of code. This object file was linked with hundreds of other object files and dozens of libraries. The resulting DLL was loaded into a process with many other DLLs where it would connect to multiple databases and web servers. This is so much complexity that it is useless for reporting the bug. In order to investigate and report compiler issues it is important to reduce them to a manageable size. I got this one down to a 4 KB .zip file.

Most compiler bugs aren’t

Five different people looked at this code before we decided that it was a compiler bug. We quintuple checked because we all knew that the first rule of compiler bugs is that they usually aren’t compiler bugs. To accurately assess whether something is a compiler bug you have to know the language standard, you have to know about undefined behavior (for example, p[i] = ++i; is undefined), you have to be comfortable reading assembly language, and you have to realize that you might be wrong. Compiling the problematic code with other compilers and looking at their warnings can be helpful in determining whether your code is at fault.

/FAcs – linking is optional

When iterating on code generation (code-gen) you can save a lot of time if you don’t have to link and run your code. In the traditional edit/compile/link/run cycle the last two stages can easily take 90% or more of the time, so if you can skip them you can iterate ten times faster! I use this trick when investigating compiler bugs and when figuring out how to get a compiler to generate efficient code. Skipping the link/run cycle also means that you can call functions that don’t exist, which simplifies your workflow when you are focused on a single function.

The trick in Visual Studio is to modify the build settings so that you can see the assembly language generated by each source file without invoking the linker. Go to Properties, C/C++, Output Files, and set Assembler Output to /FAcs. This tells the compiler to create a .cod file containing the generated assembly language. This file is typically put in the same directory as the .obj file.

If you have Whole Program Optimization (aka Link Time Code Generation, or /LTCG or /GL) enabled then machine code won’t be generated until link time, which is not what you want. To disable LTCG to to the project properties in General, look for Whole Program Optimization, and set it to No. That should change the setting in C/C++, Optimization, Whole Program Optimization, but check that just to be sure.

If disabling LTCG makes the code-gen bug stop happening then coming up with a minimal repro is going to be harder, and beyond the scope of this post – sorry.

Compile the problematic file using Ctrl+F7. Load the .cod file into Visual Studio and make sure that VS is set to automatically reload modified files. This setting is in Tools, Options, Environment, Documents, Auto-load changes if saved. Now every time that you recompile the problematic source file the .cod file will be regenerated and reloaded. If you are making minor changes then the problematic machine code will stay in view. If you make larger changes then it may move – search for the function name and “PROC” to find it.

Delete, delete, delete

Now you can start iterating on the code. The basic idea is to delete or simplify code, compile, and see if the bug still occurs. Since you aren’t linking the code you can delete code aggressively – you only need enough to let the source file compile. You should be able to delete all functions after the problematic function, and the bodies of all of the functions before it.

Because you aren’t linking and running the code you have to know what the bug looks like in order to tell whether it is still there. In the case of this bug there was a specific signature that I was watching for:

mov ebx, [rax+48] ; Move 32-bits to ebx. This is zero-extended to rbx.

The bug here is that the mov to ebx zero extends the value to 64 bits in rbx, but the 64-bit cmp is using a sign-extended constant. It is impossible for the comparison to ever succeed. The variables in question were both 32-bit int and the compiler emulated a 32-bit comparison incorrectly. The register used by the compiler varied, but always there was the zero-extended mov and the sign-extended constant for the cmp.

More on the bug later, the point is to know what to look for so you can tell whether the bug is still there by inspecting the .cod file.

In addition to deleting other functions that aren’t needed you should start deleting parts of the buggy function. The more code you remove the easier it is to understand what is going on and the smaller your repro will be. It is not uncommon at this point to spot a mistake that you have made and realize that the bug is in your code, thus proving that it’s not a compiler bug.

In the case of this bug I took the original function which was a few hundred lines and trimmed it down to about ten lines of code.

Isolate

Now you have a minimal function that probably references some data and makes some function calls. In order to come up with a minimal repro you need to simplify those references and calls. For instance, let’s imagine that the function had a line like this:

int v = g_sys->Foo()->Bar()->Data()->GetInt();

If the line can be deleted entirely then that is ideal, but if ‘v’ or the function calls are part of the bug then you can’t do that. But you can start simplifying. If g_sys->Foo() returns a Foo* then declare a global variable of that type and change the expression to:

extern Foo* pFoo;int v = pFoo->Bar()->Data()->GetInt();

See what we did there? The variable pFoo is declared but not defined anywhere. This code won’t link, but we don’t care – we’re just compiling. If the bug still reproes (examine the code carefully) then replace pFoo->Bar() with pBar, and then replace pBar->Data() with pData, and maybe even replace pData->GetInt() with GetInt(). Simplify, simplify, simplify. The more code you remove now, the easier the next step will be.

In the case of this bug the triply nested member function calls needed one level of indirection to still reproduce the bug, so the final line looked like this:

extern Data* pData;pData->GetInt();

The Data class is probably big and complicated, so we have to simplify it. That’s usually quite easy because the only functionality we are using from it is GetInt(). You’ll want to create a MyData struct that contains that accessor function and the backing data that it needs, rename Data to MyData and then confirm that the code-gen bug is still there. Here’s what a typical simplified struct might look like:

You should repeat this set of operations until you have a function with no non-trivial external references. This includes getting rid of references to member variables – those can usually be replaced with global variables – and then eventually making your function a non-member function, or a member of a trivial class.

At every step of this process you need to verify that the bug is still happening. This is a research project because you don’t know what transformations are “bug-preserving” and which are not.

You should go back and forth between deleting code and isolating it – isolating it often opens up opportunities to delete more code.

Preprocessing

Some bugs aren’t amenable to the techniques above so another option to make a stand-alone repro is to preprocess the file. This is important in a large project like Chrome where recreating the full build environment is an enormous effort. In this case you can make a stand-alone repro by building the file with the /P option to preprocess it to a file. Then rename the generated .i file to .cpp and rebuild it with /c /FACS and look at the generated code to see if the bug still reproduces. You can then start deleting code from the preprocessed file.

If your repro includes a lot of system header files then you may want to unpreprocess the system files – delete their contents and replace them with the #include statements. That process can be simplified by including all of the system header files first and by using /d1showIncludes to make them easier to identify in the preprocessed file.

Linker bugs

This post is focused on code-gen bugs but most of the same techniques apply to internal compiler errors or incorrect syntax errors. However linker bugs require a different technique. The VC++ linker has a special linker repro path which is documented here. Mostly it involves setting a LINK_REPRO environment variable to point to a directory where the repro should be put.

Making the repro

At this stage you should have a small function. This function should have references to a small number of global variables whose types are trivial. You should be able to delete all the rest of the code from this source file and all of the includes (turn off precompiled headers to allow this) and the code should still compile and still reproduce the bug. We’re almost done!

Now you need to make a new project. Turn on /FAcs, paste in the buggy code, compile, and see if you can see the bug in the generated code. If not then there must be a difference in the compilation settings, so go through those until the bug comes back. Pay attention to what settings trigger this bug as you will need that information for the bug report.

Once the bug is reproing in the new project you need to start making it link again. Verifying a bug through code inspection is great, but getting it to run is better. When doing this I recommend that you create a second source file (I typically call it SecondFile.cpp) and put all of the global variables and non-inline functions that your repro depends on there. This should be trivial because you’ve deleted so much code that there aren’t many external references and they are all simple. By putting these functions in a second source file (and because you’ve disabled LTCG, right?) you’ve ensured that the optimizer can’t see them. This means that even if the functions just return a constant value the optimizer can’t use that information. In my repro the second file just had a couple of data declarations that needed to be hidden from the optimizer:

static MyData foo ={ { }, (uint32)k_eOSUMQ};

MyData* pData = &foo;

You should adjust your repro so that it detects the bad behavior and prints out a message. My repro prints “Bug!” or “No bug.”

Report the bug

If you are still sure that you have a compiler bug then report it. The compiler vendor should fix it and they may even supply more information about what triggers the bug, how to avoid it, etc. Compiler bug fixes may take a long time to arrive – even after the vendor ships the fix you still have to upgrade – but your future self will thank you. Only by reporting bugs can we get the improvements that we need.

The specifics

The specific bug that I tracked down was with the compiler treating 32-bit integers as 64-bits, and doing it inconsistently. The ‘int’ type in VC++ is 32 bits, even on x64 builds, so if the compiler does full 64-bit operations on an int it is required to do them “as-if” they were 32-bit operations. The compiler failed to do this.

The bug report can be found on connect.microsoft.com. It contains VS 2010 and VS 2013 projects that demonstrate the bug. All you have to do is build and run the 64-bit release configuration and the program prints “Bug!”. The complete repro .zip files, including source code, project file, and solution file, are about 4 KB.

Here’s the crucial bit of code – GetInt() has a return type of unsigned int and returns k_eOSUMQ, a const int with a value of –400.

The ‘cmp’ instruction is supposed to always compare equal, but it never does because the high 32 bits of rbx are always zero.

Security

In these paranoid days of the NSA subverting every computer system available every bug is a possible security flaw. This bug seems curiously commonplace – why wasn’t it discovered when building Windows or other 64-bit products? The most likely explanation is chance and happenstance, but compiler bugs can be used to make innocuous source code do surprising things. Maybe this bug is used as part of a back door to let code execution go in a direction that the source code says is impossible. This happened accidentally on the Xbox 360 where the low 32 bits of r0 were checked and then all 64 bits were used. This inconsistency allowed a hypervisor privilege escalation that led to arbitrary code execution.

This bug is probably not security related, but security is another reason to ensure that compiler bugs get fixed.

Standardese

The C++ standard says, in 4.7.2, that assigning an int to an unsigned int is legal and well defined. The MSDN documentation agrees with this interpretation.

Section 4.7.3 says that the reverse conversion from unsigned int to int is, for the value used here, implementation defined. MSDN’s explanation is not very useful, but does say that the conversion is supported. Unlike undefined behavior “implementation defined” behavior (such as sizeof(int)) is required to be documented, sane, and consistent. Therefore the fact that the implementation defined behavior varies between debug and release builds proves that this is a bug.

Microsoft agrees that is a bug. Their comments on the bug are a bit confusing as they conflate undefined behavior (very bad) with implementation defined behavior (use with care), but as long as they fix the bug I don’t care.

Learnings

One reason to write these posts is that I always learn something useful.

Delta and c-reduce were suggested as useful tools for creating minimal repros. For this particular bug (with a complex run-time environment and an easy to see signature in the assembly language) they wouldn’t have helped, but it’s great knowing about such tools.

Advertisements

Share this:

Like this:

LikeLoading...

Related

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs.
I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8.

It looks like the downside of Delta is that you have to write a test to detect whether or not the bug is still there. For this bug I think that using /FAcs and examining the assembly language is still the easiest technique to detect the bug — linking and running the system is just too slow.

Given that the compiler sometimes changed what register it was using I’m not sure that the effort of writing an accurate test for this bug would be worthwhile. It really didn’t take me much time to reduce this bug — half an hour perhaps? Once you don’t need to link the code you can slash-and-burn at a fearsome rate, and my brain is more flexible at recognizing legitimate variations of the bug than any script would be.

In some contexts, delta-debugging produces impressive results. A generic “test to detect whether or not the bug is still there” can be: the program P is defined (therefore it should only exhibit one behavior), but these two compilers that have identical implementation-defined choices compile P to binaries that produce different results.

The two compilers can be the same compiler with optimizations enabled and disabled.

Such a test, completely automated, needs to be written once only and then allows to reduce many “wrong code” bugs.

A group at the University of Utah implemented their own specialized delta-debugger for C programs (and Chucky Ellison and I contributed in the “verifying that the program P is defined” aspect of the problem). The final, automatically reduced bug reports are very short and to the point. It is definitely acceptable to send them as generated (although one may sometimes pad them up a bit to something less reduced and more human-looking):

I agree with your assessment – given this bug and your familiarity with the assembly, Delta wouldn’t be as efficient as your compile-and-inspect iteration.

I had an ideal case for Delta: a very large leaf function that would sometimes return a wrong answer at a specific optimization level. Once preprocessed, it was easily severed from the rest of the application, reproducing with a test driver and mocked inputs. I let it run in Delta for a few hours and it gave me a beautifully minimized testcase to send to the compiler vendor.

In case you’re curious, the compiler introduced a side effect on the right side of an assignment:

Interesting… in the last year we ran into an issue with VC6.0 with a string pooling bug.
A we had two cases, one customer application with lots of long strings, and another with a lot of vary similar 8+3 strings. Turned out /Zi (edit and continue) invokes string pooling, and there was/is a bug in string pooling in VC6.0 C++. Two different strings (in each case) were being optimized as a single string. This was ultimately identified at the assembly level. Our work-around for the ancient compiler was to disable /Zi and string pooling. It’s possible that this bug still exists..someday I may try using Delta to replicate the VC6 bug and see if it’s still in 2013/2015 C++.

Can you give a reference for that? It’s still a read and two writes before the same sequence point (or whatever they’re called in C++ 11), although at least there’s only one plausible outcome (i is incremented by one).