Getting Familiar with GCC Parameters

gcc (GNU C Compiler) is actually a collection of frontend tools that does compilation, assembly, and linking. The goal is to produce a ready-to-run executable in a format acceptable to the OS. For Linux, this is ELF (Executable and Linking Format) on x86 (32-bit and 64-bit). But do you know what some of the gcc parameters can do for you? If you're looking for ways to optimize the resulted binary, prepare for a debugging session, or simply observe the steps gcc takes to turn your source code into an executable, getting familiar with these parameters is a must. So, please read on.

Recall that gcc does multiple steps, not just one. Here is a brief explanation of their meaning:

Preprocessing: Producing code that no longer contains directives. Things like "#if" cannot be understood directly by the compiler, so this must be translated into real code. Macros are also expanded at this stage, making the resulting code larger than the original. [1]

Compilation: This takes the post-processed code, does lexical and syntax analysis, and then generates the assembly code. gcc outputs warning or error messages during this stage as the analyzer parses and finds any mistakes in your code. If any optimization is requested, gcc will analyze your code further to find the possible spots and manipulate them further. This work is done in multipass style, which demonstrates that it sometimes takes more than one scan through the source code to optimize. [2]

Assembly: Accepts the assembler mnemonics and produces object codes containing opcodes. This is a common misunderstanding: compilation stage doesn't produce opcodes, assembly stage does. The result is one or more object files containing opcodes, which are really machine-dependent. [3]

Linking: Transforms object files into final executables. Opcodes alone are not enough to make the operating system recognize and execute them. They must be wrapped into a more complete form. This form, known as binary format, dictates how the OS loads the binary, arranges the relocation, and does other necessary work. ELF is the default format for x86-based Linux. [4]

The gcc parameters described here directly and indirectly touch all those four stages, so for clarity, this article is organized as follows:

Parameters related to optimization

Parameters related to function call

Parameters related to debugging

Parameters related to preprocessing

Before that, let's meet the accompanying tools that will help us to sneak inside the resulting code:

The ELF tool collection, which includes such programs as objdump and readelf. They parse ELF information for us.

Oprofile, one of the de facto ways to collect hardware performance counters. We need this tool to see certain aspects of the code's performance.

time, a simple way to collect total runtime of a program.

The following explanations can be applied to gcc version 3.x and 4.x, so it's pretty generic. Let's start digging, shall we?

Parameters Related To Code Optimization

gcc offers a "no-brainer" way to do a bunch of optimizations for you: the -O option. It either makes your code run faster or squeezes the final code's size. There are five variants of it:

-O0 (O zero) to -O3. "0" means no optimization, while "3" is highest optimization level. "1" and "2" are between these extreme edges. If you just use -O without specifying any number, implicitly it means -O1.

-Os. Tells gcc to optimize for size. Basically, it is similar to -O2, but skips some steps that might enlarge the size.

How much speedup do we really get from them? Well, suppose we have this code:

Using gcc, four different binaries are created, using each -O variants (except -Os). The time tool records their execution time, e.g.:

$ time ./non-optimized

No optimization

-O1

-O2

-O3

real

0.728

0.1

0.1

0.1

user

0.728

0.097

0.1

0.1

sys

0.000

0.002

0.000

0.000

For simplicity's sake, we use the following conventions:

Non-optimized refers to the executable file compiled with -O0.

OptimizedO1 refers to the executable file compiled with -O1.

OptimizedO2 refers to the executable file compiled with -O2.

OptimizedO3 refers to the executable file compiled with -O3.

As you can see, between the non-optimized case and the one compiled using -O1, the elapsed time is seven times shorter. Notice that -O1, -O2, and -O3 don't really give too much additional benefit--in fact, they are pretty much the same. So what's the magic of -O1 here?

After a quick sightseeing trip to the source code, you would expect that such code is a dead-end for optimization. First, let's look at a brief comparison between the disassembled versions of non-optimized and optimizedO1.

$ objdump -D non-optimized
$ objdump -D optimizedO1

(Note: You might get different results, so use this as general guideline.)

Non-optimized

OptimizedO1

mov 0xfffffff4(%ebp),%eax

add $0x6,%edx

add %eax,0xfffffff8(%ebp)

add $0x1,%eax

addl $0x1,0xfffffff4(%ebp)

cmp $0x1388,%eax

cmpl $0x3,0xfffffff4(%ebp)

The above snippets implement the innermost loop (for (k=0;k<4;k++)). Notice the clear difference: the non-optimized code directly loads and stores from a memory address, while optimizedO1 uses CPU registers as the accumulator and loop counter. As you may already be aware, registers can be accessed hundreds or thousands times faster than RAM cells.

Not satisfied to just exploit CPU registers for temporary storage, gcc uses another optimization trick. Let's see the disassembled code of optimizedO1 again, specifically in the main() function:

0x7530 is 30,000 in decimal form, so we can quickly guess the loop is simplified. This code represents the innermost loop and the outermost loop ("for(j=0;j<5000;j++) ... for(k=0;k<4;k++)") because that is literally a request to do 30,000 loops. Note that you just need to loop three times inside. When k=0, acc is still the same, so the first loop can be discarded.

Hmm, now it is compared to 300,000,000 (10,000*5,000*6). This represents all three loops. After we reach that number of loops, we jump straight to the printf() to print the sum (address 0x80483d0 - 0x80483db)

Six is added in the accumulator in each iteration. %edx will eventually hold the total sum after completion of all the loops. The third and fourth lines shows us that after this is done 5,000 times, it should return to address 0x80483a8 (as mentioned previously).

We can conclude that gcc creates a simplification here. Instead of looping three times in the innermost loop, it simply adds six for every middle loop. This sounds simple, but it makes your program do only 100,000,000 loops instead of 300,000,000. This simplification, technically called loop unrolling, is one of the jobs done by -O1/2/3. Of course, you can do it on your own, but sometimes it is good to know that gcc can detect such things and optimize them.

With -O2 or -O3, gcc also tries to optimize the branch. This is usually achieved via reordering [5] and code transformation. The goal of this procedure is to eliminate as many mispredicted branches as possible, thus improving pipeline utilization. For example, we can compare how non-optimized and optimizedO2 do most outer loops.

Instead of jle, now jne is used. Assuming any integer could be compared in the previous cmp, you could easily conclude that it boosts the chance of taking the branch to almost 100 percent. It is a small but useful hint for the processor to indicate which code should be executed. However, for most modern processors, this kind of transformation is not terribly necessary because the branch predictor is smart enough on its own.

To prove how much this transformation can help, OProfile comes to the rescue. Oprofile is executed to record the number of retired branches and retired mispredicted branches. Retired here means it is executed inside the CPU pipeline.

We run non-optimized and optimizedO2 five times each. Then we take out the maximum and minimum of the samples. We calculate the misprediction rate using this formula (adapted from this):

Misprediction ratio = retired mispredicted branch / retired branch

The average misprediction ratio is now calculated for each binary. Non-optimized gets 0.5117 percent, while optimizedO2 gets 0.4323 percent--a very small gain for our case. The actual gain may vary in a real-world case, because gcc itself can't do much without external hints. Please read about __builtin_expect() in the gcconline documentation for further information.