First, we will look at static profiling. Consider the
following code fragment:

g();
for (i=0; i<10; i++) {
g();
}

Obviously, the call inside the loop executes ten times more
often than the call outside the loop. In many cases, however, there
is no way to make a good estimate. In the following code:

for (i=0; i<10; i++) {
if (condition) {
g();
} else {
h();
}
}

it is difficult to say whether one condition is more likely to
occur than another. If h() happened to be an exit or some other
routine that was known not to return, it would be safe to assume
the then branch was more likely taken and inlining g() may be
worthwhile. Without such information, however, the decision of
whether to inline one call or the other (or both) gets more
complicated. Another option is to use dynamic profiling.

Dynamic profiling gathers information from actual executions
of a program. This allows the compiler to take advantage of the way
a program actually runs in order to optimize it. In a three-step
process, the application is first built with profiling
instrumentation embedded in it. Then the resulting application is
run with a representative sample (or samples) of data, which yields
a database for the compiler to use in a subsequent build of the
application. Finally, the information in this database is used to
guide optimizations such as code placement or grouping frequently
executed basic blocks together, function or partial inlining and
register allocation. Register allocation in the Intel compiler is
based on graph fusion (see Resource 5), which breaks the code into
regions. These regions are typically loop bodies or other cohesive
units. With profile information, the regions can be selected more
effectively and are based on the actual frequency of the blocks
instead of syntactic guesses. This allows spills to be pushed into
less frequently executed parts of the program.

Intra-Register Vectorization

Exploiting parallelism is an important way to increase
application performance in modern architectures. The Intel compiler
can be key in the effort to exploit potential parallelism in a
program by facilitating such optimizations as automatic
vectorization, automatic parallelization and support for OpenMP
directives. Let's look at the automatic conversion of serial loops
into a form that takes advantage of the instructions provided by
the Intel MMX technology or SSE/SSE2 (Streaming-SIMD-extensions), a
process we refer to as “intra-register vectorization” (see
Resource 1). For example, given the function:

the Intel compiler will transform the loop to allow four
single-precision floating-point additions to occur simultaneously
using the addps instruction. Simply put, using a pseudo-vector
notation, the result would look something like this:

for (i = 0; i < n; i+=4) {
c[i:i+3] = a[i:i+3] + b[i:i+3];
}

A scalar cleanup loop would follow to execute the remainder of the
instructions if the trip count n is not exactly divisible by four.
Several steps are involved in this process. First, because it is
possible that no information exists about the base addresses of the
arrays, runtime code must be inserted to ensure that the arrays do
not overlap (dynamic dependence testing) and that the bulk of the
loop runs with each vector iteration having addresses aligned along
16-byte boundaries (dynamic loop peeling for alignment). In order
to vectorize efficiently, only loops of sufficient size are
vectorized. If the number of iterations is too small, a simple
serial loop is used instead. Besides simple loops, the vectorizer
also supports loops with reductions (such as summing an array of
numbers or searching for the max or min in an array, conditional
constructs, saturation arithmetic and other idioms. Even the
vectorization of loops with trigonometric mathematical functions is
supported by means of a vector math library.

To give a taste of a realistic performance improvement that
can be obtained by intra-register vectorization, we report some
performance numbers for the double-precision version of the Linpack
benchmark (available in both Fortran and C at
www.netlib.org/benchmark).
This benchmark reports the performance of a linear equation solver
that uses the routines DGEFA and DGESL for the factorization and
solve phase, respectively. Most of the runtime of this benchmark
results from repetitively calling the Level 1 BLAS routine DAXPY
for different subcolumns of the coefficient matrix during
factorization. Under generic optimizations (switch -O2), this
benchmark reports 1,049 MFLOPS for solving a 100×100 system
on a 2.66GHz Pentium 4 processor. When intra-register vectorization
for the Pentium 4 processor is enabled (switch -xW), the
performance goes up to 1,292 MFLOPS, boosting the performance by
about 20%.

Comment viewing options

I have tried both gcc and icc 7.0 on cache-intensive code. Also examined the intermediate assembly code. Same code, same performance (better comments for icc), provided that you compile (under gcc) for the right processor type. Default processor is 386 (!!!) for some distributions (e.g., Mandrake), pentium for others (e.g., RedHat). Be careful, the performance advantage can be up to 40%.
Of course, no OpenMP support for gcc. However, when Intel people will dare to make measurements with hyperthreading enabled (please read their papers carefully), I will convice myself that it MIGHT be useful.. :)

"dare to say the current gcc has most of this stuff already implemented."

Not true, although you'll find some things that work better in GCC. The Intel compiler is specifically optimized for IA, while gcc has to run on a lot of different architectures. Your mileage will vary depending on what you're doing.

GCC vs. the Intel Compiler definitely falls into the category of "use the right tool for the right job." Of course, the proprietary nature of the Intel tool will be an obstacle for some, but you can definitely get some performance benefits from using a compiler that is specificially optimized for the architecture.