The OpenMP standard for C/C++ and Fortran
(www.openmp.org) has
recently emerged as the de facto standard for shared-memory
parallel programming. It allows the user to specify parallelism
without getting involved in the details of iteration partitioning,
data sharing, thread scheduling and synchronization. Based on these
directives, the Intel compiler will transform the code to generate
multithreaded code automatically. The Intel compiler supports the
OpenMP C++ 2.0 and OpenMP Fortran 2.0 standard directives for
explicit parallelization. Applications can use these directives to
increase performance on multiprocessor systems by exploiting both
task and data parallelism.

The following is an example program, illustrating the use of
OpenMP directives with the Intel C++ Linux OpenMP compiler:

The for loop will be executed in parallel by a team of
threads that divide the iterations in the loop body amongst
themselves. Variable k is marked private—each thread will have its
own copy of k—while the arrays x, y and z are shared among the
threads.

The resulting multithreaded code is illustrated below. The
Intel compiler generates OpenMP runtime library calls for thread
creation and management, as well as synchronization (see Resources
1 and 2):

The multithreaded code generator inserts the thread
invocation call __kmpc_fork_call with the T-entry point and data
environment (for example, thread id tid) for each loop. This call
into the Intel OpenMP runtime library forks a number of threads
that execute the iterations of the loop in parallel.

The serial loops annotated with the OpenMP directive are
converted to multithreaded code by localizing the lower- and
upper-loop bounds and by privatizing the iteration variable.
Finally, multithreading runtime initialization and synchronization
code is generated for each T-region defined by a [T-entry, T-ret]
pair. The call __kmpc_for_static_init computes the localized loop
lower-bound, upper-bound and stride for each thread according to a
scheduling policy. In this example, the generated code uses static
scheduling. The library call __kmpc_for_static_fini informs the
runtime system that the current thread has completed one loop
chunk.

Rather than performing source-to-source transformations, as
is done in other compilers such as OpenMP NanosCompiler and OdinMP,
the Intel compiler performs these transformations internally. This
allows tight integration of the OpenMP implementation with other
advanced, high-level compiler optimizations for improved
uniprocessor performance such as vectorization and loop
transformations.

Besides the compiler support for exploiting the OpenMP
directive-guided explicit parallelism, users also can try
auto-parallelization by using the option -parallel. Under this
option, the compiler automatically analyzes the loops in the
program to detect those that have no loop-carried dependency and
can be executed in parallel profitably. The auto-parallelization
phase in the compiler relies on the advanced memory disambiguation
techniques for its analysis, as well as the profiling information
for its heuristics in deciding when to parallelize.

CPU-Dispatch

One of the unique features of the Intel compiler is
CPU-Dispatch, which allows the user to target a single object for
multiple IA-32 architectures by means of either manual CPU-Dispatch
or Auto-CPU-Dispatch. Manual CPU-Dispatch allows the user to write
multiple versions of a single function. Each function either is
assigned a specific IA-32 architecture platform or is considered
generic, meaning it can run on any IA-32 architecture. The Intel
compiler generates code that dynamically determines on which
architecture the code is running and accordingly chooses the
particular version of the function that will actually execute. This
runtime determination allows programmers to take advantage of
architecture-specific optimizations, such as SSE and SSE2, without
sacrificing flexibility, allowing execution of the same binary on
architectures that do not support newer instructions.

Auto-CPU_Dispatch is similar but with the added benefit that
the compiler automatically generates multiple versions of a given
function. During compilation, the compiler decides which routines
will gain from architecture-specific optimizations. These routines
are then automatically duplicated to produce architecture-specific
optimized versions, as well as generic versions. The benefit of
this feature is, it does not require any rewrite by the programmer.
A normal source file can take advantage of the Auto-CPU-Dispatch
feature by the simple use of a command-line option. For example,
given the function:

the Intel compiler can produce up to three versions of the
function. A generic version of the function is generated that will
run on any IA-32 processor. Another version would be tuned for the
Pentium III processor by vectorizing the first loop with SSE
instructions. A third version would be optimized for the Pentium 4
processor by vectorizing both loops to take advantage of SSE2
instructions.

Comment viewing options

I have tried both gcc and icc 7.0 on cache-intensive code. Also examined the intermediate assembly code. Same code, same performance (better comments for icc), provided that you compile (under gcc) for the right processor type. Default processor is 386 (!!!) for some distributions (e.g., Mandrake), pentium for others (e.g., RedHat). Be careful, the performance advantage can be up to 40%.
Of course, no OpenMP support for gcc. However, when Intel people will dare to make measurements with hyperthreading enabled (please read their papers carefully), I will convice myself that it MIGHT be useful.. :)

"dare to say the current gcc has most of this stuff already implemented."

Not true, although you'll find some things that work better in GCC. The Intel compiler is specifically optimized for IA, while gcc has to run on a lot of different architectures. Your mileage will vary depending on what you're doing.

GCC vs. the Intel Compiler definitely falls into the category of "use the right tool for the right job." Of course, the proprietary nature of the Intel tool will be an obstacle for some, but you can definitely get some performance benefits from using a compiler that is specificially optimized for the architecture.