AVX Vectorisation and Cilk Plus

The parallel computing curriculum developed by the University of Oregon IPCC (Intel Parallel Computing Center) is available here.

Some Knights Corner training materials developed by Intel are available here.

AVX (Advanced Vector eXtensions) is an extension to the standard x86 ISA to support floating point computation demanding algorithms. AVX is more suited to multimedia applications. Compared to the ISA of NVIDIA GPUs AVX is not capable of handling non-consecutive memory access patterns and looses significant performance when the data is contiguous but non-aligned.

Performance provided by AVX in CPUs can be utilized by

embedding AVX assembly code into source,

using AVX instrinsic intructions defined in header file ymmintrin.h,

auto-vectorization of GNU and Intel compilers by potentially helping compilers with #pragma directives,

AVX instrinsics

Intrinsics are functions embedding assebly inline functions to simplify programming complexity. They are defined in header files ( eg. ymmintrin.h ) and are available in both GNU and Intel compilers for both Fortran and C/C++.

OpenCL

Intel's and AMD's OpenCL (Open Computing Language) driver provides "implicit vectorzation" with CPU vector instructions. OpenCL provides better code portablility across parallel platforms. To achieve vectorized code with OpenCL it is essantial to optimize the code specifically to SIMD-like vector instruction and multithreaded execution. One may use the Intel Offline Compiler to learn more about how what prvents the parallelization in the code being developed.

CilkPlus, array notation and elemental functions

These three notions in the Intel compiler can be used to help ILP and thread-level parallel implementation of algorithms and they can be used independently from each other. Cilk is a task-parallel (multi-threading) feature provided in Intel clkrt library, to schedule parallel problem (eg. for loop) to multiple cores on a CPU. Similar to OpenMP's parallel for , with the major difference of handling parallel jobs as a set of problem chunks, and by scheduling these chunks on threads with a work-stealing scheduler. An example for a for loop parallelisation:

CilkPlus is Intel's attempt to help the compiler to perform better parallelization on ILP and task-level by language extensions. CilkPlus is essentially a suite of Cilk's thread/task level parallelism capabilities and the array notation language extension. The array notation is introduced to help the compiler utilize SIMD instruction by adding more information about the datastructure and to simplify implementation. Using array notation gives better code readability and clearity. Eg.:

From the point of the compiler it is in some sense similar to Fortran's array handling. A tutorial on CilkPlus can be found here . For Cilk and CilkPlus example source codes click here . Elemental functions are a key structure for vectorisation. They are actually functions that are vectorized by the Intel compiler and they are later inline into a for loop in a later step of the compilation. An example for defining an elementray function:

Tips and Tricks for AVX vectorization

Manual loop unrolling for better performance. Although in some cases the Intel compiler is able to vectorize a loop, the performance can be further increased by manual loop unrolling. The reader is suggested to visit the following blog post . Note: the code in the post only works for N>stepsize . In general one might consider to use

#define ROUND_DOWN(N,step) (((N)/(step))*step)

instead of the proposed one:

#define ROUND_DOWN(N,step) ((N) & ((step)-1))

If N and step is declared with const int or it is a C/C++ macro or a C++ template, than the expression with division and multiplication gets calculated in compile-time. Otherwise one might consider using the latter one.