Memory Movement and Initialization: Optimization and Control

Overview

Are you initializing data or copying blocks of data from one variable to another in your application? Probably so. Moving or setting blocks of data is very common. So how to best optimize these operations for Intel® Xeon Phi™ Coprocessors?

Job #1 - For Phi, Parallelize the initialization!

A single Intel Xeon Phi coprocessor core cannot saturate the bandwidth available. So if only 1 core is initializing your large arrays you will notice a significant slowness compared to Intel® Xeon® processors (due to the relatively slow clock speed of the coprocessor cores). Therefore, on Intel Xeon Phi coprocessors, it is necessary to get many cores involved in the memory initialization to insure that the memory subsystem is driven at or near maximum bandwidth.

For example, if you have something like this:

do i=1,N
arr1(i) = 1.1_dp
end do

you can parallelize the do loop:

!DIR$ vector nontemporal
!$OMP PARALLEL DO
do i=1,N
arr1(i) = 1.1_dp
end do

mem*() calls in libc

The mem* family of functions in libc can take significant amounts of time in many applications. These include memcpy(), memset() and memmove() functions. C programmers may call these directly in their code. In addition to directly calling these functions, Fortran and C applications with data initalizations or data copy statements may IMPLICITLY call these functions when a compiler translates the data set/move/copy statements into calls to these libc mem*() functions. In addition, Fortran may hide direct calls to libc mem*() functions in the Fortran Runtime Libraries which often "wrap" calls to libc mem*() functions.

Applications compiled with the Intel Compilers

Because these libc mem*() functions are so common, the Intel Compilers provide optimized versions of memset and memcpy in the Intel Compiler provided library 'libirc'. This library and specifically these functions are intended to replace the calls to mem*() functions with a more optimized version of the mem*() functions. The Intel replacement libraries have symbol names "_intel_fast_memset" and "_intel_fast_memcpy".

So depending on the options used for compilation, you may be getting glibc memcpy (or user’s own version) OR intel_fast_memcpy.

Streaming Stores - Nontemporal writes for data

Many High-Performance Computing applications need to move data in huge blocks. Normally during write operations the application will move data through the data cache(s) with the assumption that data may be reused again soon ( known as a 'write through cache'). However, in many cases an HPC application will completely overwrite cache contents (first level, second level - the whole cache hierarchy) in the process of moving data that are much larger than the cache size. This wipes out any 'useful' data that may be cached, effectively flushing their contents. To avoid this, the programmer may specify to use 'streaming stores.' Streaming store instructions on the Intel microarchitecture code name Knights Corner do not perform a read for ownership (RFO) for the target cache line before the actual store, thus saving memory bandwidth. The data remain cached in L2 (This is in contrast to the streaming stores on Intel ® Xeon® processors where the on-chip cache hierarchy is bypassed and the data get combined in a separate write-combining buffer). See the article here for more details: Intel® MIC Architecture Streaming Stores.

To control use of non-temporal streaming store instructions, the Intel compilers provide the -qopt-streaming-stores (Linux*, OS* X) , /Qopt-streaming-stores (Windows*) option. This option enables generation of streaming stores for optimization. This method stores data with instructions that use a non-temporal buffer, which minimizes memory hierarchy pollution. Refer to the Intel Compiler User and Reference Guide for more information (C++ | Fortran).

Advanced Notes

Inside intel_fast_memcpy() (library function that resides in libirc.a library that gets shipped with the Intel compiler), uses non-temporal stores for memcpy IF the copy-size is > 256K. For smaller sizes you will still get vector-code, but it will not use non-temporal stores.

The Intel compilers and libraries do NOT automatically parallelize the mem* calls (The execution will happen in a single thread unless the memcpy/loop resides inside a user-parallelized code-region).

In some specialized uses of memcpy, the application has extra knowledge of the cache-behavior of the src/dest arrays and their cache-locality at a bigger scope than what the library-code sees from just one invocation of the memcpy. In such cases, you may be able to do smarter optimizations (such as different prefetching techniques that are not just based on the input-size) in a loop-version (or a smarter user-version of specialized memcpy) that may lead to better behavior for your application.

For stream-copy, the src-code does not use memcpy directly, but it has a copy-loop. Under default options, compiler translates the loop into a call to intel_fast_memcpy that then takes the path executing the stores using non-temporal stores. In the best performing stream-copy version though, you can get slightly better performance (~14% better) using the C++ options "-opt-streaming-stores always -opt-prefetch-distance=64,8" OR “-ffreestanding -opt-prefetch-distance=64,8” due to the better prefetching behavior in the loop-version of the code vectorized by the compiler (driven by the compiler prefetching options and no translation to memcpy library call).

In general, small-size memcpy performance is expected to be slower on Intel MIC Archiecture compared to a host processor (when it is NOT bound by bandwidth - meaning small sizes plus cache-resident data) due to the slower single-threaded clock speed on the coprocessor.

Take Aways

Memory movement operations can either explicitly or implicitly call memcpy() or memset() functions to move or set blocks of data. These functions can be linked to routines provided by the resident libc provided by your OS. The Intel compilers in certain conditions will replace the slower libc calls with faster versions in the Intel compiler runtime libraries such as _intel_fast_memcpy and _intel_fast_memset, which are optimized for Intel architecture.

Moving large data sets through the cache hierarchy can flush useful data out of cache. Streaming stores can be used to improve memory bandwidth on Intel® Xeon Phi™ Coprocessors. The opt-streaming-store compiler option can be used or the pragma/directive nontemporal can be used for finer grain control.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ Coprocessors. The paths provided in this guide reflect the steps necessary to discover best possible application performance.