Effective Use of the Intel Compiler’s Offload Features for Intel MIC Architecture

Choosing a Programming Model, Effective Use of the Intel Compiler's Offload Features for Intel MIC Architecture

Overview

In this chapter, we examine various best known methods for the Heterogeneous Offload programming model for the Intel® MIC Architecture. (Testing Changes with Firefox*)

Topics

Selecting Code Sections to Offload

Selections Based on Parallelism

Choose highly-parallel sections of code to run on the coprocessor. Serial code offloaded to the coprocessor will run much slower than on the CPU.

Changing Scope of Offloaded Sections Based on Data Flow

Using the criterion of level of parallelism to select code regions to offload may yield many small sections to offload. This must be balanced with the need for transferring data back and forth between CPU and MIC. Data exchange can be slow (subject to PCI-E speeds). It can also be difficult because of marshaling (pragma offload) or need for inserting _Cilk_shared keywords and _Offload_shared_malloc dynamic allocation. If two parallel sections do some serial processing between them then choose between a) moving the output data of the first parallel section back to the CPU, running the serial code on the CPU and then moving the input data of the second parallel region from CPU to coprocessor, versus b) keeping the data on the coprocessor and running the serial code there, in other words, making the entire parallel-serial-parallel section of code an offload unit.

Choosing Data Transfer Mechanism

Copyin/Copyout Model (#pragma offload)
This model is supported in both the Intel C/C++ and the Intel Fortran compilers
If the data exchanged between CPU and coprocessor is limited to scalars or arrays of bit-wise copyable elements, choose the #pragma offload model. This model requires localized changes to the code at the point of offload, and some markup of function declarations. Fortran programs are limited to this model (Fortran does not support the Shared-Memory model described below).

Shared-memory Model (_Cilk_shared/_Cilk_offload)
This model is available in the Intel C/C++ compiler ONLY (not supported in Fortran).
If the data exchanged between CPU and coprocessor is more complex than simple scalars and bit-wise copyable arrays, you may consider using the _Cilk_shared/_Cilk_offload constructs. These pragmas help implement a shared-memory offoad programming model. This model requires functions and statically allocated data to be given the _Cilk_shared attribute, and dynamically allocated data to be allocated in shared memory. The effort needed to implement and use _Cilk_Shared/_Cilk_Offload for the the shared-memory programming model can be more extensive, however the classes of programs able to use Intel MIC Architecture are richer since almost all C/C++ programs can be handled.

Offload Using #pragma offload

Measuring Offload Performance

Initialization Overhead
By default when a program performs the first #pragma offload all MIC devices assigned to the program are initialized. Initialization consists of loading the MIC program on to each device, setting up a data transfer pipeline between CPU and the device and creating a MIC thread to handle offload requests from the CPU thread. These activities take time. Therefore, do not place the first offload within a timing measurement. Exclude this one-time overhead by performing a dummy offload to the device.

Alternatively, use the OFFLOAD_INIT=on_start environment variable setting to pre-initialize all available MIC devices before starting the main program

Offload Data Transfer

Minimizing Input Data
Compute Locally if Possible

Keep Data Persistent across Offloads
If data values at the end of an offload are needed by a later offload, keep them on the coprocessor.

When relying on data reuse across offloads, the offloads must be to the same coprocessor. Ensure this is the case by using an explicit coprocessor number in the target clause.

Persistence: Statically allocated data
In C/C++, variables declared at file-scope and function-local variables with storage class “static” are statically allocated. Fortran common blocks, data declared in the PROGRAM block, and data with the “save” attribute are statically allocated. Static data will retain values across offloads as long as they are not over-written with new values. Use the nocopy clause to reuse previous values.

In C/C++ and Fortran, variables declared within functions and subroutines are given “automatic” or stack storage by default. Minimize the need for retaining function-local values across offloads.

In the offload environment, each offloaded region runs as a separate function on the coprocessor. Stack allocated variables are normally never retained across offloads. To implement the functionality of data persistence across offloads, if “nocopy” is requested then scalar values are copied back to the CPU at the end of an offload and again to the coprocessor at the next offload to simulate their retention across offloads. For efficiency reasons, this is not recommended for non-scalars (i.e., large function-local arrays and struct objects). Starting with the version 13.0.0. 079 Compiler Product (not earlier Beta versions), the compiler will functionally support offloading of such function-local arrays. However, we do not recommend using this feature for performance-sensitive portions.

void f()
{
int x = 55;
int y[10] = { 0,1,2,3,4,5,6,7,8,9};
// x, y sent from CPU
// To use values computed into y by this offload in next offload,
// y is brought back to the CPU
#pragma offload target(mic:0) in(x,y) inout(y)
{ y[5] = 66; }
// The assignment to x on the CPU
// is independent of the value of x on the coprocessor
x = 30;
…
// Reuse of x from previous offload is possible using nocopy
// However, array y needs to be sent again from CPU
#pragma offload target(mic:0) nocopy(x) in(y)
{ = y[5]; // Has value 66
= x; // x has value 55 from first offload
}
}

Persistence: Heap allocated data

The coprocessor heap is persistent across offloads. There are two ways to use heap memory on MIC:

Using the #pragma offload

Explicitly calling malloc on the coprocessor

Either let the compiler manage dynamic memory using the #pragma or manage it using malloc/free. Compiler-managed dynamic memory is allocated/deallocated using alloc_if and free_if.

Compiler-managed Heap-allocated Data

Memory allocation is controlled by alloc_if and free_if, and data transfer is controlled by in/out/inout/nocopy. The two are independent, but data can only be transferred in and out of allocated memory.
// The following macros are use in all the samples when alloc_if/free_if clauses are used
#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
#define REUSE alloc_if(0) free_if(0)

Code running on the coprocessor may call malloc/free to explicitly allocate/deallocate dynamic memory. Pointer variables pointing to dynamic memory allocated in this way will are scalars and are subject to the data persistence rules described above, depending on the scope of their definition – static, or function-local.
To prevent interference between compiler-managed dynamic allocation and explicit dynamic allocation, use the nocopy clause for pointer variables referenced within offload regions that are being explicitly managed.

Pointers used within offload regions are by default inout, that is, data associated with them is transferred in and out. Sometimes a pointer may be used strictly locally, that is, it is assigned and used on the coprocessor only. The nocopy clause is useful in this case to leave the pointer unmodified by the offload clauses, and allow the programmer to explicitly manage the value of the pointer. In other cases, data is transferred into the pointer from the CPU, and a subsequent offload may want to either a) use the same memory allocated and transfer fresh data into it, or b) keep the same memory and reuse the same data. For case a), an in clause with length equal to the number of elements is useful. For case b) an in clause with length 0 can be used to “refresh” the pointer but avoid any data transfer.

The complete description of in/out/nocopy and use of length clause:

Length or element count

< 0

Length or element count

== 0

Length or element count > 0

nocopy :

alloc_if(0) free_if(0)

Length is ignored, value of pointer variable is not modified by the clause (useful for managing pointers locally on MIC)

Length is ignored, value of pointer variable is not modified by the clause (useful for managing pointers locally on MIC)

Length is ignored, value of pointer variable is not modified by the clause (useful for managing pointers locally on MIC)

nocopy :

alloc_if(0) free_if(1)

OK, update ptr, free memory (ignore length)

OK, update ptr, free memory (ignore length)

OK, update ptr, free memory (ignore length)

nocopy :

alloc_if(1) free_if(0)

Error, cannot alloc <0

Error, cannot alloc 0

OK, do alloc, update ptr

nocopy :

alloc_if(1) free_if(1)

Error, cannot alloc <0

Error, cannot alloc 0

OK, do alloc, update ptr, free memory

in / out / inout:

alloc_if(0) free_if(0)

OK, update ptr only

OK, update ptr only

OK, update ptr, transfer

in / out / inout:

alloc_if(0) free_if(1)

OK, update ptr, no transfer, free

OK, update ptr, no transfer, free

OK, update ptr, transfer, free

in / out / inout:

alloc_if(1) free_if(0)

Error, cannot alloc/transfer <0

Error, cannot alloc 0

OK, alloc, update ptr, transfer

in / out / inout:

alloc_if(1) free_if(1)

Error, cannot alloc/transfer <0

Error, cannot alloc 0

OK, alloc, update ptr, transfer, free

An example of the use of in/out/nocopy and use of length clause is below:

The length value is not needed for freeing (so you can pass a dummy-value of 0 as the length). The length modifier is needed with a pointer because whether you are allocating or freeing is known only at runtime (alloc_if and free_if are expressions). That's why lexically, the length modifier is needed. But when freeing the length value is ignored.

Transferring non-bitwise Copyable Data Between CPU and MIC

Sometimes a data object containing a mixture of bitwise copyable elements (such as scalars and arrays) and non-bitwise elements (such as pointers to other data) need to be exchanged between CPU and MIC. The compiler will by default disallow such objects in the in/out clauses. If the program is only concerned with transferring the bitwise copyable elements of such objects then a compiler switch can disable the error using the -wd<number> switch, or convert the error to a warning using the -ww<number> switch.

Note:

The non-bitwise copyable elements will have indeterminate value and it is your responsibility not to access those fields before first assigning valid values to them.

There may be other circumstances where the compiler will issue the "not bitwise copyable" diagnostic. When the error may be over-ridden the error code is printed. Use that error code in the -wd or -ww switch.

In other cases, all elements of the non-bitwise object need to be transferred. In this case, you must transfer the individual components of the non-transferrable struct object.

The compiler cannot transfer non bit-wise copyable structs as a whole but can transfer individual fields separately, allowing you to specify a length for each pointer variable.

Once the data is transferred it remains persistent. Sometimes, the “nocopy” clause needs to be used so that existing data remains on MIC and the compiler does not attempt to update it.

Here is an example of passing a struct containing pointer fields, keeping the data persistent across offloads. Data is sent from CPU to MIC only once, in the function send_inputs. It is brought back to the CPU at the end, in the function receive_results. In between, you can use the data as many times as you like, as shown in the function use_the_data.

The offload syntax currently disallows specifying arrays of pointers in the IN and OUT clauses. If one or more of the arrays pointed to are needed in offloaded code, assign each required pointer array element to a scalar pointer variable of the same pointer type and use that variable in #pragma offload.

If you need all the data pointed to by an array of pointers at the same time on MIC when doing the computation, then use a loop to allocate and transfer the data to MIC and another loop to free the memory on MIC. In between, do an offload and use the data you transferred. Be careful to use nocopy whenever the pointer array is referenced in the offloaded code, because by default it is treated as inout, but you are transferring in/out the data in the array separately. See below:

Sometimes inlining a function is necessary for optimum performance of the generated code. Functions called directly within a #pragma offload are not inlined by the compiler even if they are marked as inline. To enable optimum performance of code in offload regions, either manually inline functions, or place the entire offload construct into its own function.

In the example below the code in function v1 demonstrates the problem. Without the #pragma offload the function call f(a,i) would have been inlined by the compiler and the loop would have been vectorized. However, when offloaded, the call to f(a,i) is not inlined, which inhibits loop vectorization.

One solution is to manually inline function f, as shown in function v2.

Another solution is to move the offload construct into its own function as shown in function v3.

NEW in Intel(R) Composer XE 2013 SP1 (in beta testing spring/summer 2013): This compiler supports new clauses mandatory,optional and status which provide greater control over offload.

NEW in Intel(R) Composer XE 2013 SP1: mandatory and optional clauses

Offload specified using an offload pragma is mandatory by default. The clause “optional” is available to make an individual #pragma offload optional. The clause “mandatory” is also available to specify that a particular offload is mandatory.

NEW in Intel(R) Composer XE 2013 SP1: Compiler switch to control optional/mandatory

All offloads in a file can be made optional or mandatory with the compiler switch –offload-mode={none|mandatory|optional}. The “none” setting turns off the offload feature, which means all #pragma offloads are ignored.

NEW in Intel(R) Composer XE 2013 SP1: status clause

The status clause specifies a variable that will hold the result of an offload after the offload has executed. The variable in the status clause is of type OFFLOAD_STATUS, defined in offload.h. The macro OFFLOAD_STATUS_INIT(var) can be used to initialize the status variable to a special value to distinguish success or failure.

Offload execution behavior

When offload is optional, if an offload request cannot be met
• Execution falls back to the CPU
• If a “status” clause had been used, then you will be able to tell what happened

With mandatory offload, if the request cannot be met
• No CPU fallback
• If a “status” clause is specified, the program won’t terminate. You will need to handle the situation yourself
• Without a “status” clause the program will be terminated

When device is available but 16GB cannot be allocated, prints:
offload failed
offload failed due to insufficient memory

When device is not available, prints:
offload failed

Minimize Coprocessor Memory Allocation Overhead

Dynamic memory allocation on the coprocessor can be slow. Minimize allocation/deallocation overhead by doing fewer allocations and frees. If an array is going to be passed multiple times between CPU and coprocessor, allocate it at first usage and free it at last usage. See example under “Data Persistence, Heap allocated data”. Even if the same array is not going to be reused in offloaded code repeatedly, the same memory block allocated on MIC could be reused.

Note that memory buffers kept allocated even when not needed would consume available coprocessor memory so balance the need for memory with minimizing allocation/deallocation overhead.

Offload Data Alignment

To enable vectorization of code on the coprocessor align data on 64B boundary or higher. For statically allocated variables this is achieved using __declspec(align(64)). For pointer data transferred to the coprocessor the align modifier of #pragma offload could be useful.
#pragma offload target(mic) in(p[0:2048] :align(64))

Note that the offload library normally assigns the coprocessor data the same offset within 64B as the offset within 64B of the CPU data. This offset matching ensures that fast DMA transfers between CPU and coprocessor will be enabled. An align modifier may override this offset matching. To get the benefits of fast DMA data transfer and proper alignment of coprocessor data, align the CPU data instead, and don’t explicitly use the align directive.

Maximize Data Transfer Rate

Data transfer rate between CPU and coprocessor is slowest for stack data and fastest for statically allocated and dynamically allocated data. Align CPU data on a 64B boundary or higher for improved data transfer rate. Align at 2MB for maximum transfer rate.
Make data transfer size a multiple of 64B for improved transfer rate, and a multiple of 2MB for maximum transfer rate. Generally, the larger the data transfer size, the higher the bandwidth. Allocate coprocessor memory in large (2MB) pages for improved data transfer rate. See notes on using the environment variable MIC_USE_2MB_BUFFERS.

Overlapping Data Transfer and Offloaded Computation

Input data needed by an offloaded computation may be sent in advance of the offload. The CPU may continue processing while the data in being transferred. In the example below f1 and f2 are sent to the coprocessor ahead of the offload that will use their values..

To receive data asynchronously from MIC to CPU, signal and wait are used as clauses of two different pragmas. The first offload performs the compute but only initiates data transfer. The second pragma causes a wait for the data transfer to be completed.

In Fortran, subroutine and function parameters are passed by reference. This means that the called routine operates on the parameter through its address. When a subroutine/function parameter is used within an offload region the parameter appears to the compiler as a pointer to data.

Just like other pointer data, the allocation of memory for the parameter in an offload can be controlled explicitly using alloc_if and free_if modifiers in the offload directives. If however, the parameter is used within the offload region and it is not specified explicitly in IN/OUT/INOUT directives then it is subject to the default transfer direction of INOUT.

Whether explicitly specified in the offload directive or implicitly offloaded, the parameter is allocated a separate buffer on the device. Buffer creation/destruction is an expensive process. Avoiding it improves offload data transfer performance.

A simple way to avoid buffer creation for subroutine/function parameters is to make a local copy of the variable and to use the newly created variable in the offloaded region. Then, the variable is grouped along with other local variables and only a single buffer is created for all of them. The example below illustrates this technique:

subroutine does_offload(param)
integer :: param
integer :: param_local
! This offload uses the parameter in the offload region.
! This is inefficient.
!dir$ offload begin target(mic)
param = param + 5
!dir$ end offload
! An alternate way to use the parameter on the device
! is to use a local copy. This is more efficient.
! Make a local copy of the parameter
param_local = param
! Do the offload, operating on the local variable
!dir$ offload begin target(mic)
param_local = param_local + 5
!dir$ end offload
! Update the parameter with the result from the device
param = param_local
end subroutine does_offload

Offload Using _Cilk_shared/_Cilk_offload

Marking Data and Classes _Cilk_shared

Shared Pointer Declaration

A Shared pointer is declared as follows:

int * _Cilk_shared q; // Shared pointer q to non-shared int

Declaration of a Pointer to Shared Data

A pointer to Shared data is written this way:

_Cilk_shared int * p; // Non-shared pointer p to shared int

A Shared pointer to Shared data is a combination of the two:

_Cilk_shared int * _Cilk_shared r; // Shared pointer r to shared int

Declaring a Class Type as Shared

When a class type must be declared Shared, place the keyword between “class” and the rest of the declaration. Placing _Cilk_shared at the beginning marks the data being declared Shared and not the type:

Class _Cilk_shared C {
// class members
};

Allocating Dynamic Memory for Shared Data

Dynamically allocated Shared data must be allocated from the pool of Shared memory. This is done by using the APIs:

In some cases the memory allocation is not under direct user control, for example, STL objects. The offload.h header provides a placement new mechanism for diverting STL memory allocations into Shared memory. Here is an example

Improving Performance of _Cilk_offload/_Cilk_shared

The default memory model for Shared data is to assume both CPU and coprocessor may modify data that is Shared. If the application data model is such that input data of an offload is only sent from CPU to coprocessor and after the offload has finished, all modified data can be sent back to the CPU without needing to be merged with other Shared data that may have been concurrently modified on the CPU, then a simpler and more efficient synchronization model may be specified. Enable this model using the environment variable MYO_CONSISTENCE_PROTOCOL.

Example

setenv MYO_CONSISTENCE_PROTOCOL HYBRID_UPDATE_NOT_SHARED

Linking Offloaded Code with Coprocessor Libraries

At present the compiler is very strict about checking mixing of Shared and non-Shared pointers and it is necessary to circumvent some of these checks. In general, Shared data can always be processed by routines that know nothing about sharing, as long as casts are used.
Linking third-party libraries built for MIC may be linked with offloaded code by following these steps:
1. Keep your 3rd party libraries as they are, i.e., built with –mmic and unaware of any sharing.
2. Offload from the CPU program to some functions on MIC that serve as the data exchange functions. These functions will be marked _Cilk_shared and will deal with data marked as _Cilk_shared.
3. From these data exchange functions running on MIC, make calls to the MIC-only libraries. Now, because data and functions referenced in functions marked _Cilk_shared are required to be _Cilk_shared, you will need casts on data and functions defined in the external libraries (which are not built with the Shared keywords).
Schematically, CPU code à SHARED code running on MIC à(casts) code built with –mmic

Using MKL and TBB on MIC

MKL and TBB are available on MIC, to be called from functions running on MIC. To use MKL the compiler switch –mkl (Linux) or /Qmkl (Windows) is specified when compiling the program on the CPU. Similarly, to use TBB the compiler switch –tbb or /Qtbb is specified when compiling the program on the CPU. These switches are automatically propagated to the MIC compilation of the offloaded code and those MKL and TBB functions that are supported on MIC may then be called from code running on MIC.

See _Cilk_shared/_Cilk_offload example below (taken from the MKL documentation).

This method of customization may be infeasible if the MIC or CPU versions use intrinsics that are not available on both processors. In these cases an #ifdef may be used. However, do not use an #ifdef __MIC__ directly within a #pragma offload construct because it has the potential to create a mismatch between variables sent from/received by the coprocessor, and sent/received by the CPU. Mismatch may occur because the default for variables is inout and the variable references may not be identical on the two alternate code versions.

#ifdef __MIC__
// MIC version
#else
// CPU version
#endif

Controlling Options Passed to CPU and MIC Compilers

Many options that you specify in the command-line for an offload compilation apply to both the host-side compilation as well as the MIC-side compilation.

If you want to pass additional options to the offload compilation, or you would like to override the command line options passed to offload compilation, you must use option -qoffload-option or /Qoffload-option to specify the additional or overriding options.

Example: You can pass the reporting options only for the MIC-side compilation (this makes it easier for analyzing what optimizations are performed for MIC) as follows:icc -qoffload-option,mic,compiler,"-vec-report2 -qopt-report-phase hlo -qopt-report=3" t1.c -c -qopenmp

In this case, the reporting options are not used for the host-side compilation, and they get used for the MIC-side compilation.

When building a heterogeneous application, the driver passes all compiler options specified on the command-line to the host compilation and only certain options to the offload compilation. To see a list of options passed to the offload compilation, specify option "-watch=mic-cmd".

If you add the option "-watch=mic-cmd" to the earlier example above, then the compiler reports the exact command-line expansion that will be used for the MIC-side compile as follows (in addition to doing the compile):

Internal compiler options which typically begin with –m are not automatically passed from CPU to MIC compilations. These options must be specified explicitly for either the CPU or MIC compilations (using the –qoffload-option,mic,compiler,<option>).

Example

// The following passes an internal option to the CPU compiler// The MIC command line is asked to be printed icc -c test.c -watch=mic_cmd -mP2OPT_hlo_pref_indirect_refs=TMIC command line:icc -c test.c

// The following passes an internal option to the MIC compiler// The MIC command line is asked to be printed icc -c test.c -watch=mic_cmd -qoffload-option,mic,compiler,-mP2OPT_hlo_pref_indirect_refs=TMIC command line:icc -c test.c -mP2OPT_hlo_pref_indirect_refs=T

Using Multiple MIC Cards

On a multi-card system #pragma offloads that do not specify an explicit MIC card number result in offloads issued to MIC card 0. However, if data is to be reused between offloads then it is safer to use explicit card numbers in the offloads to ensure that data is carried over from one offload to another in a predictable manner, irrespective of the number of cards available in the system.

The offload pragma allows a user to write offloads to logical cards 1 to N. The physical cards available to the process are specified using the environment variable OFFLOAD_DEVICES=<list of physical devices>. Then, logical card numbers are mapped to physical cards by doing N%<#cards-available>, meaning that logical card numbers wraparound among the physical cards.

When using _Cilk_shared and _Cilk_offload management of data is automatic. Explicit card numbers using _Cilk_offload_to(<card-number>) are useful for manually load balancing between available cards instead of relying on the round-robin default offload behavior.
See also: environment variable OFFLOAD_DEVICES

Environment Variables for Controlling Offload

There are two categories of environment variables:

Those that affect the way the Offload runtime library operates

Those that are passed through to the co-processor execution environment by the Offload library

We first describe environment variables in category 1. These are prefixed with either "MIC_" or "OFFLOAD_". The prefix is fixed, as is the environment variable name.

The special environment variable MIC_ENV_PREFIX is used to distinguish variables in category 2. It is described at the end of this section.

MIC_USE_2MB_BUFFERS

Sets the threshold for creating buffers with large pages. A buffer is created with the large pages hint if its size exceeds the threshold value.

Example

// any variable allocated on MIC that is equal to or greater than
// 100KB in size will be allocated in large pages.
setenv MIC_USE_2MB_BUFFERS 100k

This environment variable applies only for data allocated by pragma-offload for pointer variables in in/out/nocopy clauses. If the environment variable is not set, all such allocations happen in 4KB pages.

where [CPU->MIC Data] and [MIC->CPU Data] is the total data transferred in bytes

OFFLOAD_DEVICES

The environment variable OFFLOAD_DEVICES restricts the process to use only the MIC cards specified as the value of the variable. <value> is a comma separated list of physical device numbers in the range 0 to (number_of_devices_in_the_system-1).

Devices available for offloading are numbered logically. That is _Offload_number_of_devices() returns the number of allowed devices and device indexes specified in the target specifiers of offload pragmas are in the range 0 to (number_of_allowed_devices-1).

Example

setenv OFFLOAD_DEVICES “1,2”

Allows the program to use only physical MIC cards 1 and 2 (for instance, in a system with four installed cards). Offloads to devices numbered 0 or 1 will be performed on physical devices 1 and 2. Offloads to target numbers higher than 1 will wrap-around so that all offloads remain within logical devices 0 and 1 (which map to physical cards 1 and 2). The function _Offload_get_device_number() executed on a MIC device will return 0 or 1, when the offload is running on physical devices 1 or 2.

OFFLOAD_INIT

The environment variable specifies a hint to the offload runtime when it should initialize MIC devices.

Supported values:

on_start

All available devices are initialized before entering main.

on_offload

Device initialization is performed right before the first offload to it. Initialization is done only on the MIC device which handles offload.

on_offload_all

All available MIC devices are initialized right before the first offload in a program.

The default is on_offload_all (for backward compatibility).

MIC_ENV_PREFIX

This is the general mechanism to pass environment variable values to the process running on a MIC card.

Note: The setting of this environment variable has no effect on the fixed MIC_* environment variables discussed before this section,namely MIC_USE_2MB_BUFFERS, MIC_STACKSIZE and MIC_LD_LIBRARY_PATH. Those names are fixed.

By default, all environment variables defined in the environment of an executing CPU program are replicated to the coprocessor's execution environment when an offload occurs. You can modify this behavior by defining the environment variable MIC_ENV_PREFIX. When you set MIC_ENV_PREFIX, then not all CPU environment variables are replicated to the coprocessor, but only those environment variables that begin with the value of the MIC_ENV_PREFIX environment variable. The environment variables set on the coprocessor have the prefix value removed. You thus have independent control of OpenMP*, Intel® Cilk™ Plus, and other execution environments that use common environment variable names.

So, if MIC_ENV_PREFIX is not set, the Offload runtime simply replicates the host environment to the coprocessor. IfMIC_ENV_PREFIX is set then only those environment variable names whose name begins with the value defined by MIC_ENV_PREFIX are passed to the target (with prefix removed).

Thus, the value of MIC_ENV_PREFIX sets the value of the prefix which is used to recognize environment variable values intended for programs running on MIC devices. For example, setenv MIC_ENV_PREFIX MYCARDS will use “MYCARDS” as the string that indicates that an environment variable is intended for a MIC process.

Environment variable values of the form <mic-prefix>_<var>=<value> will send <var>=<value> to each card.

Environment variable values of the form <mic-prefix>_<card-number>_<var>=<value> will send <var>=<value> to the MIC card numbered <card-number>.

Environment variable values of the form <mic-prefix>_ENV=<variable1=value1|variable2=value2> will send <variable1>=<value1> and <variable2>=<value2> to each card.

Environment variable values of the form

<mic-prefix>_<card-number>_ENV=<variable1=value1|variable2=value2> will send <variable1>=<value1> and <variable2>=<value2> to the MIC card numbered <card-number>.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on the Intel® Xeon Phi™ coprocessor. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Comments (3)

Thanks for the informative and helpful article. Would like to have a small doubt cleared up -

Under the section - Persistence:Heap allocated data, I was particularly interested in Explicitly managed Heap-allocated Data. What I want to do is allocate an array of structure( this structure contains an allocatable array as well) within the offload region. As of now the compiler throws an error on this.

I could try the same problem successfully on a C++ method, but I'm out of luck in case of Fortran 90.(And I'm using Intel Fortran Compiler v15)

In section "Free-ing memory used by Offload without knowing the length", to initialize c_ptr pointer on MIC (similar as in section "Example of Persistent MIC Pointer and Selective Data Transfer"), should we use

Hi Ronald,
Very good article! A few comments/suggestions/questions:
1) In “Persistence: Statically allocated data”, you globally declare small x but assign to big X in f
2) In “Compiler-managed Heap-allocated Data”, you say that the “macros are use in all the samples when alloc_if/free_if clauses are used” but you actually keep on using alloc_if/free_if afterwards
3) In “Explicitly managed Heap-allocated Data”, you globally declare small p but assign to big P in the pragma
4) The console output from the sample code would be very useful, especially in “Example of Persistent MIC Pointer and Selective Data Transfer”. I am still not sure that I understand what happens with the data in this sample.
5) In what case can we use offload pragmas without a block following it? Are there cases where the pragma will be discarded by the compiler if it is not followed by a block? What do you have a “{;}” block in “Free-ing memory used by Offload without knowing the length”?