Download Article

Abstract

This paper is a supplement to the Intel® Composer XE for Linux* documentation. It serves as a compendium of best known methods for using the OpenMP* extensions to C/C++ and Fortran when programming offload and native programs for the Intel Many Integrated Core (Intel MIC) Architecture.

Overview

Best Known Methods (BKMs) are techniques, explanations and hints that empower users to gain maximum benefit in their use of systems. The BKMs described here are particular tips for implementing and optimizing programs running on the Intel® Many Integrated Core (Intel® MIC) Architecture and using the versatile language extension for parallel programming, OpenMP. This article and the ones to follow describe OpenMP techniques and tips for operating in both offloaded and native execution environments on Intel MIC Architecture. OpenMP offers a simple to learn model with well-chosen defaults to make the simple forms easy to understand. Those forms have rich options that can be customized to tune for a myriad of variations. These articles will explore some of these options to show how they might most effectively be applied. Each example has been verified against the Intel Composer XE 2013 compilers.

Environment settings to set configuration for use on Intel MIC Architecture

BKM: Use MIC_ENV_PREFIX in offload to automatically propagate Host environment to Target

When offloading execution to the target coprocessor, users can limit the host environment exposed on the target and selectively propagate settings defined in the host environment to the target through the use of the MIC_ENV_PREFIX environment variable.

Applicability: Offloaded C, C++ and Fortran.

Details: If set on the host, normally to “MIC”, only the host-side user-defined shell environment variables that start with the defined string prepended to “_” will be propagated to the target environment. The one exception to this rule is MIC_LD_LIBRARY_PATH, which has a unique meaning on the host (the location of target libraries on the host to be searched by the offloader)—there will be no LD_LIBRARY_PATH in the target environment of offloaded programs. Examples using MIC_ENV_PREFIX are scattered throughout this article but please note that current behavior is to always append an underscore (“_”) to the defined prefix before using it to match host environment variables for import to the target.

If the MIC_ENV_PREFIX is not set on the host, nearly the entire host environment is copied to the environment of the offload process. This can be confusing, especially if you are using OpenMP on both the target and the host, since you will likely want properties such as affinity and thread count to be different, reflective of the specific architectures.

Basic Use of OpenMP directives in combination with Offload directives

BKM: Offload Essential Details

Offload declarations can start simply but can be modified with options to handle a range of capabilities.

Applicability: Offloaded C, C++ and Fortran

Details: The simplest portable offload statement looks something like this (C and Fortran):

These “minimal” examples still include an optional component, the “:0”, which is not required but sets a specific card unit number, in this case, card 0. If the default unit is unspecified, i.e. “mic” by itself, the scheduler will always use card 0. This results in more predictable performance behavior. If you desire to balance the load across several cards, you have to do this manually by specifying a target of “mic:n” in your offloads, where n is the card to which you are offloading. Once synchronous offload is understood, the next thing users usually want to do is play with data, copying them out or copying them back, or allocating them on the target or some combination. The key to understanding offload data is the recognition that the IN, OUT, INOUT and NOCOPY directives do not expand the normal scope of data; they only restrict it. If there are a lot of data visible but not necessarily all are used; restricting how much gets copied between host and target may yield significant performance improvements.

With a mechanism in place to filter data to and from the target, there are more things you can specify about the data, such as their length, alignment, and dynamic allocation behavior. The example below shows the use of the INOUT specifier to name a data pointer (memp), give its length (200 elements whose size is determined by the type of memp) and its data alignment (8-byte). INOUT does not change the offload behavior of identifiers in its charge, but provides a bit of syntax within which the other qualities of the offloaded data can be specified:

#pragma offload target(mic:0), inout(memp : length(200) align(8))

The fourth keyword, NOCOPY, is most commonly used in programs employing pointers located exclusively on the target side, along with the directives LENGTH, ALLOC_IF and FREE_IF. There is also a mechanism using IN and LENGTH(0) directives to maintain the host-side connection to offloaded data, which empowers the programmer to use data persistent on the target across multiple offloads without having to recopy them at each offload. (More information on the use of these keywords is available in the compiler reference documents, and at the Intel® Xeon Phi™ Community portal site.)

Asynchronous offload features are also available, wherein strict, sequential, non-overlapping execution between host and offload-target can be relaxed using the SIGNAL and WAIT specifiers to synchronize offloads and the OFFLOAD_TRANSFER and OFFLOAD_WAIT pragmas to handle data transfers separately specially and separately from offloading computation. These are also described in more detail in the compiler reference documents.

More about these topics, including examples to show them in practical use, will appear in other articles.

BKM: Use –openmp-report to observe how the compiler is realizing your OpenMP regions and loops

-openmp-report is a command line switch that stimulates a report to stderr providing diagnostic messages about the compiler’s handling of OpenMP constructs processed during C/C++ or Fortran source compilation. Use the setting –openmp-report=1 to learn about the handling of parallel loops, regions and sections within the compilation unit. Make that a 2 to add diagnostics about various synchronization constructs. Also look at –vec-report

Applicability: Offloaded and native, C, C++ and Fortran.

Example: Be sure to enable –openmp when you employ the –openmp-report switch. From code equipped with OpenMP parallel region and loop constructs you should get reports like these, or others indicating failing loops or regions; the line numbers let you connect reports back to specific OpenMP constructs.

BKM: Uncertain whether that offload section is running on the target or the host? Count the number of threads.

If you are uncertain whether the code you’re testing is actually running on the right target, it is fairly easy to add a piece of temporary code that returns the number of threads in the thread team.

Applicability: Offloaded C, C++ and Fortran.

Example: Here are code samples demonstrating a way to use omp_get_num_threads() to report the number of threads running in a parallel section. Remember to make the get_num_threads call within a parallel section but avoid extra work by ensuring that only one thread makes the call.

OMP_GET_NUM_THREADS() operates on thread state within the “innermost enclosing parallel region” but there are OMP calls that don’t require such a context: You could use OMP_GET_MAX_THREADS() to get an upper bound on thread count without requiring a parallel region. But are you just interested in where the code is running but don’t care about thread counts at all? Even easier! Just replace the body of testThreadCount() with the following code, which will be selectively emitted in the two copies of the function:

Differences in OpenMP use in Native versus Offload environment

Because the system reserves a core for its own processing when running programs invoked via the offload compiler (versus programs cross-compiled and run natively on the target) some of the parameters available to OpenMP vary between these two execution environments.

Applicability: Offloaded and native C, C++ and Fortran.

Details: A simple expansion of the sample program shown above for getting the available number of threads can reveal these differences. When run as an offload program, it produces numbers like the following (note that these examples were captured on an early coprocessor prototype; with modern, production parts you should see much higher counts):

The variations here are all due to the availability of all the cores when running native. This is implemented in the kernel thread table and mimicked by an affinity map set up in the OpenMP runtime environment. You can peek at this affinity map by setting the environment variable KMP_AFFINITY=verbose prior to running an offloaded program. This will generate a lot of output, some of which looks like this:

Here you can see that the default affinity exhibits the "fine" property, ensuring that each thread is tightly bound to a specific logical CPU, allowing the OS no opportunity to move it around the machine. In the above example, there are 31 cores available, giving 124 threads. Logging into the card and setting KMP_AFFINITY=verbose before running the code as a native program shows that one extra core (four extra logical CPUs) is available:

Here the full set of 32 cores is available and the extra HW threads show up in the affinity map as numbers 0, 125, 126 and 127. This odd numbering is due to the way the HW threads are enumerated by the Advanced Programmable Interrupt Controller (APIC) as can be seen in the contents of /proc/cpuinfo (filtered down to the relevant bits):

This structure of HW thread mapping is present regardless of the number of cores on any particular device: using offload causes the system to remove one core from the affinity map. The core removed from the map is the one whose HW thread numbers are not adjacent—always the highest three plus HW thread 0. This core is used by the Intel® Coprocessor Offload Infrastructure (Intel® COI) and other offload-related services.

Between offloaded and native execution OpenMP programs can find themselves within very different environments. On the native side a typical environment is very simple. Using a little environ table dumper, it’s easy to print out the environment of the main process (LD_LIBRARY_PATH was added to provide a path to the OpenMP runtime object).

However, without user action to restrict propagation the offloaded environment is nearly a complete copy of the host environment. Low level communications libraries Intel COI and Intel® Symmetric Communications Infrastructure (Intel® SCI) each insert an identifier while LD_LIBRARY_PATH is not copied, but the host PATH is passed intact, with “/usr/bin:/bin” appended to it. It is easy to clear away all this host environment clutter by defining MIC_ENV_PREFIX in the host environment:

With MIC_ENV_PREFIX defined, the target environment is isolated from any weird side effects the host environment might trigger, but particular variables can be passed through by adding a prefix to the host variable name comprising the user-defined MIC_ENV_PREFIX plus an underscore, as shown above. You can see that it is working, as evidenced by the ENV_PREFIX definition. Change MIC_ENV_PREFIX to something else and that variable will go away.

Using this mechanism, it is possible to have separate environment values defined for runs on the host or on the target. The one place this doesn’t work is the excluded variable, LD_LIBRARY_PATH. If you look at your host environment after running the compiler initialization script, you’ll find both LD_LIBRARY_PATH and MIC_LD_LIBRARY_PATH defined, the former identifying the host library environment and the latter locating the target library paths for the offloaded code.

Topical Advice on OpenMP on Intel MIC Architecture

BKM: When modifying OpenMP stack size (OMP_STACKSIZE), take care to note the different options, defaults and effects with high thread counts

The careful reader will note that there are a number of controls available to adjust stack size for running threads on host and target, and that these controls interact with each other but don’t share the same defaults.

Applicability: Offloaded and native C, C++ and Fortran.

Details: First there is KMP_STACKSIZE, whose values can be suffixed with B/K/M/G/T to mark bytes, kilobytes, megabytes, etc.:

export KMP_STACKSIZE=4M

If the units are unspecified for KMP_STACKSIZE, the number is assumed to be in bytes. KMP_STACKSIZE overrides any OMP_STACKSIZE setting, and if both are set the runtime will generate a warning message to the user. OMP_STACKSIZE uses the same B/K/M/G/T suffix notation, but if unspecified, OMP_STACKSIZE units default to kilobytes. To be safe, always specify a units-suffix with these parameters.

If offloading computation from the host and MIC_ENV_PREFIX is not defined, the stack-size environment variables are copied from host to target environment when the target process is spawned (along with the rest of the environment). With MIC_ENV_PREFIX defined, users can define separate settings in the host environment for both host and target values:

Caution: normally the Intel Xeon Phi processor can have as many as 60 cores, each with four threads, or over 240 hardware threads. The default OMP stack size is 4MB, so by default the threads on this platform could allocate nearly a gigabyte of local memory just for all the stacks! (The example above would cut this resource limit in half while doubling the default limit on the host side.)

In addition to these environment variables, the runtime libraries also respond to the host environment setting of MIC_STACKSIZE. This controls the size of the stack in the target process wherein offloaded code is run and so only applies to offloaded code. The default size of this stack is 12 MB but there’s only one of them. Likewise, native applications run in a process whose default stack size is 8 MB.

BKM: Processor affinity may have a significant impact on the performance of your algorithm, so understanding the affinity types available and the behaviors of your algorithm can help you adapt affinity on the target architecture to maximize performance.

Hardware threads are grouped together into cores, within which they share resources. This sharing of resources can benefit or harm particular algorithms, depending on their behaviors. Understanding the behavior of your algorithms will guide you in selecting the optimal affinity. Affinity can be specified in the environment (KMP_AFFINITY) or via a function call (kmp_set_affinity).

Applicability: Offloaded and native C, C++ and Fortran

Details: Affinity starts with the process that runs the main OpenMP thread on the target. Offloaded programs inherit an affinity map that hides the last core, which is dedicated to offload system functions. Native programs can use all the cores, making the calculations required for balancing the threads slightly different.

Some algorithms exploit data sharing between threads and can take advantage of shared caches to speed computation. Other algorithms demand a lot of data that are visited only once, requiring the threads to be spread out to make maximum use of the available bandwidth. And some algorithms lie somewhere in between. OMP_NUM_THREADS and KMP_AFFINITY (and their functional counterparts) let users and programmers configure their threads as appropriate:

On a 31-core target processor with an offloaded application the above sequence of host environment settings would limit the HW threads to half the available number and distribute them as broadly as possible across the cores. The example sets MIC_ENV_PREFIX to be able to selectively set the number of threads and their distribution only for the target in the lines that follow. Granularity is set to fine so each OpenMP thread is constrained to a single HW thread. Alternatively, setting core granularity groups the OpenMP threads assigned to a core into little quartets with free reign over their respective cores.

The affinity types COMPACT and SCATTER either clump OpenMP threads together on as few cores as possible or spread them apart so that adjacent thread numbers are on different cores. Sometimes though, it is advantageous to enable a limited number of threads distributed across the cores in a manner that leaves adjacent thread numbers on the same cores. For this, there is a new affinity type available, BALANCED, which does exactly that. Using the verbose setting shown above, you can determine how the OpenMP thread numbers are dispersed. Just run a program with an offloaded OpenMP component to trigger the affinity report. We can run that for several affinity-types and see how they vary:

AFFINITY: fine, compact

AFFINITY: fine, scatter

AFFINITY: fine, balanced

Because MIC_OMP_NUM_THREADS was set to 60 but there are 31 cores in this example, the last couple of cores only get a single OpenMP thread. Alternatively, setting MIC_OMP_NUM_THREADS to 62 in this example would move thread 59 to core 29 and leave threads 60 and 61 on core 30.

In the case of native execution, all the same rules apply, though we don’t need to preface the environment variables with “MIC_” and use MIC_ENV_PREFIX. We may want to acknowledge the availability of another core and bump OMP_NUM_THREADS up to 64 in order to populate all the cores with two threads in the balanced affinity-type. Any irregularities in the underlying HW thread numbering scheme for the last core are hidden by the affinity mask.

Beyond these affinity-types, more exotic mappings can be achieved through using the low-level affinity interface. Mention of these low-level affinity APIs also brings up a caution: it’s very unlikely that code being ported to Intel MIC Architecture that already has an explicit affinity map will have the correct map for the new architecture; better to reset affinity back to some basic mapping until experiments can be run to re-optimize affinity.

BKM: Use KMP_PLACE_THREADS for scaling tests

The Intel MIC Architecture-specific environment variable KMP_PLACE_THREADS makes it much easier to allocate subsets of an Intel Xeon Phi coprocessor to use for scaling tests while maintaining the correct affinity than is possible without it. By using KMP_PLACE_THREADS you can restrict the slice of the machine that is used by your program, and then set a simple affinity within that space.

Applicability: Offloaded and native C, C++ and Fortran

Details: For instance, to investigate the scaling of a program as we change the number of cores and threads/core we can use placement like KMP_PLACE_THREADS=30c,2t to run on 30 cores with two threads per core. We can then use KMP_AFFINITY=scatter or KMP_AFFINITY=compact to enumerate the threads in different ways within the 60 hardware threads. There is no need then to set OMP_NUM_THREADS in addition, since the runtime default is to use all the hardware threads it can see (which will be 60 in this case).

If you do set OMP_NUM_THREADS as well as KMP_PLACE_THREADS, be careful; if the number of threads set via OMP_NUM_THREADS is more than the number of hardware threads made available by KMP_PLACE_THREADS you will have over-subscription and, likely, poor performance, while if it is fewer then you will very likely have different numbers of threads on some cores than on others, which is confusing when trying to investigate performance.

About the Author

Robert Reed is a software performance engineer in the Technical Computing, Analyzers and Runtimes organization within the Intel Developer Products Division. Robert honed his programming skills over many years working at Tektronix. After a migrant middle period (contracting and consulting for a variety of companies), Robert settled at Intel where he has done a lot of performance optimization and analysis on everything from architecture-specific functional simulations of DGEMM to playing with fluids and cloth in high-end 3D editors. Robert has been working with the Intel MIC Architecture program since before it was called that. When he’s not pushing bits, Robert likes to sing and see live theatre, of which there is plenty in Portland.

Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

Performance Notice

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you infully evaluating your contemplated purchases, including the performance of that product when combined with other products.