Intel Developer Zone Articleshttps://software.intel.com/en-us/articles/20857
Article FeedenWebinar: Better Threaded Performance and Scalability With Intel(R) Vtune Amplifier + OpenMP*https://software.intel.com/en-us/articles/webinar-better-threaded-performance-and-scalability-with-intelr-vtune-amplifier-openmp
<p><strong>Pre-requisites:</strong></p>
<ol><li>Intel® Parallel Studio Professional or Ultimate Edition Installed on Linux machines (Provides Intel® C++ Compiler, Intel® Vtune Amplifier, Intel® Advisor which we will use in this lab).</li>
<li>Install OpenCV latest version:
<ol><li>Download the source from github (<a href="https://github.com/opencv/opencv" rel="nofollow">https://github.com/opencv/opencv</a>) using git clone command.</li>
<li>Build OpenCV libraries using instructions documented at <a href="http://docs.opencv.org/trunk/d7/d9f/tutorial_linux_install.html" rel="nofollow">http://docs.opencv.org/trunk/d7/d9f/tutorial_linux_install.html</a>.</li>
</ol></li>
<li>Make sure that you have a copy of the source code for your lab which includes the lab documentation.</li>
</ol><p><strong>Introduction:</strong></p>
<p>This lab will help you understand how to use Intel® Vtune Amplifier and Intel® Advisor to look for tuning opportunities and tune the code by enabling threading (using OpenMP or Intel® Threading Building Blocks [Intel® TBB]) and enabling vectorization (using OpenMP 4.0 SIMD constructs). </p>
<p>Detailed document is <a href="https://software.intel.com/sites/default/files/managed/91/02/Lab_Instructions.pdf">here</a>.</p>
Tue, 26 Sep 2017 00:02:50 -0700Anoop M. (Intel)746052Intel® Xeon® Scalable Processor Cryptographic Performancehttps://software.intel.com/en-us/articles/intel-xeon-scalable-processor-cryptographic-performance
<h2>Executive Summary</h2>
<p>The new Intel® Xeon® Scalable processor family provides dramatically improved cryptographic performance for data at rest and in transit. Many Advanced Encryption Standard (AES)<sup>1</sup> based encryption schemes will immediately benefit from the 75 percent improvement in Intel® Advanced Encryption Standard New Instruction (Intel® AES-NI) instruction latency. In addition to improvements in existing technologies, the new Intel® Advanced Vector Extensions 512 (Intel® AVX–512) instruction family brings up to 3X performance gains<sup>1</sup> over previous-generation Intel® Advanced Vector Extensions 2 (Intel® AVX2) implementations of secure hashing. These performance gains for cryptographic primitives improve throughput for intensive workloads in markets such as networking and storage, lowering the barrier to making encryption ubiquitous.</p>
<h2>Overview</h2>
<p>With strong security becoming a ubiquitous data center application prerequisite, any associated cryptographic performance tax takes computes away from the main function. Intel’s focus on providing primitives in every core to accelerate cryptographic algorithms has helped alleviate this burden and enable demanding workloads to achieve remarkable throughputs on general purpose servers<sup>2</sup>. One example of this is the continued performance increases of Intel® AES-NI since its original launch in 2010. This focus also holds for the introduction of new features that provide significant gains over previous generations<sup>1</sup>.</p>
<p>With Intel® Xeon Scalable Processors, the improved Intel AES-NI design and introduction of Intel® AVX-512 brings a new level of cryptographic performance to the data center. This paper examines the gains seen in two modes of AES operation, Galois counter mode (GCM) and cipher block chaining (CBC), as a result of the Intel AES-NI improvements. The impact of Intel AVX-512 will be demonstrated with the secure hashing algorithms (SHA-1, SHA-256, and SHA-512)<sup>3</sup>, in particular comparing the new results against Intel® AVX2 based implementations from the previous Haswell/Broadwell generation of Intel Xeon processors.</p>
<h2>Intel® Xeon® Scalable Processor Improvements</h2>
<p>The cryptographic performance enhancements seen in the Intel Xeon Scalable processors are due to new instructions, micro architectural updates, and novel software implementations. Intel AVX-512 doubles the instruction operand size from 256 bits in Intel AVX2 to 512 bits. In addition to the 2X increase in amount of data that can be processed at once, powerful new instructions such as VPTERNLOG enable more complex operations to be executed per cycle. Combining Intel AVX-512 with the multibuffer software technique for parallel processing of data streams spectacularly improves SHA performance. The latency reduction of the AES Encrypt and AES Decrypt (AESENC/AESDEC) instructions along with the improved microarchitecture have shown gains in both parallel and serial modes of AES operation.</p>
<h2>AES</h2>
<p>The new Intel® Xeon® Scalable processor has significantly reduced the latency of AES instructions from seven cycles in the previous Xeon v4 generation down to four cycles. This reduction benefits serial modes of AES operation, such as Cipher Block Chaining (CBC) encrypt. As with most new Intel® microarchitectures introduced, improvements in the core design manifest into appreciable performance gains. For optimized implementations of AES GCM, the parallel paths of the AESENC and PCLMULQDQ instructions have improved to the point where the authentication path is almost free.</p>
<h2>SHA</h2>
<p>Moving from Intel AVX2 to Intel AVX-512 implementations of the SHA family brings benefits beyond the doubling of data buffers that can be processed at once. Two key additions are the expansion of registers available from the 16 256-bit YMMs in Intel AVX2 to the 32 512-bit ZMMs in Intel AVX-512 and the more powerful instructions in Intel AVX-512. With more registers available, the message schedule portion of SHA can be stored in registers and no longer has to be saved on the stack. With more powerful instructions, the number of instructions that need to be executed is reduced and the dependencies are eliminated.</p>
<p>A closer examination of the power of the VPTERNLOG instruction can be illustrated on the SHA-256 Ch and Maj functions. Table 1 shows the Ch and Maj functions along with the Boolean logic table.</p>
<p style="text-align:center"><strong>Table 1. </strong> <em>SHA-256 Ch and Maj function logic tables.</em></p>
<table align="center" class="no-alternate" border="1"><tbody><tr><td colspan="4">
<p>Ch (e, f, g) = (e &amp; f) ^ (~e &amp; g)</p>
</td>
<td colspan="4">
<p>Maj (a, b, c) = (a &amp; b) ^ (a &amp; c) ^ (b &amp; c)</p>
</td>
</tr><tr><td>
<p align="center">e</p>
</td>
<td>
<p align="center">f</p>
</td>
<td>
<p align="center">g</p>
</td>
<td>
<p align="center">Result (0xCA)</p>
</td>
<td>
<p align="center">a</p>
</td>
<td>
<p align="center">b</p>
</td>
<td>
<p align="center">c</p>
</td>
<td>
<p align="center">Result (0xE8)</p>
</td>
</tr><tr><td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
</tr><tr><td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
</tr><tr><td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
</tr><tr><td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
</tr><tr><td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">0</p>
</td>
</tr><tr><td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
</tr><tr><td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">0</p>
</td>
<td>
<p align="center">1</p>
</td>
</tr><tr><td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
<td>
<p align="center">1</p>
</td>
</tr></tbody></table><p>The VPTERNLOG instruction takes three operands and an immediate specifying of the Boolean logic function to execute. Tables 2 and 3 compare the Intel AVX2 and Intel AVX-512 instruction sequences for the SHA-256 Ch and Maj functions. Note that register to register copies are generally free in the microarchitecture.</p>
<p style="text-align:center"><strong>Table 2. </strong> <em>Intel® AVX2 and Intel® AVX-512 instruction sequence for the SHA-256 Ch function.</em></p>
<table align="center" class="no-alternate" style="width:387px" border="1"><tbody><tr><td colspan="2">
<p align="center">Ch (e, f, g) = (e &amp; f) ^ (~e &amp; g)</p>
<p align="center">Note this is equivalent to ((f ^ g) &amp; e) ^ g)</p>
</td>
</tr><tr><td>
<p align="center">Intel® AVX2</p>
</td>
<td>
<p align="center">Intel® AVX-512</p>
</td>
</tr><tr><td>
<p>vpxor ch, f, g</p>
<p>vpand ch, ch, e</p>
<p>vpxor ch, ch, g</p>
</td>
<td>
<p>vmovdqa32 ch, e</p>
<p>vpternlogd ch, f, g, 0xCA</p>
</td>
</tr></tbody></table><p style="text-align:center"><strong>Table 3. </strong> <em>Intel® AVX2 and Intel® AVX-512 instruction sequence for the SHA-256 Maj function.</em></p>
<table align="center" class="no-alternate" style="width:639px" border="1"><tbody><tr><td colspan="2">
<p align="center">Maj (a, b, c) = (a &amp; b) ^ (a &amp; c) ^ (b &amp; c)</p>
<p align="center">Note this is equivalent to ((a ^ c) &amp; b) | (a &amp; c)</p>
</td>
</tr><tr><td>
<p align="center">Intel® AVX2</p>
</td>
<td>
<p align="center">Intel® AVX-512</p>
</td>
</tr><tr><td>
<p>vpxor maj, a, c</p>
<p>vpand maj, maj, b</p>
<p>vpand tmp, a, c</p>
<p>vpor maj, maj, tmp</p>
</td>
<td>
<p>vmovdqa32 maj, a</p>
<p>vpternlogd maj, b, c, 0xE8</p>
</td>
</tr></tbody></table><h2>Performance Gains</h2>
<p>Cryptographic performance of Intel Xeon Scalable Processors shows per core gains of 1.18<sup>1</sup>X to over 3X compared to the previous Xeon v4 Processors. The performance of some of the most commonly used cryptographic algorithms in secure networking and storage are highlighted using two popular open source libraries, OpenSSL*<sup>4</sup> and the Intel® Intelligent Storage Acceleration Library (Intel® ISA-L)<sup>5</sup>.</p>
<h2>Methodology</h2>
<p>In order to maximize reproducibility and be able to project performance to different frequency and core count processors, the results are reported in cycles/byte (lower is better). The platforms are tuned for performance and turbo mode is disabled to allow for consistent core frequency for every run. To get throughput numbers in bytes per second for a specific processor, divide the processor’s frequency by the cycles/byte value reported. For total system performance multiply that value by the number of cores as these performance results have a nice linear scale.</p>
<p>The processors used for these performance tests are the Intel® Xeon® Gold 6152 processor and Intel Xeon processor E5-2695 v4, each with 16 GB of memory and running Ubuntu* 16.04.1.</p>
<h2>OpenSSL*</h2>
<p>Results shown in Table 4 are collected from the OpenSSL v1.1.0f speed application on 8 KB buffers using the following commands:</p>
<pre class="brush:plain;">openssl speed -mr -evp aes-128-cbc
openssl speed -mr -evp aes-128-gcm
</pre>
<p style="text-align:center"><strong>Table 4. </strong> <em>OpenSSLU</em><em> speed results for AES CBC Encrypt and AES GCM (cycles/byte).</em></p>
<table align="center" class="no-alternate" border="1"><tbody><tr><td>
<p>Algorithm</p>
</td>
<td>
<p align="center">Xeon V4</p>
</td>
<td>
<p align="center">Xeon Scalable</p>
</td>
<td>
<p align="center">Xeon Scalable Gain</p>
</td>
</tr><tr><td>
<p>AES-128-CBC Encrypt</p>
</td>
<td>
<p align="center">4.44</p>
</td>
<td>
<p align="center">2.64</p>
</td>
<td>
<p align="center">1.68</p>
</td>
</tr><tr><td>
<p>AES-128-GCM</p>
</td>
<td>
<p align="center">0.77</p>
</td>
<td>
<p align="center">0.65</p>
</td>
<td>
<p align="center">1.18</p>
</td>
</tr></tbody></table><h2>Intel® ISA-L</h2>
<p>Results shown in Table 5 are collected from the Intel ISA-L crypto version v2.19.0 cold cache performance tests using the following commands:</p>
<pre class="brush:plain;">make perfs
make perf
</pre>
<p style="text-align:center"><strong>Table 5. </strong><em>Intel® ISA-L performance test results for SHA </em><em>Multibuffer</em><em> (cycles/byte).</em></p>
<table align="center" class="no-alternate" border="1"><tbody><tr><td>
<p>Algorithm</p>
</td>
<td>
<p align="center">Xeon V4</p>
</td>
<td>
<p align="center">Xeon Scalable</p>
</td>
<td>
<p align="center">Xeon Scalable Gain</p>
</td>
</tr><tr><td>
<p>SHA-1</p>
</td>
<td>
<p align="center">1.13</p>
</td>
<td style="width:66px">
<p align="center">0.44</p>
</td>
<td>
<p align="center">2.55</p>
</td>
</tr><tr><td>
<p>SHA-256</p>
</td>
<td>
<p align="center">2.60</p>
</td>
<td>
<p align="center">0.87</p>
</td>
<td>
<p align="center">2.97</p>
</td>
</tr><tr><td>
<p>SHA-512</p>
</td>
<td>
<p align="center">3.24</p>
</td>
<td>
<p align="center">1.07</p>
</td>
<td>
<p align="center">3.03</p>
</td>
</tr></tbody></table><p style="text-align:center"><img src="/sites/default/files/managed/4b/67/Intel-Xeon-Scalable-Processor-Cryptographic-Performance-fig01.png" /></p>
<p style="text-align:center"><strong>Figure 1. </strong><em>Single core Xeon Scalable Processor performance gain over previous generation Xeon v4.</em></p>
<h2>Conclusion</h2>
<p>The new Intel Xeon Scalable processors continue the tradition of lowering the computational burden of cryptographic algorithms. By incorporating the open source optimized cryptographic software libraries profiled in this paper, your application will take advantage of the best performance from the latest processor features.</p>
<h2>Acknowledgements</h2>
<p>We thank Jim Guilford, Ilya Albrekht, and Greg Tucker for their contributions to the optimized code. We also thank Jon Strang for preparing the Intel Xeon platforms in the performance tests.</p>
<h2>References</h2>
<p>1. “Federal Information Processing Standards Publication 197 Advanced Encryption Standard” <a href="http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf" target="_blank" rel="nofollow">http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf</a></p>
<p>2. “6WIND Boosts IPsec with Intel Xeon Scalable Processors” <a href="http://www.6wind.com/wp-content/uploads/2017/07/6WIND-Purley-Solution-Brief.pdf" target="_blank" rel="nofollow">http://www.6wind.com/wp-content/uploads/2017/07/6WIND-Purley-Solution-Brief.pdf</a></p>
<p>3. “Federal Information Processing Standards Publication 180-4 Secure Hash Standard” <a href="http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf" target="_blank" rel="nofollow">http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf</a></p>
<p>4. OpenSSL <a href="https://github.com/openssl/openssl" target="_blank" rel="nofollow">https://github.com/openssl/openssl</a></p>
<p>5. Intel Intelligent Storage Acceleration Library Crypto Version <a href="https://github.com/01org/isa-l_crypto" target="_blank" rel="nofollow">https://github.com/01org/isa-l_crypto</a></p>
<p class="footnote">1 – Performance claims based on measured data and methodology outline in the “Performance Gains” section of this document</p>
Wed, 06 Sep 2017 15:30:49 -0700Sean Gulley (Intel)743320Use the Intel® SPMD Program Compiler for CPU Vectorization in Gameshttps://software.intel.com/en-us/articles/use-the-intel-spmd-program-compiler-for-cpu-vectorization-in-games
<p><a class="button-highlight" href="https://github.com/GameTechDev/ISPC-DirectX-Graphics-Samples" target="_blank" rel="nofollow">Download GitHub* Code Sample</a></p>
<h2>Introduction</h2>
<p>The open source <a href="http://llvm.org/" target="_blank" rel="nofollow">LLVM*</a> based <a href="https://ispc.github.io/" target="_blank" rel="nofollow">Intel® SPMD Program Compiler</a> (commonly referred to in previous documents as ISPC) is not a replacement for the Gnu* Compiler Collection (GCC) or the Microsoft* C++ compiler; instead it should be considered more akin to a shader compiler for the CPU that can generate vector instructions for a variety of instruction sets such as Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 4 (Intel® SSE4), Intel® Advanced Vector Extensions (Intel® AVX), Intel® AVX2, and so on. The input shaders or kernels are C-based and the output is a precompiled object file with an accompanying header file to be included in your application. Through the use of a small number of keywords, the compiler can be explicitly directed on how the work should be split across the CPU vector units.</p>
<p>The extra performance from explicit vectorization is available if a developer chooses to write intrinsics directly into their codebase, however, this has a high complexity and high maintenance cost. Intel SPMD Program Compiler kernels are written using a high-level language so the development cost is low. It also becomes trivial to support multiple instruction sets to provide the best performance for the CPU that the code is running on, rather than the lowest common denominator, such as Intel SSE4.</p>
<p>This article does not aim to teach the reader how to write Intel SPMD Program Compiler kernels; it simply demonstrates how to plug Intel SPMD Program Compiler into a Microsoft Visual Studio* solution and provides guidance on how to port simple High-Level Shading Language* (HLSL*) compute shaders to Intel SPMD Program Compiler kernels. For a more detailed overview of Intel SPMD Program Compiler, please refer to the online <a href="https://ispc.github.io/documentation.html" target="_blank" rel="nofollow">documentation</a>.</p>
<p>The example code provided with this article is based on a modified version of the <a href="https://github.com/Microsoft/DirectX-Graphics-Samples" target="_blank" rel="nofollow">Microsoft DirectX* 12 n-body sample</a> that has been ported to support Intel SPMD Program Compiler vectorized compute kernels. It is not intended to show performance deltas against the GPU, but to show the large performance gains that can be achieved when moving from standard scalar CPU code to vectorized CPU code.</p>
<p>While this application clearly does not represent a game due to the very light CPU load in the original sample, it does show the kind of performance scaling possible by using the vector units on multiple CPU cores.</p>
<p style="text-align:center"><img src="/sites/default/files/managed/ce/b3/Intel-spmd-program-compiler-cpu-vectorization-games-fig01.jpg" /><br /><strong>Figure 1.</strong><em> Screenshot from the modified n-Body Gravity sample</em></p>
<h2>The Original DirectX* 12 n-Body Gravity Sample</h2>
<p>Before starting the port to Intel SPMD Program Compiler, it would be useful to understand the original sample and its intent. The <em>DirectX 12 n-Body Gravity</em> sample was written to highlight how to use the separate compute engine in DirectX 12 to perform asynchronous compute; that is, the particle render is done in parallel to the particle update, all on the GPU. The sample generates 10,000 particles and updates and renders them each frame. The update involves every particle interacting with every other particle, to generate 100,000,000 interactions per simulation tick.</p>
<p>The HLSL compute shader maps a compute thread to each particle to perform the update. The particle data is double-buffered so that for each frame, the GPU renders from buffer 1 and asynchronously updates buffer 2, before flipping the buffers in preparation for the next frame.</p>
<p>That’s it. Pretty simple, and a good candidate for an Intel SPMD Program Compiler port because an asynchronous compute task lends itself perfectly to being run on the CPU; the code and engine have already been designed to perform the compute in a concurrent execution path, so by transferring some of this load onto the often underutilized CPU, the GPU can either finish its frame quicker, or can be given more work, while making full use of the CPU.</p>
<h2>Porting to Intel® SPMD Program Compiler</h2>
<p>The recommended approach would be to port from HLSL to scalar C/C++ first. This ensures that the algorithm is correct and produces the correct results, interacts with the rest of the application correctly and, if applicable, handles multiple threads properly. As trivial as this sounds, there are a few things to consider:</p>
<ol><li>How to share memory between the GPU and CPU.</li>
<li>How to synchronize between the GPU and CPU.</li>
<li>How to partition the work for single instruction/multiple data (SIMD) and multithreading.</li>
<li>Porting the HLSL code to scalar C.</li>
<li>Porting scalar C to an Intel SPMD Program Compiler kernel.</li>
</ol><p>Some of these are easier than others.</p>
<h3>Sharing Memory</h3>
<p>We know we need to share the memory between the CPU and the GPU, but how? Fortunately, DirectX 12 provides a few options, such as mapping GPU buffers into CPU memory, and so on. To keep this example simple and to minimize code changes, we just re-use the particle upload staging buffers that were used for the initialization of the GPU particle buffers, and we create a double-buffered CPU copy for CPU access. The usage model becomes:</p>
<ul><li>Update CPU-accessible particle buffer from the CPU.</li>
<li>Call the DirectX 12 helper <code>UpdateSubresources</code> using the original upload staging buffer with the GPU particle buffer as the destination.</li>
<li>Bind the GPU particle buffer and render.</li>
</ul><h3>Synchronization</h3>
<p>The synchronization falls out naturally as the original async compute code already has a DirectX 12 Fence object for marshalling the interaction between the compute and render, and this is simply reused to signal to the render engine that the copy has finished.</p>
<h3>Partitioning the Work</h3>
<p>To partition the work, we should first consider how the GPU partitions the work, as this may be a natural fit for the CPU. Compute shaders have two ways to control their partitioning. First is the dispatch size, which is the size passed to the API call when recording the command stream. This describes the number of and the dimensionality of the work groups to be run. Second is the size and dimensionality of the local work group, which is hard coded into the shader itself. Each item in the local work group can be considered a work thread and each thread can share information with other threads in the work group if shared memory is used.</p>
<p>Looking at the <em>nBodyGravityCS.</em><em>hlsl</em> compute shader, we can see that the local work group size is 128 x 1 x 1 and it uses some shared memory to optimize some of the particle loads, but this may not be necessary on the CPU. Other than this, there is no interaction between the threads and each thread works on a different particle from the outer loop while interacting with all other particles in the inner loop.</p>
<p>This seems a natural fit to the CPU vector width, so we could swap the 128 x 1 x 1 with 8 x 1 x 1 for Intel AVX2 or 4 x 1 x 1 for Intel SSE4. We can also use the dispatch size as a hint for how to multithread the code, so we could divide the 10,000 particles by 8 or 4, depending on the SIMD width. But, because we have discovered that there is no dependency between each thread, we could simplify it and just divide the number of particles by the available number of threads in the thread pool, or available logical cores on the device, which would be 8 on a typical quad core CPU with Intel® Hyper-Threading Technology enabled. When porting other compute shaders, this may require more thought.</p>
<p>This gives us the following pseudocode:</p>
<pre class="brush:cpp;">For each thread
Process N particles where N is 10000/threadCount
For each M particles from N, where M is the SIMD width
Test interaction with all 10000 particles
</pre>
<h3>Porting HLSL* to Scalar C</h3>
<p>When writing Intel SPMD Program Compiler kernels, unless you are experienced, it is recommended that you have a scalar C version written first. This will ensure that all of the application glue, multithreading, and memory operations are working before you start vectorizing.</p>
<p>To that end, most of the HLSL code from <em>nBodyGravityCS.hlsl</em> will work in C with minimal modifications other than adding the outer loop for the particles and changing the shader math vector types to using a C-based equivalent. In this example, float4/float3 types were exchanged for the DirectX XMFLOAT4/XMFLOAT3 types, and some vector math operations were split out into their scalar equivalents.</p>
<p>The CPU particle buffers are used for reading and writing, and then the write buffer is uploaded to the GPU as described above, using the original fences for the synchronization. To provide the threading the sample uses <em>Microsoft’s concurrency::parallel_for</em> construct from their <a href="https://msdn.microsoft.com/en-us/library/dd492418.aspx" target="_blank" rel="nofollow">Parallel Patterns Library</a>.</p>
<p>The code can be seen in <code>D3D12nBodyGravity::SimulateCPU()</code> and <code>D3D12nBodyGravity::ProcessParticles().</code></p>
<p>Once the scalar code is working, it is worth doing a quick performance check to ensure there are no algorithmic hot spots that should be fixed before moving to Intel SPMD Program Compiler. In this sample, some basic hot spot analysis with <a href="https://software.intel.com/en-us/intel-vtune-amplifier-xe">Intel® VTune™</a> tools highlighted that a reciprocal square root (sqrt) was on a hot path, so this was replaced with the infamous <a href="https://en.wikipedia.org/wiki/Fast_inverse_square_root" target="_blank" rel="nofollow">fast reciprocal sqrt</a> approximation from Quake* that provided a small performance improvement with no perceivable impact due to the loss of precision.</p>
<h3>Porting Scalar C to Scalar Intel® SPMD Program Compiler</h3>
<p>Once your build system has been modified to build Intel SPMD Program Compiler kernels and link them into your application (Microsoft Visual Studio modifications are described later in this article), it is time to start writing Intel SPMD Program Compiler code and hook it into your application.</p>
<h4>Hooks</h4>
<p>To call any Intel SPMD Program Compiler kernels from your application code, you need to include the relevant auto-generated output header file and then call any of the exported functions as you would any normal library, remembering that all declarations are wrapped in the <code>ispc</code> namespace. In the sample, we call <code>ispc::ProcessParticles</code> from within the <code>SimulateCPU()</code> function.</p>
<h4>Vector Math</h4>
<p>Once the hooks are in, the next step is to get scalar Intel SPMD Program Compiler code working, and then vectorize it. Most of the scalar C code can be dropped straight into an Intel SPMD Program Compiler kernel with only a few simple modifications. In the sample, all vector math types needed defining because, although Intel SPMD Program Compiler does provide some templated vector types the types are only needed for storage, so new structs were defined. Once done, all XMFLOAT types were converted to the Vec3 and Vec4 types.</p>
<h4>Keywords</h4>
<p>We now need to start decorating the code with some Intel SPMD Program Compiler specific keywords to help direct the vectorization and compilation. The first keyword is <strong><code>export</code></strong>, which is used on a function signature like a calling convention, to inform Intel SPMD Program Compiler that this is an entry point into the kernel. This does two things. First, it adds the function signature to the autogenerated header file along with any required structs, but it also puts some restrictions on the function signature, as all arguments need to be scalar; which leads us to the next two keywords to be used, <strong><code>varying</code></strong> and <code><strong>uniform</strong>.</code></p>
<p>A uniform variable describes a scalar variable that will not get shared, but its contents will be shared across all SIMD lanes, while a varying variable will get vectorized and have unique values across all SIMD lanes. All variables are varying by default so whilst the keyword can be added, it has not been used in this sample. In our first pass of creating a scalar version of this kernel, we will decorate all variables with the uniform keyword to ensure it is strictly scalar.</p>
<h4>Intel SPMD Program Compiler Standard Library</h4>
<p>Intel SPMD Program Compiler provides a <a href="https://ispc.github.io/ispc.html#the-ispc-standard-library" target="_blank" rel="nofollow">standard library</a> containing many common functions which can also aid the port, including functions like floatbits() and intbits(), which are required for some of the floating point casts required in the fast reciprocal sqrt function.</p>
<h3>Vectorizing the Intel SPMD Program Compiler Kernel</h3>
<p>When the Intel SPMD Program Compiler kernel is functioning as expected, it is time to vectorize. The main complexity is normally deciding what to parallelize and how to parallelize it. A rule of thumb for porting GPU compute shaders is to follow the original model of GPU vectorization which, in this case, had the core compute kernel invoked by multiple GPU execution units in parallel. So, where we added a new outer loop for the scalar version of the particle update, it is this outer loop that should most naturally be vectorized.</p>
<p>The layout of data is also important, as scatter/gather operations can be expensive for vector ISAs (although this is improved with the Intel AVX2 instruction set), so consecutive memory locations are normally preferred for frequent loads/stores.</p>
<h4>Parallel Loops</h4>
<p>In the n-body example, this rule of thumb was followed and the outer loop was vectorized, leaving the inner loop scalar. Therefore, 8 particles would be loaded into the Intel AVX registers and all 8 would then be tested against the entire 10,000 particles. These 10,000 positions would all be treated as scalar variables, shared across all SIMD lanes with no scatter/gather cost. Intel SPMD Program Compiler hides the actual vector width from us (unless we really want to know), which provides a nice abstraction to transparently support the different SIMD widths for Intel SSE4 or Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and so on.</p>
<p>The vectorization was done by replacing the outer <strong>for</strong> loop with an Intel SPMD Program Compiler <strong>foreach</strong> loop, which directs Intel SPMD Program Compiler to iterate over the range in N-sized chunks, where N is the current vector width. Hence, whenever the <em><strong>foreach</strong></em> loop iterator <strong>ii</strong> is used to dereference an array variable, the value of ii will be different for each SIMD lane of the vector, which allows each lane to work on a different particle.</p>
<h4>Data Layout</h4>
<p>At this point, it is important to briefly mention data layout. When using vector registers on the CPU it is important that they are loaded and unloaded efficiently; not doing so can cause a big performance slowdown. To achieve this, vector registers want to have the data loaded from a structure of arrays (SoA) data source, so a vector width number of memory-adjacent values can be loaded directly into the working vector register with a single instruction. If this cannot be achieved, then a slower gather operation is required to load a vector width of non-adjacent values into the vector register, and a scatter operation is required to save the data out again.</p>
<p>In this example, like many graphics applications, the particle data is kept in an array of structures (AoS) layout. This could be converted to SoA to avoid the scatter/gather, but due to the nature of the algorithm, the scatter/gather required in the outer loop becomes a small cost compared to processing the 10,000 scalar particles in the inner loop, so the data is left as AoS.</p>
<h4>Vector Variables</h4>
<p>The aim is to vectorize the outer loop and keep the inner loop scalar, hence a vector width of outer loop particles will all be processing the same inner loop particle. To achieve this, we load the position, velocity, and acceleration from the outer loop particles into vector registers by declaring <strong>pos, vel,</strong> and <strong>accel</strong> as varying. This was done by removing the <strong>uniform</strong> decoration we added to the scalar kernel, so Intel SPMD Program Compiler knows these variables require vectorizing.</p>
<p>This needs propagating through the <strong>bodyBodyInteraction</strong> and <strong>Q_rsqrt</strong> functions to ensure they are all correctly vectorized. This is just a case of following the flow of the variables and checking for compiler errors. The result is that <strong>Q_rsqrt</strong> is fully vectorized and <strong>bodyBodyInteraction</strong> is mainly vectorized, apart from the inner loop particle position <strong>thatPos</strong>, which is scalar.</p>
<p>This should be all that is required, and the Intel SPMD Program Compiler vectorized kernel should now run, providing good performance gains over the scalar version.</p>
<h3>Performance</h3>
<p>The modified n-body application was tested on two different Intel CPUs and the performance data was captured using <a href="https://github.com/GameTechDev/PresentMon" target="_blank" rel="nofollow">PresentMon*</a> to record the frame times from three runs of 10 seconds each, which were then averaged. This showed performance scaling in the region of 8–10x from scalar C/C++ code to an Intel AVX2 targeted Intel SPMD Program Compiler kernel. Both devices used Nvidia* 1080 GTX GPUs and used all available CPU cores.</p>
<table align="center" border="1" class="no-alternate" style="width:700px"><tbody><tr><td>
<p><strong>Processor</strong></p>
</td>
<td>
<p><strong>Scalar CPU Implementation</strong></p>
</td>
<td>
<p><strong>Intel® AVX2 Implementation Compiled with Intel® SPMD Program Compiler </strong></p>
</td>
<td>
<p><strong>Scaling</strong></p>
</td>
</tr><tr><td>
<p>Intel® Core™ i7-7700K processor</p>
</td>
<td>
<p>92.37 ms</p>
</td>
<td>
<p>8.42 ms</p>
</td>
<td>
<p>10.97x</p>
</td>
</tr><tr><td>
<p>Intel Core I7-6950X Processor Extreme Edition brand</p>
</td>
<td>
<p>55.84 ms</p>
</td>
<td>
<p>6.44 ms</p>
</td>
<td>
<p>8.67x</p>
</td>
</tr></tbody></table><h2>How to Integrate Intel SPMD Program Compiler into Microsoft Visual Studio*</h2>
<ol><li>Ensure the Intel SPMD Program Compiler compiler is on the path or easily located from within Microsoft Visual Studio*.</li>
<li>Include your Intel SPMD Program Compiler kernel into your project. It will not be built by default as the file type will not be recognized.</li>
<li>Right-click on the file <strong>Properties</strong> to alter the Item Type to be a <strong>Custom Build Tool:</strong>
<p style="text-align:center"><img alt="" src="/sites/default/files/managed/25/c2/Intel-spmd-program-compiler-cpu-vectorization-games-fig02.png" /></p>
</li>
<li>Click <strong>OK</strong> and then re-open the Property pages, allowing you to modify the custom build tool.
<p>a. Use the following command-line format:</p>
<pre class="brush:cpp;">ispc -O2 &lt;filename&gt; -o &lt;output obj&gt; -h &lt;output header&gt; --target=&lt;target backends&gt; --opt=fast-math</pre>
<p>b. The full command line used in the sample is:</p>
<pre class="brush:cpp;">$(ProjectDir)..\..\..\..\third_party\ispc\ispc -O2 "%(Filename).ispc" -o "$(IntDir)%(Filename).obj" -h "$(ProjectDir)%(Filename)_ispc.h" --target=sse4,avx2 --opt=fast-math</pre>
<p>c. Add the relevant compiler generated outputs <strong>ie. obj</strong> files:</p>
<pre class="brush:cpp;">$(IntDir)%(Filename).obj;$(IntDir)%(Filename)_sse4.obj;$(IntDir)%(Filename)_avx2.obj</pre>
<p>d. Set Link Objects to <strong>Yes</strong>.</p>
<p style="text-align:center"><img alt="" src="/sites/default/files/managed/05/5b/Intel-spmd-program-compiler-cpu-vectorization-games-fig03.png" /></p>
</li>
<li>Now compile your Intel SPMD Program Compiler kernel. If successful, this should produce a header file and an object file.</li>
<li>Add the header to your project and include it into your application source code.
<p style="text-align:center"><img alt="" src="/sites/default/files/managed/06/79/Intel-spmd-program-compiler-cpu-vectorization-games-fig04.png" style="width:200px" /></p>
</li>
<li>Call the Intel SPMD Program Compiler kernel from the relevant place, remembering that any functions you export from the kernel will be in the Intel SPMD Program Compiler namespace:
<p style="text-align:center"><img alt="" src="/sites/default/files/managed/dd/fd/Intel-spmd-program-compiler-cpu-vectorization-games-fig05.png" /></p>
</li>
</ol><h2>Summary</h2>
<p>The purpose of this article was to show how easily developers can migrate highly vectorized GPU compute kernels to vectorized CPU code by using Intel SPMD Program Compiler; thereby allowing spare CPU cycles to be fully utilized and providing the user with a richer gaming experience. The demonstrated extra performance from using Intel SPMD Program Compiler kernels instead of scalar code is available with very little effort for any workloads that naturally vectorize, and using Intel SPMD Program Compiler reduces development and maintenance time while also enabling new instruction sets to be supported with ease.</p>
Thu, 17 Aug 2017 15:18:29 -0700JONATHAN K. (Intel)741957LZO data compression functions and improvements in Intel® Integrated Performance Primitiveshttps://software.intel.com/en-us/articles/lzo-data-compression-functions-and-improvements-in-intel-integrated-performance-primitives
<h3>Introduction</h3>
<p>In this document, we describe Intel IPP data compression functions, that implement the LZO (Lempel-Ziv-Oberhumer) compressed data format. This format and algorithm use 64KB compression dictionary and do not require additional memory for decompression. (See original code of the LZO library at<a href="http://www.oberhumer.com" rel="nofollow"> http://www.oberhumer.com</a>.)</p>
<p>Lempel–Ziv–Oberhumer (LZO) is one of the well-known data compression algorithms that is lossless and focused on decompression speed. One of the fastest compression and decompression algorithms.</p>
<p> </p>
<p> </p>
<p> </p>
<h3>LZO Example in IPP</h3>
<p>IPP LZO is one of the numerous LZO methods with the medium compression ratio, and it shows very high decompression performance with low memory footprint. </p>
<p>The code example below shows how to use Intel IPP functions for the LZO compression. It includes compression and decompression procedures. </p>
<p>Before learning LZO functions of IPP, take a look at the IPP parameters made specially for LZO functionality. </p>
<p>The LZO coding initialization functions have a special parameter <span>method</span>. This parameter specifies level of parallelization and generic LZO compatibility to be used in the LZO encoding. The table below lists possible values of the <span>method</span> parameter and their meanings.</p>
<div>
<table border="1"><caption><span>Parameter<em> method</em> for the LZO Compression Functions</span></caption>
<thead><tr><th>Value</th>
<th>Descriptions</th>
</tr></thead><tbody><tr><td><span>IppLZO1XST</span></td>
<td>
<p>The compression and decompression are performed sequentially in a single-thread mode with full binary compatibility with generic LZO libraries and applications</p>
</td>
</tr><tr><td><span>IppLZO1XMT</span></td>
<td>
<p>The compression and decompression are performed in parallel (multi-threaded mode), it is more fast, but not compatible with the generic LZO. </p>
</td>
</tr></tbody></table></div>
<p>Intel IPP provides 5 functions for LZO. Please refer these links below for each supported functions' details. </p>
<ul><li><a href="https://software.intel.com/node/ec3aba09-3abd-4609-829e-4cf93680f8f9">EncodeLZOGetSize</a></li>
<li><a href="https://software.intel.com/node/2684f435-8435-44f0-b8ec-0898f00f90ce">EncodeLZOInit</a></li>
<li><a href="https://software.intel.com/node/0e383683-f8c5-404e-b8ac-2e9a087f3898">EncodeLZO</a></li>
<li><a href="https://software.intel.com/node/d20bc816-4a67-4a95-8406-aa2ee4f120bf">DecodeLZO</a></li>
<li><a href="https://software.intel.com/node/0ed1746c-d051-4c7f-8aa0-68651431c4d4">DecodeLZOSafe</a></li>
</ul><p>Please refer here to learn how to find, setting environment variables, compiler integration and building for Intel IPP applications ( <a href="https://software.intel.com/en-us/node/503894">Getting Started With Intel IPP</a> )</p>
<pre class="brush:cpp;">/* Simple example of file compression using IPP LZO functions */
#include &lt;stdio.h&gt;
#include "ippdc.h"
#include "ipps.h"
#define BUFSIZE 1024
void CompressFile(const char* pInFileName, const char* pOutFileName)
{
FILE *pIn, *pOut;
IppLZOState_8u *pLZOState;
Ipp8u src[BUFSIZE];
/* For uncompressible data the size of output will be bigger */
Ipp8u dst[BUFSIZE + BUFSIZE/10];
Ipp32u srcLen, dstLen, lzoSize;
pIn = fopen(pInFileName, "rb");
pOut = fopen(pOutFileName, "wb");
ippsEncodeLZOGetSize(IppLZO1XST, BUFSIZE, &amp;lzoSize);
pLZOState = (IppLZOState_8u*)ippsMalloc_8u(lzoSize);
ippsEncodeLZOInit_8u(IppLZO1XST, BUFSIZE, pLZOState);
while ((srcLen = (Ipp32u)fread(src, 1, BUFSIZE, pIn)) &gt; 0) {
ippsEncodeLZO_8u(src, srcLen, dst, &amp;dstLen, pLZOState);
fwrite(&amp;srcLen, 1, sizeof(srcLen), pOut);
fwrite(&amp;dstLen, 1, sizeof(dstLen), pOut);
fwrite(dst, 1, dstLen, pOut);
}
fclose(pIn);
fclose(pOut);
}
/* Example of using of DecodeLZO function to decompress the file */
void DecompressFile(const char* pInFileName, const char* pOutFileName)
{
FILE *pIn, *pOut;
size_t allocSizeSrc = 0;
size_t allocSizeDst = 0;
Ipp32u srcLen, dstLen;
Ipp8u *pSrc, *pDst;
pIn = fopen(pInFileName, "rb");
pOut = fopen(pOutFileName, "wb");
while (1) {
if (fread(&amp;dstLen, 1, sizeof(dstLen), pIn) != sizeof(dstLen))
break;
fread(&amp;srcLen, 1, sizeof(srcLen), pIn);
if (srcLen &gt; allocSizeSrc) {
if (allocSizeSrc &gt; 0)
ippsFree(pSrc);
pSrc = ippsMalloc_8u(allocSizeSrc = srcLen);
}
if (dstLen &gt; allocSizeDst) {
if (allocSizeDst &gt; 0)
ippsFree(pDst);
pDst = ippsMalloc_8u(allocSizeDst = dstLen);
}
fread(pSrc, 1, srcLen, pIn);
ippsDecodeLZO_8u(pSrc, srcLen, pDst, &amp;dstLen);
fwrite(pDst, 1, dstLen, pOut);
}
fclose(pIn);
fclose(pOut);
}
</pre>
<h3> </h3>
<h3>LZO Improvements compared to the previous version</h3>
<p>For the latest IPP 2018, there have been significant improvements in compression performance compare to the previous version. Please take a look at the compression data below.</p>
<p><span><img height="568" width="996" src="https://software.intel.com/sites/default/files/managed/55/9f/LZO_1.png" alt="" /></span></p>
<p>Compression performance uses 'MB/s' unit. So the higher the better. The compression performance can speed up ~ 50% in average with the newest IPP version. </p>
<p>for decompression performance there hasn't been a big change even, on a several files 2018 decompression performance is lower than 2017.</p>
<p>The reason is because that in 2017 Update 2 we introduced the support of decompression of LZO-999 compressed data which requires additional checks during the decompression process and as a result, we get a lower performance.</p>
<p>IPP team managed restore decompression performance in 2018 but not for all files from Calgary.</p>
<p>Additionally, source code had been re-written from ASM to C so that some optimization potential appeared. </p>
<p>Please refer the test data below for the decompression results.</p>
<p><span><img height="568" width="996" src="https://software.intel.com/sites/default/files/managed/9a/9d/LZO_2.png" alt="" /></span></p>
<p> </p>
Thu, 10 Aug 2017 23:18:27 -0700JON J K. (Intel)741123Fast Computation of Adler32 Checksumshttps://software.intel.com/en-us/articles/fast-computation-of-adler32-checksums
<h2>Abstract</h2>
<p>Adler32 is a common checksum used for checking the integrity of data in applications such as zlib*, a popular compression library. It is designed to be fast to execute in software, but in this paper we present a method to compute it with significantly better performance than the previous implementations. We show how the vector processing capabilities of Intel® Architecture Processors can be exploited to efficiently compute the Adler32 checksum.</p>
<h2>Introduction</h2>
<p>The Adler32 checksum (<a href="https://en.wikipedia.org/wiki/Adler-32" target="_blank" rel="nofollow">https://en.wikipedia.org/wiki/Adler-32</a>) is similar to the Fletcher checksum, but it is designed to catch certain differences that Fletcher is not able to catch. It is used, among other places, in the zlib data compression library (<a href="https://en.wikipedia.org/wiki/Zlib" target="_blank" rel="nofollow">https://en.wikipedia.org/wiki/Zlib</a>), a popular general-purpose compression library.</p>
<p>While scalar implementations of Adler32 can achieve reasonable performance, this paper presents a way to further improve the performance by using the vector processing feature of Intel processors. This is an extension of the method we used to speed up the Fletcher checksum as described in (<a href="https://software.intel.com/en-us/articles/fast-computation-of-fletcher-checksums">https://software.intel.com/en-us/articles/fast-computation-of-fletcher-checksums</a>).</p>
<h2>Implementation</h2>
<p>If the input stream is considered to be an array of bytes (data), the checksum essentially consists of two 16-bit words (A and B), and the checksum can be defined as:</p>
<pre class="code-simple">for (i=0; i&lt;end; i++) {
A = (A + data[i]) % 65521;
B = (B + A) % 65521;
}
</pre>
<p>Doing the modulo operation after every addition is expensive. A well-known way to speed this up is to do the addition using larger variables (for example, 32-bit dwords), and then to perform the modulo only when the variables are about to risk overflowing, for example:</p>
<pre class="code-simple">for (i=0; i&lt;5552; i++) {
A = (A + data[i]);
B = (B + A);
}
A = A % 65521;
B = B % 65521;
</pre>
<p>The reason that up to 5552 bytes can be processed before needing to do the modulo is that if A and B are initially 65520 and the data is all 0xFF (255), after processing 5552 bytes, B (the larger of the two) will be 0xFFFBC598. But if one processes 5553 such bytes, the result would be greater than 2<sup>32</sup>.</p>
<p>Within that loop, the calculation looks the same as in Fletcher, so the same approach can be used to vectorize the calculation. In this case, the body of the main loop would be an unrolled version of:</p>
<pre class="code-simple"> pmovzxbd xdata0, [data] ; Loads byte data into dword lanes
paddd xa, xdata0
paddd xb, xa
</pre>
<p>One can see that this looks essentially identical to what one would do with scalar code, except that it is operating on vector registers and, depending on the hardware generation, could be processing 4, 8, or 16 bytes in parallel.</p>
<p>If “a[i]” represents the i’th lane of vector register “a” and N is the number of lanes, we can (as shown in the earlier paper) calculate the actual sums by:</p>
<p style="text-align:center"><img src="/sites/default/files/managed/b3/1d/WEBOPS-5668-fast-computation-figure01.jpg" /></p>
<p>The sums can be done using a series of horizontal adds (PHADDD), and the scaling can be done with PMULLD.</p>
<p>In pseudo-code, if the main loop is operating on eight lanes (either with eight lanes in one register or four lanes unrolled by a factor of two), this might look like:</p>
<pre class="code-simple">While (size != 0) {
s = min(size, 5222)
end = data + s – 7
while (data &lt; end) {
compute vector sum
data += 8
}
end += 7
if (0 == (s &amp; 7)) {
size -= s;
reduce from vector to scalar sum
compute modulo
continue while loop
}
// process final 1…7 bytes
Reduce from vector to scalar sum
Do final adds in scalar loop
Compute modulo
}
</pre>
<h2>Performance</h2>
<p>The following graph compares the cycles as a function of input buffer size for an optimized scalar implementation, and for both a Streaming SIMD Extension and Intel® Advanced Vector Extensions 2 (Intel® AVX2) based parallel version, as described in this paper.</p>
<p style="text-align:center"><img src="/sites/default/files/managed/60/ce/WEBOPS-5668-fast-computation-figure02.jpg" /></p>
<p>One can clearly see that the vector versions have a significantly better performance than an optimized scalar one. This is true for all but the smallest buffers.</p>
<p style="text-align:center"><img src="/sites/default/files/managed/00/3d/WEBOPS-5668-fast-computation-figure03.jpg" /></p>
<p>An Intel® Advanced Vector Extensions 512 version was not tested, but it should perform significantly faster than the Intel AVX2 version.</p>
<p>Versions of this code are in the process of being integrated and released as part of the Intel® Intelligent Storage Acceleration Library (<a href="https://software.intel.com/en-us/storage/ISA-L">https://software.intel.com/en-us/storage/ISA-L</a>).</p>
<h2>Conclusion</h2>
<p>This paper illustrated a method for improved Adler32 checksum performance. By leveraging architectural features such as SIMD in the processors and combining innovative software techniques, large performance gains are possible.</p>
<h2>Author</h2>
<p>Jim Guilford is an architect in the Intel Data Center Group, specializing in software and hardware features relating to cryptography and compression.</p>
Thu, 08 Jun 2017 13:55:09 -0700James Guilford (Intel)735789Benefits of Intel® Optimized Caffe* in comparison with BVLC Caffe*https://software.intel.com/en-us/articles/comparison-between-intel-optimized-caffe-and-vanilla-caffe-by-intel-vtune-amplifier
<h3>Overview</h3>
<p> This article introduces Berkeley Vision and Learning Center (BVLC) Caffe* and a custom version of Caffe*, Intel® Optimized Caffe*. We explain why and how Intel® Optimized Caffe* performs efficiently on Intel® Architecture via Intel® VTune™ Amplifier and the time profiling option of Caffe* itself.</p>
<h3> </h3>
<h3>Introduction to BVLC Caffe* and Intel® Optimized Caffe*</h3>
<p><a href="http://caffe.berkeleyvision.org/" rel="nofollow">Caffe</a>* is a well-known and widely used machine vision based Deep Learning framework developed by the Berkeley Vision and Learning Center (<a href="http://bvlc.eecs.berkeley.edu/" rel="nofollow">BVLC</a>). It is an open-source framework and is evolving currently. It allows users to control a variety options such as libraries for BLAS, CPU or GPU focused computation, CUDA, OpenCV, MATLAB and Python before you build Caffe* through 'Makefile.config'. You can easily change the options in the configuration file and BVLC provides intuitive instructions on their project web page for developers. </p>
<p>Intel® Optimized Caffe* is Intel distributed customized Caffe* version for Intel Architectures. Intel® Optimized Caffe* offers all the goodness of main Caffe* with the addition of Intel Architectures optimized functionality and multi-node distributor training and scoring. Intel® Optimized Caffe* makes it possible to more efficiently utilize CPU resources.</p>
<p>To see in detail how Intel® Optimized Caffe* has changed in order to optimize itself to Intel Architectures, please refer this page : <a href="https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques">https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques</a></p>
<p>In this article, we will first profile the performance of BVLC Caffe* with Cifar 10 example and then will profile the performance of Intel® Optimized Caffe* with the same example. Performance profile will be conducted through two different methods.</p>
<p>Tested platform : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2</p>
<p>1. Caffe* provides its own timing option for example : </p>
<pre class="brush:cpp;">./build/tools/caffe time \
--model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
-iterations 1000</pre>
<p>2. Intel® VTune™ Amplifier : Intel® VTune™ Amplifier is a powerful profiling tool that provides advanced CPU profiling features with a modern analysis interface. <a href="https://software.intel.com/en-us/intel-vtune-amplifier-xe">https://software.intel.com/en-us/intel-vtune-amplifier-xe</a></p>
<p> </p>
<p> </p>
<h3>How to Install BVLC Caffe*</h3>
<p>Please refer the BVLC Caffe project web page for installation : <a href="http://caffe.berkeleyvision.org/installation.html" rel="nofollow">http://caffe.berkeleyvision.org/installation.html</a></p>
<p>If you have Intel® MKL installed on your system, it is better using MKL as BLAS library. </p>
<p>In your Makefile.config , choose BLAS := mkl and specify MKL address. ( The default set is BLAS := atlas )</p>
<p>In our test, we kept all configurations as they are specified as default except the CPU only option. </p>
<p> </p>
<h3>Test example</h3>
<p>In this article, we will use 'Cifar 10' example included in Caffe* package as default. </p>
<p>You can refer BVLC Caffe project page for detail information about this exmaple : <a href="http://caffe.berkeleyvision.org/gathered/examples/cifar10.html" rel="nofollow">http://caffe.berkeleyvision.org/gathered/examples/cifar10.html</a></p>
<p>You can simply run the training example of Cifar 10 as the following : </p>
<pre class="brush:cpp;">cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./examples/cifar10/train_full_sigmoid_bn.sh</pre>
<p>First, we will try the Caffe's own benchmark method to obtain its performance results as the following:</p>
<pre class="brush:cpp;">./build/tools/caffe time \
--model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
-iterations 1000</pre>
<p>as results, we got the layer-by-layer forward and backward propagation time. The command above measure the time each forward and backward pass over a batch f images. At the end it shows the average execution time per iteration for 1,000 iterations per layer and for the entire calculation. </p>
<p><span><img height="538" width="308" src="https://software.intel.com/sites/default/files/managed/c1/9d/Picture1.png" alt="" /></span></p>
<p>This test was run on Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM of DDR4 installed with CentOS 7.2.</p>
<p>The numbers in the above results will be compared later with the results of Intel® Optimized Caffe*. </p>
<p>Before that, let's take a look at the VTune™ results also to observe the behave of Caffe* in detail. </p>
<p> </p>
<h3>VTune Profiling</h3>
<p>Intel® VTune™ Amplifier is a modern processor performance profiler that is capable of analyzing top hotspots quickly and helping tuning your target application. You can find the details of Intel® VTune™ Amplifier from the following link :</p>
<p>Intel® VTune™ Amplifier : <a href="https://software.intel.com/en-us/intel-vtune-amplifier-xe">https://software.intel.com/en-us/intel-vtune-amplifier-xe</a></p>
<p>We used Intel® VTune™ Amplifier in this article to find the function with the highest total CPU utilization time. Also, how OpenMP threads are working. </p>
<p> </p>
<h3>VTune result analysis</h3>
<p> </p>
<p><span><img height="820" width="1798" src="https://software.intel.com/sites/default/files/managed/5c/97/Capture1.PNG" alt="" /></span></p>
<p>What we can see here is some functions listed on the left side of the screen which are taking the most of the CPU time. They are called 'hotspots' and can be the target functions for performance optimization. </p>
<p>In this case, we will focus on 'caffe::im2col_cpu&lt;float&gt;' function as a optimization candidate. </p>
<p>'im2col_cpu&lt;float&gt;' is one of the steps in performing direct convolution as a GEMM operation for using highly optimized BLAS libraries. This function took the largest CPU resource in our test of training Cifar 10 model using BVLC Caffe*. </p>
<p>Let's take a look at the threads behaviors of this function. In VTune™, you can choose a function and filter other workloads out to observe only the workloads of the specified function. </p>
<p><span><img height="625" width="1350" src="https://software.intel.com/sites/default/files/managed/36/a2/Capture2.PNG" alt="" /></span></p>
<p>On the above result, we can see the CPI ( Cycles Per Instruction ) of the fuction is 0.907 and the function utilizes only one single thread for the entire calculation.</p>
<p>One more intuitive data provided by VTune is here. </p>
<p><span><img height="476" width="948" src="https://software.intel.com/sites/default/files/managed/45/a8/Capture3.PNG" alt="" /></span></p>
<p>This 'CPU Usage Histogram' provides the data of the numbers of CPUs that were running simultaneously. The number of CPUs the training process utilized appears to be about 25. The platform has 64 physical core with Intel® Hyper-Threading Technology so it has 256 CPUs. The CPU usage histogram here might imply that the process is not efficiently threaded. </p>
<p>However, we cannot just determine that these results are 'bad' because we did not set any performance standard or desired performance to classify. We will compare these results with the results of Intel® Optimized Caffe* later.</p>
<p> </p>
<p>Let's move on to Intel® Optimized Caffe* now.</p>
<p> </p>
<h3>How to Install Intel® Optimized Caffe*</h3>
<p> The basic procedure of installation of Intel® Optimized Caffe* is the same as BVLC Caffe*. </p>
<p>When clone Intel® Optimized Caffe* from Git, you can use this alternative : </p>
<pre class="brush:cpp;">git clone https://github.com/intel/caffe</pre>
<p> </p>
<p>Additionally, it is required to install Intel® MKL to bring out the best performance of Intel® Optimized Caffe*. </p>
<p>Please download and install Intel® MKL. Intel offers MKL for free without technical support or for a license fee to get one-on-one private support. The default BLAS library of Intel® Optimized Caffe* is set to MKL.</p>
<p> Intel® MKL : <a href="https://software.intel.com/en-us/intel-mkl">https://software.intel.com/en-us/intel-mkl</a></p>
<p>After downloading Intel® Optimized Caffe* and installing MKL, in your Makefile.config, make sure you choose MKL as your BLAS library and point MKL include and lib folder for BLAS_INCLUDE and BLAS_LIB</p>
<pre class="brush:cpp;">BLAS :=mkl
BLAS_INCLUDE := /opt/intel/mkl/include
BLAS_LIB := /opt/intel/mkl/lib/intel64</pre>
<p> </p>
<p>If you encounter 'libstdc++' related error during the compilation of Intel® Optimized Caffe*, please install 'libstdc++-static'. For example :</p>
<pre class="brush:cpp;">sudo yum install libstdc++-static</pre>
<p> </p>
<p> </p>
<p> </p>
<h3>Optimization factors and tunes</h3>
<p>Before we run and test the performance of examples, there are some options we need to change or adjust to optimize performance.</p>
<ul><li>Use 'mkl' as BLAS library : Specify 'BLAS := mkl' in Makefile.config and configure the location of your MKL's include and lib location also.</li>
<li>Set CPU utilization limit :
<pre class="brush:cpp;">echo "100" | sudo tee /sys/devices/system/cpu/intel_pstate/min_perf_pct
echo "0" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo</pre>
</li>
<li>Put 'engine:"MKL2017" ' at the top of your train_val.prototxt or solver.prototxt file or use this option with caffe tool : -engine "MKL2017"</li>
<li>Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For the example run below , 'OMP_NUM_THREADS = 64' has been used.</li>
<li>Intel® Optimized Caffe* has edited many parts of original BVLC Caffe* code to achieve better code parallelization with OpenMP*. Depending on other processes running on the background, it is often useful to adjust the number of threads getting utilized by OpenMP*. For Intel Xeon Phi™ product family single-node we recommend to use OMP_NUM_THREADS = numer_of_cores-2.</li>
<li>Please also refer here : <a href="https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance" rel="nofollow">Intel Recommendation to Achieve the best performance </a></li>
</ul><p>If you observe too much overhead because of too frequent movement of thread by OS, you can try to adjust OpenMP* affinity environment variable : </p>
<pre class="brush:cpp;">KMP_AFFINITY=compact,granularity=fine</pre>
<p> </p>
<h3>Test example</h3>
<p> For Intel® Optimized Caffe* we run the same example to compare the results with the previous results. </p>
<pre class="brush:cpp;">cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh</pre>
<pre class="brush:cpp;">./build/tools/caffe time \
--model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
-iterations 1000</pre>
<p> </p>
<h3>Comparison</h3>
<p> The results with the above example is the following :</p>
<p>Again , the platform used for the test is : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2</p>
<p>first, let's look at the BVLC Caffe*'s and Intel® Optimized Caffe* together, </p>
<p><span><img height="538" width="308" src="https://software.intel.com/sites/default/files/managed/a1/80/Picture1.png" alt="" /></span> --&gt; <span><img height="546" width="315" src="https://software.intel.com/sites/default/files/managed/ed/6f/Picture2.png" alt="" /></span></p>
<p>to make it easy to compare, please see the table below. The duration each layer took in milliseconds has been listed, and on the 5th column we stated how many times Intel® Optimized Caffe* is faster than BVLC Caffe* at each layer. You can observe significant performance improvements except for bn layers relatively. Bn stands for "Batch Normalization" which requires fairly simple calculations with small optimization potential. Bn forward layers show better results and Bn backward layers show 2~3% slower results than the original. Worse performance can occur here in result of threading overhead. Overall in total, Intel® Optimized Caffe* achieved about 28 times faster performance in this case. </p>
<table border="0" style="width:451px"><tbody><tr><td style="width:72px"> </td>
<td style="width:72px">Direction</td>
<td style="width:72px">BVLC (ms)</td>
<td style="width:72px">Intel (ms)</td>
<td style="width:163px">Performance Benefit (x)</td>
</tr><tr><td>conv1</td>
<td>Forward</td>
<td>40.2966</td>
<td>1.65063</td>
<td>24.413</td>
</tr><tr><td>conv1</td>
<td>Backward</td>
<td>54.5911</td>
<td>2.24787</td>
<td>24.286</td>
</tr><tr><td>pool1</td>
<td>Forward</td>
<td>162.288</td>
<td>1.97146</td>
<td>82.319</td>
</tr><tr><td>pool1</td>
<td>Backward</td>
<td>21.7133</td>
<td>0.459767</td>
<td>47.227</td>
</tr><tr><td>bn1</td>
<td>Forward</td>
<td>1.60717</td>
<td>0.812487</td>
<td>1.978</td>
</tr><tr><td>bn1</td>
<td>Backward</td>
<td>1.22236</td>
<td>1.24449</td>
<td>0.982</td>
</tr><tr><td>Sigmoid1</td>
<td>Forward</td>
<td>132.515</td>
<td>2.24764</td>
<td>58.957</td>
</tr><tr><td>Sigmoid1</td>
<td>Backward</td>
<td>17.9085</td>
<td>0.262797</td>
<td>68.146</td>
</tr><tr><td>conv2</td>
<td>Forward</td>
<td>125.811</td>
<td>3.8915</td>
<td>32.330</td>
</tr><tr><td>conv2</td>
<td>Backward</td>
<td>239.459</td>
<td>8.45695</td>
<td>28.315</td>
</tr><tr><td>bn2</td>
<td>Forward</td>
<td>1.58582</td>
<td>0.854936</td>
<td>1.855</td>
</tr><tr><td>bn2</td>
<td>Backward</td>
<td>1.2253</td>
<td>1.25895</td>
<td>0.973</td>
</tr><tr><td>Sigmoid2</td>
<td>Forward</td>
<td>132.443</td>
<td>2.2247</td>
<td>59.533</td>
</tr><tr><td>Sigmoid2</td>
<td>Backward</td>
<td>17.9186</td>
<td>0.234701</td>
<td>76.347</td>
</tr><tr><td>pool2</td>
<td>Forward</td>
<td>17.2868</td>
<td>0.38456</td>
<td>44.952</td>
</tr><tr><td>pool2</td>
<td>Backward</td>
<td>27.0168</td>
<td>0.661755</td>
<td>40.826</td>
</tr><tr><td>conv3</td>
<td>Forward</td>
<td>40.6405</td>
<td>1.74722</td>
<td>23.260</td>
</tr><tr><td>conv3</td>
<td>Backward</td>
<td>79.0186</td>
<td>4.95822</td>
<td>15.937</td>
</tr><tr><td>bn3</td>
<td>Forward</td>
<td>0.918853</td>
<td>0.779927</td>
<td>1.178</td>
</tr><tr><td>bn3</td>
<td>Backward</td>
<td>1.18006</td>
<td>1.18185</td>
<td>0.998</td>
</tr><tr><td>Sigmoid3</td>
<td>Forward</td>
<td>66.2918</td>
<td>1.1543</td>
<td>57.430</td>
</tr><tr><td>Sigmoid3</td>
<td>Backward</td>
<td>8.98023</td>
<td>0.121766</td>
<td>73.750</td>
</tr><tr><td>pool3</td>
<td>Forward</td>
<td>12.5598</td>
<td>0.220369</td>
<td>56.994</td>
</tr><tr><td>pool3</td>
<td>Backward</td>
<td>17.3557</td>
<td>0.333837</td>
<td>51.989</td>
</tr><tr><td>ipl</td>
<td>Forward</td>
<td>0.301847</td>
<td>0.186466</td>
<td>1.619</td>
</tr><tr><td>ipl</td>
<td>Backward</td>
<td>0.301837</td>
<td>0.184209</td>
<td>1.639</td>
</tr><tr><td>loss</td>
<td>Forward</td>
<td>0.802242</td>
<td>0.641221</td>
<td>1.251</td>
</tr><tr><td>loss</td>
<td>Backward</td>
<td>0.013722</td>
<td>0.013825</td>
<td>0.993</td>
</tr><tr><td>Ave.</td>
<td>Forward</td>
<td>735.534</td>
<td>21.6799</td>
<td>33.927</td>
</tr><tr><td>Ave.</td>
<td>Backward</td>
<td>488.049</td>
<td>21.7214</td>
<td>22.469</td>
</tr><tr><td>Ave.</td>
<td>Forward-Backward</td>
<td>1223.86</td>
<td>43.636</td>
<td>28.047</td>
</tr><tr><td>Total</td>
<td> </td>
<td>1223860</td>
<td>43636</td>
<td>28.047</td>
</tr></tbody></table><p> </p>
<p>Some of many reasons this optimization was possible are :</p>
<ul><li>Code vectorization for SIMD </li>
<li>Finding hotspot functions and reducing function complexity and the amount of calculations</li>
<li>CPU / system specific optimizations</li>
<li>Reducing thread movements</li>
<li>Efficient OpenMP* utilization</li>
</ul><p> </p>
<p>Additionally, let's compare the VTune results of this example between BVLC Caffe and Intel® Optimized Caffe*. </p>
<p>Simply we will looking at how efficiently im2col_cpu function has been utilized. </p>
<p><span><img height="625" width="1350" src="https://software.intel.com/sites/default/files/managed/fd/1e/Capture2.PNG" alt="" /></span></p>
<p>BVLC Caffe*'s im2col_cpu function had CPI at 0.907 and was single threaded. </p>
<p><span><img height="622" width="1346" src="https://software.intel.com/sites/default/files/managed/4f/19/Capture4.PNG" alt="" /></span></p>
<p>In case of Intel® Optimized Caffe* , im2col_cpu has its CPI at 2.747 and is multi threaded by OMP Workers. </p>
<p>The reason why CPI rate increased here is vectorization which brings higher CPI rate because of longer latency for each instruction and multi-threading which can introduce spinning while waitning for other threads to finish their jobs. However, in this example, benefits from vectorization and multi-threading exceed the latency and overhead and bring performance improvements after all.</p>
<p>VTune suggests that CPI rate close to 2.0 is theoretically ideal and for our case, we achieved about the right CPI for the function. The training workload for the Cifar 10 example is to handle 32 x 32 pixel images for each iteration so when those workloads split down to many threads, each of them can be a very small task which may cause transition overhead for multi-threading. With larger images we would see lower spining time and smaller CPI rate.</p>
<p>CPU Usage Histogram for the whole process also shows better threading results in this case. </p>
<p><span><img height="476" width="948" src="https://software.intel.com/sites/default/files/managed/bf/89/Capture3.PNG" alt="" /></span></p>
<p> </p>
<p><span><img height="611" width="1194" src="https://software.intel.com/sites/default/files/managed/83/0d/Capture5.PNG" alt="" /></span></p>
<p> </p>
<h3> </h3>
<h3>Useful links</h3>
<div>BVLC Caffe* Project : <a href="http://caffe.berkeleyvision.org/" rel="nofollow">http://caffe.berkeleyvision.org/ </a></div>
<div>BVLC Caffe* Git : <a href="http://caffe.berkeleyvision.org/" rel="nofollow">https://github.com/BVLC/caffe </a></div>
<div> </div>
<div>Intel® Optimized Caffe* Introduction : <a href="https://software.intel.com/en-us/videos/what-is-intel-optimized-caffe">https://software.intel.com/en-us/videos/what-is-intel-optimized-caffe</a></div>
<div>Intel® Optimized Caffe* Git : <a href="https://github.com/intel/caffe" rel="nofollow">https://github.com/intel/caffe</a></div>
<div>Intel® Optimized Caffe* Recommendations for the best performance : <a href="https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance" rel="nofollow">https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance </a></div>
<div>Intel® Optimized Caffe* Modern Code Techniques : <a href="https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques">https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques</a></div>
<div> </div>
<div>
<h3> </h3>
<h3>Summary</h3>
<p>Intel® Optimized Caffe* is a customized Caffe* version for Intel Architectures with modern code techniques.</p>
<p>In Intel® Optimized Caffe*, Intel leverages optimization tools and Intel® performance libraries, perform scalar and serial optimizations, implements vectorization and parallelization. </p>
<p> </p>
</div>
<div> </div>
Sun, 09 Apr 2017 00:33:05 -0700JON J K. (Intel)707193How to use the MPI-3 Shared Memory in Intel® Xeon Phi™ Processorshttps://software.intel.com/en-us/articles/using-mpi-3-shared-memory-in-xeon-phi-processors
<p>This whitepaper introduces the MPI-3 shared memory feature, the corresponding APIs, and a sample program to illustrate the use of MPI-3 shared memory in the Intel® Xeon Phi™ processor.</p>
<h2>Introduction to MPI-3 Shared Memory</h2>
<p>MPI-3 shared memory is a feature introduced in version 3.0 of the message passing interface (MPI) standard. It is implemented in Intel® MPI Library version 5.0.2 and beyond. MPI-3 shared memory allows multiple MPI processes to allocate and have access to the shared memory in a compute node. For applications that require multiple MPI processes to exchange huge local data, this feature reduces the memory footprint and can improve performance significantly.</p>
<p>In the MPI standard, each MPI process has its own address space. With MPI-3 shared memory, each MPI process exposes its own memory to other processes. The following figure illustrates the concept of shared memory: Each MPI process allocates and maintains its own local memory, and exposes a portion of its memory to the shared memory region. All processes then can have access to the shared memory region. Using the shared memory feature, users can reduce the data exchange among the processes.</p>
<p><img alt="Data exchange among the processes" src="https://software.intel.com/sites/default/files/managed/07/58/using-mpi-shared-memory-figure1.png" title="Figure 1" /></p>
<p>By default, the memory created by an MPI process is private. It is best to use MPI-3 shared memory when only memory needs to be shared and all other resources remain private. As each process has access to the shared memory region, users need to pay attention to process synchronization when using shared memory.</p>
<h2>Sample Code</h2>
<p>In this section, sample code is provided to illustrate the use of MPI-3 shared memory.</p>
<p>A total of eight MPI processes are created on the node. Each process maintains a long array of 32 million elements. For each element <code><sub><em>j</em></sub></code> in the array, the process updates this element value based on its current value and the values of the element <code><sub><em>j</em></sub></code> in the corresponding arrays of two nearest processes, and the same procedure is applied for the whole array. The following pseudo-code shows when running the program for eight MPI processes with 64 iterations:</p>
<pre>Repeat the following procedure 64 times:
for each MPI process <em>n</em> from <em>0</em> to <em>7</em>:
for each element <em>j</em> in the array <em>A[k]</em>:
<em>A<sub>n</sub></em>[<em>j</em>] ← 0.5*<em>A<sub>n</sub></em>[<em>j</em>] + 0.25*<em>A<sub>previous</sub></em>[<em>j</em>] + 0.25*<em>A<sub>next</sub></em>[<em>j</em>]</pre>
<p>where <em>A<sub>n</sub></em> is the long array belonging to the process <code><em>n</em></code>, and <code><em>A<sub>n</sub></em></code> [<code><em>j</em></code>] is the value of the element<code><em> j </em></code>in the array belonging to the process <code><em>n</em></code>. In this program, since each process exposes it to local memory, all processes can have access to all arrays, although each process just needs the two neighbor arrays (for example, process 0 needs data from processes 1 and 7, process 1 needs data from processes 0 and 2,…).</p>
<p style="text-align:center"><img alt="Shared Memory Diagram" src="https://software.intel.com/sites/default/files/managed/25/a4/using-mpi-shared-memory-figure2.png" title="Figure 2" /></p>
<p>Besides the basic APIs used for MPI programming, the following MPI-3 shared memory APIs are introduced in this example:</p>
<ul><li><code>MPI_Comm_split_type</code>: Used to create a new communicator where all processes share a common property. In this case, we pass <code>MPI_COMM_TYPE_SHARED</code> as an argument in order to create a shared memory from a parent communicator such as <code>MPI_COMM_WORLD</code>, and decompose the communicator into a shared memory communicator <code><em>shmcomm</em></code>.</li>
<li><code>MPI_Win_allocate_shared</code>: Used to create a shared memory that is accessible by all processes in the shared memory communicator. Each process exposes its local memory to all other processes, and the size of the local memory allocated by each process can be different. By default, the total shared memory is allocated contiguously. The user can pass an info hint “<code>alloc_shared_noncontig</code>” to specify that the shared memory does not have to be contiguous, which can cause performance improvement, depending on the underlying hardware architecture. </li>
<li><code>MPI_Win_free</code>: Used to release the memory.</li>
<li><code>MPI_Win_shared_query</code>: Used to query the address of the shared memory of an MPI process.</li>
<li><code>MPI_Win_lock_all</code> and <code>MPI_Win_unlock_all</code>: Used to start an access epoch to all processes in the window. Only shared epochs are needed. The calling process can access the shared memory on all processes.</li>
<li><code>MPI_Win_sync</code>: Used to ensure the completion of copying the local memory to the shared memory.</li>
<li><code>MPI_Barrier</code>: Used to block the caller process on the node until all processes reach a barrier. The barrier synchronization API works across all processes.</li>
</ul><h2>Basic Performance Tuning for Intel® Xeon Phi™ Processor</h2>
<p>This test is run on an Intel Xeon Phi processor 7250 at 1.40 GHz with 68 cores, installed with Red Hat Enterprise Linux* 7.2 and <a href="/en-us/articles/xeon-phi-software" rel="nofollow">Intel® Xeon Phi™ Processor Software</a> 1.5.1, and <a href="/en-us/intel-parallel-studio-xe" rel="nofollow">Intel® Parallel Studio</a> 2017 update 2. By default, the Intel compiler will try to vectorize the code, and each MPI process has a single thread of execution. OpenMP* pragma is added at loop level for later use. To compile the code, run the following command line to generate the binary <code>mpishared.out</code>:</p>
<pre class="brush:bash;">$ mpiicc mpishared.c -qopenmp -o mpishared.out
$ mpirun -n 8 ./mpishared.out
Elapsed time in msec: 5699 (after 64 iterations)</pre>
<p>To explore the thread parallelism, run four threads per core, and re-compile with <code>–xMIC-AVX512</code> to take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions:</p>
<pre class="brush:bash;">$ mpiicc mpishared.c -qopenmp -xMIC-AVX512 -o mpishared.out
$ export OMP_NUM_THREADS=4
$ mpirun -n 8 ./mpishared.out
Elapsed time in msec: 4535 (after 64 iterations)</pre>
<p>As MCDRAM in this system is currently configured as flat, the Intel Xeon Phi processor appears as two NUMA nodes. The node 0 contains all CPUs and the on-platform memory DDR4, while node 1 has the on-packet memory MCDRAM:</p>
<pre class="brush:bash;">$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 98200 MB
node 0 free: 92775 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15925 MB
node distances:
node 0 1
0: 10 31
1: 31 10</pre>
<p>To allocate the memory in the MCDRAM (node 1), pass the argument <code>–m 1</code> to the command <code>numactl</code> as follows:</p>
<pre class="brush:bash;">$ numactl -m 1 mpirun -n 8 ./mpishared.out
Elapsed time in msec: 3070 (after 64 iterations)</pre>
<p>This simple optimization technique greatly improves performance speeds.</p>
<h2>Summary</h2>
<p>This whitepaper introduced the MPI-3 shared memory feature, followed by sample code, which used MPI-3 shared memory APIs. The pseudo-code explained what the program is doing along with an explanation of shared memory APIs. The program ran on an Intel Xeon Phi processor, and it was further optimized with simple techniques.</p>
<h2>Reference</h2>
<ol><li>MPI Forum, <a href="http://mpi-forum.org/mpi-30/" target="_blank" rel="nofollow">MPI 3.0</a></li>
<li>Message Passing Interface Forum, <a href="http://mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf" target="_blank" rel="nofollow">MPI: A Message-Passing Interface Standard Version 3.0</a></li>
<li>The MIT Press, <a href="https://mitpress.mit.edu/using-advanced-MPI" target="_blank" rel="nofollow">Using Advanced MPI</a></li>
<li>James Reinders, Jim Jeffers, Publisher: Morgan Kaufmann, Chapter 16 - MPI-3 Shared Memory Programming Introduction, <a href="https://www.elsevier.com/books/high-performance-parallelism-pearls-volume-two/jeffers/978-0-12-803819-2" rel="nofollow">High Performance Parallelism Pearls Volume Two</a></li>
</ol><h2>Appendix</h2>
<p>The code of the sample MPI program is available for download.</p>
Wed, 29 Mar 2017 12:33:43 -0700Nguyen, Loc Q (Intel)721808Intel® Math Kernel Library for Deep Neural Networks: Part 2 – Code Build and Walkthroughhttps://software.intel.com/en-us/articles/intel-mkl-dnn-part-2-sample-code-build-and-walkthrough
<h2>Introduction</h2>
<p>In <a href="https://software.intel.com/en-us/articles/intel-mkl-dnn-part-1-library-overview-and-installation">Part 1</a> we introduced Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), an open source performance library for deep learning applications. Detailed steps were provided on how to install the library components on a computer with an Intel processor supporting Intel® Advanced Vector Extensions 2 (Intel® AVX2) and running the Ubuntu* operating system. Details on how to build the C and C++ code examples from the command line were also covered in Part 1.</p>
<p>In Part 2 we will explore how to configure an integrated development environment (IDE) to build the C++ code example, and provide a code walkthrough based on the AlexNet* deep learning topology. In this tutorial we’ll be working with the <a href="http://www.eclipse.org/downloads/" target="_blank" rel="nofollow">Eclipse Neon</a>* IDE with the <a href="http://www.eclipse.org/cdt/" target="_blank" rel="nofollow">C/C++ Development Tools (CDT)</a>. (If your system does not already have Eclipse* installed you can follow the directions on the <a href="http://ubuntuhandbook.org/index.php/2016/01/how-to-install-the-latest-eclipse-in-ubuntu-16-04-15-10/" target="_blank" rel="nofollow">Ubuntu Handbook</a> site, specifying the Oracle Java* 8 and Eclipse IDE for C/C++ Developers options.)</p>
<h2>Building the C++ Example in Eclipse IDE</h2>
<p>This section describes how to create a new project in Eclipse and import the Intel MKL-DNN C++ example code.</p>
<p>Create a new project in Eclipse:</p>
<ul><li>Start Eclipse.</li>
<li>Click <strong>New</strong> in the upper left-hand corner of screen.</li>
<li>In the <em>Select a wizard</em> screen, select <strong>C++ Project </strong>and then click <strong>Next</strong> (Figure 1).</li>
</ul><p style="text-align:center"><img alt="Create a new C++ project in Eclipse" src="/sites/default/files/managed/63/20/mkl-dnn-part-2-fig-01-create-a-new.png" /><br /><strong>Figure 1.</strong> <em>Create a new C++ project in Eclipse.</em></p>
<ul><li>Enter <strong>simple_net</strong> for the project name. For the project type select <strong>Executable, Empty Project</strong>. For toolchain select <strong>Linux GCC</strong>. Click <strong>Next</strong>.</li>
<li>In the <em>Select Configurations</em> screen, click <strong>Advanced Settings</strong>.</li>
</ul><p>Enable C++11 for the project:</p>
<ul><li>In the <em>Properties</em> screen, expand the <strong>C/C++ Build</strong> option in the menu tree and then select <strong>Settings</strong><em>.</em></li>
<li>In the <em>Tool Settings</em> tab, select <strong>GCC C++ Compiler</strong>, and then <strong>Miscellaneous</strong>.</li>
<li>In the <em>Other flags</em> box add -<strong>std=c++11</strong> to the end of existing string separated by a space (Figure 2).</li>
</ul><p style="text-align:center"><img alt="Enable C++11 for the project (1 of 2)" src="/sites/default/files/managed/63/20/mkl-dnn-part-2-fig-02-enable-1.png" /><br /><strong>Figure 2</strong>. <em>Enable C++11 for the project (1 of 2).</em></p>
<ul><li>In the <em>Properties</em> screen, expand the <strong>C/C++ General</strong> and then select <strong>Preprocessor Include Paths, Macros etc.</strong></li>
<li>Select the <strong>Providers</strong> tab and then select the compiler you are using (for example, <em>CDT GCC Built-in Compiler Settings</em>).</li>
<li>Locate the field named <em>Command to get compiler specs:</em> and add <strong><code>-std=c++11</code></strong>. The command should look similar to this when finished:<br />
“${COMMAND} ${FLAGS} -E -P -v -dD “${INPUTS}” -std=c++11”.</li>
<li>Click <strong>Apply</strong> and then <strong>OK</strong> (Figure 3).</li>
</ul><p style="text-align:center"><img alt="Enable C++11 for the project (2 of 2)" src="/sites/default/files/managed/63/20/mkl-dnn-part-2-fig-03-enable-2.png" /><br /><strong>Figure 3</strong>. <em>Enable C++11 for the project (2 of 2).</em></p>
<p>Add library to linker settings:</p>
<ul><li>In the <em>Properties</em> screen, expand the <strong>C/C++ Build</strong> option in the menu tree and then select <strong>Settings</strong>.</li>
<li>In the <em>Tool Settings</em> tab, select <strong>GCC C++ Linker</strong>, and then<strong> Libraries</strong>.</li>
<li>Under the <em>Libraries (l)</em> section click <strong>Add</strong>.</li>
<li>Enter <strong>mkldnn</strong> and then click <strong>OK</strong> (Figure 4).</li>
</ul><p style="text-align:center"><img alt="Add library to linker settings" src="/sites/default/files/managed/63/20/mkl-dnn-part-2-fig-04-add-library.png" /><br /><strong>Figure 4</strong>. <em>Add library to linker settings</em>.</p>
<p>Finish creating the project:</p>
<ul><li>Click <strong>OK</strong> at the bottom of the <em>Properties</em> screen.</li>
<li>Click <strong>Finish</strong> at the bottom of the <em>C++ Project</em> screen.</li>
</ul><p>Add the C++ source file (note: at this point the <em>simple_net</em> project should appear in Project Explorer):</p>
<ul><li>Right-click the project name in Project Explorer and select <strong>New</strong>,<strong> Source Folder</strong>. Enter <strong>src</strong> for the folder name and then click <strong>Finish</strong>.</li>
<li>Right-click the <em>src</em> folder in Project Explorer and select <strong>Import…</strong></li>
<li>In the <em>Import</em> screen, expand the <strong>General</strong> folder and then highlight <strong>File System</strong>. Click <strong>Next</strong>.</li>
<li>In the <em>File System</em> screen, click the <strong>Browse</strong> button next to the <em>From directory</em> field. Navigate to the location containing the Intel MKL-DNN example files, which in our case is <em>/mkl-dnn/examples</em>. Click <strong>OK</strong> at the bottom of the screen.</li>
<li>Back in the <em>File System</em> screen, check the <strong>simple_net.cpp</strong> box and then click <strong>Finish</strong>.</li>
</ul><p>Build the Simple_Net project:</p>
<ul><li>Right-click on the project name <strong>simple_net</strong> in <em>Project Explorer</em>.</li>
<li>Click on <strong>Build Project</strong> and verify no errors are encountered.</li>
</ul><h2>Simple_Net Code Example</h2>
<p>Although it’s not a fully functional deep learning framework, Simple_Net provides the basics of how to build a neural network topology block that consists of convolution, rectified linear unit (ReLU), local response normalization (LRN), and pooling, all in an executable project. A brief step-by-step description of the Intel MKL-DNN C++ API is presented in the <a href="http://01org.github.io/mkl-dnn/" target="_blank" rel="nofollow">documentation</a>; however, the Simple_Net code example provides a more complete walkthrough based on the AlexNet topology. Hence, we will begin by presenting a brief overview of the AlexNet architecture.</p>
<h2>AlexNet Architecture</h2>
<p>As described in the paper <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf" target="_blank" rel="nofollow"><em>ImageNet Classification with Deep Convolutional Neural Networks</em></a>, the AlexNet architecture contains an input image (L0) and eight learned layers (L1 through L8)—five convolutional and three fully-connected. This topology is depicted graphically in Figure 5.</p>
<p style="text-align:center"><img alt=" MIT*)" src="/sites/default/files/managed/63/20/mkl-dnn-part-2-fig-05-alexnet.png" /><br /><strong>Figure 5</strong>. <em>AlexNet topology (credit: <a href="http://vision03.csail.mit.edu/cnn_art/" target="_blank" rel="nofollow">MIT</a>*).</em></p>
<p>Table 1 provides additional details of the AlexNet architecture:</p>
<table><thead><tr><th>
<p align="center"><strong>Layer</strong></p>
</th>
<th>
<p align="center"><strong>Type</strong></p>
</th>
<th>
<p><strong>Description</strong></p>
</th>
</tr></thead><tbody><tr><td>
<p align="center">L0</p>
</td>
<td>
<p align="center">Input image</p>
</td>
<td>
<p>Size: 227 x 227 x 3 (shown in diagram as 227 x 227 x 3)</p>
</td>
</tr><tr><td>
<p align="center">L1</p>
</td>
<td>
<p align="center">Convolution</p>
</td>
<td>
<p>Size: 55* x 55 x 96</p>
<ul><li>96 filters, size 11 × 11</li>
<li>Stride 4</li>
<li>Padding 0</li>
</ul><p>*Size = (N - F)/S + 1 = (227 - 11)/4 + 1 = 55</p>
</td>
</tr><tr><td>
<p align="center">-</p>
</td>
<td>
<p align="center">Max-pooling</p>
</td>
<td>
<p>Size: 27* x 27 x 96</p>
<ul><li>96 filters, size 3 × 3</li>
<li>Stride 2</li>
</ul><p>*Size = (N - F)/S + 1 = (55 – 3)/2 + 1 = 27</p>
</td>
</tr><tr><td>
<p align="center">L2</p>
</td>
<td>
<p align="center">Convolution</p>
</td>
<td>
<p>Size: 27 x 27 x 256</p>
<ul><li>256 filters, size 5 x 5</li>
<li>Stride 1</li>
<li>Padding 2</li>
</ul></td>
</tr><tr><td>
<p align="center">-</p>
</td>
<td>
<p align="center">Max-pooling</p>
</td>
<td>
<p>Size: 13* x 13 x 256</p>
<ul><li>256 filters, size 3 × 3</li>
<li>Stride 2</li>
</ul><p>*Size = (N - F)/S + 1 = (27 - 3)/2 + 1 = 13</p>
</td>
</tr><tr><td>
<p align="center">L3</p>
</td>
<td>
<p align="center">Convolution</p>
</td>
<td>
<p>Size: 13 x 13 x 384</p>
<ul><li>384 filters, size 3 × 3</li>
<li>Stride 1</li>
<li>Padding 1</li>
</ul></td>
</tr><tr><td>
<p align="center">L4</p>
</td>
<td>
<p align="center">Convolution</p>
</td>
<td>
<p>Size: 13 x 13 x 384</p>
<ul><li>384 filters, size 3 × 3</li>
<li>Stride 1</li>
<li>Padding 1</li>
</ul></td>
</tr><tr><td>
<p align="center">L5</p>
</td>
<td>
<p align="center">Convolution</p>
</td>
<td>
<p>Size: 13 x 13 x 256</p>
<ul><li>256 filters, size 3 × 3</li>
<li>Stride 1</li>
<li>Padding 1</li>
</ul></td>
</tr><tr><td>
<p align="center">-</p>
</td>
<td>
<p align="center">Max-pooling</p>
</td>
<td>
<p>Size: 6* x 6 x 256</p>
<ul><li>256 filters, size 3 × 3</li>
<li>Stride 2</li>
</ul><p>*Size = (N - F)/S + 1 = (13 - 3)/2 + 1 = 6</p>
</td>
</tr><tr><td>
<p align="center">L6</p>
</td>
<td>
<p align="center">Fully Connected</p>
</td>
<td>
<p>4096 neurons</p>
</td>
</tr><tr><td>
<p align="center">L7</p>
</td>
<td>
<p align="center">Fully Connected</p>
</td>
<td>
<p>4096 neurons</p>
</td>
</tr><tr><td>
<p align="center">L8</p>
</td>
<td>
<p align="center">Fully Connected</p>
</td>
<td>
<p>1000 neurons</p>
</td>
</tr></tbody></table><p align="center"><strong>Table 1. </strong><em>AlexNet layer descriptions.</em></p>
<p>A detailed description of convolutional neural networks and the AlexNet topology is beyond the scope of this tutorial, but the reader may find the following links useful if more information is required.</p>
<ul><li><a href="https://en.wikipedia.org/wiki/Convolutional_neural_network" target="_blank" rel="nofollow">Wikipedia - Convolutional Neural Networks</a></li>
<li><a href="https://deeplearning4j.org/convolutionalnets.html" target="_blank" rel="nofollow">Introduction to Convolutional Neural Nets</a></li>
<li><a href="https://cs231n.github.io/" target="_blank" rel="nofollow">Convolutional Neural Networks for Visual Recognition</a></li>
<li><a href="https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/" target="_blank" rel="nofollow">An Intuitive Explanation of Convolutional Neural Networks</a></li>
</ul><h2>Simple_Net Code Walkthrough</h2>
<p>The source code presented below is essentially the same as the Simple_Net example contained in the repository, except it has been refactored to use the fully qualified Intel MKL-DNN types to enhance readability. This code implements the first layer (L1) of the topology.</p>
<ol><li>Add include directive for the library header file:
<pre class="brush:cpp;"> #include "mkldnn.hpp"
</pre>
</li>
<li>Initialize the CPU engine as index 0:
<pre class="brush:cpp;"> auto cpu_engine = mkldnn::engine(mkldnn::engine::cpu, 0);
</pre>
</li>
<li>Allocate data and create tensor structures:
<pre class="brush:cpp;"> const uint32_t batch = 256;
std::vector&lt;float&gt; net_src(batch * 3 * 227 * 227);
std::vector&lt;float&gt; net_dst(batch * 96 * 27 * 27);
/* AlexNet: conv
* {batch, 3, 227, 227} (x) {96, 3, 11, 11} -&gt; {batch, 96, 55, 55}
* strides: {4, 4}
*/
mkldnn::memory::dims conv_src_tz = {batch, 3, 227, 227};
mkldnn::memory::dims conv_weights_tz = {96, 3, 11, 11};
mkldnn::memory::dims conv_bias_tz = {96};
mkldnn::memory::dims conv_dst_tz = {batch, 96, 55, 55};
mkldnn::memory::dims conv_strides = {4, 4};
auto conv_padding = {0, 0};
std::vector&lt;float&gt; conv_weights(std::accumulate(conv_weights_tz.begin(),
conv_weights_tz.end(), 1, std::multiplies&lt;uint32_t&gt;()));
std::vector&lt;float&gt; conv_bias(std::accumulate(conv_bias_tz.begin(),
conv_bias_tz.end(), 1, std::multiplies&lt;uint32_t&gt;()));
</pre>
</li>
<li>Create memory for user data:
<pre class="brush:cpp;"> auto conv_user_src_memory = mkldnn::memory({{{conv_src_tz},
mkldnn::memory::data_type::f32,
mkldnn::memory::format::nchw}, cpu_engine}, net_src.data());
auto conv_user_weights_memory = mkldnn::memory({{{conv_weights_tz},
mkldnn::memory::data_type::f32, mkldnn::memory::format::oihw},
cpu_engine}, conv_weights.data());
auto conv_user_bias_memory = mkldnn::memory({{{conv_bias_tz},
mkldnn::memory::data_type::f32, mkldnn::memory::format::x}, cpu_engine},
conv_bias.data());
</pre>
</li>
<li>Create memory descriptors for convolution data using the wildcard <em>any</em> for the convolution data format (this enables the convolution primitive to choose the data format that is most suitable for its input parameters—kernel sizes, strides, padding, and so on):
<pre class="brush:cpp;"> auto conv_src_md = mkldnn::memory::desc({conv_src_tz},
mkldnn::memory::data_type::f32,
mkldnn::memory::format::any);
auto conv_bias_md = mkldnn::memory::desc({conv_bias_tz},
mkldnn::memory::data_type::f32,
mkldnn::memory::format::any);
auto conv_weights_md = mkldnn::memory::desc({conv_weights_tz},
mkldnn::memory::data_type::f32, mkldnn::memory::format::any);
auto conv_dst_md = mkldnn::memory::desc({conv_dst_tz},
mkldnn::memory::data_type::f32,
mkldnn::memory::format::any);
</pre>
</li>
<li>Create a convolution descriptor by specifying the algorithm, propagation kind, shapes of input, weights, bias, output, and convolution strides, padding, and padding kind:
<pre class="brush:cpp;"> auto conv_desc = mkldnn::convolution_forward::desc(mkldnn::prop_kind::forward,
mkldnn::convolution_direct, conv_src_md, conv_weights_md, conv_bias_md,
conv_dst_md, conv_strides, conv_padding, conv_padding,
mkldnn::padding_kind::zero);
</pre>
</li>
<li>Create a descriptor of the convolution primitive. Once created, this descriptor has specific formats instead of any wildcard formats specified in the convolution descriptor:
<pre class="brush:cpp;"> auto conv_prim_desc =
mkldnn::convolution_forward::primitive_desc(conv_desc, cpu_engine);
</pre>
</li>
<li>Create a vector of primitives that represents the net:
<pre class="brush:cpp;"> std::vector&lt;mkldnn::primitive&gt; net;
</pre>
</li>
<li>Create reorders between user and data if it is needed and add it to net before convolution:
<pre class="brush:cpp;"> auto conv_src_memory = conv_user_src_memory;
if (mkldnn::memory::primitive_desc(conv_prim_desc.src_primitive_desc()) !=
conv_user_src_memory.get_primitive_desc()) {
conv_src_memory = mkldnn::memory(conv_prim_desc.src_primitive_desc());
net.push_back(mkldnn::reorder(conv_user_src_memory, conv_src_memory));
}
auto conv_weights_memory = conv_user_weights_memory;
if (mkldnn::memory::primitive_desc(conv_prim_desc.weights_primitive_desc()) !=
conv_user_weights_memory.get_primitive_desc()) {
conv_weights_memory =
mkldnn::memory(conv_prim_desc.weights_primitive_desc());
net.push_back(mkldnn::reorder(conv_user_weights_memory,
conv_weights_memory));
}
auto conv_dst_memory = mkldnn::memory(conv_prim_desc.dst_primitive_desc());
</pre>
</li>
<li>Create convolution primitive and add it to net:
<pre class="brush:cpp;"> net.push_back(mkldnn::convolution_forward(conv_prim_desc, conv_src_memory,
conv_weights_memory, conv_user_bias_memory, conv_dst_memory));
</pre>
</li>
<li>Create a ReLU primitive and add it to net:
<pre class="brush:cpp;"> /* AlexNet: relu
* {batch, 96, 55, 55} -&gt; {batch, 96, 55, 55}
*/
const double negative_slope = 1.0;
auto relu_dst_memory = mkldnn::memory(conv_prim_desc.dst_primitive_desc());
auto relu_desc = mkldnn::relu_forward::desc(mkldnn::prop_kind::forward,
conv_prim_desc.dst_primitive_desc().desc(), negative_slope);
auto relu_prim_desc = mkldnn::relu_forward::primitive_desc(relu_desc, cpu_engine);
net.push_back(mkldnn::relu_forward(relu_prim_desc, conv_dst_memory,
relu_dst_memory));
</pre>
</li>
<li>Create an AlexNet LRN primitive:
<pre class="brush:cpp;"> /* AlexNet: lrn
* {batch, 96, 55, 55} -&gt; {batch, 96, 55, 55}
* local size: 5
* alpha: 0.0001
* beta: 0.75
*/
const uint32_t local_size = 5;
const double alpha = 0.0001;
const double beta = 0.75;
auto lrn_dst_memory = mkldnn::memory(relu_dst_memory.get_primitive_desc());
/* create lrn scratch memory from lrn src */
auto lrn_scratch_memory = mkldnn::memory(lrn_dst_memory.get_primitive_desc());
/* create lrn primitive and add it to net */
auto lrn_desc = mkldnn::lrn_forward::desc(mkldnn::prop_kind::forward,
mkldnn::lrn_across_channels,
conv_prim_desc.dst_primitive_desc().desc(), local_size,
alpha, beta);
auto lrn_prim_desc = mkldnn::lrn_forward::primitive_desc(lrn_desc, cpu_engine);
net.push_back(mkldnn::lrn_forward(lrn_prim_desc, relu_dst_memory,
lrn_scratch_memory, lrn_dst_memory));
</pre>
</li>
<li>Create an AlexNet pooling primitive:
<pre class="brush:cpp;"> /* AlexNet: pool
* {batch, 96, 55, 55} -&gt; {batch, 96, 27, 27}
* kernel: {3, 3}
* strides: {2, 2}
*/
mkldnn::memory::dims pool_dst_tz = {batch, 96, 27, 27};
mkldnn::memory::dims pool_kernel = {3, 3};
mkldnn::memory::dims pool_strides = {2, 2};
auto pool_padding = {0, 0};
auto pool_user_dst_memory = mkldnn::memory({{{pool_dst_tz},
mkldnn::memory::data_type::f32,
mkldnn::memory::format::nchw}, cpu_engine}, net_dst.data());
auto pool_dst_md = mkldnn::memory::desc({pool_dst_tz},
mkldnn::memory::data_type::f32,
mkldnn::memory::format::any);
auto pool_desc = mkldnn::pooling_forward::desc(mkldnn::prop_kind::forward,
mkldnn::pooling_max, lrn_dst_memory.get_primitive_desc().desc(), pool_dst_md, pool_strides, pool_kernel, pool_padding, pool_padding,mkldnn::padding_kind::zero);
auto pool_pd = mkldnn::pooling_forward::primitive_desc(pool_desc, cpu_engine);
auto pool_dst_memory = pool_user_dst_memory;
if (mkldnn::memory::primitive_desc(pool_pd.dst_primitive_desc()) !=
pool_user_dst_memory.get_primitive_desc()) {
pool_dst_memory = mkldnn::memory(pool_pd.dst_primitive_desc());
}
</pre>
</li>
<li>Create pooling indices memory from pooling dst:
<pre class="brush:cpp;"> auto pool_indices_memory =
mkldnn::memory(pool_dst_memory.get_primitive_desc());
</pre>
</li>
<li>Create pooling primitive and add it to net:
<pre class="brush:cpp;"> net.push_back(mkldnn::pooling_forward(pool_pd, lrn_dst_memory,
pool_indices_memory, pool_dst_memory));
</pre>
</li>
<li>Create reorder between internal and user data if it is needed and add it to net after pooling:
<pre class="brush:cpp;"> if (pool_dst_memory != pool_user_dst_memory) {
net.push_back(mkldnn::reorder(pool_dst_memory, pool_user_dst_memory));
}
</pre>
</li>
<li>Create a stream, submit all the primitives, and wait for completion:
<pre class="brush:cpp;"> mkldnn::stream(mkldnn::stream::kind::eager).submit(net).wait();
</pre>
</li>
<li>The code described above is contained in the <em>simple_net()</em> function, which is called in <em>main</em> with exception handling:
<pre class="brush:cpp;"> int main(int argc, char **argv) {
try {
simple_net();
}
catch(mkldnn::error&amp; e) {
std::cerr &lt;&lt; "status: " &lt;&lt; e.status &lt;&lt; std::endl;
std::cerr &lt;&lt; "message: " &lt;&lt; e.message &lt;&lt; std::endl;
}
return 0;
}
</pre>
</li>
</ol><h2>Conclusion</h2>
<p><a href="https://software.intel.com/en-us/articles/intel-mkl-dnn-part-1-library-overview-and-installation">Part 1</a> of this tutorial series identified several resources for learning about the technical preview of Intel MKL-DNN. Detailed instructions on how to install and build the library components were also provided. In this paper (Part 2 of the tutorial series), information on how to configure the Eclipse integrated development environment to build the C++ code sample was provided, along with a code walkthrough based on the AlexNet deep learning topology. Stay tuned as Intel MKL-DNN approaches production release.</p>
Tue, 14 Mar 2017 14:28:11 -0700Bryan B. (Intel)720743Intel® Math Kernal Library for Deep Learning Networks: Part 1–Overview and Installationhttps://software.intel.com/en-us/articles/intel-mkl-dnn-part-1-library-overview-and-installation
<h2>Introduction</h2>
<p>Deep learning is one of the hottest subjects in the field of computer science these days, fueled by the convergence of massive datasets, highly parallel processing power, and the drive to build increasingly intelligent devices. Deep learning is described by <a href="https://en.wikipedia.org/wiki/Deep_learning" target="_blank" rel="nofollow">Wikipedia</a> as a subset of machine learning (ML), consisting of algorithms that model high-level abstractions in data. As depicted in Figure 1, ML is itself a subset of artificial intelligence (AI), a broad field of study in the development of computer systems that attempt to emulate human intelligence.</p>
<p style="text-align:center"><img alt="Relationship of deep learning to Artificial Intelligence" src="/sites/default/files/managed/5a/30/mkl-dnn-part-1-fig-01-relationship.png" /><br /><strong>Figure 1. </strong><em>Relationship of deep learning to AI.</em></p>
<p>Intel has been actively involved in the area of deep learning though the optimization of popular frameworks like Caffe* and Theano* to take full advantage of Intel® architecture (IA), the creation of high-level tools like the <a href="/en-us/deep-learning-sdk" rel="nofollow">Intel® Deep Learning SDK</a> for data scientists, and providing software libraries to the developer community like <a href="/en-us/intel-daal" rel="nofollow">Intel® Data Analytics Acceleration Library (Intel® DAAL)</a> and <a href="https://github.com/01org/mkl-dnn" target="_blank" rel="nofollow">Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)</a>.</p>
<p>Intel MKL-DNN is an open source, performance-enhancing library for accelerating deep learning frameworks on IA. Software developers who are interested in the subject of deep learning may have heard of Intel MKL-DNN, but perhaps haven’t had the opportunity to explore it firsthand.</p>
<p>The Developer's Introduction to Intel MKL-DNN tutorial series examines Intel MKL-DNN from a developer’s perspective. Part 1 identifies informative resources and gives detailed instructions on how to install and build the library components. <a href="https://software.intel.com/en-us/articles/intel-mkl-dnn-part-2-sample-code-build-and-walkthrough">Part 2</a> of the tutorial series provides information on how to configure the Eclipse* integrated development environment to build the C++ code sample, and also includes a source code walkthrough.</p>
<h2>Intel® MKL-DNN Overview</h2>
<p>As depicted in Figure 2, Intel MKL-DNN is intended for accelerating deep learning frameworks on IA. It includes highly vectorized and threaded building blocks for implementing convolutional neural networks with C and C++ interfaces.</p>
<p style="text-align:center"><img alt="Deep learning framework on IA" src="/sites/default/files/managed/5a/30/mkl-dnn-part-1-fig-02-deep-learning.png" /><br /><strong>Figure 2.</strong> <em>Deep learning framework on IA.</em></p>
<p>Intel MKL-DNN operates on these main object types: primitive, engine, and stream. These objects are defined in the library <a href="http://01org.github.io/mkl-dnn/" target="_blank" rel="nofollow">documentation</a> as follows:</p>
<ul><li><strong>Primitive</strong> - any operation, including convolution, data format reorder, and memory. Primitives can have other primitives as inputs, but can have only memory primitives as outputs.</li>
<li><strong>Engine</strong> - an execution device, for example, <em>CPU</em>. Every primitive is mapped to a specific engine.</li>
<li><strong>Stream</strong> - an execution context; you submit primitives to a stream and wait for their completion. Primitives submitted to a stream may have different engines. Stream objects also track dependencies between the primitives.</li>
</ul><p>A typical workflow is to create a set of primitives, push them to a stream for processing, and then wait for completion. Additional information on the programming model is provided in the <a href="http://01org.github.io/mkl-dnn/" target="_blank" rel="nofollow">Intel MKL-DNN documentation</a>.</p>
<h2>Resources</h2>
<p>There are a number of informative resources available on the web that describe what Intel MKL-DNN is, what it is not, and what a developer can expect to achieve by integrating the library with his or her deep learning project.</p>
<h3>GitHub Repository</h3>
<p>Intel MKL-DNN is an open source library available to download for free on <a href="https://github.com/01org/mkl-dnn" target="_blank" rel="nofollow">GitHub</a>*, where it is described as a performance library for DL applications that includes the building blocks for implementing convolutional neural networks (CNN) with C and C++ interfaces.</p>
<p>An important thing to note on the GitHub site is that although the Intel MKL-DNN library includes functionality similar to <a href="/en-us/intel-mkl" rel="nofollow">Intel® Math Kernel Library (Intel® MKL) 2017</a>, <strong>it is not API compatible</strong>. At the time of this writing the Intel MKL-DNN release is a technical preview, implementing the functionality required to accelerate image recognition topologies like AlexNet* and VGG*.</p>
<h3>Intel Open Source Technology Center</h3>
<p>The <a href="https://01.org/mkl-dnn" target="_blank" rel="nofollow">MKL-DNN|01.org</a> project microsite is a member of the Intel Open Source Technology Center known as 01.org, a community supported by Intel engineers who participate in a variety of open source projects. Here you will find an overview of the Intel MKL-DNN project, information on how to get involved and contribute to its evolution, and an informative blog entitled <a href="https://01.org/group/4217/blogs" target="_blank" rel="nofollow"><em>Introducing the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)</em></a> by Kent Moffat.</p>
<h2>Installing Intel MKL-DNN</h2>
<p>This section elaborates on the installation information presented on the <a href="https://github.com/01org/mkl-dnn" target="_blank" rel="nofollow">GitHub repository</a> site by providing detailed, step-by-step instructions for installing and building the Intel MKL-DNN library components. The computer you use will require an Intel® processor supporting Intel® Advanced Vector Extensions 2 (Intel® AVX2). Specifically, Intel MKL-DNN is optimized for Intel® Xeon® processors and Intel® Xeon Phi™ processors.</p>
<p>GitHub indicates the software was validated on RedHat* Enterprise Linux* 7; however, the information presented in this tutorial was developed on a system running <a href="http://releases.ubuntu.com/16.04/" target="_blank" rel="nofollow">Ubuntu* 16.04</a>.</p>
<h3>Install Dependencies</h3>
<p>Intel MKL-DNN has the following dependencies:</p>
<ul><li><a href="https://cmake.org" target="_blank" rel="nofollow">CMake</a>* – a cross-platform tool used to build, test, and package software.</li>
<li><a href="http://www.stack.nl/~dimitri/doxygen/" target="_blank" rel="nofollow">Doxygen</a>* – a tool for generating documentation from annotated source code.</li>
</ul><p>If these software tools are not already set up on your computer you can install them by typing the following:</p>
<p><code>sudo apt install cmake</code></p>
<p><code>sudo apt install doxygen</code></p>
<h3>Download and Build the Source Code</h3>
<p>Clone the Intel MKL-DNN library from the GitHub repository by opening a terminal and typing the following command:</p>
<p><code>git clone https://github.com/01org/mkl-dnn.git</code></p>
<p>Note: if Git* is not already set up on your computer you can install it by typing the following:</p>
<p><code>sudo apt install git</code></p>
<p>Once the installation has completed you will find a directory named <em>mkl-dnn</em> in the Home directory. Navigate to this directory by typing:</p>
<p><code>cd mkl-dnn</code></p>
<p>As explained on the GitHub repository site, Intel MKL-DNN uses the optimized general matrix to matrix multiplication (GEMM) function from Intel MKL. The library supporting this function is also included in the repository and can be downloaded by running the <em>prepare_mkl.sh</em> script located in the <em>scripts</em> directory:</p>
<p><code>cd scripts &amp;&amp; ./prepare_mkl.sh &amp;&amp; cd ..</code></p>
<p>This script creates a directory named <em>external</em> and then downloads and extracts the library files to a directory named <em>mkl-dnn/external/mklml_lnx*</em>.</p>
<p>The next command is executed from the <em>mkl-dnn</em> directory; it creates a subdirectory named <em>build</em> and then runs <em>CMake</em> and <em>Make</em> to generate the build system:</p>
<p><code>mkdir -p build &amp;&amp; cd build &amp;&amp; cmake .. &amp;&amp; make</code></p>
<h3>Validating the Build</h3>
<p>To validate your build, execute the following command from the <em>mkl-dnn/build</em> directory:</p>
<p><code>make test</code></p>
<p>This step executes a series of unit tests to validate the build. All of these tests should indicate <em>Passed</em>, and the processing time as shown in Figure 3.</p>
<p style="text-align:center"><img alt="Execute a series of unit tests to validate the build" src="/sites/default/files/managed/5a/30/mkl-dnn-part-1-fig-03-test-results.png" /><br /><strong>Figure 3.</strong> <em>Test results.</em></p>
<h3>Library Documentation</h3>
<p>Documentation for Intel MKL-DNN is <a href="http://01org.github.io/mkl-dnn/" target="_blank" rel="nofollow">available online</a>. This documentation can also be generated locally on your system by executing the following command from the <em>mkl-dnn/build</em> directory:</p>
<p><code>make doc</code></p>
<h3>Finalize the Installation</h3>
<p>Finalize the installation of Intel MKL-DNN by executing the following command from the <em>mkl-dnn/build</em> directory:</p>
<p><code>sudo make install</code></p>
<p>The next step installs the libraries and other components that are required to develop Intel MKL-DNN enabled applications under the <em>/usr/local</em> directory:</p>
<p>Shared libraries (/<em>usr/local/lib</em>):</p>
<ul><li>libiomp5.so</li>
<li>libmkldnn.so</li>
<li>libmklml_intel.so</li>
</ul><p>Header files (/<em>usr/local/include</em>):</p>
<ul><li>mkldnn.h</li>
<li>mkldnn.hpp</li>
<li>mkldnn_types.h</li>
</ul><p>Documentation (/<em>usr/local/share/doc/mkldnn</em>):</p>
<ul><li>Intel license and copyright notice</li>
<li>Various files that make up the HTML documentation (under <em>/reference/html</em>)</li>
</ul><h2>Building the Code Examples on the Command Line</h2>
<p>The GitHub repository contains C and C++ code examples that demonstrate how to build a neural network topology block that consists of convolution, rectified linear unit, local response normalization, and pooling. The following section describes how to build these code examples from the command line in Linux. Part 2 of the tutorial series demonstrates how to configure the Eclipse integrated development environment for building and extending the C++ code example.</p>
<h3>C++ Example Command-Line Build (G++)</h3>
<p>To build the C++ example program (simple_net.cpp) included in the Intel MKL-DNN repository, first go to the <em>examples</em> directory:</p>
<p><code>cd ~/mkl-dnn/examples</code></p>
<p>Next, create a destination directory for the executable:</p>
<p><code>mkdir –p bin</code></p>
<p>Build the simple_net.cpp example by linking the shared Intel MKL-DNN library and specifying the output directory as follows:</p>
<p><code>g++ -std=c++11 simple_net.cpp –lmkldnn –o bin/simple_net_cpp</code></p>
<p style="text-align:center"><img alt="C++ command-line build using G++" src="/sites/default/files/managed/5a/30/mkl-dnn-part-1-fig-04-c-plus-plus.png" /><br /><strong>Figure 4.</strong> <em>C++ command-line build using G++.</em></p>
<p>Go to the <em>bin</em> directory and run the executable:</p>
<p><code>cd bin</code></p>
<p><code>./simple_net_cpp</code></p>
<h3>C Example Command-Line Build Using GCC</h3>
<p>To build the C example application (simple_net.c) included in the Intel MKL-DNN repository, first go to the <em>examples</em> directory:</p>
<p><code>cd ~/mkl-dnn/examples</code></p>
<p>Next, create a destination directory for the executable:</p>
<p><code>mkdir –p bin</code></p>
<p>Build the simple_net.c example by linking the Intel MKL-DNN shared library and specifying the output directory as follows:</p>
<p><code>gcc –Wall –o bin/simple_net_c simple_net.c -lmkldnn</code></p>
<p style="text-align:center"><img alt="C command-line build using GCC" src="/sites/default/files/managed/5a/30/mkl-dnn-part-1-fig-05-c.png" /><br /><strong>Figure 5.</strong> <em>C command-line build using GCC.</em></p>
<p>Go to the <em>bin</em> directory and run the executable:</p>
<p><code>cd bin</code></p>
<p><code>./simple_net_c</code></p>
<p>Once completed, the C application will print either <em>passed</em> or <em>failed</em> to the terminal.</p>
<h2>Next Steps</h2>
<p>At this point you should have successfully installed the Intel MKL-DNN library, executed the unit tests, and built the example programs provided in the repository. In <a href="https://software.intel.com/en-us/articles/intel-mkl-dnn-part-2-sample-code-build-and-walkthrough">Part 2</a> of the Developer's Introduction to Intel MKL-DNN you’ll learn how to configure the Eclipse integrated development environment to build the C++ code sample along with a walkthrough of the code.</p>
Mon, 13 Mar 2017 15:56:59 -0700Bryan B. (Intel)720659Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processorshttps://software.intel.com/en-us/articles/recipe-building-and-running-milc-on-intel-processors
<h3>Introduction</h3>
<p>MILC software represents a set of codes written by the MIMD Lattice Computation (MILC) collaboration used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four-dimensional SU lattice gauge theory on MIMD (Multiple Instruction, Multiple Data) parallel machines. “<em>Strong interactions”</em> are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus. MILC applications address fundamental questions in high energy and nuclear physics, and is directly related to major experimental programs in these fields. MILC is one of the largest compute cycle users at many U.S. and European supercomputing centers.</p>
<p>This article provides instructions for code access, build, and run directions for the “ks_imp_rhmc” application on Intel® Xeon® processors and Intel® Xeon Phi™ processors. The “ks_imp_rhmc” is a dynamical RHMC (rational hybrid Monte Carlo algorithm) code for staggered fermions. In addition to the naive and asqtad staggered actions, the highly improved staggered quark (HISQ) action is also supported.</p>
<p>Currently, the conjugate gradient (CG) solver in the code uses the QPhiX library. Efforts are ongoing to integrate other operations (gauge force (GF), fermion force (FF)) with the QPhiX library as well.</p>
<p>The QPhiX library provides sparse solvers and Dslash kernels for Lattice QCD simulations optimized for Intel® architectures.</p>
<h3>Code Access</h3>
<p>The MILC Software and QPhiX library are primarily required. The MILC software can be downloaded from GitHub here: <a href="https://github.com/milc-qcd/milc_qcd" target="_blank" rel="nofollow">https://github.com/milc-qcd/milc_qcd</a>. Download the <em>master </em>branch. QPhiX support is integrated into this branch for CG solvers.</p>
<p>The QPhiX library and code generator for use with Wilson-Clover fermions (for example, for use with chroma) are available from <a href="https://github.com/jeffersonlab/qphix.git" target="_blank" rel="nofollow">https://github.com/jeffersonlab/qphix.git</a> and <a href="https://github.com/jeffersonlab/qphix-codegen.git" target="_blank" rel="nofollow">https://github.com/jeffersonlab/qphix-codegen.git</a>, respectively. For the most up to date version, we suggest you use the <em>devel</em> branch of QPhiX. The MILC version is currently not open source. Please contact the MILC collaboration group for access to the QPhiX (MILC) branch.</p>
<h3>Build Directions</h3>
<h4><strong>Compile the QPhiX Library</strong></h4>
<p>Users need to build QPhiX first before building the MILC package.</p>
<p>The QPhiX library will have two tar files, <em>mbench*.tar and qphix-codegen*.tar.</em></p>
<p>Untar the above.</p>
<h4><strong>Build qphix-codogen</strong></h4>
<p>The files with intrinsics for QPhiX are built in the <em>qphix-codegen</em> directory.</p>
<p>Enter the qphix-codegen directory.</p>
<p>Edit line #3 in <em>“Makefile_xyzt”</em>, enable “milc=1” variable.</p>
<p>Compile as:</p>
<pre class="brush:plain;">source /opt/intel/compiler/&lt;version&gt;/bin/compilervars.sh intel64
source /opt/intel/impi/&lt;version&gt;/mpi/intel64/bin/mpivars.sh
make –f Makefile_xyzt avx512 -- [for Intel® Xeon Phi™ Processor]
make –f Makefile_xyzt avx2 -- [for Intel® Xeon® v3 Processors /Intel® Xeon® v4 Processor]</pre>
<h4><strong>Build mbench</strong></h4>
<p>The mbench is part of the QPhiX library. The MILC version is currently not open source. Please contact the MILC collaboration group for access to the QPhiX (MILC) branch.</p>
<p>Enter the mbench directory.</p>
<p>Edit line #3 in <em>“Makefile_qphixlib”, </em>set <em>“mode=mic”</em> to compile with Intel® AVX-512 for Intel® Xeon Phi™ Processor and <em>“mode=avx”</em> to compile with Intel® Advanced Vector Extensions 2 (Intel® AVX2) for Intel® Xeon® Processors.</p>
<p>Edit line #13 in “Makefile_qphixlib” to enable MPI. Set ENABLE_MPI = 1.</p>
<p>Compile as:</p>
<pre class="brush:plain;">make -f Makefile_qphixlib mode=mic AVX512=1 -- [Intel® Xeon Phi™ Processor]
make -f Makefile_qphixlib mode=avx AVX2=1 -- [Intel® Xeon® Processors]</pre>
<h4><strong>Compile MILC Code</strong></h4>
<p>Install/download the master branch from the above GitHub location.</p>
<p>Download the Makefile.qphix file from the following location:</p>
<p><a href="http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/" target="_blank" rel="nofollow">http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/</a>.</p>
<p>Copy the Makefile.qphix to the corresponding application directory. In this case, copy the Makefile.qphix to the “ks_imp_rhmc” application directory and rename it as Makefile.</p>
<p>Make the following changes to the Makefile:</p>
<ul><li>On line #17 - Add/uncomment the appropriate ARCH variable:
<ul><li>For example, ARCH = knl (compile with Intel AVX-512 for Intel® Xeon Phi™ Processor architecture).</li>
<li>For example, ARCH = bdw (compile with Intel AVX2 for Intel® Xeon® Processor architecture).</li>
</ul></li>
<li>On line #28 - Change MPP variable to “true” if you want MPI.</li>
<li>On line #34 - Pick the PRECISION you want:
<ul><li>1 = Single, 2 = Double. We use Double for our runs.</li>
</ul></li>
<li>Starting line #37 - Compiler is set up and this should work:
<ul><li> If directions above were followed. If not, customize starting at line #40.</li>
</ul></li>
<li>On line #124 - Setup of Intel compiler starts:
<ul><li>Based on ARCH it will use the appropriate flags.</li>
</ul></li>
<li>On line #395 - QPhiX customizations starts:
<ul><li>On line #399 – Set QPHIX_HOME to correct QPhiX path (Path to mbench directory).</li>
<li>The appropriate QPhiX FLAGS will be set if the above is defined correctly.</li>
</ul></li>
</ul><p>Compile as:</p>
<p>Enter the ks_imp_rhmc. The Makefile with the above changes should be in this directory. Source the latest Intel® compilers and Intel® MPI Library.</p>
<pre class="brush:plain;">make su3_rhmd_hisq -- Build su3_rhmd_hisq binary
make su3_rhmc_hisq -- Build su3_rhmc_hisq binary</pre>
<p>Compile the above binaries for Intel® Xeon Phi™ Processor and Intel® Xeon® Processor (edit Makefile accordingly).</p>
<h3>Run Directions</h3>
<h4>Input Files</h4>
<p>There are two required input files, params.rest, and rat.m013m065m838.</p>
<p>They can be downloaded from here:</p>
<p><a href="http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/" target="_blank" rel="nofollow">http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/</a>.</p>
<p>The file rat.m013m065m838 defines the residues and poles of the rational functions needed in the calculation. The file params.rest sets all the run time parameters, including the lattice size, the length of the calculation (number of trajectories), and the precision of the various conjugate-gradient solutions.</p>
<p>In addition, a params.&lt;lattice-size&gt; file with required lattice size will be created during runtime. This file essentially has the params.rest appended to it with the lattice size (Nx * Ny * Nz * Ny) to run.</p>
<h4>The Lattice Sizes</h4>
<p>The size of the four-dimensional space-time lattice is controlled by the “nx, ny, nz, nt” parameters.</p>
<p>As an example, consider a problem as (nx x ny x nz x nt) = 32 x 32 x 32 x 64 running on 64 MPI ranks. To weak scale this problem a user would begin by multiplying <strong><em>nt</em></strong> by 2, then <strong><em>nz</em></strong> by 2, then <strong><em>ny</em></strong> by 2, then <strong><em>nx</em></strong> by 2 and so on, such that all variables get sized accordingly in a round-robin fashion.</p>
<p>This is illustrated in the table below. The original problem size is 32 x 32 x 32 x 64, to keep the elements/rank constant (weak scaling); for 128 rank count, first multiply <strong>nt by 2 </strong>(32 x 32 x 32 x 128). Similarly, for 512 ranks, multiply <strong><em>nt</em> </strong>by 2, <strong><em>nz</em></strong> by 2, <strong><em>ny</em></strong> by 2 from the original problem size to keep the same elements/rank.</p>
<table align="center" border="1"><tbody><tr><td><strong>Ranks</strong></td>
<td style="text-align:right"><strong>64</strong></td>
<td style="text-align:right"><strong>128</strong></td>
<td style="text-align:right"><strong>256</strong></td>
<td style="text-align:right"><strong>512</strong></td>
</tr><tr><td>nx</td>
<td style="text-align:right">32</td>
<td style="text-align:right">32</td>
<td style="text-align:right">32</td>
<td style="text-align:right">32</td>
</tr><tr><td>ny</td>
<td style="text-align:right">32</td>
<td style="text-align:right">32</td>
<td style="text-align:right">32</td>
<td style="text-align:right"><strong>64</strong></td>
</tr><tr><td>nz</td>
<td style="text-align:right">32</td>
<td style="text-align:right">32</td>
<td style="text-align:right"><strong>64</strong></td>
<td style="text-align:right">64</td>
</tr><tr><td>nt</td>
<td style="text-align:right">64</td>
<td style="text-align:right"><strong>128</strong></td>
<td style="text-align:right">128</td>
<td style="text-align:right">128</td>
</tr><tr><td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr><tr><td>Total Elements</td>
<td style="text-align:right">2097152</td>
<td style="text-align:right">4194304</td>
<td style="text-align:right">8388608</td>
<td style="text-align:right">16777216</td>
</tr><tr><td>Multiplier</td>
<td style="text-align:right">1</td>
<td style="text-align:right">2</td>
<td style="text-align:right">4</td>
<td style="text-align:right">8</td>
</tr><tr><td>Elements/Rank</td>
<td style="text-align:right">32768</td>
<td style="text-align:right">32768</td>
<td style="text-align:right">32768</td>
<td style="text-align:right">32768</td>
</tr></tbody></table><p align="center"><strong>Table:</strong><em> Illustrates Weak Scaling of Lattice Sizes</em></p>
<h4>Running with MPI x OpenMP*</h4>
<p>The calculation takes place on a four-dimensional hypercubic lattice, representing three spatial dimensions and one time dimension. The quark fields have values on each of the lattice points and the gluon field has values on each of the links connecting nearest-neighbors of the lattice sites. </p>
<p>The lattice is divided into equal subvolumes, one per MPI rank. The MPI ranks can be thought of as being organized into a four-dimensional grid of ranks. It is possible to control the grid dimensions with the params.rest file. Of course, the grid dimensions must be integer factors of the lattice coordinate dimensions.</p>
<p>Each MPI rank executes the same code. The calculation requires frequent exchanges of quark and gluon values between MPI ranks with neighboring lattice sites. Within a single MPI rank, the site-by-site calculation is threaded using OpenMP* directives, which have been inserted throughout the code. The most time-consuming part of production calculations is the CG solver. In the QPhiX version of the CG solver, the data layout and the calculation at the thread level is further organized to take advantage of the Intel Xeon and Intel Xeon Phi processors SIMD(single instruction, multiple data) lanes.</p>
<h4>Running the Test Cases</h4>
<ol><li>Create a “run” directory in the top-level directory and add the input files obtained from above.</li>
<li>cd &lt;milc&gt;/run<br />
P.S: Run the appropriate binary for each architecture.</li>
<li>Create the lattice volume:
<pre class="brush:plain;">cat &lt;&lt; EOF &gt; params.$nx*$ny*$nz*$nt
prompt 0
nx $nx
ny $ny
nz $nz
nt $nt
EOF
cat params.rest &gt;&gt; params.$nx*$ny*$nz*$nt</pre>
<p>For this performance recipe, we evaluate the single node and multinode (16 nodes) performance with the following weak scaled lattice volume:</p>
<p>Single Node (nx * ny * nz * nt): 24 x 24 x 24 x 60</p>
<p>Multinode [16 nodes] (nx * ny * nz * nt): 48 x 48 x 48 x 120</p>
</li>
<li>Run on Intel Xeon processor (E5-2697v4).<br />
Source the latest Intel compilers and Intel MPI Library
<ul><li>Intel® Parallel Studio 2017 and above recommended</li>
</ul><p><strong>Single Node:</strong></p>
<p><code>mpiexec.hydra –n 12 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose' &lt;path-to&gt;/ks_imp_rhmc/su3_rhmd_hisq.bdw &lt; params.24x24x24x60</code></p>
<p><strong>Multinode</strong> (16 nodes, via Intel® Omni-Path Host Fabric Interface (Intel® OP HFI)):</p>
<pre class="brush:plain;"># Create a runScript (run-bdw) #
&lt;path-to&gt;/ks_imp_rhmc/su3_rhmd_hisq.bdw &lt; params.48x48x48x120
#Intel® OPA fabric-related environment variables#
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm2
export PSM2_IDENTIFY=1
export I_MPI_FALLBACK=0
#Create nodeconfig.txt with the following#
-host &lt;hostname1&gt; -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 &lt;path-to&gt;/run-bdw
…..
…..
…..
-host &lt;hostname16&gt; -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 &lt;path-to&gt;/run-bdw
#mpirun command#
mpiexec.hydra –configfile nodeconfig.txt</pre>
</li>
<li>Run on Intel Xeon Phi processor (7250).<br />
Source Intel compilers and Intel MPI Library
<ul><li>Intel® Parallel Studio 2017 and above recommended</li>
</ul><p><strong>Single Node:</strong></p>
<p><code>mpiexec.hydra –n 20 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose' numactl –p 1 &lt;path-to&gt;/ks_imp_rhmc/su3_rhmd_hisq.knl &lt; params.24x24x24x60</code></p>
<p><strong>Multinode</strong> (16 nodes, via Intel OP HFI):</p>
<pre class="brush:plain;"># Create a runScript (run-knl) #
numactl –p 1 &lt;path-to&gt;/ks_imp_rhmc/su3_rhmd_hisq.knl &lt; params.48x48x48x120
#Intel OPA fabric-related environment variables#
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm2
export PSM2_IDENTIFY=1
export I_MPI_FALLBACK=0
#Create nodeconfig.txt with the following#
-host &lt;hostname1&gt; -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 &lt;path-to&gt;/run-knl
…..
…..
…..
-host &lt;hostname16&gt; -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 &lt;path-to&gt;/run-knl
#mpirun command#
mpiexec.hydra –configfile nodeconfig.txt</pre>
</li>
</ol><h3>Performance Results and Optimizations</h3>
<p>The output prints the total time to solution for the entire application, which takes into account the time for the different solvers and operators (for example, CG solver, fermion force, link fattening, gauge force, and so on).</p>
<p>The performance chart below is the speedup w.r.t 2S Intel Xeon processor E5-2697v4 based on the total run time.</p>
<p align="center"><img alt=" Speedup w.r.t 2S Intel® Xeon® processor E5-2697v4" src="https://software.intel.com/sites/default/files/managed/0b/b6/performance-results-optimizations.png" /></p>
<p>The optimizations as part of the QPhiX library include data layout changes to target vectorization and generation of packed aligned loads/stores, cache blocking, load balancing and improved code generation for each architecture (Intel Xeon processor, Intel Xeon Phi processor) with corresponding intrinsics, where necessary. See <em>References and Resources </em>section for details.</p>
<h3>Testing Platform Configurations</h3>
<p>The following hardware was used for the above recipe and performance testing.</p>
<table border="1"><thead><tr><th><strong>Processor</strong></th>
<th><strong>Intel® Xeon® Processor E5-2697 v4</strong></th>
<th><strong>Intel® Xeon Phi™ </strong><strong>Processor 7250F</strong></th>
</tr></thead><tbody><tr><td>Sockets / TDP</td>
<td>2S / 290W</td>
<td>1S / 215W</td>
</tr><tr><td>Frequency / Cores / Threads</td>
<td>2.3 GHz / 36 / 72</td>
<td>1.4 a / 68 / 272</td>
</tr><tr><td>DDR4</td>
<td>8x16 GB 2400 MHz</td>
<td>6x16 GB 2400 MHz</td>
</tr><tr><td>MCDRAM</td>
<td>N/A</td>
<td>16 GB Flat</td>
</tr><tr><td>Cluster/Snoop Mode</td>
<td>Home</td>
<td>Quadrant</td>
</tr><tr><td>Memory Mode</td>
<td> </td>
<td>Flat</td>
</tr><tr><td>Turbo</td>
<td>OFF</td>
<td>OFF</td>
</tr><tr><td>BIOS</td>
<td>SE5C610.86B.01.01.0016.033<br />
120161139</td>
<td>GVPRCRB1.86B.0010.R02.1<br />
606082342</td>
</tr><tr><td>Operating System</td>
<td>Oracle Linux* 7.2<br />
(3.10.0-229.20.1.el6.x86_64)</td>
<td>Oracle Linux* 7.2<br />
(3.10.0-229.20.1.el6.x86_64)</td>
</tr></tbody></table><h3>MILC Build Configurations</h3>
<p>The following configurations were used for the above recipe and performance testing.</p>
<table border="1"><tbody><tr><td>MILC Version</td>
<td>Master version as of 28 January 2017</td>
</tr><tr><td>Intel® Compiler Version</td>
<td>2017.1.132</td>
</tr><tr><td>Intel® MPI Library Version</td>
<td>2017.0.098</td>
</tr><tr><td>MILC Makefiles Used</td>
<td>Makefile.qphix, Makefile_qphixlib, Makefile</td>
</tr></tbody></table><h3>References and Resources</h3>
<ol><li>MIMD Lattice Computation (MILC) Collaboration: <a href="http://physics.indiana.edu/~sg/milc.html" target="_blank" rel="nofollow">http://physics.indiana.edu/~sg/milc.html</a></li>
<li>QPhiX Case Study: <a href="http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/application-case-studies/qphix-case-study/" target="_blank" rel="nofollow">http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/application-case-studies/qphix-case-study/</a></li>
<li>MILC Staggered Conjugate Gradient Performance on Intel Intel® Xeon Phi™ Processor: <a href="https://anl.app.box.com/v/IXPUG2016-presentation-10" target="_blank" rel="nofollow">https://anl.app.box.com/v/IXPUG2016-presentation-10</a></li>
</ol>Tue, 07 Feb 2017 11:42:25 -0800Smahane D. (Intel)711301