Intel Developer Zone Articleshttps://software.intel.com/en-us/articles/20800
Article FeedenIntroducing Batch GEMM Operationshttps://software.intel.com/en-us/articles/introducing-batch-gemm-operations
<p>The general matrix-matrix multiplication (GEMM) is a fundamental operation in most scientific, engineering, and data applications. There is an everlasting desire to make this operation run faster. Optimized numerical libraries like Intel® Math Kernel Library (Intel® MKL) typically offer parallel high-performing GEMM implementations to leverage the concurrent threads supported by modern multi-core architectures. This strategy works well when multiplying large matrices because all cores are used efficiently. When multiplying small matrices, however, individual GEMM calls may not optimally use all the cores. Developers wanting to improve utilization usually batch multiple independent small GEMM operations into a group and then spawn multiple threads for different GEMM instances within the group. While this is a classic example of an embarrassingly parallel approach, making it run optimally requires a significant programming effort that involves threads creation/termination, synchronization, and load balancing. That is, until now. </p>
<p>Intel MKL 11.3 Beta (part of Intel® Parallel Studio XE 2016 Beta) includes a new flavor of GEMM feature called "Batch GEMM". This allows users to achieve the same objective described above with minimal programming effort. Users can specify multiple independent GEMM operations, which can be of different matrix sizes and different parameters, through a single call to the "Batch GEMM" API. At runtime, Intel MKL will intelligently execute all of the matrix multiplications so as to optimize overall performance. Here is an example that shows how "Batch GEMM" works:</p>
<h4>Example</h4>
<p>Let <em>A0, A1 </em>be two real double precision 4x4 matrices; Let <em>B0, B1</em> be two real double precision 8x4 matrices. We'd like to perform these operations:</p>
<p><i>C0 = 1.0 * A0 * B0<sup>T </sup></i><sup> </sup>, and <i>C1 = 1.0 * A1 * B1<sup>T</sup></i></p>
<p>where <i>C0</i> and <em>C1</em> are two real double precision 4x8 result matrices. </p>
<p>Again, let <em>X0, X1 </em>be two real double precision 3x6 matrices; Let <em>Y0, Y1 </em>be another two <span>real double precision 3x6 matrices. We'd like to perform these operations:</span></p>
<p><i>Z0 = 1.0 * X0 * Y0<sup>T</sup></i><span style="vertical-align:baseline"> </span><em>+ 2.0 * Z0<span>, </span></em>and Z<i>1 = 1.0 * X1 * Y1<sup>T</sup><span style="vertical-align:baseline"> </span></i><em>+ 2.0 * Z1</em></p>
<p>where <em>Z0 </em>and <em>Z1 </em>are two real double precision 3x3 result matrices.</p>
<p>We could accomplished these multiplications using four individual calls to the standard DGEMM API. Instead, here we use a single "Batch GEMM" call for the same with potentially improved overall performance. We illustrate this using the "cblas_dgemm_batch" function as an example below.</p>
<pre class="brush:cpp;">#define GRP_COUNT 2
MKL_INT m[GRP_COUNT] = {4, 3};
MKL_INT k[GRP_COUNT] = {4, 6};
MKL_INT n[GRP_COUNT] = {8, 3};
MKL_INT lda[GRP_COUNT] = {4, 6};
MKL_INT ldb[GRP_COUNT] = {4, 6};
MKL_INT ldc[GRP_COUNT] = {8, 3};
CBLAS_TRANSPOSE transA[GRP_COUNT] = {'N', 'N'};
CBLAS_TRANSPOSE transB[GRP_COUNT] = {'T', 'T'};
double alpha[GRP_COUNT] = {1.0, 1.0};
double beta[GRP_COUNT] = {0.0, 2.0};
MKL_INT size_per_grp[GRP_COUNT] = {2, 2};
// Total number of multiplications: 4
double *a_array[4], *b_array[4], *c_array[4];
a_array[0] = A0, b_array[0] = B0, c_array[0] = C0;
a_array[1] = A1, b_array[1] = B1, c_array[1] = C1;
a_array[2] = X0, b_array[2] = Y0, c_array[2] = Z0;
a_array[3] = X1, b_array[3] = Y1, c_array[3] = Z1;
// Call cblas_dgemm_batch
cblas_dgemm_batch (
CblasRowMajor,
transA,
transB,
m,
n,
k,
alpha,
a_array,
lda,
b_array,
ldb,
beta,
c_array,
ldc,
GRP_COUNT,
size_per_group);
</pre>
<p>The "Batch GEMM" interface resembles the GEMM interface. It is simply a matter of passing arguments as arrays of pointers to matrices and parameters, instead of as matrices and the parameters themselves. We see that it is possible to batch the multiplications of different shapes and parameters by packaging them into groups. Each group consists of multiplications of the same matrices shape (same <em>m, n, </em>and <i>k)</i> and the same parameters. </p>
<h4>Performance</h4>
<p>While this example does not show performance advantages of "Batch GEMM", when you have thousands of independent small matrix multiplications then the advantages of "Batch GEMM" become apparent. The chart below shows the performance of 11K small matrix multiplications with various sizes using "Batch GEMM" and the standard GEMM, respectively. The benchmark was run on a 28-core Intel Xeon processor (Haswell). The performance metric is Gflops, and higher bars mean higher performance or a faster solution.</p>
<p><img height="673" width="1090" src="https://software.intel.com/sites/default/files/managed/70/b9/gemm_batch_xeon.png" alt="" /></p>
<p>The second chart shows the same benchmark running on a 61-core Intel Xeon Phi co-processor (KNC). Because "Batch GEMM" is able to exploit parallelism using many concurrent multiple threads, its advantages are more evident on architectures with a larger core count. </p>
<p><img height="669" width="1079" src="https://software.intel.com/sites/default/files/managed/70/b9/gemm_batch_phi.png" alt="" /></p>
<h4>Summary</h4>
<p><span>This article introduces the new API for batch computation of matrix-matrix multiplications. It is an ideal solution when many small independent matrix multiplications need to be performed. "Batch GEMM" supports all precision types (S/D/C/Z). It has Fortran 77 and Fortran 95 APIs, and also CBLAS bindings. It is available in Intel MKL 11.3 Beta and later releases. Refer to the reference manual for additional documentation. </span></p>
<p><img alt="Optimization Notice in English" src="https://software.intel.com/sites/default/files/m/0/1/3/opt-notice-en_080411.gif" /></p>
Thu, 14 Sep 2017 01:55:40 -0700Fiona Z. (Intel)556940Coarray Fortran 32-bit doesn&#039;t work on 64-bit Microsoft* Windowshttps://software.intel.com/en-us/articles/coarray-fortran-doesnt-work-on-microsoft-windows-10
<p><strong>Version : </strong>Intel<span style="color:rgb(31, 73, 125)">®</span> Visual Fortran Compiler 17.0, 18.0</p>
<p><strong>Operating System :</strong> Microsoft* Windows 10 64-bit, Microsoft* Windows Server 2012 R2 64-bit</p>
<p><strong>Problem Description : </strong>Coarray Fortran 32-bit doesn't work on Microsoft* Windows 10 or Microsoft* Windows Server 2012 R2 (only on 64-bit OS) due to required utilities “mpiexec.exe” and “smpd.exe” not working properly.</p>
<p><strong>Resolution Status :</strong></p>
<p>It is a compatibility issue. You need to change the compatibility properties in order to run “mpiexec.exe” and “smpd.exe” correctly. Following workaround should resolve the problem:</p>
<p>1. Go to folder where your “mpiexec.exe” and “smpd.exe” files are located.<br />
2. For both files follow these steps:</p>
<ul><li>Right click &gt; Properties &gt; Compatibility Tab</li>
<li>Make sure the “Run this program in compatibility mode for:” box is checked and Windows Vista (Service Pack 2) is chosen.</li>
<li>Click Apply and close the Properties window.</li>
</ul><p>Coarray Fortran 32-bit application should work fine if all steps followed carefully.</p>
Thu, 31 Aug 2017 08:55:12 -0700Igor V. (Intel)743116Wrong Intel® Fortran compiler version displayed in Microsoft* Visual Studio 2012https://software.intel.com/en-us/articles/wrong-intel-fortran-compiler-version-displayed-in-microsoft-visual-studio-2012
<p><strong>Issue: </strong>Microsoft* Visual Studio 2012 is supported by Intel<span style="color:rgb(31, 73, 125)">®</span> Parallel Studio XE 2017. It is not supported by Intel<span style="color:rgb(31, 73, 125)">®</span> Parallel Studio XE 2018. Wrong Intel® Fortran compiler version displayed in Microsoft* Visual Studio 2012 in case both Intel<span style="color:rgb(31, 73, 125)">®</span> Parallel Studio XE 2017 and Intel<span style="color:rgb(31, 73, 125)">®</span> Parallel Studio XE 2018 are installed on the same system with Microsoft* Visual Studio 2012.</p>
<p>It may be observed while opening "Tools &gt; Options &gt; Intel Compilers and Tools &gt; Visual Fortran &gt; Compilers".<br />
The 'selected compiler' may be shown as "Intel(R) Visual Fortran Compiler 18.0", which is not correct.</p>
<p>Once the compilation process is invoked the correct compiler version is used. The output window shows a wrong compiler name "Intel(R) Visual Fortran Compiler 18.0". For example, there are both 17.0 Update 4 and 18.0 compiler versions installed:</p>
<p>1&gt;------ Rebuild All started: Project: Console8, Configuration: Debug Win32 ------<br />
1&gt;Deleting intermediate files and output files for project 'Console8', configuration 'Debug|Win32'.<br />
1&gt;<strong>Compiling with Intel(R) Visual Fortran Compiler 18.0.0.118 [IA-32]...</strong><br />
1&gt;Console8.f90<br />
1&gt;<strong>Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on IA-32, Version 17.0.4.210 Build 20170411</strong><br />
1&gt;Copyright (C) 1985-2017 Intel Corporation. All rights reserved.<br />
1&gt;Linking...</p>
<p><strong>Environment: </strong>Both Intel(R) Parallel Studio XE 2017 Update 4 and Intel(R) Parallel Studio XE 2018 are installed, Microsoft* Visual Studio 2012 is installed</p>
<p><strong>Root Cause:</strong> A root cause was identified and will be fixed in the upcoming compiler versions.</p>
<p><strong>Workaround: </strong></p>
<p>The user should select “Intel(R) Visual Fortran Compiler 17.0” at the 'Select compiler' option at "Tools &gt; Options &gt; Intel Compilers and Tools &gt; Visual Fortran &gt; Compilers". Then the correct name and compiler will be displayed as expected:</p>
<p>1&gt;------ Rebuild All started: Project: Console8, Configuration: Debug Win32 ------<br />
1&gt;Deleting intermediate files and output files for project 'Console8', configuration 'Debug|Win32'.<br />
1&gt;<strong>Compiling with Intel(R) Visual Fortran Compiler 17.0.4.210 [IA-32]...</strong><br />
1&gt;Console8.f90<br />
1&gt;<strong>Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on IA-32, Version 17.0.4.210 Build 20170411</strong><br />
1&gt;Copyright (C) 1985-2017 Intel Corporation. All rights reserved.<br />
1&gt;Linking...</p>
Mon, 21 Aug 2017 05:15:00 -0700Igor V. (Intel)742698Intel(R) Math Kernel Library - Introducing Vectorized Compact Routineshttps://software.intel.com/en-us/articles/intelr-math-kernel-library-introducing-vectorized-compact-routines
<h3>Introduction </h3>
<p> Many high performance computing applications depend on matrix operations performed on large groups of matrices of small sizes. Intel® Math Kernel Library (Intel® MKL) 2018 and later versions provide new <em>compact </em>routines that include optimizations for problems of this type.</p>
<p>The main idea behind these compact routines is to create true SIMD computations, in which subgroups of matrices are operated on with kernels that abstractly appear as scalar kernels while registers are filled by cross-matrix vectorization. Intel MKL compact routines provide significant performance benefits compared to batched techniques (see <a href="https://software.intel.com/en-us/articles/introducing-batch-gemm-operations">https://software.intel.com/en-us/articles/introducing-batch-gemm-operations</a> for more detailed information about Intel MKL Batch functions), while maintaining ease-of-use through the inclusion of compact service functions that facilitate the reformatting of matrix data for use in these routines.</p>
<p>Compact routines operate on matrices that have been packed into a contiguous segment of memory in an interleaved format, called <em>compact format</em>. Six compact routines have been introduced in Intel MKL 2018: general matrix-multiply (<em>mkl_?gemm_compact</em>), triangular matrix equation solve (<em>mkl_?trsm_compact</em>), inverse calculation (<em>mkl_?getrinp_compact</em>), LU factorization (<em>mkl_?getrfnp_compact</em>), Cholesky decomposition (<em>mkl_?potrf_compact</em>), and QR decomposition (<em>mkl_?geqrf_compact</em>). These routines can only be used for groups of matrices of identical dimensions, where the layout (row-major or column-major) and the stride are identical throughout the group. </p>
<h3>Compact Format</h3>
<p> In compact format, for real precisions, matrices are organized into packs of size V, where V is related to the SIMD vector length of the underlying architecture. Each pack is a 3D tensor with the matrix index incrementing the fastest. These packs can then be loaded into registers and operated on using SIMD instructions.</p>
<p>The picture below demonstrates the packing of a set of 4, 3 x 3, real-precision matrices into compact format. The pack length for this example is V = 2, resulting in 2 compact packs.</p>
<p><span><img src="https://software.intel.com/sites/default/files/managed/16/88/Compact_figure%231.png" alt="" /></span> </p>
<p> Figure 1: Compact format for 4, 3 x 3, real precision matrices with pack length V = 2</p>
<p>The particular form for the packs for each architecture and problem precision are specified by a MKL_COMPACT_PACK enum type.</p>
<p>Before calling a BLAS or LAPACK compact function, the input data must be packed in compact format. After execution, the output data should be unpacked from this compact format, unless another compact routine will be called immediately following the first. Two service functions, mkl_?gepack_compact, and mkl_?geunpack_compact, facilitate the process of storing matrices in compact format. It is recommended that the user call these service functions before calling the mkl_?gepack_compact routine to obtain the optimal format for performance, but advanced users can pack and unpack the matrices themselves and still use Intel MKL compact kernels on the packed set.</p>
<p>For more details, including a description of the compact format of complex-type arrays, see &lt;Compact Format&gt; in the Intel MKL User’s guide.</p>
<h4><strong>A SIMPLE VISUAL EXAMPLE</strong></h4>
<p>A simple compact version of a matrix multiplication is illustrated in this section, performing the operation C = A * B for a set of 4, 3 x 3, real-precision matrices. Generic (or batched) routines require 4 matrix-matrix multiplications to be performed for a problem of this type, as illustrated in Figure 2.</p>
<p><span><img src="https://software.intel.com/sites/default/files/managed/a6/00/Compact_figure%232.png" alt="" /></span></p>
<p> Figure 2: Generic GEMM for a set of 4, 3 x 3 matrices</p>
<p>Assuming that the matrices have been packed into compact format using a pack length of V = 2, the compact version of this problem involves two matrix-matrix multiplications, as illustrated in Figure 3</p>
<p><span><img src="https://software.intel.com/sites/default/files/managed/54/8a/Compact_figure%233.png" alt="" /></span></p>
<p> Figure 3: Compact GEMM for a set of 4, 3 x 3 matrices</p>
<p>The elements of the matrices involved in these two multiplications are vectors of length V, which are loaded into registers and operated on as if they were a scalar element in an ordinary matrix-matrix multiplication. Clearly, it is optimal to have pack length V equal to the length of the SIMD registers of the architecture.</p>
<h4><strong>NUMERICAL LIMITATIONS</strong></h4>
<p>Compact routines are subject to a set of numerical limitations, and they skip most of the checks presented in regular BLAS and LAPACK routines to provide effective vectorization. Error checking is the responsibility of the user. For more information on limitations in compact routines, see &lt;MKL User Guide Numerical Limitations&gt;</p>
<h4><strong>BLAS COMPACT ROUTINES</strong></h4>
<p>Intel MKL BLAS provides compact routines for general matrix-matrix multiplication and solving triangular matrix equations. The following table provides a brief description of the new routines. For detailed information on usage for these routines, see the Intel MKL User’s Guide.</p>
<table border="1"><tbody><tr><td style="width:215px">
<p>MKL Routine</p>
</td>
<td style="width:408px">
<p>Description</p>
</td>
</tr><tr><td style="width:215px">
<ul><li>mkl_?gemm_compact</li>
</ul><p> </p>
<p> </p>
<p> </p>
<p> </p>
<ul><li>mkl_?trsm_compact</li>
</ul><p> </p>
</td>
<td style="width:408px">
<ul><li><em>General matrix-matrix multiply</em></li>
</ul><p>Performs the operation</p>
<p>C = alpha*op(A)*op(B) + beta*C</p>
<p>where op(X) is one of op(X) = X, op(X) = X^T, or op(X) = X^H, alpha and beta are scalars, and A, B, and C are matrices stored in compact format.</p>
<ul><li><em>Solves a triangular matrix equation</em></li>
</ul><p>Computes the solution of one of the following matrix equations:</p>
<p>op(A) * X = alpha * B, or X*op(A) = alpha*B</p>
<p>where alpha is a scalar, X and B are m x n matrices stored in compact format, and A is a unit (or non-unit) triangular matrix stored in compact format.</p>
</td>
</tr></tbody></table><h4><strong>LAPACK COMPACT ROUTINES</strong></h4>
<p>Intel MKL LAPACK provides compact functions to calculate QR, LU, and Cholesky decompositions, as well as inverses, in Intel MKL 2018 (and later versions). The compact routines for LAPACK follow the same optimization principles as the compact BLAS routines. The following table provides a brief description of the new routines. For detailed information on these routines, see the Intel MKL User’s Guide.</p>
<table border="1"><tbody><tr><td style="width:215px">
<p>MKL Routine</p>
</td>
<td style="width:408px">
<p>Description</p>
</td>
</tr><tr><td style="width:215px">
<ul><li>mkl_?geqrf_compact</li>
</ul><p> </p>
<p> </p>
<ul><li>mkl_?getrfnp_compact</li>
</ul><p> </p>
<p> </p>
<p> </p>
<ul><li>mkl_?getrinp_compact</li>
</ul><p> </p>
<p> </p>
<ul><li>mkl_?potrf_compact</li>
</ul></td>
<td style="width:408px">
<ul><li><em>QR decomposition</em></li>
</ul><p>Computes the QR factorization of a set of general m x n, matrices, stored in the compact format.</p>
<ul><li><em>LU decomposition, without pivoting</em></li>
</ul><p>Computes the LU factorization, without pivoting, of a set of general, m x n matrices A, which are stored in array ap in the compact format (see Compact Format).</p>
<ul><li><em>Inverse, without pivoting</em></li>
</ul><p>Computes the inverse, of a set of LU factorized (without pivoting), general matrices A, which are stored in the compact format (see Compact Format).</p>
<ul><li><em>Cholesky decomposition</em></li>
</ul><p>Computes the Cholesky factorization of a set of symmetric (Hermitian), positive-definite, matrices, stored in the compact format.</p>
</td>
</tr></tbody></table><p> </p>
<h4>Example</h4>
<p>The following example uses Intel MKL compact routines to calculate first the LU factorizations, then the inverses (from the LU factorizations), of a group of 2048, 8x8 matrices. Within this example, the same calculations are made using an OpenMP loop on the group of matrices. The time that each routine takes is printed so that the user can verify the performance improvement when using compact routines.</p>
<p>Notice that the routines mkl_dgetrfnp_compact and mkl_dgetrinp_compact are called between the mkl_dgepack_compact and mkl_dgeunpack functions. Because the mkl_?gepack_compact and mkl_?geunpack_compact functions add overhead, users who call multiple compact routines on the same group of matrices will see the greatest performance benefit from using compact routines.</p>
<p>The complex compact routines are executed similarly, but it is important to note that for complex precisions, all input parameters are of real type. For more details, see &lt;Compact Format&gt; in the Intel MKL User’s guide. Examples of the calling sequences for each individual routine can be found in the Intel MKL 2018 product.</p>
<pre class="brush:cpp;">#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;omp.h&gt;
#include "mkl.h"
#define N 8
#define NMAT 2048
#define NITER_WARMUP 10
void test(double *t_compact, double *t_omp) {
MKL_INT i, j;
MKL_LAYOUT layout = MKL_COL_MAJOR;
MKL_INT m = N;
MKL_INT n = N;
MKL_INT lda = m;
MKL_INT info;
MKL_COMPACT_PACK format;
MKL_INT nmat = NMAT;
/* Set up standard arrays in P2P (pointer-to-pointer) format */
MKL_INT a_size = lda * n;
MKL_INT na = a_size * nmat;
double *a_ref = (double *)mkl_malloc(na * nmat * sizeof(double), 128);
double *a = (double *)mkl_malloc(na * nmat * sizeof(double), 128);
double *a_array[NMAT];
double *a_compact;
/* For random generation of matrices */
MKL_INT idist = 1;
MKL_INT iseed[] = { 0, 1, 2, 3 };
double diag_offset = (double)n;
/* For workspace calculation */
MKL_INT imone = -1;
MKL_INT lwork;
double work_query[1];
double *work_compact;
/* For threading */
MKL_INT nthr = omp_get_max_threads();
MKL_INT ithr;
MKL_INT lwork_i;
double *work_omp;
double* work_i = work_omp;
/* For setting up compact arrays */
MKL_INT a_buffer_size;
MKL_INT ldap = lda;
MKL_INT sdap = n;
/* Random generation of matrices */
dlarnv(&amp;idist, iseed, &amp;na, a);
for (i = 0; i &lt; nmat; i++) {
/* Make matrix diagonal dominant to avoid accuracy issues
in the non-pivoted LU factorization */
for (j = 0; j &lt; m; j++) {
a[i * a_size + j + j * lda] += diag_offset;
}
a_array[i] = &amp;a[i * a_size];
}
/* Set up a_ref to use in OMP version */
for (i = 0; i &lt; na; i++) {
a_ref[i] = a[i];
}
/* -----Start Compact----- */
/* Set up Compact arrays */
format = mkl_get_format_compact();
a_buffer_size = mkl_dget_size_compact(ldap, sdap, format, nmat);
a_compact = (double *)mkl_malloc(a_buffer_size, 128);
/* Workspace query */
mkl_dgetrinp_compact(layout, n, a_compact, ldap, work_query, imone, &amp;info, format, nmat);
lwork = (MKL_INT)work_query[0];
work_compact = (double *)mkl_malloc(sizeof(double) * lwork, 128);
/* Start timing compact */
*t_compact = dsecnd();
/* Pack from P2P to Compact format */
mkl_dgepack_compact(layout, n, n, a_array, lda, a_compact, ldap, format, nmat);
/* Perform Compact LU Factorization */
mkl_dgetrfnp_compact(layout, n, n, a_compact, ldap, &amp;info, format, nmat);
/* Perform Compact Inverse Calculation */
mkl_dgetrinp_compact(layout, n, a_compact, ldap, work_compact, lwork, &amp;info, format, nmat);
/* Unpack from Compact to P2P format */
mkl_dgeunpack_compact(layout, n, n, a_array, lda, a_compact, ldap, format, nmat);
/* End timing compact */
*t_compact = dsecnd() - *t_compact;
/* -----End Compact----- */
/* -----Start OMP----- */
for (i = 0; i &lt; nmat; i++) {
a_array[i] = &amp;a_ref[i * a_size];
}
/* Workspace query */
mkl_dgetrinp(&amp;n, a_array[0], &amp;lda, work_query, &amp;imone, &amp;info);
lwork = (MKL_INT)work_query[0] * nthr;
work_omp = (double *)mkl_malloc(sizeof(double) * lwork, 128);
/* Start timing OMP */
*t_omp = dsecnd();
/* OpenMP loop */
#pragma omp parallel for
for (i = 0; i &lt; nmat; i++) {
/* Set up workspace for thread */
ithr = omp_get_thread_num();
lwork_i = lwork / nthr;
work_i = &amp;work_omp[ithr * lwork_i];
/* Perform LU Factorization */
mkl_dgetrfnp(&amp;n, &amp;n, a_array[i], &amp;lda, &amp;info);
/* Perform Inverse Calculation */
mkl_dgetrinp(&amp;n, a_array[i], &amp;lda, work_i, &amp;lwork_i, &amp;info);
}
/* End timing OMP */
*t_omp = dsecnd() - *t_omp;
/* -----End OMP----- */
/* Deallocate arrays */
mkl_free(a_compact);
mkl_free(a);
mkl_free(a_ref);
mkl_free(work_compact);
mkl_free(work_omp);
}
int main() {
MKL_INT i = 0;
double t_compact;
double t_omp;
double flops = NMAT * ((2.0 / 3.0 + 4.0 / 3.0) * N * N * N);
for (i = 0; i &lt; NITER_WARMUP; i++) {
test(&amp;t_compact, &amp;t_omp);
}
test(&amp;t_compact, &amp;t_omp);
printf("N = %d, NMAT = %d\n", N, NMAT);
printf("Compact time = %fs, GFlops = %f\n", t_compact, flops / t_compact / 1e9);
printf("OMP time = %fs, GFlops = %f\n", t_omp, flops / t_omp / 1e9);
return 0;
}
</pre>
<h3><strong>PERFORMANCE RESULTS</strong></h3>
<p>The following four charts demonstrate the performance improvement for the following operations: general matrix-matrix multiplication (GEMM), triangular matrix equation solve (TRSM), non-pivoting LU-factorization of a general matrix (GETRFNP), and inverse calculation of an LU-factorized (without pivoting) general matrix (GETRINP). The results were measured against calls to the generic BLAS and LAPACK functions, as in the above example.</p>
<p><span><img src="https://software.intel.com/sites/default/files/managed/ee/c9/Compact_Perf_Chart%231.png" alt="" /></span></p>
<p><span><img src="https://software.intel.com/sites/default/files/managed/88/a9/Compact_Perf_Chart_trsm%232.png" alt="" /></span></p>
<p><span><img src="https://software.intel.com/sites/default/files/managed/16/e3/Compact_Perf_Chart_getrfnp%233.png" alt="" /></span></p>
<p><span><img src="https://software.intel.com/sites/default/files/managed/27/e2/Compact_Perf_Chart_getrfiverse%234.png" alt="" /></span></p>
<p> </p>
Mon, 14 Aug 2017 01:26:15 -0700Gennady F. (Intel)741258Intel® Data Analytics Acceleration Library - Decision Treeshttps://software.intel.com/en-us/articles/intel-data-analytics-acceleration-library-decision-trees
<h3>Introduction</h3>
<p>Decision trees method is one of most popular approaches in machine learning. They can easily be used to solve different classification and regression tasks. Often, decision trees endear by their universality and by the fact that the model obtained by learning the decision tree is easy to interpret even by an unprepared person.</p>
<p>The universality of decision trees is a consequence of two main factors. First, the decision tree method is non-parametric machine learning method. It means that its usage does not need to know or assume the probabilistic characteristics of the data with which it is supposed to work. Second, the decision tree method naturally incorporates mixtures of variables with different levels of measurement [1].</p>
<p>At the same time, the decision tree model is a white-box, from which it is clear to understand for which particular data a particular class for the classification problem, or one or another value of the dependent variable for regression problem, will be predicted, which features or dependent variables have impact on this and how.</p>
<p>This article describes the decision trees algorithm and how Intel® Data Analytics Acceleration Library (Intel® DAAL) [2] helps optimize this algorithm when running it on systems equipped with Intel® Xeon® processors.</p>
<h3>What is a Decision tree?</h3>
<p>Decision trees partition the feature space into a set of hypercubes, and then fit a simple model in each one. Such a simple model can be a prediction model, which ignores all predictors and predicts the majority (most frequent) class (or mean of dependent variable for regression), also known as 0-R or constant classifier.</p>
<p>Decision tree induction constructs a tree-like graph structure as shown on the figure below where each internal (non-leaf) node denotes a test on features, each branch descending from node corresponds to an outcome of the test, and each external node (leaf) node donates the mentioned simple model. </p>
<p><span><img src="https://software.intel.com/sites/default/files/managed/83/22/Decision_Trees_v0.png" alt="" /></span></p>
<p>The test is a rule, which depends on feature values, to perform the partitioning of the feature space: each outcome of the test represents an appropriate hypercube associated with both the test and one of descending branches. If the test is Boolean expression (e.g. <em>f</em> &lt; <em>c</em> or <em>f</em> = <em>c</em>, where <em>f</em> is a feature and <em>c</em> is a constant fitted while decision tree induction), the inducted decision tree is a binary tree and so each its non-leaf node has exactly two branches (“true” and “false”) according to result of such a Boolean expression. In this case, often, left branch implicitly assumed to be associated with “true” outcome, while right branch implicitly assumed to be associated with “false” outcome.</p>
<p>Test selection is performed as a search through all reasonable tests to find best one according to some criterion, named split criterion. There are many widely used split criteria, including Gini index [3] and Information Gain [4] for classification, and Mean-Squared Error (MSE) [3] for regression.</p>
<p>To improve prediction, decision tree can be pruned [5]. Pruning technics that are embedded in the training process named pre-pruning, because they stop further growing of the decision tree. There are also post-pruning technics that replace already completely trained decision tree by another one [5].</p>
<p>For instance, Reduced Error Pruning (REP), described in [5], assumes an existence of a separate pruning dataset, each observation in which is used to get prediction by the original (unpruned) tree. For every non-leaf subtree, the change in mispredictions is examined over the pruning dataset that would occur if this subtree were replaced by the best possible leaf:</p>
<p>where <em>E<sub>Subtree</sub></em> and <em>E<sub>leaf</sub></em> are numbers of errors in case of classification and MSE in case of regression respectively for given subtree and a best possible leaf, which replaces given subtree. If the new tree would give an equal or fewer mispredictions (D<em>E </em>£ 0) and subtree contains no subtree with the same property, the subtree is replaced by the leaf. The process continues until any further replacements would increase mispredictions over the pruning dataset. The final tree is the most accurate subtree of the original tree with respect to the pruning dataset and is the smallest tree with that accuracy. Pruning dataset can be some fraction of original training dataset (e.g. randomly chosen 20% of observations), but in this case those observations must be excluded from the training dataset.</p>
<p>The prediction is performed by starting at the root node of the tree, testing features by test specified by this node, then moving down the tree branch corresponding to the outcome of the test for the given example. This process is then repeated for the subtree rooted at the new node. The final result of the prediction of is the prediction of simple model at leaf node.</p>
<h3>Applications of Decision trees</h3>
<p>Decision trees can be used in many real-world applications [6]:</p>
<ul><li>Agriculture</li>
<li>Astronomy (e.g. for filtering noise from Hubble Space Telescope images)</li>
<li>Biomedical Engineering</li>
<li>Control Systems</li>
<li>Financial analysis</li>
<li>Manufacturing and Production</li>
<li>Medicine</li>
<li>Molecular biology</li>
<li>Object recognition</li>
<li>Pharmacology</li>
<li>Physics (e.g. for the detection of physical particles)</li>
<li>Plant diseases (e.g. to assess the hazard of mortality to pine trees)</li>
<li>Power systems (e.g. power system security assessment and power stability prediction)</li>
<li>Remote sensing</li>
<li>Software development (e.g. to estimate the development effort of a given software module)</li>
<li>Text processing (e.g. medical text classification)</li>
<li>Personal learning assistants</li>
<li>Classifying sleep signals</li>
</ul><h3>Advantages and disadvantages of Decision trees</h3>
<p>Using Decision trees has advantages and disadvantages [7]:</p>
<ul><li><strong>Advantages</strong>
<ul><li>Simple to understand and interpret. Have a white-box model.</li>
<li>Able to handle both numerical and categorical data.</li>
<li>Requires little data preparation.</li>
<li>Non-statistical approach that makes no assumptions of the training data or prediction residuals; e.g., no distributional, independence, or constant variance assumptions</li>
<li>Performs well even with large datasets.</li>
<li>Mirrors human decision making more closely than other approaches.</li>
<li>Robust against co-linearity.</li>
<li>Have built in feature selection.</li>
<li>Have value even with small datasets.</li>
<li>Can be combined with other techniques.</li>
</ul></li>
<li><strong>Disadvantages</strong>
<ul><li>Trees do not tend to be as accurate as other approaches.</li>
<li>Trees can be very non-robust. A small change in the training data can result in a big change in the tree, and thus a big change in final predictions.</li>
<li>The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristics such as the greedy algorithm where locally-optimal decisions are made at each node.</li>
<li>Decision-tree learners can create over-complex trees that do not generalize well from the training data. Mechanisms such as pruning are necessary to avoid this problem.</li>
<li>There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. In such cases, the decision tree becomes prohibitively large.</li>
</ul></li>
</ul><h3>Intel® Data Analytics Acceleration Library</h3>
<p>Intel® DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. Those building blocks are highly optimized for the latest features of latest Intel® processors. More about Intel® DAAL can be found in [2]. Intel® DAAL provides Decision tree classification and regression algorithms.</p>
<h3>Using Decision trees in Intel® Data Analytics Acceleration Library</h3>
<p>This section shows how to invoke Decision trees classification and regression using Intel® DAAL.</p>
<p>Do the following steps to invoke Decision tree classification algorithm from Intel® DAAL:</p>
<pre class="brush:cpp;">1. Ensure that you have Intel® DAAL installed and environment is prepared. See details in [8, 9, 10] according to your operating system.
2. Include header file daal.h into your application:
#include &lt;daal.h&gt;
3. To simplify usage of Intel® DAAL namespaces we will use following using directives:
using namespace daal;
using namespace daal::algorithms;
4. We will assume that training, pruning and testing datasets are in appropriate .csv files. If so, we must read first and second of them into Intel® DAAL numeric tables:
const size_t nFeatures = 5; /* Number of features in training and testing data sets */
/* Initialize FileDataSource&lt;CSVFeatureManager&gt; to retrieve the input data from a .csv
file */
FileDataSource&lt;CSVFeatureManager&gt; trainDataSource(“train.csv”,
DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);
/* Create Numeric Tables for training data and labels */
NumericTablePtr trainData(new HomogenNumericTable&lt;&gt;(nFeatures, 0,
NumericTable::notAllocate));
NumericTablePtr trainGroundTruth(new HomogenNumericTable&lt;&gt;(1, 0,
NumericTable::notAllocate));
NumericTablePtr mergedData(new MergedNumericTable(trainData, trainGroundTruth));
/* Retrieve the data from the input file */
trainDataSource.loadDataBlock(mergedData.get());
/* Initialize FileDataSource&lt;CSVFeatureManager&gt; to retrieve the pruning input data from a
.csv file */
FileDataSource&lt;CSVFeatureManager&gt; pruneDataSource(“prune.csv”,
DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);
/* Create Numeric Tables for pruning data and labels */
NumericTablePtr pruneData(new HomogenNumericTable&lt;&gt;(nFeatures, 0,
NumericTable::notAllocate));
NumericTablePtr pruneGroundTruth(new HomogenNumericTable&lt;&gt;(1, 0,
NumericTable::notAllocate));
NumericTablePtr pruneMergedData(new MergedNumericTable(pruneData, pruneGroundTruth));
/* Retrieve the data from the pruning input file */
pruneDataSource.loadDataBlock(pruneMergedData.get());
5. Create an algorithm object to train the model:
const size_t nClasses = 5; /* Number of classes */
/* Create an algorithm object to train the Decision tree model */
decision_tree::classification::training::Batch&lt;&gt; algorithm1(nClasses);
6. Pass the training data and labels with pruning data and labels to the algorithm:
/* Pass the training data set, labels, and pruning dataset with labels to the algorithm */
algorithm1.input.set(classifier::training::data, trainData);
algorithm1.input.set(classifier::training::labels, trainGroundTruth);
algorithm1.input.set(decision_tree::classification::training::dataForPruning, pruneData);
algorithm1.input.set(decision_tree::classification::training::labelsForPruning,
pruneGroundTruth);
7. Train the model:
/* Train the Decision tree model */
algorithm1.compute();
where algorithm1 is variable as defined in step 5.
8. Store result of training in variable:
decision_tree::classification::training::ResultPtr trainingResult =
algorithm1.getResult();
9. Read testing dataset from appropriate .csv file:
/* Initialize FileDataSource&lt;CSVFeatureManager&gt; to retrieve the test data from a .csv
file */
FileDataSource&lt;CSVFeatureManager&gt; testDataSource(“test.csv”,
DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);
/* Create Numeric Tables for testing data and labels */
NumericTablePtr testData(new HomogenNumericTable&lt;&gt;(nFeatures, 0,
NumericTable::notAllocate));
testGroundTruth = NumericTablePtr(new HomogenNumericTable&lt;&gt;(1, 0,
NumericTable::notAllocate));
NumericTablePtr mergedData(new MergedNumericTable(testData, testGroundTruth));
/* Retrieve the data from input file */
testDataSource.loadDataBlock(mergedData.get());
10. Create an algorithm object to test the model:
/* Create algorithm objects for Decision tree prediction with the default method */
decision_tree::classification::prediction::Batch&lt;&gt; algorithm2;
11. Pass the testing data and trained model to the algorithm:
/* Pass the testing data set and trained model to the algorithm */
algorithm2.input.set(classifier::prediction::data, testData);
algorithm2.input.set(classifier::prediction::model,
trainingResult-&gt;get(classifier::training::model));
12. Test the model:
/* Compute prediction results */
algorithm2.compute();
13. Retrieve the results of the prediction:
/* Retrieve algorithm results */
classifier::prediction::ResultPtr predictionResult = algorithm2.getResult();
</pre>
<p>For decision tree regression, the steps 1-4, 7, 9, 12 are same, while other are very similar:</p>
<pre class="brush:cpp;">1. Ensure that you have Intel® DAAL installed and environment is prepared. See details in [8, 9, 10] according to your operating system.
2. Include header file daal.h into your application:
#include &lt;daal.h&gt;
3. To simplify usage of Intel® DAAL namespaces we will use following using directives:
using namespace daal;
using namespace daal::algorithms;
4. We will assume that training, pruning and testing datasets are in appropriate .csv files. If so, we must read first and second of them into Intel® DAAL numeric tables:
const size_t nFeatures = 5; /* Number of features in training and testing data sets */
/* Initialize FileDataSource&lt;CSVFeatureManager&gt; to retrieve the input data from a .csv
file */
FileDataSource&lt;CSVFeatureManager&gt; trainDataSource(“train.csv”,
DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);
/* Create Numeric Tables for training data and labels */
NumericTablePtr trainData(new HomogenNumericTable&lt;&gt;(nFeatures, 0,
NumericTable::notAllocate));
NumericTablePtr trainGroundTruth(new HomogenNumericTable&lt;&gt;(1, 0,
NumericTable::notAllocate));
NumericTablePtr mergedData(new MergedNumericTable(trainData, trainGroundTruth));
/* Retrieve the data from the input file */
trainDataSource.loadDataBlock(mergedData.get());
/* Initialize FileDataSource&lt;CSVFeatureManager&gt; to retrieve the pruning input data from a
.csv file */
FileDataSource&lt;CSVFeatureManager&gt; pruneDataSource(“prune.csv”,
DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);
/* Create Numeric Tables for pruning data and labels */
NumericTablePtr pruneData(new HomogenNumericTable&lt;&gt;(nFeatures, 0,
NumericTable::notAllocate));
NumericTablePtr pruneGroundTruth(new HomogenNumericTable&lt;&gt;(1, 0,
NumericTable::notAllocate));
NumericTablePtr pruneMergedData(new MergedNumericTable(pruneData, pruneGroundTruth));
/* Retrieve the data from the pruning input file */
pruneDataSource.loadDataBlock(pruneMergedData.get());
5. Create an algorithm object to train the model:
/* Create an algorithm object to train the Decision tree model */
decision_tree::regression::training::Batch&lt;&gt; algorithm;
6. Pass the training data and labels with pruning data and labels to the algorithm:
/* Pass the training data set, dependent variables, and pruning dataset with dependent
variables to the algorithm */
algorithm.input.set(decision_tree::regression::training::data, trainData);
algorithm.input.set(decision_tree::regression::training::dependentVariables,
trainGroundTruth);
algorithm.input.set(decision_tree::regression::training::dataForPruning, pruneData);
algorithm.input.set(decision_tree::regression::training::dependentVariablesForPruning,
pruneGroundTruth);
7. Train the model:
/* Train the Decision tree model */
algorithm1.compute();
where algorithm1 is variable as defined in step 5.
8. Store result of training in variable:
decision_tree::regression::training::ResultPtr trainingResult =
algorithm1.getResult();
9. Read testing dataset from appropriate .csv file:
/* Initialize FileDataSource&lt;CSVFeatureManager&gt; to retrieve the test data from a .csv
file */
FileDataSource&lt;CSVFeatureManager&gt; testDataSource(“test.csv”,
DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);
/* Create Numeric Tables for testing data and labels */
NumericTablePtr testData(new HomogenNumericTable&lt;&gt;(nFeatures, 0,
NumericTable::notAllocate));
testGroundTruth = NumericTablePtr(new HomogenNumericTable&lt;&gt;(1, 0,
NumericTable::notAllocate));
NumericTablePtr mergedData(new MergedNumericTable(testData, testGroundTruth));
/* Retrieve the data from input file */
testDataSource.loadDataBlock(mergedData.get());
10. Create an algorithm object to test the model:
/* Create algorithm objects for Decision tree prediction with the default method */
decision_tree::regression::prediction::Batch&lt;&gt; algorithm2;
11. Pass the testing data and trained model to the algorithm:
/* Pass the testing data set and trained model to the algorithm */
algorithm.input.set(decision_tree::regression::prediction::data, testData);
algorithm.input.set(decision_tree::regression::prediction::model,
trainingResult-&gt;get(decision_tree::regression::training::model));
12. Test the model:
/* Compute prediction results */
algorithm2.compute();
13. Retrieve the results of the prediction:
/* Retrieve algorithm results */
decision_tree::regression::prediction::ResultPtr predictionResult =
algorithm2.getResult();
</pre>
<h3>Conclusion</h3>
<p>Decision tree is a powerful method, which can be used for both classification and regression. Intel® DAAL optimized the decision tree algorithm. By using Intel® DAAL developers can take advantage of new features in future generations of Intel® Xeon® processors without having to modify their applications. They only need to link their applications to the latest version of Intel® DAAL.</p>
<h3>References</h3>
<ol><li><a href="https://en.wikipedia.org/wiki/Level_of_measurement" rel="nofollow">https://en.wikipedia.org/wiki/Level_of_measurement</a><a name="_Ref487797216" id="_Ref487797216"></a></li>
<li><a href="https://software.intel.com/en-us/blogs/daal">https://software.intel.com/en-us/blogs/daal</a><a name="_Ref488155198" id="_Ref488155198"></a></li>
<li><a name="_Ref488244516" id="_Ref488244516">Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone. <em>Classification and Regression Trees.</em> Chapman &amp; Hall. 1984.</a></li>
<li><a name="_Ref488244569" id="_Ref488244569">J. R. Quinlan. <em>Induction of Decision Trees.</em> Machine Learning, Volume 1 Issue 1. pp. 81-106. 1986.</a></li>
<li><a name="_Ref488244670" id="_Ref488244670">J. R. Quinlan. <em>Simplifying decision trees.</em> International journal of Man-Machine Studies, Volume 27 Issue 3. pp. 221-234. 1987.</a></li>
<li><a href="http://www.cbcb.umd.edu/~salzberg/docs/murthy_thesis/survey/node32.html" rel="nofollow">http://www.cbcb.umd.edu/~salzberg/docs/murthy_thesis/survey/node32.html</a><a name="_Ref488245088" id="_Ref488245088"></a></li>
<li><a href="https://en.wikipedia.org/wiki/Decision_tree_learning" rel="nofollow">https://en.wikipedia.org/wiki/Decision_tree_learning</a><a name="_Ref488247359" id="_Ref488247359"></a></li>
<li><a href="https://software.intel.com/en-us/get-started-with-daal-for-linux">https://software.intel.com/en-us/get-started-with-daal-for-linux</a><a name="_Ref488249055" id="_Ref488249055"></a></li>
<li><a href="https://software.intel.com/en-us/get-started-with-daal-for-windows">https://software.intel.com/en-us/get-started-with-daal-for-windows</a><a name="_Ref488249068" id="_Ref488249068"></a></li>
<li><a href="https://software.intel.com/en-us/get-started-with-daal-for-macos">https://software.intel.com/en-us/get-started-with-daal-for-macos</a><a name="_Ref488249082" id="_Ref488249082"></a></li>
</ol>Sun, 13 Aug 2017 23:53:58 -0700Gennady F. (Intel)741254Intel® Software Development tools integration to Microsoft* Visual Studio 2017 issueshttps://software.intel.com/en-us/articles/intel-software-development-tools-integration-to-vs2017-issue
<p>We received customer reports regarding Intel Parallel Studio 2017 and 2018 installation or integration issues with Visual Studio 2017 update 3. The following known issues have been identified and corresponding workarounds are provided. We are working on getting appropriate fixes. The root cause is related to Visual Studio 2017 update 3 changes displaying a new version format.</p>
<p><strong>Please follow the workarounds provided below for now. When the fixes are available, this article will be updated accordingly. </strong></p>
<p><strong>Environment: </strong>Microsoft* Windows, Visual Studio 2017</p>
<p><strong>Affected products</strong>: Intel® Parallel Studio XE 2017 Update 4, Intel® Parallel Studio XE 2018 and later versions, Intel® System Studio Update 3 and later.</p>
<p><strong>Problem Description : </strong></p>
<p>Please find the list of all known integration issues below. Issues are expected to be fixed with the latest 15.3.3 version of Microsoft* Visual Studio 2017 and upcoming Intel® Parallel Studio XE/System Studio versions. Issues are still observed with available Intel® Parallel Studio XE/System Studio versions, e.g. with Intel® Parallel Studio XE 2017 Update 4 and Intel® Parallel Studio XE 2018.</p>
<p>Note that to use the Intel® Compilers with Microsoft Visual Studio* 2017 you must customize the install and enable additional workloads. Please refer to <a href="https://software.intel.com/en-us/articles/installing-microsoft-visual-studio-2017-for-use-with-intel-compilers">this article</a> for details.</p>
<div>
<ol><li><a href="#issue1" rel="nofollow">Installation of Intel® Parallel Studio XE with Microsoft* Visual Studio 2017 integration hangs and fails</a></li>
<li><a href="#issue2" rel="nofollow">Errors within Visual Studio 2017 after Intel® Parallel Studio XE/Intel® System Studio uninstallation</a></li>
<li><a href="#issue3" rel="nofollow">VS2017 installation is not complete message during the installation of IPSXE/ISS</a></li>
<li><a href="#issue4" rel="nofollow">Integration to Visual Studio 2017 version 15.3 on Windows 10</a></li>
<li><a href="#issue5.1" rel="nofollow">Installation/upgrade of Intel® Parallel Studio XE/Intel® System Studio into Visual Studio 2017 version 15.3 reports errors once installation is completed</a>​
<ol><li><a href="#issue5.1" rel="nofollow">Installation of Intel® Parallel Studio XE/Intel® System Studio into Visual Studio 2017 version 15.3 reports errors once installation is completed​</a></li>
<li><a href="#issue5.2" rel="nofollow">Upgrade of Intel® Parallel Studio XE/Intel® System Studio with integration into Visual Studio 2017 version 15.3 (or later) reports errors once installation is completed</a></li>
</ol></li>
<li><a href="#issue6" rel="nofollow">Existing integration to Visual Studio 2017 is broken once Visual Studio upgraded</a></li>
</ol></div>
<p><a name="issue1" id="issue1"></a><strong>#1 -- Installation of Intel® Parallel Studio XE with Microsoft* Visual Studio 2017 integration hangs and fails</strong></p>
<p>Installation of Intel® Parallel Studio XE with Microsoft* Visual Studio 2017 integration hangs and fails on some systems. The problem is intermittent and not reproducible on every system. Any attempts to repair it fails with the message "Incomplete installation of Microsoft Visual Studio* 2017 is detected". Note, in some cases the installation may complete successfully with no error/crashes, however, the integration to VS2017 is not installed.</p>
<p><strong>Workaround: </strong>Note that with Intel® Parallel Studio XE 2017 Update 4 there is no workaround for this integration problem. The following workaround is expected to be implemented in Intel® Parallel Studio XE 2017 Update 5. It is implemented in Intel® Parallel Studio XE 2018.</p>
<p>Integrate the Intel Parallel Studio XE components manually. You need to run all the files from the corresponding folders:</p>
<ul><li>C++/Fortran Compiler IDE:
<ul><li>&lt;installdir&gt;/ide_support_2018/VS15/Common Tools/*.vsix</li>
<li>&lt;installdir&gt;/ide_support_2018/VS15/C++/*.vsix</li>
<li>&lt;installdir&gt;/ide_support_2018/VS15/Fortran/*.vsix</li>
</ul></li>
<li>Amplifier: &lt;installdir&gt;/VTune Amplifier 2018/amplxe_vs2017-integration.vsix</li>
<li>Advisor: &lt;installdir&gt;/Advisor 2018/advi_vs2017-integration.vsix</li>
<li>Inspector: &lt;installdir&gt;/Inspector 2018/insp_vs2017-integration.vsix</li>
<li>Debugger: &lt;InstallDir&gt;/ide_support_2018/MIC/*.vsix<br />
&lt;InstallDir&gt;/ide_support_2018/CPUSideRDM/*.vsix</li>
</ul><p>Note that for 2017 version you need to use folders with 2017 instead of 2018, e.g. ide_support_2017, VTune Amplifier 2017, etc.</p>
<p><a name="issue2" id="issue2"></a><strong>#2 -- Errors within Visual Studio 2017 after Intel® Parallel Studio XE/Intel® System Studio uninstallation</strong></p>
<p>Different errors are observed within Visual Studio 2017 if not all parts of Intel® Parallel Studio XE/Intel® System Studio were uninstalled. Examples of such errors:</p>
<ul><li>Error message appears while building any C/C++ program within Visual Studio 2017 after Intel® Parallel Studio XE/Intel® System Studio uninstallation:<br /><em>Could not load file or assembly 'Intel.Misc.Utilities,</em><br /><em>Version=18.0.15.0 Culture=neutral,<br />
PublicKeyToken=5caa3becd8c4c9ee' or one of its dependencies.<br />
The system cannot find the file specified.,</em></li>
<li>Error message when invoking Visual Studio 2017:<br /><em>The 'IntelCommonPkg' package did not load correctly.</em><br /><em>The problem may have been caused by a configuration change or by the installation of another extension. You can get more information by examining the file …</em></li>
<li>Remnants of Intel® Parallel Studio XE/Intel® System Studio in Tools&gt;Options:<br />
Remaining entry in Tools&gt;Options:<br /><em> &gt; Intel_Composer_XE<br />
General<br />
Compilers (win32 tab present, but empty)<br />
Guided Auto Parallelism (…)</em></li>
</ul><p><a name="issue2workaround" id="issue2workaround"></a><strong>Workaround:</strong></p>
<ul><li>Launch "Visual Studio Installer"</li>
<li>For Visual Studio ... 2017</li>
<li>Select [Modify]</li>
<li>On the right side of the screen under the heading, "Summary", open the category, "Individual components"</li>
<li>Uncheck all entries that begin with or contain "Intel(R) ..." including “Common tools for Intel compiler projects”</li>
<li>Select [Modify] to have it make the changes</li>
<li>Launch Visual Studio 2017</li>
<li>Select menu items, “Tools &gt; Extensions and Updates”</li>
<li>Uninstall any remaining entries (components) beginning with "Intel(R) ..."</li>
<li>Ensure that there are no remaining entries (components) containing "Intel(R) ...". Otherwise uninstall them.</li>
</ul><p><a name="issue3" id="issue3"></a><strong>#3 -- VS2017 installation is not complete message during the installation of IPSXE/ISS</strong></p>
<p>The installation displays a dialog message saying that VS2017 installation is not complete. One of the other symptom is that there is no Intel compiler context switch menu to change project to use Intel Compiler within Visual Studio 2017.<br /><strong>Workaround: </strong>Reinstall Visual Studio 2017 and then reinstall Intel® Parallel Studio XE/Intel® System Studio.</p>
<p><a name="issue4" id="issue4"></a><strong>#4 -- Integration to Visual Studio 2017 version 15.3 on Windows 10</strong></p>
<p>On some Windows 10 systems with Visual Studio 2017 version 15.3 installed, after installation of Intel® Parallel Studio XE/Intel® System Studio with integration into Visual Studio, there is no indication of product being integrated except Extension Manager (“Tools &gt; Extensions and Updates”) shows Intel extensions.<br /><strong>Workaround:</strong> Launch "Visual Studio Installer" and repair Visual Studio installation. Then repair Intel® Parallel Studio XE/Intel® System Studio installation.</p>
<p><a name="issue5.1" id="issue5.1"></a> <strong>#5.1 -- Installation of Intel® Parallel Studio XE/Intel® System Studio into Visual Studio 2017 version 15.3 reports errors once installation is completed</strong></p>
<p>Visual Studio 2017 version 15.3 integration failed to install correctly. Some of Intel® Parallel Studio XE/Intel® System Studio components may not be available in Visual Studio 2017.<br /><strong>Workaround:</strong> All components are installed successfully, no actions required.</p>
<div><a name="issue5.2" id="issue5.2"></a><strong>#5.2 – Upgrade of Intel® Parallel Studio XE/Intel® System Studio with integration into Visual Studio 2017 version 15.3 (or later) reports errors once installation is completed</strong></div>
<div>The issue is observed with following products installation scenario:</div>
<ul><li>Installed Intel® Parallel Studio XE 2017 Update 4 / 2018 Beta / Intel® System Studio 2017 Update 3 with integration to Visual Studio 2017. </li>
<li>Visual Studio 2017 upgraded to Update 3 (15.3.3) or later.</li>
<li>Installation of Intel® Parallel Studio 2018 completes with the error, integration to Visual Studio is not successful.</li>
</ul><div>We recommend to follow steps 2 and 3 from this workaround prior to installing these product versions.</div>
<div><strong>Workaround:</strong></div>
<div>
<ol><li>Uninstall unsuccessfully installed Intel® Parallel Studio XE/Intel® System Studio.</li>
<li>Do <a href="#issue2workaround" rel="nofollow">workaround from issue #2</a>.</li>
<li>Create .bat file with provided content and execute it. The script should be run for every Visual Studio 2017 instance, so “root” should corresponds to appropriate installed Visual Studio version. For example, for Professional version:<br /><em>@echo off<br />
rem Please verify that root corresponds to the location of your Microsoft Visual Studio 2017 instance<br />
set root="C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\Common7\IDE\VC\VCTargets"<br />
for /d /r %root% %%d in ("Intel C++ Compiler"*) do if exist "%%d" rmdir /s /q "%%d"<br />
for /r %root% %%f in (Intel.*) do if exist "%%f" del "%%f"</em></li>
<li>Install Intel® Parallel Studio XE/Intel® System Studio again.</li>
</ol></div>
<p><a name="issue6" id="issue6"></a><strong>#6 -- Existing integration to Visual Studio 2017 is broken once Visual Studio upgraded</strong></p>
<p>Visual Studio 2017 or any update is installed and Intel® Parallel Studio XE/Intel® System Studio is integrated. An attempt to upgrade Visual Studio 2017 (for example from 15.0 to 15.3.1, or 15.3.1 to 15.3.2, etc…) results in different issues, e.g.:</p>
<ul><li>when building C/C++ projects with Intel C++ compiler: C:\Program Files <em>(x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Platforms\Win32\PlatformToolsets\Intel C++ Compiler 18.0\Toolset.targets(127,5): error : Could not expand ICInstallDir variable. Platform toolset may be set to an invalid version number.1&gt;Done building project "ConsoleApplication1.vcxproj" -- FAILED.</em></li>
<li>Project Properties are empty</li>
<li>Code coverage summary is not shown when you run Code Coverage from Tools &gt; Intel Compiler &gt; Code Coverage…</li>
<li>Other odd behavior with Intel tools integrated with Visual Studio 2017</li>
</ul><p><strong>Workaround: </strong>close Visual Studio and run Intel® Parallel Studio XE/Intel® System Studio repair.</p>
<p>Note that root causes of the most issues were identified and reported to Microsoft*. We are working with Microsoft* to fix all issues in the upcoming updates.</p>
<p>There may be different reasons of integration issues and failures. We are documenting all cases and providing to Microsoft for further root-cause analysis.</p>
<p>If your problem is not described in this article or suggested workaround doesn't work then please report the problem to Intel through the <a href="https://software.intel.com/en-us/forums/">Intel® Developer Zone Forums</a> or <a href="http://www.intel.com/supporttickets">Online Service Center</a>. You will need to supply the <a href="https://software.intel.com/en-us/articles/where-can-i-find-the-installation-log-files">installation log file</a> and error message from Microsoft installer.</p>
Wed, 26 Jul 2017 07:47:51 -0700Igor V. (Intel)739887Intel® Software Guard Extensions Tutorial Part 9: Power Events and Data Sealing https://software.intel.com/en-us/articles/intel-sgx-tutorial-part-9-power-events-and-data-sealing
<p><a class="button-highlight" href="/protected-download/676750/737211" rel="nofollow">Download</a> [ZIP 598KB]</p>
<p>In part 9 of the <a href="https://software.intel.com/en-us/sgx">Intel® Software Guard Extensions (Intel® SGX)</a> tutorial series we’ll address some of the complexities surrounding the suspend and resume power cycle. Our application needs to do more than just <em>survive</em> power transitions: it must also provide a smooth user experience without compromising overall security. First, we’ll discuss what happens to enclaves when the system resumes from the sleep state and provide general advice on how to manage power transitions in an Intel SGX application. We’ll examine the data sealing capabilities of Intel SGX and show how they can help smooth the transitions between power states, while also pointing out some of the serious pitfalls that can occur when they are used improperly. Finally, we’ll apply these techniques to the Tutorial Password Manager in order to create a smooth user experience.</p>
<p>You can find a list of all the published tutorials in the article <a href="https://software.intel.com/en-us/articles/introducing-the-intel-software-guard-extensions-tutorial-series">Introducing the Intel® Software Guard Extensions Tutorial Series</a>.</p>
<p>Source code is provided with this installment of the series.</p>
<h2>Suspend, Hibernate, and Resume</h2>
<p>Applications must be able to survive a sleep and resume cycle. When the system resumes from suspend or hibernation, applications should return to their previous state, or, if necessary, create a new state specifically to handle the wake event. What applications <em>shouldn’t</em> do is become unstable or crash as a direct result of that change in the power state. Call this the “rule zero” of managing power events.</p>
<p>Most applications don’t actually need special handling for these events. When the system suspends, the application state is preserved because RAM is still powered on. When the system hibernates, the RAM is saved to a special hibernation file on disk, which is used to restore the system state when it’s powered back on. You don’t need to add code to enable or take advantage of this core feature of the OS. There are two notable exceptions, however:</p>
<ul><li>Applications that rely on physical hardware that isn’t guaranteed to be preserved across power events, such as CPU caches.</li>
<li>Scenarios where possible changes to the system context can affect program logic. For example, a location-based application can be moved hundreds of miles while it’s sleeping and would need to re-acquire its location. An application that works with sensitive data may choose to guard against theft by reprompting the user for his or her password.</li>
</ul><p>Our Tutorial Password Manager actually falls into <em>both</em> categories. Certainly, if a laptop running our password manager is stolen, the thief would potentially have access to the victim’s passwords until they explicitly closed the application or locked the vault. The first category, though, may be less obvious: Intel SGX is a hardware feature that is not preserved across power events.</p>
<p>We can demonstrate this by running the Tutorial Password Manager, unlocking the vault, suspending the system, waking it back up, and then trying to read a password or edit one of the accounts. Follow those sequences, and you’ll get one of the error dialogs shown in Figure 1 or Figure 2.</p>
<p style="text-align:center"><img src="/sites/default/files/managed/ae/3d/intel-software-guard-extensions-part-9-power-events-data-sealing-fig01.png" /></p>
<p style="text-align:center"><strong>Figure 1.</strong> Error received when attempting to edit an account after resuming from sleep.</p>
<p style="text-align:center"><img src="/sites/default/files/managed/bf/a3/intel-software-guard-extensions-part-9-power-events-data-sealing-fig02.png" /></p>
<p style="text-align:center"><strong>Figure 2.</strong> Error received when attempting to view an account password after resuming from sleep.</p>
<p>As currently written, the Tutorial Password Manager violates rule zero: it becomes unstable after resuming from a sleep operation. The application needs special handling for power events.</p>
<h2>Enclaves and Power Events</h2>
<p>When a processor leaves S0 or S1 for a lower-power state, the enclave page cache (EPC) is destroyed: all EPC pages are erased along with their encryption keys. Since enclaves store their code and data in the EPC, when the EPC goes away the enclaves go with it. This means that enclaves do not survive power events that take the system to state S2 or lower.</p>
<p>Table 1 provides a summary of the power states.</p>
<p style="text-align:center"><strong>Table 1.</strong> CPU power states</p>
<table border="1" class="grey-alternating-rows"><tbody><tr><td>
<p><strong>State</strong></p>
</td>
<td>
<p><strong>Description</strong></p>
</td>
</tr><tr><td>
<p><strong>S0</strong></p>
</td>
<td>
<p>Active run state. The CPU is executing instructions, and background tasks are running even if the system appears idle and the display is powered off.</p>
</td>
</tr><tr><td>
<p><strong>S1</strong></p>
</td>
<td>
<p>Processor caches are flushed, CPU stops executing instructions. Power to CPU and RAM is maintained. Devices may or may not power off. This is a high-power standby state, sometimes called “power on suspend.”</p>
</td>
</tr><tr><td>
<p><strong>S2</strong></p>
</td>
<td>
<p>CPU is powered off. CPU context and contents of the system cache are lost.</p>
</td>
</tr><tr><td>
<p><strong>S3</strong></p>
</td>
<td>
<p>RAM is powered on to preserve its contents. A standby or sleep state.</p>
</td>
</tr><tr><td>
<p><strong>S4</strong></p>
</td>
<td>
<p>RAM is saved to nonvolatile storage in a hibernation file before powering off. When powered on, the hibernation file is read in to restore the system state. A hibernation state.</p>
</td>
</tr><tr><td>
<p><strong>S5</strong></p>
</td>
<td>
<p>“Soft off.” The system is off but some components are powered to allow a full system power-on via some external event, such as Wake-on-LAN, a system management component, or a connected device.</p>
</td>
</tr></tbody></table><p>Power state S1 is not typically seen on modern systems, and state S2 is uncommon in general. Most CPUs go to power state S3 when put in “sleep” mode and drop to S4 when hibernating to disk.</p>
<p>The Windows* OS provides a mechanism for applications to subscribe to wakeup events, but that won’t help any ECALLs that are in progress when the power transition occurs (and, by extension, any OCALLs either since they are launched from inside of ECALLs). When the enclave is destroyed, the execution context for the ECALL is destroyed with it, any nested OCALLs and ECALLs are destroyed, and the outer-most ECALL immediately returns with a status of SGX_ERROR_ENCLAVE_LOST.</p>
<p>It is important to note that any OCALLs that are in progress are destroyed without warning, which means any changes they are making in unprotected memory will potentially be incomplete. Since unprotected memory is maintained or restored when resuming from the S3 and S4 power states, it is important that developers use reliable and robust procedures to prevent partial write corruptions. Applications must not end up in an indeterminate or invalid state when power resumes.</p>
<h3>General Advice for Managing Power Transitions</h3>
<p>Planning for power transitions begins before a sleep or hibernation event occurs. Decide how extensive the enclave recovery needs to be. Should the application be able to pick up exactly where it left off without user intervention? Will it resume interrupted tasks, restart them, or just abort? Will the user interface, if any, reflect the change in state? The answers to these questions will drive the rest of the application design. As a general rule, the more autonomous and seamless the recovery is, the more complex the program logic will need to be.</p>
<p>An application may also have different levels of recovery at different points. Some stages of an application may be easier to seamlessly recover from than others, and in some execution contexts it may not make sense or even be good security practice to attempt a seamless recovery at all.</p>
<p>Once the overall enclave recovery strategy has been identified, the process of preparing an enclave for a power event is as follows:</p>
<ol><li>Determine the minimal state information and data that needs to be saved in order to reconstruct the enclave.</li>
<li>Periodically seal the state information and save it to unprotected memory (data sealing is discussed below). The sealed state data can be sent back to the main application as an [out] pointer parameter to an ECALL, or the ECALL can make an OCALL specifically to save state data.</li>
<li>When an SGX_ERROR_ENCLAVE_LOST code is returned by an ECALL, explicitly destroy the enclave and then recreate it. <strong>It is strongly recommended that applications explicitly destroy the enclave with a call to <em>sgx_enclave_destroy</em>()</strong>.</li>
<li>Restore the enclave state using an ECALL that is designed to do so.</li>
</ol><p>It is important to save the enclave state to untrusted memory <em>before</em> a power transition occurs. Even if the OS is able to send an event to an application when it is about to enter a standby mode, there are no guarantees that the application will have sufficient time to act before the system physically goes to sleep.</p>
<h2>Data Sealing</h2>
<p>When an enclave needs to preserve data across instantiations, either in preparation for a power event or between executions of the parent application, it needs to send that data out to untrusted memory. The problem with untrusted memory, however, is exactly that: it is untrusted. It is neither encrypted nor integrity checked, so any data sent outside the enclave in the clear is potentially leaking secrets. Furthermore, if that data were to be modified in untrusted memory, future instantiations of the enclave would not be able to detect that the modification occurred.</p>
<p>To address this problem, Intel SGX provides a capability called data sealing. When data is sealed, it is encrypted with advanced encryption standard (AES) in Galois/Counter Mode (GCM) using a 128-bit key that is derived from CPU-specific key material and some additional inputs, guided by one of two key policies. The use of AES-GCM provides both confidentiality of the data being sealed and integrity checking when the data is read back in and unsealed (decrypted).</p>
<p>As mentioned above, the key used in data sealing is derived from several inputs. The two key policies defined by data sealing determine what those inputs are:</p>
<ul><li><strong>MRSIGNER</strong>. The encryption key is derived from the CPU’s key material, the security version number (SVN), and the enclave signing key used by the developer. Data sealed using MRSIGNER can be unsealed by other enclaves on that same system that originate from the same software vendor (enclaves that share the same signing key). The use of an SVN allows enclaves to unseal data that was sealed by previous versions of an enclave, but prevents older enclaves from unsealing data from newer versions. It allows enclave developers to enforce software version upgrades.</li>
<li><strong>MRENCLAVE</strong>. The encryption key is derived from the CPU’s key material and the enclave’s cryptographic signature. Data signed with the MRENCLAVE policy can only be unsealed by that exact enclave on that system.</li>
</ul><p>Note that the CPU is a common component in the two key policies. Each processor has some random, hardware-based key material—physical circuitry on the processor—which is built into it as part of the manufacturing process. This ensures that data sealed by an enclave on one CPU cannot be unsealed by enclaves on another CPU. Each CPU will result in a different signing key, even if all other aspects of the signing policy (enclave measurement, enclave signing key, SVN) are the same.</p>
<p>The data sealing and unsealing API is really a set of convenience functions. They provide a high-level interface to the underlying AES-GCM encryption and 128-bit key derivation functions.</p>
<p>Once data has been sealed in the enclave, it can be sent out to untrusted memory and optionally written to disk.</p>
<h3>Caveats</h3>
<p>There is a caveat with data sealing, though, and it has significant security implications. Your enclave API needs to include an ECALL that will take sealed data as an input and then unseal it. <strong>However, Intel SGX does not authenticate the calling application, so you cannot assume that only your application is loading your enclave.</strong> This means that your enclave can be loaded and executed by <em>anyone</em>, even applications you didn’t write. As you might recall from Part 1, enclave applications are divided into two parts: the trusted part, which is made up of the enclaves, and the untrusted part, which is the rest of the application. These terms, “trusted” and “untrusted,” are chosen deliberately.</p>
<p>Intel SGX cannot authenticate the calling application because this would require a trusted execution chain that runs from system power-on all the way through boot, the OS load, and launching the application. This is far outside the scope of Intel SGX, which limits the trusted execution environment to just the enclaves themselves. Because there’s no way for the enclave to validate the caller, each enclave must be written defensibly. <strong>Your enclave cannot make any assumptions about the application that has called into it.</strong> An enclave must be written under the assumption that <em>any</em> application can load it and execute its API, and that its ECALLs can be executed in <em>any </em>order.</p>
<p>Normally this is not a significant constraint, but sealing and unsealing data complicates matters significantly because both the sealed data and the means to unseal it are exposed to arbitrary applications. <strong>The enclave API must not allow applications to use sealed data to bypass security mechanisms.</strong></p>
<p>Take the following scenario as an example: A file encryption program wants to save end users the hassle of re-entering their password every time the application runs, so it seals their password using the data sealing functions and the MRENCLAVE policy, and then writes the sealed data to disk. When the application starts, it looks for the sealed data file, and if it’s present, reads it in and makes an ECALL to unseal the data and restore the user’s password into the enclave.</p>
<p>The problems with this hypothetical application are two-fold:</p>
<ul><li>It assumes that it is the only application that will ever load the enclave.</li>
<li>It doesn’t authenticate the end user when the data is unsealed.</li>
</ul><p>A malicious software developer can write their own application that loads the same enclave and follows the same procedure (looks for the sealed data file, and invokes the ECALL to unseal it inside the enclave). While the malicious application can’t expose the user’s password, it <em>can</em> use the enclave’s ECALLs to encrypt and decrypt the user’s files using their stored password, which is nearly as bad. The malicious user has gained the ability to decrypt files without having to know the user’s password at all!</p>
<p>A non-Intel SGX version of this same application that offered this same convenience feature would also be vulnerable, but that’s not the point. If the goal is to use Intel SGX features to harden the application’s security, those same features should not be undermined by poor programming practices!</p>
<h2>Managing Power Transitions in the Tutorial Password Manager</h2>
<p>Now that we understand how power events affect enclaves and know what tools are available to assist with the recovery process, we can turn our attention to the Tutorial Password Manager. As currently written, it has two problems:</p>
<ul><li>It becomes unstable after a power event.</li>
<li>It assumes the password vault should remain unlocked after the system resumes.</li>
</ul><p>Before we can solve the first problem we need to address the second one, and that means making some design decisions.</p>
<h3>Sleep and Resume Behavior</h3>
<p>The big decision that needs to be made for the Tutorial Password Manager is whether or not to lock the password vault when the system resumes from a sleep state.</p>
<p>The primary argument for locking the password vault after a sleep/resume cycle is to protect the password database in case the physical system is stolen while it’s suspended. This would prevent the thief from being able to access the password database after waking up the device. However, having the system lock the password vault immediately can also be a user interface friction: sometimes, aggressive power management settings cause a running system to sleep while the user is still in front of the device. If the user wakes the system back up immediately, they might be irritated to find that their password vault has been locked.</p>
<p>This issue really comes down to balancing user convenience against security, so the right approach is to give the user control over the application’s behavior. The default will be for the password vault to lock immediately upon suspend/resume, but the user can configure the application to wait up to 10 minutes after the sleep event before the vault is forcibly locked.</p>
<h3>Intel® Software Guard Extensions and Non-Intel Software Guard Extensions Code Paths</h3>
<p>Interestingly, the default behavior of the Intel SGX code path differs from that of the non-Intel SGX code path. Enclaves are destroyed during the sleep/resume cycle, which means that we <em>effectively</em> lock the password vault as a result. To give the user the illusion that the password vault never locked at all, we have to not only reload the vault file from disk, but also explicitly unlock it again <em>without</em> forcing the user to re-enter their password (this has some security implications, which we discuss below).</p>
<p>For the non-Intel SGX code path, the vault is just stored in regular memory. When the system resumes, system memory is unchanged and the application continues as normal. Thus, the default behavior is that an unlocked password vault remains unlocked when the system resumes.</p>
<h2>Application Design</h2>
<p>With the behavior of the application decided, we turn to the application design. Both code paths need to handle the sleep/resume cycle and place the vault in the correct state: locked or unlocked.</p>
<h3>The Non-Intel Software Guard Extensions Code Path</h3>
<p>This is the simpler of the two code paths. As mentioned above, the non-Intel SGX code path will, by default, leave the password vault unlocked if it was unlocked when the system went to sleep. When the system resumes it only needs to see how long it slept: if the sleep time exceeds the maximum configured by the user, the password vault should be explicitly locked.</p>
<p>To keep track of the sleep duration, we’ll need a periodic heartbeat that records the current time. This time will serve as the “sleep start” time when the system resumes. For security, the heartbeat time will be encrypted using the database key.</p>
<h3>The Intel Software Guard Extensions Code Path</h3>
<p>No matter how the application is configured, the system will need code to recreate the enclave and reopen the password vault. This will put the vault in the locked state.</p>
<p>The application will then need to see how long it has been sleeping. If the sleep time was less than the maximum configured by the user, the password vault needs to be explicitly unlocked without prompting the user for his or her master passphrase. In order to do that the application needs the passphrase, and that means the passphrase must be saved to untrusted memory so that it can be read back in when the system is restored.</p>
<p>The only safe way to save a secret to untrusted memory is to use data sealing, but this presents a significant security issue: As mentioned previously, our enclave can be loaded by <em>any</em> application, and the same ECALL that is used to unseal the master password will be available for anyone to use. Our password manager application exposes secrets to the end user (their passwords), and the master password is the only means of authenticating the user. The point of keeping the password vault unlocked after the sleep/resume cycle is to prevent the user from having to authenticate. That means we are creating a logic flow where a malicious user could potentially use our enclave’s API to unseal the user’s master password and then extract their account and password data.</p>
<p>In order to mitigate this risk, we’ll do the following:</p>
<ul><li>Data will be sealed using the MRENCLAVE policy.</li>
<li>Sealed data will be kept in memory only. Writing it to disk would increase the attack surface.</li>
<li>In addition to sealing the password, we’ll also include the process ID. The enclave will require that the process ID of the calling process match the one that was saved when unsealing the data. If they don’t match, the vault will be left in the locked state.</li>
<li>The current system time will be sealed periodically using a heartbeat function. This will serve as the “sleep start” time.</li>
<li>The sleep duration will be checked in the enclave.</li>
</ul><p>Note that verification logic must be in the enclave where it cannot be modified or manipulated.</p>
<p>This is not a perfect solution, but it helps. A malicious application would need to scrape the sealed data from memory, crash the user’s existing process, and then create new processes over and over until it gets one with the same process ID. It will have to do all of this before the lock timeout is reached (or take control of the system clock).</p>
<h3>Common Needs</h3>
<p>Both code paths will need some common infrastructure:</p>
<ul><li>A timer to provide the heartbeat. We’ll use a timer interval of 15 seconds.</li>
<li>An event handler that is called when the system resumes from a sleep state.</li>
<li>Safe handling for any potential race conditions, since wakeup events are asynchronous.</li>
<li>Code that updates the UI to reflect the “locked” state of the password vault</li>
</ul><h2>Implementation</h2>
<p>We won’t go over every change in the code base, but we’ll look at the major components and how they work.</p>
<h3>User Options</h3>
<p>The lock timeout value is set in the new Tools -&gt; Options configuration dialog, shown in Figure 3.</p>
<p style="text-align:center"><img src="/sites/default/files/managed/06/07/intel-software-guard-extensions-part-9-power-events-data-sealing-fig03.png" /></p>
<p style="text-align:center"><strong>Figure 3.</strong> Configuration options.</p>
<p>This parameter is saved immediately to the Windows registry under HKEY_LOCAL_USER and is loaded by the application on startup. If the registry value is not present, the lock timeout defaults to zero (lock the vault immediately after going to sleep).</p>
<p>The Intel SGX code path also saves this value in the enclave.</p>
<h3>The Heartbeat</h3>
<p>Figure 4 shows the declaration for the <strong>Heartbeat </strong>class which is ultimately responsible for recording the vault’s state information. The heartbeat is only run if state information is needed, however. If the user has set the lock timeout to zero, we don’t need to maintain state because we know to lock the vault immediately when the system resumes.</p>
<pre class="brush:cpp;">class PASSWORDMANAGERCORE_API Heartbeat {
class PasswordManagerCoreNative *nmgr;
HANDLE timer;
void start_timer();
public:
Heartbeat();
~Heartbeat();
void set_manager(PasswordManagerCoreNative *nmgr_in);
void heartbeat();
void start();
void stop();
};
</pre>
<p style="text-align:center"><strong>Figure 4.</strong> The Heartbeat class.</p>
<p>The <strong>PasswordManagerCoreNative </strong>class gains a <strong>Heartbeat</strong> object as a class member, and the <strong>Heartbeat</strong> object is initialized with a reference back to the containing <strong>PasswordManagerCoreNative</strong> object.</p>
<p>The <strong>Heartbeat</strong> class obtains a timer from <em>CreateTimerQueueTimer </em>and executes the callback function <em>heartbeat_proc</em> when the timer expires, as shown in Figure 5. The timer is sent a reference to the <strong>Heartbeat</strong> object, which in turn calls the <em>heartbeat</em> method in the <strong>Heartbeat</strong> class, which in turn calls the <em>heartbeat</em> method in <strong>PasswordManagerCoreNative </strong>and restarts the timer.</p>
<pre class="brush:cpp;">static void CALLBACK heartbeat_proc(PVOID param, BOOLEAN fired)
{
// Call the heartbeat method in the Heartbeat object
Heartbeat *hb = (Heartbeat *)param;
hb-&gt;heartbeat();
}
Heartbeat::Heartbeat()
{
timer = NULL;
}
Heartbeat::~Heartbeat()
{
if (timer == NULL) DeleteTimerQueueTimer(NULL, &amp;timer, NULL);
}
void Heartbeat::set_manager(PasswordManagerCoreNative *nmgr_in)
{
nmgr = nmgr_in;
}
void Heartbeat::heartbeat ()
{
// Call the heartbeat method in the native password manager
// object. Restart the timer unless there was an error.
if (nmgr-&gt;heartbeat()) start_timer();
}
void Heartbeat::start()
{
stop();
// Perform our first heartbeat right away.
if (nmgr-&gt;heartbeat()) start_timer();
}
void Heartbeat::start_timer()
{
// Set our heartbeat timer. Use the default Timer Queue
CreateTimerQueueTimer(&amp;timer, NULL, (WAITORTIMERCALLBACK)heartbeat_proc,
(void *)this, HEARTBEAT_INTERVAL_SECS * 1000, 0, 0);
}
void Heartbeat::stop()
{
// Stop the timer (if it exists)
if (timer != NULL) {
DeleteTimerQueueTimer(NULL, timer, NULL);
timer = NULL;
}
}
</pre>
<p style="text-align:center"><strong>Figure 5. </strong>The Heartbeat class methods and timer callback function.</p>
<p>The heartbeat method in the <strong>PasswordManagerCoreNative</strong> object maintains the state information. To prevent partial write corruption, it has a two-element array of state data and an index pointer to the current index (0 or 1). The new state information is obtained from:</p>
<ul><li>The new ECALL <em>ve_heartbeat </em>in the Intel SGX code path (by way of <em>ew_heartbeat</em> in EnclaveBridge.cpp).</li>
<li>The <strong>Vault</strong> method <em>heartbeat </em>in the non-Intel SGX code path.</li>
</ul><p>After the new state has been received, it updates the next element (alternating between elements 0 and 1) of the array, and then updates the index pointer. The last operation is our atomic update, ensuring that the state information is complete before we officially mark it as the “current” state.</p>
<h4>Intel Software Guard Extensions code path</h4>
<p>The <em>ve_heartbeat</em> ECALL simply calls the <em>heartbeat </em>method in the <strong>E_Vault </strong>object, as shown in Figure 6.</p>
<pre class="brush:cpp;">int E_Vault::heartbeat(char *state_data, uint32_t sz)
{
sgx_status_t status;
vault_state_t vault_state;
uint64_t ts;
// Copy the db key
memcpy(vault_state.db_key, db_key, 16);
// To get the system time and PID we need to make an OCALL
status = ve_o_process_info(&amp;ts, &amp;vault_state.pid);
if (status != SGX_SUCCESS) return NL_STATUS_SGXERROR;
vault_state.lastheartbeat = (sgx_time_t)ts;
// Storing both the start and end times provides some
// protection against clock manipulation. It's not perfect,
// but it's better than nothing.
vault_state.lockafter = vault_state.lastheartbeat + lock_delay;
// Saves us an ECALL to have to reset this when the vault is restored.
vault_state.lock_delay = lock_delay;
// Seal our data with the MRENCLAVE policy. We defined our
// struct as packed to support working on the address
// directly like this.
status = sgx_seal_data(0, NULL, sizeof(vault_state_t), (uint8_t *)&amp;vault_state, sz, (sgx_sealed_data_t *) state_data);
if (status != SGX_SUCCESS) return NL_STATUS_SGXERROR;
return NL_STATUS_OK;
}
</pre>
<p style="text-align:center"><strong>Figure 6.</strong> The heartbeat in the enclave.</p>
<p>It has to obtain the current system time and the process ID, and to do this we have added our first OCALL to the enclave, <em>ve_o_process_info</em>. When the OCALL returns, we update our state information and then call <em>sgx_seal_data</em> to seal it into the <em>state_data </em>buffer<em>.</em></p>
<p>One restriction of the Intel SGX seal and unseal functions is that they can only operate on enclave memory. That means the <em>state_data</em> parameter must be a marshaled data buffer when used in this manner. If you need to write sealed data to a raw pointer that references untrusted memory (one that is passed with the user_check parameter), you must first seal the data to an enclave-local data buffer and then copy it over.</p>
<p>The OCALL is defined in EnclaveBridge.cpp:</p>
<pre class="brush:cpp;">// OCALL to retrieve the current process ID and
// local system time.
void SGX_CDECL ve_o_process_info(uint64_t *ts, uint64_t *pid)
{
DWORD dwpid= GetCurrentProcessId();
time_t ltime;
time(&amp;ltime);
*ts = (uint64_t)ltime;
*pid = (uint64_t)dwpid;
}
</pre>
<p>Because the heartbeat runs asynchronously, two threads can enter the enclave at the same time. This means the number of Thread Control Structures (TCSs) allocated to the enclave must be increased from the default of 1 to 2. This can be done one of two ways:</p>
<ol><li>Right-click the Enclave project, select Intel SGX Configuration -&gt; Enclave Settings to bring up the configuration window, and then set Thread Number to 2 (see Figure 7).</li>
<li>Edit the Enclave.config.xml file in the Enclave project directly, and then change the &lt;TCSNum&gt; parameter to 2.</li>
</ol><p style="text-align:center"><img src="/sites/default/files/managed/14/99/intel-software-guard-extensions-part-9-power-events-data-sealing-fig07.png" /></p>
<p style="text-align:center"><strong>Figure 7.</strong> Enclave settings dialog.</p>
<h3>Detecting Suspend and Resume Events</h3>
<p>A suspend and resume cycle will destroy the enclave, and that will be detected by the next ECALL. However, we shouldn’t rely on this mechanism to perform enclave recovery, because we need to act as soon as the system wakes up from the sleep state. That means we need an event listener to receive the power state change messages that are generated by Windows.</p>
<p>The best place to capture these is in the user interface layer. In addition to performing the enclave recovery, we must be able to lock the password vault if the system was in the sleep state longer than maximum sleep time set in the user options. When the vault is locked, the user interface also needs to be updated to reflect the new vault state.</p>
<p>One limitation of the Windows Presentation Foundation* is that it does not provide event hooks for power-related messages. The workaround is to hook in to the message handler for the underlying window handle. Our main application window and all of our dialog windows need a listener so that we can gracefully close each one.</p>
<p>The hook procedure for the main window is shown in Figure 8.</p>
<pre class="brush:cpp;">private IntPtr Main_Power_Hook(IntPtr hwnd, int msg, IntPtr wParam, IntPtr lParam, ref bool handled)
{
UInt16 pmsg;
// C# doesn't have definitions for power messages, so we'll get them via C++/CLI. It returns a
// simple UInt16 that defines only the things we care about.
pmsg= PowerManagement.message(msg, wParam, lParam);
if ( pmsg == PowerManagementMessage.Suspend )
{
mgr.suspend();
handled = true;
} else if (pmsg == PowerManagementMessage.Resume)
{
int vstate = mgr.resume();
if (vstate == ResumeVaultState.Locked) lockVault();
handled = true;
}
return IntPtr.Zero;
}
</pre>
<p style="text-align:center"><strong>Figure 8.</strong> Message hook for the main window.</p>
<p>To get at the messages, the handler must dip down to native code. This is done using the new <strong>PowerManagement</strong> class, which defines a static function called <em>message</em>, shown in Figure 9. It returns one of four values:</p>
<table border="0" class="grey-alternating-rows"><tbody><tr><td>
<p><strong>PWR_MSG_NONE</strong></p>
</td>
<td>
<p>The message was not a power event.</p>
</td>
</tr><tr><td>
<p><strong>PWR_MSG_OTHER</strong></p>
</td>
<td>
<p>The message was power-related, but not a suspend or resume message.</p>
</td>
</tr><tr><td>
<p><strong>PWR_MSG_RESUME</strong></p>
</td>
<td>
<p>The system has woken up from a low-power or sleep state.</p>
</td>
</tr><tr><td>
<p><strong>PWR_MSG_SUSPEND</strong></p>
</td>
<td>
<p>The system is suspending to a low-power state.</p>
</td>
</tr></tbody></table><pre class="brush:cpp;">UINT16 PowerManagement::message(int msg, IntPtr wParam, IntPtr lParam)
{
INT32 subcode;
// We only care about power-related messages
if (msg != WM_POWERBROADCAST) return PWR_MSG_NONE;
subcode = wParam.ToInt32();
if ( subcode == PBT_APMRESUMEAUTOMATIC ) return PWR_MSG_RESUME;
else if (subcode == PBT_APMSUSPEND ) return PWR_MSG_SUSPEND;
// Don't care about other power events.
return PWR_MSG_OTHER;
}
</pre>
<p style="text-align:center"><strong>Figure 9.</strong> The message listener.</p>
<p>We actually listen for both suspend and resume messages here, but the suspend handler does very little work. When a system is transitioning to a sleep state, an application has less than 2 seconds to act on the power message. All we do with the sleep message is stop the heartbeat. This isn’t strictly necessary, and is just a precaution against having a heartbeat execute while the system is suspending.</p>
<p>The resume message is handled by calling the <em>resume </em>method in <strong>PasswordManagerCore</strong>. It’s job is to figure out whether the vault should be locked or unlocked. It does this by checking the current system time against the saved vault state (if any). If there’s no state, or if the system has slept longer than the maximum allowed, it returns <code>ResumeVaultState.Locked</code>.</p>
<h3>Restoring the Enclave</h3>
<p>In the Intel SGX code path, the enclave has to be recreated before the enclave state information can be checked. The code for this is shown in Figure 10.</p>
<pre class="brush:cpp;">bool PasswordManagerCore::restore_vault(bool flag_async)
{
bool got_lock= false;
int rv;
// Only let one thread do the restore if both come in at the
// same time. A spinlock approach is inefficient but simple.
// This is OK for our application, but a high-performance
// application (or one with a long-running work loop)
// would want something else.
try {
slock.Enter(got_lock);
if (_nlink-&gt;supports_sgx()) {
bool do_restore = true;
// This part is only needed for enclave-based vaults.
if (flag_async) {
// If we are entering as a result of a power event,
// make sure the vault has not already been restored
// by the synchronous/UI thread (ie, a failed ECALL).
rv = _nlink-&gt;ping_vault();
if (rv != NL_STATUS_LOST_ENCLAVE) do_restore = false;
// If do_store is false, then we'll also use the
// last value of rv_restore as our return value.
// This will tell us whether or not we should lock the
// vault.
}
if (do_restore) {
// If the vaultfile isn't open then we are locked or hadn't
// been opened to be begin with.
if (!vaultfile-&gt;is_open()) {
// Have we opened a vault yet?
if (vaultfile-&gt;get_vault_path()-&gt;Length == 0) goto restore_error;
// We were explicitly locked, so reopen.
rv = vaultfile-&gt;open_read(vaultfile-&gt;get_vault_path());
if (rv != NL_STATUS_OK) goto restore_error;
}
// Reinitialize the vault from the header.
rv = _vault_reinitialize();
if (rv != NL_STATUS_OK) goto restore_error;
// Now, call to the native object to restore the vault state.
rv = _nlink-&gt;restore_vault_state();
if (rv != NL_STATUS_OK) goto restore_error;
// The database password was restored to the vault. Now restore
// the vault, itself.
rv = send_vault_data();
restore_error:
restore_rv = (rv == NL_STATUS_OK);
}
}
else {
rv = _nlink-&gt;check_vault_state();
restore_rv = (rv == NL_STATUS_OK);
}
slock.Exit(false);
}
catch (...) {
// We don't need to do anything here.
}
return restore_rv;
}
</pre>
<p style="text-align:center"><strong>Figure 10.</strong> The <em>restore_vault</em>() method.</p>
<p>The enclave and vault are reinitialized from the vault data file, and the vault state is restored using the method <em>restore_vault_state</em> in <strong>PasswordManagerCoreNative.</strong></p>
<h4>Which Thread Restores the Vault State?</h4>
<p>The Tutorial Password Manager can have up to three threads executing at any given time. They are:</p>
<ul><li>The main UI</li>
<li>The heartbeat</li>
<li>The power event handler</li>
</ul><p>Only one of these threads should be responsible for actually restoring the enclave, but it is possible that both the heartbeat and the main UI thread are in the middle of an ECALL when a power event occurs. In that case, both ECALLs will fail with the error code SGX_ERR_ENCLAVE_LOST while the power event handler is executing. Given this potential race condition, it’s necessary to decide which thread is given the job of enclave recovery.</p>
<p>If the lock timeout is set to zero, there won’t be a heartbeat thread at all, so it doesn’t make sense to put enclave recovery logic there. If the heartbeat ECALL returns SGX_ERR_ENCLAVE_LOST, it simply stops the heartbeat and assumes other threads will be dealing with it.</p>
<p>That leaves the UI thread and the power event handler, and a good argument can be made that <em>both</em> threads need the ability to recover an enclave. The event handler will catch all suspend/resume cycles immediately, so it make sense to have enclave recovery happen there. However, as we pointed out earlier it is entirely possible for a power event to occur during an active ECALL on the UI thread, and there’s no reason to prevent <em>that</em> thread from starting the recovery, especially since it might occur before the power event message is received. This not only provides a safety net in case the event handler fails to execute for some reason, but it also provides a quick and easy retry loop for the operation.</p>
<p>Since we can’t have both of these threads run the recovery at the same time, we need to use locking to ensure that only the first thread to arrive is given the job. The second one simply waits for the first to finish.</p>
<p>It’s also possible that a failed ECALL will complete the recovery process before the event handler enters the recovery loop. To prevent the event handler from blindly repeating the enclave recovery procedure, we have added a quick test to make sure the enclave hasn’t already been recreated.</p>
<h4>Detection in the UI Thread</h4>
<p>The UI thread detects power events by looking for ECALLs that fail with SGX_ERR_LOST_ENCLAVE. The wrapper functions in EnclaveBridge.cpp automatically relaunch the enclave and pass the error NL_STATUS_ENCLAVE_RECREATED back up to the <strong>PasswordManagerCore</strong> object.</p>
<p>Each method in <strong>PasswordManagerCore</strong> handles this return code uniquely. Some methods, such as <em>initialize</em>, <em>initialize_from_header</em>, and <em>lock_vault</em> don’t actually have to restore state at all, but most of the others do and they call in to <em>restore_vault</em> as show in Figure 11.</p>
<pre class="brush:cpp;">int PasswordManagerCore::accounts_password_to_clipboard(UInt32 idx)
{
UINT32 index = idx;
int rv;
int tries = 3;
while (tries--) {
rv = _nlink-&gt;accounts_password_to_clipboard(index);
if (rv == NL_STATUS_RECREATED_ENCLAVE) {
if (!restore_vault()) {
rv = NL_STATUS_LOST_ENCLAVE;
tries = 0;
}
}
else break;
}
return rv;
}
</pre>
<p style="text-align:center"><strong>Figure 11.</strong> Detecting a power event on the main UI thread.</p>
<p>Here, the method gets three attempts to restore the vault before giving up. This retry count of three is an arbitrary limit: it’s not <em>likely</em> that we’ll have multiple power events in rapid succession but it’s possible. Though we don’t want to just give up after one attempt, we also don’t want to loop forever in case there’s a system issue that prevents the enclave from ever being recreated.</p>
<h3>Restoring and Checking State</h3>
<p>The last step is to examine the state data for the vault and determine whether the vault should be locked or unlocked. In the Intel SGX code path, the sealed state data is sent into the enclave where it is unsealed, and then compared to current system data obtained from the OCALL <em>ve_o_process_info</em>. This method, <em>restore_state</em>, is shown in Figure 12.</p>
<pre class="brush:cpp;">int E_Vault::restore_state(char *state_data, uint32_t sz)
{
sgx_status_t status;
vault_state_t vault_state;
uint64_t now, thispid;
uint32_t szout = sz;
// First, make an OCALL to get the current process ID and system time.
// Make these OCALLs so that the parameters aren't be supplied by the
// ECALL (which would make it trivial for the calling process to fake
// this information)
status = ve_o_process_info(&amp;now, &amp;thispid);
if (status != SGX_SUCCESS) {
// Zap the state data.
memset_s(state_data, sz, 0, sz);
return NL_STATUS_SGXERROR;
}
status = sgx_unseal_data((sgx_sealed_data_t *)state_data, NULL, 0, (uint8_t *)&amp;vault_state, &amp;szout);
// Zap the state data.
memset_s(state_data, sz, 0, sz);
if (status != SGX_SUCCESS) return NL_STATUS_SGXERROR;
if (thispid != vault_state.pid) return NL_STATUS_PERM;
if (now &lt; vault_state.lastheartbeat) return NL_STATUS_PERM;
if (now &gt; vault_state.lockafter) return NL_STATUS_PERM;
// Everything checks out. Restore the key and mark the vault as unlocked.
lock_delay = vault_state.lock_delay;
memcpy(db_key, vault_state.db_key, 16);
_VST_CLEAR(_VST_LOCKED);
return NL_STATUS_OK;
}
</pre>
<p style="text-align:center"><strong>Figure 12.</strong> Restoring state in the enclave.</p>
<p>Note that unsealing data is programmatically simpler than sealing it: the key derivation and policy information is embedded in the sealed data blob. Unlike data sealing there is only one unseal function, <em>sgx_unseal_data,</em> and it takes fewer parameters than its counterpart.</p>
<p>This method returns NL_STATUS_OK if the vault is restored to the unlocked state, and NL_STATUS_PERM if it is restored to the locked state.</p>
<h3>Lingering Issues</h3>
<p>The Tutorial Password Manager as currently implemented still has issues that need to be addressed.</p>
<ul><li>There is still a race condition in the enclave recovery logic. Because the ECALL wrappers in EnclaveBridge.cpp immediately recreate the enclave before returning an error code to the <strong>PasswordManagerCore </strong>layer, it is possible for the power event handler thread to enter the <em>restore_vault </em>method after the enclave has been recreated but before the enclave recovery has completed. This can cause the power event handler to return the wrong status to the UI layer, placing the UI in the “locked” or “unlocked” state incorrectly.</li>
<li>We depend on the system clock when validating our state data, but the system clock is actually untrusted. A malicious user can manipulate the time in order to force the password vault into an unlocked state when the system wakes up (this can be addressed by using trusted time, instead).</li>
</ul><h2>Summary</h2>
<p>In order to prevent cold boot attacks and other attacks against memory images in RAM, Intel SGX destroys the Enclave Page Cache whenever the system enters a low-power state. However, this added security comes at a price: software complexity that can’t be avoided. All real-world Intel SGX applications need to plan for power events and incorporate enclave recovery logic because failing to do so will lead to runtime errors during the application’s execution.</p>
<p>Power event planning can rapidly escalate the application’s level of sophistication. The user experience needs of the Tutorial Password Manager took us from a single-threaded application with relatively simple constructs to one with multiple, asynchronous threads, locking, and atomic memory updates via simple journaling. As a general rule, seamless enclave recovery requires careful design and a significant amount of added program logic.</p>
<h2>Sample Code</h2>
<p>The code sample for this part of the series builds against the Intel SGX SDK version 1.7 using Microsoft Visual Studio* 2015.</p>
<h3>Release Notes</h3>
<ul><li>Running a mixed-mode Intel SGX application under the debugger in Visual Studio will cause an exception to be thrown if a power event is triggered. The exception occurs when an ECALL detects the lost enclave and returns SGX_ERROR_LOST_ENCLAVE.</li>
<li>The non-Intel SGX code path was updated to use Microsoft’s DPAPI to store the database encryption key. This is a better solution than the in-memory XOR’ing.</li>
</ul><h2>Coming Up Next</h2>
<p>In Part 10 of the series, we’ll discuss debugging mixed-mode Intel SGX applications with Visual Studio. Stay tuned!</p>
Thu, 22 Jun 2017 15:12:28 -0700John M. (Intel)737211Intel® VTune™ Amplifier Disk I/O analysis with Intel® Optane Memoryhttps://software.intel.com/en-us/articles/intel-vtune-amplifier-vtune-disk-io-analysis-with-intel-optane-memory
<p>This article will talk about Intel® VTune™ Amplifier I/O Analysis with Intel® Optane Memory. Several benchmark tools like crystaldisk, IOmeter, System Mark or PC Mark etc. are used to evaluate system I/O efficiency with usually a score number. For some power users, PC-gaming geeks might be satisfied with those numbers served for performance validation purpose. How about the further technical-depth information like slow I/O activities identification, detailed I/O queue depth visualization in a timeline, I/O function APIs callstacks and even the correlation with other system metrics to give further debug or profiling information for a software developer? Software Developers need the clues to understanding how I/O efficient his program performs. VTune tries to provide such insights with its new feature, Disk I/O Analysis Type.</p>
<h1>A bit about I/O Performance metrics</h1>
<p>First of all, there are some basics you might need to know; I/O Queue Depth, Read/Write Latency, I/O Bandwidth, they are the I/O metrics used to track I/O efficiency. I/O Queue Depth means how many I/O commands wait in a queue to be served. This queue depth (size) depends on the application, driver, OS implementation or the definition of host controller interface’s spec., like AHCI or NVMe and etc.. Comparing to ACHI with a single queue design, NVMe has multiple queues design supports parallel operations.</p>
<p>Imagine that a software program issues multiple I/O requests pass through the framework, software libraries, VM, container, runtimes, OS’s I/O scheduler, driver to the host controller of I/O device. These requests can be temporarily delayed in any of these components due to different queue implementation and other reasons. Observing the change of system’s queue depth can help understand how busy system I/O utilization is and overall I/O access patterns. From OS perspective, high queue depth represents a state that system is working to consume pending I/O requests. Zero queue depth means I/O scheduler is idle. From Storage device perspective, high queue depth design shows the storage media or controller has the confidence to serve a bulk of I/O requests in a higher speed comparing to lower queue depth design. Read/Write Latency shows how quick storage device finishes or response I/O request. Its inverse also represents IOPS (I/O per second). As for I/O Bandwidth, it will be tightened to the capability offered by different host controller interfaces. For example, SATA 3.0 can achieve 600MB/s of the theoretical bandwidth and NVMe PCIe 3.0 x2 lanes can do ~1.87GB/s.</p>
<p> </p>
<p><span><img alt="Optane+NAND SSD" height="703" width="1104" src="https://software.intel.com/sites/default/files/managed/0e/22/OptaneOverview.png" /></span></p>
<p> </p>
<p>We will expect the system I/O performance increase after adopting Intel® Optane Memory + Intel Rapid Storage technique.</p>
<h1>Insight from VTune for a workload running on Optane enabled setup</h1>
<p><span><img alt="IOAPI_time_ssdvsoptane" height="587" width="1440" src="https://software.intel.com/sites/default/files/managed/ec/5c/IOAPI_time_optane.png" /></span> [figure1]</p>
<p>The figure 1 shows two VTune results are based on a benchmark program, PCMark, running on “single SATA NAND SSD” vs “SATA NAND SSD + additional 16GB NVMe Optane module within IRST RAID 0 mode”. Besides the basics of VTune’s online help for Disk I/O analysis, you can also observe I/O APIs effective time by applying “Task Domain” grouping view. As VTune indicates, I/O API’s CPU time also gets improved with Optane’s acceleration. It makes senses since most of I/O API calls are synchronous in this case and I/O media with Optane acceleration responses quickly.</p>
<p><span><img alt="Latency SSD vs Optane" height="494" width="1224" src="https://software.intel.com/sites/default/files/managed/a9/ba/IOAPI_latency_ssdvsoptane.png" /></span></p>
<p>[figure 2]</p>
<p>In figure 2, it shows how VTune measures the latency for single I/O operation. We compare 3rd FileRead operation of the test#3(importing pictures to Windows Photo Gallery) of benchmark workload on both cases. It shows Optane+SSD can help nearly 5 times gain for this read operation speed in 300us vs 60us.</p>
<p>On Linux target, VTune also provides the Page fault metric. Page fault event usually invokes disk I/O to handle page swapping. To avoid frequent Disk I/O caused by page fault events, the typical direction is to keep more pages in the memory instead of swapping pages back to the disk. Intel® Memory Drive Technology provides a solution to expand memory capacity and Optane provides the best proximity to memory’s speed. And that’s transparent to application and OS, it also mitigates the Disk I/O penalty to further increase the performance. One common mistake is that using asynchronous I/O can always help application’s I/O performance. Asynchronous I/O is to actually add more responsiveness back to the application because asynchronous I/O does not need to put CPU to wait. Putting CPU to wait is the case when synchronous I/O API is used but I/O operation is not finished.</p>
<p>With all that software design suggestions above, the extra performance solution is to upgrade your hardware to faster media. Intel® Optane is Intel’s edge non-volatile memory technology enabling memory-like performance at storage-like capacity and cost. VTune can even help to juice out more software performance by providing the insight analysis.</p>
<p><span style="color:rgb(83, 86, 90)">See also</span></p>
<p><a href="https://www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.html">Intel® Optane™ Technology</a></p>
<p><a href="https://www.intel.com/content/www/us/en/architecture-and-technology/rapid-storage-technology.html">Intel® Rapid Storage Technology</a></p>
<p><a href="https://software.intel.com/en-us/system-studio/2017">Check Intel® VTune™ Amplifer in Intel</a>®<a href="https://software.intel.com/en-us/system-studio/2017"> System Studio</a></p>
<p><a href="https://software.intel.com/en-us/node/628099">Intel® VTune™ Amplifier online help - Disk Input and Output Analysis</a></p>
<p><a href="https://software.intel.com/en-us/articles/how-to-use-disk-io-analysis-in-intel-vtune-amplifier-for-systems">How to use Disk I/O analysis in Intel® VTune™ Amplifier for systems</a></p>
<p><a href="https://software.intel.com/en-us/articles/memory-performance-in-a-nutshell">Memory Performance in a Nutshell</a></p>
<h2><span></span></h2>
Wed, 07 Jun 2017 02:54:11 -0700Lin, Joel (Intel)735716Intel® Parallel Studio XE integration to Microsoft* Visual Studio 2017 failshttps://software.intel.com/en-us/articles/intel-parallel-studio-xe-2017-update-4-integration-to-vs2017-fails
<p><strong>Issue: </strong>Installation of Intel<span style="color:rgb(31, 73, 125)">®</span> Parallel Studio XE 2017 Update 4 with Microsoft* Visual Studio 2017 integration hangs and fails on some systems. The problem is intermittent and not reproducible on every system. Any attempts to repair it fails with the message "Incomplete installation of Microsoft Visual Studio* 2017 is detected". Note, in some cases the installation may complete successfully with no error/crashes, however, the integration to VS2017 is not installed. The same issue observed with Intel<span style="color:rgb(31, 73, 125)">®</span> Parallel Studio XE 2018 Beta. </p>
<p><strong>Environment: </strong>Microsoft* Windows, Visual Studio 2017</p>
<p><strong>Root Cause:</strong> A root cause was identified and reported to Microsoft*. Note that there may be different reasons of integration failures. We are documenting all cases and providing to Microsoft for further root-cause analysis.</p>
<p><strong>Workaround: </strong></p>
<p>Note that with Intel Parallel Studio XE 2017 Update 4 there is no workaround for this integration problem. The following workaround is expected to be implemented in Intel Parallel Studio XE 2017 Update 5. It is implemented in Intel Parallel Studio XE 2018 Beta Update 1.</p>
<p>Integrate the Intel Parallel Studio XE components manually. You need to run all the files from the corresponding folders:</p>
<ul><li>C++/Fortran Compiler IDE: &lt;installdir&gt;/ide_support_2018/VS15/*.vsix</li>
<li>Amplifier: &lt;installdir&gt;/VTune Amplifier 2018/amplxe_vs2017-integration.vsix</li>
<li>Advisor: &lt;installdir&gt;/Advisor 2018/advi_vs2017-integration.vsix</li>
<li>Inspector: &lt;installdir&gt;/Inspector 2018/insp_vs2017-integration.vsix</li>
<li>Debugger: &lt;InstallDir&gt;/ide_support_2018/MIC/*.vsix<br />
&lt;InstallDir&gt;/ide_support_2018/CPUSideRDM/*.vsix</li>
</ul><p>If this workaround doesn't work and installation still fails then please report the problem to Intel through the <a href="https://software.intel.com/en-us/forums/">Intel® Developer Zone Forums</a> or <a href="http://www.intel.com/supporttickets">Online Service Center</a>. You will need to supply the <a href="https://software.intel.com/en-us/articles/where-can-i-find-the-installation-log-files">installation log file</a> and error message from Microsoft installer.</p>
Tue, 23 May 2017 10:17:45 -0700Igor V. (Intel)734881Benefits of Intel® Optimized Caffe* in comparison with BVLC Caffe*https://software.intel.com/en-us/articles/comparison-between-intel-optimized-caffe-and-vanilla-caffe-by-intel-vtune-amplifier
<h3>Overview</h3>
<p> This article introduces Berkeley Vision and Learning Center (BVLC) Caffe* and a custom version of Caffe*, Intel® Optimized Caffe*. We explain why and how Intel® Optimized Caffe* performs efficiently on Intel® Architecture via Intel® VTune™ Amplifier and the time profiling option of Caffe* itself.</p>
<h3> </h3>
<h3>Introduction to BVLC Caffe* and Intel® Optimized Caffe*</h3>
<p><a href="http://caffe.berkeleyvision.org/" rel="nofollow">Caffe</a>* is a well-known and widely used machine vision based Deep Learning framework developed by the Berkeley Vision and Learning Center (<a href="http://bvlc.eecs.berkeley.edu/" rel="nofollow">BVLC</a>). It is an open-source framework and is evolving currently. It allows users to control a variety options such as libraries for BLAS, CPU or GPU focused computation, CUDA, OpenCV, MATLAB and Python before you build Caffe* through 'Makefile.config'. You can easily change the options in the configuration file and BVLC provides intuitive instructions on their project web page for developers. </p>
<p>Intel® Optimized Caffe* is Intel distributed customized Caffe* version for Intel Architectures. Intel® Optimized Caffe* offers all the goodness of main Caffe* with the addition of Intel Architectures optimized functionality and multi-node distributor training and scoring. Intel® Optimized Caffe* makes it possible to more efficiently utilize CPU resources.</p>
<p>To see in detail how Intel® Optimized Caffe* has changed in order to optimize itself to Intel Architectures, please refer this page : <a href="https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques">https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques</a></p>
<p>In this article, we will first profile the performance of BVLC Caffe* with Cifar 10 example and then will profile the performance of Intel® Optimized Caffe* with the same example. Performance profile will be conducted through two different methods.</p>
<p>Tested platform : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2</p>
<p>1. Caffe* provides its own timing option for example : </p>
<pre class="brush:cpp;">./build/tools/caffe time \
--model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
-iterations 1000</pre>
<p>2. Intel® VTune™ Amplifier : Intel® VTune™ Amplifier is a powerful profiling tool that provides advanced CPU profiling features with a modern analysis interface. <a href="https://software.intel.com/en-us/intel-vtune-amplifier-xe">https://software.intel.com/en-us/intel-vtune-amplifier-xe</a></p>
<p> </p>
<p> </p>
<h3>How to Install BVLC Caffe*</h3>
<p>Please refer the BVLC Caffe project web page for installation : <a href="http://caffe.berkeleyvision.org/installation.html" rel="nofollow">http://caffe.berkeleyvision.org/installation.html</a></p>
<p>If you have Intel® MKL installed on your system, it is better using MKL as BLAS library. </p>
<p>In your Makefile.config , choose BLAS := mkl and specify MKL address. ( The default set is BLAS := atlas )</p>
<p>In our test, we kept all configurations as they are specified as default except the CPU only option. </p>
<p> </p>
<h3>Test example</h3>
<p>In this article, we will use 'Cifar 10' example included in Caffe* package as default. </p>
<p>You can refer BVLC Caffe project page for detail information about this exmaple : <a href="http://caffe.berkeleyvision.org/gathered/examples/cifar10.html" rel="nofollow">http://caffe.berkeleyvision.org/gathered/examples/cifar10.html</a></p>
<p>You can simply run the training example of Cifar 10 as the following : </p>
<pre class="brush:cpp;">cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./examples/cifar10/train_full_sigmoid_bn.sh</pre>
<p>First, we will try the Caffe's own benchmark method to obtain its performance results as the following:</p>
<pre class="brush:cpp;">./build/tools/caffe time \
--model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
-iterations 1000</pre>
<p>as results, we got the layer-by-layer forward and backward propagation time. The command above measure the time each forward and backward pass over a batch f images. At the end it shows the average execution time per iteration for 1,000 iterations per layer and for the entire calculation. </p>
<p><span><img height="538" width="308" src="https://software.intel.com/sites/default/files/managed/c1/9d/Picture1.png" alt="" /></span></p>
<p>This test was run on Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM of DDR4 installed with CentOS 7.2.</p>
<p>The numbers in the above results will be compared later with the results of Intel® Optimized Caffe*. </p>
<p>Before that, let's take a look at the VTune™ results also to observe the behave of Caffe* in detail. </p>
<p> </p>
<h3>VTune Profiling</h3>
<p>Intel® VTune™ Amplifier is a modern processor performance profiler that is capable of analyzing top hotspots quickly and helping tuning your target application. You can find the details of Intel® VTune™ Amplifier from the following link :</p>
<p>Intel® VTune™ Amplifier : <a href="https://software.intel.com/en-us/intel-vtune-amplifier-xe">https://software.intel.com/en-us/intel-vtune-amplifier-xe</a></p>
<p>We used Intel® VTune™ Amplifier in this article to find the function with the highest total CPU utilization time. Also, how OpenMP threads are working. </p>
<p> </p>
<h3>VTune result analysis</h3>
<p> </p>
<p><span><img height="820" width="1798" src="https://software.intel.com/sites/default/files/managed/5c/97/Capture1.PNG" alt="" /></span></p>
<p>What we can see here is some functions listed on the left side of the screen which are taking the most of the CPU time. They are called 'hotspots' and can be the target functions for performance optimization. </p>
<p>In this case, we will focus on 'caffe::im2col_cpu&lt;float&gt;' function as a optimization candidate. </p>
<p>'im2col_cpu&lt;float&gt;' is one of the steps in performing direct convolution as a GEMM operation for using highly optimized BLAS libraries. This function took the largest CPU resource in our test of training Cifar 10 model using BVLC Caffe*. </p>
<p>Let's take a look at the threads behaviors of this function. In VTune™, you can choose a function and filter other workloads out to observe only the workloads of the specified function. </p>
<p><span><img height="625" width="1350" src="https://software.intel.com/sites/default/files/managed/36/a2/Capture2.PNG" alt="" /></span></p>
<p>On the above result, we can see the CPI ( Cycles Per Instruction ) of the fuction is 0.907 and the function utilizes only one single thread for the entire calculation.</p>
<p>One more intuitive data provided by VTune is here. </p>
<p><span><img height="476" width="948" src="https://software.intel.com/sites/default/files/managed/45/a8/Capture3.PNG" alt="" /></span></p>
<p>This 'CPU Usage Histogram' provides the data of the numbers of CPUs that were running simultaneously. The number of CPUs the training process utilized appears to be about 25. The platform has 64 physical core with Intel® Hyper-Threading Technology so it has 256 CPUs. The CPU usage histogram here might imply that the process is not efficiently threaded. </p>
<p>However, we cannot just determine that these results are 'bad' because we did not set any performance standard or desired performance to classify. We will compare these results with the results of Intel® Optimized Caffe* later.</p>
<p> </p>
<p>Let's move on to Intel® Optimized Caffe* now.</p>
<p> </p>
<h3>How to Install Intel® Optimized Caffe*</h3>
<p> The basic procedure of installation of Intel® Optimized Caffe* is the same as BVLC Caffe*. </p>
<p>When clone Intel® Optimized Caffe* from Git, you can use this alternative : </p>
<pre class="brush:cpp;">git clone https://github.com/intel/caffe</pre>
<p> </p>
<p>Additionally, it is required to install Intel® MKL to bring out the best performance of Intel® Optimized Caffe*. </p>
<p>Please download and install Intel® MKL. Intel offers MKL for free without technical support or for a license fee to get one-on-one private support. The default BLAS library of Intel® Optimized Caffe* is set to MKL.</p>
<p> Intel® MKL : <a href="https://software.intel.com/en-us/intel-mkl">https://software.intel.com/en-us/intel-mkl</a></p>
<p>After downloading Intel® Optimized Caffe* and installing MKL, in your Makefile.config, make sure you choose MKL as your BLAS library and point MKL include and lib folder for BLAS_INCLUDE and BLAS_LIB</p>
<pre class="brush:cpp;">BLAS :=mkl
BLAS_INCLUDE := /opt/intel/mkl/include
BLAS_LIB := /opt/intel/mkl/lib/intel64</pre>
<p> </p>
<p>If you encounter 'libstdc++' related error during the compilation of Intel® Optimized Caffe*, please install 'libstdc++-static'. For example :</p>
<pre class="brush:cpp;">sudo yum install libstdc++-static</pre>
<p> </p>
<p> </p>
<p> </p>
<h3>Optimization factors and tunes</h3>
<p>Before we run and test the performance of examples, there are some options we need to change or adjust to optimize performance.</p>
<ul><li>Use 'mkl' as BLAS library : Specify 'BLAS := mkl' in Makefile.config and configure the location of your MKL's include and lib location also.</li>
<li>Set CPU utilization limit :
<pre class="brush:cpp;">echo "100" | sudo tee /sys/devices/system/cpu/intel_pstate/min_perf_pct
echo "0" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo</pre>
</li>
<li>Put 'engine:"MKL2017" ' at the top of your train_val.prototxt or solver.prototxt file or use this option with caffe tool : -engine "MKL2017"</li>
<li>Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For the example run below , 'OMP_NUM_THREADS = 64' has been used.</li>
<li>Intel® Optimized Caffe* has edited many parts of original BVLC Caffe* code to achieve better code parallelization with OpenMP*. Depending on other processes running on the background, it is often useful to adjust the number of threads getting utilized by OpenMP*. For Intel Xeon Phi™ product family single-node we recommend to use OMP_NUM_THREADS = numer_of_cores-2.</li>
<li>Please also refer here : <a href="https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance" rel="nofollow">Intel Recommendation to Achieve the best performance </a></li>
</ul><p>If you observe too much overhead because of too frequent movement of thread by OS, you can try to adjust OpenMP* affinity environment variable : </p>
<pre class="brush:cpp;">KMP_AFFINITY=compact,granularity=fine</pre>
<p> </p>
<h3>Test example</h3>
<p> For Intel® Optimized Caffe* we run the same example to compare the results with the previous results. </p>
<pre class="brush:cpp;">cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh</pre>
<pre class="brush:cpp;">./build/tools/caffe time \
--model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
-iterations 1000</pre>
<p> </p>
<h3>Comparison</h3>
<p> The results with the above example is the following :</p>
<p>Again , the platform used for the test is : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2</p>
<p>first, let's look at the BVLC Caffe*'s and Intel® Optimized Caffe* together, </p>
<p><span><img height="538" width="308" src="https://software.intel.com/sites/default/files/managed/a1/80/Picture1.png" alt="" /></span> --&gt; <span><img height="546" width="315" src="https://software.intel.com/sites/default/files/managed/ed/6f/Picture2.png" alt="" /></span></p>
<p>to make it easy to compare, please see the table below. The duration each layer took in milliseconds has been listed, and on the 5th column we stated how many times Intel® Optimized Caffe* is faster than BVLC Caffe* at each layer. You can observe significant performance improvements except for bn layers relatively. Bn stands for "Batch Normalization" which requires fairly simple calculations with small optimization potential. Bn forward layers show better results and Bn backward layers show 2~3% slower results than the original. Worse performance can occur here in result of threading overhead. Overall in total, Intel® Optimized Caffe* achieved about 28 times faster performance in this case. </p>
<table border="0" style="width:451px"><tbody><tr><td style="width:72px"> </td>
<td style="width:72px">Direction</td>
<td style="width:72px">BVLC (ms)</td>
<td style="width:72px">Intel (ms)</td>
<td style="width:163px">Performance Benefit (x)</td>
</tr><tr><td>conv1</td>
<td>Forward</td>
<td>40.2966</td>
<td>1.65063</td>
<td>24.413</td>
</tr><tr><td>conv1</td>
<td>Backward</td>
<td>54.5911</td>
<td>2.24787</td>
<td>24.286</td>
</tr><tr><td>pool1</td>
<td>Forward</td>
<td>162.288</td>
<td>1.97146</td>
<td>82.319</td>
</tr><tr><td>pool1</td>
<td>Backward</td>
<td>21.7133</td>
<td>0.459767</td>
<td>47.227</td>
</tr><tr><td>bn1</td>
<td>Forward</td>
<td>1.60717</td>
<td>0.812487</td>
<td>1.978</td>
</tr><tr><td>bn1</td>
<td>Backward</td>
<td>1.22236</td>
<td>1.24449</td>
<td>0.982</td>
</tr><tr><td>Sigmoid1</td>
<td>Forward</td>
<td>132.515</td>
<td>2.24764</td>
<td>58.957</td>
</tr><tr><td>Sigmoid1</td>
<td>Backward</td>
<td>17.9085</td>
<td>0.262797</td>
<td>68.146</td>
</tr><tr><td>conv2</td>
<td>Forward</td>
<td>125.811</td>
<td>3.8915</td>
<td>32.330</td>
</tr><tr><td>conv2</td>
<td>Backward</td>
<td>239.459</td>
<td>8.45695</td>
<td>28.315</td>
</tr><tr><td>bn2</td>
<td>Forward</td>
<td>1.58582</td>
<td>0.854936</td>
<td>1.855</td>
</tr><tr><td>bn2</td>
<td>Backward</td>
<td>1.2253</td>
<td>1.25895</td>
<td>0.973</td>
</tr><tr><td>Sigmoid2</td>
<td>Forward</td>
<td>132.443</td>
<td>2.2247</td>
<td>59.533</td>
</tr><tr><td>Sigmoid2</td>
<td>Backward</td>
<td>17.9186</td>
<td>0.234701</td>
<td>76.347</td>
</tr><tr><td>pool2</td>
<td>Forward</td>
<td>17.2868</td>
<td>0.38456</td>
<td>44.952</td>
</tr><tr><td>pool2</td>
<td>Backward</td>
<td>27.0168</td>
<td>0.661755</td>
<td>40.826</td>
</tr><tr><td>conv3</td>
<td>Forward</td>
<td>40.6405</td>
<td>1.74722</td>
<td>23.260</td>
</tr><tr><td>conv3</td>
<td>Backward</td>
<td>79.0186</td>
<td>4.95822</td>
<td>15.937</td>
</tr><tr><td>bn3</td>
<td>Forward</td>
<td>0.918853</td>
<td>0.779927</td>
<td>1.178</td>
</tr><tr><td>bn3</td>
<td>Backward</td>
<td>1.18006</td>
<td>1.18185</td>
<td>0.998</td>
</tr><tr><td>Sigmoid3</td>
<td>Forward</td>
<td>66.2918</td>
<td>1.1543</td>
<td>57.430</td>
</tr><tr><td>Sigmoid3</td>
<td>Backward</td>
<td>8.98023</td>
<td>0.121766</td>
<td>73.750</td>
</tr><tr><td>pool3</td>
<td>Forward</td>
<td>12.5598</td>
<td>0.220369</td>
<td>56.994</td>
</tr><tr><td>pool3</td>
<td>Backward</td>
<td>17.3557</td>
<td>0.333837</td>
<td>51.989</td>
</tr><tr><td>ipl</td>
<td>Forward</td>
<td>0.301847</td>
<td>0.186466</td>
<td>1.619</td>
</tr><tr><td>ipl</td>
<td>Backward</td>
<td>0.301837</td>
<td>0.184209</td>
<td>1.639</td>
</tr><tr><td>loss</td>
<td>Forward</td>
<td>0.802242</td>
<td>0.641221</td>
<td>1.251</td>
</tr><tr><td>loss</td>
<td>Backward</td>
<td>0.013722</td>
<td>0.013825</td>
<td>0.993</td>
</tr><tr><td>Ave.</td>
<td>Forward</td>
<td>735.534</td>
<td>21.6799</td>
<td>33.927</td>
</tr><tr><td>Ave.</td>
<td>Backward</td>
<td>488.049</td>
<td>21.7214</td>
<td>22.469</td>
</tr><tr><td>Ave.</td>
<td>Forward-Backward</td>
<td>1223.86</td>
<td>43.636</td>
<td>28.047</td>
</tr><tr><td>Total</td>
<td> </td>
<td>1223860</td>
<td>43636</td>
<td>28.047</td>
</tr></tbody></table><p> </p>
<p>Some of many reasons this optimization was possible are :</p>
<ul><li>Code vectorization for SIMD </li>
<li>Finding hotspot functions and reducing function complexity and the amount of calculations</li>
<li>CPU / system specific optimizations</li>
<li>Reducing thread movements</li>
<li>Efficient OpenMP* utilization</li>
</ul><p> </p>
<p>Additionally, let's compare the VTune results of this example between BVLC Caffe and Intel® Optimized Caffe*. </p>
<p>Simply we will looking at how efficiently im2col_cpu function has been utilized. </p>
<p><span><img height="625" width="1350" src="https://software.intel.com/sites/default/files/managed/fd/1e/Capture2.PNG" alt="" /></span></p>
<p>BVLC Caffe*'s im2col_cpu function had CPI at 0.907 and was single threaded. </p>
<p><span><img height="622" width="1346" src="https://software.intel.com/sites/default/files/managed/4f/19/Capture4.PNG" alt="" /></span></p>
<p>In case of Intel® Optimized Caffe* , im2col_cpu has its CPI at 2.747 and is multi threaded by OMP Workers. </p>
<p>The reason why CPI rate increased here is vectorization which brings higher CPI rate because of longer latency for each instruction and multi-threading which can introduce spinning while waitning for other threads to finish their jobs. However, in this example, benefits from vectorization and multi-threading exceed the latency and overhead and bring performance improvements after all.</p>
<p>VTune suggests that CPI rate close to 2.0 is theoretically ideal and for our case, we achieved about the right CPI for the function. The training workload for the Cifar 10 example is to handle 32 x 32 pixel images for each iteration so when those workloads split down to many threads, each of them can be a very small task which may cause transition overhead for multi-threading. With larger images we would see lower spining time and smaller CPI rate.</p>
<p>CPU Usage Histogram for the whole process also shows better threading results in this case. </p>
<p><span><img height="476" width="948" src="https://software.intel.com/sites/default/files/managed/bf/89/Capture3.PNG" alt="" /></span></p>
<p> </p>
<p><span><img height="611" width="1194" src="https://software.intel.com/sites/default/files/managed/83/0d/Capture5.PNG" alt="" /></span></p>
<p> </p>
<h3> </h3>
<h3>Useful links</h3>
<div>BVLC Caffe* Project : <a href="http://caffe.berkeleyvision.org/" rel="nofollow">http://caffe.berkeleyvision.org/ </a></div>
<div>BVLC Caffe* Git : <a href="http://caffe.berkeleyvision.org/" rel="nofollow">https://github.com/BVLC/caffe </a></div>
<div> </div>
<div>Intel® Optimized Caffe* Introduction : <a href="https://software.intel.com/en-us/videos/what-is-intel-optimized-caffe">https://software.intel.com/en-us/videos/what-is-intel-optimized-caffe</a></div>
<div>Intel® Optimized Caffe* Git : <a href="https://github.com/intel/caffe" rel="nofollow">https://github.com/intel/caffe</a></div>
<div>Intel® Optimized Caffe* Recommendations for the best performance : <a href="https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance" rel="nofollow">https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance </a></div>
<div>Intel® Optimized Caffe* Modern Code Techniques : <a href="https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques">https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques</a></div>
<div> </div>
<div>
<h3> </h3>
<h3>Summary</h3>
<p>Intel® Optimized Caffe* is a customized Caffe* version for Intel Architectures with modern code techniques.</p>
<p>In Intel® Optimized Caffe*, Intel leverages optimization tools and Intel® performance libraries, perform scalar and serial optimizations, implements vectorization and parallelization. </p>
<p> </p>
</div>
<div> </div>
Sun, 09 Apr 2017 00:33:05 -0700JON J K. (Intel)707193