The latest Intel® Xeon® processor E7 v2 family includes a feature called Intel® Advanced Vector Extensions (Intel® AVX), which can potentially improve application performance. Here we will explain the context, and provide an example of how using Intel® AVX improved performance for a commonly known benchmark.

For existing vectorized code that uses floating point operations, you can gain a potential performance boost when running on newer platforms such as the Intel® Xeon® processor E7 v2 family, by doing one of the following:

In this article, I will share a simple experiment using the Intel® Optimized LINPACK benchmark to demonstrate the performance gain of three different sized workloads (30K, 40K, and 75K) from Intel AVX running on Windows* and Linux* operating systems. I will also share the list of AVX instructions that were executed and the equivalent SSE instructions for developers who are interested in direct coding.

Create three different input files for 30K, 40K, and 75K from the “...\linpack” directory

For AVX runs update files as follows:

For Windows, update the runme_xeon64.bat file to take new input files you that have created runme_xeon64.bat file. For Linux, update the runme_xeon64 shell script file to take the new input files.

The results will be in Glops similar to the Table 2

For Intel SSE runs, you will need to have an Intel AVX disabled processor and repeat the above steps.

What are the Intel AVX and the equivalent Intel SSE instructions that were executed?

Table 1 has a list of Intel AVX instructions that were executed during the Intel AVX runs. I have provided the equivalent Intel SSE instructions for those developers who are thinking of moving their existing Intel SSE code to Intel AVX.

Intel AVX Instructions
from the
LINPACK Runs

Equivalent Intel SSE Instructions
(SSE/SSE2/SSE3/SSE4)

Definitions

VADDPD

ADDPD

Add Packed Double-Precision Floating-Point Values

VBLENDPD

BLENDPD

Blend Packed Double Precision Floating-Point Values

VBROADCASTSD

N/A – Supported in AVX

Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register.

What is the performance gain for running the LINPACK benchmark with Intel AVX vs. Intel SSE enabled on the Intel Xeon E7 4890 v2 server?

Table 2 shows the results from the three different workloads running on Windows* and Linux*. In the Ratio column, the numbers show that the LINPACK benchmark produces ~1.6x-1.7x better performance when running with the combination of an Intel AVX optimized LINPACK and an Intel AVX capable processor. This is just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.

Windows*

Intel AVX (Gflops)

Intel SSE (Gflops

Ratio: Intel AVX/Intel SSE

LINPACK 30K v11.1.1

631.8

400.3

1.6

LINPACK 40K v11.1.1

756.4

480.6

1.6

LINPACK 75K v11.1.1

829.3

514.3

1.6

Linux*

LINPACK 30K v11.1.1

913.6

534.3

1.7

LINPACK 40K v11.1.1

1023.5

621.2

1.6

LINPACK 75K v11.1.1

1128.8

657.0

1.7

Table2 – Results and Performance Gain from the LINPACK benchmark

Conclusion

From our LINPACK experiment, we see compelling performance benefits when going to an AVX-enabled Intel Xeon processor; in this specific case, we saw a performance increase of ~1.6x-1.7x in our test environment, which is a strong case for developers who have SSE-enabled code and are weighing the benefit of moving to a newer Intel® Xeon® processor-based system with AVX. The reference materials below can help developers learn how to migrate existing SSE code to Intel AVX code.