1. Executive Summary

The Intel® Xeon® processor E7 V2 family, codenamed “Ivy Bridge EX”, is a 2, 4 or 8-socket platform based on Intel’s most recent microarchitecture. Ivy Bridge is the 22-nanometer shrink of the “Sandy Bridge” microarchitecture. This product brings additional capabilities for data centers: more cores, more memory bandwidth and extended Reliability, Availability and Serviceability (RAS) features. As a result, platforms based on this product family can yield up to 2X improvement in performance compared to the previous generation Intel Xeon processor E7 family. Additional features introduced (such as Intel® AVX, Intel® Secure Key, and RAS features) provide opportunities to create faster, more secure, and more resilient applications.

2. Introduction

The Intel Xeon processor E7 V2 family is based on Ivy Bridge EX microarchitecture, an enhanced version of the Sandy Bridge microarchitecture (http://software.intel.com/en-us/articles/intel-xeon-processor-e5-26004600-product-family-technical-overview). The platform supporting the Intel Xeon processor E7 V2 family is named “Brickland” This paper discusses the new features available compared to the previous generation Intel Xeon processor E7 family. Each section includes information about what developers need to do to take advantage of new features for improving application performance, security and reliability.

3. Intel Xeon processor E7 V2 family enhancements

Some of the new features that come with the Intel Xeon processor E7 V2 family include:

Figure 1 shows a block diagram of the 4-socket Intel Xeon processor E7-4800 V2 family microarchitecture. All processors in the family have up to 15 cores (compared to 10 cores in its predecessor), which bring additional computing power to the table. They also have 25% additional cache (37.5 MB), higher memory capacity and bandwidth. With the 22-nm process technology, the Intel Xeon processor E7 V2 family consumes less power, during idle periods, compared to its predecessor platform.

Table 1 shows a comparison of the Intel Xeon processor E7-4800 V2 product family features compared to its predecessor, the Intel Xeon processor E7-4800.

Table 1. Comparison of the Intel® Xeon® processor E7–4800 product family to the Intel® Xeon® processor E7–4800 V2 product family

On Jordan Creek based platforms, exact memory speeds will depend on the memory configuration and population rules as well as the memory controller mode selected in the BIOS (Performance or Lockstep)

The rest of this paper discusses some of the main enhancements in this product family.

3.1 Intel® C104/102 Scalable Memory Buffer

The C104/102 scalable memory buffer available for Intel Xeon processor E7 V2 platforms significantly increase memory capacity – with 24 DDR3 DIMMs (64 GB) per socket, it is possible to support up to 6TB in a 4 socket platform. The Intel Xeon processor E7 V2 family supports up to 1600 MHz DDR3 speeds. There are 2 modes of operation for the memory controller – performance and lock step mode. Performance mode is the normal (default) mode of operation with higher I/O and bandwidth. Lockstep Memory mode uses two memory channels at a time, stores half the cacheline in one DIMM on one channel and the other half on the next, and offers an even higher level of protection. In lockstep mode, two channels operate as a single channel—each write and read operation moves a data word two channels wide. In three-channel memory systems, the third channel is unused and left unpopulated. The Lockstep Memory mode is the most reliable, but it reduces the total system memory bandwidth by one-third in most systems. This mode of operation will be configurable from the BIOS, and often the BIOS by default will be set to operate in ‘Performance’ mode.

3.2 Intel® Secure Key (DRNG)

Intel Secure Key (Digital Random Number Generator: DRNG) is a hardware approach to high-quality and high-performance entropy and random number generation. The entropy source is thermal noise within the silicon.

Figure 2. Digital Random Number Generator using RDRAND instruction

Figure 2 shows a block diagram of the Digital Random Number Generator. The entropy source outputs a random stream of bits at the rate of 3 GHz that is sent to the conditioner for further processing. The conditioner takes pairs of 256-bit raw entropy samples generated by the entropy source and reduces them to a single 256-bit conditioned entropy sample. This is passed to a deterministic random bit generator (DRBG) that spreads the sample into a large set of random values, thus increasing the amount of random numbers available by the module. DRNG is compliant with ANSI X9.82, NIST, and SP800-90 and certifiable to FIPS-140-2.

Since DRNG is implemented in hardware as a part of the processor, both the entropy source and DRBG execute at processor clock speeds. There is no system I/O required to obtain entropy samples and no off-chip bus latencies to slow entropy transfer. DRNG is scalable enough to support heavy server application workloads and multiple VMs.

DRNG can be accessed through a new instruction named RDRAND. RDRAND takes the random value generated by DRNG and stores it in a 16-bit or 32-bit destination register (size of the destination register determines size of the random value). RDRAND can be emulated via CPUID.1.ECX[30] and is available at all privilege levels and operating modes. Performance of RDRAND instruction is dependent on the bus infrastructure; it varies between processor generations and families.

Software developers can use the RDRAND instruction either through cryptographic libraries (OpenSSL* 1.0.1) or through direct application use (assembly functions). The Intel® Compiler (starting with version 12.1), Microsoft Visual Studio* 2012, and GCC* 4.6 support the RDRAND instruction.

Microsoft Windows* 8 uses the DRNG as an entropy source to improve the quality of output from its cryptographically secure random number generator. Linux* distributions based on the 3.2 kernel use DRNG inside the kernel for random timings. Linux distributions based on the 3.3 kernel use it to improve the quality of random numbers coming from /dev/random and /dev/urandom, but not the quantity. That being said, Red Hat Fedora* Core 18 ships with the rngd daemon enabled by default, which will use DRNG to increase both the quality and quantity of random numbers in /dev/random and /dev/urandom.

For more details on DRNG and RDRAND instruction, refer to the Intel DRNG Software Implementation Guide.

Support for Intel OS Guard needs to be in the operating system (OS) or Virtual Machine Monitor (VMM) you are using. Please contact your OS or VMM providers to determine which versions include this support. No changes are required at the BIOS or application level to use this feature.

3.4 Intel® Advanced Vector Extensions (Intel® AVX)

Intel®AVX, a new-256 bit instruction set extension designed for applications that are floating-point (FP) intensive. This product family also introduces a new set of instructions to convert between single-precision and half-precision floating-point formats.

Figure 4. Intel® Advanced Vector Extensions Instruction Format

Intel AVX introduces the following architectural enhancements:

Support for 256-bit wide vectors and SIMD register set.

Instruction syntax support for generalized three-operand syntax to improve instruction programming flexibility and efficient encoding of new instruction extensions.

Enhancement of legacy 128-bit SIMD instruction extensions to support three-operand syntax and to simplify compiler vectorization of high-level language expressions.

Instruction encoding format using a new prefix (referred to as VEX) to provide compact, efficient encoding for three-operand syntax, vector lengths, compaction of SIMD prefixes and REX functionality.

Intel AVX employs an instruction encoding scheme using a new prefix (known as a “VEX” prefix). Instruction encoding using the VEX prefix can directly encode a register operand within the VEX prefix. This supports two new instruction syntaxes in Intel 64 architecture:

In four-operand syntax, the extra register operand is encoded in the immediate byte. The introduction of three-operand and four-operand syntaxes helps to reduce the number of register to register copies, thus making the programming more efficient.

Intel AVX improves performance due to wider vectors, new extensible syntax, and rich functionality which results in better data management. Applications that could benefit from Intel AVX include general purpose applications like image, audio/video processing, scientific simulations, financial analytics and 3D modeling and analysis.

Operating system and compiler support are needed for executing applications with Intel AVX. Some of the supporting operating systems include Linux* 2.6.30 or later, Windows 7* SP1 or later and Windows* 2008 server SP1 or later. The compilers supporting Intel AVX include Intel C/C++ and Fortran Compilers version 11.1 or later, Microsoft* Visual Studio 2010 or later and GCC* 4.4.1 or later.

There are a couple of ways a developer can make use of Intel AVX in their applications:

Re-compiling the application with the appropriate compiler – if the developer doesn’t want to modify his code, he can re-build his application using the appropriate compiler (mentioned above) using the right switches to turn on AVX optimizations. On Windows, using the Intel compiler, use the command line switch /QxAVX. On Linux, use –xavx. The switches /QaxAVX (Windows) and –axavx (Linux) may be used to build applications that will take advantage of AVX instructions on Intel systems that support these, but will use only SSE instructions on other Intel or compatible, non-Intel systems. For Microsoft Visual Studio compiler, use the flag /arch:AVX to enable AVX optimizations.

Hand-optimizing the application using intrinsics – the developer could modify relevant portions of his software code using intrinsic instructions. Please refer to http://software.intel.com/en-us/articles/intel-intrinsics-guide for more details on the intrinsics. The Intel® C++ Compiler supports Intel AVX-based intrinsics via the header file immintrin.h.

A significant amount of performance overhead in machine virtualization is due to Virtual Machine (VM) exits. Every VM exit can cause a penalty of approximately 2,000 – 7,000 CPU cycles (see Figure 5), and a significant portion of these exits are for APIC and interrupt virtualization. Whenever a guest operating system tries to read an APIC register, the VM has to exit and the Virtual Machine Monitor (VMM) has to fetch and decode the instruction.

The Intel Xeon processor E7 V2 family introduces support for APIC virtualization (APICv); in this context, the guest OS can read most APIC registers without requiring VM exits. Hardware and microcode emulate (virtualize) the APIC controller, thus saving thousands of CPU cycles and improving VM performance.

Figure 5. APIC Virtualization

This feature must be enabled at the VMM layer: please contact your VMM supplier for their roadmap on APICv support. No application-level changes are required to take advantage of this feature.

3.6 PCI Express Enhancements

The Intel Xeon processor E7 V2 family supports PCIe atomic operations (as a completer). Today, message-based transactions are used for PCIe devices, and these use interrupts that can experience long latency, unlike CPU updates to main memory that use atomic transactions. An Atomic Operation (AtomicOp) is a single PCIe transaction that targets a location in memory space, reads the location’s value, potentially writes a new value back to the location, and returns the original value. This “read-modify-write” sequence to the location is performed atomically. This is a new operation added per PCIe Specification 3.0. FetchAdd, Swap, and CAS (Compare and Swap) are the new atomic transactions.

The benefits of atomic operations include:

Lower overhead for synchronization

Lock-free statistics (e.g. counter updates)

Performance enhancement for device drivers

The Intel Xeon processor E7 V2 family also supports X16 non transparent bridge. All these contribute to better I/O performance.

These PCIe features are inherently transparent and require no application changes.

3.7 New RAS features

PCIe Live Error Recovery (LER) - When errors are detected by the PCIe root port, the Live Error Recovery (LER) feature will bring down the PCIe link associated with the affected root port within one cycle and then automatically recover the link. PCIe LER also protects against the transfer of associated corrupt data during this process. PCIe LER allows the system to recover from PCIe errors that would otherwise cause the system to crash. Note that this is implemented in the hardware and there is no need to change anything in the software programs.

Machine Check Architecture (MCA) recovery – Execution Path – This enables software layers (OS, VMM, DBMS, and Application) to assist in system recovery from data errors that are not correctable at the hardware level. Execution path extends prior recovery capabilities to include errors detected in data passed to the CPU. MCA recovery allows the system to recover from certain errors that would otherwise be fatal. Execution path enabling greatly increases a number of situations where recovery is possible by extending this capability to data errors detected in the execution engine (core).

Enhanced Machine Check Architecture (eMCA) Gen 1 - Enhanced MCA is a new Xeon RAS capability that allows firmware to enhance the error logging capabilities of Machine Check Architecture. Enhanced MCA can be configured to provide more information to the software layer about error conditions enabling better recovery and better error identification. In Gen1 this information is provided for corrected memory errors and uncorrected errors. eMCA Gen 1 provides enhanced error log information to the OS, VMM, DBMS that can be used to extend software recovery capabilities and to provide better diagnostic and predictive failure analysis for the system. This enables higher levels of uptime and reduced service costs.

Machine Check Architecture (MCA) recovery for I/O - MCA recovery for IO incorporates PCIe uncorrected (non-fatal and fatal) errors into the architecturally supported MCA mechanism for communicating error information to the software layer (OS, VMM, DBMS) to improve error identification and recovery. MCA recovery for I/O enables the system to recover from I/O error conditions that would otherwise have been fatal, increasing overall uptime. It also provides additional diagnosis information on the source of errors which can reduce service costs.

The new RAS features require additional enabling. Please refer to the Appendix for the supported Operating Systems and VMMs that support these new features.

4. Conclusion

In summary, the Intel Xeon processor E7 V2 family, combined with the Brickland platform, provides many new and improved features that could significantly change your performance and power experience on enterprise platforms. Developers can make use of most of these new features without making any changes to their applications.

Appendix

Figure 6: Intel® Xeon® Processor E7 Family RAS Features OS Support Summary
** New features will be supported in upcoming OS releases. Please contact OS vendors for additional details
¥ denotes new features.