Striking a Balance

This week, AMD is making a couple of very important announcements for developers: support of Intel’s Advanced Vector Extensions (AVX) instruction set in future AMD processors, and the adaptation to the AVX framework of AMD’s previous SSE5 instruction set proposal. The latter step has resulted in three new extensions: XOP (for eXtended Operations), CVT16 (half-precision floating point converts), and FMA4 (four-operand Fused Multiply/Add). In this posting I’ll give an overview of the capabilities that these extensions provide, and also some insight into why we’re taking this step.

First, the why. When we proposed the SSE5 extensions back in mid-2007, it brought some important innovations to the SIMD side of the x86 architecture:

a non-destructive three-operand capability, and a four-operand capability to support some very powerful new operations;

a variety of other new operations to address various holes in the SSE instruction set: shift/rotate, integer compares, integer multiply/accumulate, and half-precision floating point support.

In April of 2008, Intel published its AVX/FMA proposal, which incorporated several of SSE5’s innovations – in particular the three- and four-operand capabilities, the Fused Multiply/Add instructions, and some of the permute instructions – except in a somewhat different form. This proposal also added some new capabilities with a new instruction format: doubling the width of SIMD FP operations, applying the non-destructive three-operand capability to most legacy SSE instructions, and greatly expanding the potential opcode space for future extensions.

With this duplication of functionality between SSE5 and AVX/FMA, and AVX’s additional features, we felt the right thing to do was to support AVX. In our minds, a more unified instruction set is clearly what’s best for developers and the x86 software industry. With our acceptance of AVX, a key aspect of this instruction set unification is the stability of the specification. Since we don’t control the definition of AVX, all we can say for sure is that we expect our initial products to be compatible with version 5 of the specification (the most recent one, as of this writing, published in January of 2009), except for the FMA instructions, which we expect will be compatible with version 3 (published in August of 2008).

Why the FMA difference? This was not something we did lightly. In December of 2008, Intel made significant changes to the FMA definition, which we found we could not accommodate without unacceptable risk to our product schedules. Yet we did not want to deprive customers of the significant performance benefits of FMA. So we decided to stick with the earlier definition, renaming it FMA4 (for four-operand FMA – Intel’s newer definition uses what we believe to be a less capable three-operand, destructive-destination format). It will have a different CPUID feature flag from Intel’s FMA extension. At some future point, we will likely adopt Intel’s newer FMA definition as well, coexisting with FMA4. But as you might imagine, we may wait until we’re sure the specification is stable.

The fact remains that AVX does not incorporate all of SSE5’s features. Since SSE5 was based on months of discussions with ISVs on what sort of capabilities they felt were needed, and had been positively reviewed by the industry when we first put out the specification, we decided to follow through with development of these additional features. To do so most effectively, we redefined them in the AVX framework, resulting in the XOP extension.

Well, quite a lot, really. First of all, the instruction formatting was changed to leverage the capabilities that the AVX VEX prefix brings, using a new VEX-like three-byte prefix sequence called (interestingly enough) the XOP prefix. This provides three- and four-operand non-destructive destination encoding, an expansive new opcode space, and extension of SIMD floating point operations to 256 bits. The SSE5 operations that are retained by the XOP extension are:

Horizontal integer add/subtract: Signed or unsigned add, or signed subtract, of adjacent byte, word, or dword elements in the source vector to word, dword or qword elements of the destination vector. 128-bit.

Integer multiply/accumulate: Multiplies elements of two input vectors, adding the results to a third input vector. 128-bit.

Shift/rotate with per-element counts: These use a vector of shift counts, allowing each element of the source vector to be shifted or rotated by a different amount. There is also a rotate instruction with an immediate-byte single count applied to all elements. 128-bit.

Integer compare: Signed and unsigned comparison of byte, word, dword and qword elements, with predicate (mask) generation as in the various SSE compare instructions. The particular comparison to perform is specified in an immediate byte. 128-bit.

Byte permute: A powerful operation which copies bytes from two 16-byte input vectors to a 16-byte destination vector, optionally performing a selected transformation on each, under the control of a third input vector. 128-bit.

Bit-wise conditional move: Selects each bit of the destination vector from either of two input vectors, per a third input vector. 128- and 256-bit.

Fraction extract: Extract the mantissa from floating point operands. Scalar and 128- or 256-bit vector, single and double precision.

Half-precision convert: These convert between half-precision and single-precision formats while loading or storing a four- or eight-element vector. They provide dynamic control of rounding and denormalized operand handling. These particular instructions form a separate extension called CVT16, with a distinct CPUID feature flag.

Along with the FMA4 instructions, these support a wide variety of numeric-intensive, multimedia, and cryptographic applications, and allow some new cases of automatic vectorization by compilers. Speaking of compilers, plans are afoot to support these in intrinsic form in various compilers, and they may be used automatically in code generation in some cases.

A version of the AMD64 SimNow! simulator with support for these extensions is planned for availability in very short order.

I hope I’ve given you a good taste of these new features. For all the details on the XOP and FMA4 extensions, you can find the specification here. And, I encourage you to read the blog of our CMO, Nigel Dessau, for an executive perspective on driving innovation into the x86 instruction set. We believe we’ve struck the right balance between innovation and standardization. Feel free to comment or ask questions – we’re always happy to hear from you. As you can see below, we’ve already heard from ten of our technology partners on the subject.

“The addition of AVX support by AMD is a great move as it enables superior performance potential across AMD’s x86 family of processors,” said Wood Lotz, Absoft CEO. “AMD’s use of AVX can also simplify development of high performance compilers and tools for companies like Absoft, and enable customers across a wide variety of industries to build faster applications.”

“Acumem fully supports AMD’s adoption and enhancement of the AVX instructions and will follow this standard as it becomes available in the market. As an ISV for performance tools we clearly see potential for performance improvements with these new additions” said Mats Nilsson, VP Software Engineering at Acumem.

“Axceleon applauds AMDs efforts to support both specifications, AVX and SSE5, in their XOP specification proposal. The further enhancements in FMA4 which accelerate floating point algorithms are very important to Axceleon’s HPC customers and will be welcomed across the board” said Mike Duffy, CEO of Axceleon.

“We at Bibble Labs are constantly looking for performance improvements, and as such we are investigating AVX because of the possible performance advantage it might bring. We also appreciate that AMD is taking an active role to ensure the instruction sets converge and not create separate, conflicting instruction sets,” said Jeff Stephens, Vice President of Product Development, Bibble Labs.

“We commend AMD for taking an active role in open standards, by unifying the x86 instruction set and merging SSE5 into the AVX specification. This can help improve compatibility and simplify the work for developers implementing this. We look forward to investigating AVX for potential advantages it may bring to our real-time applications and plug-ins,” said Noel Borthwick, Chief Technology Officer, Cakewalk.

“We are pleased that AMD has decided to adopt the AVX instruction set extension instead of offering a variant,” said Simone Hoefer, General Manager, Technology at Nero AG. “This will help reduce implementation complexity and multiple code-paths. We are confident that the SIMD (SSE/SSE2) optimizations already implemented will scale nicely to 256-bit/AVX, allowing us to truly embrace this new development.”

“Having to choose acceleration solutions that work well on both AMD and Intel CPU platforms, Smith Micro welcomes convergence of the x86 instruction set. AMD supporting AVX is desirable from Smith Micro’s point of view,” said Uli Klumpp, director of engineering, Smith Micro Software, Inc. “The AVX instruction set extensions are looking promising for further optimizing our computationally most demanding software, DCC and data compression products such as Poser and StuffIt.”

“AMD’s adoption of AVX will help Sonic unify some of its engineering efforts and reduce development costs,” said Jim Roth, Chief Technical Officer, Sonic Solutions. “We welcome this initiative and the proposed enhancements to the x86 processor architecture, which we will leverage to increase the responsiveness and performance of our digital media applications.”

“We are pleased that AMD has decided to adopt the AVX instruction set extension instead of offering a variant,” said John Freeborg, Vice President of Engineering for Sony Creative Software. “We also appreciate that AMD is taking an active role to ensure these converge and do not create separate, conflicting instruction sets.”

This post is the opinion of the author and may not represent AMD’s positions, strategies or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.