Introduction

With this week's introduction of the x52 line of Opteron processors, AMD is giving us a little look into the future of their Athlon 64 line. As mentioned in our article on Monday, the new 2.6GHz speed grade is also introducing the new E4 stepping, which adds SSE3 support. The new Opteron also received a face lift in that it is fabbed on a 90nm process, runs coherent HT links at 1GHz, and comes in a shiny new organic package rather than the older ceramic.

The goal of this article is to bring out a quick look at what SSE3 brings to the table for Opteron and the future revision E Athlon 64 cores. As desktop parts do not enable coherent HT links at all, the 1GHz support won't matter. Also, the newer A64 parts are already 90nm on organic packages. Other than the usual small tweaks we see between steppings, the only thing that will be new across the board for K8 processors is SSE3.

What exactly is SSE3? Intel introduced SSE3 as Prescott New Instructions last year. These instructions are generally additions to the SIMD (single instruction multiple data) capabilities of the processor. SIMD processing is based on the idea that sometimes processors must take large amounts of data and perform similar operations across the entire set. This lends itself well to things like audio and video processing. In these areas of computing, large amounts of data flow through the processor, undergoing roughly the same operations, in preparation for display. The philosophy behind SIMD lends itself well to graphics as well. Modern graphics cores incorporate many SIMD processing units in order to churn through vector and pixel data as fast as possible. SIMD processing has also largely overshadowed the use of the x87 floating point unit on x86 processors. Because of this, it is advantageous for AMD to support the extensions to SIMD Intel makes as quickly as possible.

With SSE3, Intel added 10 new instructions targeted at SIMD as well as 3 other instructions that don't touch the SSE registers (fisttp, monitor, mwait). Here's a brief list of SSE3 instructions and what they are for:

The float to integer conversion is rather obvious in function, but some of the other instructions are a little mysterious. The complex math instructions extend functionality for imaginary numbers. The hadd and hsub instructions are horizontal additions and horizontal subtractions. These allow faster processing of data stored "horizontally" in (for example) vertex arrays. Here is a 4-element array of vertex structures.

x1 y1 z1 w1 | x2 y2 z2 w2 | x3 y3 z3 w3 | x4 y4 z4 w4

SSE and SSE2 are organized such that performance is better when processing vertical data, or structures that contain arrays; for example, a vertex structure with 4-element arrays for each component:

x1 x2 x3 x4
y1 y2 y3 y4
z1 z2 z3 z4
w1 w2 w3 w4

Generally, the preferred organizational method for vertecies is the former. Under SSE2, the compiler (or very unfortunate programmer) would have to reorganize the data during processing.

The lddqu instruction is designed to reduce the impact of 128bit unaligned memory accesses. As unaligned loads happen quite often in video processing, the lddqu instruction is designed to load 256bits of data aligned on a 16byte boundary. The instruction also takes care of extracting the correct 16bytes (as requested) from the 32byte block. Under SSE2, 64bit loads are executed and then the data is recombined.

In order to test these features as implemented by AMD, we tested an Opteron 250 against an Opteron 252. We were able to use crystalcpuid to set the multiplier of the Opteron 252 (though powernow!) to 12 in order to match the 2.4GHz of the Opteron 250. This way, we'll have a direct comparison of the two architectures.

We ran both processors in HP's wx9300 workstation. We used a single CPU configuration and 4x 512MB of RAM at 3:3:3:8. Windows XP SP2 was used in our tests. In an MP environment (with more memory bandwidth), the Opteron has a greater potential for improvement with SSE3. Unfortunately, we were unable to perform a direct comparison of the older and newer cores under a DP configuration. Attempting to use powernow! to adjust the multiplier with more than 1 processor installed resulted in a BSOD (machine check exception).

"#29, XviD is an *UNLICENSED MPEG-4 HACK*. That's just a fact. DivX is a MPEG-4 licensee, XviD is not. "

This is pretty silly. How could a piece of code get an MPEG4 license? Obviously it can't, which is why neither Xvid nor Divx code is licensed. Only a (compiled) product can be licensed to use MPEG4.

Anyone selling an MPEG4 product is welcome to use Xvid and its perfectly legal, but they must pay a license fee for each product sold, as Divx does when you buy their product. Its the same situation as LAME, when you use it without having paid for a license, you're violating some patents. But you're free to license it and then you're in the clear legally.

Also the irony of calling something a hack and mentioning Divx is simply breath taking. Reply

"This seems to indicate that the K8 architecture is simply resilient when it comes to unaligned 128bit loads. In the case of Intel's NetBurst, the lddqu instruction may have more impact."
If you have SSE3 enabled Intel CPU's, then test your hypothesis instead of guessing. It would be interesting to see the absolute and percentage increases in performance for the same tests using equivalent Intel chips. From what I can remember is that SSE3 gave Intel little performance increase for previously SSE2 optimised code. There may have been a few artificial test cases that showed large benefits ie. deliberate unoptimised SSE2 code versus optimised SSE3 code.

"As the Intel compiler is designed to optimize for Intel processors, we haven't had a viable source for high quality SSE3 compilation." You maybe surprised by the performance of so-called 'Intel optimised' code on AMD systems. I say this particularly because of the old case of PIII and early P4 optimised showing better AMD Athlon scores at the time.

It would also be interesting to see the difference in performance with the Opteron 252 with the SSE3 turned off in those benchmarks.

Like always we will have to wait for further optimisations and validations before we can make a better comparison. To investigate the features and implementation is to use hand coded SSE2/3 code for an inner loop and compare performance and behaviour under different conditions. It's like, at the moment we only have one side of a six-sided dice.

The other thing would be to compare the power consumption of the two steppings of Opterons (either at the power point and extrapolate or measure power to mainboard/CPU).

I see that "23 - Posted on Feb 17, 2005 at 10:37 AM by pxc" has added some useful information using a Intel 3.4GHz P4 F. A 2.4GHz Opteron could be considered to compete with an Intel P4 based @ 3.6GHz. Others have already mentioned similar comments to me or provided a different view of the benchmarks given.Reply

XviD at least implements features as per specifications. DivX tends to add in their own "features" that aren't exactly in spec (although some are understandable given the limitations of the AVI container)

Choose your words carefully.

#32

I'm also inclined to believe that it's simply because DivX's implementation of SSE3 simply doesn't do anything much yet.Reply

Sorry I'm a bit of a n00b when it comes to Divx encoding tests but are you sure the SSE3 codepath was enabled on the Opteron? I'm curious if some apps simply test for core/stepping rather than actual SSE3 ability--maybe DivX wasn't even using the right code??Reply

Is it even remotely possible that DivX skips using SSE3 on the Opteron because it's currently only "meant" to run on the Prescott? I realise that SSE3 should work if the program is correctly written, but one never knows...Reply

Anyway it seems DivX SSE3 implementation isn't very good or simply, SSE3 is useless (I think it is the first possibility or really, it hasn't SSE3, because for example, with TmpegENC Xpress there is a good improvement).

It's to be expected that the E revision chips will run hotter than the D revision because strained-silicon increases power consumption (but allows for higher speeds). So long as you have good cooling, the E revision chips should be great overclockers.

It would be nice for comparisons of temperature and system power consumption to be taken of a D4 x48, and E4 clocked at 2.2GHz (there are no D4 x50 parts, they were all CG revision).Reply