64-bit computing in theory and practice

LISTEN TO THE HYPE ABOUT 64-bit computing, and you could get the idea that the move to 64 bits will make all of your games run twice as fast, replace blocky 3D models with smooth, photorealistic replicas of the human form, and transform the average PC into a wonder-box that can resequence your dog’s genome in its spare cycles so he won’t pee on the rug anymore. On the other hand, listen to the anti-hype about 64-bit computing, and you could be forgiven for wondering why anyone even botheredprobably just a conspiracy to get us to buy new stuff we don’t need.

The truth is somewhat different from both of these visions. 64-bit computing won’t bring us two times the performance in an amazing overnight transformation of the PC, as the move from 8 bits to 16 seemed to do back in the day. But it’s not a pointless exercise in shuffling bits, either. The 64-bit extensions to the venerable x86 instruction set architecture (ISA), including AMD64 and Intel’s code-compatible EM64T, actually offer some tangible benefits with few drawbacks. These extensions to the x86 ISA offer a much larger memory address space, bring a cleaner programming model with performance benefits, and retain backward compatibility with existing 32-bit applications.

In order to help you navigate through the hype, we nabbed a pair of 64-bit processors from AMD and Intel and tested them with the latest release candidate of the 64-bit version of Windows XP. Read on for our take on the move to 64 bits, including a look at the performance of the latest CPUs in Windows XP Pro x64 Edition with both 32 and 64-bit applications.

64-bit basics The essence of the move to 64-bit computing is a set of extensions to the x86 intruction set pioneered by AMD and now known as AMD64. During development, they were sensibly called x86-64, but AMD decided to rename them to AMD64, probably for marketing reasons. In fact, AMD64 is also the official name of AMD’s K8 microarchitecture, just to keep things confusing. When Intel decided to play ball and make its chips compatible with the AMD64 extensions, there was little chance they would advertise their processors “now with AMD64 compatibility!” Heart attacks all around in the boardroom. And so EM64T, Intel’s carbon copy of AMD64 renamed to Intel Extended Memory 64 Technology, was born.

The difference in names obscures a distinct lack of difference in functionality. Code compiled for AMD64 will run on a processor with EM64T and vice versa. They are, for our purposes, the same thing.

Whatever you call ’em, 64-bit extensions are increasingly common in newer x86-compatible processors. Right now, all Athlon 64 and Opteron processors have x86-64 capability, as do Intel’s Pentium 4 600 series processors and newer Xeons. Intel has pledged to bring 64-bit capability throughout its desktop CPU line, right down into the Celeron realm. AMD hasn’t committed to bringing AMD64 extensions to its Sempron lineup, but one would think they’d have to once the Celeron makes the move.

The contenders

For some time now, various flavors of Linux compiled for 64-bit processors have been available, but Microsoft’s version of Windows for x86-64 is still in beta. That’s about to change, at long last, in April. Windows XP Professional x64 Edition, as it’s called, is finally upon us, as are server versions of Windows with 64-bit support. (You’ll want to note that these operating systems are distinct from Windows XP 64-bit Edition, intended for Intel Itanium processors, which is a whole different ball of wax.) Windows x64 is currently available to the public as a Release Candidate 2, and judging by our experience with it, it’s nearly ready to roll. Once the Windows XP x64 Edition hits the stores, I expect that we’ll see the 64-bit marketing push begin in earnest, and folks will want to know more about what 64-bit computing really means for them.

The immediate impact, in a positive sense, isn’t much at all. Windows x64 can run current 32-bit applications transparently, with few perceptible performance differences, via a facility Microsoft has dubbed WOW64, for Windows on Windows 64-bit. WOW64 allows 32-bit programs to execute normally on a 64-bit OS. Using Windows XP Pro x64 is very much like using the 32-bit version of Windows XP Pro, with the same basic look and feel. Generally, things just work as they should.

There are differences, though. Device drivers, in particular, must be recompiled for Windows x64. The 32-bit versions won’t work. In many cases, Windows x64 ships with drivers for existing hardware. We were able to test on the Intel 925X and nForce4 platforms without any additional chipset drivers, for example. In other cases, we’ll have to rely on hardware vendors to do the right thing and release 64-bit drivers for their products. Both RealTek and NVIDIA, for instance, supply 64-bit versions of their audio and video drivers, respectively, that share version numbers and feature sets with the 32-bit equivalents, and we were able to use them in our testing. ATI has a 64-bit beta version of its Catalyst video drivers available, as well, but not all hardware makers are so on the ball.

Some other types of programs won’t make the transition to Windows x64 seamlessly, either. Microsoft ships WinXP x64 with two versions of Internet Explorer, a 32-bit version and a 64-bit version. The 32-bit version is the OS default because nearly all ActiveX controls and the like are 32-bit code, and where would we be if we couldn’t execute the full range of spyware available to us? Similarly, some system-level utilities and programs that do black magic with direct hardware access are likely to break in the 64-bit version of Windows. There will no doubt be teething pains and patches required for certain types of programs, despite Microsoft’s best efforts.

Of course, many applications will be recompiled as native 64-bit programs as time passes, and those 64-bit binaries will only be compatible with 64-bit processors and operating systems. Those applications should benefit in several ways from making the transition.

The 64-bit advantage When AMD’s design team created the x86-64 ISA, they tackled several inherent deficiencies of the old x86 ISA. First and foremost among those was a very basic limitation of accessing memory with 32-bit addresses: the sum total of memory one can address at one time with a 32-bit number is 4GB. That may sound like a lot of memory for the average desktop PC, but then again, not every PC is average, and the x86 ISA is increasingly becoming the platform of choice for technical workstations and servers, as well. As memory densities increase over time thanks to the happy benefits of Moore’s Law, that 4GB limit is beginning to look smaller and smaller.

Not only that, but the practical effects of 32-bit addressing are even more constraining. By default, Windows XP limits applications to 2GB of memory space and reserves 2GB for system-level tasks. (It is possible for x86 systems to address more than 4GB of total memory using a mechanism called Physical Address Extension, created by Intel. In fact, some server versions of Windows allow up to 128GB of physical RAM in a 32-bit system. However, PAE uses a paging scheme that generally isn’t considered the most optimal way of doing things.)

Meanwhile, certain types of user data sets are growing constantly, from ever-higher resolutions in digital cameras to HD video streams to video games capable of taking advantage of 512MB of RAM on a graphics card. Scientific computing and technical workstations are already hitting their heads on 32-bit addressing limitations with regularity.

By moving to a 64-bit addressing scheme, the possible address space grows exponentially from 232 to 264, so that the x86-64 ISA allows for what seems like a practically unlimited amount of memory. The theoretical peak size of a 64-bit address space is 16 exabytes, an extremely large number. Current AMD64 processors allow up to 40 bits of physical address space, or one terabyte, and up to 48 bits of virtual address space, or 256TB. Initial versions of WinXP x64 will support as much as 128GB of physical RAM and up to 16 terabytes of virtual memory. The upper limits of the Windows system cache size grow from 1GB in 32 bits to 1TB in 64 bits, a thousand-fold increase. WinXP x64 even takes advantage of the additional headroom for 32-bit apps, giving each one up to 4GB of its own space.

In short, the move to 64 bits removes the memory address space constraints of the old x86 ISA, granting PCs room to grow for quite some time. This change alone won’t bring performance benefits, except in cases where the amount of memory is a performance-constraning factor, but it’s still probably the most important benefit of x86-64 overall.

x86: registered offender Another problem with the x86 ISA is the number of general-purpose registers (GPRs) available. Registers are fast, local slots inside a processor where programs can store values. Data stored in registers is quickly accessible for reuse, and registers are even faster than on-chip cache. The x86 ISA only provides eight general-purpose registers, and thus is generally considered register-poor. Most reasonably contemporary ISAs offer more. The PowerPC 604 RISC architecture, to give one example, has 32 general-purpose registers. Without a sufficient number of registers for the task at hand, x86 compilers must sometimes direct programs to spend time shuffling data around in order to make the right data available for an operation. This creates overhead that slows down computation.

x86-64 adds register space to the x86 ISA. Source: AMD.

To help alleviate this bottleneck, the x86-64 ISA brings more and better registers to the table. x86-64 packs 8 more general-purpose registers, for a total of 16, and they are no longer limited to 32-bit valuesall 16 can store 64-bit datatypes. In addition to the new GPRs, x86-64 also includes 8 new 128-bit SSE/SSE2 registers, for a total of 16 of those. These additional registers bring x86 processors up to snuff with the competition, and they will quite likely bring the largest performance gains of any aspect of the move to the x86-64 ISA.

What is the magnitude of those performance gains? Well, it depends. Some tasks aren’t constrained by the number of registers available now, while others will benefit greatly when recompiled for x86-64 because the compiler will have more slots for local data storage. The amount of “register pressure” presented by a program depends on its nature, as this paper on 64-bit technical computing with Fortran explains:

The performance gains from having 16 GPRs available will vary depending on the complexity of your code. Compute-intensive applications with deeply nested loops, as in most Fortran codes, will experience higher levels of register pressure than simpler algorithms that follow a mostly linear execution path.

So, as they say, your mileage may vary. Sometimes, 64-bit programs will see little or no performance advantage over 32-bit versions of the same. In other cases, the performance increase could be substantial. We will, of course, test that theory in the following pages.

Declaring war on alphabet soup The final major problem the x86 ISA is a programming model cluttered by an alphabet soup of overlapping instruction set extensions that aren’t entirely necessary or, in the case of some legacy instructions, particularly efficient. MMX, 3DNow!, x87, SSE, SSE2, and SSE3 extensions all hang off of the original x86 ISA, overlapping in many cases. x86-64 cleans things up by adopting SSE and SSE2 as part of its core set of instructions and jettisoning MMX, 3DNow!, and the x87 FPU. SSE/2 instructions can duplicate the functionality of those other instruction sets, and as a result, WinXP x64 doesn’t carry over the registers for the FPU and MMX during context switches in 64-bit mode. MMX, 3DNow!, and the x87 FPU are all supported fully in 32-bit compatibility mode in WOW64, but not for 64-bit apps. (SSE3, the newest of the extensions, will likely be supported by all 64-bit processors in the near future, because AMD is expected to add SSE3 to the AMD64 architecture very soon. I’d expect SSE3 to work in 64-bit mode.)

The x87 FPU has long been considered a weakness of x86 CPU architectures compared to competing RISC designs, and x86 processors have indeed had weak FPU performance, relatively speaking. SSE2 exchanges the x87’s stack-based programming model for a more modern one, a potential boon for floating-point math performance. SSE2 also replaces the x87’s IEEE 80-bit precision with the choice of either IEEE 32-bit or 64-bit floating-point math. As a result, x86-64 processors running in 64-bit mode will produce floating-point results more like those of most RISC CPUs, but those results will vary slightly from the answers produced by legacy programs that use the x87 FPU due to the difference in precision.

Because of the move to the 64-bit ISA and the elimination of MMX, 3DNow! and the x87 FPU, Windows applications that include inline assembly code will not compile on Windows x64. That means applications, including games, that include segments of hand-tuned inline assembly code may have to sacrifice their optimizations when being ported to 64 bits. During the transition period between 32 and 64 bits, this reality may be a bit of a counterweight against the performance advantages that x86-64’s extra registers provide. One could see how 32-bit native games or similar applications with lots of optimizations might perform better than their 64-bit equivalents. However, the move to clean up the x86 programming model will almost surely pay dividends in the long run in terms of simplicity of development, ease of optimization, and even outright performance.

Weighing the benefits of 64 bits Now that we’ve sorted through the theory about 64-bit performance, it’s time to take a look at the current reality. Neither Window XP Pro x64 Edition nor the handful of 64-bit applications and device drivers we used are yet finished products, but as you’ll see, their performance indicates relative maturity. With that mild caveat in mind, we’ll attempt to explore answers to several questions. Among them: How do 32-bit applications perform on Windows x64? What are the performance benefits of running 64-bit code on a 64-bit OS? And how do the Intel and AMD implementations of x86-64 compare? Do they offer similar performance deltas in the move to 64 bits, or does one demonstrate obvious superiority over the other?

Our testing methods As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

The Chronicles of Riddick: Escape from Butcher Bay We’ll start with gaming performance because I know many of you will be interested to see these numbers first. Our first test is a surprisingly good new game, The Chronicles of Riddick: Escape from Butcher Bay. This is one of the most visually impressive games on the PC, perhaps even better than Doom 3. The game also comes out of the box with a 64-bit executable (in addition to the standard 32-bit version) and a built-in benchmarking function. That makes it particuarly useful for us, because we can test performance without running a 32-bit benchmarking utility, like FRAPS, alongside it.

We recorded our own custom demo of one of the opening levels of the game and played it back for testing. The game has an advanced rendering mode with soft shadows available on GeForce 6-series GPUs like we used in our test systems, but it really taxes the graphics card, so we bypassed it for the “SM2.0” mode, which runs fast enough to show us when performance is CPU limited.

You’ll notice that in the benchmark graphs below and those on the following pages, we have several sets of data for each CPU. Any result labeled “Win32” was run on the 32-bit version of Windows XP Pro, and anything labeled “Win64” was run on WinXP Pro x64 RC2. The tests labeled “32-bit” used 32-bit executable programs, and those labeled “64-bit” used 64-bit versions. Notice that in many cases you’ll see a mix of “Win64” and “32-bit,” when we are running a 32-bit program via WOW64 on Windows x64.

There’s nothing earth-shattering about the performance of either the AMD or Intel CPUs in 64-bit mode here. Interestingly enough, the Athlon 64 is faster running the 32-bit code on WinXP x64 than on WinXP 32-bit. The Pentium 4, meanwhile, is the opposite, losing a step or two in the 64-bit OS. Neither processor benefits tangibly from the move to 64-bit application code, unfortunately.

Doom 3 We’ll continue our gaming tests with a few more 32-bit games, just to see how they run on Windows XP Pro x64. Few other games have 64-bit versions that are available to the public at present, sadly. That makes our gaming tests a little bit less enlightening than the non-gaming applications that follow.

We tested performance by playing back a custom-recorded demo that should be fairly representative of most of the single-player gameplay in Doom 3.

Doom 3 doesn’t gain or lose much of anything when making the transition to the 64-bit OS. That’s good news for those who would like to make the leap.

Far Cry Far Cry is an interesting case of a game that, like Riddick, ships with an AMD64 logo on the box. Unlike Riddick, its 64-bit version is long AWOL, so we have to stick to 32-bit code only.

Our Far Cry demo takes place on the Pier level, in one of those massive, open outdoor areas so common in this game. Vegetation is dense, and view distances can be very long.

Once more, no news is good news. Far Cry runs pretty much the same in WinXP x64 as it does in 32 bits.

Unreal Tournament 2004 Our UT2004 demo shows yours truly putting the smack down on some bots in an Onslaught game.

The Pentium 4 runs a few frames per second faster with the 64-bit OS, but overall, it’s safe to say that 32-bit gaming performance on WinXP x64 is now more or less equivalent to WinXP in 32 bits. That wasn’t the case with earlier revisions of the OS and video drivers, so our results show solid progress. It looks like there will be little reason for gamers to avoid making the move to WinXP x64.

picCOLOR The picCOLOR image processing and analysis tool is a nice example of a 32-bit application ported to 64 bits. picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA.

Comparing the 64-bit version of this program to the 32-bit version isn’t entirely straightforward, because the 32-bit version of picCOLOR for Windows includes hand-tuned inline assembly that uses MMX to accelerate the program’s Morph function. This inline assembly code doesn’t fly in 64-bit mode because MMX isn’t supported, and because inline assembly code won’t compile in Microsoft’s 64-bit compiler. As a result, the 64-bit version of picCOLOR doesn’t have this optimized code.

Fortunately, the 32-bit version of picCOLOR includes an option to disable the hand-tuned MMX code, so we can compare 32-bit and 64-bit performance in picCOLOR purely with executable binaries compiled from a high-level language (C, in this case). In the graphs below, the data set with the inline MMX assembly disabled is labeled “32-bit/No MMX.” These “no MMX” results do not include hand-tuned MMX code, but don’t let my labels fool you; the compiler may have chosen to use some MMX instructions in the executable it produced.

Both the Pentium 4 and Athlon 64 gain significantly with the 64-bit version of picCOLOR. Compared directly the to 32-bit version of the program without inline MMX assembly code, the 64-bit version of picCOLOR is quite a bit faster. In a bit of drama, the Athlon 64 4000+ manages to leapfrog the Pentium 4 660 during the move to 64 bitsthe P4 is faster in 32 bits, but the Athlon 64 benefits more from using the x86-64 ISA.

The 32-bit version of picCOLOR is indeed faster with hand-tuned MMX than without, and there’s virtually no performance gained or lost when running the 32-bit version on WinXP x64. Let’s have a look at the individual functions that make up the picCOLOR benchmark sequence to see how they are affected by the transition to 64 bits.

The first function worth a mention here is Morph, which uses inline MMX. The hand-tuned code provides a major performance boost in that function, and turning it off brings a performance loss. However, we more than make up the difference elsewhere simply be recompiling the application for 64 bits. (None of the other functions use inline assembly.) Dr. Müller describes the tradeoff like so:

[Y]ou see that changing from 32 bit to 64 bit gives us a speed up of a good 30% on the AMD, less than 20% on the P4. But inline MMX gave us a factor of 2.5 just for the morph function! Now imagine all the 12 function had been hand-optimized with MMX or SSE2! We’d have some overall score of 8 or 9! Well, but we’d need another 10.000 programming hours… 🙁

The problem is that inline MMX-ing is quite some work, and switching from 32 bit to 64 bit is just re-compiling 🙂

Taking the time to port an app to 64 bits may be a very efficient means of improving performance, relatively speaking.

Undeterred by the restrictions of working in x86-64, Dr. Müller says he may yet convert his hand-tuned MMX code to use SSE2 registers and assemble the SSE2 code separately from his C program. With luck, he hopes to see even more of a speedup in 64-bit mode.

Several of picCOLOR’s other functions get quite a bit quicker in 64 bits. Among them is the Skeleton function, which Dr. Müller describes as “a very simple function with lots of short loops, integer comparisons and array index calculations” that “[s]hould fit in any cache.” It seems quite likely that the additional general-purpose registers are being put to good use here.

The next function, Texture Orien is “based on a 16*16 double precision DCT,” or discrete cosine transform, a bit of math commonly used in image compression algorithms like JPEG and MPEG. It’s also faster in 64 bits, especially on the Athlon 64. The rotate function with floating-point interpolation nearly doubles its performance in 64 bits, as well.

Dr. Müller suspects that the Pentium 4’s relatively strong performance in the rotate test with fixed-point interpolation is the result of the barrel shifter added to the Prescott core, but oddly, this function slows down slightly in 64 bits on the Pentium 4.

Another function that gains dramatically from x86-64 is Watershed, which Dr. Müller says “uses about 5 MBytes of stack, all integer.” He speculates that the Athlon 64’s lower memory access latencies may help it outperform the Pentium 4 in this test, but he’s unsure why the function is so much faster in 64 bits on both architectures.

The Panorama Factory The Panorama Factory joins together multiple photographs to create ultra-wide-angle panoramic images. Because working with multiple high-resolution images at once can require a lot of memory, The Panorama Factory is also a good candidate for porting to 64 bits. We used the program’s default wizard to join together four very high res (approximately 4000×3500 pixels) images in a partial panorama.

The performance boost when going to 64 bits is dramatic. The Athlon 64 lops off almost exactly one minute from its processing time in 64-bit mode, and the Pentium 4’s gains are similar. The Panorama Factory’s timer function records the time required for each step of the process of converting our sample images into a panorama, so we can see where the speed-ups are.

The stitch function, which is the heart of this program’s capabilities, gains greatly by using the x86-64 ISA. I don’t believe the I/O functions like read and write are included in The Panorama Factory’s calculation of the overall wizard time. The crop, render blend, enhance, and improve quality functions shave off quite a bit of execution time in 64 bits on both the Pentium 4 and Athlon 64. In addition, the Athlon 64 is faster at the align and fine-tune operations in 64 bits, although the P4 doesn’t benefit as much.

POV-Ray POV-Ray is a ray-tracing rendering program that we’ve been using as a benchmark for ages. It’s an open-source program that is intended to be portable to multiple platforms easily, so it’s not multithreaded. There is, however, a 64-bit version available now.

We tested POV-Ray with a pair of scenes. The first one is a classic Chess scene that looks like so:

The two processors are mirror images of each other here. The Athlon 64 renders this scene ten seconds quicker in the 64-bit version of POV-Ray, while the Pentium 4 is actually slower with the 64-bit version of the renderer. With the 32-bit version of the program, the P4 gets a little faster in WinXP x64, but the Athlon 64 is slower. Talk about mixed results!

POV-Ray’s default benchmark tells a similar story. (Note, here, that the results are reported in pixels per second rather than render times.) The P4 again slows down with the 64-bit version of the program, and the Athlon 64 gets a pretty nice speed boost. I’ll be curious to see whether this pattern holds with future versions of the program or with those compiled differently.

Blobby Dancer Blobby Dancer is a graphics demo from NVIDIA that was originally a 32-bit program, but NVIDIA later ported it to x86-64. Not only is it 64 bits, but it’s funky, too!

The P4 and Athlon 64 are both able to stretch their legs in the 64-bit version of this quirky little demo.

SiSoft Sandra Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The most interesting of those benchmarks is probably the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX and SSE. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The 64-bit port of this benchmark, of course, ought to be able to show us how x86-64 aids performance. The benchmark is also multithreaded, and should be able to take advantage of Hyper-Threading.

The “Integer x16” version of this test uses integer numbers to simulate floating-point math. Oddly, the Athlon 64 is slower in the 64-bit integer test. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations at once. The Pentium 4 has long excelled in highly parallel SSE2 tests, and this one is no exception. The additional SSE2 registers in x86-64 really appear to help, too, on both processors.

The Dhrystone test is more synthetic than the Mandelbrot test. From the FAQ:

The original Dhrystone benchmark is still widely used to measure CPU performance in industry under various versions/variants. The benchmark is designed to contain a representative sample of types of operations, mostly numerical, used by applications. Unfortunately this does not always represent a true real-life performance, but is useful to compare the speed of various CPUs.

The Dhrystone benchmark used here is a multi-threaded, 32/64-bit variant of the original one which runs under UNIX. Up to 64 CPUs in SMP systems are supported. The result is determined by measuring the time it takes to perform some sequences of instructions. Due to various changes, the result is not directly comparable with other Dhrystone benchmarks. However the MIPS (Million Instructions Per Second) should be the same for the same system (+5-10% variation) between benchmarks.

Yes, it’s MIPS, that Meaningless Indicator of Processor Speed! What kind of MIPS differences do we get with x86-64?

About that much. Again, the Pentium 4 gains more than the Athlon 64 here, but both achieve solid improvements.

Whetstone is the floating-point twin of Dhrystone; it reports results in MFLOPS, or millions of floating-point operations per second. SiSoft has created a version of Whetstone that’s vectorized for use with SSE2. The original “FPU” version most likely uses SSE/SSE2 in 64-bit mode, but in a scalar rather than vector fashion.

As in Dhrystone, so in Whetstone; compiling for x86-64 produces higher performance.

Conclusions These early benchmarks indicate that the x86-64 ISA holds significant promise for better performance when applications are ported to it. The benefits aren’t uniform or universal, but they can be fairly compelling. For technical and scientific computing, the combination of additional registers, a cleaner programming model, and a larger memory address space adds up to a slam dunk. 64 bits is the way to go. The same is likely true for servers.

For PC enthusiasts and gamers, moving to 64 bits may not present as many obvious advantages in the near term, but there’s also very little apparent penalty in going with Windows XP Pro x64, even if it’s only to run 32-bit applications. All of our gaming tests showed very little performance delta between WinXP and WinXP x64, and the same was generally true for other apps. Just make sure that 64-bit device drivers are available for your hardware.

One question that our testing hasn’t answered is whether or not 64-bit versions of popular games will really bring notable performance gains. Judging by our experience with the Riddick game, it’s hard to be terribly optimistic on this front. 64-bit games do hold promise down the road, when really large textures and very complex worlds eat up more than 4GB of total RAM, but that day is still a long way off.

As for the issue of whether the Athlon 64 or the Pentium 4 stands to gain more from 64-bit apps, well, I think the jury is still out. The applications we’ve tested have been all over the map on that question, and I’d hate to venture a guess. The best news, though, is that the typical scenario seems to involve solid performance increases on both architectures with 64-bit programs, if there is any performance increase at all. That makes sense, because both microarchitectures have dedicated transistors to the x86-64 ISA’s additional register space, and those new registers are the key to better performance.

Any chance of a rev 2 to this article? Surely things have changed, especially when you add in SMP patches that alot of newer games seem to have come out with.
I’d love to see a comparison of:
a64 4000+ vs a64 4800+
(or price-equivalent single vs dual-core)
64-bit vs 32-bit

bwcbiz

14 years ago

One minor quibble:
“q[

indeego

14 years ago

You can tell the intelligence of Anandtech readers simply by reading their comments/reaction to this TR articleg{http://www.anandtech.com/news/shownews.aspx?i=24007

PerfectCr

14 years ago

Thanks (I think) for showing me that link. My IQ has now dropped a few points after reading some of those inane posts.

UberGerbil

14 years ago

Good grief. Hopefully none of them cross over and start posting here…

eitje

14 years ago

i dunno, page 3 is pretty neat.

sativa

14 years ago

i could have sworn there was a ut2k4 64 bit demo out… maybe its just for linux.

UberGerbil

14 years ago

………………………………………………………….

UberGerbil

14 years ago

One interesting factor that is a little difficult to tease out is the support libraries, and how optimized they might be. Doom is running against OpenGL; the other games are (AFAIK) running against DirectX. Both are 64bit versions, but we don’t know how tweaked they are. Moreover they have to talk to the video drivers, which are still in beta. We could see continued performance improvements — even for 32bit apps — after Windows x64 has shipped, as new revisions of drivers and OpenGL and DirectX are released.

Magnus

14 years ago

Too true. I think that we all have to keep in mind that this is the bare minimum of performance improvement we are likely to see. 6 months from now, these might be skewed differently.

Oldtech

14 years ago

The Alphas are RISC based.

5150: they are not connected to the internet. Shame 🙁

Oldtech

Kurlon

14 years ago

To be more specific, Alphas are based on the…

…

wait for it…

…

Alpha ISA.

Damn I miss that line. EV8 would have been smoking had the project gotten full backing. Most CPU tech you see as ‘new’ and ‘innovative’ in PPC and IA32 is derived from Alpha production or research.

tfp

14 years ago

if they are not x86 or PPC they can not fold. Don’t worry they aren’t going to waist.

Oldtech

14 years ago

64 bit computing is not new. The DEC Alphas vintage ~1995 were 64 bit running Open VMS. At work we still have two of these cranking right along 24/7.
When you think about it, everybody else is playing catch up 10 years later.

Oldtech

5150

14 years ago

Soooooo, are they folding?

tfp

14 years ago

Are Alpha’s x86 or PPC?

Lord.Blue

14 years ago

Neither. They are Alphas.

tfp

14 years ago

And that was my point, alpha’s can’t fold.

Kurlon

14 years ago

Actually, they can. Alphas via FX32! can run IA32 binaries under Linux and NT/2k. Depending on the task, running under emulation was FASTER than on available x86 cpus back in the day…

tfp

14 years ago

wow nice, i’m guessing now it would be considered very slow for something like F@H

Kurlon

14 years ago

Actually, F@H should the type of app FX32! excels at. FX32! is similar to HP’s Dynamo in that it’s a profiling dynamic recompiler. The first run is usually slow as you do brute emulation. As it’s running it’s also building alpha native code chunks to replace emulated ones with, and caching them. Your second run is then quicker as all the codepaths you hit initially are now native code. Hit a new code path, slowdown while you brute through it, next pass, it’s native.

For refrence, I had a 21164 533 on a ‘LX’ mobo with 4MB of L3. That box could CRUSH Pent 2’s running Quake 2’s dedicated server, especially the native port. (Ah, those were the days, Id released native Alpha binaries.)

For floating point code, and highly looped code like F@H, I’d guestimate that box could run with P3 550’s and up to say P3 733’s when running via FX32!. If I still had my box I’d light it up to see.

Actually, a friend still has one… hrmm…

tfp

14 years ago

Ooo now thats slick. If you can post results. P3 500-750ish is still sort of slow anymore but not unreasonable.

This is what Intel should have done with the Itanium…

UberGerbil

14 years ago

It *[http://www.realworldtech.com/page.cfm?ArticleID=RWT122803224105

tfp

14 years ago

I knew they had software emulation but I didn’t think it did what Kurlon decribed for alpha “As it’s running it’s also building alpha native code chunks to replace emulated ones with, and caching them.”…

From your link:
/[

Kurlon

14 years ago

I still have Win2k Beta RC2 for Alpha floating around here somewheres… was pretty slick, FX32! was working well, PnP was functional, good amount of hardware support…

Mac_Bug

14 years ago

64 bit is good for somethings, but it’s safe to say that these things only became mainstream when there is a NEED for it on the consumer market. Addressing more memory is great and all, but the real performance that everyone care about is in all the other changes that comes with a new generation of procs

sbuckler

14 years ago

Testing memory usage would have been interesting. For true 64 bit programmes this may go up something like 25% due to the increase in the size of variables.

spworley

14 years ago

Yes, but most programs will have small changes. Its possible even to decrease a bit since there’s less register musical chairs.
The biggest bloat will come from storing lots and lots of pointers, since THAT is the data which doubled in size. I’ve never seen 25% bigger but I suppose some super-fancy data structure with lots and lots of internal links may work that way. An Oracle database may be the most interesting case.
My own 3D rendering code is about 5% larger in 64 bit, within the “noise” limit where I don’t care.

UberGerbil

14 years ago

No, you’re unlikely to see anything like 25%. Recompiling clean code shouldn’t grow much of anything other than pointers. Your FP values are already >64bits, and most of your ints are probably happy remaining 32bits.

eitje

14 years ago

D –

you may want to link MS KB 282423 somewhere. drops a quick list of “limitations” for Windows x64.

included in these are the note about 32-bit activeX controls, as well as that 32-bit dll shell extensions don’t work anymore. so if you have a favorite program that’s 32-bit only, and it adds right-click extensions, those aren’t going to work anymore until the dll is recompiled into a 64-bit dll.

but i thought it might be important to mention the inproc extensions to explorer, since the only reason i switched back from 64-bit XP was because i had some right-click-only tools i liked to use, but couldn’t since they weren’t in the right-click menu.

Dposcorp

14 years ago

HEYYYYYYYYYYYYYYYYY! How come folding was tested????

(someone had to ask)

alphaGulp

14 years ago

Awesome article.

I’m particularly impressed with the list of programs you found to test. As I recall, you had asked for suggestions on this a few weeks ago, and no one was able to think of much.

The ’64-bit advantage’ section is without a doubt the best presentation of the subject that I have read.

Anyhow, what surprises me the most about all this is that the intel architecture does so well. Apparently, the athlon was designed from the ground up to operate in 64 and 32 bits, whereas the pentium 4 only had its 64 bit stuff bolted on as an after thought.

Architecture-wise, I imagine this means that the 32 and 64 bit logic is totally seperate (only sharing things like SSE and the caches)? I’m surprised that Intel’s late addition of EMT64 didn’t increase their logic gate count by that much.

(In truth, my impression of the test results is that the AMD architecture gets more of a boost in switching to x64, but the % improvement seen in both architectures is generally close enough.)

UberGerbil

14 years ago

The reason adding 64bits didn’t increase their gate count by much was because they /[

alphaGulp

14 years ago

I wasn’t surprised about the gate count of the K8, but rather of the P4. (nor was I making assumptions, AFAIK)

The surprise is that the P4 could have all of its registers and logic widened to 64 bits (with a check at all points to see what mode it is operating in), have the number of registers increased, and have new micro ops added. The P4 (32 bits) was a very complex and heavily optimized architecture, and yet not only did the gate count not increase by much, but the 32 bit performance was not (seemingly) affected. Surely such a huge change to their 32bit architecture would have lost them some optimizations, at least in the first generation.

Damage

14 years ago

Prescott’s gate count was up quite a bit, actually, over Northwood’s.

redpriest

14 years ago

Lord Blue, for all intents and purposes, there is no difference architecturally. A program doesn’t care whether or not the CPU can address 36 bits, 40 bits, or even more. The implementation is 64-bits across the board. Whether or not the processor can address memory is a completely separate issue.

Anomymous Gerbil

14 years ago

Edit: not to worry, I see that your post is in reply to post #20, so my response is irrelevant!

Illissius

14 years ago

Thoroughness, thy name is The Tech Report.
Love this site. It has been months since both AMD and Intel processors with 64-bit extensions have been available, and this is only the /[

Krogoth

14 years ago

Excellent Article, Damage! It justs further reinforces what I knew all along. X86-64 only benefits the professional market in the foreseeible furture. Just give or take 3-5 more years for mainstream apps to get complied for x86-64.

To me, on the grand scale of folding, 2% seems huge. How many additional clients would that represent?

It seems like an app like this would be pretty easy to recompile for 64-bit.

-Stephen

tfp

14 years ago

Not when you don’t have enough people to do the coding it doesn’t. Heck they hardly support any platforms commared to something like SETI. And alot of the important parts of the code they have is hand coded asm, I don’t know if a simple recompile will do what they want.

They are using there limited resources for things that will give them “more bang for the buck” like folding on a GPU.

Thresher

14 years ago

Excellent article. Very impressive work.

One thing that would be nice:

Can you add a glossary to explain what some of the abbreviations and words mean?

I’m not trying to be a smartass, but when I refer some of my co-workers and our analysts to this article, they may not know what they all mean. A glossary would be helpful, especially if hyperlinked into the article.

sbarash

14 years ago

While I’ve been in the biz for a long time, acronyms still drive me nuts. I hate getting half way through an article and hitting some acronym that wasn’t previously defined.

Not to say that this article had any acronyms that I didn’t understand, but in general lots of tech articles drive me nuts in this respect.

-Stephen

Ruiner

14 years ago

Meh.
The only cpu intensive apps I use are games. My impression has been that game developers have been spending more effort increasing compatibility/improving performance on the /[

UberGerbil

14 years ago

Actually, assuming they’ve been writing their code in a clean fashion, even small developers should be able to recompile for 64bits with minimal effort. How much benefit it will offer is hard to say; if a game is (or can be) CPU-limited and the compiler does a good job of taking advantage of the added registers (libraries that make good use of the added SSE registers may help a lot) then you might see a nice little gain. The problem is really regression testing, and supporting users dealing with beta drivers and so on.

Recoding to make good use of multiple cores, on the other hand, is a huge piece of work.

redpriest

14 years ago

I should add that a nice thing about the x64 windows compiler (which is related to Whidbey, or visual studio 8.0/2005) is that it’s a much nicer compiler than .NET 2003/2002. The control you have over floating point precision is 100x better than it is with the previous versions.

/fp:precise, /fp:strict, /fp:fast, etc. is better than /op. Here is an awesome article describing its effects:

Need a time-to-precision benchmark that measures how fast a CPU can solve a singular value decomposition problem to a given level of precision. Even more fun if you compare times to sucessively solve the same problem for a series of increasingly tight precision requirements.

Randy2

14 years ago

Too bad the review didn’t include tests with the NF3 chipset. NF4 just hit the streets, and the majority of the AMD64 crowd is still using the latter, waiting on MS for the software. So, this review only provides information to a prospective purchaser of brand new hardware, but doesn’t really provide any useful info to existing users. The conclusion doesn’t seem to match the benchmarks. If you weigh in the burden of of all the hardware and software issues, it really isn’t a practical move to switch to a 64bit OS.

Since no software runs any faster on it, you may as well ask MS to release an SP3 for XP with a bunch of bugs in it – Then you can have the features and performance of X64 on your existing 32bit XP !

Right now, the X64 is similiar to Linux distro’s. It will install correctly, but half of your hardware and software won’t work.

d0g_p00p

14 years ago

Reading is good. Learn it, it might help.

WaltC

14 years ago

I think the jury is in, though, on the fact that x86-64 is here to stay…;) Very likely within the next year if you buy a high-end x86 cpu you won’t be able to escape it–even if for some odd reason you’d ever want to…;)

I’m running x64 RC2 along with XP in a dual-boot config (from IDE), and my only hardware yet to receive x64 drivers is my HP 3845 Deskjet and my Promise tx4200 RAID controller. Drivers for my x800 and Audigy 2zs and nf3 core-logic came from their respective manufacturer web sites and run fine–and everything else in my system installed natively from x64.

Applications which receive an x86-64 recompile will certainly run faster–sometimes by a wide margin as demonstrated in the article. Those that won’t receive the recompile should run transparently under the x64 OS. So-called “security” applications like virus checkers will have to be rewritten for the OS, of course–since x64 is not an *upgrade* to XP and may only be installed clean.

I would expect to see the needed remaining drivers and “security” applications hitting the market shortly after x64 ships in the next couple of months.

My impression of x64 RC2 thus far is that overall it is a far more polished OS than XP was when XP shipped in ’01.

Norphy

14 years ago

Surely that wasn’t the purpose of the article. It seems to me that the author just tried to gauge the difference in performance between the PC running in 32bit mode and 64bit mode. If thats the case it wouldn’t make any difference if the benchmarks were run on a nForce 150, a VIA chipset, an AMD 8111 or whatever. Logically, the %age difference in performance should be the same no matter what platform the CPU is running on.

Thats the point, he wasn’t benchmarking the hardware, he was benchmarking the software. He was running the software on the 64bit and 32bit versions of Windows on AMD and Intel platforms. As far as I can see the idea behind the article isn’t supposed to be “Whos best, Intel or AMD?”, its more a “Is there any point in using Windows x64?” question.

paulio

14 years ago

excellent article

redpriest

14 years ago

#9,

It’s somewhat straightforward, but it depends on how many lines of assembly you’re going to port. (Every memory access has to be changed from esi ->rsi, etc).

There’s also the difference between porting it a “dummy” sort of fashion where all it is is a straight port with no performance enhancement, and then there is a port that takes full advantage of the many registers you have available. I’d hold off porting until I had the latter version done simply because you get some payoff for the work you put in. This of course, only happens if you have had some register pressure to begin with, or the algorithm benefits from expanded registers.

Great article as always Damage, good to see some performance numbers.

Zenith

14 years ago

dragmor – Heh, I wouldn’t doubt there being extra juice pull (Though not much), but you have to remember something…the Athlon64 has extra Registers for 64-bit mode….the Pentium 4 doesn’t.

redpriest

14 years ago

#10 – EM64T is identical to AMD64 to about 99.9%. There are a couple exceptions, but nothing an applications programmer has to worry about for the most part. This means the P4 gets the benefit of the expanded registers too – and in some cases, gets an even larger benefit than the Athlon 64 does!

Lord.Blue

14 years ago

Actually, no, the P4, EMT64 or whatever does not have the same number of registers that the AMD64 does, AMD64 uses 40-bit registers, while the P4 uses 36-bit registers all the while lying to Windows and any other application saying that it does indeed have 40-bit ones. When you try to run something on the regsiters 37 and up, the system will crash. Seen it. really bad when the developer manual from Intel is a carbon copy of the one AMD sent to developers almost a year ago. Right down to saying 40-bit registers. The reason Intel did not implement 40-bit registers is one thing: cost. It already had made new 36-bit ones for the Prescot & Xeons, so it figured it could get away with it. Seem like it has, but don’t expect full compatability between both chips even if they have the same instruction sets (SSE 1, 2, 3, etc.)

just brew it!

14 years ago

I think you are talking about physical address bits, not registers. The registers are all 64 bits wide.

Lord.Blue

14 years ago

Proboly so, but there is that difference for programmers to consider when making programs, so something that absolutly flies on an Athlon may just crash on a P4.

jdevers

14 years ago

Only the OS really has to worry about that sort of stuff, the app just requests memory and not specific addresses (well, not in general anyway).

UberGerbil

14 years ago

Sorry, you’re wrong. The P4 has the same number of registers as the Opteron/A64, in both 32bit and 64bit mode. And in 64bit the integer registers are 64bit (I can’t understand whether you’re asserting a difference in number of registers or their width, but it doesn’t matter, as both CPUs are identical in both respects). Both do 48bit virtual addressing. There is a difference in physical addressing, as JBI says — there are up to 40 valid address bits in the Page Table Entries on the A64, and 36 valid address bits on the P4 (the PTEs are actually 64bits, but some bits are reserved, and others are used for things like the NX bit). However, only only the OS will see that difference, and it would only be relevant if you were running on a system with >64GB of memory in any case. There is no way the difference in the number of valid bits in the Page Table Entries will cause an application to crash. Early OS versions may have had problems with the difference, but not applications. Applications do not see physical addresses. And unless your system has more than 64GB of memory, all of those addresses will fit into 36bits or less anyway.

Intel probably restricted the current version to 36bits simply because they already have server boards that support 36bits of memory and they wanted to maintain compatibility (allowing them to use old Xeons in 32bit mode using PAE to access >4GB and new Xeons in 64bit mode using flat addressing). Those additional bits are still reserved, so they could enable them in future versions of the P4/Xeon. And realistically, it’s going to be a while before desktop systems bump up against the 64GB “limit” anyway.

If there was a serious incompatibility between the chips, it would have shown up in these tests. And word would have leaked out of Redmond. The same version of Windows runs on both. The instruction sets are the same. The registers are the same. And addressing, as seen by application programs, is the same.

UberGerbil

14 years ago

…………………………..

Flowboy

14 years ago

Very nice.

SSE2 is really a superset of what’s in MMX; it shouldn’t be that hard to port the MMX code over. There have been MMX and SSE intrinsics in the Microsoft and Intel compilers for a good few years now; you don’t always have to use assembler. My experience of porting from MMX to SSE2 is that it’s straightforward.

dragmor

14 years ago

Nice to see positive results.

Any chance of doing a comparision of the power draw / CPU temp comparing 64bit vs 32bit applications. Since the extra registers, etc are being used.

totoro

14 years ago

second that.

side note: should I use the rc2 of xp x86-64, or wait for the final version?
can I install side by side with XP?

redpriest

14 years ago

You need a separate partition for 64-bit windows, or you could install on top of your 32-bit installation.

I recommend the separate partition. There are still some bugs with 64-bit windows that are being fixed.

totoro

14 years ago

Thanks for the information.

BTW, Great review, Scott!
This is the kind of article that attracted me to TR in the first place.

Mac_Bug

14 years ago

Larger memory address space also means larger page tables and the works, in that sense even the most efficient algorithms might take longer with a bigger input.

alphaGulp

14 years ago

I don’t quite get what you are saying. The page table grows as you access more memory. x64 lets you go past the 32bit limit, but obviously you can’t compare a game using 10GB of textures and data, vs. one that is using 2GB. (or one using 100 MB vs. one using 500MB).

Mac_Bug

14 years ago

Well, in the absence of advancement in hardware, PAE mentioned is a classic example of the problem. Expanded address space resulted in an extra level of page table, smaller number of entries per table and bigger page size which can lead to higher internal fragmentation plus overhead on software translation and additional memory access in the case of TLB cache miss.

Hierarchical paging worked for 32bit, but it’s not really appropriate without additional hardware support for 64 bit processors so I’m assuming stuff like hashed page table will be used instead. There are always trade offs between speed and space usage, and I would imagine people might not be too pleased to have their program’s memory footprint balloon too much, which typically means slower algorithms (if it were faster, then it would already be used, and even if that’s the case, larger input means slower output) which is only negated by faster subsystems or moving software calculations into hardware.

UberGerbil

14 years ago

Current implementations are using the same 3rd level of PTEs that PAE uses, they just do it transparently and more efficiently. Then they add a fourth level to handle the larger address space. There are some issues (the OS has to be careful to manage memory for 32bit apps so they don’t see addresses they aren’t expecting) and it’s possible that a time will come when inefficiencies create problems for our fully-stocked 64GB+ machines, but with present hardware it’s not a problem. And it’s not like we will be headed into uncharted territory even then; Itanium, Power, etc are already there.

Convert

14 years ago

What Indeego said minus the copyrighted green punctuation.

I am glad to see gaming doesn’t take such a hit like the first 64bit tests showed, some cases it helps which is very nice to see.

Anyways, I’ve seen Windows benchies, and I’ve seen Linux benchies…AMD64 does pretty well…Now to wait for dual-core benchies. 🙂

samadhi

14 years ago

Couldn’t agree more about the Synthetic Benchmark stuff, what is the point of Synthetic benchmarks if they do not at least accurately reflect the performance of real life applications.

Anyone who based their buying on Sandra, Drhystone and Whetstone would go out and buy a P4 in an instant as it is apparently almost twice as fast as the AMD64.

I am not really against synthetic benchmarks as a whole if they can be used as an easy way to guage the relative power of CPUs, but when they demonstrate a significantly different behaviour pattern to real life performance you have to wonder if they should still be included in reviews?