Benchmarks: WinXP, OSX on MacBook Pro

GeekPatrol uses their GeekBench tool to compare Windows XP and OSX, both running on MacBook Pros. “Overall, there are areas where the Windows XP MacBook Pro was faster, areas where the Mac OS X MacBook Pro was faster, and areas where they were both roughly the same. Looking at these results, it’s hard to say which configuration comes out on top, although I think you could make a convincing argument for Windows XP (with Visual C++) being a bit faster overall than Mac OS X (with GCC).”

About The Author

28 Comments

This output of the benchmark doesn’t really compare the two OSes: They compare two Operating Systems, by compiling the same code but using two different compilers. Why not compile the Windows version with GCC ( using CYGWIN?). So, this is actually a benchmark between VC++ and GCC and outputs nothing new!

But, compare two OSes while changing a very important factor? Everybody knows that GCC is not the fastest compiler out there and the reason for being default in most OSes, is its opensource nature, not its performance. Also if VC++ wasn’t a product of MS, GCC would possibly be the default for WinXP, too. With the money required to buy WinXP’s default compiler (VC++), someone can easily buy a faster than GCC compiler for MAC OS X, too.

For me, this was just another GCC V VC++ comparison (and not WinXP V MAC OS X) that outputed nothing new.

Actually, VC++ is free these days (as in beer). In any case, everyone compiles with VC++ on Windows, even though they could spend money on a better compiler (Intel C++). The same will likely be true of GCC on OS X. Also note that only GCC supports Obj-C on OS X, so its pretty much the only reasonable compiler consideration for most OS X apps.

The operating systems aren’t really relevant in these benchmarks, except as noted (the standard library implementations). Only a very bad operating system would have an adverse effect on what is basically a bunch of CPU benchmarks. What is relevant, however, is the platform. Simply put, GCC is part of the OS X platform, while Visual C++ is part of the Windows platform. GCC-code performance on Windows is more or less irrelevant, as is non-GCC code performance on OS X.

I couldn’t help it. It’s not horribly surprising, especially on the stdlib scores.

All in all I think they’re pretty close in _most_ of the benches. But it’s neat to finally be able to compare them on the exact same hardware. Imagine telling your dad this 15 years ago! Or if you’re old enough, telling yourself!

And some memory benchmarks depend on the OS. If they make use of malloc/free n times, where n is the length of the benchmark. But I think most of these don’t do that, definitely not bzip2.

And it also shouldn’t show much difference if the malloc’s are all the same size.

… And I’ll say it again — YOU DO NOT PUBLISH BENCHMARK SCORES SUBMITTED BY READERS AS CONCLUSIVE.

Seriously.

Just to name a few things off the top of my head … you don’t know what kind of tweaks/optimizations have been done to either installations. You don’t know what other programs are running during the benchmark. You don’t know if someone has disabled DEP on their XP installation. You don’t know if the tests were run repeatedly to iron out abnormalities, or whether abnormal scores for a few tests were submitted.

And so on, and so forth.

I appreciate what these guys are doing in terms of writing a neat little benchmark, but they’re going by it completely incorrectly. Oh, and about the topic of compiler optimizations …

MSVC and GCC are *very* different compilers. Most seasoned developers will provide custom optimization flags for each specific benchmark source file, knowing which optimizations are beneficial to that specific code. You can’t just use a few generic flags for everything. Each of these optimizations are also very specific to the version of the compiler being used, let alone different compilers.

This is all just very silly. They should borrow a MacBook Pro from a friend and run their benchmark in a consistent manner, documenting all of the settings used for both OSes.

If I didn’t type something in wrong, the geometric mean of all the WinXP results is 1.2177. With the four extreme outliers removed, the geometric mean is 1.1873. So basically, we’re talking about a 20% advantage for Windows. Not hugely significant, but not peanuts either.

If I didn’t type something in wrong, the geometric mean of all the WinXP results is 1.2177. With the four extreme outliers removed, the geometric mean is 1.1873. So basically, we’re talking about a 20% advantage for Windows. Not hugely significant, but not peanuts either.

Well, either you did type something wrong, I did, or geometric mean has a different meaning to you and me.

Going a bit further, if I do the obvious thing and separate the data in single-threaded vs. multi-threaded, taking out the lower & higher values, the results are 4% advantage in multi-threaded and 6.7% advantage in single-threaded tests. (1.039 & 1.067 geom. mean respectively)

Of course, “geometric mean” has NO MEANING in this case to begin with, since each value measures a different element, and neither would arithmetic mean, modal, etc.

A true test if you want to compare the performance of both systems is testing Real World Application(s) X(YZ) in system A vs. Real World Application(s) X(YZ) in system B.

Hmm. I couldn’t call a 20% performance difference dog-like. In fact, given VC++’s limited focus, and GCC’s much broader one (both in language and platform support), a delta of 20% on this particular benchmark seems quite good to me. When you consider that GCC’s current register allocator is quite primitive, and that its SSA infrastructure has yet to mature, future improvements could easily wipe out that difference and even turn it into an advantage. These improvements are going to come one way or another, either via the integration of LLVM’s SSA framework, or via alternate proposals that have cropped up.

In the end, a 20% difference means a lot when you’re developing applications. Not to take away from GCC though, but it’s if you’re developing for windows with C++, there really is almost no reason to use it.

This is one of several benchmarks that show the stdlib memory allocation functions as being woefully slow. Apple really needs to do something about that. A 35x speed hit in this benchmark and similarly, though not equally, bad benchmarks on other memory tests leads me to think Apple needs to concentrate their optimization folks in that part of the OS.

The 35x number on this benchmark reallly should raise some alarm bells about the validity of the test. Based on the documentation I’ve found, the malloc() in both OS X and Windows are thread-safe. There are only so many ways to skin the malloc() cat (particularly the thread-safe version), so I find it really hard to believe that OS X’s malloc() is that much slower than Window’s. Looking at the OS X libc code, it appears that the malloc implementation is one of the few parts of the stdlib portion that are not based on FreeBSD code. The OS X malloc implementation appears to have been written to replace the previous (presumably 4.4BSD-derived) malloc implementation in 1999, and it looks fairly sophisticated.

Therefore, I find it far more likely that either geekbench is hitting some sort of pathalogical case in OS X’s malloc(), or hitting some fast-case optimization in Window’s. It would be illustrative to see how the stdlib test performs on Linux, which uses glibc’s very good malloc() implementation.

Actually, after using ObjectAlloc to look at Geekbench’s allocation profile, I have to say the benchmark sucks. Basically, it repeatedly allocates and then immediately frees a 32KB memory block. This case is trivial to optimize for. All the malloc code has to do is cache the last freed block and its size, and if the new allocation size matches the old one, return the cached block. Windows is full of such “do nothing case” optimizations, but their advantage for real-world code is minimal at best.

A far better benchmark would randomize both the allocation size, and whether an alloc(), free(), or both would happen on any given iteration. It would certainly use a mix of both small, odd-sized allocations (think strings in a program), and large allocations that are multiples of the page size (think I/O buffers). This would test both the ability of the allocator to handle varying block sizes, the ability of the allocator to manage fragmentation within pages, and exercise the allocator with a non-trivial loading pattern with multiple outstanding allocations.

The asm code for the stdlib.allocate benchmark follows:

00016068 lwz r3,0x2c(r31)

0001606c addi r30,r30,0x1

00016070 bl 0x1b8c0 ; symbol stub for: _malloc

00016074 bl 0x1b930 ; symbol stub for: _free

00016078 lwz r0,0x30(r31)

0001607c cmplw cr7,r30,r0

00016080 blt cr7,0x16068

Note bene:

malloc() has the signature int -> pointer

free() has the signature pointer -> nil

On PowerPC, the first 4-byte integer or pointer argument is passed in GPR3. The return value, if it is a 4-byte integer or pointer, is stored in GPR3.

On PowerPC, large constant values (in this case, 32768 and the loop maximum) must be loaded from memory, as immediates values have a limited size. That’s what the lwz’s are for.

Avie Tevanian wasn’t “let go”, he left the company to pursue other interests. In any case, he’s been in a managerial position since 2003, so if replacing Mach was on Apple’s list of priorities, Apple could’ve done it by now. The reason they haven’t, and likely won’t, is because fixing Mach’s limitations is likely to be much easier than ripping Mach out and replacing it. The BSD component of XNU is quite intimately tied to Mach, as is IOKit (and by extension all OS X drivers), and even some of the userspace. Replacing Mach would require a lot of time and effort on Apple’s part, and would hardly be transparent to developers.

On the other hand, if Apple spends some time working on Mach’s threading limitations, and continuing the locking work they’ve already started for Tiger, they can probably get XNU into pretty decent shape. It’ll never be in the same league as FreeBSD, Linux, or Solaris, in that it’ll probably never be a good fit for 64 CPUs, handle 10,000 threads per machine, or gracefully handle a process select()’ing 1,000 file descriptors, but to be honest, for Apple’s purposes, it really doesn’t need to. Apple isn’t in the high-end server business, its in the workstation business, and for such apps all you want is an OS that’s good at getting out of the way, which XNU does adequately.

All that aside, Mach really has nothing to do with these benchmark results. None of these benchmarks should spend a non-trivial amount of time in the kernel.