Posted
by
kdawson
on Tuesday September 12, 2006 @11:30PM
from the times-eight dept.

An anonymous reader writes, "Apple's Showtime event was all well and good, but the big news today was on Anandtech.com. They found that the two dual-core CPUs in the Mac Pro were not only removable, but that they were able to insert two quad-core Clovertown CPUs. OS X recognized all eight cores and it worked fine. Anandtech could not release performance numbers for the new monster, but did report they were unable to max out the CPUs."

Too many cores on the same bus will cause a lot of contention for memory access. There will always be a place for NUMA architectures, including clusters. That place is for the ultra-high end though, not for scientists who merely want a few processors for a Gaussian computation.

There should be a considerable performance improvement if the core's are on the same chip die, since communication doesn't have to go through the motherboard.

If the bulk of your bus traffic is inter-CPU transfers, yes. However, if you've now got four cores and they all need to get to memory (or, heaven forbid, the disk), then they're all going to be sucking down bus bandwidth, and sitting in wait states until the cache refills. A single processor can waste over a hundred cycles on a cache miss, I don't e

Speaking of memory access, it seem Anandtech showed the Pro in the worst light. They pointed out (fairly) where the higher latency of FB-DIMMs slowed performance, but ran the benchmarks with only a pair of DIMMs instead of four, failing to show the boost in performance from quad-channel memory access. Doubling memory bandwidth could have boosted some of the scores.It would have been fun to see something better show the potential gains available from additional cores. A utility like Visual Hub [techspansion.com] can use mul

Which puts me in mind of sex researchers, Masters and Johnson, who forty years ago established under rigorous experimental conditions that degree of uh, masculine endowment doesn't make any difference. Nothwithstanding this, people always care about what they can't have.

Hrmmm. Well, seeing as how I just took delivery of a new quad 3.0Ghz Mac Pro, this dulls my bragging rights a bit. However, this bodes well for the CPU upgrade market. Companies like Sonnett, Newer, Powerlogix and OWC have had a tough time with the IBM/Freescale market because of poor performance among other critical reasons. The old 1.0 Ghz G4 I have at home as a media server is still an adequate system that currently holds a terabyte of storage space and I'd love to drop a good 2.0 Ghz or higher chip

However, this bodes well for the CPU upgrade market. Companies like Sonnett, Newer, Powerlogix and OWC have had a tough time with the IBM/Freescale market because of poor performance among other critical reasons.

And it will still bode poorly for these companies because now that the Mac is all off-the-shelf components, so are the CPU upgrades.

There are enough old G4s lying around for the after market to last for a few more years. I'm keeping mine til the thing dies because I still need an OS 9 native environment; Classic still can't do everything, and is no longer available on x86 Macs.

Sure, but he'll have to drop $2,000 or whatever it will cost to buy two of these puppies first. His credit card is probably already strained from buying a $5,000 desktop to start with:)

In Australian dollars at least, it is over $1,000 extra to get the 3GHz vs the 2.66GHz CPUs in the Mac Pro - that's about US$750 at the current rate. So chances are these quad-core CPUs will be pricey.

In Australian dollars at least, it is over $1,000 extra to get the 3GHz vs the 2.66GHz CPUs in the Mac Pro - that's about US$750 at the current rate.

FYI, this processor bump costs exactly US$800 (plus applicable tax, of course) from Apple for buyers in the US.

Having always presumed it a foregone conclusion that the processors would be swappable, I opted for the standard 2.66GHz configuration and an eventual upgrade as it becomes necessary. Considering the current cost of FB-DIMMs with huge heat sinks (a

While Jones's transformer gets installed, don't forget that my blog is directly connected to the grid so we pretty much never have a power loss.
Today I mostly covered the resolution of the piece of cheesy-poof that was stuck between the letters Z and X on my keyboard (those that have been reading my blog know the entire tale). Well, to summarize, I spilled a little diet coke in that same area just this morning, the food soaked it up, and voila, out it popped.
Anyway, check it out, it's at www.foopy-see

Isn't Clovertown where all the leprechauns hang out? 'Tis a fine place to spend your gold on some Guinness while watching some midget porn. Just don't get into an argument about pipe tobacco with one of those short-assed little shits.

You know, I thought I would never say this. Your right, this is great. The one MAJOR thing I did not like about Apple is that I can't change the hardware much. Back in high school I swore I would never own a Mac unless I could upgrade the CUP. With OSX and tweekable hardware Mac is looking more and more worthwhile.

It's harmless in the sense that it won't crash your computer, but it will still block that user from running any additional programs because it uses up their thread quota. Of course, if you can trick someone into running it as root....I remember writing stuff similar to this back in the 80's to trip the watchdog on the VAX when the system operator was away and the machine needed a reboot. I think the C code of choice was something like "main(){while(fork(fork())||!fork(fork()))fork();} ". We'd get a few

"they were unable to max out the CPUs" that is ridiculous! On PC's in VB it's pretty simple: dim Processor1Thread as new thread(addressof sub1)
dim Processor2Thread as new thread(addressof sub2)
Processor1Thread.start()
Processor2Thread.start()
dim x as integer
sub sub1()
for x = 0 to 1000000000000000
end sub
sub sub2()
dim x as integer
for x = 0 to 1000000000000000
end sub
and repeat for 6 other threads and subs. So they either proved it doesn't really work well at all or programming on a mac is impossibly hard...or they're lying to make it sound more dramatic. So whether they're lying about not maxing it out or they're lying and you just plain can't use all 8 cores at once, it's not as good as it sounds.

Your sig reads (to me) like you are a (younger) CS student. Assuming you are, here's what you're missing; in the real world, we need to max out those cores doing something productive, or we get in trouble. Very few users have apps that can use even more than one core usefully.

I'm 19, been in college since I was 17 cuz they made me go early since I was so smart. And forget that CS theory bullshit, the department is called IT and that's what's written on the degree. People that go to 4 year colleges for programming are beyond stupid and I've heard many stories of how all that theory and little experience forced them to go to my college for a year before anyone would hire them. But gee, at least they know when C++ was invented and how they decided to name memory addresses. And

Sounds like you're going to something like DeVry [devry.edu], correct?

Here's a hint... Most companies won't give a DeVry graduate any more consideration than someone wihout a degree. In fact, many companies will take someone who is self taught without a degree over a DeVry graduate.

And forget that CS theory bullshit

Good luck with ever being more than a code monkey. If you don't understand the theory behind programming, you'll never do more than writing basic code that conforms to the specifications that the architects gave you.

P.S. My sig says that because the teacher, a 15 year programming veteran, and some other crazy expert with natural skills like me all couldn't design the project we were working on as fast as I could and only one other person's was virtually crash proof.

If a second year student is writing better code than the teacher, that says a lot about the school. That goes back to what I said about most companies don't give much (If any) weight to a degree in "PC programming/Web Development with a certificate in Web Design", because the types of schools that give those out are usually not the highest caliber.

And I'm not trying to be a dick, but drop the attitude; you're not the super programmer that you think you are. Relax, and pay attention to what others are telling you, you'll learn something.

ps... Graduating high school and starting college at 17 isn't all that special, tons of people do that.

I had some friends who did two year certificate programs at a local college with a VERY good CS program. Anyone who graduated from that program was almost guaranteed a job at a certain darling of the game development world.

These guys told me a story once. Some hotshot with a degree from DeVry was hired one day. He was fired within two weeks for incompetence.

I'm always suspicious of an institution of higher education that finds it necessary to advertise on TV, radio and by SPAM!

GP claims correctness because he was one of the best programmers at his school, and he started school at 17. I started university at 15 and similarly out-performed (most of) the (largely mediocre) students at my (less-than-prestigious) university as well as many of the professors. Ergo, if we assume the GP's correctness, my opinions must carry equal or greater weight than the GP's, by his own arguments.

However, I agree with the parent and think the GP is full of crap. This contradicts the starting assump

The reason colleges make you take all those stupid classes is to help round out your education, so you learn to think in a variety of different ways and learn different methods of analysis... at least at good colleges. If you really want to be a better programmer, take a class on the philosophy of language.

way to make the 4 year assumption there. I'm going to write and sell software myself. I looked at the badly designed crap that's out there and decided to become a programmer because I can do infinitely better. That same theory applied to computer repair and that business is running pretty well for me at the moment too.
And 4 year colleges rerun all that info from high school and middle school because they assume you paid no attention and must have cheated on the SAT/ACT's or something to get in. It's an

Just a guess, they don't teach English there, do they? And, I want to guess you skipped typing when you left highschool early and went straight on to "college".I think what they meant in the article is that they have no applications that thread to 8 threads nicely.Its easy to max out 8 CPU's/cores with 8 different tasks (or 9-10 tasks if you want to take advantage of context switches and iowait). Its harder to find something that scales past 4 threads because most programmers just don't program for it. A

Dude, give the kid a break. He didn't learn anything about Shakespear, atoms, Africa, grammar, and how to turn on a computer (his words, not mine). By the time we get to be managers (if you aren't already), he's still in college, trying to figure out why he can't get laid, and we can make it a point not to hire one-trick ponies with big ego problems. He was 17 when he entered college for crying out loud!

I run blender (www.blender3d.org), and the latest version supports 8 cpus. When integrated with povray (blend2pov), you get really nice rendering of very powerful models and can animate the lot (plus add hair/cloth/particle effects) plus sound/animation, etc. When you add Catmul-Clarke subdivisions, and advanced effects, and povray the lot at 24 frames per second, your cpu's can be pinned at 100% for literally hundreds of hours at a crack. My single 1.8 GHz processor can easily be pinned working on the same job for months on end (6 at least). Double the processor speed and you could look at 3 months. Now divide by 8 processors, 90 days turns into 11.25 days --pinned at 100%. Now I take the animation, and add 3 more scenes, and we are back up to 45 days of rendering with 8 cores twice as fast as what I am running now. There are literally a million computer applications that suck time hard. Over at Pixar, one frame from Finding Nemo took 4500 computers over 90 hours to render. Supercomputers with hundreds of thousands of processors (BlueGene/L, etc.) are usually capped to not run jobs that take more than two weeks to run. Short answer: they did not try very hard to 'max the processors'.

Apart from a missing 'next' statement, why wouldn't any half-decent compiler just optimise out the pointless empty looping?I'm pretty sure you've got to do something in a loop or it'll be dropped by the compiler as a trivial optimisation. But hey! What do I know after years of VB, VBA programming, in addition to *real* languages like C++ or *useful* things like SQL? I'm a babe in the woods compared to a Uni student full of piss and vinegar!

The NeXT architecture of OS X has always been more “at ease” with multiple CPUs than various versions of NT. Not that NT can’t handle them, but that OS X does a better job of dividing tasks sanely to more fully utilize the chips and from what I’ve heard is much more capable once you move past four. That being the case, as multiple CPUs/cores become more commonplace, I think OS X will end up with the reputation of being the faster of the two.

Windows divides just fine on multiple cores. It just spreads threads around, and can even move things core to core (or CPU to CPU case being) as needed. Remember there ARE 32 processor versions of Windows. I have a friend who works on them, they do large SQL databases on 32-processor Itanium Superdomes (HP) running Windows.I've never seen any good benchmarking on it, probably because there haven't been higher order Intel Macs until recently, but I'm going to bet you find little difference when running apps

Remember there ARE 32 processor versions of Windows. I have a friend who works on them, they do large SQL databases on 32-processor Itanium Superdomes (HP) running Windows.

I thought that the >4 CPU Windows systems were, in essence, specially tweaked systems to make it all worthwhile and that standard setups couldn’t really make effective use of more than four processors. If so, I stand corrected. *looks around* Err, sit corrected, sorry.

I thought that the >4 CPU Windows systems were, in essence, specially tweaked systems to make it all worthwhile and that standard setups couldn't really make effective use of more than four processors. If so, I stand corrected. *looks around* Err, sit corrected, sorry.

Multi-core restrictions on Windows versions are mostly artificial. For example, 8-CPU systems running just fine on Windows 2003 Advanced Server without any special tweaking. The system the grandparent referred to must have been runnin

Well, according to MS, Windows has no problems supporting 32 processors for 32 bit software and 64 processors for 64-bit software. Given versions of windows are limited to a lower number of processors, though not cores. One processor is one processor regardless of cores by MS's licensing. Indeed you'll find XP Pro, while only supporting 2 processors, will happily run 2 dual core processors and see and use all 4 cores.

You have to remember that Windows is not static, they improve it all the time. They rolled out a 32-processor version back with Windows 2000. It's called Data Center Edition. You can't buy it over the counter, only from OEMs that make systems with tons of processors. You've likely never encountered it since it's fairly rare to see systems with that many processors. Generally you cluster smaller systems rather than get one large one. However there are cases where the big iron is called for, hence why HP sells them.

Also I think multiprocessing in the OS is less complicated than many people make it out to be. The OS isn't where the magic has to happen, it's the app. The OS already has things broken up for it in the form of threads and processes. A thread, by definition, can be executed in parallel. So the OS simply needs to decide on the optimum placement of all the threads it's being asked ot run on it's cores. Also, it doesn't have to stick with where it puts them (unless software requests a certain CPU), it can move them if there's reason to. The hard part is in the app, to break it up in to pieces that can be processed at the same time and to keep them all in sync.

My guess is that it's mostly FUD floated by anti-Windows people. There is, unfortunately, a lot of that going around. For example it was reported on/. that Vista won't support OpenGL (http://slashdot.org/article.pl?sid=05/08/06/17725 1). Well it turns out this isn't just false, but is the exact opposite of the truth. Vista indeed supports OpenGL in three different ways:

1) The method mentioned there, as an emulation that is limited to 1.4 and isn't that fast. Bonus is it works on any system with Vista graphics drivers, even if the manufacturer doesn't provide GL.

2) Old style ICD. This is the kind of driver used on XP today. This more or less takes over the display, and thus will turn off all the nifty effects while active. The bonus is there's little to update. However this is probably not going to be used because there's...

3) The new ICD. This provides full, hardware accelerated GL and is fully compatible with the shiny new compositing engine. For that matter, you can add any API you want via an ICD that works with the new UI.

So not only does the OS have the ability to support GL, it can do so better than XP can, because GL can be used in the same way as DX. However to read the/. story, you'd think they'd all but disabled hardware GL in their OS. As it stands nVidia has beta drivers with a GL ICD. I haven't tried them, but the release notes suggest it's a new ICD that work with the compositor. ATi's drivers don't have an ICD, though ATi claims to be working on it and says they'll have it for launch. Intel doesn't have any driver status for Vista on their website.

When it comes to Windows info, you do need to check sources, as with anything else. There's plenty of misinformation floating around. Often people who don't like Windows believe they know what they are talking about so post incorrect information.

Well I'm not going to justify their business case to you since I don't work for them. However, I'm going to go out on a limb here and say you've got no idea what you are talking about. I'm going to guess you probably do not develop enterprise telecom apps for a living. This is, in fact, what the company my friend works for does (no I'm not going to name them). I don't know why they use what they use, I don't work for them, however I'm going to guess, given that they do a good job making money, that their ch

The NeXT architecture of OS X has always been more "at ease" with multiple CPUs than various versions of NT.

Your evidence for this being what, exactly ? Tea leaves ?

NeXT didn't even *support* multiple processors until Apple's OS X reinvention, whereas NT was designed from the ground up with multi-CPU machines in mind and has supported them since its first release in 1993.

Not that NT can't handle them, but that OS X does a better job of dividing tasks sanely to more fully utilize the chips and from what

Anandtech could not release performance numbers for the new monster, but did report they were unable to max out the CPUs.

From TFA:

We definitely had a difficult time stressing 8 cores in the Mac Pro, but if you have a handful of well threaded, CPU intensive tasks then a pair of slower Clovertowns can easily outperform a pair of dual core Woodcrest based Xeons.

There's a big difference between unable to and had a difficult time. When I first read the summary I thought that there must be some problem with the system if they're unable to get all the CPUs under full load.

I thought that there must be some problem with the system if they're unable to get all the CPUs under full load.

It's actually really easy to do if your memory system isn't meant to service 8 cores. And the article pretty much backs this up, every time the quad cores fail to shine it's blamed on the memory. But to me, the really interesting aspect of this is that they always blame FB-DIMM, which gains bandwidth by sacrificing latency. They even go so far as to suggest:

if Apple were to release a Core 2 (Conroe/Kentsfield) based Mac similar to the Mac Pro, it could end up outperforming the Mac Pro by being able to use regular DDR2 memory.

So, I think regular DDR2 @ 667 = 5.4 GB/s... divided amongst 8 cores is just 677 MB/s per core. It seems insane to think that would work (maybe it would, maybe my numbers are wrong also). If you want to attack latency but simply can't give up the bandwidth, wouldn't the SMP model work better-- just swap out the L2-miss stalled thread, and run the other full bore. Now you've reduced the problem to distributing your register bank among active threads. Well, I think that's how video cards do it, and memory latency is their enemy #1.

In any event, there you have it. The performance pendulum has left Ghz, is briefly swinging toward more cores, but appears headed now toward memory systems. Does anyone else think it's funny that L1 is still just 32kb? (oughta be enough for anybody).

Yes, and it runs very well (drivers for all major devices). Note that installing XP of any sort on the Mac Pro is a bit of an endeavor currently due to the need to slipstream drivers or you get 1/20th of the SATA performance. http://forums.macrumors.com/showthread.php?t=23190 1 [macrumors.com]

To be perfectly honest, I can see an immediate application for this where I work.

We're introducting a virtual infrastructure very quickly, using XServe RAIDs as our storage LUNs. That being said, with VMware's soon-to-be Mac OS X offering, this would give our mac-toting engineers the ability to build a virtual machine locally before deploying it into the wider infrastructure. That is a truly valuable tool.

There's three of us at work that heavily rely on our non-mac machines - a pair of us doing some reasonably heavy VM work. I'd love to transition to a straight Mac platform (not Mac OS X + SuSE + XP). It's such a pain in the ass to have to suspend one and start another constantly because my performance starts to block. It's not disk I/O - the I/O never pegs (most of the stuff is resident, anyway). The RAM can be mitigated by adding more RAM (4GB currently). More than once I've watched procmon show me that the vmx process is pegged on the

I immediately thought the same thing.Be aware that most high quality renderers aren't multithreaded through an entire render job though.

Case in point: a Maya mental ray render uses a single thread for translation, displacement map triangulation and subsurface scattering map processing before the render itself begins. Most dynamics calculations are also single-threaded.

So, on a dual core station rendering a scene fitting the above description taking 5 minutes, I see about 2 1/2 minutes only utilize one core.

Really, who the frig cares from a general computing standpoint? Who needs 8 CPUs?

Where this really pays off right now is with virtualization. For the cost of 1 & 1/2 boxes, you get the value of 8. That may not seem like a "general computing standpoint" to you, but virtualization is getting absolutely huge in the software development and server world. Besides, since when is the Mac Pro dual core a "general computing" machine? My guess is >75% of the buyers are buying them for specific heavy lifting.

My Mac Book, a wonderfully affordable Very Good Quality laptop, is now serving as my one and only machine at work. I'm an IT Manager for a 65/35 Windows/Mac house and I LOVE the fact that I now have options like Parallels and Crossover to give me true cross-platform ability....BUT...!My CPU is maxed out. I've got plenty of RAM (2GB) and I can only wait to get a faster hard drive as >5400RPM SATA drives aren't readily available. Give me a user-installable multi-core CPU PLEASE!

It's great for scientific computing. My software (which analyzes the structure of galaxy clusters) is fully parallelized. Its speed scales with the number of CPUs, i.e., if you double the CPUs you double the speed. A quad mac pro would be an enormous productivity boost for me.

Why don't I just farm the software out to a Beowulf cluster? Well, I do, but we have a queue for ours. When I'm testing the software I need to run, stop, and rerun the software, something which would be inefficient on a remote clust

NeXT multiprocessed the guts of OS X on 2-4 processors. The result is that the mach kernel doesn't scale the processors linearly. There isn't the straightline performance boost of adding another processor beyond 4 cores with Mac OS X's mach kernel.

Let's assume for the moment that none of us in this forum actually know anything factual about how many years Apple (or even NeXT before them) have been running Mach on machines with more than 4 processors on the corporate campus behind locked doors.

However, we can probably reason this out if we try. We're all bright geek types, right? There are several clues. NeXT bought Apple for a negative $400 million or so in what, December of 1996?

The heritage of NeXT that you mention is a pretty big clue. I don't recall off the top of my head how many processors were supported by the production shipping Mach build for SPARC and PA-RISC back in the NeXT days, but let's assume it was 2, just for the sake of argument. Both of those platforms offered ready availability of systems with many processors even way back then. Perhaps there were systems like that in the lab.

Mach was originally a research project with an interesting goal: clean support of certain abstractions in a platform-independent way. One of those abstractions was support for multiple processors, beyond the typical SMP architectures we see today, which means that the author's concept of platform-independent went quite some distance beyond a different instruction set in a different risk architecture. Dig this:

Mach kernel [wikipedia.org]
Unlike UNIX, which was developed without regard for multiprocessing, Mach incorporates multiprocessing support throughout. Its multiprocessing support is also exceedingly flexible, ranging from shared memory systems to systems with no memory shared between processors. Mach is designed to run on computer systems ranging from one to thousands of processors. In addition, Mach is easily ported to many varied computer architectures. A key goal of Mach is to be a distributed system capable of functioning on heterogeneous hardware.

An excellent book entirely about Mach is: Programming under Mach [amazon.com], which also mentions the design intent.

The original project was funded by DARPA, with the specific goal of developing operating systems technologies which would support super computers with hundreds or thousands of processors.

The Mach project developed new techniques which have migrated directly (via actual Mach code to OSF, NeXT, Mac OS X, et. al.) or indirectly into pretty much every modern operating system.

Mach research spanned a very long period of time, and two Universities. Curious, bright, and arguably insane people (or they would have been making money instead of slaving away making Mach on grad-student salary) with access to multiple processor machines with DARPA funded directives to make it scale to hundreds of processors. Hmm... that seems like a clue.

NeXT was, and Apple is a hardware engineering company. Apple has been building multiple processor boxes since before the reverse acquisition. I know, I had the, uh, perverse and shameful pleasure of running BeOS on one of them for sport.

I'm not quite sure what you mean by "mitigate their single-threaded nature", but if you run 8 single-threaded processes on an 8 core machine on any modern OS, the OS will end up spreading the workload across all 8 processors without having to do anything special. Normally, the OS will move threads from core to core as it sees fit, depending on the whims of the thread scheduler. However, you can override this (e.g., in XP by using the task manager and setting the processor affinity mask). The main reason

would you be able to somehow mitigate the their single-threaded nature by assigning the respective processes to it's own core?

First, pretty much any application on the Mac is multithreaded just because of the way the user interface works. Apple's OpenGL implementation is partly software, for example... this is why you can run hardware T&L on the Mac mini with its GMA950 GPU - the OS does that in software on the second core even in single-threaded games.

Second, OS X does a pretty goodjob of distributing applications to cores without having to explicitly bind them. Binding an application to a core would most likely slow it down... unless the program has been written to use a lot of fined grained shared state between threads... and what you're doing with processor affinity is *preventing* it from multiprocessing.

Processor affinity is like 64 bit. Unless you're doing something on the edge you probably don't need it, and if you need it you're probably already doing it.

Here's the summary:

The bad news is that OS X doesn't provide a hook for processor affinity. The good news is that Mach does support it, and you could use the Darwin sources to figure out how to implement it in OSX using direct Mach calls. The bad news is that it's really hard. The good news is you don't need to do it unless you're trying to prevent multiprocessing anyway.