Graphics chipmaker Nvidia Corp. is in the early developmental stages of its first Mac-bound GPGPUs, AppleInsider has learned.

Short for general-purpose computing on graphics processing units, GPGPUs are a new wave of graphics processors that can be instructed to perform computations previously reserved only for a system's primary CPU, allowing them aid in the speed of non graphics related applications.

The technology -- in Nvidia's case -- leverages a proprietary architecture called CUDA, which is short for Compute Unified Device Architecture. It's currently compatible with the company's new GeForce 8 Series of graphics cards, allowing developers to use the C programming language to write algorithms for execution on the GPU.

It's likely that the first Mac-comptaible GPGPUs would turn up as build-to-order options for Apple's Mac Pro workstations due to their ability to aid digital video and audio professionals in sound effects processing, video decoding and post processing.

Precisely when those cards will crop up is unclear, though Nividia through its Santa Clara, Calif.-based offices this week put out an urgent call for a full time staffer to help design and implement kernel level Mac OS X drivers for the cards.

Nvidia's $1500 Tesla graphics and computing hybrid card released in June is the chipmaker's first chipset explicitly built for both graphics and high intensity, general-purpose computing.

Programs based on the CUDA architecture can not only tap its 3D performance but also repurpose the shader processors for advanced math. The massively parallel nature leads to tremendous gains in performance compared to regular CPUs, NVIDIA claims.

In science applications, calculations have seen speed boosts from a 45 times to as much as 415 times in processing MRI scans for hospitals. Increases such as this can mean the difference between using a single system and a whole computer cluster to do the same work, the company says.

Will software need to be modified to leverage the power of these processors?

I'm very confused at the role of the GPU/CPU, they seem to be crossing paths more and more! Someone, tell me, what does the GPU do, and what does the CPU do, and why are they better for their separate tasks.

As GUIs get more processor heavy, i.e. RI, and people realize that computers are useful for much more than gaming, I think these will be a great asset, especially for a companies with software-hardware integration like Apple.

MacBook Pro C2D 2.4GHz and a battle-scarred PowerBook G4 1.33GHz

"When you gaze long into a dead pixel, the dead pixes gazes also into you"

n science applications, calculations have seen speed boosts from a 45 times to as much as 415 times in processing MRI scans for hospitals. Increases such as this can mean the difference between using a single system and a whole computer cluster to do the same work, the company says.

"A computer cluster is a group of loosely coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks." From The Fount of all Knowledge.

Will software need to be modified to leverage the power of these processors?

I'm very confused at the role of the GPU/CPU, they seem to be crossing paths more and more! Someone, tell me, what does the GPU do, and what does the CPU do, and why are they better for their separate tasks.

Yes, the computation pipeline would have to be largely rewritten. And even then, the GPGPU is very specific in the kinds of operations it can accelerate. If you can define your operation in terms of what discrete set of input pixels you'll need in order to generate each output pixel (blurring, sharpening, color tweaking, etc), it can accelerate phenomenally. If, however, your operation can only be expressed in terms of what output pixel will need to be modified for each input pixel (histograms, accumulators, searches, etc), it's no better than a regular CPU. To make matters a little worse, in order to switch between the two "modes", you may need to copy your entire image from VRAM to RAM and back to VRAM.

The newer GPGPUs attempt to compensate for some of these issues, but the areas where this computation is applicable tend to be pretty specific.

As GUIs get more processor heavy, i.e. RI, and people realize that computers are useful for much more than gaming, I think these will be a great asset, especially for a companies with software-hardware integration like Apple.

OMG!! Computers are useful for more than just gaming ??? (KIDDING) This is a very promising development. When it migrates down to the point where us mere mortals can afford it, time will tell.... Sorry but I won't be spending $1,500 for a graphics card ANYTIME soon...

"A computer cluster is a group of loosely coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks." From The Fount of all Knowledge.

I'm really curious how these things might work for audio - audio companies have been making DSP cards for years, but they are usually extremely expensive and CPUs have become a better bang for the buck.

Anyone know if any audio apps are taking advantage of things like this yet?

It's an interesting development, but it looks to be pretty expensive. I imagine that's the only reason why they'd do it for the few Macs that can use it.

Since this would be useless without the OS recognizing it for what it is, I wonder if the reason why they're doing it, is because Apple is somehow involved. If Apple didn't support it, it wouldn't be useful.

I'm really curious how these things might work for audio - audio companies have been making DSP cards for years, but they are usually extremely expensive and CPUs have become a better bang for the buck.

Anyone know if any audio apps are taking advantage of things like this yet?

The advantage to these chips is that they can be programmed to do whatever you need them to do. A DSP is just a number cruncher, as is a vector processor.

This is really good news. Apple has been offloading processing into GPUs for some time, but so far, I think it's only been for certain graphic-oriented frameworks (like CoreImage) and some special-purpose apps (like Motion).

With this announcement, we may see more generalized frameworks (possibly integrated into Accelerate, CoreAudio, and other frameworks) to take advantage of the immense about of GPU power that goes unused most of the time.

Quote:

Originally Posted by thefunky_monkey

Will software need to be modified to leverage the power of these processors?

Maybe yes, maybe no. If Apple updates OS frameworks (like CoreImage, CoreAudio, Accelerate, and other commonly-used components) to use these capabilities, lots of apps will automatically begin using them.

Of course, you probably won't be able to take advantage of some features without using new APIs, which will obviously require updates to applications.

Quote:

Originally Posted by thefunky_monkey

I'm very confused at the role of the GPU/CPU, they seem to be crossing paths more and more! Someone, tell me, what does the GPU do, and what does the CPU do, and why are they better for their separate tasks.

The lines have been getting very blurry in recent years.

A modern GPU is really a high-end math coprocessor. Things like matrix arithmetic across large data sets, fourier transforms, and other features useful for image processing are built-in. Many features are highly specialized, like various 3D texture mapping abilities, anti-aliasing, compositing objects, etc.

For quite a while now, it has been possible to upload code into the GPU for execution. Typically, this is to get better video performance, but it can be done for other purposes as well. The source and target memory of an operation doesn't have to be on-screen - it can be anywhere in the video card's memory.

Of course, this is only useful if your application needs the kinds of operations the GPU can provide. Audio processing is one such operation. I can think of a few other possibilities.

The big deal about NVidia's CUDA architecture (if I understand the article right) is that it will now be possible to use the GPU for a much wider range of processing tasks than just those that can benefit from image-processing algorithms.

Quote:

Originally Posted by minderbinder

I'm really curious how these things might work for audio - audio companies have been making DSP cards for years, but they are usually extremely expensive and CPUs have become a better bang for the buck.

Anyone know if any audio apps are taking advantage of things like this yet?

I know Motion (part of the Final Cut Studio suite) requires a powerful GPU, because it offloads tons of rendering work into it. I think the other parts of FCS (including Soundtrack Pro) will use GPU power, if its available, but I'm not certain of that.

By the time it is available for the Mac, the Mac hardware and OSX should have completely transitioned to a dumbed down system for making photo albums of the kids' little league and sending prettified emails about how you just managed to make photo albums of the kids' little league.

To extend its usefulness will be video iChat so you can hold up the photo album to show nana what you just did.

To get Steve's attention though, Nvidia will have to work a lot harder to make this the world's thinnest graphics card.

It will take a marketing miracle for this to be a success: it was hard enough getting compilers to support altivec, which is a hell of a lot more seamless than this proposal.

Even so, I wholeheartedly support the effort. Superscalar CPUs are often inefficient for many kinds of modern computing. However, I see IBM's Cell as a much better paradigm than this is, and certainly one that's more likely to have an impact on the future of computing. Board-to-board data flow is supposed to be going away, not coming back.

Funny you say that, as this first thing I thought of was Apple using this specific card to boost performance in the successor to Shake…

Quote:

Originally Posted by Mirlin11

… Sorry but I won't be spending $1,500 for a graphics card ANYTIME soon...

And the consumer market is NOT where nVidia is aiming this product… (duh)

Quote:

Originally Posted by gastroboy

Sounds useful, but will Apple buy it?

By the time it is available for the Mac, the Mac hardware and OSX should have completely transitioned to a dumbed down system for making photo albums of the kids' little league and sending prettified emails about how you just managed to make photo albums of the kids' little league.

To extend its usefulness will be video iChat so you can hold up the photo album to show nana what you just did.

To get Steve's attention though, Nvidia will have to work a lot harder to make this the world's thinnest graphics card.

Wow! Bitter much? Apple is a business, and the consumer sales are what drive that business. But I seriously don't see them forgetting about the professional DCC market anytime soon.

I would expect this card working in conjunction with the rework of Shake and announced at Siggraph this year… Pixar might get into the game and show a GPGPU-ized RenderMan also…

It will take a marketing miracle for this to be a success: it was hard enough getting compilers to support altivec, which is a hell of a lot more seamless than this proposal.

When AltiVec came out, Apple didn't have the various Core libraries (including the abstracted SIMD library), which automatically distribute their load across available resources. If you use those you get GPGPU for free when your machine has it (after Apple supports it in their frameworks, obviously). If you need it to do custom work, just write a plugin. You would only have to drop down to writing specialized GPGPU code for boundary cases where Apple's libraries weren't efficient enough. Even their Son of Batch Processing task-based API can be updated to transparently support GPGPU.

This process might be a bit slower than some pros would like if the tech first appears on an optional upgrade card for the Mac Pro, but if it's as promising as it sounds it will propagate down the lineup and Apple will have a real incentive to build support into their various libraries.

The real benefit of CUDA will be speed, accuracy, and usefulness. Currently Apple uses only OpenGL to access the GPU. Apple is using OpenGL for tasks it was not designed to do directly, like all the flavors of Quartz Composer such as Core Image. Now back in the day, the painfully slow progress of the PPC chip forced Apple to start hacking OpenGL to speed up image processing. It's given them an competitive edge that is really coming full circle. Now that the GPU makers are seeing people use their hardware this way they are finally designing GPUs for this.

With the new GPU hardware paradigm and the access CUDA will give we will start to see real time raytracing, faster than realtime H.264 encoding (maybe), and yes massive Audio processing. So with the more approachable development environment you will start to see applications really start to take advantage of the speed. It would provide a type of SSE on steroids. But to really make it shine I guess the real work will be in Apples hands, because they will have to extend Quartz Composer to accommodate both Nvidia's CUDA and ATI's Close To Metal (CTM). Maybe it will be a whole new Apple tech that will show up in 10.6 or sooner. If Apple unifies both techs and makes it even more accessible to developers it would be pure genius!

I think you have the same problem as you had for AltiVec, that Apple didn't build the technology into every new model.

When the technology starts there are relatively few computers that support it.

If, and only IF, Apple commits to make all computers include the technology, and there is a clear advantage to using it, then you get a growing base of computers that will benefit from the additional programming to implement it. But it still takes quite some time before the new technology forms even a sizable minority of computers in the market.

Apple is so unreliable with its support of even its own technology (eg ports, media, CPUs, SCSI, FW, GPUs, etc,) that the chances of the new technology having a viable long term pool of compatible computers to utilise is slim.

Its worth noting that the GPGPU hardware is just a standard nVidia 8800 series GPU, without the video display output. The port of the CUDA software to the Mac means that you should be able to run CUDA programs on any Mac with a 8800 (or later) based nVidia graphics card. The GPGPU boards are additional hardware to boost the computational capabilities of the machine even further.

Unfortunately CUDA is nVidia-specific, so until nVidia and ATI/AMD agree or Apple introduces an OS-provided standard that covers both this will never run on all Macs even if they have the latest GPUs. As observed above, Apple may transparently leverage this kind of technology via its various Core technologies and SIMD libraries which will benefit applications using those. This isn't as flexible as coding directly to CUDA, but until a better alternative to CUDA comes along then its better than nothing.

BTW, the main difference between a CPU and a GPU is that the CPU does one operation at a time to one piece of data at a time and is designed for these operations and their ordering to be as flexible as possible and for the data to be as freely organized as possible. GPUs, on the other hand, constrain what your data looks like and how it is organized and they do the same sequence of operations to as many data elements as possible at the same time. It is these tighter constraints that allow the GPU hardware guys to make more design assumptions, which allows them to optimize to faster computational speeds. The price of generality/flexibility is usually performance (and visa versa).

Over the past decade the lines have become increasingly blurry. GPUs are becoming vastly more flexible in what operations they can do, and in what order, and are improving in the flexibility for how they handle data (although it is all still strongly oriented toward large volumes of data and massively data parallel computations). CPUs have also become more capable of doing operations on more data at once (SIMD, in the form of AltiVec and SSE, and multiple cores). The future will be even more blurred as GPUs and CPUs are built into the same physical chips, and as the processing elements of GPUs become more and more flexible in how they operate, and as CPUs get more cores, better SIMD, and more memory bandwidth.

How to explain this to consumers is a really big problem for the marketeers, who usually don't even understand it themselves. The days of being able to say that "this machine is faster than that one" are gone or going fast -- in the future you'll have to say "this implementation of that particular algorithm with these tools, on that hardware, under these conditions, running that OS is faster than..." (well, that's what you'll say if you want to be truthful... otherwise everybody will just claim they are the fastest... and at something they will be). For the software developers this is an ever increasing nightmare of having too many poorly conceived and inadequate tools to efficiently develop software that is competitive, not to mention being stuck with a legacy of out dated non-concurrent code. The software world is going to have to mature significantly to keep up with the rapidly changing hardware.

If the TDP of a 45nm Core 2 is 130W and the TDP of an GeForce 8800 GTX is 185W, then the GeForce only needs to perform around 45% faster than the CPU for its perf/Watt to be superior. Anything close to the "45 times faster" claimed by this article would imply a far superior perf/Watt.

If the TDP of a 45nm Core 2 is 130W and the TDP of an GeForce 8800 GTX is 185W, then the GeForce only needs to perform around 45% faster than the CPU for its perf/Watt to be superior. Anything close to the "45 times faster" claimed by this article would imply a far superior perf/Watt.

True that. I'd love to see some turbocharged GPGPU applications. Informative first post. Welcome to the boards!

He's a mod so he has a few extra vBulletin privileges. That doesn't mean he should stop posting or should start acting like Digital Jesus.- SolipsismX

Amdhal wasn't strictly talking about this. His "law" doesn't apply in a direct way. It's more closely related to using 8 cores over 2 cores.

Sure it does. Amdhal's Law is extremely important in this case. If you have a process where 50% of the time is in a data parallel problem, and the other 50% is in a serial portion of the problem, then by throwing the data parallel problem at a GPU that can do those calcs a billion times faster... your program will run (at most) twice as fast.

Its worth noting that the GPGPU hardware is just a standard nVidia 8800 series GPU, without the video display output. The port of the CUDA software to the Mac means that you should be able to run CUDA programs on any Mac with a 8800 (or later) based nVidia graphics card. The GPGPU boards are additional hardware to boost the computational capabilities of the machine even further.

Hi,
I've done both MIMD and SIMD programming as well as a little bit of GPGPU programming. GPGPU was pretty much a pain in the ass. I wonder if now it's a lot better or easier to program for, seems it probably is. Also C code is a huge improvement over when I did it in assembly.

So it is possible to get 45 times speed up, not just 45%, on an application over a standard one core cpu using a SIMD machine, especially one as powerful as an nvidia gpu processor.

Processing MRI scans seems like an image algorithm which can be very efficient on SIMD machines, similar to GPGPUs. However I don't know the details of course so I'll just be safe and say they probably meant 45%-415% for now. But to say 45-415 times faster wouldn't surprise me if proven true. I've written SIMD code which ends up around 5-10 times faster than on a modern one core cpu, however the performance per watt is astoundingly high, as each SIMD cpu is only 20 MHz, as compared with an intel processor in the GHz.

Amdahl's law is CRUCIAL for GPGPUs. It says you can only speed up the part which can be parallelized, while part of the algorithm cannot be sped up by going parallel because it is has to be done in order. Biggest example of a serial part of a program in the I/O.

So for a while Intel and AMD have been looking into SIMD co-processors to do certain algorithms very fast, not just pieces of the cpu already there, but think like 512 processing elements dedicated to running a thread which is designate as SIMD. It looked like they were going to implement it but maybe the existence of a GPGPU in every computer in a few years may trump this and why pay for an extra simd processor when you can just use a GPGPU? interesting.

Hi,
I've done both MIMD and SIMD programming as well as a little bit of GPGPU programming. GPGPU was pretty much a pain in the ass. I wonder if now it's a lot better or easier to program for, seems it probably is. Also C code is a huge improvement over when I did it in assembly.

"Easier to program" is a relative term. Yes, it is easier. No, its not easy.

Quote:

So it is possible to get 45 times speed up, not just 45%, on an application over a standard one core cpu using a SIMD machine, especially one as powerful as an nvidia gpu processor.

No, they mean 45 times faster. On data parallel problems this is entirely feasible.

Quote:

each SIMD cpu is only 20 MHz, as compared with an intel processor in the GHz.

I don't know what you were coding, but modern SIMD is epitomized by the IBM Cell processor. A main processor plus up to 8 on-chip 3+ GHz dual issue SIMD cores. 200+ GFLOPS performance, compared to quad core Intel Core 2 Duos in the ~50 GFLOPS ballpark... with much higher cost and power consumption. On scalar code the Intel CPUs win hands down. On SIMD code, the Cell crushes them.

Quote:

Biggest example of a serial part of a program in the I/O.

I/O isn't inherently serial (although the encoding for output might be), but it usually is a bandwidth restriction. Serial portions of a program are the algorithms that you just can't figure out how to parallelize (or replace with parallel equivalents).

Quote:

but maybe the existence of a GPGPU in every computer in a few years may trump this and why pay for an extra simd processor when you can just use a GPGPU? interesting.

Increasingly tight integration is what we're going to see. AMD's Fusion project is going to bring the GPU on-chip with the CPU. Intel's integrated GPUs are another example of tighter integration as well. The biggest performance bottleneck in GPGPU is usually moving data across the expansion bus (PCIe currently), so bringing it onto the motherboard has some advantages at the cost of losing modularity. System-on-chip is the logical move as we approach a billion transisters per die.

First thanks for clarifying a few things for others. Definitely 'easier' not from an algorithm design standpoint, but for implementation certainly more simple than assembly. You're right about I/O, I was just thinking of having a file on a HDD but yea you're right.

I didn't know gpus were moving to the cpu, but then again as you say, it's obvious they're going there.

The SIMD I programmed for is a single board 512 processing element linear array(with wrap around). 2-d torus I think is the shape.http://www.soe.ucsc.edu/projects/kestrel/
"system accelerates computational biology, computational chemistry, and other algorithms by factors of 20 to 40" than a 433 mhz, probably, sun workstation. Each cpu runs at 20 MHz.

I don't really think the cell is the epitome of SIMD. It can be programmed to be mimd and so many various things, it's very flexible which is good but at the same time difficult to master from its complexity. But of course it's still super fast, powerful, and very interesting.

Anyways I just saw their latest paper addresses why a have a SIMD over a gpgpu, however it hasn't been published yet but you can read their pdf on the webpage.
"We propose a model in which the CPU and the GPU are complemented by the third big chip, a massively-parallel SIMD processor."