If you haven’t seen it yet, Jamie and Adam did what may be the greatest illustration of a computing concept onstage ever, using an 1100-barrel paintball gun:

Updated: We’ve seen the basic idea before — one of the Max/MSP + Atmel-powered Printball notes his own, similar project, as featured on Pixelsumo way back in 2005. But it’s the first time I’ve seen this used to illustrate this point.

The basic idea: GPUs, by using parallel processing, are able to render graphics more effectively than CPUs. And while the illustration is something of an oversimplification, it is pretty literal in terms of showing people what’s going on — and why GPUs are uniquely well-suited to computing graphics. Conceptually, it’s really one of the most brilliant demos I’ve ever seen.

There are just a couple of problems — and, amusingly, this demo makes them visible, as well.

We should recognize that the demo took place at NVIDIA’s NVISION show, so there was a certain bias expected.

But first, there’s the problem of standards. With graphics, we’ve already seen what happens with competing APIs like DirectX and OpenGL (and other wrinkles, like the Mac-only Core Image, which is in turn built on OpenGL). Now, NVIDIA and other GPU makers want to sell us on the idea of using their chips for general computing tasks (aka GPGPU). There still isn’t a standard for doing that. NVIDIA is pushing its CUDA 2.0 technology, of course, but it’s not clear whether others will adopt it. I’d put more faith in OpenCL; aside from the backing of Apple and AMD, that spec looks like it has a good shot at being adopted by the Khronos Group, the folks who brought you OpenGL’s standard implementation. (Not that everything Khronos has touched has turned to gold — see the largely failed OpenML.) But the bottom line is, to put it in the terms of this demo, it’d be a lot nicer if you could take the same instructions from one paintball gun to another.

Then there’s the bigger problem: what works for graphics doesn’t work for everything else.

In fact, you can see in the demo why, in general terms, parallel processing doesn’t work as well for audio tasks. Interestingly, this week as I chatted about a new release of a dedicated DSP-chip platform called the UAD-2 on Create Digital Music, people brought up this question of whether GPU chips might be the future of audio processing. You can see the fundamental problem here: audio, as a real-time operation that occurs in extremely tiny slices of time (think 44,100+ samples each second), tends to want to be processed more like the first paintball gun. That’s not to say that some tasks, particularly those that require parallel operations like granular synthesis, wouldn’t benefit from some parallel processing. But even then, the round-trip to the GPU has to compete with simply staying on the CPU for the task. These aren’t necessarily insurmountable problems, but suffice to say CPU makers are working on this, too. GPU makers are removing architectural barriers between the CPU just as CPU makers are working on smarter parallel processing, so the placards in this demo are likely to say “serial versus parallel”, not “CPU versus GPU” in the near future. (What do you want? It was an NVIDIA-sponsored event.)

That said, readers of Create Digital Motion have quite a lot to look forward to in the increasing power of GPUs. Obviously, 3D graphics capabilities continue to improve, and despite what you may hear about the PC game market dying, both the niche PC game and 3D production industries seem interested in continuing to push the envelope here.

But that’s just the beginning. Video decoding at higher resolutions and qualities is getting more efficient, which is not only good for video but leaves more room for other tasks. Decoding on the GPU (or a dedicated chip) is becoming increasingly powerful, and a number of GPGPU implementations do decoding on the GPU. (If we get really lucky, we’ll see a convergence of an open-source codec with an open-source GPGPU implementation in a standard spec like OpenCL, finally making video accessible to all. A boy can dream.)

I think the most interesting application may actually be in the area of computer vision. Doing analysis of video to determine motion has always been costly and difficult on the CPU. But processing a bunch of pixels in parallel is ideally suited to GPGPU implementations. Again, if we could get an open-source implementation, the work built on that could be incredible. Intel’s open-source OpenCV library is the basis for countless computer vision projects, including many in academia that would be unable to license a proprietary library. OpenCV runs on the CPU (written in C/C++); imagine something similar running on a standard GPGPU implementation.

If anyone was at NVISION, I’d love to hear about any developments on those fronts.

So, there’s my semi-uninformed opinion, at least. Usually what happens is, when we touch a subject like this, people who know more than me come out of the woodwork to share what they know. If you’re there, I’d love to hear from you.

<blockquote cite="http://en.wikipedia.org/wiki/Uad-1"&gt;
"[the uad-1] has an MPACT2 processor which is a 125 MHz, graphics, audio and video media processor, used primarily for AGP graphics cards and hardware DVD decoding in 1997."

Comparing the Mpact2 to a general purpose (GP) processor like the Pentium or PowerPC is difficult because they have profoundly different architectures and design strategies. The closest comparisons are the SSE2 or AltiVec engines in these GP CPUs which are actually derived from design advances orginally used by Chromatic based on super-computer architectures.

These engines are commonly referred to as "vector" units because they process data in blocks rather than one datum at at time, which allows them to save instruction decode cycles. Standard scaler DSP (e.g. 56K) and GP processors must perpetually decode the same instructions over and over when doing the same sequence of operations, and this wastes cycles.

The fundamental differences between the Mpact2 and a GP CPU (even it's vector units) is the amount of parallelism and the available memory bandwidth. The Mpact2 uses an internal 11-way 792-bit bus for moving data around between the DMA engine, PCI bus interface, the instruction cache, the data cache, the 5 execution units, the various peripheral interfaces, and the main dual-spline 1200 MByte/s RDRAM interface. This provides 11GBytes/s of bandwidth. In addition to this ultra-high speed internal data bandwidth, the multiple execution units are parallel instruction SIMD units, which can process up to 8 independent data sets with multiple instructions per clock.

In audio DSP terms, the internal clock speed is indeed 1GHz, and can exceed this speed for certain sub-operations. For example, a floating point multiply/add instruction consists of 2 FLOPs (floating-point operations). The Mpact2 typically provides 2GOPS of execution unit perfomance (and up to 6GOPS for some specialized units), hence the 1GHz processor speed equivalent. However the comparison isn't quite accurate because the Mpact2 can do additional non-FPU operations in parallel (like calculating addresses, dithering, moving data, etc.). In addition, the Mpact2 doesn't have a deep pipeline like the GP CPUs. The Pentium-4 has a 14-clock pipeline that's also data dependent. This means coding audio algorithms like recursive filters (which have high data dependencies) on the Pentium requires lots of waiting around for the pipeline delay.

In practice, the GP CPUs overcome the pipeline delay by allowing new operations to be started while the others are in process, but this has limited applications in audio processing. It's much better suited to calculating spreadsheets, or decoding an MP3 stream or MPEG movie.

Another advantage the Mpact2 has over GP CPUs is the multiple dedicated execution units and the cache arcitecture. The Pentium actually shares the same physical silicon hardware multiplier between the FPU, ALU and vector units, so all multiply instructions must fight over the same hardware. And the over-hyped Hyper-threadding CPUs share this same silicon between multiple thread contexts which pretend to be multiple CPUs, but they're not. You still only have one multiplier unit.

The Mpact2 uses an independent direct-mapped instruction cache, and a separate ld/st data cache with read-ahead, write-behind access. This means the Mpact2's instruction fetch unit doesn't cause random cache-line misses like the GP CPU's multi-way set-associative code/data cache. The Mpact2 compilers also provide control over instruction cache use which avoids cache misses at runtime. The Mpact2 data cache appears as a huge, 9-way ported register set to the execution and bus interface units which eliminates all conflicts between multiple execution units and the "key-holing" effect in GP CPUs' caused by accessing the cache through a tiny register set.

The Mpact2 has a 1200 MByte/s RDRAM interface that has independent, full-speed access to the instruction and data caches, plus the PCI bus at all times. This means nothing needs to wait for anything else in the Mpact2 to access a resource. Everything runs at full speed, all the time. This *never* happens in a GP CPU, becuase they're designed around a statistical model of resource sharing, while the Mpact2 is designed around a deterministic model.

That's one of the main advantages DSPs have over GP CPUs when processing real-time audio. Audio samples come in on a rigid time schedule and can't wait to be processed like video or pretty much any other type of compute intensive operation.

The Mpact2 is not going away anytime soon. We have large inventories, and the chip is available for the customers who are still using it. Incidently, there are DVD decoders that use this chip that are still being made.

The advantages this processor have brought to everyone involved are clear: we get an extraordinary advantage in the marketplace, and our customers get the benefit of this power with the best processing algorithms available.

The transition to other DSP platforms is not a difficult one, as we have shown with our TDM plugs. In some cases, the main difficulty has been optimising the algorithms to fit, but these optimizations can be applied to the UAD versions as well. The forthcoming 3.1 versions of the 1176LN and LA2A are not only better sounding than before, they're more efficient as well.

Regarding third-party support, the DSP coding aspect of plugin development is a small portion of the process, while the business aspects are very important. We would rather spend our company resources developing new algorithms for our customers than supporting a development platform for other companies. That said, there are still opportunities for 3rd party plugin developers, only not at the code porting level right now.

The evolution of Universal Audio has included the incorporation of both Kind of Loud, and Hyperactive Audio Systems, which I started several years ago. We think this is a unique combination of analog, DSP and digital hardware expertise, and we hope you agree.

There is a GPU vision library called GpuCV that provides the same interface as OpenCV. I haven't tried it out.

bgrggfe

The City Council is examining a request to open a Louis Vuitton Handbags and retail shop at 11502 Middlebelt in the Livonia Crossroads shopping center on the southeast corner of Plymouth and Middlebelt roads.The council heard at a study session on Monday from Taylor Bond, president of Children’s Orchard, who wants to open a 7,500-square-foot Louis Vuitton Handbags Sale store at the site of the former Family Buggy restaurant, which was closed several years ago.