Pre-AMD, ATI preps novel server charge

Even before its merger with AMD closes, ATI plans to charge the server market with a new type of graphics product that could shake up the high performance computing scene. Advocates of ATI's technology say it could create a lucrative new revenue stream for the company and add some weight to the ATI/AMD marriage.

ATI has invited reporters to a Sept. 29 event in San Francisco at which it will reveal "a new class of processing known as Stream Computing." The company has refused to divulge much more about the event other than the vague "stream computing" reference. The Register, however, has learned that a product called FireStream will likely be the star of the show.

The FireStream product marks ATI's most concerted effort to date in the world of GPGPUs or general purpose graphics processor units. Ignore the acronym hell for a moment because this gear is simple to understand. GPGPU backers just want to take graphics chips from the likes of ATI and Nvidia and tweak them to handle software that normally runs on mainstream server and desktop processors.

The GPGPU concept isn't new, but for the first time, hardware and software companies have matured to the point where they can make the technology live up to its promise. And what a promise it is.

The enormous horsepower delivered by ATI and Nvidia's graphics gear could facilitate 10x to 30x performance gains on a fairly wide variety of software loads typically handled by standard processors. Such a performance boost would be of major interest to big spenders in the government lab, oil and gas and bio-tech industries who want all the juice they can get. Even better, the GPGPU products should prove both cost effective and power efficient when compared to current processor options.

"There's this whole change going on right now," said Mike Houston, a PhD student at Stanford's Graphics Lab. "Now, there are companies doing this stuff for real. And, more importantly, there are big businesses that will buy their stuff."

Researchers at Stanford, the University of North Carolina and the University of Waterloo are just some of the folks who have hammered away at the software problems around GPGPUs for years. The computer science crowd has worked with - and in some cases convinced - ATI and Nvidia to open up their hardware and programming interfaces to make it easier to run common software on the GPUs. The University of Waterloo, for example, has a programming language called SH to ease the software translation process, while Stanford has Brook.

A company called RapidMind - formerly Serious Hack - commercialized SH in 2004. At the SIGGRAPH conference this year, RapidMind showed off its software working on the Cell processor developed by IBM, Toshiba and Sony.

PeakStream, another company going after this market, came out of stealth mode this week with a software programming platform meant to make it easier for developers to push code onto GPUs, multi-core processors and the Cell chip. The company turned to Stanford's Brook for inspiration and basically provides a type of shim that goes between a GPU and applications.

Researchers have zeroed in on products such as FPGAs, GPUs and the Cell chip because of their potential to speed up demanding floating-point operations. Most of the action right now has centered around software that relies on what's known as single precision floating point calculations. We're talking about horsepower hungry code for things such as medical imaging, computational fluid dynamics and seismic modeling.

As it turns, some of the biggest spenders in the hardware world use tons of floating-point heavy software. So, the software middlemen along with companies such as ATI and Nvidia could make serious profits if they're able to deliver on the GPGPU potential.

Of course, the software problem is not an easy hurdle to clear.

Up to now, developers have mostly focused on single-core server processors from IBM, Sun Microsystems, HP, Intel and AMD. Savvy types in the Unix world have written lots of multi-threaded software to spread work across large servers with tens and even hundreds of chips. Multi-threaded software has become even more important in recent years with chip makers producing dual-core, four-core and even eight-core chips.

The GPGPU world presents new challenges.

"These are streaming chips with a whole bunch of floating point units," Houston said. "You have to restructure code sometimes to get the best use out of these things. It's not for the faint of heart.

"In the high performance market, we've been talking about symmetric multi-processor servers with maybe four or eight or 16 threads. On an ATI chip, you're talking about 48 threads of simultaneous execution."*

ATI has only recently allowed developers to tap into its CTM (close to the metal) interface, which lets software interact directly with the underlying hardware.

Presumably, ATI will announce an even more open stance at its event next week.

As stated, the company declined to give us specific details on what it will reveal at the event. In late August, however, some savvy types discovered mention of ATI's FireStream 2U product by examining the server output log from an ATI Linux driver. Then, earlier this month, another chap discovered a living, breathing 1GB ATI FireStream card.

"The card is indeed based on R580 with a board layout nearly identical to the FireGL 7350 including the 1GB of ram," he wrote. "The box I saw contained only this card and a driver cd which had been burnt and was labeled as being a beta. Also, I found the label on the CD interesting: 'FireSTREAM Enterprise Stream Processor.'"

An ATI spokesman confirmed the existence of the FireStream product, but said its name may change due to the pending AMD merger and associated branding funk.

Nvidia did not immediately return our calls seeking comment for this story.

In an ideal world, the AMD/ATI tie-up will make life even easier on the GPGPU crowd.

"If I had my dream setup, it would include a much tighter interconnection between the graphics chip and central processor," Houston said. "What would be really interesting would be to have a cache coherent interface between the graphics processor and main processor."

AMD - via its Hypertransport technology - could potentially deliver just such technology by plugging GPUs directly into Opteron-based motherboards. This would let mainstream server makers such as Sun, IBM and HP follow the lead of a GraphStream and make graphics supercomputers.

Houston, and others, are also hoping that future GPGPU gear will allow for double precision floating-point operations as well, opening up the processor technology to a wider array of applications.

At the moment, ATI seems to be in an experimental phase with the GPGPU idea. Similarly, Nvidia hasn't rushed at the chance to talk up what it plans to offer.

The success of the technology will depend on the progression of software written for the GPUs and the sophistication of the GPGPU tools. In addition, the GPUs will need to stack up well against other options such as the Cell chip and FPGAs.

You can, however, imagine that with the raw power of GPUs and their volume status, customers should expect to see $2,000-ish boards make their way into workstations soon, followed by cheaper boards slotting into servers.

Without question, enterprise customers and labs are pleased to see GPGPUs moving out of the concept and testing phase and toward productville. A merged AMD/ATI might be in the best possible position to capitalize on these customers' interest. Hopefully, we'll know a lot more about ATI's ambitions next week. ®

*Bootnote

Houston was kind enough to add some technical detail to the differences between stream processing and multi-threaded processing for the curious.

In stream processing, you will run the same program on lots of elements simultaneously. Stream processing is a subtype of data parallel processing. The main goal of stream processing is to stage data so that it can be moved (streamed) through the memory system at high efficiency. All processing elements will run the exact same program, but on different data (parts of the stream). You cover memory latency with large amount of computation on each element ("arithmetic intensity"). In a stream model, all execution contexts (processors) run independent, so there is no locking or communication. Stream programming works well for large amounts of parallelism, but is limited in what applications it can run well. Often you have to convert an algorithm into a streaming formulation.

For multi-threading, each core can, and often do, run different programs. For example, one thread might be doing audio, while another does the AI for the bots in a game. Memory performance is generally gained by tuning your apps to make good use of the processor caches. General thread programming styles work well for a small number of threads, like 10s of threads. The user explicitly manages the processing but there is a large burden on the programmer to handle locking and communication control. Data-parallel and streaming models can be used well on multi-core processors as well, generally by carefully moving data through the cache hierarchy.