Self-generating processors advance

PALO ALTO, Calif.  Design tools are becoming so deft they may one day put designers out of a job. Hewlett-Packard Co. has just turned out a processor around its self-generating VLIW core, and is advancing the technology in anticipation of a day when a machine will be able to design a machine. Tensilica Inc. is also experimenting with ways to let a compiler choose the optimal processor architecture using only C-language inputs. It recently demonstrated a tool that takes less than a minute to generate gates for a fast Fourier transform algorithm.

The announced completion by STMicroelectronics of a very long-instruction-word processor with partner Hewlett-Packard Laboratories here is clear evidence of how EDA tools are giving software programmers and systems architects keys once held only by hardware designers. "This is the next plateau of computer architecture," said Bob Rau, HP fellow and director of the compiler and architecture research at HP Labs, in a recent interview with EE Times.

HP and Tensilica have the same goal in mind: Capture the hardware design expertise in the tool itself. The next generation of tools will be able to spit out many iterations of processor designs by taking into consideration the application code, compiler performance and hardware cost/performance trade-offs, said Monica Lam, a professor at Stanford University and Tensilica co-founder.

Manual massage

Yet some are skeptical of whether tools can ever evolve to the point of booting the hardware designer out of the equation. "If you are designing at the algorithmic level and going to hardware, chances are you are going to need some hardware expertise for things like registers and pipelines," said Stan Krolikowski, vice president of business operations at Cadence Design Systems Inc. "At least for the foreseeable future, we have to provide the mechanism to allow people to manually massage these designs."

In development for three years, STMicroelectronics' new ST210 processor is the brainchild of HP researchers who are part of a growing movement to generate an optimized processor core using straight C code. Unlike the C-language tools being bandied about in the EDA industry, these tools have no roots in register-transfer-level coding. They don't even have a notion of a clock.

"With C-level design tools, you still have to think like a hardware designer, and none of them go up to the level of architecting a computer system," HP's Rau said. "There's no concept of hardware built into this code."

The genesis of the ST210 goes back some six years, when Rau and others at HP Labs were looking at what comes after VLIW. Rau was among the early pioneers of very long-instruction-word computing when it burst onto the scene in the 1980s.

That effort later evolved into the PA Wide Word project, the predecessor to the Explicitly Parallel Instruction Computing that is the basis of Intel Corp.'s IA-64 architecture.

The HP Labs team wanted to extend VLIW's ability to profile an application through its compiler to bring out a higher level of design abstraction. The result was the Program In Code Out (Pico) project, an ambitious effort that would turn hardware design over to the processor itself.

HP is now in its second-generation self-generating VLIW core and expects to start marketing it in earnest next year. Under this architectural-synthesis model, algorithmic C code is plugged into a program that generates its own processor compiler and VLIW machine in VHDL code, which can then be synthesized into processor gates.

Rau said the VLIW machine is a natural for the wide range of embedded applications the company is targeting. Optimized for video and audio streams, the ST210 executes multiple instructions per cycle while running at a mellow 250 MHz; a general-purpose RISC processor could only attain that kind of performance at 1-GHz speeds, according to STMicroelectronics.

Malleable architecture

Most noteworthy is that the processor's VLIW microarchitecture remains malleable. Using the Pico compilation tools, a customer can create new instructions, choose the number of execution units and play with the internal cache hierarchy. The tool then analyzes the trade-offs in price, power and performance before committing to silicon.

The algorithms are geared to extract instruction-level parallelism from C code to keep the number of hardware functional units  and hence die size and power  to a minimum. While the ST210 is designed to execute four instructions per cycle, Rau said Pico is capable of much more.

"We can issue eight to 12 operations per cycle, which is something that can't be done with a RISC core even if it's superscalar," he said. "If you are going for high performance, we believe VLIW is much better. VLIW can do a lot of computations that are not vector-based."

Use of advanced compilers to generate processors appears to be aimed primarily at signal processing, which is a natural fit for VLIW. But companies such as Tensilica (Santa Clara, Calif.), which has developed tools to let users reconfigure a RISC processor core, are exploring the use of more general-purpose processors to take on different forms of signal processing.

"There has been a lot of emphasis on DSPs," said Tensilica's Lam. "The general way to improve their performance is through parallelism, such as VLIW, SIMD [single-instruction, multiple-data architectures] and pipelining."

But there's also a need to extend configurable processors to the more general-purpose architectures used commonly in networks. Such an architecture is "not like DSP  it doesn't sit in a small loop and execute instructions," she said. "But at the same time it moves a lot of data through the system at speed. Anytime you worry about performance, you need to look at specializing your architecture."

Lam wouldn't discuss the details of Tensilica's next-generation compiler tools but said the company is looking to make use of VLIW and SIMD techniques to create "fused instructions" to perform simple bit extractions that would otherwise take multiple instructions in a general-purpose processor.

"If you look at Texas Instruments' instruction-set architecture, sometimes they have extra instructions for special DSP algorithms. The principle behind that is fused instructions," she said. "For some of the design problems, the tools are much better at making the iterations, and because it's integrated with the compiler it knows a lot of details  not just about the architecture but how the compiler works."

For HP, devising a self-constructing VLIW processor is just the first step. The company is also developing a way for the Pico compiler to generate what it calls nonprogrammable accelerators (NPAs), or closely coupled, hardwired functional blocks. Under this scheme, the NPA could be compared to a VLIW machine as part of a hardware-exploration exercise that considers some 3,200 possible system configurations, out of 2.5 million architectures, and narrows them down to 77 candidates. Those candidates are then plotted on a chart showing the performance and die area trade-offs. HP claims to hold key patents on this search mechanism.

Using that approach, HP says the gate count using the automated system can be close to that of a manual design, but design time is cut dramatically. In one extreme example, HP's printer group was able to design hardware for a 40-line row discrete cosine transform loop in just two days for a 14 percent gate count penalty. That kind of job generally takes six months, the company said.

Test run

HP said the Pico NPA was also evaluated by an undisclosed semiconductor company, which used it to design a Viterbi decoder that made it through physical design on its first pass.

Despite his belief that even the most highly capable tools will never make hardware designers obsolete, Cadence's Krolikowski said there's nothing stopping companies from moving to higher levels of abstraction when designing a processor. "It's a valid approach, and that's one way to get around the limitations of behavioral synthesis," he said.

Krolikowski said Cadence isn't prepared to take the SystemC behavioral language into that realm, but he does envision its being used to compare power and performance models on code running from disparate microprocessor and DSP architectures.

"I'd like to have a tool suite that will allow you to make that trade-off analysis in real-time. You need to be able to represent performance models and quantify the performance of the processor bus and memory at pre-implementation levels," Krolikowski said.