Next-Generation, High-Performance Processor Unveiled

The prototype for a revolutionary new general-purpose computer processor, which has the potential of reaching trillions of calculations per second, has been designed and built by a team of computer scientists at The University of Texas at Austin. The new processor, known as TRIPS (Tera-op, Reliable, Intelligently adaptive Processing System), could be used to accelerate industrial, consumer and scientific computing. Professors Stephen Keckler, Doug Burger and Kathryn McKinley have been working on underlying technology that culminated in the TRIPS prototype for the past seven years. Their research team designed and built the hardware prototype chips and the software that runs on the chips.

37 Comments

Well apparently you didn’t read much the article: a major problem with VLIW is that everything is scheduled by the compiler, which is fine for very regular code, but which sucks for ‘normal’ code.

TRIPS is a kind of massively super-scalar OOO CPU (which doesn’t use much the register file), so it’s quite different.

That said, I wonder if it’s possible to have a TRIPS ISA independent of the CPU implementation and still have good performances, otherwise it’d mean that you’d have to change/recompile all your applications each time you upgrade, a big problem..

Having read through chapter 6 of the link given above (thanks to that poster: can’t remember the name off the top of my head, and OSNews has obliterated that from my view while typing this comment, and I feel too lazy right now to get it by opening another browser window/tab) it appears to me that even a lousy compiler writer would still create executable code without too much effort: in each block of up to 128 instructions, the processor itself prioritizes them according to how they’re entered, BUT the instructions only execute once their predicates and incoming data are ready: in other words, the actual order of the instructions to accomplish something is merely an optimization of priorities, but actually does not affect the eventual outcome at all, except perhaps in efficient execution time. It is indeed, “data-flow driven” code, in that the dependencies between incoming and outgoing data and calculations is very explicitly stated, and as long as you’ve drawn that graph correctly, it’s actually very easy to create code for it that’s technically correct, even if not optimal for prioritization.

The impression I got from reading thus far through the PDF file is that this is a CPU designed by someone that is very accustomed to thinking in functional logic blocks, and as such the code and how it is all connected has an incredibly strong resemblance to drawing a combinatorial/sequential logic gate diagram and all the inputs/outputs, etc. associated with them.

In addition, in a bizarre but potentially very scalable design implementation, many (or all, I’d have to review) of the registers for each execution unit are available outside the chip (!) via a memory mapped address, and outsiders can pause the processor as needed: this chip is designed from the outset to scale in some mutation of a grid (the topology can be defined quite loosely in hardware to some degree, largely defined in software for programming purposes) leading to the very interesting observation that if there were (for example) a data processing task that required 1000 separate small sets of data, it’d be fairly easy to figure out how to set that up with a bunch of these processors, as blocks of instructions are executed atomically: either everything is satisfied by the end of the block, or no outside writing of data occurs, and there are native synchronization instructions to make this easier.

Nonetheless, compared to any other processor architecture I’ve seen documented, this one is WEIRD but it may very well be the future, though I suspect not in the next few years: there’s too much legacy code to make it an easy jump to something so foreign in concept to most people at the low level, though if the compiler writers get busy, that may mostly be abstracted into irrelevancy: the biggest question becomes how well and fast it can execute emulation of other processors in a reasonable manner. I suspect that may be the really hard thing to do, by comparison.

hmm, you make it sound a bit like what i would expect a FPGA to behave, but then im on really thin ice here…

as in, it gulps down a whole lot of instructions, store them in some way (FPGA modify its layout to match the instructions) and then weight them based on the data it gets to work on afterwards.

im guessing that with enough storage it could store most of the program the user is using. or at least the part of the program the user uses (who ever uses all the features of said word at the same time?). so that after a couple of passes the speed of said program would be interesting to say the least…

still, it really really sounds like yet another bulk calculation chip. i really wonder how well it would handle multitasking…

That’s exactly what I was thinking too, an FPGA, except one that’s reprogrammable a million times a second. It looks like each instruction is a mini circuit, executing code directly in hardware. This would be ideal for image, audio and signal processing – it would beat traditional DSPs.

I’m fairly confident that it’s possible to write compilers for such a processor. It just has to analyze the dataflow, and design an ever changing circuit. You could execute entire loops in hardware.

Binary compatibility is not an issue these days, when the world is moving to the virtual machine direction. The compiler can be executed at runtime.

Multitasking wouldn’t be an issue. If you have 1024 execution units in an array, you can choose to run dozens of tasks, or fully utilize the entire processor to run a single task at an unprecedented speed. Just imagine that you could download an entire neural network to a TRIPS. Now that’s parallel processing!

I have further researched the processor on the university website, and not only is it possible, they’ve already written a C and a Fortran compiler and toolchain, along with performance monitoring tools.

The looping structure and how that works is from one block to another, and not specifically within the same block: however, unrolling loops should be very easy, because of the usage of predicates and dataflow. For example:

for(i=0;i<4;i++)

{

a[i]*=10;

}

In this case, how it could be unrolled is:

a[0] depends on i==0, and then updates a[0]

a[1] depends on i=1, and then updates a[1]

<ditto>

Now if you put each a[x] as separate locations and then set what they depend on to match the loop unrolling increment (4 in this case) it then becomes such that all instructions are executed in parallel, based on their dependency of the data. If it takes 4 instructions for such a loop unroll in this sort of case (I haven’t figured it out exactly) that may mean being able to unroll the loop by an increment of 32, and subject to the limitation of the processor core for throughput (there seems to be a limit per thread of 16 instructions) it could do this effectively in 2 cycles, assuming all the instructions and data were present in the block and it was loaded, with then the time requirements for storing the results back out, if desired, or they could be left behind for the next block to process.

If anything, it looks like keeping this sort of processor fed with data/instructions may be the biggest issue.

at best it will be done using virtualization, and then it most likely need to run said x86 at near native speed.

still, with the ongoing development of true, open standards and open source software that seems easier to port by the day…

still, the same thing that makes dell still have XP as a option on the computers is what keeps x86 in use (even if its being translated on the chip as some say), install base inertia. there is to much out there that work on current day tech, and unless we have a painless and effortless way of jumping to the new platform/technology, it will not happen.

this is why microsoft have sacrificed security on the altar of compatibility, to maintain their install base.

still, apple have jumped cpu tech 2 times. and people still use their stuff because they supply a emulator that integrates with the new as seamless as possible.

so it can be done. but its not something that will ever happen over night. if one wants to change something, one have to be able to stay in the game for the long haul. but with the investors requests for quick payoffs, not many is willing, or able, to do that.

hell, there have been a upgrade for the IP protocol sitting around since the 80-90 at least. why is it not in use all over the place? because the critical boxes, the routers, do not support it as well as the old one. and replacing those boxes will not happen unless its 100% required.

one can say that stuff like x86, IPv4, and similar have become a kind of tech tradition. and traditions is damn hard to change. just attempt to tell some recent immigrant from a nation with traditions you find revolting that he cant do that in his nation…

Why not ditch the hard-drive it is the slowest piece of hardware in a laptop/pc/server ect.

It is the bottle neck in any system and prone to failure this is the area that needs work the most.

X86 is going to be around for a long time, IMO the entire processor arena is stagnant not to much going on not for what it cost. Second factor is in the Enterprise Corps are trying to go cheap by not spending money on new equipment and re-deploying the old hardware. Lastly, the truth of the matter is the dot-com bubble has come and gone. The days of having huge budgets and being able to hire more associates is over. Well if you list Goggle they are but like everything else it will come to an end. When the next ‘new thing’ comes along which right now in Tech in really dead.

Why not ditch the hard-drive it is the slowest piece of hardware in a laptop/pc/server ect.

Agreed and as everything else has dramatically improved over the years this bottleneck has relatively worsened. Hopefully MRAM or PRAM memory technology becomes widely avaialable over next five years and disk drives will become mainly archival devices.

SAME here, it still amazes me that (X-Corp) has the latest processor with 50 trillion floating calculations that can iron a shirt to… However, the same mechanical hard-drive that plagues the server room with Raid set failures to just failed drives continues to hammer on forever.

It is time for the hard-drive to be retired, with all of the current technology available (for example LCD screens) when they came out they were Expensive! Now they are cheaper than the original CRT monitors.

Talk about a system to BOOT like a rocket or running yum -y update Red Hat or apt-get dist upgrade Ubuntu ect… and having those updates install QUICKLY without having the hard-drive spooling would be awesome. With the ‘dot-com’ bust companies have went cheap and it shows. Almost every ad you see for X-brand hardware it is all about PRICE not about speed, robust, just the same old rehash of some blade server or cheap server.

I am all for new processors, why not match it with STORAGE capable of matching the read/write rates with static memory??? Sure the cost would be expensive at first but it would not ‘price’ how cheap can you go? Unscheduled outages/downtime due to failed drives is too common today in the Data Center.

“Why not ditch the hard-drive it is the slowest piece of hardware in a laptop/pc/server ect.“

Yeah, I’m still waiting for the promises of http://nantero.com/ based on nanotubes, offering non-volatile memory that is faster than Static RAM, and more dense and as cheap as Magnetic Hard Drive technology.

This would be great, half way through a game of Duke Nukem Forever, you could flick a switch to power down, and then power up anywhere else instantly to continue your game.

For some reason last time I heard this kind a news it was few years ago, well history repeats I guess. If one thing they can do right is license this technology to AMD, Intel and IBM, and hope that they will use it. Otherwise it’s doomed to walk same path as Transmeta did. Pure fact that when big names like Intel and IBM have introduced there “superior” new technologies, it never worked.

Sounds nothing like Transmeta. This isn’t a commercial product, it’s a university research project. Even if it never makes a dime, it’ll be a success, because academia is more interested in The Right Thing (TM) than What People Will Buy (TM).

How is it success if it’s never used in Real Life (TM)? And what about this bold statement that it will “accelerate industrial, consumer and scientific computing”, how is it going to do that? And keep mind that this probaply, as been American University Research, is highly patented. So if all major companies going to ignore it, it won’t “accelerate industrial, consumer and scientific computing”. It’s bullshit to say that academic research in USA doesn’t need commercial success, ofc it does! How do think Universities are funded in USA? Public money, nope.

Just for some background, here is a brief description of processor evolution.

1. Accumulator and index register based. These were the simplest, so came first. The index register is large enough to address all memory and can be changed in very simple ways. The accumulator is smaller and can support more operations. Data from memory is applied to the accumulator – load, store, add, or, and so on. The PDP-8 and 6502 were accumulator architectures.

2. General purpose registers, in which a register could be used as an index or an accumulator. This meant indexing was no longer limited to extremely simple operations. Also with more than one accumulator, other registers could be a data source as well as a memory address. By extension, a memory value could be an address to another value. Plus register accesses could have side effects (typically increment or decrement). The PDP-11 was an example, the 68000 wasn’t quite – it has separate address and data registers.

Ever wonder why so many 80×86 operands are A + B -> A instead of A + B -> C ? Because the registers were still considered to be accumulators, so were implicitly one source of the data as well as the destination.

3. Stack based, in which the accumulator is replaced by a stack, which can also be accessed by offset (not just the top value or two). This was an alternative strategy to the general purpose register design. The advantages were much smaller code, since operands were implied, simpler design, and a closer match to the execution of languages like Algol, Pascal, and C, but the stack was a bottleneck to parallelism and optimisation.

General purpose registers could be used to emulate stacks in memory (without the efficiency gains, but without the disadvantages), so stack machines were abandoned. The Burroughs B 5000 and some FORTH machines are examples.

4. Load and store based architectures restrict memory access to load and store using an address in a register, or with a small offset. No indirect addresses were allowed – that was simply replaced by two loads. This greatly simplified memory access. In addition, since all operations were now between registers, more were added, and three operand instructions (A + B -> C) allowed operations to leave both source registers unchanged so those values could be reused.

The MIPS and SPARC were early examples, with POWER coming later.

The trends have been to localise data and reduce data types. Data is no longer primarily in memory, but operated on in registers. And the distinction between data and addresses has disappeared, first by merging accumulator and index registers, then eliminating the idea of indirect addressing (operands are always data, never address to the data, with the single exception of load/store).

Future trends should be simple to extrapolate, but how to best implement them aren’t. Data will continue to be localised. Currently this is being done transparently, such as the POWER5 and late Alpha which physically duplicate the register sets to keep copies of data closer to the functional units, and which group instructions in bundles dynamically.

Some past ideas to further this have been dataflow architectures, which try to describe data operations more directly, Sun’s Counterflow research project which would send operands into two ends of a pipe which would meet in the middle where the operation is performed, Move Machines (or Transfer Triggered Architectures – TTAs) which send the data to functional units which simply perform an operation and produce a result once both input ports receive data (I like this one, but it’s impractical in its pure form).

This looks like it’s following this trend of localising data, and it’s doing it in a very balanced way. It’s like a dataflow architecture, but with hardware making the analysis rather than the compiler (also the downfall of VLIW architectures, which rarely do any better than hardware grouping of instructions like modern 80×86 or PowerPC). So I think this idea is promising, especially if it can show 2-10 times the speedup over conventional CPUs, like RISC did.

RISC was a breakthrough, and fundamentally altered even conventional designs to incorporate new features, even while using the old instruction sets. This may have the same potental, with an explosion of new, far faster processors, followed (if possible) by existing architecturs (80×86, Power, IA-64) adopting the most effective features. Either way the potential for vastly improving computing performance is good.

>especially if it can show 2-10 times the speedup over conventional CPUs, like RISC did.

RISC did a major improvement, but only for a time, afterwards the x86 managed to catch up, Intel managed to provide adequate performance reusing improvement made by RISC CPU and despite keeping the ugly x86 CISC ISA.

>Either way the potential for vastly improving computing performance is good.

Perhaps, but if TRIPS CPUs don’t provide binary compatibility on several implementation, it will probably remain a ‘niche product’: this compatibility issue prevented VLIW to because widely used for a long time..

Since a no of comments have mentioned these, I’ll throw in my 2c since this is also my main work interest.

I haven’t studied the paper in depth but I see some similarities elsewhere.

There is this idea held by myself and quite a few others suggesting that much digital hardware and some software can be put in the same form with a single language that can be used for both highly concurrent programming and or be synthesized into hardware. This is especially true for mathematical algorithms that can be expressed in Matlab, Fortran, several HDLs, Haskell, occam and various hardware C that are candidates to turn into hardware. There seems to be a lot of overlap in the research of auto parallelization and behavioural synthesis. Indeed many of these scientific languages are used to design hardware in combination with HDLs.

There are now quite a few Par C dialects mostly from hardware and tool vendors hoping to sell FPGA kit.

I would like to see some of the Verilog (or VHDL) language (sort of C like) combined with a light version of C++ using struct => class => process idea with the hardware language adding a very strong concurrency aspect that is easy to synthesize to hardware but can also be executed on conventional cpus albeit slower using event driven kernels. In other words the process is the object model, with wiring for messages and models the world as it really is. Some of the more interesting non procedural languages already do some of this but the world keeps reinventing the same thing over.

One could argue whether the resulting model is software or hardware running on a hardware simulator or real hardware. Indeed the 1st application of FPGAs was for hardware simulators. If one changes the granularity level from 100ks of simple FPGA LUT cells to 1000s of simple cpus, we have a continuoum for executing the parallel languages.

I have yet to read the documentation but I wonder if anyone here is aware of the Mozart/Oz system (as utilized in Concepts, Techniques, and Models of Computer Programming)? If my memory serves this system uses dataflow variables to control much of the computation. ( See wikipedia