With over 9500 miles of road tripping across North America so far this month, I found myself driving in to Stanford California last week for the 20th annual Hot Chips conference. Unlike ISCA which was mostly about theoretical future design changes, Hot Chips is all about real new chips that will be seeing the light of day shortly. It was two days of pretty intense back-to-back presentations on Larrabee, Nehalem, Rock, Godson, Tukwila, SPARC64, and other cool new CPUs. The common theme of most of the presentation is a plethora of hyphenated adjectives - "multi-core", "hyper-threaded", "in-order", "clock-gating" - describing this year's batch of processors.

Multi-core is definitely here to stay, as all the companies are jumping on the bandwagon of putting 4, 8, or more CPU cores on a single silicon die. What's more, the Intel technology of hyper-threading, which seemed to die off with the Pentium 4, is back with a vengeance, allowing most chips to double or even quadruple the effective number of hardware execution threads. With all this parallelism, a number of CPUs are now shedding their OOO (out-of-order) pipelines in favor of a 1990's style in-order pipeline, as I most recently described showing up in the new Intel Atom notebook processor. Chip makers are finally starting to significantly cut gates from the silicon to save power and die space, with out-of-order execution being the first major cut in most designs.

It wasn't all about computer microprocessors. One of the most fascinating presentations I thought was one by Telegent Systems on their latest 300-milliwatt NTSC/PAL "TV on a chip"; where literally a single piece of silicon has an RF antenna wire going in one end and HDMI video output coming out the other. I'd never really stopped to consider how they put live television on a cell phone, but in hindsight, it explains the recent availability of rather inexpensive (60 dollars or less) USB-based TV tuners that are barely the size of thumb drive. Gone are the days of the bulky and heavily shielded 100 to 200 dollar TV tuner cards. The Telegent presenter pointed out that while North America may be killing off analog television transmissions, 5 billion people in the rest of the world still use NTSC or PAL, and now have a means for very inexpensive television reception. The cell phone may not only kill the PC in emerging markets, it may also kill off the stand-alone television, heh.

Another presentation that I found to be an eye-opener was the one on China's Godson-3 microprocessor. China had a non-existent CPU industry only a few years ago, but now the Chinese government sees it as a matter of their national security to free themselves from their dependence Western computer technology. In just 7 years they have bootstrapped themselves straight into the 21st century and designed a chip that is competitive with where Intel or AMD were barely a year or two ago. The Godson-3 is an x86 clone chip that appears to be a brilliant melding of ideas from Intel, Transmeta, IBM, and other companies that have spent decades and billions of dollars developing the industry. For while the Godson-3 is of course multi-core based, they avoided the complexity of cloning an x86 core by using Transmeta-like binary translation instead. That is, the Godson-3 is actually a MIPS core using a RISC instruction set. As described in the presentation, about 5% of the silicon space is devoted to hardware accelerating binary translation in order to achieve, and this number stunned me, 80% of the speed of comparable x86 processor, with a total power consumption for quad cores at only 10 watts.

Repeat: a quad-core 64-bit RISC-based x86 clone that performs software based x86 emulation and consumes a total of 10 watts of power. Holy cow! China gets it: save gates, save power, harness binary translation to do things in software instead. There is no hardware implementation of 16-bit MS-DOS real mode, v86 mode, 32-bit protected mode or any such legacy baggage. China has beaten Intel and AMD to the punch of implementing my "10 Steps To Fix The Hardware Mess" which I posted last year. There seems to be absolutely nothing that our future overlords can't seem to accomplish once they throw their money and manpower at it. If there is any doubt about that, the opening and closing ceremonies of the Olympics alone should leave you in awe. A 500-foot long LED screen, the human tower of acrobats, the inexplicably precise synchronization of 2008 drummer acting as one giant computer screen. The West is f*****. It would not surprise me if the next Apple Macbook or some new game console used the Godson-3.

So how did AMD and Intel respond? As they had done the week before, Intel presented the Larrabee processor, a heavily multi-core in-order x86 chip designed for graphics performance. It is interesting to note that one of the authors of the Larrabee paper is none other than Michael Abrash, famed developer of the DOOM rendering engine and author of clever programming books. The Larrabee uses a ring bus architecture to connect its cores and, like the Godson-3, uses a directory mechanism to break up and distribute the L2 cache across the different cores. There isn't one global L2 cache that all the cores have to contend for, but rather the address of the memory access determines which cache to go look in. Larrabee is an in-order core just as the Atom, and thus is more suitable for very parallel workloads such as graphics rendering. It is interesting that while Larrabee is a full blown multi-core x86 chip, it will first be marketed as a graphics co-processor.

For general purpose desktop use, Intel is introducing a third new core this year, the Nehalem, to replace the existing Core 2 architecture. As was stated in the presentation, Nehalem's goal is not so much to push the performance envelope that much beyond Core 2 as it is to make legacy x86 code run better. In other words, to reduce latencies for things like data cache misses and interlock operations to get closer to achieving that 3 or 4 instructions-per-cycle throughput that existing processors such as Pentium III, Pentium 4, and Core 2 were theoretically capable of but rarely achieved. One very welcome design change is cutting in half yet the latency of atomic operations such as LOCK CMPXCHG. Last month I mentioned that the Intel Atom appears to completely eliminate such atomic lock overhead, and Nehalem seems to be taking a similar path. This is good news for multi-threaded software, which suffered from ridiculously high locking latencies (of over 100 clock cycles) in architectures such as that of the Pentium 4. A related change, and likely why the lock overhead is so low, is that Intel has followed AMD in moving to an on-chip memory controller, which they say will cut memory latency of cache misses by 40%. If true, this would be serious competition for AMD's HyperTransport mechanism and the one design lead that AMD still had left. And also related to this, Intel claims that hardware virtualization (VT) latencies are also reduced by 40%. But in my opinion, hardware VT is still an unnecessary waste of die space. I am not sure if the power consumption figures for Nehalem were disclosed, but I am guessing it is not 10 watts. So kill VT please!

Like most of the other chips this year, Nehalem brings back hyper-threading to give, as I am going to guess will appear next year, desktop PCs with 8 execution threads. So Nehalem will not necessary make well-written code run much faster than Core 2, but poorly written code than suffers from cache misses or high lock contention will now run and scale better. For the SSE fanatics out there who absolutely must uses ever possible multimedia extensions (and I mean you Igor, you know who you are), Nehalem adds the SSE 4.2 instructions, which include the POPCNT bit-counting instruction and the CRC32 hashing instruction, that will likely have more use in device drivers than application code.

Not to be outdone by Larrabee, AMD presented its AMD 780G chip, which for a 1-watt power consumption is claimed to deliver 2560x1600 HDMI output and hardware BluRay decoding. Could it be, that between the 780G and the Larrabee, video cards will finally stop needed their own power supplies and big loud cooling fans?

The last new chip I'll mention is the Sun Rock, a 2.1 GHz 16-core 32-thread monster of a chip that implements some novel speculative execution schemes. The Rock implements hardware transactions, which is the ability of the chip of make it appear that a sequence of instruction executed atomically. Conversely, hardware transactions allow a series of instructions to be undone, making it possible to write multi-threaded code with practically no locking overhead. If two threads conflict on say writing to the same memory location, one thread simply undoes its actions, at no cost to the other thread. This is different from classic lock-based synchronization schemes where a thread must pay a lock penalty (in the form of a compare-and-swap type instruction at the heart of any critical section, mutex, or lock function) regardless of whether any other threads will even take the same lock.

This transaction mechanism is implemented by having two separate copies of every single register, a "checkpoint" set of registers containing the last good state of the thread, and a "working" set of registers containing the speculative transactional state. The hardware thus maintains two program counters into the code, the program counter corresponding to the checkpointed state, and a "scout" which executes ahead speculatively. When a long-latency instruction such as a data cache miss is encountered, the scout thread keeps on executing ahead. If it finds any instructions which are dependent on the stalled instruction, those get put into a "deferred queue", and the scout thread keeps executing. At some later point when the deferred instructions can complete, the two threads will eventually merge and get in sync again. This effectively discovers parallelism in serial code and sounds a lot like a very deep out-of-order pipeline, but with far less complexity (and from the sounds of it, a much deeper window into the code) than a traditional out-of-order pipeline. The Rock is an in-order core that appears to behave as an out-of-order core. Or at least that's my understanding from the limited number of slides that I saw, but it does sound like Sun is doing something way cool that AMD and Intel have not tried yet.

SSDs Keep Getting Faster and Cheaper - And LCD TVs!

Just in the past few weeks since I posted about solid-state drives and building you own from Compact Flash, the prices and capacities have just been improving like crazy. Last week at IDF, Intel announced its intention to enter the hot SSD market with very fast and high capacity drives. Not surprisingly, the price of 64-gigabyte and 128-gigabyte drives has hit new all-time lows, with 128GB drives now being available for under 500 dollars. Amazon.com for example is now offering the 128GB OCZ SATA drive for 489 dollars, which if you think about it is four times a better value than the $999 64GB option for the Macbook Air that Steve Jobs announced just this past January at Macworld. At this rate, and with Intel's move into the market, SSDs may displace mechanical hard disks well before my prediction a few weeks ago of the year 2011. Stupid man, what was I thinking!

The price of 1080p LCD televisions also appears to have hit a new milestone this week, no doubt in preparation for the Labor Day rush to buy televisions for the coming TV season and football season. As I was driving up I-5 a few days ago I stopped in various Best Buy, Fry's, and Circuit City locations in California and Oregon. At several locations I verified that 40-inch Sony Bravia LCD televisions are now selling for under $1000. One location in Palo Alto offered the store demo V2500 model for $929, while a location in Eugene Oregon offered the same television brand new unopened for $999. Given the high premium usually charged on Sony products, this is the first time I've seen a decent Sony LCD television sell for under $1500 let alone under $1000. Why do they still sell $6000 televisions?

One has to wonder, how soon before solid-state drives and televisions merge? Perhaps by 2011, not only will solid-state drives replace the hard disk in all notebooks and desktop PCs, but why not build them right into television sets and eliminate the stand-alone Tivo DVR device? A television with a 1000-hour multi-terabyte flash drive. Sony, make it so!

Zero Overhead Interpretation?

Let's get to code again finally. As dynamic binary translation is catching on again as a viable way to emulate x86 on simpler RISC cores, I still stand by my belief that binary translation can mostly, if not fully, be implemented using purely interpreted techniques. That was the topic that Stanislav and I presented at ISCA, and is a topic that I have continued to research this summer with my Pentium M, Core 2, and now Intel Atom based notebooks in tow. At ISCA in June, we showed how the purely interpreted Bochs x86 emulator comes within a factor of 2x to 3x of the speed of QEMU, the widely used jit-based x86 emulator. As our benchmarks show, the CPU interpretation loop in Bochs can dispatch a new guest x86 instruction at best about once every 15 host clock cycles. QEMU on the other hand jits code which takes on average about 6 host clock cycles to emulate one guest instruction.

The main bottleneck now left in Bochs is that branch-mispredicted CALL instruction inside of the main CPU loop (in bochs\cpu\cpu.cpp) which calls each guest instruction's handler function. Most other branch mispredictions in Bochs are today eliminated, save this one most important and common one of all, which results in a 9 clock cycle per guest instruction advantage for QEMU. If you think about, if the CALL instruction had no overhead, the code flow and execution speed of an interpreter and of jitted code would be very similar. Jitting could become mostly unnecessary.

Most processors today have a branch misprediction overhead of about 20 or more clock cycles, while a predicted branch (including indirect calls and indirect jumps) takes just one cycle. Pentium 4 in some cases appears to exhibit far worse latency on the order of about 50 cycles for a misprediction and about 14 cycles for a predicted indirect call. When you consider that the actual implementation of most x86 instruction handler functions (or for that matter, 68040 instruction handlers in Gemulator) complete in about 5 to 10 cycles, it is obviously quite beneficial to reduce misprediction of the instruction dispatch. With everything else optimized, mispredicted dispatch can easily still be consuming 1/2 to 2/3 of the clock cycles of an interpreter. This is the technical issue that needs to be solved.

There are two ways to make this overhead go away without resorting to making use of jitting or assembly language:

Rewrite the dispatch code in a way which allows the host CPU to better branch predict such calls to handlers, or,

Parallelize other work during the branch misprediction to reduce the clock cycle cost of the misprediction itself.

I believe I've solved this problem. In the second of this week's postings which will be up on Wednesday September 3 on the 1 year anniversary of the start of this blog, I will show what I call the Nostradamus Distributor, a very simple yet portable acceleration of the interpreter dispatch mechanism that works unmodified across CPU architectures and C++ compilers, whether 32-bit or 64-bit, x86 host or PowerPC host. As is the theme of Judas Priest's current album, it is all about predicting the future. The mechanism does in fact accelerate the standard CPU loop by a factor of 2x to 3x by eliminating the mispredicted guest instruction dispatch, and does so purely in C++. Can you figure it out?