Friday, October 29, 2010

The mere mention of the Intel 440BX motherboard chipset still runs chills down my spine (as it should for all self respecting nerds). It serves as a great example of a piece of hardware that lasted way way longer than its builder intended. Its compatibility started with the 0.35 micron Pentium II 233mhz processor of 1997, and ended with the 0.13 micron Pentium III-S 1.4ghz Tualatin processor of 2001. That is four generations of Moore's law! In addition, Tualatin benefited from an unusually elegant design that allowed it to outperform 2ghz processors sold years later. (0.18 and 0.13 micron processor generations required slocket adapters like this, and in a future post we will also note another great role the slocket served ;-) From a system builder's perspective, the 440BX represented the epitome of upgradeability, and reinforced in us the value of building our own computers.

The amazing dominance of the 440BX chipset may have partially inspired hubris at Intel that lead to a series of bad decisions, like RDRAM (Rambus) and the 33-stage pipelines of the Pentium 4. Once pride was swallowed, Intel backtracked into better products that used standard memory and efficient pipelines. In order to increase sales, however, they began a practice of requiring new motherboard chipsets each processor generation, and this continues to today.

In this way the 440BX represents the greatest period of computer upgrading in history, a golden age of technology that will seemingly never be repeated. In a future post we will see similarities with Microsoft's WindowsXP product, which was also followed by bad decision making, a poor product cycle, and continuous incompatibility.

Thursday, October 28, 2010

This is what it's like when worlds collide (song) - well, maybe not worlds, more like data storage technologies with different substrates. As reported at TomsHardware, Steve Luczo, CEO of magnetic storage hard drive manufacturer Seagate, made some interesting comments regarding up and coming Solid State Drive (SSD) technology (full transcript).

First, on the product most famous for its SSD option, the MacBook Air (discussed here by Mac himself :) Luzco states the percentage of those units sold with SSD is very low, so it's not a threat. Second, SSD drives are too small (low capacity) and too costly. Third , SSD slows down over time, so their performance is not as great as you've heard. And last, "Seagate introduced hybrid drive last quarter, you get basically the features and function of SSD at more like disc drive cost and capacity ... with the hybrid there is things that you can do to alleviate that [performance degredation] so your boot times are actually as compelling one and two, three and four years down the road."

There you have it, resistance is futile, prepare for assimilation. Well, that is certainly one point of view, but in reality things could turn out less comfortable for Big Magnetic Storage (yeah, I said it :P) than they would like to admit.

As reported at RealWorldTech, SSD is a classic disruptive technology. There is a certain amount of storage each user needs, and as the cost per bit of SSD continues to follow Moore's Law, SSD will continue to meet an increasing percentage of users' needs. A possible counter argument is that users' needs are increasing exponentially, but I wonder if this is really happening. Most of the large data requirements come from audio and video libraries, where storage demands increase when the libraries get bigger or gain quality. In my case I have about two hundred albums in my music library that I have collected over the last decade, and it is really unlikely this will double to 400 albums in the next two years. Furthermore, high definition video is pretty close to the resolution of the human retina, making demand for further improvements (and their requisite bitrate increases) unnecessary.

To hard drive manufacturers, SSDs represent the wild west. In the vertically integrated magnetic storage business, manufacturers build the entire hard drive themselves in factories that cost billions of dollars. In contrast, the flash memory chips that store data in SSDs are a commodity, similar to DRAM memory chips, and anyone can buy them wholesale and integrate them into SSDs. This allows anyone to become an SSD OEM and ushers in a new competitive environment that previously wasn't. It will be interesting to see how this new west is won.

Wednesday, October 27, 2010

SiCortex was an inspiring company - they had a novel chip-to-chip networking architecture that boasted improved latency and bandwidth for MPI programs. They packed their processors densely, and instead of making rack-mount servers they optimized their own custom chassis for cooling. Their largest system unit had over 5,000 cores and could fit within the power budget of the typical office without any power or cooling retrofitting - a very exciting proposition and I don't think I was the only one who dreamed of bringing one home.

The system was marketed as a power efficient supercomputer, a niche that seems like a good idea to target since there is such a large margin by which commodity servers can be beat in that arena, with the right architecture. Low-power cores coming out of ARM and Tensilica inspire thoughts of how computer systems could incorporate such efficient cores in a useful way.

In 2007, about a year after SiCortex installed their first systems, the best SiCortex system was "only arguably" more power efficient than the latest Intel servers - meaning a significant advantage on some common cluster workloads wasn't obvious. For example, in the double-precision GFlops arena (often times not a representative benchmark but used here for simplicity), SiCortex provided 5832 cores, each capable of 1GFlops, yielding 5.832 TFlops. The power consumption of the system was about 18 kilowatts, resulting in 324 GFlops per kilowatt. The 3Ghz Core 2 Quads that could fit in 150-watt servers in 2007 were putting out 48GFlops (4 ops/cycle in SIMD, 4 cores, 3ghz). That's 320 GFlops per kilowatt, reducing the SiCortex advantage to a rounding error.

The GFlops comparison is not fair - the SiCortex architecture had a lot of advantages outside of GFlops, like much lower penalties for cache misses, higher memory bandwidth per compute cycle, somewhat lower penalties for branch misprediction, etc. The comparison above also does not take network power consumption into account, and the PC network would have delivered lower performance for latency or bandwidth-bound problems. These advantages for SiCortex would have been more compelling if the floating point power efficiency had at least a 2x-4x advantage over Intel at the time, which could be perceived as a 2-year to 4-year advantage over commodity servers as a minimum. As it was, it was easy to think of workloads (e.g. SIMD floating-point bound workloads) that gained no power efficiency advantage on the SiCortex hardware, which starts the power efficiency story on the wrong foot.

Tuesday, October 26, 2010

This is an update to the previous post on the Q9505S (and includes a correction).

When the first 45nm quad-core processor arrived, the 3ghz QX9650 with 12MB cache, it was labeled with a 130 TDP. TDP, which stands for "Thermal design power", indicates the maximum amount of heat that a cooler would need to dissipate when the processor is under load. TDP's are known to not be the best estimate of power consumption, in fact they are necessarily overestimates, but I had figured they are a fair estimate for processors with the highest clock speed in their class, i.e. the ones closest to consuming the TDP power.

Measuring the power consumption of the processor in a PC is non-trivial - using a Kill-a-Watt results in measuring the total system power (including power supply overhead). Even Kill-a-Watts get it quite wrong if the power consumption is changing quickly between different levels - but for measuring constant loads they work fine. Another way of measuring power consumption, and the one xbitlabs uses, is to wrap the DC power cables that run to the CPU through an ammeter ring, which measures amps, and then measures voltage elsewhere, allowing calculation of watts as amps times volts. This method adds the inefficiency of the voltage regulator module, or VRM, to the CPU power consumption. The VRM is responsible for bringing the 12v down to the ~1.1v required by the processors and efficiency can vary from 75%-95% (ASUS has achieved 96%). It is possible to use multiple motherboards with the same processor and use some statistical techniques to work out what the efficiencies of the different motherboards must be close to, but that has never been tried to my knowledge since it is just too much work.

Another technique for measuring power is to change the voltage of the processor several times, each time measuring the system power consumption. This can be hairy because over-volting can break the processor, but when it works you end up with several points data at several different voltages. You can overclock and underclock the processor to different levels as well. Power consumption will be passive power + dynamic power. Assuming passive power is pretty low (claimed to have decreased by 10x in Intel's 45nm process), estimating the dynamic power is enough, and dynamic power can be calculated by multiplying multiple factors, two of which are voltage and frequency. By holding the other factors constant and modifying just voltage and frequency it is possible to calculate the total dynamic power by solving for the missing factor. System power = Non-CPU power + CPU-power, or s = a + x*y*b, where many s,x,y points (voltage and frequency) can be collected to solve for a and b (ignoring CPU passive power). I haven't seen this technique used but it should work - it would be interesting to compare it experimentally other techniques to see how close it comes to them. One nice aspect of this technique is that the only hardware required is a kill-a-watt, as the frequency and voltage can be measured using speedfan or other software tools.

A last technique for measuring power consumption, and the one used in the previous article, is described by Anandtech as: "requires nothing more that the processor's specified TDP and then scales this value based on a given overclocked core frequency and voltage". This method is particularly terrible when the TDP is inaccurate, and the QX9650 was a special case of an extremely inaccurate TDP, estimated to be about double the actual power consumption.

This makes a corrected theme of the previous post a little less exciting: the power consumption of the 45nm process probably decreased a significant amount but not by half over the lifetime of the process.

Still, the Q9505S is an amazing processor. While working in the brain engineering lab at Dartmouth we put it in a mobile robot to run speech recognition (Dragon Naturally Speaking), speech production (AT&T Voices) and visual feature extraction (RoboRealm) simultaneously on a small ~15 pound mobile robot called Brainbot. Each application ran on a different core quite smoothly, while the entire robot got around 1.5 hours of battery life on 200 watt-hours. That's better life than my old Alienware Pentium 4 laptop when it was new. Brainbot is now sold for $30k, and the only robot available with more onboard processing power is the WillowGarage PR2, with two PCs built-in, consuming 6x as much power and selling for $400k. The Q9505S helped Brainbot get close to the PR2's level of performance for less than one-tenth the cost, which makes it a true marvel.

Monday, October 25, 2010

In the Intel lineup, the Q9505S is a freak. If I were to tell you that the performance of Intel's flagship 45nm quad-core processor (QX9650) would be delivered 2-years later at one third the cost and one half the power consumption you might conclude that Moore's Law had cycled again and the new processor benefited from a 32nm die shrink. This is what makes the Q9505S such a strange creature - these benefits were surprisingly reaped all within the same 45nm tech node. What gives?

For starters, it was released quite recently, two years after the first 45nm Core 2 quad-core processor, the QX9650, and consumes half the power at 65-watt TDP vs 130-watt TDP, and achieves approximately the same clock speed at 2.83ghz vs 3ghz. Because the processor architecture is the same (Core 2 quad), there are only a few possible sources for the power efficiency gains.

One possibility is an improved layout, i.e. placement-and-routing of the transistors and wires that implement the Core 2 architecture. It makes sense that only slight improvements would have been made here because Intel spends many man-hours hand-crafting the circuitry in the first place, and the Q9505S is seemingly not a high volume product that would merit follow-on hand-crafting. Another source of efficiency gains is obtaining a sweet spot of 6MB for the L2 cache, which is 50% less than the 130 watt 12MB Core-2-quads, and 50% more than the 4MB Core-2-quads.

Perhaps most interesting is the implication that Q9505S is a beneficiary of improved fabrication technology within the 45nm tech node, long after the original 45nm debut. Additional evidence for this is that mobile quad-core 2.53ghz parts (e.g. QX9300) with similar power efficiency to the Q9505S (45 watt TDP) were available a full year prior to the Q9505S, but cost much more (~$1k), indicating that the yield of processors with those specs was quite low. Given the additional year for developing the 45nm process, the yield for such processor specs must have improved by a large margin to allow their release at much lower prices - these types of improvements usually come from Moore's Law die shrinks but in this case it was all within the 45nm process.

45nm also stands out as an unusual tech node, having been credited as the greatest advancement in semiconductors in 40 years by Gordon Moore himself. This was due to overcoming difficulties in fabricating 45nm transistors by changing the elements used to makeup of the gate wires and insulation (so called "high-k metal gate"). This same technology is also being used in 32nm, begging the question as to whether 32nm will see similar delayed improvements. This would be just what Intel needs in order to deliver 3ghz 8-core Sandy Bridge processors before Ivy bridge's 22nm fabrication technology is ready.

Friday, October 22, 2010

I was fortunate enough to attend Bill Gate's back yard barbecue twice. In the summers of 2003 and 2004 I interned at Microsoft, definitely two of the best summers of my life, and each time some interns were treated to meeting Bill Gates for dinner on his home turf. What fantastic evenings - I don't remember the food but the unlimited free beer and ice cream sandwiches were just awesome. Even the bathroom was amazing, the paper towels used to dry your hands after washing were like real towels, super thick and yet super soft.

Bill would arrive fashionably late on his back lawn that touches Lake Washington, just when the sun was setting but so bright you had to almost close your eyes to squint hard enough to see when looking west. A crowd of 50+ interns would immediately surround him at very close proximity and at that point he would answer questions for about an hour and a half and then security would usher us out and back home. Nobody wanted to leave Bill, he really has an electric personality in personal settings. I think he wanted to inspire us interns, and he did.

I have some experience getting to the front of crowds, having practiced getting to the stage at Tool and Rage Against the Machine concerts - and I was able to in this case too, but there were always certain interns with more hubris, who would ask questions quicker and louder than me. I was able to interject a couple times, and one of my questions pertained to supercomputing and what he thought about it.

He said "Supercomputing didn't happen, it never happened, ask anybody. The only company that even tried is right over there [points across Lake Washington, referring to Cray Inc.] and they have only barely survived." (not a word for word quote)

I like people who get to the point and don't give wishy washy answers (who doesn't?), and I liked his statements. Bill was definitely right from certain perspectives, and in that historical context, but today the destiny of PCs and supercomputers are deeply intertwined. In the modern context, most of supercomputing is the collection of networked personal computers, personal computers that he invented. Many supercomputers are built around hardware acceleration on graphics cards that originated in, and can't run without, the PC; and a standard way to build a supercomputer is by extending PC clusters with accelerator cards like Nvidia's Fermi or IBM's CELL PCI Express cards. The dependency between supercomputing and the PC is not a one-way street either. PCs owe many of their features to supercomputers like multi-core processing, 64-bit addressing, and SIMD instructions. Only a fraction of the performance of modern desktop PCs would be possible without parallel programming techniques previously used only for supercomputers, like MPI and OpenMP.

Besides the insight Bill bestowed upon the crowd of interns, he also gave us great stories that we can tell and retell, anytime there is an excuse, to anyone who will listen.. blog readers not exempted! :-D

Thursday, October 21, 2010

Four score and seven years ago... no wait... Four years and two tech nodes ago, Intel bestowed upon us the first quad-core x86 processor, the QX6800. (Well, ok three-and-a-half years ago, but that doesn't really roll off the tongue the same does it?) It was a beast at 2.93 ghz, capable of issuing four instructions per cycle.

Fast forward two Moore's Law cycles (4x increase in transistors, even faster than before) and we should have 16-cores of at least the same performance right? Or maybe 4-cores that are four times as fast? Or 8-cores that are twice as fast? Hell we could even settle for four cores that issue a max 16 instructions per clock. That is the life to which we have grown accustomed.

Well, it turns out we've been spoiled, the chickens have come home to roost, and Intel is only to blame if you think they should be able to tweak the laws of physics (well, maybe for getting our hopes up, but do you really want them not to be optimists?). The best x86 processor today is a 6-core Gulftown (Westmere) with the same 4-issue rate, and a measly 13% increase in clock rate. There are some feature improvements, like extra threads, better branch prediction, and an increased likelihood of actually issuing all four instructions, but dammit, I want my cores, clocks and issue rate ;-)

We were promised and actually got8-cores at 45nm, called the Nehalem-EX, but that was way too hot to function at 3ghz (a speed originally introduced back in 2002) - so it was underclocked to 2.26ghz. It also arrived _after_ Westmere and costs about 2x-3x as much :-(

Undeterred, Intel has announced it's first 22nm fab will be located in the U.S., which will cost something like $8B, a surprising move in some respects since historically most of Intel's fabs are outside the U.S. in places like Israel, Malaysia, and Costa Rica. The increasing cost of new fabs (yes really, $8B!) has necessarily created a consolidation in the semiconductor industry, with only one player (Intel) capable of production at 32nm (or better) for about 10 months, and previously competitive companies are coming together to prevent falling further behind. In fact it is quite arguable that "half nodes" like 40nm and 28nm are an admission that the traditional nodes cannot be delivered in lock step with Intel, and the missing months are costly. All this leads to the conclusion that, with the possibility of escalating trade wars, a state-of-the-art domestic fab is of key strategic importance.

In response to China's increasing dependence on imported computers, the Chinese national processor "Godson" was developed, and can be fabricated by STMicroelectronics within China's borders. With respect to placing a lower bound on the processor performance that can be achieved without imports, Godson could be considered a huge success and a security blanket of sorts. Intellectual property issues did arise early in Godson's development due to using an instruction set based on MIPS but without the patented instructions. Licensing agreements were eventually worked out with MIPS technologies (founded in the U.S.), which were arguably unnecessary but certainly put a stop to any ongoing controversy.

It will be interesting to find out how the world responds to China's reluctance to export rare earth elements, and where future fabs and processor architectures will emerge in the context of their increasing political importance...

Tuesday, October 19, 2010

Until the HPC Cluster instance debuted back in July, there was a lot to complain about in terms of CPU performance per dollar in the Amazon EC2 cloud. Single-threaded performance of their next-best instance, the so-called "High CPU" instance, is less than a tenth that of a modern desktop PC (2.5/8 = 0.3125 vs. 33.5/8 = 4.1875 EC2 compute units for a 2.93Ghz Nehalem). Indeed it is well known that Amazon slices up their real cores into many virtual cores that include only a fraction of the computing resources. This was the norm until the HPC Cluster instance, which is the first to provide a 1:1 real:virtual core ratio.

Without special arrangements only 8 HPC cluster instances can be recruited, at a cost of $1.60 each per hour, or $12.80 for all 8. The theoretical max double-precision GFlops (an imperfect and often misleading metric that is OK with respect to how we use it here) is 93.76 GFlops/instance * 8 instances = 750 GFlops (although only half this rate was achieved in their poster benchmark, we will give the benefit of the doubt). An hour's worth of processing (the smallest unit that can be purchased) delivers 750 * 3600 seconds = 2.7 Peta floating point operations.

Some applications are able to scale performance on N processors to be O(N), meaning linear scaling minus some overhead that does not increase out of proportion. "Embarassingly parallel" algorithms are good examples of this, such as Monte Carlo algorithms and algorithms that process large amounts of data like Web Search etc.

Suppose a scalable algorithm that bottlenecks on the SIMD DP FPU takes an hour to complete a task on Amazon's 8 available HPC instances (any faster and performance per dollar is reduced due to the 1-hour minimum). Ignoring initialization time (which we will discuss in a future posting, and which today is not charged to EC2 users) scalable algorithms can do massive amounts of work in almost no time by recruiting tons of hardware for very brief periods. In this example, if 28,800 instances are available and there is no "1-hour minimum", the task finishes in about 1 second for the same price as the 1-hour scenario, utilizing 2.7 Petaflops (quadrillion floating point operations per second). At the current cost per compute-second, err.. compute-hour, the total cost would be $12.80.

Conventional commodity-server based systems will probably never be capable of delivering this type of performance because of initialization time (currently 5-20 minutes on EC2, depending on OS), but it is easy to envision custom cloud architectures that would confer nearly instant execution to scalable algorithms.

Monday, October 18, 2010

Ashok Chandrashekar is an amazing guy. During his first few months of graduate school at Dartmouth he taught himself how to program the CELL processor, and even how to debug it, which was much harder. Errors that show up only in hardware were particularly vicious, and the catch-all "bus error" actually gives no information about what went wrong. Still, in about one month he wrote every line of code that went into our entry for the CELL University Challenge, which resulted in winning the grand prize (thanks also to Jay Moorkanikara, who originally had the idea to submit the entry).

For that month, working past midnight day after day with Ashok was one of the best experiences of my life, and our workarounds for the problems we encountered are why I think we won. The biggest and most strategic workaround was discovered while we were trying to reduce our "bit-vector dot-product" (dot-product where all the elements are 0 or 1) to 3 cycles. This was possible because the CELL processor ingeniously implemented the "pop-count" instruction which counts all of the 1's in a binary integer (e.g. popcount(1000100100001) = 4). One of the claims-to-fame for architectures like Itanium was the hardware pop-count instruction, which required significant dedicated hardware in the architecture design. Itanium and other architectures count the bits in the entire integer, but counting 32 or 64 bits requires a lot of logic to complete in a single clock cycle. Someone at IBM had the notion to count bits in smaller fields, namely each 8-bit field, separately. For large integers, it is much easier to count multiple 8-bit fields separately and store the totals in separate 8-bit regions, which allows the results of multiple pop-counts to be summed (on the CELL, summing 31 or fewer pop-counts has no overflow danger).

With the CELL pop-count instruction, it is possible to perform a 128-bit bit-vector dot-product in just 3 cycles: AND, popcount, ADD, and repeat along the entire vector's length. Before the sums outgrow their 8-bit boundaries they must be aggregated, e.g. to 16-bits, but that is pretty simple to do. And of course both input vectors must be loaded into registers as well, but those loads are hidden by the CELL's second execution port which can handle simultaneous loads/stores to/from the local store memory.

During the last stages of our implementation we encountered a throughput problem: much more than 3-cycles was required per 128-bits, and the reason for this was not obvious. It turns out the CELL processor does not contain a bypass network in the traditional sense, meaning that values that exit the ALU for register writeback are not immediately available in the next cycle as inputs (a capability provided in most modern architectures by their bypass network - in fact the Pentium 4 had a half-cycle throughput and latency for simple instructions). The CELL is designed this way because the bypass network is an expensive piece of hardware in terms of silicon area and power consumption, and as clock cycles scale (3.2 ghz in the CELL, which was a very high clock for 90nm technology) the turnaround time of the bypass network must decrease to achieve single-cycle latency (there is a similar requirement for branch prediction which we will cover in a future post). Furthermore, the ability of typical modern processors to process out-of-order (OOO) allows other instructions to be scheduled whose inputs are in fact available, but the CELL uses an in-order design instead of OOO to reduce the silicon area and power requirements of the CPU (thereby increasing the number of cores that can fit in each processor, increasing overall throughput).

Instead of a full latency-hiding bypass network and OOO execution, IBM relies on the compiler to intelligently schedule instructions, but this didn't work in our case (something we are ironically thankful for, since it made our contest submission more impressive, hehe). This may have been due to limitations in the compiler's scheduler, such as the size of the window in which rearrangements are looked for. Our workaround was to unroll the inner loop and then syncopate the operations of several loop iterations like this:

The very large (though multi-cycle latency) register file in the CELL processor was able to simultaneously hold all of the temporary values without issue. This bit-vector dot-product sequence works on vector chunks of 768 bits, which evenly divided our input vectors. This scheduling allows the AND, POPCOUNT, and ADD instructions to have 6 cycles of latency headroom before their outputs are needed as inputs. It also amortizes the cost of the looping branch over 18 instructions instead of just 3, further increasing throughput.

With the $10k in prize money our team went to Vegas and had a very crazy week that was eventually ripped off by Rockstar Games and incorporated into Grand Theft Auto 4. :-D just kidding, I think I spent my portion paying off a credit card. Oh well!

Friday, October 15, 2010

I was discussing some new developments in computer architecture with my friend (and CEO) Mac Dougherty one day when he suggested that I might start a blog on the subject. This intrigued me because, since building my first computer (Dual Celeron 300a with modded slockets! (the forum that started it all)) I have appreciated reading commentary on computer architecture. I still get a rush when reading great articles by likes of Jon Stokes, David Kanter, and Michael Schuette, and I also love checking tech news sites like HardOCP etc., which are updated at a much higher frequency. What was never satisfied for me was a need for computer architecture analysis and commentary updated on a more frequent basis than is possible for in-depth articles, and so I will endeavor to do something about it.