PC user, hardcore gamer and programmer here; for me, energy efficiency is a lesser priority than speed in a CPU. Make an ARM CPU compete with an Intel Core i7 2600K, and show me it's overclockable with few issues, and you got my attention.

No doubt your CPU would win. But when looking at power/price as well, you'd have to pit your CPU against 50 or so ARM chips in parallel. For some solutions, it may be a far better choice. One size doesn't fit all.

There is already one line of supercomputers built from embedded hardware: the IBM Blue Gene. Their CPUs are embedded PowerPC [wikipedia.org] cores. That's the reason why those systems typically have an order of magnitude more cores than their x86-based competition.

Now, the problem with BG is, that not all codes scale well with the number of cores. Especially when you're doing strong scaling (i.e. you fix the problem size, but throw more and more cores on the problem), then the law of Amdahl [wikipedia.org] tells you that it's beneficial to have fewer/faster cores.

Finally I consider the study to be fundamentally flawed as it compares the OEM prices of consumer-grade embedded chips with retail prices of high-end server chips. This is wrong for so many reasons... you might then throw in the 947 GFLOPS, $500 AMD Radeon 7970 [wikipedia.org], which beats even the ARM SoCs by a margin of 2x (ARM: ~1 GFLOPS/$, AMD Radeon: ~2 GFLOPS/$).

I may be wrong here, but I get the impression that the MIPS architecture is much more power efficient than that of the ARM architecture

If they are going to talk about building up a big iron using CPUs which are of high power efficiency, I reckon the MIPS cpu might be more suitable for this task than one from the ARM camp

I don't think it is. Best figures (albeit somewhat out-of-date) I can find for a MIPS-based system is 2GFLOPS/W for a complete 6-core node including memory. ARM Cortex A15 power consumption is a little hard to track down, although it's suggested that a 4-core 1.8GHz configuration (eg Samsung Exynos 5) could run at full speed on 8W (if the power manager let it; the Exynos 5 throttles down when it consumes more than 4W). Performance per GHz/core is about 4GFLOPS, so this system should be able to pull in about 28.8GFLOPS (or twice that if using ARM's "NEON" SIMD system to full advantage). Add in ~2W for 1GB DDR3 SDRAM, and that's 2.9GFLOPS/W. Assuming that the MIPS system I found is not the best available (as the data was from 2009 it certainly seems likely better is available now), the two appear to be roughly comparable.

I may be wrong here, but I get the impression that the MIPS architecture is much more power efficient than that of the ARM architecture

If they are going to talk about building up a big iron using CPUs which are of high power efficiency, I reckon the MIPS cpu might be more suitable for this task than one from the ARM camp

MIPS is an under invested older but great technology.
Another historic winner was the DEC Alpha.

As the folk at Transmeta (and others) demonstrated logic to decode any random ISA and drive a RISC core faster than the old VAX microcode days is
very possible. This seems to be the way of modern processors. So ARM/x86/x86_64 ISA almost does not matter except to the compiler
and API/ABI folk. If you want to go fast feed your compiler folk well.

As the folk at Transmeta (and others) demonstrated logic to decode any random ISA and drive a RISC core faster than the old VAX microcode days is very possible. This seems to be the way of modern processors. So ARM/x86/x86_64 ISA almost does not matter except to the compiler and API/ABI folk. If you want to go fast feed your compiler folk well.

One of the best ways you can help the compiler folk is with an orthogonal and sensible architecture. Furthermore, consider that generating good code is a problem that must be solved for every language, so starting with a good ISA makes for a lot less work.

THe core i7 might very well still win. Remember that intel is more efficient in computing work per watt, and an Ivy Bridge core i7 3770k uses 77w. If your average arm chip uses 2 watts, that means that ~30 arm chips will still get beaten by the core i7....

No doubt your CPU would win. But when looking at power/price as well, you'd have to pit your CPU against 50 or so ARM chips in parallel. For some solutions, it may be a far better choice. One size doesn't fit all.

50 costs more in silicon than a single x86.

basically you need a "new generation" of arm chips. but they'll have to compete against a new generation of x86 chips - and remember, x86 chips are priced as they are only because they're fastest you can buy!.

the thing is, we have been listening to this for years, that in few years arm will take over everything. yet it hasn't.

instead of supercomputing, I would foresee the lowest tier of rent-a-webservers to move to arm.. what's a better business than renting a mach

..and by ugly you mean the greatest (most versatile) addressing modes of any currently produced CPU's?

The x86 addressing modes are so powerful that they even created an instruction to leverage the addressing generation logic without accessing memory...

The fact is that neither RISC nor CISC is best, that a hybrid of the two is best. The problem with the RISC camp is that they cant make it hybrid while still being RISC, while the CISC camp hybridized long ago and even remained entirely compatible while do

Alpha's high price was due to DEC trying too hard to achieve prized speeds, and thereby having plenty of fallout, resulting in their need to jack up prices on those that did pass their tests. Had DEC gone for different speed bins, instead of just one, they could have priced it lower and sold it to markets which would have happily considered an Alpha, but where price was less critical.

There have always been cheap x86. It's only the extreme high end that's been rediculous. There has always been a sweet spot with x86 in terms of price and performance.

Although Alpha does provide a nice example of how performance per core trumps anything else. There were some problems you simply could not solve by throwing lesser CPUs at it no matter how much you might have wanted.

Then you use something else as well. High performance computing server rooms already have a mix of stuff, especially since the AMD chips can give you a 64 core machine with half a terabyte of memory for $14K but it's not as fast per core as the two way Xeons. The parallel stuff is done on the plentiful and slower cores while the single treaded stuff is done on the faster cores - then GPUs do whatever parallel stuff you can feed them (memory and bandwidth limiting issues keep them from doing some tasks)

Exactly, then again, there are plenty of non-cpu intensive loads.. part of the popularity and growth of NodeJS is that a lot of jobs are IO bound, and even a lot of web services/sites are spending most of their time waiting on files, or network resources/services... 10 arm CPU's handling 10K simultaneous requests, is as good as 1 uber-cpu handling 10K simultaneous requests... for that matter, there's been a lot of work done in MessageQueue routing, and distributed databases... ARM is a pretty good fit for an environment designed to scale horizontally. Some of the first things I wanted to try on my Raspberry Pi were MongoDB and NodeJS, with the thought that a couple dozen of them might work better with more resilience than a few larger systems...

For the record, I think addressing a bit more memory, and larger/faster storage channels are what's holding back some of these systems.. which aren't a problem at super-computer scale.. but for someone wanting to put together a small cluster, it gets irritating.

In TFA's slides 10 and 11, Intel i7 chips are shown to be more efficient in terms of performance per watt than ARM chips. However, they're close to each other and Intel's prices are significantly higher.

Useless for what you do. The second performance...not performance per watt...PERFORMANCE becomes an issue..ARM is a steaming pile of shit and you know it. If you're doing anything more than what the above AC said (keep playing soduku, and portal) it can't handle it. How about everyday consumers who need a tablet that can actually do work? A gimp version of windows is not going to get the job done. Some of the Samsung Slate tablets however come with an x86...and are actually fully functional! Can you point t

On the other hand, a Windows version of GIMP does get a lot of jobs done that don't quite need Adobe Photoshop.

But seriously, the reason Windows RT is "gimped" is because Microsoft has refused to endorse recompiling desktop applications. That's not a failing of ARM, as ARM ran RISC OS on Acorn computers, as much as a power grab by Microsoft.

Some of the Samsung Slate tablets however come with an x86...and are actually fully functional! Can you point to an ARM tablet that can do everything it can?

Some ARM tablets run Ubuntu [ubuntu.com]. Other Android tablets run Debian in a chroot, with video out through an X11 server app for Android. These can't run Windows application

You aren't operating in the supercomputing market. There, what matters is the how much processing you can get for how much money. You can always buy more chips, and power usage and cooling are both signficant factors. That's why x86 became dominant in that space. It was cheaper to buy a bunch of x86 chips than to buy fewer POWER chips. In terms of computing power, a POWER7 will eat your i7 for breakfast, but they are ungodly expensive.

It was a two week process to attempt to buy a single low end machine with one of those things to see if it was viable for a paticular task - two weeks getting my companies wallet weighed by a slimy bastard that made used car salesmen look like saints and a lot of veiled comments that may have been about kickbacks. In the end the price was more than that of four gold plated IBM Xeon systems of similar clockspeed or about double that in whitebox systems. Sounds like you need a black budget immune from the e

...but also reliability (because supercomputers are really large and one failed node will generally crash the whole job, thereby wasting gazillions of core hours; that's one reason why SC centers buy expensive Nvidia Tesla hardware instead of the cheaper GeForce series) and IO and memory bandwidth and finally integration density. That one Intel chip can be more tightly integrated as it won't generate as much excess heat per GFLOPS (according to TFA...).

Why did you even say this? "PC users" aren't even mentioned in this article. This article is about supercomputers where the workloads are by virtual definition extremely parallel and the restrictions are around price and power consumption, not "FPS on a single game".

No supercomputing whatsoever. I'm not a physicist, a mathematician, a code breaker nor anyone else with supercomputing needs. My HTTP request for web page is quite likely served by a single core. Maybe 2.

The problem you have is the software tools you use sap the power of the hardware. Windows is engineered to consume cycles to drive their need for recurrent license fees. Try a different OS that doesn't have this handicap and you'll find the full power of the equipment is available.

The last two times I ran Linux on my desktop I ran into issues that weren't impossible to overcome, just a pain in the ass to deal with... I had a desktop with two graphics cards in sli, and two monitors.. getting them both working in 2006 was a pain, I know that was seven years ago, but still... far harder than it should have been.. in 2007, my laptop was running fine, upgraded to the latest ubuntu, nothing but problems.. In the first case, XP/Vista were less trouble, in the second, Win7 RC1 ran better... I also ran PC-BSD for a month, which was probably the nicest experience I've had with something outside win/osx on my main desktop, but still had issues with virtual machines that was a no-go.

Given, my experiences are pretty dated, and things have gotten better... for me, linux is on the server(s) or in a virtual machine... every time I've tried to make it my primary OS has been met with heartache and pain. I replaced my main desktop a couple months ago, and tried a few Linux variants.. The first time, I installed on my SSD, then when I plugged in my other hard drives, it still booted, but an update to Grub screwed things up and it wouldn't boot any longer. This was after 3 hours of time to get my displays working properly.... I wasn't willing to spend another day on the issue, so back to Windows I went. I really like Linux.. and I want to make it my primary desktop, but I don't have extra hours and days to tinker with problems an over-the-wire update causes... let alone the initial setup time which I really felt was unreasonable.

I've considered putting it as my primary on my macbook, but similar to windows, the environment pretty much works out of the box, and brew takes things a long way towards how I want it to work. Linux is close to 20 years old.. and still seems to be more crusty for desktop users than windows was a decade and a half ago in a lot of ways. In the end, I think Android may be a better desktop interface than what's currently on offer from most of the desktop bases in the Linux community, which is just plain sad... I really hope something good comes out of it all, I don't like being tethered to Windows or OSX... I don't like the constraints... but they work, with far fewer issues... the biggest ones being security related... I think that Windows is getting secure faster than Linux is getting friendlier, or at least easier to get up and running with.

Given SLI barely works in Windows, expecting it to work in Linux was optimistic. I recently booted up a Linux Mint DVD on my laptop to try it out and... everything just works. Even using the 'recovery partition' to reinstall Windows on there takes over three hours, reboots about thirty times and breaks with barely decipherable and completely misleading error messages if you installed a hard drive larger than the one that came with it.

Yeah yeah you had no problems therefore they don't exist. I wish Linux advocates would be more honest about its flaws. I think it's great but it's nowhere near perfect. I swapped a Mint hard drive from another machine into this one and it works flawlessly which Windows most certainly wouldn't, however when I put Ubuntu on that other machine it was a nightmare.

Got any evidence for that claim? here [phoronix.com] are some benchmarks that suggest gaming performance is the same (which is what you would expect since the OS isn't participating much, except through the graphics drivers).

Far more games are played on ARM cpus than X86 CPUs these days. Of course the takeover started at the bottom end with Snake, and moved on through Angry Birds etc., it's only a matter of time before ARM takes over the hard core gamers too. It's more a matter of having a platform with big screen and interesting controllers. ARM CPUs are already up to the task of running such systems.

Your comment is off-topic. Nobody cares about your gaming machine and your desktop. Have you read the article? It is about HPC, you know these machines which are simulating global warming, nuclear weapons, etc. It is talking about entire rooms filled with dense compact racks of CPUs and memory and these are having a super high electricity bill to pay each month and they actually care about energy efficiency which may mean more processing power for the same price. Overclocking your gaming machine isn't HPC.

Damage or a winner? I feel so bad about having a cheap, efficient, and above all, quiet box.

I bought this [hardkernel.com] 4*2GHz baby, and the only reason it's not my main desktop yet is a weird and asinine requirement for monitor resolution to be exactly 720 or 1080 (WTF?!?). I think I'll replace my old but perfectly working pair of 1280x1024 monitors (I hate 16x9!), and put the big loud clunker to the cellar. I just hate the noise so much. x86 machines with no moving parts are extremely hard to get, and have terrible performance/price. Anything that requires lots of processing power: compilation, running Windows VMs, etc, can be done remotely from the cellar just as well, while a 2GHz arm is fast enough to do client stuff, running a browser being the most demanding part.

And what else do you need to reside directly on the machine you plop your butt at?

If it's the OP AC, whinging about how his games don't work well on ARM - then it's a damage (not that I regret it).
If it's you (thanks for the link: nice to see others on top of RasPi) or me - then its winning.

Speaking about quiet: I recently bough a Proliant Microserver for the "home FS"/NAS - at 15W for the Turion and the 4 NAS grade WD HDDes... mums, I can't hear it (under 60W at peak use). I would have gone with a ARM-board, but could't find enough support for NAS-ing (not when RAID-ing anyway).

A single ARM 4 core A-15 running 1.5 GHz per core blows away any competing chip at the same specs, on power AND price. It's not limited to the calculations x86 are and can process graphics and physics better as a result.

Translation: It gets raped sideways on single-threaded performance and you have to double up on sockets right out of the gate.It's a bit of a misconception about ARM and x86. ARM wins of watts/socket and mhz/watts, but Intel's i7s cream ARM on performance/watt, once you account for those two factors, ARM isn't as competitive as you might think. Now, I'm not saying it isn't competitive, just that it's nowhere near as one-sided as you might be led to believe by cherry-picking.

..if it runs x86 native, isn't it a x86 cpu?you look like an idiot who read some hype up article a few years back and is still waiting for it to be true. keep waiting! like for the magic parallel!(plenty of games utilize parallel code nowadays)

Most of the actual processing power in current supercomputers comes from GPUs, not CPUs. There are exceptions (that all-SPARC Japanese one, or a few Cell-based ones), but they're just that, exceptions.

So sure, replace the Xeons and Opterons with Cortex-A15s. Doesn't really change much.

What might be interesting is a GPU-heavy SoC - some light CPU cores on the die of a supercomputer-class GPU. I have heard Nvidia is working on such (using Tegra CPUs and Tesla GPUs), and I would not be surprised if AMD is as well, although they'd be using one of their x86 cores for it (probably Bulldozer - damn thing was practically built for heavily-virtualized servers, not much different from supercomputers).

As someone who does heavy duty scientific computing, I wouldn't say that "most" of the actual process power is in GPUs. They are certainly more powerful at certain tasks, but most applications run are legacy code, and most algorithms require substantial reworking to get them to run with reasonable performance on a GPU. Simply put, GPU for supercomputing is not quite a mature technology yet. I am personally not too interested in coding for GPUs simply because the code is not portable enough yet, and by the time the technology might be mature, there might be a new wave of technology (like ARM) that could be easier to work with.

ARM has predication: execute or don't execute a particular instruction based on the result of a previous instruction. It's like branching past one instruction at a time, and it doesn't stall the pipeline.

That advantage goes away if your core is superscalar -- you still have issues with branching and not keeping the queue full.
Some versions of x86 superscalar can execute both sides of branches, then discard the results of the branch not taken.
There is no reason that an architecture with an ARM instruction set could not do this; but then some of the power-per-watt benefits would be leveled out.

It really doesn't seem like portability should be a huge goal for writing code for top-100 supercomputers. The cost of the computer would dwarf (or at least be a significant portion of) the cost of developing the software for it. It seems like writing purpose-built software for this type of machine would be desirable.

If you can cut the cost of the computer in half by doubling the speed of the software, it seems a valid fiscal tradeoff, and the way to do that would be to write it for purpose-built hardware

On the point or portability, there's then a distinction of your focus. If you do research on numerical methods, then yes, you would write highly optimized code for a particular machine, as an end in and of itself. I myself am merely a user, and our research group does not have the expertise to write such optimized code. We pay for time on supercomputing clusters, which constantly bring online new machines and retire old ones. Every year our subscription can change, and we are allowed to use resources on dif

System and numerical libraries and compilers are of course written specifically for the machine. But user-level apps (and a lot of scientific computing uses finished apps) are ported across multiple systems.

Portability is not as big an issue as it was a generation ago, as most supercomputers basically are Linux machines today, and made to more or less look like a typical Linux installation from a user-application level, with a POSIX API; pthreads, OpenMP and OpenMPI; a standard set of numerical libraries; a

It depends on what are you doing. If you have relatively short term project (say less than couple of years) you are right.

I've got to take issue with this statement. Anything that takes over a couple years probably should not be started on new silicon as it doesnt make sense to start them yet due to Moores law. The guy that starts the same project a year from now using the same amount of money that you used will beat you to the final calculation and get the hookers and blow that you thought that you deserved.

The only time it makes sense is when the hardware is otherwise at end of life, that there is no longer an initial inv

False. According to the Top 500 computer survey from November, 2012 (Category: Accelerator/Co-Processor), 87% of systems are not using any type of GPU co-processor, and 77% of the processing power is coming from the CPU.

This is, however, a decrease from the June 2012 survey, so GPU is certainly making inroads, but it is not yet the main source of computation.

http://www.top500.org/statistics/list/

I still remember when the IBM Blue architecture came out, using embedded PowerPC processors and it was a huge po

Of the last published top500 list, 7 out of the top 10 had no GPUs. This is a clear indication that while GPU is defintely there, claiming 'Most of the actual processing power' is overstating it a touch. It's particularly telling that there are so few as overwhelming the specific hpl benchmark is one of the key benefits of GPUs. Other benchmarks in more well rounded test suites don't treat GPUs so kindly.

Perhaps. Pretty much any time I am doing some SSE coding I am thinking to myself "wouldnt it be nice of these registers were wider.. why doesn't someone in the x86 market just go ahead and make huge vector registers at least for addition, multiplication, and shifting" and then I realize that that is in fact where the APU's are at right now.. and think to myself "geeze I should be doing OpenCL not this hand-crafted SSE shit"

Not really. The main difference between ARM and x86 cores in this application is that ARM has an equally flexible but lower performance ALU. For scientific applications that is a good trade off because performance tends to be mostly dependent on the FPU and on things like network and memory latency.

In other words it is hard to max out an x86 core constantly in a supercomputer so much of its performance is unused. ARM does away with the bits that are less critical which results in lower power consumption and

As I understand it, Intel still has the advantage in the performance per watt category for general processing and GPUs have better performance per watt IF you can optimize for that specific environment--both things which have been commented to death endlessly by people far more knowledgeable than I.

However, to me there are at least 3 questions unanswered:

1. ASICs (and possibly FPGAs): Bitcoin miners and DES breakers are the best known examples. Where is the dividing line between where your operations are specific enough to emply an ASIC vs not specific enough and needing a GPU (or even CPU)? Could further optimization move this line more toward the ASIC?

2. Huge dies: This has been talked about before, but it seems that, for applications that are embarrassingly parallel, this is clearly where the next revolution will be, with hundreds of cores (at least, and of whatever kind of "core" you want). So when will this stop being vaporware?

3. But what do we do about all the NON-parallel jobs? If you can't apply an ASIC and you can't break it down, you're still stuck at the basic wall we've been at for around a decade now: where's Moore's (performance) law here? It would seem the only hope is new algorithms: TRUE computer science!

In ASICs ARM is an ideal choice because you can built it right into the chip from a reference design. A lot of ASICs feature an 8502 core for management and I/O tasks, but if you needed to execute a more complex application than a simple ARM core running THUMB or even a full 32 bit ARM core would be ideal.

Hopefully this means we should start seeing ARM-using motherboards in an ATX form-factor. The Pi and Beaglebone are nice, but I want something that's eassentially just like a commodity x86 motherboard except it uses ARM.

Hopefully this means we should start seeing ARM-using motherboards in an ATX form-factor. The Pi and Beaglebone are nice, but I want something that's eassentially just like a commodity x86 motherboard except it uses ARM.

Why? Mini-ATX's not good for a commodity MB? 'cause you don't need a high google-fu to find heaps of them.

Mini-ATX or Mini-ITX will do fine. I just haven't seen any that have the kinds of things you take for granted on x86 boards. I want an ARM board with SATA ports, PCIe slots, and DIMM (or SODIMM) slots. Is that too hard to produce? I don't see anything like this anywhere.

Current ARM processors may indeed have a role to play in supercomputing, but the advantages this article implies don't exist.

Go look at performance figures for the Cortex-A15. It's *much* faster than the Cortex-A9. It also draws far more power. There's a reason why ARM's own product literature identifies the Cortex-A15 as a smartphone chip at the high end, but suggests strategies like big.LITTLE for lowering total power consumption. Next year, ARM's Cortex-A57 will start to appear. That'll be a 64-bit chip, it'll be faster than the Cortex-A15, it'll incorporate some further power efficiency improvements, and it'll use more power at peak load.

That doesn't mean ARM chips are bad -- it means that when it comes to semiconductors and the laws of physics, there are no magic bullets and no such thing as a free lunch.

What it shows is the cost, in energy, of moving data. Keeping data local is essential to keeping power consumption down in a supercomputing environment. That means that smaller, less-efficient cores are a bad fit for environments in which data has to be synchronized across tens of thousands of cores and hundreds of nodes. Now, can you build ARM cores that have higher single-threaded efficiency? Absolutely, yes. But they use more power.

ARM is going to go into datacenters and supercomputers, but it has no magic powers that guarantee it better outcomes.

Slashdot seems to have lots of ARM fanboys that look at ARM's low power processors and assume that ARM could make processors on par with Intel chips but much more efficient. They seem to think Intel does things poorly, as though they don't spend billions on R&D.

Of course that would beg the question as to why ARM doesn't and the answer is they can't. The more features you blot on to a chip, the higher the clock speed, and so on, the more power it needs. So you want 64-bit? More power. Bigger memory controller? More power. Heavy hitting vector unit? More power. And so on.

There's no magic ju ju in ARM designs. They are low power designs, in both sense of the word. Now that's wonderful, we need that for cellphones. You can't be slogging around with a 100 watt chip in a phone or the like. However don't mistake that for meaning that they can keep that low consumption and offer performance equal to the 100 watt chip.

The point is that an ARM processor can provide, say, 75% of the performance for 25% of the power compared to x86. You can see it in tablet computers, particularly those running Windows RT or Ubuntu where a direct comparison is possible. Since most of the bottlenecks are not due to processing power but rather disk, RAM, graphics rendering, network etc. you very quickly reach the point of diminishing returns with increasing CPU performance.

yeah well, we'll see when it does 75% performance for 25% of power. it doesn't. you can't see it in tablets right now. that's what next gen is supposed to fix. but the next gen arm design is going to use more power to get there.

(incidentally memory access, network etc are all slower on arm and for most supercomputing they do matter)

it is a bit boring to read these articles now for a decade though. "intel is dead due to arm in two years!! yeehaw!!". they were even more boring back in the day when intel was m

The magic ju ju is the ARM business model. There is one trump card ARM holds that precludes Intel from many portable devices; chip makers can build custom SOCs in-house with whatever special circuits they want on the same die. Intel doesn't do that and they don't want to do it; it would mean licencing masks to other manufactures like ARM does. For example, the Apple A5, manufactured by Samsung, includes third party circuits like the Audience EarSmart noise-cancellation processor, among others. It is presently not feasible to imagine Intel handing over masks such that Apple could then contract with some foundry to manufacture custom x86 SOCs. This obviates Intel from many portable use cases.

That feature of the ARM business model might be very useful to large scale computing. One can imagine integrating a custom high-performance crossbar with an ARM core. Cores on separate dies could then communicate with the lowest possible latency. Using a general purpose ARM core to marshal data to and from high-performance SIMD circuits on the same die is another obvious possibility. A custom cryptography circuit might be hosted the same way.

Contemporary supercomputers are great aggregations of near-commodity components. However, supercomputing has a long history of custom circuit design and if the need arises for a highly specialized circuit then a designer may decide that integrating with ARM to do the less exotic leg work computing that is always necessary is a good choice.

I have long pined for a server with maybe 10 4 core ARM CPUS. Basically my server spends its time serving up web stuff from memory. Each web request needs to do a bit of thinking and then fire the data out the port. Disk IO is not an issue nor is server bandwidth. Quite simply I don't need much CPU but I need many CPUs. A big powerful intel is of less interest.

Also by breaking up the system into physically separate CPUs I suspect that an interesting memory accessing architecture could be conjured up preventing another potential choke point.

Supermicro
1u 64 cores [supermicro.com]. Bunch of other Mobos (some more than 1u) on this page [supermicro.com]. Cheap is relative to the buyer I suppose, but to my (admittedly very large) company these things are rather cheap unless you start stacking them with lots of dense memory.

Has anybody else seen/considered the Xilinx Zync [xilinx.com]? It's a mix of ARM kernels and FPGA, which could be interesting in supercomputing solutions.

For anyone willing to tweak around with it there are development boards around like the ZedBoard [zedboard.org] that is priced at US$395. Not the cheapest device around, but for anyone willing to learn more about this interesting chip it is at least not an impossible sum. Xilinx also have the Zynq®-7000 AP SoC ZC702 Evaluation Kit [xilinx.com] which is priced at US$895, which is quite a bit more expensive and not as interesting for hobbyists.

Done right you may be able to do a lot of interesting stuff with a FPGA a lot faster than an ordinary processor can and then let the processor take care of stuff where performance isn't a critical part.

Those chips are right now starting to find their way into vehicle ECUs [xilinx.com], but it's still in an early phase so there aren't many mass produced cars yet with it.

As I see it - supercomputers will have to look at every avenue to get maximum performance for the lowest possible power consumption - and avoid solutions with high power consumption in standby situations.

Not this week....
I am a fan boy for the small ARM boards... I have built an MPI cluster
out of Raspberry-Pi boards and it is not even close except as a teaching
exercise where it excels.

However many site services can be dedicated to these little boards
where corp IT seems to dedicate virtual machines.

Department Web Servers... with mostly static content... via NFS or
a revision control system like hg.
Department and internal caching name servers... NTP servers and managed central storage for each bu

This isn't to say that ARM *can't* be there, but thus far all of the implementations have focused around 'good enough' performance within a tightly constrained power envelope. Intel's designs have traditionally been highly inefficient in that power band, but at peak conditions, it is still compelling.

I recall one 'study' which claimed to demonstrate ARM as inarguably better. It got way more attention than they should have. The reason being is that they measured the performance on the ARM test, but just *

It depends entirely on the task. There's plenty of threads that just cannot fit their memory requirements onto a GPU and keeping the things fed with memory can be slower than doing it on a CPU in the first place. Remember you are comparing something of the order of 8GB shared memory between the GPU cores with 1TB shared between the CPU cores.

I'm thinking you don't understand. The whole "shared memory" thing is not exclusive to x86 cores. At some level it's a software abstraction relating to latency of storage. GPUs can have terabytes of RAM too as a sixth level cache.

Intel really needs some help here because the ground has shifted too much for them.

"keeping the things fed with memory can be slower than doing it on a CPU in the first place" is the line you've missed and is why GPUs don't solve every highly parallel problem at the moment. They can do reverse time migration, but can't currently do time migration, depth migration, tomography etc etc. The penalties of swapping so much memory in and out are far too costly, to the point of orders of magnitude of performance or complete showstoppers where you just can't get enough in for it to work at all.