Posted
by
CmdrTaco
on Monday January 03, 2011 @01:12PM
from the eat-the-seeds dept.

eldavojohn writes "Remember a few months ago when the feasibility was discussed of a thousand core processor? By using FPGAs, Glasgow University researchers have claimed a proof of concept 1,000 core chip that they demonstrated running an MPEG algorithm at a speed of 5Gbps. From one of the researchers: 'This is very early proof-of-concept work where we're trying to demonstrate a convenient way to program FPGAs so that their potential to provide very fast processing power could be used much more widely in future computing and electronics. While many existing technologies currently make use of FPGAs, including plasma and LCD televisions and computer network routers, their use in standard desktop computers is limited. However, we are already seeing some microchips which combine traditional CPUs with FPGA chips being announced by developers, including Intel and ARM. I believe these kinds of processors will only become more common and help to speed up computers even further over the next few years.'"

A desktop CPU in an FPGA will always cost more and perform worse (i.e. slower clock rate) than a full custom chip from Intel or AMD. Mind you I've seen embedded designs where a microcontroller, Ram, Rom and custom logic are implemented in a $10 FPGA - especially where volumes are too low for an ASIC.

On the other hand I could definitely see programmable logic inside Intel or AMD CPUs, a sort of super SSE. Then again even there you'd probably be better off using GPU like custom hardware for the heavy lifting. In fact I can see CPU/GPU hybrids being very common in low end machines. Full custom logic is always going to have a performance per $ advantage over FPGAs unless FPGA technology chains drastically.

They're dynamically reprogrammable, but its not like you can just just instantly flip to another ROM. These things take time to switch to another configuration. They are much better suited for batch operation, running one task completely before moving onto the next, than multitasking.

You should jump on some of the newer non-volatile FPGA's that can run a microcontroller core. I found one from Lattice who had it on sale for 29.95 with a jtag programmer. Now they are 50. There are other brands and I'm always looking for cheap dev kits. I think there's a devkit for Omap from Ti that's open source and not to expensive but I'm not finding it.

given your last comment - I think pre-defined hardware such as AMD/Intel desktop chips will always be faster than FPGA for a pre-specified set of individual operations. It's only when you get to operation combinations not defined at manufacture time, but used frequently, that FPGA has an advantage.

The current CPU design will stay for most of the work, and an FPGA attachment would handle the specialty work that isn't needed most of the time, and can be dropped.

A non-FPGA AMD/Intel CPU will always be faster doing general CPU business than a FPGA implemented one doing the same.
It is however a stupid approach, a CPU is built to do general purpose calculations to allow for all software to exist without specialized hardware. A FPGA on the other hand is made to configure into specialized hardware in order to... well, i guess not having to build a lot of prototypes for hardware testing was its original purpose. But its use go far beyond that in that it could turn into

The typical home user rarely needs to do any really heavy number-crunching - the closest they get is physics in games. It has definate use in scientific computing and analytics, though - espicially as it allows the engineers to constantly improve the programs without needing to get new silicon manufacturered. It's a niche into which GPGPU has settled quite happily, though - and it does such a good job, only the most extremally demanding workloads may justify the expense of FPGAs and people with the skills t

The typical home user rarely needs to do any really heavy number-crunching - the closest they get is physics in games.

For the past 5 to 8 years there has been a "rasterization vs. ray tracing" [google.com] debate in the game developing and graphics community (with ray tracing in real time in games only being a theoretical pipe dream until recently).

If someone were to make ray tracing feasible, cheap, and practical for either a console or desktop PC, then yes... Home users will need that number crunching as Ray Tracing i

How long will it be before we will see the first motherboards with FPGA emerge?

A desktop CPU in an FPGA will always cost more and perform worse (i.e. slower clock rate) than a full custom chip from Intel or AMD.

Sure, but no-one's going to do that anyway- if the OP thought that, then he missed the potential of his own idea.

I thought up something similar a few years back, and realised that, yes, the performance would obviously be horribly uncompetitive and pointless if you simply tried to reproduce (e.g.) an x86 chip's circuitry with an FPGA. The obvious idea (or rather, my idea, which I suspect countless other people also figured out independently) is that the FPGA *circuit* implemented in hardware replaces the *

Tens of thousands of blocks, but how many do they spend implementing their CPU cores? Could be using multiple FPGAs or a very very simple CPU core... I'm more intrigued by the blurb about Intel and ARM developing CPU/FPGA chips - could be a lot of fun with (hopefully) a lot lower cost than a Virtex.

"By creating more than 1,000 mini-circuits within the FPGA chip, the researchers effectively turned the chip into a 1,000-core processor - each core working on it's own instructions."This is entirely feasible, but the 'cores' would have to be very very simple. Looking at the data sheet for the Xilinx Virtex 6 FPGA, it contains 118,560 Configurable Logic Blocks, which are equivalent to four Look Up Tables, and 8 flip-flops. If you wanted to create an 8-bit instruction set processor, it would require at minim

I agree, hence the "very simple" in my reply. I bet they are extremely limited, but fast. Other brands/models of FPGAs have different definitions of 'complex' - Altera has some pretty smokin' FPGAs, too.

FPGAs can be programmed to emulate any logic hardware (logically, though not usually electrically, so power and timing will not be accurate though the logical results will be identical). Many CPU cores have been rendered as library modules that can be programmed into an FPGA. Put 1,000 of them in your FPGA (or big array of FPGAs in this case) and route them together, and you can claim you have a 1,000-core CPU.

Of course, it takes more than one FPGA chip to do this, so you can't in any sense claim a 1,000-

It may be too late, but perhaps someone could talk with Viva Computing, LLC who now owns the assets of Star Bridge Systems [starbridgesystems.com]. It was not specified in the news release if they also own the intellectual property.

I think this is a great development. I've been using FPGAs in medical imaging for about 15 years. The groups that use the GPUs are getting great performance--definitely--but seeing as how MRI and CT machines are placed and need to run for 10, 15 20 years, I don't see how the GPUs will survive that time. One large OEM was pushing the GPUs for their architecture and I can't believe it will be successful if success is measured on the longevity scale. I'm sure the service sales guy will clean up.

Why do GPUs fail? I'm not sure of the exact modes of failure but the amount of heat has got to have something to do with it. FPGAs will run much cooler and in the FLOPS/Watt game, will win.

The drawback with using FPGAs compared to commodity processors is that the FPGA market currently does not support using the bleeding edge processes that CPUs are manufactured with. Typically a competitively priced FPGA will be at least one generation behind a CPU. In HPC FPGA's are a plausible improvement, but at a smaller scale the development costs for incorporating a custom firmware for an FPGA into an application are significant. It all really rests on what demand is out there for a particular algorithm

Actually, usually FPGAs are on the bleeding edge of manufacturing processes. Intel may have beat everyone to 28/32nm, but expect to see 28nm FPGAs from Altera and Xilinx (from TSMC and/or Samsung) around the same time as 28/32nm ASICs from AMD or nVidia. Intel rolls their own, but everyone else is using the same foundries...

FPGAs are much slower and less efficient and bigger than a dedicated design because even the simplest gate is actually a block that can be controlled to perform many different functions. That block consists of several latches and a complex gate, perhaps a hundred transistors in all, whereas a 2-input nand gate consists of four transistors. So it's 25 times bigger (area), and the distance to the next gate is increased by 5x (linear). The complexity makes the block inherently slower than a simple gate, and th

FPGA development is synchronous digital logic design. Verilog and VHDL are hardware description languages; they are not programming languages. Having a software-engineering or programming background does not mean you can simply learn Verilog and start doing FPGA design.

I don't see why an MRI machine processor can't be made fault-tolerant. If a GPU burns out, it could just be disabled and a fault warning indicated - and then the machine can carry on working, even if it does take significently longer to produce an image. Then you call tech support, they come around and pull the faulty part and slot in a new one. The only concern then is making sure parts are available in twenty years - and I imagine any machine that expensive has to come with a long-term support contract which will oblige the manufacturer to ensure a supply of compatible boards in years to come.

Two things--if there's a failure, then there's a problem. The machines for years used to use military grade hardware. Machines that were designed in 1992, sold in 1994 are still running strong today. Then to cut costs, the OEMs switched to more commodity hardware and they've effectively sucked in uptime since. You make it sound like it's no big deal to call tech support. It is a big deal. To put it in dollar terms, we had a machine go down for technically 4 hours. The tech was there, made the diagnos

FPGAs have so much more overhead both in space and power due to programmability, whereas GPUs are pure processing. Further the algorithms necessary for CT and MRI are practically the same algorithms GPUs were designed for, so if you were to use an FPGA, your design would end up with a similar architecture anyway. Further, while low end commercial GPUs (like those you and I use for gaming), may only last 3-4 years, the high end scientific computin

I am serious and you are wrong. I don't have a clear idea what you mean about space and power due to programmability. FPGAs are soft coded hardware. If by the nature of being able to code it and change it you mean "overhead" then fine. But even with that overhead, they are still more efficient. You might be thinking of raw speed instead of FLOPS/Watt.

From "A Comparative Study on ASIC, FPGAs, GPUs and General Purpose Processors in the O(N^2) Gravitational N-body Simulation"

What's most surprising it that the research was on matrix dot products, something that graphics cards do in 3D operations. The FPGA beat the graphics card at its own game in both performance and performance per watt.

I'm impressed. Perhaps we'll see graphics cards made up of nothing but programmable FPGAs in the future. Instead of loading and running a CUDA kernel we'll be loading and running an FPGA core.

A $3 Million MRI machine can't afford to have 10 $100 redundant backup GPUs inside it? Of course commodity hardware isn't medical grade. Anyone trying to shove an off the shelf GTX 580 WTF FTW suck-my-balls-off edition card into such an expensive device is cutting some huge corners instead of requesting industrial/medical grade units from any of the potential manufacturers. So what if that part costs $50k and is equivalently powerful as a $50 card at Best Buy.

Not all problems map well to current GPU offerings. I have a problem that would benefit from parallel processing but due to a branchy algorithm and very random access for read/write, I can't really take advantage of GPU's to the extent some algorithms can (note: I have coded and run it on GPU's so this is more than just theory, additionally I have coded it to run on a network of computers and unfortunately the calc time vs network transmission time ratio for each cycle is not favorable enough for that to

What are the practical differences between targeting an FPGA on a computing platform and targeting more ubiquitous massively-parallel programmable pipelines in modern GPUs? Also, what are the fundamental differences? Could my GPU already contain FPGAs?

The main difference is that you don't program FPGAs. You do synchronous digital logic design which is implemented in the FPGA fabric. Thinking that you can program them like you program a sequential-execution processor is a recipe for failure. And, yeah, C-to-gates tools are a joke.

The researchers then used the chip to process an algorithm which is central to the MPEG movie format – used in YouTube videos – at a speed of five gigabytes per second: around 20 times faster than current top-end desktop computers.

20x speed is getting closer to what I need before I can even ATTEMPT to build my very own Holodeck.

Indeed. 1000 simple CPUs will fit in a FPGA, though it might require one near the top of the line. (e.g., picoblaze reportedly needs 96 "slices" and 1.5 "block RAMs"; the biggest Virtex-7 FPGA has more than 1400x as many block RAMs and 3100x as many "slices") There's little doubt that you could program a DCT for a picoblaze, if you wanted to.

It's hard to tell what 5.0GBps refers to -- the bitrate of the incoming, uncompressed, RGB video data? If so, that's maybe about 800FPS of 1080P video. In a circa

We first need to break a lock of x86 instruction set and the operating system that requires it. CPUs already try to execute multiple x86 instructions in parallel, but this is severely limited by sequential instruction set design. There needs to be a way to express computation A and B using different sets of virtual registers and let hardware execute them sequentially or in parallel depending on its capabilities, or vectorize/parallelize multiple iterations of a loop. If software, including operating systems

There's a reason why embedded devices use ARM over x86. The x86 instruction set has a lot of instructions that no compilers (and therefore hardly anyone) ever use. Those unused instructions are just sitting there in the silicon, charged up with electrons, draining power, generating heat, and making it harder to create smaller & faster x86 chips. Some of these "deprecated" instructions are microcoded, but that just means they're slower and even less likely to be used by an optimizin

Sigh. Multi-way branching was already old when ARM implemented it. What you fail to explain (understand?) is that there is a cost associated with either choice. As with most of engineering there is not a simple proposal that wins. In the case that branch prediction is perfect, the predicted execution is cheaper. In the case that the prediction is terrible the multi-way execution wins. In real life branch prediction is neither perfect, nor is it that terrible, so engineers have to balance the likelihood that

Well.... no. A few percent is a small deal. A larger percentage would be a bigger deal.

You've made a critical failure here: the x86 *instruction* cache stores x86 instructions after they've been decoded into simpler RISCy form

Yes - after they've been transferred across the bottleneck from memory. So at the point where it matters (the cache fetching lines from memory) the code is in a dense form because of the CISC encoding.

It's really quite simple: RISC is an advantage where the cost of decoding dominates because it simplifies the decoder circuitry. CISC is an advantage where the cost of transferring instructions (and the space that they occup

Admittedly slightly tangential to your discussion of virtual machines... but part of the point of Intel's IA64 instruction set was to address this kind of thing. The compiler's job was to specify groups of instructions that could be executed safely in parallel, then the CPU would execute these according to its capabilities.

But a higher-level virtual instruction set with just-in-time compilation is admittedly more insulated against future technology and more amenable to the code being run on a variety of a

A programmable hardware platform would provide amazing computing power because of hardware specialization: rather than emulating a proper CPU, you would download core architecture into the FPGA to accelerate tasks such as REGEX processing or H.264 decoding. You could compile the entire logic of a program into a gate array with various logical operators and flip-flop circuits for unlimited (albeit slow) registers (L2 registers) as well as including standard registers and SRAM cache (L1).

Although the FPGA runs slower than a regular CPU, direct programming rather than instructional programming (that is logic blocks that perform programmatic functions, rather than logic blocks that interpret discrete instructions to follow programmatic functions) would shorten the overall hardware logic path. In short, the chip would follow fewer clock cycles and instead just "do things." The CPU would be slow, but optimized for your workload. The main performance bottleneck would be the context switch: replacing the logic gate configuration with a new program every time you switch. Other than that, dynamic program expansion could be utilized: inlining operations like multiplication, addition, etc, or breaking them out if space constraints make it hard to load the whole program onto the FPGA that way.

The obvious, major issue we see is, of course, a security issue. You can now reprogram the CPU. This makes it difficult to prevent a program from bypassing any and all hardware security measures. This is solved by implementing a completely new security design on the chip, by which the CPU itself (the FPGA) is under control of external security mechanisms (paging etc handled in the MMU, outside the FPGA space, would largely mitigate most of this); it's not impossible to deal with, it's just an issue that needs to be raised.

In short, this sucks for "download the new Intel CPU into your BIOS/bootloader." This sucks for whatever general purpose CPU you can think of. For an entirely new programmatic platform, however, this would provide some interesting performance possibilities, and some interesting challenges.

So here is the basic problem. If the target application is made of steps that exist as specialised circuits in the CPU then selecting which of those circuits to apply in sequence will be faster than a generic circuit because the specialised circuit uses the space on the die more effectively and is clocked at a much higher frequency.

If the target application is made of steps which are very unlike the circuits provided on the CPU then the generic design will win. For everything in-between it is a trade-off. Not as many things win as FPGA designs and there is ten years of literature showing marginal improvements.

Encryption is a lot of things in CPU that are faster in hardware because it's a single clock cycle to do thing that are 30,000 clock cycles on the CPU.

Regex calculation, faster in a specialized hardware chip.

Codec decoding, we use an off-board CPU that has a microkernel and a small program; it benefits from just not running an OS and being a dedicated RISC processor, but in no other way.

GPU, specialized instruction set. Not dedicated to a specific task, but dedicated to a type of task. WAY faster t

It's odd that you pick crypto as I've spent a little time implementing crypto primitives on weird and exotic hardware. Sure - division is quite slow, that is why most primitives avoid the need for it, or only perform reductions in a specialised field rather than a full division. Multiplication on the other hand is fast and tends to be used a lot.

AES is quite a bad example for FPGAs. The very latest AES extensions from Intel can compute a round of AES in under three clock cycles. Performing the full cipher t

AES is quite a bad example for FPGAs. The very latest AES extensions from Intel can compute a round of AES in under three clock cycles. Performing the full cipher takes less than twenty clock cycles (on a processor running in excess of 3Ghz). No FPGA in the world can keep up with that performance.

"AES Extensions" means that Intel put a dedicated instruction pipeline in the processor to compute AES. That means you now have a specialized purpose hardware encryption chipset built into your CPU, tada. Just like an FPU.

Try the same Intel CPU with IA-32 instructions implementing AES, you won't do the whole cipher in 20 cycles. If you implement the exact same instruction architecture on an FPGA, it'll run at the slower clock of the FPGA, but still do it in 20 cycles. This means when you want to run

Your original point was that a reconfigurable processor would be more efficient at most tasks than a specialised processor, and that the big issue would be handling security. Why resort to car analogies when your entire argument can be summarised so concisely?

I have made a simple enough argument that you seem to keep missing - while it is a nice theory that we can reconfigure chips to be more efficient for a particular task the actual practice doesn't live up to expectations. Reconfigurable architectures ha

My point is everyone responding when this was first posted had this idea that you can just "reconfigure your FPGA to be a new Intel CPU" by some magic, and it'll work better. This is a dumb and short-sighted idea; if you have reconfigurable hardware, you have the ability to ad-hoc create specialized gate hardware rather than run software on generic instruction set architectures.

As for lard-cycles, somebody pointed out modern FPGAs clock at 1.5GHz; I'm more interested in what someone else said about the l

I could see this being used by a driver model. A generic driver is present that is able to reprogram the FPGA. Specialized or even derived drivers use the - now static - set of functionality. This could allow you to create generic purpose CPU's that can still be tweaked for certain tasks. It would also allow for upgrades of the algorithms being implemented. Symmetric cryptography and encoding/decoding would be obvious choices.

If updating the FPGA is really slow, I would not try and let applications change t

I had the same idea, but from my time playing with them at university, I remember that they have a very limited number of write cycles. You can reprogram them enough to do your development and bugfixes, but you can't reprogram them every time you run a different type of application.
Well, that is now. I suppose if someone can prove that it will make consumers happy that the engineers can find a way to increase the write limit.

Software developers have barely figured out how to write single threaded algorithms without crashing. Now we are seeing more multithreaded algorithms with race conditions, deadlocks and other data-sharing bugs.

Can you imagine what will happen if every desktop machine has one or two FPGAs available for programs to use as needed?

PHB says "Hey, I've heard that you can make the program faster if you program custom hardware on the motherboard's FPGA. Get the new intern to write some FGPA code for our algorithm

Sorry, but to make parallelism painless, you have to restrict the language in ways that make a lot of other things painful.

A language where every method call is a perfect closure is easily made parallel, the only question left is what granularity of parallelism will produce a gain when considering the overhead of managing threads. It also introduces a lot of overhead for constantly copying data on methods you are not going to be making parallel and rendering it slower for some to many applications when comp

The ultimate end to this trend is to build a system that is just core processing logic, with logic and memory all fused as closely as possible. I call it the BitGrid... it consists of 4bit look up tables hooked into an orthogonal grid. Because every single table can be used simultaneously, there is no Von Neuman bottleneck to worry about.

You've just described the FPGA. Large areas of an FPGA are devoted to thousands of almost-identical functional blocks ("slices" in xilinx parlance). For instance, in one Xilinx family, a slice contains a 4-input LUT, a flip-flop (1 bit of memory, called an FF), and other specific gates that help implement things like carry chains, shift registers, and some 5+input functions the chip designers thought were commonly encountered.

I think this is fantastic that a 1000-core processor is in development.
I hate to be the devil's advocate but at what point will Amdahl's Law take hold fully and adding more cores to a processor will prove to be a fruitless endeavor?

Depends on the problem(s) and the processor design, which is entirely the point of why a 1k core CPU is a big deal. If you can have enough independent problems or programs running all the time and design a system that lessens contention and fighting for resources, Amdahl's Law can be avoided almost indefinitely. It won't last forever of course.