Posted
by
samzenpus
on Monday June 23, 2014 @07:52AM
from the we-need-another-core dept.

rtoz writes The more cores — or processing units — a computer chip has, the bigger the problem of communication between cores becomes. For years, Li-Shiuan Peh, the Singapore Research Professor of Electrical Engineering and Computer Science at MIT, has argued that the massively multicore chips of the future will need to resemble little Internets, where each core has an associated router, and data travels between cores in packets of fixed size. This week, at the International Symposium on Computer Architecture, Peh's group unveiled a 36-core chip that features just such a "network-on-chip." In addition to implementing many of the group's earlier ideas, it also solves one of the problems that has bedeviled previous attempts to design networks-on-chip: maintaining cache coherence, or ensuring that cores' locally stored copies of globally accessible data remain up to date.

To really run crysis well, you'd probably need something like the GeForce GTX Titan-- which has 896 double precision cores. However, if you raytrace the graphics, you might be able to run it on a 72 core Knights Landing chip.

The core count isn't the interesting thing about this chip. The cores themselves are pretty boring off-the-shelf parts too. I was at the ISCA presentation about this last week and it's actually pretty interesting. I'd recommend reading the paper (linked to from the press release) rather than the press release, because the press release is up to MIT's press department's usual standards (i.e. completely content-free and focussing on totally the wrong thing). The cool stuff is in the interconnect, which uses the bounded latency of the longest path multiplied by single-cycle one-hop delivery times to define an ordering, allowing you to implement a sequentially consistent view of memory relatively cheaply.

Since I'm here, I'll also throw out a plug for the work we presented at ISCA, The CHERI capability model: Revisiting RISC in an age of risk [cam.ac.uk]. We've now open sourced (as a code dump, public VCS coming soon) our (64-bit) MIPS softcore, which is the basis for the experimentation in CHERI. It boots FreeBSD and there are a few sitting around the place that we can ssh into and run. This is pretty nice for experimentation, because it takes about 2 hours to produce and boot a new revision of the CPU.

Yes, we've also released the generated Verilog for anyone who wants to use just that. If you're a university, you can easily get a free license for Bluespec. If you're not, then you either most likely don't have the resources to get a decent FPGA (the ones that can run a processor at a useable speed start at about $3K), or you can probably afford the license. We're also talking to Bluespec about open sourcing their compiler, as most of their real value is from other services on top of it, but that's like

So what's special about this chip that Intel's Xeon Phi (first demonstrated in 2007 as Knights Landing with 80 or so cores) isn't already doing? Or is this just a rehash of 7 year old technology that's already in production? It sounds like a copy/paste of Intel's research.

"Intel's research chip has 80 cores, or "tiles," Rattner said. Each tile has a computing element and a router, allowing it to crunch data individually and transport that data to neighboring tiles." - Feb 11, 2007

Yes, as usual, the MIT press release oversells the research, while the original paper [pdf] [mit.edu] is a bit more careful in its claims. The paper makes clear that the novel contribution isn't the idea of putting "little internets" (as the press release calls them) on a chip, but acknowledges that there is already a lot of research in the area of on-chip routing between cores. The paper's contribution is to propose a new cache coherence scheme which they claim has scalability advantages over existing schemes.

The paper's contribution is to propose a new cache coherence scheme which they claim has scalability advantages over existing schemes.

Somehow this was obvious to me even from the press release. I've never yet seen details of an ordering model laid bare where it wasn't the core novelty. Ordering models are inherently substantive. Ordering models beget theorems. Cute little Internets drool and coo.

It does seem rather similar - a large cluster of cores, laid out in a grid topology. Perhaps they're doing something different with the cache coherency? I couldn't find too much on how Intel's handling that, and it seems to be a focus of the articles on this chip.

I would be curious to know more about the architecture and all around chip specs they are using in their prototype: clock speed, memory interface, etc. The article states they are developing a version of Linux to test it on, so it's safe to say it's an established architecture. Anyway, I am excited to see the results once they have tested it on Linux. While this does not help with the density per core problem, perhaps it will help extend Moore's Law from the perspective of speed increase in respect to micro

So, in one die, it's a little interesting, though GPU stream processors and Intel's Phi would seem to suggest this is not that novel. The latter even let's you ssh in and see the core count for yourself in a very familiar way (though it's not exactly the easiest of devices to manage, it's still a very much real world example of how this isn't new to the world).

The 'not all cores are connected' is even older. In the commodity space, hypertransport and QPI can be used to construct topologies that are not fu

The basic idea isn't new. What the paper is really claiming is new is their particular cache coherence scheme, which (to quote from the Conclusion) "supports global ordering of requests on a mesh network by decoupling the message delivery from the ordering", making it "able to address key coherence scalability concerns".

How novel and useful that is I don't know, because it's really a more specialist contribution than the headline claims, to be evaluated by people who are experts in multicore cache coherence schemes.

Some knowledge about multicore cache coherence here. You are completely right, Slashdot's summary does not introduce any novel idea. In fact, a cache-coherent mesh-based multicore system with one router associated to each core was presented on the market years ago by a startup from MIT, Tilera [tilera.com]. Also, the article claims that today's cores are connected by a single shared bus -- that's far outdated, since most processors today employ some form of switched communication (an arbitrated ring, a single crossbar, a mesh of routers, etc).

What the actual ISCA paper [mit.edu] presents is a novel mechanism to guarantee total ordering on a distributed network. Essentially, when your network is distributed (i.e., not a single shared bus, basically most current on-chip network) there are several problems with guaranteeing ordering: i) it is really hard to provide a global ordering of messages (like a bus) without making all messages cross a single centralized point which becomes a bottleneck, and ii) if you employ adaptive routing, it is impossible to provide point-to-point ordering of messages.

Coherence messages are divided in different classes in order to prevent deadlock. Depending on the coherence protocol implementation, messages of certain classes need to be delivered in order between the same pair of endpoints, and for this, some of the virtual networks can require static routing (e.g. Dimension-Ordered Routing in a mesh). Note a "virtual network" is a subset of the network resources which is used by the different classes of coherence messages to prevent deadlock. This is a remedy for the second problem. However, a network that provided global ordering would allow for potentially huge simplifications of the coherence mechanisms, since many races would disappear (the devil is in the details), and a snoopy mechanism would be possible -- as they implement. Additionally, this might also impact the consistency model. In fact, their model implements sequential consistency, which is the most restrictive -- yet simple to reason about -- consistency model.

Disclaimer: I am not affiliated with their research group, and in fact, I have not read the paper in detail.

While adding an extra core or two made big jumps in performance (because you are almost always running at least two applications) there comes a point where most users won't see a performance boost. While I may now be able to throw 36 processors at a problem, you have to program all those cores to work together. Right now that's a lot of effort, and until programming languages catch up and can optimize code by making it massively parallel, this is going to be a non-starter.

You could go ahead a run both paths of code. Then decide which one is correct and discard the unused results.

Intel is already doing partial speculative execution in the case of conditional branches. The pipeline is filled with the predicted path which is then frequently executed out of order (before the condition is known)..

Intel is not however doing the full concept you have described (eager speculative execution) and I don't think its likely that they ever will. The best case for eager speculative execution would be when the branches are completely unpredictable, which is only very rarely true. Further, it r

Cache coherency has been one of the banes of multicore architecture for years. It's nice to see a different approach but chip manufacturers are already getting high performance results without introducing additional complexity. The Oracle (Sun) Sparc T5 [oracle.com] architecture has 16 cores with 128 threads running at 3.6Ghz. It gives a few more years to Solaris at least but it's still a hell of a processor. For you Intel fans the E7-2790 v2 [intel.com] sports 15 cores with 30 threads with a 37.5MB cache so they're doing something right because it screams and is capable of 85GB/s memory throughput.

I'm sure the chip architects are looking at this research but somehow I think they're already ahead of the curve because these kinds of cores/threads are jumps ahead of where we were just a few years ago. Anybody remember the first Pentium Dual Core [wikipedia.org] and The UltraSparc T1 [wikipedia.org]?

High "thread" count cores are good for work loads where there is little inter-thread communication and has lots of memory stalls. By having a lot of threads running at once, whenever there is a memory stall, you can just switch to another thread, and the chance of that thread being stalled is very low. This also means lots more cache thrashing, so you need larger caches, but they can be tuned for high-throughput high-latency. The entire design for these cpus is geared for high-throughput high-latency, which

Oh no question, high thread counts would make sense for say a web service application server vs. something more compute intensive. None of these architectures will ever be in the terraflop or petaflop range for that so there will still be need for specialization of highly compute intensive workloads to those kinds of systems. One thing that will kill this architecture is software compatibility, so it'd be interesting to see if it does take off. In the meantime Moore's law will keep pushing the Sparc and

Parallel processing has made big strides, but only in some limited areas. Graphics rendering where each pixel can be updated independent of other pixels. Or in fluid mechanics (CFD) using time marching techniques where updating the solution at one point needs data from a limited set of neighbors, or in iterative solvers of matrices. Even something very structured without if statements like inverting a matrix, parallel methods have suffered.

Basic problem is this, even if just 5% of the work has to be serial, the maximum speedup is 20x, that is the theoretical maximum. YMMV, and it does. Internet and search has opened up another vast area where a thread can do lots of work and send just very small set of results back to the caller. Hits are so small compared to misses, you can make some headway. Even then we have found very few applications suitable for massively parallel solutions.

We need a big breakthrough. If you divide a 3D domain into a number of sub domains, the interfaces between the subdomains is 2D. The volume of 3D domain represents computational load, and the area interfaces represent the communication load. If we could come up with domain-division algorithms that guarantee the interfaces would be an order of magnitude smaller, even as we go from 3D to higher number of dimensions, and if we could organize these subdomains into hierarchies, we would be able to deploy more and more of computational work, and be confident the communication load would not overwhelm the algorithm.
This break through is yet to come. Delaunay Tessellations (and its dual Voronoi polygons) have been defined in higher dimensions. But the number of "cells" to number of "vertices" ratio explodes in higher dimensions, last time we tried, we could not even fit a 10 dimensional mesh of 10 points into all the available memory of the machine. It did not look promising.

Lol,Such says the guy with no clue.The Transputers where way ahead of what we do in our days.And the first thing I thought when I saw the MIT concept is: "oh, they have put 32 transputers on a single die".

Transputers where build byba company called INMOS.

About 90% of the military hardware in Europe (around 1990/1995) was running on transputers.

That means radar systems, flight control, avionic hardware etc.

INMOS went down because the Japanes wanted to buy it. But the french government intervened and prevente

People are still doing it, or did you not get what this article is about?Or did you never catch any hint (given in this thread and many others) about http://www.greenarraychips.com... [greenarraychips.com] ?

Actually, the transputer had a few good kernels of an idea: sea of loosely interconnected processors each with local memory. However, the actually execution wasn't that good, and the only real market was embedded military signal processing systems. For a while, inmos attempted to chase workstation graphics, but eventually they got killed by the i860 (which is sad as it too wasn't a very good implementation of any idea either, but happened better $/! than the transputer for floating point) which of course

This is a nice little trick. This has the potential to extend shared consistent memory multiprocessor designs to far larger numbers of processors. Whether this is a performance win remains to be seen. Good idea, though. Note that the prototype chip is just a feasibility test; they used an off the shelf Power CPU design, added their interconnect network, and send the job off to a fab. A production chip would have optimizations this does not.

I don't see what the big deal is. I'm currently working with early silicon on a cache coherent 48-core 64-bit MIPS chip with NUMA support and built-in 40Gbps Ethernet support. The chip also has a lot of extended instructions for encryption and hashing plus a lot of hardware engines for things like zip compression, RAID calculations, regular expression engines and networking support among other things. It also has built-in support for content addressable memory.

According to the comparison table, (Refer timeline 4:21 of this video) this chip uses 1.1V while other standard chips are using 1.0V. This difference may make it hard for the chip makers to use this technology.

Really? They won't be able to specify a 1.1V VRM instead of a 1.0V VRM? Those poor, poor chip makers. They sound like a bunch of incompetent fucks.

A higher high/low voltage swing (with a reasonable amount of other stuff being equal) will be more of a thermal nuisance; but if the perks make up for it, that's hardly a dealbreaker. The toasty end of boring desktop CPUs is somewhere north of 200watts already, with a little shoving that they typically survive, so if somebody really wants 36 cache-coherent cores on-die, they'll suck it up and make it work.

For applications that don't specifically demand that, I'd be interested to know how the costs and benefits of 'dealing with the cooling demands of a smaller number of denser parts' compare with 'dealing with the cooling demands of more, cooler, parts, closer to whatever the performance per watt sweet spot is; but with more cabling, PSUs, switches, and similar interconnect and support stuff to buy and power'...

Post Netburst, AMD is the one having TDP issues, and their current enthusiast-gamer-nutjob CPU is specced at 220 watts. Intel has their numbers down from the Prescott Pentium D days, though the use of 'TDP' rather than peak, and thermal throttling that actually works, makes it a little tricky to pin a precise ceiling value on some of them without actually getting out the test equipment.

Most are, of course, much lower, given the popularity of laptops and desktops that don't need water cooling, and so on.

and their current enthusiast-gamer-nutjob CPU is specced at 220 watts.

I'll admit, the AMD FX was the only line I didn't check before posting. Their next closest chips are only 140W, and they've only got a couple at that. Most are 115W or lower. I didn't even know the AM3 socket was capable of 220W.

Based on the mixed reviews, it sounds like 220w is really pushing your luck unless the motherboard has some heroically overqualified VRM onboard, and your PSU is descended from an arc welder on its mothers side; but I've yet to see a single report of somebody actually fusing a pin rather than just crashing a lot, so apparently the socket is tougher than it looks. I was very surprised to see such a part being sold at that power level, though, rather than just 'unlocked, and we'll just look the other way'.

Why do people with zero actual semiconductor knowledge try to speak as an authority*?!

It's a research chip, meaning they don't need to be on the latest process node to show their proof of concept. Larger nodes (much cheaper to design a chip on) have thicker gate passivation layers and run at higher voltage. From an architecture standpoint the process node/voltage are irrelevant. So if their architecture proves out, some bigger outfit can run with it while targetting the latest-greatest itty-bitty process node to increase the clock-rate, drop the power, and reduce the area/price.

*I am not a processor designer, just a mixed signal (mostly analog) guy, but I've been working in the semiconductor industry, including doing process bake-offs for over a dozen years.

According to the comparison table, (Refer timeline 4:21 of this video [youtube.com]) this chip uses 1.1V while other standard chips are using 1.0V. This difference may make it hard for the chip makers to use this technology.

Whilst I have my foot to the floor... I still think it's a failure of science - there's nothing wrong with doing both simultaneously - to believe otherwise would be to buy into a rhetorical device based on "false opposites."

There are still technical challenges to increasing clock speed. Just because "IBM said it would" doesn't make it so. Instead you are seeing higher IPC due to architectural refinements as well as more and more cores. Clock speeds are still inching up but do not expect any huge radical jumps anytime soon.

Nope, Liquid Nitrogen cooling gets you past the speed limits. How about over 8Ghz [youtube.com] on a chip that costs less than $200? Going to Helium and you can get over 8.5Ghz. [youtube.com] although both become a bit unweildy when it comes to game play because I don't want my hard drives to freeze. I love that last video there's some real country boy engineering there including using a propane torch and a hair dryer to keep certain components from freezing.

Nope, Liquid Nitrogen cooling gets you past the speed limits. How about over 8Ghz [youtube.com] on a chip that costs less than $200? Going to Helium and you can get over 8.5Ghz. [youtube.com] although both become a bit unweildy when it comes to game play because I don't want my hard drives to freeze. I love that last video there's some real country boy engineering there including using a propane torch and a hair dryer to keep certain components from freezing.

I'm a little confused as to why you're citing the chip's low low price of "less than $200" if you need liquid nitrogen to get it to perform the way you want it to. You do realize that cooling systems cost money, too...right? There's no point in being able to use a cheap processor to get to X performance benchmark if the required additional support systems cost thousands of dollars more than a more powerful and more expensive processor that can do it out of the box. Not to mention the fact that liquid nitrogen cooling isn't exactly hassle-free, especially in a household environment. And it's worth noting that even if you boost Ghz, you eventually run into choke points related to pushing data to and from the chip anyways. You can give the most important worker on an assembly line all the crystal meth they can eat, but they can't work any faster than the conveyor belt in front of them.

Some of us run better than off the shelf liquid cooling, no hassles and for less than 300 bucks. I have a nice system and it's quiet because I can run the big fans. Sure, Liquid Nitrogen systems are available but the OP was about stopping the rev up process, since 8Ghz is now possible, the barrier needs to be set higher. I don't think we'll see it anytime within the next five years but maybe.

Some of us run better than off the shelf liquid cooling, no hassles and for less than 300 bucks. I have a nice system and it's quiet because I can run the big fans. Sure, Liquid Nitrogen systems are available but the OP was about stopping the rev up process, since 8Ghz is now possible, the barrier needs to be set higher. I don't think we'll see it anytime within the next five years but maybe.

Yeah, but Intel and AMD will go bankrupt if they make chips just for "some of us." And if you look at where Intel has gotten their speed increases, very little of it in the past decade has been from clock speed. Ghz is no longer where the performance boost is to be found.

Ghz is king because not all workloads are multithreaded enough to take advantage of multiple cores/threads. Eventually software engineers will catch up and startleveraging what the architecture provides I'd bet that 8 out of 10 COTS packages out there at least in the Desktop arena don't take advantage multithreading.

Liquid Nitrogen/Helium cooling is great... while it lasts. When it's used up however, you've got to pay for another bottle of cooling. I have no idea how long a $200 CPU can run at 8GHz on a bottle of Nitrogen, or how much a bottle of Nitrogen costs, but I can't imagine it's a good long term solution.

Nitrogen overclocking is done for contests. You can get phase change cooling, which is the next best thing and will still get your processor far below zero. The big downside to that is just power consumption. It's also bulky and noisy.

Using liquid helium would be way cost prohibitive, especially for a very small gain (8.5 GHz vs. 8 GHz in GP's post). Under a fairly good contract and purchased in relatively large quantities, our current cost of LHe is slightly less than $11/liter.

Liquid nitrogen is a different story. It's still expensive compared to a cooling fan, but we pay ~$0.40/liter for LN2. If you were doing a lot of this, the "standard" tank sizes are 160 or 180 L, one such tank carefully managed should last a bare minimum of 1-2 w

I didn't really feel the need to go there, but I hear you. Considering that my job relies on keeping the superconducting coils in our magnets at ~4K, I'm all too aware of this. Our prices have nearly doubled in the last few years, and there have been a handful of supply scares.

Well then since our Helium reserves come from Oil and Natural Gas drilling, all I can say is Drill baby Drill!

When I started TIG welding in the 70s, Helium tanks were about $30/bottle which was still expensive considering a mortgage for a decent home was $300. Now all I use is Argon which is a bitch when trying to weld overhead.

A better analogy is that they keep adding seats and making the whole vehicle slower.

Kawasaki Ninja == 10GHZ single core (fastest way to get anywhere alone)Ford Mustang == 4GHz quad-core (most people only use the front two seats, but if desperate you can squeeze more people in)Chevy Suburban == 3.3 GHz 8-core (it seems like everyone wants one, but most people who have a full load just have a bunch of little kiddies)Mercedes Sprinter == 2.7 GHz 12-core (just meant to be a grinding people hauler)School Bus ==

The Core 2 Duo is approximately 2x faster clock-for-clock versus the Pentium 4 [techreport.com], and the current Haswell core is barely 40% faster than that (assume a 7% speedup per-clock for every core rev since). That gets you somewhere in the 2x-3x performance improvement range for Haswell, barring corner-cases that are embarrassingly easy to leverage AVX/FMA (most real-world use cases show small improvements).

Intel proved that they could do a whole lot better than the Pentium 4, but your performance

Here is an example of one of the world's most optimized pieces of software: x264. It's also one of the few real-world loads that can take advantage of multiple processor and SSE. So how much speedup did this incredible piece of software see with AVX2, which DOUBLED the width of the integer pipelines?

Now why would I want to do that? Obviously, for FP tasks, a modified design would be necessary - but given that GA144 is already unbeatable in integer performace energy efficiency, even at 180 nm where it's being manufactured, if you extend the ISA to be more FP-friendly and switch to a recent process, I don't see a problem. Well, it would need different memory interfaces to make it a shared memory multiprocessor. That's a bummer. But I guess it can't be helped; programmers are lazy.

Maybe you don't want to do that, but good floating point performance is a requirement for a lot of useful tasks. Also, many real world tasks need access to large amounts of memory, and often that memory needs to be available to multiple nodes. The GA144 fails there too, since it has a pitiful amount of memory. Except for a small handful of niche applications that happen to match the GA144's capabilities, it's a useless device.

It's the notion (asynchronous, self-clocked, energy efficient chip, maximizing performance per watt and performance per mm^2) that matters to me, not this specific design (which is intended for specific purposes). Witness how the HPC people embraced GPUs, which are sort of heading in a similar direction already.

My point exactly. What is a simple task on an modern Intel becomes nearly impossible on the GA144. We've already tried the idea of combining large numbers of simple processors, and it has failed every single time. If NxM simple cores together can't beat a modern Intel processor for a range of useful tasks, there's not much point in developing it.

Not just 'tried to', but actually delivered. Thinking Machines CM-1 and CM-2 had routers on chip with 32 CPUs per chip. In the hybercube architecture, the 5 lowest bits of CPU address routed on-chip, and the rest of the CPU address routed between chips. It worked quite well, and was the fastest computer on the planet for several years running.

The reason Apple stuck with the Power architecture for so long, was because IBM promised them quad and greater core chips running at 8 Ghz, air cooled, by 2005. Needless to say, they didn't even come close to delivering. It was that failure that led Apple to switch to x86.

Regarding your Power PC comment specifically: This was from an IBM research department that makes a processor other than the Power PC – I believe they were used for Z OS units or the like. I use it as an operational example ONLY as, as Mr Spock says:- “that which HAS happened CAN happen” and it is therefore a possibility.

And hopefully in any lectures on Moore's Law, the students learn that Moore's Law refers to transistors on a die, not the speed of the chips. This 36-core chip probably jumps ahead of Moore's Law a bit, as it's got to be a fairly large die. In any event Moore's Law continues to hold, more or less. Other things like CPU speed have followed a similar trend in times past, but no longer do now.

And hopefully in any lectures on Moore's Law, the students learn that Moore's Law refers to transistors on a die, not the speed of the chips. This 36-core chip probably jumps ahead of Moore's Law a bit, as it's got to be a fairly large die.

Moore's Law refers to the number of components per integrated circuit for minimum cost. Note that this is basically transistor density and is not impacted by core size. Silicon defects and transistor size determine the optimal number of components per IC.

A quote from Wikipedia,

Moore himself wrote only about the density of components, "a component being a transistor, resistor, diode or capacitor,"[26] at minimum cost.

Do you know what I'm going to do? I'm going to go out and get a shirt printed up with the expression “I Heart Processor Ghz” and wear it at parties! Why? Every time this crops up at meetings that I've attended there is always someone who loses their temper at the mere mention of anyone developing a faster processor, irrespective of how many cores or cache size and I don't like it! The physics has been done! I'd swear, if anyone ever does a THz processor, and one of these kids finds out, they'll

It's of course good if the distance between cores is kept to a minimum, but if the software designers and compilers considers the limitations when generating the binaries it may not be a huge performance bottleneck in real world applications.

It's better to switch to a new core than to switch task on a core for example. Looking at what happens in a modern PC most processing is mostly unrelated to the other. Even inside a web browser you may have several plugins running in different parts of the screen, but t

That's a fun post! 36-core is immense! As an aside: It's been a while since we've seen any decent rise in processor Ghz. I remember IBM talking about functioning reasonably cool 10 Ghz processors (ref needed) in the early 2000s, but no one has them in the shops yet! I'm sure this was discussed in Moore's Law lectures prior to Y2K, but mention it these days and everyone scowls! So some people can (and they run cool) and some people can't, what normally happens in computing when the faster items are released?