The idea: massively multicore chips of the future will need to resemble little Internets, where each core has an associated router, and data travels between cores in packets of fixed size.

The innovation also solves one of the problems that has bedeviled previous attempts to design networks-on-chip: maintaining cache coherence, or ensuring that cores’ locally stored copies of globally accessible data remain up to date.

In today’s chips, all the cores — typically somewhere between two and six — are connected by a single wire, called a bus. When two cores need to communicate, they’re granted exclusive access to the bus. But that approach won’t work as the core count mounts: Cores will spend all their time waiting for the bus to free up, rather than performing computations.

In a network-on-chip, each core is connected only to those immediately adjacent to it. “You can reach your neighbors really quickly,” says Bhavya Daya, an MIT graduate student in electrical engineering and computer science, and first author on the new paper. “You can also have multiple paths to your destination. So if you’re going way across, rather than having one congested path, you could have multiple ones.”

Maintaining cache coherence

One advantage of a bus, however, is that it makes it easier to maintain cache coherence. Every core on a chip has its own cache, a local, high-speed memory bank in which it stores frequently used data. As it performs computations, it updates the data in its cache, and every so often, it undertakes the relatively time-consuming chore of shipping the data back to main memory.

But what happens if another core needs the data before it’s been shipped? Most chips address this question with a protocol called “snoopy,” because it involves snooping on other cores’ communications. When a core needs a particular chunk of data, it broadcasts a request to all the other cores, and whichever one has the data ships it back.

If all the cores share a bus, then when one of them receives a data request, it knows that it’s the most recent request that’s been issued. Similarly, when the requesting core gets data back, it knows that it’s the most recent version of the data.

But in a network-on-chip, data is flying everywhere, and packets will frequently arrive at different cores in different sequences. The implicit ordering that the snoopy protocol relies on breaks down.

The MIT researchers solve this problem by equipping their chips with a second network, which shadows the first. The circuits connected to this network are very simple: All they can do is declare that their associated cores have sent requests for data over the main network. But precisely because those declarations are so simple, nodes in the shadow network can combine them and pass them on without incurring delays.

Groups of declarations reach the routers associated with the cores at discrete intervals — intervals corresponding to the time it takes to pass from one end of the shadow network to another. Each router can thus tabulate exactly how many requests were issued during which interval, and by which other cores. The requests themselves may still take a while to arrive, but their recipients know that they’ve been issued.

During each interval, the chip’s 36 cores are given different, hierarchical priorities. Say, for instance, that during one interval, both core 1 and core 10 issue requests, but core 1 has a higher priority. Core 32’s router may receive core 10’s request well before it receives core 1’s. But it will hold it until it’s passed along 1’s.

This hierarchical ordering simulates the chronological ordering of requests sent over a bus, so the snoopy protocol still works. The hierarchy is shuffled during every interval, however, to ensure that in the long run, all the cores receive equal weight.

The new architecture, called SCORPIO, will be open-source.

Abstract of International Symposium on Computer Architecture presentation

In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy coherence is common in small multicore systems, directory-based coherence is the de facto choice for scalability to many cores, as snoopy relies on ordered interconnects which do not scale. However, directory-based coherence does not scale beyond tens of cores due to excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering on arbitrary unordered networks are impractical for full multicore chip designs.
We present SCORPIO, an ordered mesh Network-on-Chip (NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering. Message delivery is decoupled from the ordering, allowing messages to arrive in any order and at any time, and still be correctly ordered.
The architecture is designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns. Full-system 36 and 64-core simulations on SPLASH-2 and PARSEC benchmarks show an average application runtime reduction of 24.1% and 12.9%, in comparison to distributed directory and AMD HyperTransport coherence protocols, respectively.
The SCORPIO architecture is incorporated in an 11 mm-by- 13mm chip prototype, fabricated in IBM 45nm SOI technology, comprising 36 Freescale e200 Power ArchitectureTMcores with private L1 and L2 caches interfacing with the NoC via ARM AMBA, along with two Cadence on-chip DDR2 controllers.
The chip prototype achieves a post synthesis operating frequency of 1 GHz (833MHz post-layout) with an estimated power of 28.8W (768mW per tile), while the network consumes only 10% of tile area and 19 % of tile power.

Comments (11)

It’s sad that it took so long. My original Amiga 1000 from 1986 had three or four processors – Motorola 68000, Blitter, Copper, ?? – all of them capable of running separate programming and returning independent results. Recall that the Amiga was generally about 15~20 years ahead of the PC – full 32-bit architecture, real-time OS, multi-tasking, multithreaded, but those extra processors made a HUGE difference in performance. One example: formatting a drive – floppy or HD – on a PC or Mac bogged the whole system down. On the Amiga, you could simultaneously format up to four floppies and an HD and be running a bunch of apps with minimal slowdown, because the jobs were handed off to co-processors. The PC, because of its open architecture, became locked into a model of having one main processor and one math processor plus any of thousands of often incompatible plug-in boards that used a 16-bit buss and required their own drivers, which might or might not play together and had to be written generically for all the models of pc.

My 1991 Amiga 3000 had 9 processors native and I could have added more via expansion cards that directly accessed the main system buss. It was FAST!

Commodore demonstrated around 1992 a transputer-based computer running the Helios OS that allowed you to just keep adding processors. I think that they claimed something like 400Mhz operations. And of course there was the BeBox. I think that the design allowed for 8 processors.

P.S.
There is a company called Tilera that has multicore chips. If memory serves me correctly. they believe that they can get over a thousand cores on one device. Put Tilera in the search engine,and you can read more

The whole thing will be apocalyptic only if we make it so. What is the alternative?—to kill off all of the scientists and engineers?
Hans Moravec rightly said that most of us have trouble with Math because
our evolution did not require it. A full understanding of how the brain works should remedy this problem. When full mind machine transfer of knowledge will then be possible, leveling
out the playing field, and then true communism–not the crap of the Soviet Union–will come to pass. In the meantime, no apology should be made for
automating hazardous work, and I remember when no one thought that keypunching would cause carpal tunnel syndrome
By the way Gordon, you never contacted me. Suffice it to say, you used to be a a nice guy.

I still think that we can avoid the apocalypse is the people can own the robots.

Or at least, the department of Housing and Urban Development can own the design for self-replicating robots that can build houses in the desert that are made of stone quarried by those robots from the mountains.

Are you guys paying attention? These networked cores will make the robot pink-slip apocalypse come even sooner than the news told by that previous article by the title of “Ultrasonic waves allow for precision micro- and nano-manufacturing of thin-film chips.’

Together, all of these new developments will synergize together.

Don’t any of you forget what our friend Ray told us about the hockey stick curve and that the “exponent will have an exponent.”

Nice graphic. Something I would call a “logic model.” That seems indeed the way clients and networks do work. The trick here, I think, relative to this news article, is that somebody figured out a suitable material architecture for this (kind of thinking) to work/run on a multi-core silicon chip. That in itself is nice work.

Wow, you theorized this some time ago… for DARPA nonetheless ? And a random MIT professor just happened to notice your awesome work and decided to base his new chip architecture on your findings ? That’s a highly plausible story, seriously.