The Interconnect Bottleneck

With communications playing a crucial role in the design and performance of multi-core SoCs, various interconnect structures have been proposed as promising solutions to simplify and optimize SoC design.

However, sometimes things don’t go as planned and the interconnect becomes the bottleneck.

“Under high utilization cases the DRAM will be over-constrained with requests from all the active agents in the system,” said Neil Parris, senior product manager in the Systems and Software Group at ARM. “Many SoC architects may over-provision the interconnect to ensure that it is never the bottleneck to memory. Of course this over-provisioning may add unnecessary cost in the form of wires, gates, area and power, so it’s important to configure and size this appropriately.,

Parris noted the interconnect configuration includes obvious factors such as bus width and frequency, but there are also tradeoffs to be made on the number and type of interfaces and scaling the size of internal transaction trackers and snoop filters.

Drew Wingard, CTO of Sonics, has seen a similar trend. “A lot of the SoC activity in the last 15 years has been pretty focused on consumer-facing systems, including mobile phones. In those systems there has been a relentless push for integrating more features and integrating those features at reasonable cost, and that always requires sharing important resources on the devices. So instead of having 50 chips inside your phone or TV, you’re trying to have one ideally.”

The two resources that are hardest to share are the general-purpose microprocessor and the memory.

“The approach to addressing that has been to go multicore so that you’re not really sharing one, you’re sharing a cluster of them, and the smart software decides which thing is going to run on which one, at which time,” Wingard said. On top of that there is memory, particularly external memory. He said a very strong case can be made that the biggest cost benefit to integration comes from sharing the external memory so there doesn’t have to be three separate DRAM ports — there’s one.

“That means the interconnect’s job on those systems tends to be about funneling traffic to and from that external memory,” he said. “When you design an interconnect around the protocol that’s available at the edge of the microprocessor, which was the early way that most SoCs did it, you’ll find out that protocol is not usually very friendly to getting the highest memory system performance. You need to be able to do things like reorder stuff. You’ve got to have lots of transactions outstanding. You’ve got to be able to understand the differences between the page and bank behaviors of DRAM, because the throughput and latency characteristics of memory are very dependent upon those things. When we see a system where the interconnect is the bottleneck, the most common answer to the next question how is it the bottleneck is, ‘I’m not getting enough memory system performance.’”

This is SoC architectural design at its most fundamental level—performance, power and area.

“At a high level, if you think about the different interconnects in the cache coherent portion, and depending on the functions in the SoC, there are either internal caches or L2 cache, said Pat Sheridan, director of product marketing for virtual prototyping at Synopsys. “If the information they need is available in the cache, then you don’t have to go to the DRAM, so you can have a shorter latency from the cache access. There are lots of architecture decisions around the optimization of the sizes of caches, and the way that you set up the snooping or directory based cache coherency — there are tradeoffs there, too, depending on how many masters there are that are cache coherent versus stuff that can observe what’s in the cache but maybe can’t participate; and then there are non-coherent ones. All that’s in the front end of the interconnect. And then as you get closer to the memory controller, it comes down to methodologies at the system level.”

Protocols can affect interconnect performance, as well. “If you’re not careful about the interconnect topologies and technologies you’re using, of course you can end up with bottlenecks in terms of the place and route,” said Wingard. “The total connectivity level of these SoCs is very high, so while the bulk data transfers all tend to happen through external shared memory, there are important communication paths between different blocks and the designs tend to be highly connected. If you’re not careful, and let’s say you decided to try to build that whole thing flat almost like a crossbar switch, of course you would die in the total number of wires associated with that. You’d end up with designs that are almost impossible to place and route. By using technologies where you essentially build deeper networks and you’re more careful about how you shape things and share things, you keep the network out of that problem.”

Related to this, the interconnect actually could become a bottleneck in terms of timing convergence, which happens because the interconnect’s job is to hook up all the components to each other and that means, abstractly, the interconnect has to manage the longest wires on the SoC other than clock and reset, he pointed out. “We’ve got specialized tools in our flows for managing the fact that the clock has so many loads on it, and the fact that reset has so many loads on it. But in the interconnect, there isn’t necessarily a specialized approach for doing that or automation in the flows. As a result, many designers report that their on-chip fabrics end up having timing closure issues.”

Avoiding interconnect roadblocks
To plan for the issues with interconnects, as detailed above, Parris suggested there is a tradeoff that system architects can make between interconnect area and performance, and this activity should include benchmarking. “This would start with simple traffic generators to measure peak bandwidth and latency, but extend later to micro-benchmarks running on processors and even full benchmarks and use-cases. The goal of this activity is to validate the configuration of the system.”

Further, Wingard said, there are techniques that can be used to simplify this. “You can add retiming stages, which is a tradeoff of latency and area, versus ease of timing closure. But a lot of people don’t like slowing down the latency of their system in order to make it work. Other approaches involve trying to use more intelligent protocols inside the network so that you can rework things, like flow control to try to allow the design to run at a higher frequency even with long wire delays.”

New datacenter, new challenges
Looking ahead, he sees the datacenter as one other place where the interconnect can be a bottleneck. “There is a new class of SoCs that are becoming more interesting. A lot of people discuss what the requirements are for the new datacenter where we’re kind of beyond the place where just adding more processors is going to help us from a throughput energy budget perspective. Because of this we are seeing a renaissance of systems companies doing chip designs that are focused on trying to reduce the power budget for their datacenter.”

Here, it’s really about performance at a certain amount of power, so the SoCs are more specialized. But a lot of those architectures also tend to be highly parallelized.

“If you go to the literature on computer networks, it’s not about a model in which all the interesting traffic is going to external memory, said Wingard. “There’s a lot more peer-to-peer traffic because the nodes they put into those networks have a fair amount of memory and a lot of times it’s explicitly managed as sharable memory, i.e., my processor in my node can talk to the memory inside your node. Suddenly that does change the traffic patterns inside the SoC a lot and therefore it changes the design in the network a lot. As we look at some of the AMBA-ish kinds of fabrics that have been used, a lot of times they don’t scale well into that world because they’ve been optimized around the fact that everything’s talking to memory basically.”

At the end of the day, Parris asserted that interconnect architecture will also evolve and improve in parallel with processors.

“A correctly configured interconnect should not be the bottleneck,” he concluded.