Understanding Bandwidth and Latency

Increasing Bandwidth

In the preceding sections, I touched on only a few of the available techniques that some systems use for making more efficient use of available bandwidth. There are so many more that could be covered, like data prefetching, for instance. However, I've discussed many of these techniques here and there in previous articles, so I'll leave off talking about them for now and take up another topic. In this section, we'll switch gears and explore some techniques for adding more bandwidth to the bus itself.

A faster bus

The ideal way to get more bus bandwidth is to increase the speed of the bus. Increasing the bus speed adds more down beats (or "hooks", in our analogy) per second to the bus. More down beats per second means more opportunities per second for sending out code and data. Thus doubling a bus's clock speed also doubles its theoretical peak bandwidth.

More beats/sec also means that each bus cycle translates into fewer CPU cycles, which means that from the CPU's perspective RAM looks "closer" since it has to wait less time to get its requests filled. And of course decreasing the amount of time the CPU has to wait for data is the primary goal of memory subsystem designers.

A wider bus

Another common way to increase bus bandwidth is to increase the width of the bus. If you double a bus's width while keeping its clock speed the same, then although down beats come by at the same rate as before you can place twice as much data on each beat. So for the 8-byte bus + 32-byte cache line system I've been using as an example, if we doubled the bus width to 16 bytes then it would take only two bus cycles to transfer an entire cache line (as opposed to four cycles in the initial example).

Though doubling with width of the bus also doubles the theoretical peak bandwidth, it's a less desirable way to do so than doubling the clock rate. This is because when you double the clock rate the system's initial read latency (calculated in terms of CPU cycles) decreases, while when you double the bus width this latency stays the same. So in a system with a faster bus, it takes less CPU time for RAM to return the critical word than it does on a system with a slower but wider bus.

Figure 6: fast bus vs. slow bus

The diagram above illustrates how a faster but narrower bus returns the critical word more quickly to the CPU than a slower but wider one. Since the critical word is the one that the CPU is waiting on at that very moment, it's always best to get the critical word in as fast as possible.

Of course, it should be noted that the above diagram presumes that the RAM attached to the faster bus can actually cough up the critical word in a shorter amount of time than the RAM attached to the slower but wider bus. This only works if the faster system's RAM has a sufficiently short access latency. In some cases, when designers increase the speed of the bus without also using RAM with a lower access latency, it winds up taking more of the shorter cycles to get the critical word in. This is similar to the situation, discussed below, with DDR buses.

For some applications, especially media apps that do lots of data streaming over the bus, the critical word isn't quite so critical in terms of real-world performance. Such systems benefit more from high sustained bandwidth than they do from getting the critical word out quickly, which is why RDRAM works so well with such apps in spite of the fact that it does not do critical word first bursting. In systems designed to run applications where the critical word's latency isn't quite so important, it's cheaper and easier for a system manufacturer to double the bus width than it is to increase the bus frequency. This is why very wide memory buses have become so popular as processor frequencies increase. It's tough to scale memory bus frequency to match ever higher CPU speeds, so system makers compensate by widening the bus.

Doubling the data rate

A relatively easy way to jack up a bus's peak bandwidth is by sending data on both the rising and falling edges of the clock. It's much easier to implement such a double data rate (DDR) bus than it is to actually double the clock rate of a bus. So DDR allows you to instantly double a bus's peak bandwidth without all the hassle and expense of a higher frequency bus.

Let's take a look at a DDR bus.

Figure 7: DDR bus

Since this bus can carry data on the clock's rising and falling edges, it's able to transfer all four beats of our 32-byte cache line in only two clock cycles, or half the number of clock cycles as the SDR bus in our previous example.

Of course, just because DDR allows you to easily double a bus's peak bandwidth, this doesn't mean that the bus's sustained bandwidth is doubled as well. In the following graph, the blue curve represents a 200MHz SDR bus, while the red curve represents a 100MHz DDR bus.

Graph 4: Bandwidth vs. burst length

Notice how the blue curve ramps up much faster than the red curve. This is because the 200MHz SDR bus's read latency is much shorter than that of the 100MHz DDR bus. In other words, doubling the data rate does not affect read latency, which means that it also does not affect the amount of time it takes for the CPU to get the critical word. Take a look at how this works:

Figure 8

The top figure shows a DDR system, while the bottom figure shows an SDR one. The DDR system gets all four beats of the cache line to the CPU faster, but the critical word arrives at the exact same time as an SDR system with the same bus clock rate. In the end, while DDR signaling affords real increases in peak and sustained bandwidth, it isn't without its shortcomings. The ideal will always be a combination of increased bus frequencies and decreased DRAM access latencies.

Conclusions

There are many other factors playing into sustained bandwidth that I haven't covered above. For instance, I haven't really discussed other types of non-data bus traffic (e.g. address traffic, cache snoop traffic, command traffic etc.) that can take up a significant portion of a system's bandwidth and thereby limit the amount of bandwidth available for use by the CPU for receiving data. In this respect, packet-based bus protocols, like RDRAM's memory bus protocol or the PowerPC 970's frontside bus protocol, are probably the worst offenders. Although such protocols usually run on a much faster bus, command and address traffic can take up a relatively greater fraction of available bus bandwidth than such traffic would on a simpler bus. Nonetheless, the preceding discussion should provide a useful framework for understanding how the specifics of various implementations impact sustained bandwidth and performance.

Revision History

Date

Version

Changes

11/6/2002

1.0

Release

12/4/2002

1.1

Some minor clarifications; I changed the maximum burst lengths on the graphs from a ridiculously long 400 beats to a slightly less ridiculously long 100 beats. (Please note that these are idealized numbers intended to prove a point.)