Features —

Understanding Bandwidth and Latency

Introductuion

From the bygone debates over DDR vs. RDRAM to the current controversy over Apple's DDR implementations, one issue is commonly misunderstood in most discussions of memory technology: the nature of the relationship between bandwidth and latency. This article aims to give you a basic grasp of the complex and subtle interaction between bandwidth and latency, so that the next time you see bandwidth numbers quoted for a system you'll be able to better understand how those numbers translate into real-world performance.

This article was written in such a manner that the concepts communicated in it will be applicable to understanding a wide range of systems and parts of systems: from the frontside and memory buses of current P4 and Athlon systems to the buses in Apple's new XServe. Throughout the article, then, I've deliberately avoided getting mired down in the details of specific implementations in hopes that the general concepts will stand out clearly for the reader. The flip side of this simplicity is that for almost every claim I make a technically savvy reader could probably point out various exceptions, workarounds and other caveats peculiar to particular systems and data access scenarios. Nonetheless, I hope that the article will contribute to more informed discussions when bandwidth-based comparisons between different systems come up in forums like Ars's own OpenForum.

The theoretical: peak bandwidth

Most technical people have this sense of bus bandwidth as a single number, attached to a line or a bus, that quantifies how much data the line or bus can carry. In this view bandwidth is seen as an intrinsic, fixed property of the transmitting medium itself, a number that's unaffected by the vagaries of the transmitters at either end of the medium.

When people talk about bus bandwidth this way what they're really describing is only one type of bandwidth: the bus's theoretical peak bandwidth. The peak bandwidth of a bus is the most easily calculated, the largest (read: the most marketing friendly), and the least relevant bandwidth number that you can use to quantify the amount of data that two components (i.e. the CPU and RAM) can exchange over a given period of time. In most product literature this theoretical number, which is rarely (if ever) approached in actual practice, will be cited whenever the literature wants to talk about how much bandwidth is available to the system. Let's take a closer look at how this number is calculated and what it represents.

Figure 1

Take a moment to look over the simplified, conceptual diagram above. It shows main memory sending four, 8-byte blocks of data to the CPU, with each 8-byte block being sent on the falling edge (or down beat) of the memory clock. Each of these 8-byte blocks is called a word, so the system shown above is sending four words in succession from memory to the CPU.

(Note that this example assumes a 64-bit (or 8-byte) wide memory bus. If the memory bus were narrowed to 32 bits, then it would only transmit 4 bytes on each clock pulse. Likewise, if it were widened to 128 bits then it would send 16 bytes per clock pulse.)

Think of the falling edges or down beats of the memory bus clock as hooks on which the memory can hang a rack of 8 bytes to be carried to the CPU. Since the bus clock is always beating, it's sort of like a conveyor belt with empty hooks coming by once every clock cycle. These empty hooks represent opportunities for transmitting code and data to the CPU, and every time one goes by with nothing on it it becomes wasted capacity, or unused bandwidth. Ideally, the system would like to see all of these hooks filled so that all of the bus's available bandwidth is used. However, for reasons I'll explain much later, it can be difficult to keep the bus fully utilized.

In addition to the memory clock, I've included the CPU clock at the top for reference. Note that the CPU clock runs much faster than the bus clock, so that each bus clock cycle corresponds to multiple (in this case about 7.5) CPU clock cycles. Also note that there is no northbridge; the CPU is directly connected to main memory for the sake of simplicity, so for now when I use the term "memory bus" I'm actually talking about this combination frontside bus/memory bus.

In the preceding diagram and in the ones that follow, I've tried to represent the amount of time that it takes for memory to respond to a request from the CPU (or latency) as a distance value. The size of the CPU clock cycles will remain fixed in each diagram, while the number of CPU clock cycles (and, in effect, the distance) separating the CPU from main memory will vary. In depictions of systems where memory takes fewer CPU clock cycles to respond to a request for data, memory will be placed closer to the CPU; and vice versa for systems where memory takes more clock cycles to deliver the goods.

Just to show you what I'm talking about, check out the following picture to see how I illustrate a system with a slower bus speed.

Figure 2

The slower, "longer" bus in the above diagram has a lower peak bandwidth than the faster bus, since it delivers fewer 8-byte blocks in a given period of time. You can see this if you compare the number of CPU clock cycles that sit between the CPU and RAM in the first diagram versus this one. Since the length of the CPU clock cycles has remained fixed in between the two diagrams, you can see that the slower bus takes more time to send data to the CPU because the number of CPU cycles between the CPU and RAM is greater in the second diagram than it is in the first one.

Now that we've got that straightened out, let's try a bandwidth calculation. If the slow bus runs at 100 million clock cycles per second (100MHz) and it delivers 8 bytes on each clock cycle, then its peak bandwidth is 800 million bytes per second (800 MB/sec). Likewise, if the faster bus runs at 133MHz and delivers 8 bytes per clock cycle, then its bandwidth is 1064 MB/sec (or 1.064 GB/sec).

8 bytes * 100MHz = 800 MB/s
8 bytes * 133MHz = 1064 MB/s

Both of these numbers are theoretical peak bandwidth numbers that characterize the bandwidth of the bus only. Or, to go back to the "hooks" analogy, these numbers simply tell you how many hooks are going by each second. There's quite a bit more to the picture than just the capacity of the bus, though, and once we factor in the capabilities of both the consumer (the CPU) and the producer (the RAM) we'll see that the real-world bandwidth of the system as a whole is usually quite a bit less than the raw capacity of the transmitting medium might at first suggest. The more hooks that go by unfilled (for whatever reason) the greater the real-world bandwidth of the system is diminished.

One final thing that I should make clear before we move on is that a complete clock cycle consists of one up beat and one down beat. For reasons that will become apparent when we talk about DDR signaling, in my diagrams I've slightly separated the up and down beats, but you can also imagine them fused together.