This post originally appeared on the Flux7 blog at: http://flux7.com/blogs/benchmarks/littles-law/. It discusses the relationship between throughput and latency summarized by Little’s Law. Little’s Law has applications on any system with throughput and latency, whether it is memory performance or an industrial assembly line. Read the original article below:

I’ve been doing a lot of analysis of latency and throughput recently as a part of benchmarking work on databases. I thought I’d share some insights on how the two are related. For an overview of what these terms mean, check out Aater’s post describing the differences between them here.

In this post we’ll talk about how they each relate to one another.

The basic relationship between latency and throughput within a stable system—one that is not ramping up and down—is defined by a simple equation called Little’s Law:

occupancy = latency x throughput

Latency in this equation means the unloaded latency of the system, plus the queuing time that Aater wrote about in the post referenced above. Occupancy is the number of requestors in the system.

The Water Pipe Revisited

To understand the latency/throughput relationship, let’s go back to Aater’s example of the water pipe.

Occupancy is the amount of water in the pipe through which we are measuring latency and throughput. The throughput is the rate at which water leaves the pipe, and the latency is the time it takes to get water from one end of the pipe to the other.

If we cut the pipe in two, we halve the latency, but throughput remains constant. That’s because by halving the water pipe we’ve also halved the total amount of water in the pipe. On the other hand, if we add another pipe, the latency is unaffected and the throughput is doubled because the amount of water in the system has doubled. Increasing the water pressure also changes the system. If we double the speed at which the water travels we halve the latency, double the throughput, and the occupancy remains constant.

This relationship governs many computing systems, as well as systems we encounter in our daily lives. In computing it can be used to understand peak throughput of a processor, of a database, or of a memory subsystem. In our everyday lives we can also see it in traffic flow. The one constant factor in each of these systems is that they exhibit peak capacity, or occupancy. In a memory subsystem, peak capacity may be the number of fill buffers. In a database, it may be the maximum number of requests that can be serviced concurrently. In the case of traffic, it’s the number of cars that can be on the road at any one time. Regardless of system type, there’s always a physical limit to maximum occupancy.

To understand the behavior of a system with an increasing load, let’s consider an unloaded system. If a transaction comes in, it will take a minimum amount of time, which will be the unloaded system latency. In ideal conditions, the latency remains constant as we ramp up throughput. At the same time, unseen to us, the system occupancy increases in proportion to the throughput. But, once we reach peak capacity around the bottleneck, throughput stops increasing. At this point, additional requests don’t add to the throughput. They simply linger around the bottleneck and increase the average latency caused by queuing delays. Figure 3, illustrates this observed behavior with real data from database benchmarking:

Figure 3: Throughput vs Average Latency

In many cases, the results aren’t as clear as in the table above. That’s because there’s often interference between transactions. This is especially true for highway traffic because cars cause neighboring cars to slow down. At low occupancy on a freeway, there’s very little interference and the average speed is, say, 65 mph. As more cars come onto the road, everyone slow down a bit. Studies show that this system reaches maximum throughput at 45 mph. Adding more cars to the mix doesn’t increase the throughput, but rather slows all of the cars, thereby increasing latency and, by Little’s law, decreasing throughput. Suddenly, you realize that traffic has come to a crawl. Those of you in California probably have noticed how some on-ramps limit the number of cars that can enter the freeway at any one time. That’s an attempt to keep traffic flowing at a high throughput and to keep overall system throughput high. Figure 4, shows this behavior with the same database being benchmarked from a remote location. The unloaded latency is higher but we have the same peak threshold. Due to the higher latency of remote accesses, more concurrent requests were created and resulted in interference. So we see the throughput drop while the latency increases. This is an example of interference caused by having multiple threads. Unfortunately increasing threads increases measurement inaccuracies and we could not collect accurate data with a far higher number of threads.

Figure 4 Request Rate vs Average Latency

So Which Is More Important?

Let’s say that you’re designing a processor and a genie offers to either halve the latency or double the throughput. Which would you choose? The answer is simple—latency. That’s because improving latency improves throughput at the same time. Put another way, if you do things faster you get more things done. But, you really don’t want to design your processor for latency. Why not? Well, let’s say you need a bunch of boxes moved. A normal person can perform the move in four hours but you need the task finished in an hour. So you have two options, find a bodybuilder capable of performing the task in one hour. Alternatively, hire four people and they will be able to perform the same task in one hour. Either way, you sustain the same throughput. But the difference is that it is a lot harder to find the one person that can do the complete task in an hour, rather than use four normal people. So in general it is easier to increase the throughput than to decrease the latency. In computers, when you try to reduce latency you need to put in more hardware, more expensive hardware, and increase complexity.

In processor design, you have to make compromises for each particular subsystem. Sometimes you have to focus on throughput and at other times on latency. But your goal is nearly always to get your task completed faster, which means decreasing the latency of the complete program.

One Man’s Throughput Is Another Man’s Latency

What I think’s interesting, and is probably a major source of confusion for many, is how throughput at one level determines latency at the next higher level. The simplest example of this is an instruction. We can design a processor that wastes no time in latching the results and that executes an instruction in one long cycle. But we don’t because we can get much higher throughput by pipelining the instructions, while causing only a nominal increase in latency due to the additional time taken to latch the results. So, by increasing single-instruction latency we decrease the latency of processing a complete stream of instructions.

A 1-cycle processor versus a pipelined processor is an extreme case, but it represents a real-life compromise. For multiple parallel data streams, you can ignore CPUs that are designed for reducing latency on a single data stream. Instead, you can use a GPU that executes a single data stream at a significantly slower speed, but processes multiple streams in parallel in order to complete the task at a faster overall speed.

Yet, there are higher levels still. If you don’t need to run your operation on a single system, you can use clusters of multiple systems. When using parallelism to increase throughput, you have to ask yourself the same question that you ask when looking at any parallelism problem: What are the data dependencies and how much data will need to be shared?

The relationship between latency and throughput across different levels can be confusing. It’s a reminder that we need to look at each system as a whole. Any system we see in real life will include many subsystems, and there will be complex dependencies between them. Each subsystem will bind to the rest of the system, either by throughput or by latency. When designing a system, it’s important to consider these interactions and then to create a system that won’t bottleneck the larger system. These same rules apply to everything, including supply chains, CPU performance, databases and highway traffic. So, when asking yourself which is fastest, first ask what fast means for you in terms of latency and throughput, and how your system meets these goals.

Hi Sorry, this didn’t get noticed. Both I and Aater are busy with Flux7 and this blog gets ignored more than we’d like. You should check out the blog on Flux7, it is very active and has a lot of very valuable content. Although with more of a focus on the cloud than chip design.

Modeling the latency/throughput of a submodule is significantly easier than doing it for the whole module. It’s really only the submodules that we understand at a fundamental level. The first step is understanding the pipelines. After that you should make some simplifying assumptions about the usecase to make the analysis simpler. Afterwards you need to do your analysis keeping an eye on all three metrics because there may be different limits on any of them. Watch out for any feedback loops (e.g., deallocation of resources in the end of the pipeline needed in the beginning). And then keep on stepping outwards, and repeating the analysis at the next level. It goes without saying that parallel pipes are additive and sequential pipes are taking the minimum values. Keep in mind, that may be several new latencies at points where decisions are made.

Hi Sorry, this didn’t get noticed. Both I and Aater are busy with Flux7 and this blog gets ignored more than we’d like. You should check out the blog on Flux7, it is very active and has a lot of very valuable content. Although with more of a focus on the cloud than chip design.

Modeling the latency/throughput of a submodule is significantly easier than doing it for the whole module. It’s really only the submodules that we understand at a fundamental level. The first step is understanding the pipelines. After that you should make some simplifying assumptions about the usecase to make the analysis simpler. Afterwards you need to do your analysis keeping an eye on all three metrics because there may be different limits on any of them. Watch out for any feedback loops (e.g., deallocation of resources in the end of the pipeline needed in the beginning). And then keep on stepping outwards, and repeating the analysis at the next level. It goes without saying that parallel pipes are additive and sequential pipes are taking the minimum values. Keep in mind, that may be several new latencies at points where decisions are made.

Howdy! This is kind of off topic but I need some
guidance from an established blog. Is it hard
to set up your own blog? I’m not very techincal but I can figure things out pretty quick.
I’m thinking about making my own but I’m not sure where to begin.
Do you have any ideas or suggestions? Thank you

I met this equation during ICL sizing training in 1970 as T=C/E (throughput = concurrency/existencetime).
I still have the T=C/E lapel pin from the course.

In refering to latency both this and the original article fail to distinguish between servicetime and responsetime.

Servicetime is the latency of an operation when there is no contention and therefore no queuing.
If your servicetime is unacceptable then no amount of additional resource will improve it. The only solutuion is to reduce the servicetime by tuning the operation (shortening the pipe) or by using faster resources (increasing the water pressure).

Responsetime is the time taken to complete an operation at a given level of concurrency (resource usage/contention) including the queuing which Queuing Theory tells us is introduced.

If you measure servicetime for an operation along with its load on each component resource, i.e. run a metric measurement test at a concurrency of 1, then from queuing theory it is relatively easy to calculate the responsetime which will be achiveved for a given Concurrency and the Concurrency (and hence resource) which will be required to achieve a given Throughput. This can then be verified by stress tests at increasing levels of concurrency to validate the scalability of the solution (absence of bottlenecks).
The results of the servicetime test should be a basic deliverable from the developer of any transaction/operation.
Performance SLAs should be defineded as Throughput at Responsetimes (latencies) expressed as both expected average and acceptable deviations for multiple percentiles (80,90,95). Most users and indeed IT staff are totally unaware of the increase in resources required to achieve a given responsetime at the 95%tile compared to that required to achieve it at the 80%tile.

This was a cool read for me as I was researching Little’s Law as it applies to business optimization. Really the same thing, just looked at from a slightly different angle… i.e. Any Lean journey strives to minimize waste and increase speed. Increasing speed equates to reducing lead time to your customers. Minimizing waste includes an analysis of inventory on-hand and steps to reduce that inventory. Little’s Law provides an equation for relating Lead Time, Work-in-Process (WIP) and Average Completion Rate (ACR) for any process.