Understanding bufferbloat and the network buffer arms race

If a little salt makes food taste better, then a lot must make it taste great, right? This logic is often applied in the digital domain, too. (My pet peeve is that TV shows and DVDs keep getting darker and darker.) In a similar vein, networks used to buffer a little data, but these buffers have been getting larger and larger and are now getting so big they are actually reducing performance. Long-time technology pundit Bob Cringely even deemed the issue worthy of three of his ten predictions for the new year.

Networks need buffers to function well. Think of a network as a road system where everyone drives at the maximum speed. When the road gets full, there are only two choices: crash into other cars, or get off the road and wait until things get better. The former isn't as disastrous on a network as it would be in real life: losing packets in the middle of a communication session isn't a big deal. (Losing them at the beginning or the end of a session can lead to some user-visible delays.) But making a packet wait for a short time is usually better than "dropping" it and having to wait for a retransmission.

For this reason, routers—but also switches and even cable or ADSL modems—have buffers that cause packets that can't be transmitted immediately to be kept for a short time. Network traffic is inherently bursty, so buffers are necessary to smooth out the flow of traffic—without any buffering, it wouldn't be possible to use the available bandwidth fully. Network stacks and/or device drivers also use some buffering, so the software can generate multiple packets at once, which are then transmitted one at a time by the network hardware. Incoming packets are also buffered until the CPU has time to look at them.

So far, so good. But there's another type of buffering in the network, used in protocols such as TCP. For instance, it takes about 150 milliseconds for a packet to travel from Europe to the US west coast and back. My ADSL line can handle about a megabyte per second, which means that at any given time, 150K of data is in transit when transferring data between, say, Madrid and Los Angeles. The sending TCP needs to buffer the data that is in transit in case some of it gets lost and must be retransmitted, and the receiving TCP must have enough buffer space to receive all the data that's in transit even if the application doesn't get around to reading any of it.

In the old days (which mostly live on in Windows XP), the TCP buffers were limited to 64K, but more modern OSes can support pretty large TCP buffers. Some of them, like Mac OS X 10.5 and later, even try to automatically size their TCP buffers to accommodate the time it takes for packets to flow through the network. So when I send data from Madrid to Los Angeles, my buffer might be 150K at home, but at the university, my network connection is ten times faster so the buffer can grow as large as 1.5MB.

The trouble starts when the buffers in the network start to fill up. Suppose there's a 64-packet buffer on the network card—although it would be hard to fill it entirely—and another 64 packets are buffered by the router. With 1500-byte Ethernet packets, that's 192K of data being buffered. So TCP simply increases its buffer by 192K, assuming that the big quake happened and LA is now a bit further away than it used to be.

The waiting is the hardest part

Of course with all the router buffers filled up with packets from a single session, there's no longer much room to accommodate the bursts that the router buffers were designed to smooth out, so more packets get lost. To add insult to injury, all this waiting in buffers can take a noticeable amount of time, especially on relatively low bandwidth networks.

I personally got bitten by this when I was visiting a university in the UK where there was an open WiFi network for visitors. This WiFi network was hooked up to a fairly pathetic 128kbps ADSL line. This worked OK as long as I did some light Web browsing, but as soon as I started downloading a file, my browser became completely unworkable: every click took 10 seconds to register. It turned out that the ADSL router had a buffer that accommodated some 80 packets, so 10 seconds worth of packets belonging to my download would be occupying the buffers at any given time. Web packets had to join the conga line at the end and were delayed by 10 seconds. Not good.

Cringely got wind of the problem through the blog of Bell Labs' Jim Gettys, which reads like a cross between a detective novel and an exercise in higher Linuxery. Gettys suggests some experiments to do at home to observe the issue ("your home network can't walk and chew gum at the same time"), which seems to be exacerbated by the Linux network stack. He gets delays of up to 200ms when transferring data locally over 100Mbps. I tried this experiment, but my network between two Macs, using a 100Mbps wired connection through an Airport Extreme base station, was only slowed down by 6ms (Mac OS X 10.5 to 10.6) or 12ms (10.6 to 10.5).

Cringely gets many of the details wrong. To name a few: he posits that modems and routers pre-fetch and buffer data in case it's needed later. Those simple devices—including the big routers in the core of the Internet—simply aren't smart enough to do any of that. They just buffer data that flows through them for a fraction of a second to reduce the burstiness of network traffic and then immediately forget about it. Having more devices, each with their own buffers, doesn't make the problem worse: there will be one network link that's the bottleneck and fills up, and packets will be buffered there. The other links will run below capacity so the packets drain from those buffers faster than they arrive.

He mentions that TCP congestion control—not flow control, that's something else—requires dropped packets to function, but that's not entirely true. TCP's transmission speed can be limited by the send and/or receive buffers and the round-trip time, or it can slow down because packets get lost. Both excessive buffering and excessive packet loss are unpleasant, so it's good to find some middle ground.

Unfortunately, it looks like the router vendors and the network stack makers got into something of an arms race, pushing up buffer space at both ends. Or maybe, as Gettys suggests, it's just that memory is so cheap these days. The network stacks need large buffers for sessions to high-bandwidth, far-away destinations. (I really like being able to transfer files from Amsterdam to Madrid at 7Mbps!) So it's mostly up to the (home) router vendors to show restraint, and limit the amount of buffering they put in their products. Ideally, they should also use a good active queuing mechanism that avoids most of these problems either way.

Cringely may have a point when he suggests that ISPs are in no big hurry to solve this, because having a high-latency open Internet just means that their own VoIP and video services, which usually operate under a separate buffering regime, look that much better. But the IETF LEDBAT working group is looking at ways to avoid having background file transfers get in the way of interactive traffic, which includes avoiding filling up all those router buffers. This may also provide relief in the future.

I understand wanting to skewer Cringely but why not spend at least as much time discussing the original source, Gettys, actual findings? Gettys spills ten thousand words on the subject illuminating bad behavior in *all* OS's and gets a passing mention in your article. Not good.

The fact that the majority of the Internet's websites are running copies of Linux with huge software buffers is also probably not helping things.

The machine needs to be tuned and I'm glad Gettys made the effort to tell us we need a checkup.

This affects pretty much all modern OSes and networking equipment. The only Linux-specific issue is the large txqueuelen. And even then, that might not be completely Linux-specific, since txqueuelen is maybe not as visible for other OSes.

Most TCP congestion control algorithms are based on packet loss. The sender will continue sending more data, increasing the congestion window until a packet is lost. The sender will then back off when packet loss is experienced, shrinking the congestion window.

LEDBAT proposes a TCP congestion algorithm that moderates transmit speed based on round trip times instead of packet loss. As those buffers in various hops between nodes start to fill up, the latency will increase so LEDBAT will shrink the congestion window. This can reduce or eliminate packet loss and keep round rip times low for interactive traffic like telnet session or web browsing.

The big gotcha is that any stack using the older, packet loss based congestion control algorithm will be at a significant advantage. It will keep sending packets to fill up those buffers until the buffers are full. The LEDBAT implementation will slow down sending as the buffers get full. Stacks using the old algorithm will bog everything down but get more traffic through.

I believe that the LEDBAT congestion control algorithms come from work done in peer to peer clients. Peer to peer clients may actually be some of the most well behaved clients on a network if they're using the round trip based congestion control.

One thing I should have made clearer: large buffering in the stack or routers/switches/etc doesn't hurt too much as long as TCP buffers are small and the other way around. Only when both get large will you see these problems.

TCP is fine when you are working Layer-3, but it does nothing to help when the last mile broadband solution is to oversubscribe a network by 10 fold (yes, 1,000%). As such, the end user's original packet is dropped, allowing TCP to resend something (or request something to be sent back again), and the process repeats itself to a non-linear exhaustion condition.

Its afforded under the whole Best Effort scheme that is the business model and not the ideal networking model. Until that changes (not in my lifetime), it isn't going to get perceivable better as it take major changes to cross the perception threshold.

Let's go back five years on the telco side and ask how on Earth one can aggregate 150 1.5 ADSL lines onto a 45 Mbps DS3 pipe? Or how you can aggregate 15-25 DS3 pipes onto an OC3 access truck accessing an ATM core? Easy, make TCP work for you and hell be damned for anything UDP.

No mention of bandwidth shaping? Getty's article is basically raging on the benefits of bandwidth shaping, unsurprisingly, while also mentioning the 12ms latency he found on Mac OSX, which is also what you found.

Really, for a home user wanting to be able to browse while the connection is absolutely loaded with downloading and etc this is a non-issue and has been solvable for years. QoS along with traffic shaping on the link will let you completely load the connection while not sacrificing interactivity, at the cost of a small amount of bandwidth on both the down and upstream.

It's a misconception that you should be able to use your line at 100% capacity 100% of the time. If everyone picks up the phone at the same time, most people don't get a dial tone. If everyone takes a shower at the same time, not much water comes out. Et cetera.

One thing I should have made clearer: large buffering in the stack or routers/switches/etc doesn't hurt too much as long as TCP buffers are small and the other way around. Only when both get large will you see these problems.

Hmmmm... correct me if I'm wrong, but isn't this only true in the situation where there are only a few TCP clients in the network? So this would be true for a home router, where the small TCP buffers in the PCs would minimize how much data would be held in the router's buffer at any point. But this would not be necessarily true for an ISP's router, where many small TCP buffers would have the same effect as a few big ones.

This looks to me as a QoS issue, I.e. prioritization of packets. Count me as a total fail in understanding how larger buffers decrease throughput or latency, as explained in this article. If data wasn't buffered, instead of the conga line you'd have the equivalent delay in the browser waiting for the network to be ready.

The key issue is the lowest bandwidth link, which acts as the Rate Limiting Step in the traffic. Having a smaller buffer doesn't make this link faster. Having a large buffer shouldn't make it slower.

No mention of bandwidth shaping? Getty's article is basically raging on the benefits of bandwidth shaping, unsurprisingly, while also mentioning the 12ms latency he found on Mac OSX, which is also what you found.

Really, for a home user wanting to be able to browse while the connection is absolutely loaded with downloading and etc this is a non-issue and has been solvable for years. QOS along with traffic shaping on the link will let you completely load the connection while not sacrificing interactivity, at the cost of a small amount of bandwidth on both the down and upstream.

High latencies due to FIFO queuing in crappy hardware with saturated uplink is an old, and somewhat well known problem.

It has bitten, probably, every cable/DSL user who has ever saturated his upstream. In the future, I think it's going to bite people on the downstream side as well. 802.11g remains popular but usually one can't get more than 20Mbit/s from it. As these clients get 20Mbit/s and faster DSL/cable/fiber services, their cheap wireless router will become the choke point.

On the upstream the problem has remained mostly unsolved. Users just set limits on their P2P upload rate and otherwise live with it.On the downstream side, it's a somewhat rare problem nowadays. If it becomes more common, it will be hard to mitigate with similar measures; better router design will be needed.But so far, this is a problem with congestion at the edge, in the clients' premisses router, which affects each client individually. Clients don't affect each other.

Gettys' articles hint at something much larger: ISP level congestion caused by buffer bloating.However, I don't see a very solid explanation for that.

One way for buffer bloating to cause ISP level congestion problems would be is ISP level routers are suffering from the same tendency of buffer bloating. Since we usually don't see dramatic latencies nowadays, this would also mean that currentlty, ISPs don't run their networks in saturation. LOLIn my limited experience, the more "serious" network gear does exibit an increase in latency when under saturation but the latency values remain quite acceptable (<10ms).

Another way for this to cause ISP level congestion would be if the bad behavior of these millions of edge routers could, somehow, combine to that. But then again, I don't see how that would happen.

No mention of bandwidth shaping? Getty's article is basically raging on the benefits of bandwidth shaping, unsurprisingly, while also mentioning the 12ms latency he found on Mac OSX, which is also what you found.

Really, for a home user wanting to be able to browse while the connection is absolutely loaded with downloading and etc this is a non-issue and has been solvable for years. QOS along with traffic shaping on the link will let you completely load the connection while not sacrificing interactivity, at the cost of a small amount of bandwidth on both the down and upstream.

This looks to me as a QoS issue, I.e. prioritization of packets. Count me as a total fail in understanding how larger buffers decrease throughput or latency, as explained in this article. If data wasn't buffered, instead of the conga line you'd have the equivalent delay in the browser waiting for the network to be ready.

The key issue is the lowest bandwidth link, which acts as the Rate Limiting Step in the traffic. Having a smaller buffer doesn't make this link faster. Having a large buffer shouldn't make it slower.

Or?

The problem now is that buffers are so large as to be the equivalent of lines around the block. Under congestion, packets are being delivered in seconds instead of dropped according to some queue management algorithm. In the absence of Explicit Congestion Notification, dropping packets is the way TCP signals congestion, so holding onto all that traffic for so long is counter-productive.

One thing I should have made clearer: large buffering in the stack or routers/switches/etc doesn't hurt too much as long as TCP buffers are small and the other way around. Only when both get large will you see these problems.

Not true for the general case.As long as you are saturating your router/switch/modem TX, it's buffers will grow until full.As some of these crappy things will store 1 or 2 second worth of data, your connection's latency grows to the said 1 or 2 seconds.I think we all agree that latencies of 1 or 2 seconds is a problem.

This looks to me as a QoS issue, I.e. prioritization of packets. Count me as a total fail in understanding how larger buffers decrease throughput or latency, as explained in this article. If data wasn't buffered, instead of the conga line you'd have the equivalent delay in the browser waiting for the network to be ready.

The key issue is the lowest bandwidth link, which acts as the Rate Limiting Step in the traffic. Having a smaller buffer doesn't make this link faster. Having a large buffer shouldn't make it slower.

Or?

The large buffer doesn't make the link slower. It creates a longer line of data waiting to get through the link, increasing latency dramatically. QoS can help fix this, but only to the extent that the QoS software has a way of knowing which packets you want delivered first. So if you prioritize UDP, that's great for UDP applications, but what about other TCP sessions. Don't you still want CNN to load quickly while you're downloading a movie?

As long as you are saturating your router/switch/modem TX, it's buffers will grow until full.

This will only happen if the outgoing link on the computer is the bottleneck, like in that 100 Mbbp Ethernet test. If there is a smaller bottleneck elsewhere the packets will drain faster from the sending system's buffers than that they are injected by TCP.

Just a quick correction: the article seems to link bandwidth and latency - which are not necessarily related. For example, in the article, you state that 1 Mbps connection overseas with a latency of 150 ms = 150 KB (bandwidth*latency). The correct function would be packet size * maximum number of outstanding packets to get an idea of the average inflight data size. If there is no maximum # of packets, there will be some cases, due to buffering, where you may actually have more than your total bandwidth "in flight".

As long as you are saturating your router/switch/modem TX, it's buffers will grow until full.

This will only happen if the outgoing link on the computer is the bottleneck, like in that 100 Mbbp Ethernet test. If there is a smaller bottleneck elsewhere the packets will drain faster from the sending system's buffers than that they are injected by TCP.

Err... No, this will happen if there's a big buffer behing a saturated link, at any point in the path.

As long as you are saturating your router/switch/modem TX, it's buffers will grow until full.

This will only happen if the outgoing link on the computer is the bottleneck, like in that 100 Mbbp Ethernet test. If there is a smaller bottleneck elsewhere the packets will drain faster from the sending system's buffers than that they are injected by TCP.

Err... No, this will happen if there's a big buffer behing a saturated link, at any point in the path.

Right. The issue is that without ECN (Explicit Congestion Notification), the sending system will try to pump as many packets as it can towards the bottle neck. The buffers along the path will fill up, from the bottleneck back, until such time as the end-to-end packet propagation time (latency) increases to the point that the TCP speed control mechanism notices that the latency is exceeding an acceptable value. At that point, the sending station will slow down its transmission, the buffers along the path will start to drain out, and the sending station may decide to speed up again. Then the buffers fill, etc.

The problem is that the latency delay introduced by all the buffering prolongs the time it takes for the transmitting station to calibrate itself for the available path bandwidth, causing the latency to rise and crash and rise and crash, as opposed to letting packets drop quickly, thereby providing more information to the transmitting node as to the appropriate level of throttling (not buffering) that it should be doing.

What's very wrong here is that routers (especially ADSL, but any with very limited upstream) size their buffers, if they size them at all, for the downstream bandwidth.

This means that if you saturate the upstream, the buffer grows exponentially. The connection becomes unusable as all your ACKs are still buffered, which means whatever you're downloading slows as it assumes the pipe between you and it is clogged.

This is me, pinging out with 20% upstream use and about 50% downstream use:

One thing I should have made clearer: large buffering in the stack or routers/switches/etc doesn't hurt too much as long as TCP buffers are small and the other way around. Only when both get large will you see these problems.

Alternatively, we can start with the proper client-side priority routine again....

Nah, that'd take WORK.

hellokeith - Yea, I mean, it's only ime about x10 worse with ipv6 devices. (Mostly because they're more modern and hence have bigger buffers...)

total fail in understanding how larger buffers decrease throughput or latency, as explained in this article

Increase latency. The pipe has a bottleneck, so the buffer before it fills up. Packets are buffered not dropped, so TCP calculates a high bandwidth high delay link. The buffer can hold 10 seconds of data, so latency goes up by 10 seconds. Average throughput goes way down for new requests, since even the smallest request takes 10+ seconds for a response. Your total fail is the network admins' total fail, and we're all fucked for it because you don't understand.

Quote:

If data wasn't buffered, instead of the conga line you'd have the equivalent delay in the browser waiting for the network to be ready

?If it wasn't buffered for 10 seconds, you'd have <100ms round trip times, and dropped packets would get each end to throttle themselves to the link speed - not what they think is a 10s RTT but high-bandwidth link, but a 100ms high-bandwidth link when not being lag-fucked by the admins. It's not rocket science.

I can only imagine that most consumer grade routers have their buffers tuned for a 100Mbps WAN connection (because it is a 100Mbps port typically) which of course leads to nasty egress delays. Now, RED is a nice fix but you have to know bandwidth in your upstream direction to properly apply it. If ISPs provided a more robust customer device that would auto-tune it's knowledge of bandwidth in the upstream direction QoS could be automatically deployed to an extent. The real problem for automagic QoS deployment for the laymen network user is how do you treat UDP differently than TCP? Of course any network engineering modifying QoS policies knows how to handle the situation but you can't just automatically assume all UDP gets express forwarding treatment (EF, DSCP 46 from a QoS perspective). That is a recipe for disaster.

As for the statement about bandwidth utilization with 150ms of delay and a 1Mbps pipe the math is off as previously stated but there is even more to account for. Now, bandwidth is dependent on latency but there is a TCP Window Size that must also be accounted for. The TCP Window Size is the number of bytes that can be in-flight for one session at a time. The default window size these days is typically 64kB (512kb). So, the maximum potential bandwidth for a connection is a product of (window size)/(delay in seconds) = bandwidth. This makes the maximum amount of bandwidth available to the 1Mbp/s connection with 100ms of delay about 427kb/s. The window size is a setting typically hidden by every OS for good reason. Once you add packet loss to the equation your bandwidth typically crashes. This was the reason for RED, to prevent "global TCP synchronization."

In short, it seems the only way to fix the issue of bandwidth being underutilized by the laymen (read, non network engineer) is to have protocol changes signal to consumer router XYZ what level of QoS a particular session should receive and have the router intelligently know the upload bandwidth capacity. Of course, this is all a moot point if internet connections were synchronous (same upload/download): the provider can't necessarily apply the same type of automagic QoS principle on traffic heading towards the customer otherwise a rogue client could completely saturate a pipe with "high priority" traffic. Unless, of course, everything by default is "high priority" and applications self select "low" priority traffic.

As a tangent, the premise of LEDBAT sounds intriguing and I will definitely read more about it.

Seems to me it is much better to overbuffer, induce 100-200ms of latency than to simply drop the packet and wait for TCP to retry. The retry timing is longer latency, and it also exacerbates the bandwidth overuse issue.

I can only imagine that most consumer grade routers have their buffers tuned for a 100Mbps WAN connection (because it is a 100Mbps port typically) which of course leads to nasty egress delays

Uh, wouldn't the consumer stuff be tuned for LAN or first-hop to ISP connections, which have tiny pings to way offset the high bandwidth? Egress delays on a typically uncongested and very low-ping link? Even appropriate buffering for a wireless LAN is still under a second, which doesn't account for seconds of lag... unless the 'tuning' is fucked more than you seem to know...

Quote:

The default window size these days is typically 64kB (512kb)

Scaling is done by Windows 7, recent Linux, and recent Mac OSX, so you're way behind the game if you're running a commercial network with that assumption. This is where the idiots screaming at Torrents make their idiocy known - it doesn't take a Torrent when the window scales to appropriately.

Quote:

Once you add packet loss to the equation your bandwidth typically crashes

Once you add 10 seconds of buffering you fucked it much worse for most uses.

Quote:

In short, it seems the only way to fix the issue ... level of QoS

What about appropriately-sized buffers for fixed network hops, and window scaling that uses RTT to detect broken buffers? I think Gargoyle is pointed out as having a potential solution for the latter in the posts, so why don't you actually read the posts and address the issue at hand...

As someone who could arguably be called an "expert" (if there really are such things) on this topic (or at least TCP/IP and large networks), I have a few thoughts to share.

I wish to preface this with the following: I have not read all of Jim Gettys website (or at least the portion linked to be this article), nor have I read all the comments here. What I have read has already led me to form an opinion that this is a raging tempest in a teapot.

This articles author, Iljitsch van Beijnum, lays out a number of the particulars of the problem. I could nitpick some of the specifics, but there's nothing that was written that fatally undermines any specific point. The author is clearly aware of, and has some understanding of, the salient points that can contribute to the topic being discussed.

The problem, which I can not in good conscious hold the author accountable to, is in taking all of these individual details and understanding how they interact when brought together under "real world" conditions. This starts to be that line in the sand that distinguishes "experts" from "non-experts". This is not intended to slight anyone, I think everyone recognizes that you can have a good command of the basics, but someone who has 3-5 years of practical experience in dealing with the "topic" every single day is going to have a "better" command of the issues at hand.

First, let us distinguish between two things: Buffers at the END POINTS, and buffers in the NETWORK. These two things are completely and totally different even though they both use the word "buffer". The reason why I think all of this is a "raging tempest in a teapot" is my quick review of the article and the linked page is immediately twigging my "the distinction between there two points is not properly understood or separated." This is an initial impression, and further reading of the material may cause me to change my mind. Also, I have seen a number of comments here that clearly do not distinguish between the two.

Although I'm not going to explain the pedantic details of why, and anything that requires an explanation based on "trust me" should be viewed as suspicious, you're just going to have to "trust me" when I say "The amount of buffers in the NETWORK is, for the most part, irrelevant, and does not impact performance either positively or negatively." This is not universally true, the amount of buffers in the NETWORK can have an impact on performance, but when and why is extremely complex and very non-intuitive. In fact, from a practical stand point (i.e., making a backbone router), it is a problem of "what is the minimum amount of buffers that need to be added before it begins to impact performance?" As a general rule, adding more buffers past that point "does nothing", which includes "does not negatively impact performance for flows going through this link / buffer". Therefore, it's always safe to "add more", but here "add more" means "adds cost", and that extra cost does not turn in to more performance.

In backbone routers, we're talking about costs that range from four to six figures, so there is a genuine, compelling reason to get the amount down to as little as possible. On a 100 Gb/s link (which is a shipping product and in use in backbones today), a 64 byte minimum sized packet must be processed about every 5 nanoseconds. You can't use DDR3 DRAM, the stuff that's in your computer, for buffers at this speed- DDR3 DRAM is simply not fast enough (as in orders of magnitude to slow, but I admit I haven't actually done the calculations, this is off the top of my head). You need stuff like QDR SRAM, which is probably going to cost you something on the order of "$10 to $100 per megabyte". This one single point should be enough to raise some doubt about anyone who is claiming that "too much buffering in the NETWORK is end of the world bad!" They clearly do not understand or appreciate that there is already tremendous pressure (read: MONEY) to keep the amount of buffers in the NETWORK at a minimum. All you have to do is realize "Well, if they weren't needed at all, then you could just rip them out of the routers and save tens, maybe hundreds, of thousands of dollars!"

Next, TCP between END POINTS needs a certain amount of buffering. This is directly proportional to the Bandwidth Delay Product, or BDP. In short, the farther away to END POINTS are (in terms of latency) determines how much buffer the sender requires in order to achieve the maximum performance for that path. After that, "more is just more" with no additional benefit.

When operating at 100% peak maximum performance for any given BDP path, the amount of buffers in the NETWORK is, for all practical purposes, completely irrelevant. When operating in this mode, any problem, and I mean ANY problem, that causes the receiver to stop receiving packets instantly (in the informal sense for our purposes) causes the sender to stop sending packets. That is to say:

The amount of packets in the network, i.e. that are in flight at any given moment, between two end points is completely under the control of those two end points.

From this it follows that the amount of packets in flight is tightly coupled between what the receiver is receiving and what the sender is sending, and any "problems" cause an immediate reduction in new packets entering the network by the sender. This is the quality that makes TCP so unbelievably robust and, for all intents and purposes, completely unmodified since it's introduction (I'm hand waving some important details, one of them is actually related precisely to this very point).

There is a related phenomena known as "jitter". This phenomena is usually the result of "queueing" at a link, and typically begins to manifest itself in a way that becomes noticeable and objectionable when a links utilization starts to go about 66-70%. After that point, a small burst that causes link utilization to significantly rise transiently above that point (i.e., there's not a lot of slop left to soak things up) can dramatically increase packet jitter for flows traversing that link.

This phenomena is also mostly unrelated to how big the NETWORK buffer is... it has everything to do with the relationship between queue service time as a link becomes used more and more. For example, a link at 66% utilization may have an average jitter of 0.2ms per packet, but a link at 93% utilization may have an average jitter of 10ms per packet. After a certain point, jitter becomes extremely non-linear. Adding more buffers won't help in this case- in order to fix the "problem", you need to (wait for it)..... add more capacity! If you need to send 500Mb/s between two points, either a single end point or an aggregated over many end points, a 100Mb/s link is just not going to cut it. No amount of buffers, bandwidth shaping, or other snake oil is going to change that fact.

In fact, "bandwidth shaping" is nothing but snake oil, though it is likely to be defended with religious fervor by the true believers. Why is it snake oil? Well, there's no doubt that it does exactly what the true believers say it does. The "problem" is that the only time that "bandwidth shaping" becomes important is when you hit the point where you actually need to add more capacity. If you're at the point where bandwidth shaping can "make a difference", you're already past the point where you need to add more capacity. The only thing "bandwidth shaping" can do at that point is help you "pick which of the really important traffic you're going to screw over first." Granted, this can be a total life saver in certain situations (DDoS, the extra capacity you ordered won't be delivered due to telco issues for another month, etc).... but you are out of your mind, and fundamentally misunderstand the issues at hand if you think "bandwidth shaping" is some kind of sustainable solution. It is a valid stop-gap emergency solution in limited situations only, and that's it.

Once you add 10 seconds of buffering you fucked it much worse for most uses.

With all due respect, this statement is not only wrong, it is wronger than wrong. While not universally true, for all practical purposes adding "more buffering" can not negatively impact performance. In fact, while the following statement has certain qualifications that are likely to be understood by a "PhD level of expert", as a general rule of thumb, adding "more buffering" can not negatively impact performance even in principle.

As a caveat to my very long post regarding this topic, I think it's important that I mention at least one case where what I wrote doesn't necessarily apply.

There are two basic schools of thought when designing a network from the ground up: Make the network as smart as possible with the end points as dumb as possible, or make the network as dumb as possible but make the end points as smart as possible.

TCP/IP networking was built on a very simple assumption: The end points are smart, the network is stupid. Everything that can, within reason, be shifted to the end points should be. The only thing really required of the network is that it "makes an effort to get a packet from point A to point B", and that's about it. There is a very subtle and usually unstated assumption that "it is better for the network to drop the packet rather than for the network to attempt to reliably deliver said packet." The ability to "reliably deliver" a packet was shifted entirely to the end points.

In other words, TCP/IP takes the "make the network as dumb as possible but make the end points as smart as possible" approach to network design. One could reasonably argue that prior to TCP/IP, nearly every network architecture was in the other camp. Many of these network architecture designs tended to originate from telcos, though I genuinely don't think it was any kind of conspiracy on their part- they were experts in the field, but this was a case where that expertise was not an advantage because it prevented them from giving serious consideration to other alternatives.

This should raise the obvious question of "Well, ok, what happens when you send TCP/IP traffic over a network that makes an effort to reliably deliver packets?" The short answer is: nothing good.

Do such networks exist in the real world? Yes they do. Where are you likely to find such a beast? On a modern, or even next generation, high speed digital cell phone. The same people who thought it was a great idea back then still think it's a great idea today, and the protocols and network interfaces are a product of that.

What happens when you stuff "lots of buffers" on a "smart network with reliable packet delivery" and run TCP/IP on? Disaster. In this case, having the NETWORK and the END POINTS trying to do reliable packet delivering can result in an exponential explosion of the number of packets in flight in the network at any time.... and if both layers are doing their job, that means that the packets TCP adds to ensure reliable packet delivery are literally wasting valuable buffer space which is just going to be dropped by the receiving end point once it finally gets there anyways.

This is the only case I'm aware of where aware of, that is to say running TCP/IP over a "smart network that also implements 'invisible' reliable packet delivery", where adding more buffers in the NETWORK is a bad, bad thing. In practice, the only place you're likely to find this today is running TCP/IP over a cellular interface (but this is a generalization, and does not necessarily apply to all modern cellular interfaces).

Manufacturers need to get on the ball here as their products are artificially restricting their own performance.

The phenomena that you are describing literally has nothing what-so-ever to do with buffers, either in any part of the network or on the communicating end-points.

The phenomena you are describing is modeled by queueing theory. There is nothing a manufacturer, nor any technology or algorithm, can do to "fix" the "problem" you are having- you can only utilize so much of the capacity of a link before it begins to effect the average service time of a queue for a "random distribution". In other words you need more bandwidth, and there is only one thing that can fix that problem- more bandwidth.

"Packet shaping", "QoS", etc may help you make the most of a link that is congested to the point where queueing delays have become manifest, perhaps even dominant. But they can not change, or do anything about, the fact that what your real problem is that you don't have enough bandwidth for what you need to do. QoS is a euphemism for "picking which traffic to screw harder than the rest." If you have enough actual bandwidth to do what you need to do, QoS is not only useless, all that extra programming and silicon is just one more thing that has to be done, and therefore might itself impact performance (though in reality with a modern CPU dedicated to the task (i.e., a home ADSL/802.11n router), doing QoS at 100Mb/s is "trivial" and unlikely to contribute noticeably to the time it takes to process a packet). Much more likely, however, is if it's all done in software, that's just one more place for bugs to hide, and as the old proverb goes: "You don't have to debug what you leave out."

The way that the torrent programs you talk about work is they essentially use heuristics to get a feel for what your maximum link capacity to the internet is (i.e., what your up and down speeds are). Once they have a model of just those two simple parameters, they can artificially restrict how much traffic they inject in to the network so that they don't just flood traffic on to your link which is, if your running a torrent, likely to be at the saturation point where the effects of queueing delay are not just noticeable, but dominate everything that has to pass through that link. Since the torrent program is the source of all of the problems for the traffic that is likely to be passing through a home link, it makes sense that it is the best place to do intelligent traffic management.

The impact of queueing delay effects is extremely non-linear. The difference between a link operating at 85%, 90%, 95%, and 100% is not only noticeable at the "surfing the web using a web browser", the tiny difference between just a single percentage point past 90% can be the difference between "useable" and "completely unusable", even though intuitively you might expect there to be a nice, linear relationship and proportionate degradation in perceived network quality. In fact, in reality every "single digit percentage point" past 90% can represent a nearly "doubling in average packet jitter time" (this is "rough rule of thumb" true, not pedantically true, in so far as it helps visualize how much of an impact queueing delay has at high utilizations).

As long as you are saturating your router/switch/modem TX, it's buffers will grow until full.

This will only happen if the outgoing link on the computer is the bottleneck, like in that 100 Mbbp Ethernet test. If there is a smaller bottleneck elsewhere the packets will drain faster from the sending system's buffers than that they are injected by TCP.

Err... No, this will happen if there's a big buffer behing a saturated link, at any point in the path.

Actually, you're both wrong, but iljitsch is much less wrong than raxx7.

Imagine a set of ethernet switches set up to approximate something like a lab model of a "small internet". Now imagine two PC's plugged in to that network- one on the east coast, one of the west coast. The east coast PC sends traffic to the west coast PC at 100Mb/sec. Assume that the traffic must pass through two to three ethernet switches that represent "hops in the middle". We'll simplify and call the ethernet switches "routers" for the purposes of our discussion, and further assume that said switches make a reasonable approximation for the behavior we are attempting to understand. In good faith, for the purposes of this discussion, I believe that this is "reasonably true". If we want to model true bandwidth delay product, we assume there is a black box linking routers that "emulates" this behavior, but specifically this phenomena is "out of band" of the network devices themselves.

Now, how much of the "buffers" of the routers is the east to west coast 100Mb/s using?

None. As in zero.

How about if we say that each link contains a 128 megabyte buffer for each link. Now how much of the buffer of each link is this east to west 100Mb/s transfer using?

Again, none. As in zero.

Now assume that there are two PC's on the east coast, one on the west coast. The two east coast PC's are on separate, 100Mb/s links. The west coast PC downloads a large file from each of the two east PC's. Same question, how much buffer of each link is this transfer taking up?

None. As in zero. With the minor caveat of "once steady state is achieved."

Also note that TCP is very specifically designed to correctly deal with just this case: additive increase with exponential backoff. Using this algorithm, TCP will converge to "maximum steady state" very quickly regardless of the other traffic or link utilization of the hops a flow has to pass through. Once steady state has been achieved, it requires no buffering by the router hops for that flow.

TCP is also specifically designed so that once steady state is achieved, anything that causes a change to the steady state (say a third, unrelated by high throughput flow traversing one of our flows hops) immediately and "instantly" results in a reduction in the amount of packets the sender introduces to the network. This phenomena is known as "ACK clocking", named after the fact that "once steady state is achieved, the receivers ACKs effectively serve as a clock pulse to gate when the sender can send new packets." If there is a reduction in bandwidth "somewhere in the middle", this instantly translates in to a reduction in the amount of packets arriving at the receiver, which instantly translates in to a reduction in the rate at which the receiver can ACK packets. In effect, what happens is the sender can not send any additional packets because "the send window is full", and the send window becomes unfull when it gets an ACK from the receiver. Anything that interrupts this process instantly causes the sender to stop adding new packets in to the network.

This is one of the primary reason why "adding buffers to the NETWORK" makes it, for all practical purposes, impossible to negatively impact the performance of TCP. For TCP, buffers in the NETWORK are there to soak up those transient RTT delays that prevent the "signal" from instantly being communicated from one end to the other. A small amount of buffer goes an incredibly long way in this situation, and it should be obvious that adding more buffers very quickly results in "diminished returns". Past a certain point, more is just more, with no additional benefit, and certainly no negative impact on performance.

Iljitsch van Beijnum / Iljitsch is a contributing writer at Ars Technica, where he contributes articles about network protocols as well as Apple topics. He is currently finishing his Ph.D work at the telematics department at Universidad Carlos III de Madrid (UC3M) in Spain.