Whose house is of glasse, must not throw stones at another.

In my last post I outlined the general bufferbloat problem. This post attempts to explain what is going on, and how I started on this investigation, which resulted in (re)discovering that the Internet’s broadband connections are fundamentally broken (others have been there before me). It is very likely that your broadband connection is badly broken as well; as is your home router; and even your home computer. And there are things you can do immediately to mitigate the brokenness in part, which will cause applications such as VOIP, Skype and gaming to work much, much better, that I’ll cover in more depth very soon. Coming also soon, how this affects the world wide dialog around “network neutrality.”

Bufferbloat is present in all of the broadband technologies, cable, DSL and FIOS alike. And bufferbloat is present in other parts in the Internet as well.

As may be clear from old posts here, I’ve had lots of network trouble at my home, made particularly hard to diagnose due to repetitive lightning problems. This has caused me to buy new (and newer) equipment over the last five years (and experience the fact that bufferbloat has been getting worse in all its glory). It also means that I can’t definitively answer all questions about my previous problems, as almost all of that equipment is scrap.

Debugging my network

As covered in my first puzzle piece I was investigating performance of an old VPN device Bell Labs had built last April, and found that the latency and jitter when running at full speed was completely unusable, for reasons I did not understand, but had to understand for my project to succeed. The plot thickened when I discovered I had the same terrible behavior without using the Blue Box.

I had had an overnight trip to the ICU in February; so did not immediately investigate then as I was catching up on other work. But I knew I had to dig into it, if only to make good teleconferencing viable for me personally. In early June, lightning struck again (yes, it really does strike in the same place many times). Maybe someone was trying to get my attention on this problem. Who knows? I did not get back to chasing my network problem until sometime in late June, after partially recovering my home network, further protecting my house, fighting with Comcast to get my cable entrance relocated (the mom-and-pop cable company Comcast had bought had installed it far away from the power and phone entrance), and replacing my washer, pool pump, network gear, and irrigation system.

But the clear signature of the criminal I had seen on April had faded. Despite several weeks of periodic attempts, including using the wonderful tool smokeping to monitor my home network, and installing it in Bell Labs, I couldn’t nail down what I had seen again.I could get whiffs of smoke of the the unknown criminal, but not the same obvious problems I had seen in April. This was puzzling indeed; the biggest single change in my home network had been replacing the old blown cable modem provided by Comcast with a new faster DOCSIS 3 Motorola SB6120 I bought myself.

In late June, my best hypothesis was that there might be something funny going on with Comcast’s PowerBoost® feature. I wondered how that worked, did some Googling, and happened across the very nice internet draft that describes how Comcast runs and provisions its network. When going through the draft, I happened to notice that one of the authors lives in an adjacent town, and emailed him, suggesting lunch and a wide ranging discussion around QOS, Diffserv, and the funny problems I was seeing. He’s a very senior technologist in Comcast. We got together in mid-July for a very wide ranging lunch lasting three hours.

Lunch with Comcast

Before we go any further…

Given all the Comcast bashing currently going on, I want to make sure my readers understand through all of this Comcast has been extremely helpful and professional, and that the problem I uncovered, as you will see before the end of this blog entry, are not limited to Comcast’s network: bufferbloat is present in all of the broadband technologies, cable, FIOS and DSL alike.

The Comcast technical people are as happy as the rest of us that they now have proof of bufferbloat and can work on fixing it, and I’m sure Comcast’s business people are happy that they are in a boat the other broadband technologies are in (much as we all wish the mistake was only in one technology or network, it’s unfortunately very commonplace, and possibly universal). And as I’ve seen the problem in all three common operating systems, in all current broadband technologies, and many other places, there is a lot of glasse around us. Care with stones is therefore strongly advised.

The morning we had lunch, I happened to start transferring the old X Consortium archives from my house to an X.org system at MIT (only 9ms away from my house; most of the delay is in the cable modem/CMTS pair); these archives are 20GB or so in size. All of a sudden, the wiffs of smoke I had been smelling became overpowering to the point of choking and death. “The Internet is Slow Today, Daddy” echoed through my mind; but this was self inflicted pain. But as I only had an hour before lunch, the discussion was a bit less definite than it would have been even a day later. Here is the “smoking gun” of the following day, courtesy of DSL Reports Smokeping installation. You too can easily use this wonderful tool to monitor the behavior of your home network from the outside.

As you can see, I had well over one second latency, and jitter just as bad, along with high ICMP packet poss. Behavior from inside out looked essentially identical. The times when my network connection returned to normal were when I would get sick of how painful it was to browse the web and suspend the rsync to MIT. As to why the smoke broke out, the upstream transfer is always limited by the local broadband connection: the server is at MIT’s colo center on a gigabit network, that directly peers with Comcast. It is a gigabit (at least) from Comcast’s CMTS all the way to that server (and from my observations, Comcast runs a really clean network in the Boston area. It’s the last mile that is the killer.

As part of lunch, I was handed a bunch of puzzle pieces that I assembling over the following couple months. These included:

That what I was seeing was more likely excessive buffering in the cable system, in particular, in cable modems. Comcast has been trying to get definitive proof of this problem since Dave Clark at MIT had brought this problem to their attention several years ago.

A suggestion of how to rule in/out the possibility of problems from Comcast’s Powerboost by falling back to the older DOCSIS 2 modem.

I went home, and started investigating seriously. It was clearly time to do packet traces to understand the problem. I set up to take data, and eliminated my home network entirely by plugging my laptop directly into the cable modem.

But it had been more than a decade since I last tried taking packet captures, and was staring at TCP traces. Wireshark was immediately a big step up (I’d occasionally played with it over the last decade); as soon as I took my first capture I immediately knew something was gravely wrong despite being very rusty at staring at traces. In particular, there were periodic bursts of illness, with bursts of dup’ed acks, retransmissions, and reordering. I’d never seen TCP behave in such a bursty way (for long transfers). So I really wanted to see visually what was going on in more detail. After wasting my time investigating more modern tools, I settled on the old standby’s of tcptrace and xplot I had used long before. There are certainly more modern tools; but most are closed source and require Microsoft Windows. Acquiring the tools, and their learning curve (and the fact I normally run Linux) mitigated against their use.

A number of plots show the results. The RTT becomes very large after a while (10-20 seconds) into the connection, just as the ICMP ping results go.. The outstanding data graph and throughput graph show the bursty behavior so obvious even browsing the wireshark results. Contrast this with the sample RTT, outstanding data graph and throughput graphs from the TCP trace manual.

RTT - Round Trip Time

Outstanding data graph

Throughput Graph

Also remember that buffering in one direction still causes problems in the other direction; TCP’s ack packets will be delayed. So my occasional uploads (in concert with the buffering) was causing the “Daddy, the Internet is slow today” phenomena; the opposite situation is of course also possible.

The Plot Thickens Further

Shortly after verifying my results on cable, I went to New Jersey (I work for Bell Labs from home, reporting to Murray Hill), where I stay with my in-laws in Summit. I did a further set of experiments. When I did, I was monumentally confused (for a day), as I could not reproduce the strong latency/jitter signature (approaching 1 second of latency and jitter) that I saw my first day there when I went to take the traces. With a bit of relief, I realized that the difference was that I had initially been running wireless, and then had plugged into the router’s ethernet switch (which has about 100ms of buffering) to take my traces. The only explanation that made sense to me was that the wireless hop had additional buffering (almost a second’s worth) above and beyond that present in the FIOS connection itself. This sparked my later investigation of routers (along with occasionally seeing terrible latency in other routers), which in turn (when the results were not as I had naively expected, sparked investigating base operating systems.

The wireless traces are much rattier in Summit: there are occasional packet drops severe enough to cause TCP to do full restarts (rather than just fast retransmits), and I did not have the admin password on the router to shut out other access by others in the family. But the general shape in both are similar to that I initially saw at home.

Ironically, I have realized that you don’t see the full glory of TCP RTT confusion caused by buffering if you have a bad connection as it reset TCP’s timers and RTT estimation; packet loss is always considered possible congestion. This is a situation where the “cleaner” the network is, the more trouble you’ll get from bufferbloat. The cleaner the network, the worse it will behave. And I’d done so much work to make my cable as clean as possible…

At this point, I realized what I had stumbled into was serious and possibly widespread; but how widespread?

Calling the consulting detectives

At this point, I worried that we (all of us) are in trouble, and asked a number of others to help me understand my results, ensure their correctness, and get some guidance on how to proceed. These included Dave Clark, Vint Cerf, Vern Paxson, Van Jacobson, Dave Reed, Dick Sites and others. They helped with the diagnosis from the traces I had taken, and confirmed the cause. Additionally, Van notes that there is timestamp data present in the packet traces I took (since both ends were running Linux) that can be used to locate where in the path the buffering is occurring (though my pings are also very easy to use, they may not be necessary by real TCP wizards, which I am not, and begs a question of accuracy if the nodes being probed are loaded).

Dave Reed was shouted down and ignored over a year ago when he reported bufferbloat in 3G networks (I’ll describe this problem in a later blog post; it is an aggregate behavior caused by bufferbloat). With examples in broadband and suspicions of problems in home routers I now had reason to believe I was seeing a general mistake that (nearly) everyone is making repeatedly. I was concerned to build a strong case that the problem was large and widespread so that everyone would start to systematically search for bufferbloat. I have spent some of the intervening several months documenting and discovering additional instances of bufferbloat, as my switch, home router, results from browser experiments, and additional cases such as corporate and other networks as future blog entries will make clear.

ICSI Netalyzr

One of the puzzle pieces handed me by Comcast was a pointer to Netalyzr.

ICSI has built the wonderful Netalyzr tool, which you can use to help diagnose many problems in your ISP’s network. I recommend it very highly. Other really useful network diagnosis tools can be found at M-Lab and you should investigate both; some of the tests can be run immediately from a browser (e.g. netalyzr), but some tests are very difficult to implement in Java. And by using these tools, you will also be helping researchers investigate problems in the Internet, and you may be able to discover and expose mis-behavior of many ISP’s. I have, for example, discovered that the network service provided on the Acela Express is running a DNS server which is vulnerable to man-in-the-middle attacks due to lack of port randomization, and therefore will never consider doing anything on it that requires serious security.

At about the same time as I was beginning to chase my network problem, the first netalyzer results were published at NANOG; more recent results have since been published. Netalyzr: Illuminating The Edge Network, by Christian Kreibich, Nicholas Weaver, Boris Nechaev, and Vern Paxson. This paper has a wealth of data in it on all sorts of problems that Netalyzr has uncovered; excessive buffering is caused in section 5.2. The scatterplot there and the discussion is worth reading. Courtesy of the ICSI group, they have sent me a color version of that scatterplot that makes the technology situation much clearer (along with the magnitude of the buffering) which they have used in their presentations, but is not in that paper. Without this data, I would have still been wondering bufferbloat was widespread, and whether it was present in different technologies or not. My thanks to them for permission to post these scatter plots.

Netalyzer Uplink buffer test results

Netalyzer Downlink buffer test results

As outlined in the Netalyzr paper in section 5.2, the structure you see is very useful to see what buffer sizes and provisioned bandwidths are common. The diagonal lines indicate the latency (in seconds!) caused by the buffering. Both wired and wireless Netalyzer data are mixed in the above plots. The structure shows common buffer sizes, that are sometimes as large as a megabyte. Note that there are times that Netalyzr may have been under-detecting and/or underreporting the buffering, particularly on faster links; the Netalyzr group have been improving its buffer test.

I do have one additional caution, however: do not regard the bufferbloat problem as limited to interference cause by uploads. Certainly more bandwidth makes the problem smaller (for the same size buffers); the wired performance of my FIOS data is much better than what I observe for Comcast cable when plugged directly into the home router’s switch. But since the problem is present in the wireless routers often provided by those network operators, the typical latency/jitter results for the user may in fact be similar, even though the bottleneck may be in the home router’s wireless routing rather than the broadband connection. Anytime the downlink bandwidth exceeds the “goodput” of the wireless link that most users are now connected by, the user will suffer from bufferbloat in the downstream direction in the home router (typically provided by Verizon) as well as upstream (in the broadband gear) on cable and DSL. I see downstream bufferbloat commonly on my Comcast service too, now that I’ve upgraded to 50/10 service, now that it is much more common my wireless bandwidth is less than the broadband bandwidth.

Discarding various alternate hypotheses

You may remember that I started this investigation with a hypothesis that Comcast’s Powerboost might be at fault. This hypothesis was discarded by dropping my cable service back to using DOCSIS 2 (which would have changed the signature in a different way when I did).

Secondly, those who have waded through this blog will have noted that I have had many reasons not to trust the cable to my house, due to mis-reinstallation of a failed cable by Comcast earlier, when I moved in. However, the lightning events I have had meant that the cable to my house was relocated this summer, and a Comcast technician had been to my house and verified the signal strength, noise and quality at my house. Furthermore, Comcast verified my cable at the CMTS end; there Comcast saw a small amount of noise (also evident in (some of) the packet traces by occasional packet loss) due to the TV cable also being plugged in (the previous owner of my house loved TV, and the TV cabling wanders all over the house). For later datasets, I eliminated this source of noise, and the cable tested clean at the Comcast end and the loss is gone in subsequent traces. This cable is therefore as good as it gets outside a lab and very low loss. You can consider some of these traces close to lab quality. Comcast has since confirmed my results in their lab.

Another objection I’ve heard is that ICMP ping is not “reliable”. This may be true if pinging a particular node when loaded, as it may be handled on a node’s slow path. However, it’s clear the major packet loss is actual packet loss (as is clear from the TCP traces). I personally think much of the “lore” that I’ve heard about ICMP is incorrect and/or a symptom of the bufferbloat problem. I’ve also worked with the author of httping, so that there is a commonly available tool (Linux and Android) for doing RTT measurements that is indistinguishable from HTTP traffic (because it is HTTP traffic!), by adding support for persistent connections. In all the tests I’ve made, the results for ICMP ping match that of httping. But TCP shows the same RTT problems that ICMP or httping does in any case.

What’s happening here?

I’m not a TCP expert; if you are a TCP expert, and if I’ve misstated or missed something, do let me know. Go grab your own data (it’s easy; just an scp to a well provisioned server, while running ping), or you can look at my data.

The buffers are confusing TCP’s RTT estimator; the delay caused by the buffers is many times the actual RTT on the path. Remember, TCP is a servo system, which is constantly trying to “fill” the pipe. So by not signalling congestion in a timely fashion, there is *no possible way* that TCP’s algorithms can possibly determine the correct bandwidth it can send data at (it needs to compute the delay/bandwidth product, and the delay becomes hideously large). TCP increasingly sends data a bit faster (the usual slow start rules apply), reestimates the RTT from that, and sends data faster. Of course, this means that even in slow start, TCP ends up trying to run too fast. Therefore the buffers fill (and the latency rises). Note the actual RTT on the path of this trace is 10 milliseconds; TCP’s RTT estimator is mislead by more than a factor of 100. It takes 10-20 seconds for TCP to get completely confused by the buffering in my modem; but there is no way back.

Eventually, packet loss occurs; TCP tries to back off. so a little bit of buffer reappears, but it then exceeds the bottleneck bandwidth again very soon. Wash, Rinse, Repeat… High latency with high jitter, with the periodic behavior you see. This is a recipe for terrible interactive application performance. And it’s probable that the device is doing tail drop; head drop would be better.

There is significant packet loss as a result of “lying” to TCP. In the traces I’ve examined using The TCP STatistic and Analysis Tool (tstat) I see 1-3% packet loss. This is a much higher packet loss rate than a “normal” TCP should be generating. So in the misguided idea that dropping data is “bad”, we’ve now managed to build a network that both is lossier and exhibiting more than 100 times the latency it should. Even more fun is that the losses are in “bursts.” I hypothesis that this accounts for the occasional DNS lookup failures I see on loaded connections.

By inserting such egregiously large buffers into the network, we have destroyed TCP’s congestion avoidance algorithms. TCP is used as a “touchstone” of congestion avoiding protocols: in general, there is very strong pushback against any protocol which is less conservative than TCP. This is really serious, as future blog entries will amplify. I personally have scars on my back (on my career, anyway), partially induced by the NSFnet congestion collapse of the 1980’s. And there is nothing unique here to TCP; any other congestion avoiding protocol will certainly suffer.

Again, by inserting big buffers into the network, we have violated the design presumption of all Internet congestion avoiding protocols: that the network will drop packets in a timely fashion.

Any time you have a large data transfer to or from a well provisioned server, you will have trouble. This includes file copies, backup programs, video downloads, and video uploads. Or a generally congested link (such at a hotel) will suffer. Or if you have multiple streaming video sessions going over the same link, in excess of the available bandwidth. Or running current bittorrent to download your ISO’s for Linux. Or google chrome uploading a crash to Google’s server (as I found out one evening). I’m sure you can think of many others. Of course, to make this “interesting”, as in the Chinese curse, the problem will therefore come and go mysteriously, as you happen to change your activity (or things you aren’t even aware of happen in the background).

If you’ve wondered why most VOIP and Skype have been flakey, stop wondering. Even though they are UDP based applications, it’s almost impossible to make them work reliably over such links with such high latency and jitter. And since there is no traffic classification going on in broadband gear (or other generic Internet service), you just can’t win. At best, you can (greatly) improve the situation at the home router, as we’ll see in a future installment. Also note that broadband carriers may very well have provisioned their telephone service independently of their data service, so don’t jump to the conclusion that therefore their telephone service won’t be reliable.

Why hasn’t bufferbloat been diagnosed sooner?

Well, it has been (mis)diagnosed multiple times before; but the full breadth of the problem I believe has been missed.

The individual cases have often been noticed, as Dave Clark did on his personal DSLAM, or as noted in the Linux Advanced Routing & Traffic Control HOWTO. (Bert Huber attributed much more blame to the ISP’s than is justified: the blame should primarily be borne by the equipment manufacturers, and Bert et. al. should have made a fuss in the IETF over what they were seeing.)

As to specific reasons why, these include (but are not limited to):

We’re all frogs in heating water; the water has been getting hotter gradually as the buffers grow in subsequent generations of hardware, and memory has become cheaper. We’ve been forgetting what the Internet *should* feel like for interactive applications. Us old guy’s memory is fading of how well the Internet worked in the days when links were 64Kb, fractional T1 or T1 speeds. For interactive applications, it often worked much better than today’s internet.

Those of us most capable of diagnosing the problems have tended to opt for the higher/highest bandwidth tier service of ISP’s; this means we suffer less than the “common man” does. More about this later. Anytime we try to diagnose the problem, it is most likely we were the cause; so we stop what we were doing to cause “Daddy, the Internet is slow today”, the problem will vanish.

It takes time for the buffers to confuse TCP’s RTT computation. You won’t see problems on a very short (several second) test using TCP (you can test for excessive buffers much more quickly using UDP, as Netalyzer does).

The most commonly used system on the Internet today remains Windows XP, which does not implement window scaling and will never have more than 64KB in flight at once. But the bufferbloat will become much more obvious and common as more users switch to other operating systems and/or later versions of Windows, any of which can saturate a broadband link with but a merely a single TCP connection.

In good engineering fashion, we usually do a single test at a time, first testing bandwidth, and then latency separately. You only see the problem if you test bandwidth and latency simultaneously. None of the common consumer bandwidth tests test latency simultaneously. I know that’s what I did for literally years, as I would try to diagnose my personal network. Unfortunately, the emphasis has been on speed; for example, the Ookla speedtest.net and pingtest.net are really useful; but they don’t run a latency test simultaneously with each other. As soon as you test for latency with bandwidth, the problem jumps out at you. Now that you know what is happening, if you have access to a well provisioned server on the network, you can run tests yourself that make bufferbloat jump out at you.

I understand you may be incredulous as you read this: I know I was when I first ran into bufferbloat. Please run tests for yourself. Suspect problems everywhere, until you have evidence to the contrary. Think hard about where the choke point is in your path; queues form only on either side of that link, and only when the link is saturated.

Acknowledgements

My thanks to the many who have helped cracking of this case, including Dave Clark, Vint Cerf, Vern Paxson, Van Jacobson, Dave Reed, Scott Bradner, Steve Bellovin, Greg Chesson, Dick Sites, Ted T’so, and quite a few others. And particularly to the ICSINetalyzr developers, without whose work I’d still be wondering if what I saw at home and in New Jersey were a fluke.

Conclusions

All broadband technologies are suffering badly from bufferbloat, as are many other parts of the Internet.

You suffer from bufferbloat nearly everywhere: if not at home or your office, then when you travel, you will find many hotels are now connected by broadband connections, and you often suffer grievous latency and jitter since they have not mitigated bufferbloat and are sharing the connection with many others. (More about mitigation strategies soon). How easy/difficult to fix those technologies is clearly dependent on the details of those technologies; full solutions depend on active queue management; some other mitigations are possible (just set the buffers to something sane, as they are often up to a megabyte in size now, as the ICSI data show), as I’ll describe later in this sequence of blog posts.

Bufferbloat is a serious, widespread problem, the full severity of which will become clearer subsequent postings.

169 Responses to “Whose house is of glasse, must not throw stones at another.”

Is this exclusively about latency, or could it also explain situations where a large file transfer initially saturates the “last hop” link, but slows down to ~10% of theoretical bandwidth after a few megabytes are transferred, and stays that way until completion?

I’d have to see data to know (I’m not volunteering to go look at yours either; I have plenty of fish frying). I’ve seen high packet loss rates at times, but I haven’t caught anything like 90% loss rate in my experiments.

Certainly Powerboost (and similar features from other broadband providers), don’t make a 90% difference in bandwidth performance; they might get you temporarily a factor of 2-5 more than your provisioned bandwidth at most.

It’s not that simple, with a modern TCP: fast retransmit and SACK can paper over a lot of sins. But there may be circumstances where things go badly wrong. I suggest you take a packet capture, and see if you can get someone with real TCP expertise to take a look at it.

Yes, you are entirely correct. Of course, editing your registry on Windows is hazardous to your machine’s heath, so few enable it. It’s mostly interesting as it bears on why bufferbloat (and problems it has caused) has gone so long before widespread diagnosis, as future posts will make clear. It is also why I tend to lose sleep at night: the traffic is finally shifting away from old TCP’s and XP finally retires, and I worry about the problem becoming more severe.

Most traffic is initiated by Windows XP, given its (finally dropping) dominance on the net. So correct me if I’m wrong, but that tells me that we’ll still see most XP initiated TCP sessions running without window scaling.

Linux has traffic shaping capabilities that can be used to work around this problem. My home setup involves a linux router sitting in front of the DSL modem. traffic from the router to the DSL modem is rate limited to a couple percent slower than then actual DSL link speed, so that buffering will occur in the router rather than in the modem. Then, one can configure buffering behaviors in the router:
– limiting buffer size, to control latency;
– fair queuing (typically SFQ in linux), so that individual high throughput connections might still have high latency but they at least won’t impact the latency of lower throughput connections;
– or, any combination of the above strategies.

Really, the linux traffic shaping stuff is very powerfull and underused. I wish broadband hardware manufacturers could all do something similar in their hardware (even better if they’d make it configurable, of course).

Yes, it does, as I’ll explain detail in a future post. OpenWRT variants such as gargoyle do this. It doesn’t require hardware at all, just use of existing facilities (though RED also has some problems, as I’ll also cover). This is the mitigation I’ve referred to in my post. But as the posts are long enough as it is, I didn’t want to try to cover that immediately.

Note that classification is not sufficient, you also need to run some form of AQM, or you’ll still have problems.

However, it’s clear they aren’t doing everything they should: such as running (G)RED on the local routing.

I’ve also been aware of this problem since May 2009, when I noticed that high latency was correlated with a saturated upload link. Initially I thought it was something BitTorrent specific, what with the 100+ connections, but it was just a wild guess. The key moment for me was when I realized that even a single uploading connection as with a speed test was capable of increasing the latency of the connection. At that point I understood why, because all of the QoS related information out there for systems like OpenWRT and Tomato mention the buffering issue as something that has to be worked around for the QoS to be able to provide good latencies for high priority packets. I’m amazed at how much more time you had to spend to diagnose this, but I’m happy you’re taking up the cause, and I look forward to your advice for mitigating it, given the level of detail you’ve put into this post.

The time has been spent mostly looking elsewhere than in the broadband link; that was clear quickly as soon as I had traces and saw the Netalyzr data.

Since it quickly became clear the problem is much more widespread than the broadband edge network, the time has gone into building a strong enough case that I now hope everyone with stop and think deeply about whether their piece of the Internet system suffers from bufferbloat. Dave Reed tried to warn everyone over a year ago about bufferbloat in 3G network systems, and despite his deep expertise in Internet technology (he’s a co-author of the famous “end to end” design paper), ended up not “making the case” well enough to convince the jury. Some have axes to grind.

The immediate reaction I’ve received on quite a few occasions, including in my own company, has been incredulity.

“nothing bad can be happening”
“but dropping any packet is horrible and wrong”
“I don’t understand”

Just to give a small sample of what I’ve heard over the last few months. It helps that I’ve had a bit of success with this quest; I know of at least one product we’ll be shipping which will work well rather than badly, having had a bloatectomy. And that device will therefore likely work much better than its competition; I certainly hope it does well when it reaches the market.

I terms of technical insight and investigative ability this was a HUGE hit out of the ballpark. Way out. You not only got a home run, you hit it over the stands, past the parking lot and it’s bouncing over the highway as we speak.

Internet history in the making. No question about it at all.

Beyond belief you deserve the gratitude of, well, anybody with high speed internet access to the internet.

Words escape me. All I can do is shake my head in awe. Completely awesome.

As always, we are on the shoulders of other giants: the area of congestion management was explored with a depth of understanding I admire deeply by the likes of Van Jacobson, Sally Floyd, and many, many others. If I’ve done anything important here, it has been recognizing that the problem is occurring in other parts of the end-to-end system than “conventional” internet core routers, where it was pretty fully explored in the 1980’s and 1990’s.

And chance is very important: aiding me was knowing some of the players here, so that when I smelled smoke, they could diagnose the fire, giving me the confidence to dig deeper and look further. So in part, it’s being in a particular place at a particular time.

I’ve been working in the area of video streaming over TCP for a number of years. In the course of that work, I’ve noticed some of the pathologies in last-mile broadband access too. A lot of time, they seem to be due to shapers that appear to be applied probabilistically. It seems to me that if you are a long fat TCP flow, odds are high you will be lumped into the “smells like bittorrent” category of the “traffic management” gear of the ISP.
When that happens, the shaper kicks in, and the buffering is horrendous.

For a kind of crazy workaround, you might find a paper we published in the ACM Multimedia Systems 2010 conference to be entertaining:

In particular, we designed an automated failover mechanism into our protocol above TCP, called Paceline. Basically, when TCP delay goes off the chart, we kill the connection and continue on a fresh one. I usually explain this as based on the human behavior that is the ‘stop-reload’ cycle everyone does when their web browsing session stalls. Only in Paceline, we automate it. It was not designed to address the above shaper issue, but I’ve noticed that it often does so in practice. The connections will failover for a few seconds, and then one will seem to break free of the shaper and be good to go for tens of seconds or even minutes. I had a good chuckle when I first noticed it in action. :)

Yuck. Engineering around brokenness. Let’s get the brokenness fixed… Or the kludge tower that is the Internet will teeter yet more, and someday we’ll fall over (something I now actually fear, as I’ve alluded to in my posting and will discuss in more detail soon).

You may be aware of this, but it is important to remember that TCP’s window size is the maximum of 1) receive buffer size, 2) send buffer size, 3) bandwidth delay product determined by the congestion control algorithms. When buffer sizes get massive, it is very possible that 1) or 2) will be smaller than 3), so you could say the effect is to “turn off” congestion control in elephant flows most of the time. From their point of view, they are in the LAN like ACK paced mode, they simply send data on receipt of every ACK, and leave the actual rate determination to lower network layers. I have long suspected/wondered whether whoever engineered current traffic management practices has done this by design, the goal isn’t to “break” congestion control, but instead to re-assign responsibility to a different entity, from end-host to ISP managed devices–broadband modems and traffic management gear.

You have done a lot of good measurements. I’ve found some insights through end host instrumentation, specifically I’ve wired up some of Linux’s TCP_INFO sockopt statistics (buffers sizes, window size, rtts, rto’s etc.) to a user level trace tool (of my writing). Watching the actual values used inside of TCP is quite informative.

As for the kludginess of Paceline, yea well “kludge” vs “pragmatic, balanced and elegant solution given the context” is always a subjective assessment. ;)

Thanks for delving into this! I’ve been wanting to get to the bottom of this ever since writing the Linux Advanced Routing & Traffic Control HOWTO.

I indeed noticed this problem way back when in 1999 or so. You correctly note that the blame should fall on equipment manufacturers, but back then consumers did not have any choice in the matter. You got the equipment your cable company or DSL provider selected for you.

Also, at the time, there was a huge and almost exclusive focus on ‘DOWNLOAD SPEED’, and modems were clearly optimized to generate as much of that as possible, disregarding any latency impact.

About raising a stink in the IETF, I don’t know. At the time I did not see the (European) Internet Service Providers I was working with interact with the IETF much.

Certainly the ISP’s share the responsibility for the problem with equipment vendors; the monomaniacal focus on bandwidth has cost us all tremendously, and we need to change the conversation from solely bandwidth to some bandwidth/latency metric to make progress. We have to change this to a competitive situation to make quick progress. Without shining the light of day onto the problem and turning it to a competitive situation, bufferbloat won’t get eliminated in finite time.

Van Jacobson pointed out to me that the problem goes back a long way, when DARPA walked away from funding most network research over a decade ago: this left research in how to handle dynamic range of bandwidth completely in the lurch; NSF has primarily been interested in just “go fast” to connect scientists to super computers. So nobody has been minding the store, and doing research on many orders of magnitude of differing performance is far from a fully solved problem and needs serious research. Dynamic range of adaptive behavior is as hard as absolute performance; we’ve only been looking at absolute performance for over a decade.

As to the IETF, even in 1997, when I was still working in HTTP extensively, there was enough representation that the word might have spread; the Nordic countries were already very clue full in particular. The IETF is both very similar and slightly different from the FOSS community (having some of the shared heritage having been spawned out of the academic research communities decades ago). There was certainly heavy representation from all the equipment manufacturers. Somehow we need to break down the barriers that have somewhat separated the communities, as there is much that can/should be shared.

And yes, mitigation of bufferbloat is (partially) possible via Wondershaper and techniques like that which Paul Bixel is attempting in his recent work in Gargoyle (I haven’t yet tried them out, but hope to soon). Just remember that the problem is more general, and not confined to the router/broadband hop; we also have to fix even local traffic (to your storage and other boxes at home), as my experiments show.

And the base OS’s all have problems to some degree or another. We have a mess everywhere, and be careful with stones…

The Wondershaper should be updated to use the TC options “linklayer” and “overhead”.
As this solves the issues of “reducing” the bandwidth to achive queue control, as these options (eg. linklayer ADSL) takes ADSL overhead and framing into account.

The options (which I implemented) are included in mainline Kernels since 2.6.24 and in tc/iproute2 in version 2.6.25.

Guess the word have not been spread of this (now old) option… sorry about that.

Much of the “tuning” information I’ve seen is both out of date, often now completely broken, superseded by other problems (e.g. classification may be completely ineffective if your device driver is doing buffering underneath you), and mostly to “go fast” for supercomputers, and not what most users want/need. This is part of why I think real solutions should “just work”; expecting everyone to figure out the right “default” is a recipe for failure.

I need to turn on a wiki I have set up to help with this problem and have a place for everyone to work together on this. Maybe next week. Getting Slashdotted today hasn’t helped.

It does not compete successfully with other TCP/ips. Further, it appears to be be confused by the the number of retries in a modern wireless connection to mis-estimate the length of the path. (resulting in a slowdown).

This blog pots explains a lot. Recently I upgraded my cable internet connection from 2 to 10 Mbps and also switched from Windows XP to 7, and I’ve experienced incredible sluggishness and outright connection resets when downloading so much as 1 file that saturates the pipe. The one file I’m downloading with wget comes along really nice at a steady 1.1 MB/s, but all the other connections that I have open (like ssh and irc), reset within 30 to 60 seconds of starting the download.

Yes, you’ll see more problems having switched to Windows 7, due implementing window scaling by default. Exactly how bad things can get, I don’t really know; I haven’t seen connection resets in my controlled experiments, but have seen DNS lookup failure.

How much pain you will suffer depends on the cross product of buffering amount and bandwidth (in each direction).

I have no data on which modems may be “good” or “bad” in terms of buffering.

I do have a SB6120 myself; tomorrow’s post will be how I’ve mitigated most of the pain in my broadband hop. I’m quite happy now…

Also, I’ve noticed that manually limiting the download speed results in oddly fluctuating speeds. I’ve seen this with both LeechFTP and wget — I’ve tried limiting the download speed to 800 kB/s to avoid having my other connections reset, and the download speed fluctuates between 100 and 1000 kB/s on my cable modem. On an ethernet connection at work (presumably a fiber optic link without bufferbloat at any point) everything downloads at a steady 800 kB/s when I try this.

This may not be a “real” effect you are seeing; the fluctuation may be primarily in the accounting. If you look at my traces, you’ll see bursts of dup’ed acks in them and bursts of SACKS. Most data did not get dropped, but the acks certainly end up getting piled together. TCP gurus can better explain what’s going on; I’m not such a guru.

When I upload large files over my cable connection, I always see it go in bursts with a period of about a second. I had assumed that this was something inherent in the way the cable system imposed my upstream bandwidth limit, i.e. it was setting a quota of bytes per second. Now I suspect that the data on the cable is going at a constant rate and the 1 second burstyness is a function of the buffer size in my modem.

So the question is, what can be done about it? All of my network gear has some sort of web interface where lots of things can be tweaked, but I’ve never seen anything to change buffer sizes. I wonder if it’s possible in principle to change the buffer sizes in typical devices by changing the software, or whether the buffers are at a lower level in the hardware?

If you are running Linux, have a look at its pluggable congestion control algorithms. See . Some algorithms rely on RTT measurements rather than packet loss for feedback on congestion. Try changing to TCP-vegas and see if that solves the problem.

Actually, much of what that link describes is *exactly* the kind of bandwidth maximization tuning that got us into this mess, and is often obsolete information to boot. For example, at this date Linux automatically tunes its socket buffer sizes, making a class of “optimization” obsolete (along with some of the reason for some of the buffering).

What is more, as I said in a previous post: *there is no single right answer* for buffer sizes. The challenge is how to do buffer management in a fully automatic way. More about that to come…

And I certainly hope it doesn’t come to your jocular “The Terrible Internet Buffer Overrun Disaster of 2012″, though I have been losing sleep over it. Destroying TCP’s congestion avoidance algorithms is a recipe for disaster.

I’ve been arguing with Verizon in the UK for the last year about insane RTT times on our E1. They always pointed at over-utilisations, but it just didn’t make sense to me that RTT would be knocked to pieces by a single FTP session. As you mention in the article, for a while now the whole network has just ‘felt’ wrong in a way I struggled to explain, but knew wasn’t right. Also as you suggest, I had pretty much given up the ghost on working it out and have just been throwing bandwidth at the problem, with very limited sucess. But reading this, suddenly it all makes sense. Not sure where to go from here, but at least I know I’m not mad.

Looking at your quoted comment above “but dropping any packet is horrible and wrong” I can’t help but think how many ISP SLA documents have specific compensation clauses about levels packet loss. Would it be fair to suggest that this builds in an inherant motivation for the ISPs to increase buffer size in order to prevent packet loss and therefore reduce compensation payout? Even if by doing so they break the network? Note, I’m not suggesting this is a macheavelian plot, but simply an unintended consequence of how ISP contracts are written.

I just had a look at the SLA for our E1. Interestingly, we get a direct commitment to packet loss based on a percentage per month from ingress to the network (i.e. our managed router) to the point where they hand it off to the next provider or destination, but there doesn’t appear to be any exception for over-uttlisation. There is also a latency SLA, but this only applies across the core network, not the last mile. So if my local tail is buffered up to the eyeballs and causing latency, but not dropping packets, neither SLA will kick in. This structure makes sense when the average throughput is lower than the total bandwidth, as was typically the case historically. But in the modern age when any pipe can be filled no matter how fat it is, it doesn’t make sense any more. Before any technical fix can be applied, the ISPs need to alter their SLA structure so packet loss from over-utilisation is exempted from compensation, otherwise any attempt to fix this will just trigger lots of invalid payment to customers.

Note that Verizon has made the same mistake as Comcast, as has AT&T, the hardware manufacturers, the operating system folks, and so on. We’re all living in a glass houses, so be gentle and leave the stones behind; go forth and educate…

Telling everyone “they are stupid”, or “they screwed up”, when it’s “we were complacent”, and “we all screwed up” won’t be at all helpful; it is why I chose the title I did for this posting. At some point late in this process of blogging, I’ll show bufferbloat in application software as well, just to complete the journey. It’s “we” who have made/are making this mistake.

Certainly, there may have been unintended consequences of SLA contracts; but as the last SLA I ever worried about was about 15 years ago, I’m hardly someone to comment on on the perverse incentives that may have entered the system.

See Nagle’s RFC 970 “On packet switches with infinite storage”. Even in 1985 the early roots of this problem were visible.

Note also that Nagle suggests dropping the *last* packet in the host’s queue when one must be dropped. If we want drops to produce rapid feedback, dropping the *first* one in the queue would notify the receiving host earlier that there’s a problem.

and RFC 896, also by John Nagle, is worth reminding yourselves of: I’ve been alluding to congestion collapse, and we all need to remember what was said in 1984 as I move onto that topic….

Right, of course, is in the eye of the beholder: real AQM (RED or something better; classic RED has not one, but two bugs, according to Van when I talked to him in August) is also better than head drop, as the queues never grow to such a huge size (remember, they are now often orders of magnitude bigger than they should be, and there is no “single right answer” to the question). I’ll move onto that topic soon as well.

One of the strangest support calls we got at PSINet was “web pages load 1/2 way and stop” which was tracked to a bad buffer on a router on our network. My home lab has the docis 3 modem load balanced with fios 20/20 using Vyatta software router. Be interested in the mitigations

Tomorrow. Today’s posting will be a couple areas where I know the issue is/has affected the network neutrality discussions. Since they are ongoing, I want to inject a bit of insight (and opinion) on that topic, and feel I can’t take my time and come back to it later and have the observations illuminate the discussion (whatever side of the debate you may be on).

a) Never tell the application to stop sending, i.e. no back pressure, if the application has stuff to send, let it send;
b) treat everything equally giving no special treatment to packets. What the net neutrality types are always crying for.
Now this may not be the behavior that is desirable but it seems to be in line with what has been advocated for many years.

My view on NN is that the network is supposed to do what I ask it to (it’s what I’m paying my ISP to provide service for and my money may need to go further than my immediate ISP in the form of traffic exchange agreements, at times), and do it with some fairness when sharing is required. Note that I personally don’ t have problems with paying extra for premium performance at busy times of day (which is why it makes me sad that the best mitigation for bufferbloat right now defeats Comcast’s Powerboost, which, in internet tradition, is trying to give me extra performance when it doesn’t cost them extra).

It’s having others make yea or nea decisions on what I can access and/or get decent service for that makes my hackles rise immediately and will get me all worked up on the topic. I should be able to choose the service I get, and without the “bundling” disaster that has made me pull the plug on cable TV.

Having a network which is neither fair under load, and causing operational nightmares for users and ISP’s alike is a jointly losing strategy. That’s were we are today. A Lose Lose situation, if ever there was one.

Namely, that by doing delay based estimation it becomes possible to divorce the point of control (the ‘congestion notification’, aka, where to drop the packets) from the bottleneck, allowing you to get active queue management even when the queues don’t support active queue management.

Thus a properly equipped in-path device (like a WRT system) could kludge-fix ALL the paths going through it for buffer problems, without needing to be the bottleneck itself (unlike conventional traffic shaping).

This will fail if there are some flows through the bottleneck that aren’t controlled by the RAQM device, but otherwise should work very well.

And there also is a 90% solution in queue engineering: queues sized in delay rather than capacity.

If a queue is considered ‘full’ if the oldest packet is > 200ms old, this will still allow cross-the-world good bandwidth (you need a minimum queue size of ~bandwidth*delay/sqrt(N), so with 200ms US to Europe pingtimes, 200ms is big enough for most).

This is NOT optimal (the optimal size should be based on measured RTTs and dynamically changed), but its at least in the right ballpark: you still get full rate TCP throughput on reasonable cross-the-planet links, and you add a maximum delay of 250ms latency in the worst case.

Certainly the good is the enemy of the perfect; far be it from me to tell people to not do something less broken than they currently do. I’m often seeing latencies in seconds, getting to 200ms would be a serious improvement. And RAQM may help mitigate the problem while we fix all the broken gear properly.

I will point out that, however, we can’t stop at 200ms (a number that networking people seem to like, as it is convenient and achievable with little thought or hard work). The reality of human interaction and the speed of light is that *any* additional unnecessary latency is often/usually too much. As a UI guy, my metrics have always been (since I learned this stuff first hand in the 1980’s), that to:

No perceptible delay to all human interactions requires less than 20ms (rubber banding is hardest)

semi tolerable rubber banding needs less than 50ms

typing needs to be less than 50ms to be literally imperceptible

typing echo needs to be less than 100ms to to be usually not objectionable

echo cancellation gets harder as well (the best echo cancellation needs to be done as close to all participants as possible, even the latency over a broadband link is undesirable).

then there are serious gamers, where even a millisecond may be an advantage and the difference between life and death

don’t even get me started about the financial loonies that got us into our current economic mess

Given that vertical retrace at 60hz puts you statistically almost behind from the get-go (on average, you’ve lost 8ms right there, even with a really good OS and scheduler), and most paths are 10’s to a hundred milliseconds and we can’t repeal the speed of light, the problem is harder than most in the networking community tend to acknowledge. Even a gigabit switch when loaded may insert significant latency due to buffering.

I see latency as one of the great challenges for the networking/OS community. It’s probably as difficult as the “go fast” problem, and we want an internet that does both simultaneously without tuning, under load.

Lest people think this is unrealistic, I’ll point out that my rubber banding experiments were on a Microvax II on a 10Mbps ethernet in 1985; we got to 16ms over the X protocol over TCP on that local network then; it required that to get client side rubber banding to “feel” physically attached to the hand when running remotely. While client side java/javascript has relaxed the need some, it hasn’t gotten rid of all of the need, and the typing perception requirements are still necessary (as Google Instant has shown).

Here’s the reason why so many dirty-network types like me want to say “200ms and done” (Jim knows this, its more just for the record in general):

Its because that anything less than 200ms REQUIRES that every bottleneck in the path implement traffic classification and prioritization.

It is impossible to have simple queues which satisfy the requirements of both LPBs (“Low Ping Bastards”: any application with strong realtime components like first person shooters, VoIP, etc) and sustained TCP throughput at the same time.

Thus the choices are either a compromise that produces the maximum benefit for both types of traffic (~200ms is a very good answer, and RAQM can easily do this) or do a forklift upgrade on EVERY bottleneck in the Internet to do multiple-queue traffic prioritization.

You seem to be stuck with legacy information. Achiving low (zero) loss, low (<0.98) goodput is indeed impossible without a major rehaul of deployed congestion control and without seperating (corruption) loss from congestion feedback.

You may want to learn about DCTCP (based on alpha/beta ECN TCP aka ECN-hat), virtual queues (deployed in ATM but that technology became too costly) and CONEX (re-ECN) which provides the basic signalling foundation to implement all that goodness.

And I no longer think there is a good excuse not to do full AQM going forward: the cost of a dual issue gigahertz SOC with embedded NIC is in the 1 Watt power dissipation, and costs no more than $!5 (right now). So something fully capable of “working right” implementing full AQM should be in hardware/software/fimware designs going forward, IMHO, and we can mitigate the problem to the of order 200ms as best we can in the existing plant until it gets swapped out. That I seek the ideal while also wanting the good ASAP is not a contradiction in my view.

This is why I’ve been talking both about mitigation and solution to the buffering problem, rather than a single “fix”.

I *really* don’t want people to leave under the impression that 200ms is good enough. Many don’t understand the UI realities that exist who haven’t worked in the UI field. And there is a market for equipment that doesn’t just work OK, but works well. I see it as a market opportunity for equipment vendors that solve the problem properly.

No, if buffering is correct in a system, two (or more) TCP sessions should fairly share the link just fine even with the connection saturated. You should never be seeing latencies of the order we’re getting.

I don’t have the expertise to know if you’re right or not, but I’ve been noticing these sorts of issues for most of the last decade on and off, occasionally musing: TCP has features built in to avoid these problems, doesn’t it? Why don’t they work!

Recent experiences streaming random internet video to recent windows devices on a (locally) quiet network have been equally confounding. The video is supposed to degrade seamlessly, but instead I get sputtering high quality video — maybe the players are written wrong, but I’m suspicious.

I don’t quite have the expertise to judge, but intuitively this makes sense. Fortuitously there should be a much more hackable home router on my desk tomorrow. I shall follow along eagerly.

A couple of observations:
– there are a number of ‘rules of thumb’ which call for rather big buffers (google Villamizar tcp buffer size).
– TCP congestion avoidance works under the assumption that bit error- induced packet loss is very rare. This is true for optical networks, but less true for copper. Wireless is extremely lossy. The link layer has to do some error correction or TCP would never get up to speed. This comes at the prize of some buffering

Hi Jim,
For someone who’s always been affected by the symptoms you describe here (thanks to the severely limited upstream links we have in Brazil), this was a very enlightening read. Thanks a lot!
Recently, though, I switched providers (also switching from cable to DSL) and bought the *cheapest* modem I could find. Much to my surprise, since then I seem to be able to upload at full-speed (which is still rather slow; I have a 400kbps uplink) without rendering the downlink unusable (as has always been the case). Now I’m wondering that maybe to cut costs on the modem they used a small buffer, which in turn doesn’t trick the TCP congestion avoidance algorithms. I guess that’s a possibility?

The problem with this theory is that you can’t even buy “small enough” DRAM chips these days for the “right size” buffers (not that there can be any “right size” in the first place, AQM is the “right” solution, as I’ll discuss later).

Most likely it was cheap because it was an old design where memory had been significant cost issue, or the designer’s firmware wasn’t riddled with bugs they were papering over (a common cause of buffer bloat, since latency has not been tested for properly, and if they don’t meet bandwidth goals, they don’t get certified by carriers for use). Measure the bandwidth you get, and the saturated latency, and you can easily compute the buffer size.

I was running into this with a 128kbit/s uplink I had once. The solution I used was to put a Linux box in front of the modem, capping my bandwidth at 100kbit/s using a tiny queue, effectively taking the modem’s queue out of the equation. I couldn’t do much about the downlink, though, except try to manage TCP ACKs.

I guess this is a bit of a tangent, but what you say about overly clean networks is often observed when tunneling TCP over TCP, as in the not uncommon “VPN over SSH” situation. Saturating such a network will tend to bring everything to a halt, since no packets are dropped in the upper layer.

I see you have mentioned this elsewhere, but my first reaction to reading this was “where art thou, ECN?”. This is the very problem ECN is meant so solve. Yet we don’t implement it because the short term pain may be high.

I get IPv6 deja-vu when I think about it. But unlike IPv6, there is no d-day coming. I suspect that rather than fix wear the one off pain and the problem with ECN, we will just let the water warm up until it becomes intolerable, then tweak a few queue lengths until it becomes just tolerable again. If there isn’t a collective effort put in by a few big players, that is where we will sit for time immemorial.

Oh, and for those thinking implementing QOS at home will fix the problem for you – it will only do that if you are the cause of the congestion in the cloud. If someone else is filling up the queues in an upstream router nothing you can do can change the latency you see. From Jim’s description, this is the situation he finds himself in.

Steve Bauer at MIT (and probably others) are researching the state of ECN suppression. Until that and/or other research is complete, it’s not clear if we can. A conversation with Steve a month or two ago makes me hope it may be usable and useful in some parts of the network, but the general answer isn’t in yet.

And yes, you can possibly help yourself with QOS locally for your VOIP in a limited environment like your home, but you really still have to manage your queues and get TCP behaving correctly. If you don’t, you run smack dab into congestion someplace.

And as ISP’s aren’t necessarily managing queues, we have lots of messy problems. Others can help by starting to monitor their ISP’s carefully with tools like smokeping (and educating them as to the issues, if they lack clues). In the limited probing I’ve done of Comcast’s network, it’s always been smooth as a baby’s behind, until that last killer mile. Other monitoring I’ve done (particularly from hotel rooms) makes me believe, as the anecdotal and other data suggest, that there are clueless ISP’s out there. I’ll explain next week as to why there has been a reluctance to use AQM.

I’m guessing your data was captured directly on the sending machine, and that it has a NIC doing TSO.

The effect should be small, but because of the way you’ve captured the data, none of it can be truly trusted. The TCP RTT plots are measuring latency starting with a bogus TCP segment that hasn’t been actually been transmitted yet. It still needs to be sliced into MSS-sized segments which then need to be streamed onto your LAN.

Yes, serialization delay is low on the LAN, but it would be nice to see this data more accurately.

Do you perhaps have captures without TSO, or (better) taken by a 3rd party to the transaction?

At the time I took that data, I had no good way to take traces except on the transmitting system.

The particular laptops I’ve used to take data have Intel NIC’s, not the broadcom, and I don’t think Linux distro’s are typically doing traffic control games (which maybe they should). But with the current very large transmit rings, from what I’ve gathered in other comments to this blog, the traffic control would be ineffective anyway (until those buffers are cut down to size, as one person posted a patch to do).

My pings in the traces are exactly in line with what I observe from without (e.g. the DSL reports Smokeping data) in magnitude. If you are motivated, it would be interesting to look a bit further into the pings; Van Jacobson noted that since Linux happened to be used on both ends, there is already timestamp data in the traces.

I’ve since bought one of these port mirroring switches, which are quite inexpensive ($150). At some point, it would indeed be better to collect data that way (now that I’m able), and expect to do so before I do formal publication. For the next few weeks, I need to finish writing up what I know, update an overview presentation I did several months ago, and do a few other things, before circling back to try to write more formally and rigorously.

In any case, while I’d certainly like to retake the data before a formal publication, I encourage you to do your own experiments. The Netalyzr data shows buffering is dismayingly common (e.g. Nick Weaver at ICSI immediately reproduced similar results on his home connection, as soon as I made contact with him toward the end of the summer). If anything, the Netalyzr data has underestimated the frequency problem (it’s UDP buffering test wasn’t aggressive enough to fill higher bandwidth connections such as FIOS all the time, and can be confused by cross traffic). Broadband bufferbloat isn’t a rare phenomena (worse luck).

You suggested that small packets might be priority queued. There has been research conducted on the one-way delay in 3G networks where it was shown that large packets get through faster than smaller ones, at least in Sweden (I guess its dependent on the ISPs hardware). See the links below, there seems to be a threshold at around 250 bytes. The authors claim that reason for this is that the technology used changes from WCDMA to HSDPA around this point, resulting in lower latencies.

I think you would be very interested in reading my masters thesis [1] (http://goo.gl/sBHtg), as it contains pieces to solve your puzzel.

I think you have has actually missed what is really happening.

The real problem is that TCP/IP is clocked by the ACK packets, and on asymmetric links (like ADSL and DOCSIS), the ACK packets are simply comming downstream too fast (on the larger downstream link), resulting in bursts and high-latency on the upstream link. See page 11 in thesis for a nice drawing.

With the ADSL-optimizer I actually solved the problem, by having an ACK queue, which is bandwidth “sized” to the opposite link size. The ADSL-optimizer also solves the issue by ceasing control of the queue, which actually isn’t that easy on ADSL due to the special linklayer overhead (see chapter 5 and 6).

My investigations show, that the major issue is, that TCP/IP congestion protocol was not designed with asymmetric links in mind.
But, there is still some truth in, the ISPs are increasing the buffer sizes too much, which makes this effect even worse.

I guess the real (but impractical) solution would be to implement a new TCP algorithm which handels this asymmetry, and e.g. isn’t based on the ACK feedback, and deploy it on you home machines (as the effect is largest here).

While I don’t doubt that there are issues caused due to the asymmetric nature of many of the broadband connections, I also would be surprised if that was the primary issue here (above and beyond I’ve already had real TCP experts look at the data, which I’m not). In part, because I see the same effect on symmetric FIOS service….

Remember, what’s going on here is that a huge amount of buffering has been inserted into TCP’s control loop: the paths I’m typically testing on are between 10-30ms, and the delays several orders of magnitude larger. Without queue management, TCP (and other protocols) will fill these buffers. If your algorithm is having the effect of managing the queue growth, then you are achieving what is required for good operation.

And above we have a real winner. We have determined this to be the case in 1997-1999 time when doing one way satellite links to Australia with the back channel being an GRE tunnel. TCP inherently does not deal well with links that have different directional latency. It is the primary if not the only issue affecting you.

It is caused by under-provisioned networks between the end points. And that is caused by people buying into the crap known as Quality of Service, which should be called Quantify of Service.

By your definition, all networks would always have to be provisioned at the highest possible speed (note that a modern TCP can trivially go at gigabits/second). And it is never even knowable (in general) what the provisioning would need to be, nor can provisioning be changed over-night, as it requires trucks, ships, and backhoes. So at best an ISP can try to match provisioning with traffic; but can never do it 100% right.

The whole point of TCP and congestion avoiding protocols is to adapt to whatever speed is actually available over a path. With bufferbloat, however, you destroy the hosts abilities to react to congestion properly and the network operate properly.

Because you went from talking about buffers in TCP connections to line card buffers back to TCP buffers.

“And it is never even knowable (in general) what the provisioning would need to be, nor can provisioning be changed over-night, as it requires trucks, ships, and backhoes. So at best an ISP can try to match provisioning with traffic; but can never do it 100% right. ”

FUD. Of course you know what your network is provisioned for. I will let you in on a little secret – if you are to take the cap on transfer and divide it by the number of seconds in the time interval the cap is responsible for you will see what the network is provisioned for. Neat huh?

When you provision your network for 250GB/month transferred per drop ( which is what comcast does ) but you peddle it as 50Mbit/sec connection you will have all the weird problems you are experiencing. And when you add “PowerBoost”-type crap that the supposedly in the know population swoons over as Comcast’s gift to the users in the spirit of the free internet you are trying to cover up an incompetent network design.

“The whole point of TCP and congestion avoiding protocols is to adapt to whatever speed is actually available over a path.”

No, the whole point of TCP congestion control was allowing protocol to adapt to whatever speeds that were available on the sanely designed path because at the time all the paths were sanely designed. Congestion showed up on symmetrical links. That’s why SLIP and PPP worked well over V22bis and sucked over HST.

And TCP always sucked on congestion – just ask anyone who had MAE-East ports at the times when ServInt and Netaxs were 30% of traffic going over that fabric.

The netalyzr data shows many different buffer sizes in play. So you may be lucky on the hardware chosen by your ISP (or your ISP did have a clue when selecting it).

And don’t ever tell a gamer that 240ms is good latency (and, btw, that is almost twice the “acceptable” latency the telephony industry has used as a benchmark for many years). 240ms is much less than the disaster that set me going, but I’d still have to consider it very problematic.

I still can’t consider 20ms base jitter “good” however: that’s at the lower level of human perception, and gamers (and stock traders) care about latency differences to even an order of magnitude beyond 20ms. It’s still much higher than it “ought” to be from first principles. Latency is also something you never get back, and to get acceptable latency for a given application, you have to add up all the latencies; e.g. vertical retrace on the display, queuing delays in all switching/routing, delays in server/peer processing, speed of light, etc. You have to attack all sources of latency/jitter in the entire system, end to end. Speed of light means we’re almost always starting off behind; we should always be minimizing latency as much as we can.

Of course, my home router is now my biggest problem, having “fixed” my broadband connection… We’ll have to beat down some other nails before circling back to broadband. There are a lot of nails scattered around to pound on…

You must be new around here, I first saw this problem in the 1980 time frame with nice large buffers, I mean, what could go wrong, that could buffer 30s worth of data, and they did. Trivial enough to spot. By reducing the buffer size, one just makes the problem harder to spot. I kinda would like someone to do the theoretic modeling where they `fix’ the problem by doing last in, first out (min latency, no matter what line condition) and updating all the software stacks to `work’ in this mode. You then sense congestion by noticing out of order and delayed packets. The only problem, kinda `late’ fixing it now. :-( Not that I’ve done the work to know if such a solution is even possible.

Another fix I’d like to see would be to use the existing TOS field by providing a hard meaning for it, and router support for it by defining the `cost’ associated with that packet. Imagine it is a mu-law encoded 8 bit floating point value representing the `cost’ to send the packet. Have 0 be the bulk, very low cost. Most people, most applications would use this and would be equivalent to what we have today. People doing 1-million dollar trades across the internet, could, if they wanted, tag the 20 packets to do the trade at a higher cost. Rate limiters could then enforce not bandwidth, but rather the `cost’ of the packets. One could run data hogs along side VOIP merely by raising the TOS to be, say 30x the cost of the base line. Grandmas that use their connections lightly, could just tag everything with a higher cost and get a `better’ quality, suitable for VOIP, and people that pound out gigabytes, could sit back and know their ISP doesn’t have to worry about them, as they tag everything at 0 cost, meaning, it drops faster and sooner than a packet with _any_ non-zero cost to it. Links that saturate would first throw out all 0 cost traffic, bye bye bit torrent, which is fine, as the client would seek out a link where there is no saturation. If people want to bit torrent really important stuff, TOS of 0xff, and presto, like having a dedicated line. This would allow an ISP to offer differing levels of service, merely by having the clients select which service they want with the TOS field of packets. Hard core gamers with money would want to run at a higher TOS, and pay for it. Cheap games would run at a lower value, and bulk people, they know who they are, would want to run at 0. This gets the ISP out of monitoring and controlling and deep packet inspection and the like, it also provides a way for customers to pay more (to up the cap/rate limiters than control their line). Also, the ISP can use a sudden influx of high TOS packets across a link to mean, man, lets order up another one of these circuits now; conversely, a link with only 0 TOS packets means, even though it is completely full and saturated, don’t bother wasting any money expanding it. This reduces the costs for expansion and build out and provides customers a way to better communicate the value of the links of the network, as a whole. Owners of high value links, would naturally want to charge more, which, would, in the free market, encourage competition, thus, reducing the cost of the link.

In my book, trying to fix this by using globally accepted TOS rules has already tuned out to be an EPIC FAIL about 2 decades ago…

Why else are ISPs today not running TOS and instead MPLS when real money comes into the equation?

You should better take an interest in CONEX and reECN to learn how all parties (users,content and service providers) can get their incentives right to actually make this reality (and not another EPIC FAIL like TOS, AQM and pure ECN).

Yes! Of course TOS was an epic fail and it is exactly why we run MPLS. But please don’t tell it to people who worked at Bell Labs or are considered gods of TCP – that just does not fit into their nice world view of how the things are supposed to work.

I keep saying, and I’ll say this again: bufferbloat isn’t a property of TCP per se’.

The bad latencies will occur anytime a network hop is saturated with any protocol.

The issue is that the buffers are destroying all congestion avoiding protocol’s congestion avoidance, which guarantees that any unmanaged buffers fill and stay full (at the bottleneck). So the end-points (hosts) never slow down, and you suffer continually high latency than you would otherwise. All buffers should be correctly sized, and when you can’t (as often occurs) predict the “right” buffer size in advance, the buffers must be managed.

Interestingly, this reminds me of a similar issue that I talked to AT&T frame relay engineers about back in the mid 90’s about the excessive amounts of data that they were buffering in their frame relay switches and how it completely subverted the edge router from being able to properly prioritize traffic.

I never got anywhere with this. It was like I was speaking a completely foreign language or something. And in the frame relay case it is even lamer to buffer seconds worth of data based on the CIR at a single switch, because you choices are not limited to buffering or dropping. Frame relay has a built-in flow-control mechanism.

The problem was first observed very early in the Internet on satellite experiments.

And we have the same phenomena today in 802.11 and 3g network technologies, where they are often trying too hard to transport data (and on top of that, we’re failing to manage the building queues). As I don’t understand those technologies at the proper level of detail, I’ve been glossing over those problems; someone else needs to take a stab and the explanations.

The fundamental issue is that most practicing engineers think that losing any bits is bad; whereas the Internet was designed presuming that packets could/would be lost at any time and would indicate congestion was occurring. And with memory having become so cheap (in most places; I know the really high speed networking folks still have problems), we’ve been frogs in heating water.

I’m surprised this is news for any network engineer especially for ISPs. This problem is at least known and understood well enough to develop countermeasures since 2002, see http://lartc.org/wondershaper/.

But few have seen it as a general problem afflicting systems end-to-end.

As far as ISP’s, RED has had enough tuning difficulty that we have a bi-modal set of ISP’s/network: those who were so burned by congestion in the 1990’s that they have RED (or other AQM) religion, and those who don’t. So some networks do run with AQM, some only bother where they have problems (and then have problems later when a bottleneck shifts), and some run entirely without.

And equipment vendors have never been tested for “latency under load”; until i stumbled into my simple minded test that exposed it, even Comcast had no easy test for it. And since Windows XP doesn’t implement window scaling, even that test doesn’t work on the systems that most engineers were using (at least on their day jobs) until recently.

And there have been cultural barriers: the right Linux hackers never happened to talk to the other right people to get word out.

Ahh good point. Isn’t Nagle’s essentially another buffer? It accumulates small data messages into a single large message before sending.

For data transfer, where latency doesn’t matter it can help. However, for interactive apps, like games, that send small messages often it can hurt. In fact, games that use TCP like World of Warcraft often explicitly disable Nagle’s while they are running.

How this would interact with other buffers I’m not sure… But I guess it illustrates the negative effects of a buffer on latency at a local level.

Nagle’s a pretty much different horse, and is certainly not excessive in the size of the buffering; it’s just trying to keep tiny operations close together coalesced into fewer packets (like two back to back keystrokes).

Nagle is usually a good optimization; but it was also why I became aware of these issues in the first place in 1984 or 1985. TCP_NODELAY was added to Berkeley UNIX when we ran into it early in the development of the X Window System.

I would argue the default for Nagle has happened to be wrong, since a lot of app writers get it wrong.

This is news? Really? Traffic congestion on an oversubscribed link? QoS facilities have been available for years to work around traffic congestion and aid TCP. For the most part, there’s nothing new here—everyone can more along and simply read http://www.benzedrine.cx/ackpri.html, which covers 85% of the diatribe here.

I’ve had to add the answer to this one to the FAQ list I put together today, having answered it sooo many times now.

Please go look at the FAQ list and/or any of the other attempts to explain why classification isn’t adequate.

Classification by itself can’t solve the bufferbloat problem. It may be useful for many other purposes, but if you haven’t done AQM, you will still lose, and classification (ironically) becomes much more necessary than it would otherwise be.

The link you provided is all about a tool that can can improve a DSL broadband link using OpenBSD using classification. But I’m mostly use Linux, on Cable, and, now that I’ve twisted my home router’s QOS knobs to deal with the cable hope, I still have problems in my home network on 802.11 and my laptop not amenable to radical improvement by classification. So first, the problem isn’t what you say it is, nor would your approach help me much if I did the equivalent in Linux, as it doesn’t actually solve the problem. I hope this illustrates that we’re in a complex mess.

There is a general point: while we have many tools which can immediately help the situation, (including classification), this is a complex topic, and when people assert that a particular tool or incantation “solves” “the” problem, they leave the impression there is a single magic fix, where in reality, there are many tools that can be used to both mitigate and solve the many problems that occur in many parts of the network, often including some network hops under which they have no control and which someone else will have to do their piece. Which is most appropriate where depends on the circumtances.

That classification could at most be a help, but could not solve the problem was a point I’d covered many times already. It does get tiring to repeat ones self so many times.

But not everyone reads every page, nor should I expect them to, and I hadn’t put the FAQ together, so your post is fully understandable.

Jim, you and I saw some of this when using audio via the J-Video system between Palo Alto and Cambridge in the mid-90s. I was adjusting the local VoIP buffer size trying to minimize buffering while maximizing continuity by measuring arrival jitter, eventually coming up with something very similar to Van’s heuristic in the mbone tools (8x std deviation?) .

We found this worked great locally but to Cambridge we had these occasional crazy stalls tracked down to routers that would occasionally buffer a large fraction of a second w/o dropping anything, shooting my jitter estimations in the head.

You may not remember this because I gave up and we reverted to the telephone for voice and J-Video for video. At the time I declared that “VoIP works as far as you can walk” and gave up on the whole idea.

I see echos of this today (like when people try to relay VoIP over RDP sessions) and have found that people don’t understand that more expensive service might improve bandwidth but has no effect on latency, and as you point out might even make it worse.

I’d completely forgotten about this. It was actually before the mid-90’s, just about exactly when Sally and Van were doing RED. I think we were doing the J-Video work in 1992 or thereabouts. I have no clue as to whether we were in touch with them over our troubles or not; certainly Van was using AudioFile and we were in touch in that era pretty regularly.

And yes, people somehow think more bandwidth will solve their latency problems; it seldom does, and often has made it worse (as you buy later hardware which is often even more bloated than the older hardware, and the dynamic range problem has just gotten that much larger..)

Excellent post. Explains perfectly the issues I’ve been seeing recently. I’d been baffled as to why lately a single download stream can make an Internet connection unusable for anyone else (on big pipes) when back in the day we used to do multiples on much lower bandwidth connections without an issue!

What you describe here seems like a manafestation of the Nagle algorithm. I have been dealing with issues like this for years in using TCP to move large medical imaging datasets; even on a LAN we will see several seconds of latency when the algorthm is not properly implemented. By manipulating the send and recieve buffer in sockets we can usually get the latency down to something managable. As an example, a 512KB CAT Scan image might take 3 seconds over a 1Gbps network with these settings set incorrectly. When you correctly set the send and receive buffer in sockets, you can get the transmission time back to the milliseconds. When you are moving over 20K of these images per day on a network, it makes a huge difference.

Nagle is a useful algorithm; I just think it should have been off by default, since it bites so many application programmers the wrong way. On the other hand, that might have caused other problems.

Best might have been to have the socket interface require you to specify, so application writers might have had to engage brain enough to think in the first place, about the nature of the traffic they were about to send.

@Brian: What you describe is not the per-packet latency (which is what this blog post is about), but receiver or sender side limited bandwidth and flow completion time.

TCP Windowsizes are way too small these days, even for corporate LANs (and it was mentioned multiple times already, that Windows XP’s defaults are especially small).

Orthogonal to too-small windowsize defaults and bufferbloat, most TCP stacks (with the expeption of Linux) implement only the RFC algorithms for loss recovery.

However, expecially at higher (LAN) speeds, timely recovery from losses (which, as mentioned here multiple times, are a basic design choice of the Internet / IP Networks) also becomes a paramount issue – and not everything which could be done in that space is already fully explored.

For starters, not many stacks are implementing F-RTO, Eifel (well, only Linux is allowed to), FACK, and my favorite, Lost Retransmission detection (LRD) and improved RTTM.

Just let two Linux and two Windows boxes (with properly tuned windowsizes) run across a larger LAN. You will notice, that the goodput (flow completion time) of Linux will beat any other stack every time, hands down.

@Paul M: No, SPDY is a L5 protocol (alternative to HTTP), transporting the same content as HTTP (ie. HTML, XML).

But Chrome is often bundled with devices/OS (ie Android), where the TCP stack itself is already heavily tuned (some would say, these stacks violate IETF standards), and that helps too…

But quite a number of features of SPDY you can also get using HTTP 1.1, when server and browser are properly tuned (see this blog: http://bitsup.blogspot.com/ ). But as they are optional (instead of default with SPDY), they are not in widespread use…

The problem of the trade-off between delay and throughput dates back to 1983, as you said when Jacobson and Floyd were working on RED. It seems to me, that the issue is still marketing, and how “throwing bandwidth at the problem” won’t cut it anymore.

But mainly, mainstream OS do as they please. ECN has been proposed for a long time already, and just until now, Windows vista implemented it (not enabled by default of course). And then it comes BitTorrent type of protocols and modifications to TCP such as CUBIC (now default in Linux). As my topic of research, I say that congestion should be addressed first. Since the core network is now fairly unused, due to the overdimensioned bandwidth, the problem moved toward the edges, and as you claimed, the only thing that ISPs managed to do, were throwing the ball to the end-mile (customers).

Interesting that someone puts this up, and people actually cares about it! Keep it up.

I much prefer driving a sports car: trying to maneuver the Queen Mary on a super highway approaching the exit to my house just isn’t fun… And that’s what we all get to do these days.

And yes, I agree whole heartedly we have to change the marketing discourse. Ergo this blog: we have to shine light on the problem, and encourage fixes and have the power of the purse working for us, rather than against us.

Note ECN has been inhibited by a certain Taiwanese vendor having shipped a lot of broken kit a long time ago that would go belly up if they saw an ECN bit. Steve Bauer (and maybe others) is investigating whether it may finally be safe to use ECN overall. So characterizing this as “as they please” is a mis-characterization. We do need easy to use tools and ways to distribute them to help your grandma find out if their home network is broken, however.

Bufferbloat is a case where I know a problem has been generating many service calls (where it hurts ISP’s where it does most, directly to their bottom lines). I know, because I’ve placed them multiple times myself (before I understood what was going on).

I’m not convinced, ECN per se is the proper (only) answer here. ECN by itself only helps reducing the loss (and redundant work, subsequently necessary after a lost packet).

Here at home, I’m running with enabled ECN (linux, Win7) and only found a few obscure server sites, that also support TCP ECN.

However, even though I’m constantly tracing, I have yet to see a CE-marked frame in one of those few ECN-enabled flows.

Thus the problem is NOT only the end systems – where the default is still not to use ECN – but IMHO much more so, the access (where congestion actually occurs) and core routers. There, ECN (and AQM) could / should be enabled as it won’t make a difference even if those broken home routers are still operational (which I kind of doubt that there are still large numbers around. The half-time of home gear is probably less than 2-3 years, and the debacle happend twice or three times that half-life ago. So only a small fraction of the original popultation of broken equipment will be still operational (and users there are completely free not to enable ECN / disable ECN on their side).

One problem of ECN, as I see it, is that it only signals the existance of “TCP-compatible” congestion, but NOT it’s extent (ie. the depth of the cumulative network buffers across the whole path). The reaction to ECN was specified to be identical to the reaction to loss – thus there never was any incentive to neither end-users nor network operators to move from loss-based congestion signalling to ECN-mark congestion signalling.

A simple incentive back in those days might have been to allow a gradually less severe cwnd reduction on ECN marks. Thereby, traditional protocols not using ECN (but required to be TCP friendly) would be at a disadvantage (not only because of the worst goodput/throughput ratio, which the end users don’t really care about all that much – but network operators should care about).

Perhaps the current CONEX WG does something right and not only builds an improved signalling framework, but also sets the incentives for end-users and network operators right, to get that deployed this time.

What AQM algorithm are you using altogether with ECN (I assume your router at home should be doing most of the marking)? That’s an interesting thing I’m yet to try.

I agree with you, that changes to TCP should be made for those using ECN to encourage its use, but IETF has always been looking after the fairness and how some flows shouldn’t be able to starve others following standards. But there’s the new P2P on the block, that abuses TCP opening many connections, and quickly overflows queues.

As for the inhibition of the ECN bit, Jim I think that’s still the market. Since ECN wasn’t as important for many manufacturers, to me, selling expensive cards or focusing on other areas was more important that adding such feature. If the ECN bug had been a TCP bug, they would have taken the Taiwanese routers down, instead they just inhibited ECN.

As I said before, one main advantage of ECN (besides the obvious one of reducing packet losses and retransmissions) is that it allows to differentiate between packet losses due to congestion from those due to malfunctions, medium, etc. Which is key to actually modify TCP to behave accordingly. With no TCP modifications, well, that advantage is not completely used at its potential.

Hah! Your wireshark picture looks almost exactly like what I noticed happening on my connection just a week or two ago! I was wondering why less than a hundred kB/s of bittorrent download was completely thrashing every other outgoing connection attempt and saw exactly the same sorts of dup’d acks and retransmissions going on.

What I’m less certain of is whether TCP’s RTT/retry/congestion-avoidance algorithms are worth trying to save at all. In the presence of any real degree of packet loss much above the 0.1% range, TCP falls down horribly, as I discovered while working on ultra-wideband networking devices a few years ago. I guess that finally actually implementing proper ToS/QoS everywhere is going to be the only real effective solution long-term.

I don’t agree with your conclusions: rather, I believe that everyone needs to develop a deeper understanding of how packet networking actually works. We aren’t seeing the forest for the trees.

Part of the issue (as I understand it from watching mail traffic, and again, this is not my area) is that many/most of these technologies have been designed such that they buffer packets for a long time in the name of trying to get them delivered reliably: but then this can have the effect of defeating SACK and fast retransmit in TCP. Note that in my traces (which show between one and three percent packet loss, BTW), the pipe’s being kept very full.

There is no such thing as a “layer” in a network “stack”; I’ve been badly burned by this kind of thinking (and somewhat guilty of it myself) and hope to address this in a future post. One very common pervasive problem has been design by committee, where the committees have been entirely focused on the particular “layer” (in the ISO model sense) of a particular technology, and lacking in expertise in how the protocols built above them actually function. These so-called “layers” interact with each other, and “fixing” problems in one “layer” may just cause more trouble elsewhere.

A quick example: most 802.11 access points drop the transmit bandwidth down to 1Mbps on all multicast/unicast traffic; this means if there is even a small amount of such traffic, you can turn your 20 megabit network into a 1 megabit network.

Hangon a second. This is familiar. I used to have an old DSL modem that was really fast but the ISP had it working at about 1/4 to 1/8th of its maximum speed. This worked fine for downloads from the internet but on uploads caused issues.

Because the modem could do something like 1Mbps but was working at only 120kps it’s buffers were way bigger than it needed to be for the speed it was actually working at. If the outbound link got saturated the internet would, basically, stop working. The symptoms were new HTTP connections (or any TCP connections really) might start but not always complete.. or complete really, really slowly.

Since I was writing a p2p client at the time and mostly using that program to do large transfers I implemented throttling in it and the problem went away. It would come back, though, if any family member saturated the outbound connection for more than a few seconds.

As the ISP upgraded its systems the modem started to work closer to its designed speed and the issue went away. Dramatically so. I could still get it to happen by adding noise to the line in such a way that I closed off enough of its upstream 5KBps channels. (This was easy at the time since the apartment block’s wiring would do this for me.)

All this was when I first got broadband back in 1999-2000. Unless I missed something I don’t see why silly buffer sizes causing latency causing TCP to fail would be controversial.

The Netalyzr team hasn’t published the source to their tests so far. I gather they may have a command line version internally. You could drop a note to Nick Weaver or Christian Kreibich and see if they will give you copies.

However, I believe some of the tests at m-lab are equally effective; but as the results of those tests haven’t yet been published, I’ve mostly ignored them in this blog as they would just further complicate an already complicated story.

I worked for a startup company almost 10 years ago. We were building a cellular wireless data system. Our first simulation system I noticed slow transmissions and a high number of retransmissions. I finally tracked it down to packets being delivered out of order. The solutions I saw was either modify the TCP packets, or increase the TCP window size. This is a complicated issues. I always thought a good solution for wireless carriers was a proxy approach. If you breakup the communications into 2 separate TCP connections the issues becomes more manageable.

I haven’t thought about these kind of problems for years. These are interesting problems. I wish I had more time to think about them.

Running multiple TCP connections just dilutes the congestion avoidance further, and makes the situation worse. I’ve alluded to this in the blog already a bit; I have a major posting sometime on this topic coming, once I can breathe again.

Of course, I can not prove this when using the phone’s browser (or an apple device, due to it’s closed ecosystem, preventing tcpdump to be available there). But even when tethered, some non-http sessions look suspicious (negotiate a smaller MTU in the SYN/SYNACK of the tcp session; certain options I know are supported by the server (when using a wired internet connection) are missing in the SYN/ACK, etc…

Doing this vs. not doing this has become a no-brainer, as mobile operators not “proxying” TCP sessions (idependent of content; with http its particularly easy and more cost effective) will not have many customers for long…

Again, this is rumor (I only consulted with an operator once – running 2G at the time, and this was the one big thing which fixed their issues vs. their competition at the time) as far as I’m concerned… Perhaps someone working for a mobile operator want to speak up and shed some light into 2G / 3G / 4G mobile data networks and operational tweaks.

[…] Bufferbloat – “In my last post I outlined the general bufferbloat problem. This post attempts to explain what is going on, and how I started on this investigation, which resulted in (re)discovering that the Internet’s broadband connections are fundamentally broken (others have been there before me). It is very likely that your broadband connection is badly broken as well. And there are things you can do immediately to mitigate the brokenness in part, which will cause applications such as VOIP, Skype and gaming to work much, much better…” […]

[…] are new here, you might want to subscribe to the RSS feed for updates on this topic. Jim Gettys has a bone to pick about performance of his networks. He suspects there is a problem with TCP buffers related to network congestion and round trip time […]

This happens to be a well known problem in gaming circles where latency AND bandwidth matter greatly. The solution is to tune the “maximum receive window size” in the operating system to the speed of the connection. Ideally that buffer should hold no more than a seconds worth of data.

In Linux, this can be done in /proc/sys/net/ipv4/tcp_rmem and for windows, there’s a program called TCPOptimizer.

Years ago I realised that my ISP had a buffer at their end large enough to hold 40 seconds worth of data which ended up being a similar story to yours.

Yes, it’s well known in a few circles. (who haven’t properly screamed about their problems, in my opinion).

Note, however, that your wireless router or your computer may also be bufferbloated (and maybe even worse than your ISP): as soon as your broadband bandwidth is higher than your wireless bandwidth, the bottleneck moves to the 802.11 link, and you have yet another problem.

What is worse, the bandwidth available (actual goodput) there is often widely varying, so static tuning as you suggest won’t really help that case. So we have to circle back to AQM to fully solve the problem.

It’s sad that the lessons of Stuart Cheshire’s well-known rant It’s the Latency, Stupid are still being ignored. Back in the early days of the commercial Internet, Cheshire was bemoaning the excessive buffering in consumer modems: and here we are again, fifteen years later, in exactly the same place.

Thanks for your detective work. It might be useful to distil the large amount of text you have written about the history of your investigation into a short overview document, capturing the essential message. The congestion recovery mechanisms of TCP assume that congestion leads to packet loss that can be detected. Bufferbloat removes this link between congestion and packet loss. The result is worse overall network performance than using smaller buffers with some packet loss.

Correctly identified a symptom, and perfectly predictably resulted in the kneejerk reaction of “more memory for the masses”. (The root cause of Incast and TCP performance impact should be addressed via more smart means, ie an evolution of http://simula.stanford.edu/sedcl/files/dctcp-final.pdf)

This is really great research – and the narrative is great reading too. Thanks. Looking forward to the rest.

I’m wondering if this may also be incorrectly implemented by satellite providers – creating bufferbloat on this wireless lines? My folks live way out in the woods and can only get sat links to the internet. Their latency is atrocious at all times, but some times the link is just unusable – I wonder if this is at times when the uplink (since it’s shared with lots of other senders) is overloaded and the sat company is buffering everybody’s stuff (either on the sat or at the downlink)?

Any thoughts on how to test out this hypothesis? I’d love to be able to give some info to the sat company on how to improve their service, since it’s very painful to use these days. Very bursty in my superficial experience which makes me think bufferbloat might be the culprit (and I could see the false logic in thinking that putting a big buffer on such a slow link would help).

Also would possibly implementing wondershaper on their local end possibly improve things? Thanks for any insights on the satellite implications of all this.. Really great work!

Sure. Historically, (talking with others with yet more gray hair than me), bufferbloat was first identified and understood on satellite hops, and I’ve certainly seen behavior over satellite links that was likely extreme bufferbloat.

And yes, shaping traffic to avoid the buffers filling may (or may not) be very helpful. On links with predictable bandwidth, you can avoid filling the buffes.

But, as usual, you have to identify which hop is actually the bottleneck. It can be hiding in the satellite technology itself, or in the routers on either side of the hop, or locally in your home network (though this is probably less likely in this case). And, IIRC, some of the satellite technologies play evil games with TCP in the background. So the first step is to identify where you are actually suffering. Time maybe to start another page on tools and trouble shooting, or turn on a wiki for everyone to play with.

I just got my new FTTH connection (10/10 Mbit/s). As expected, even the new provider has not configured any decent AQM scheme (despite the CPE using a broadcom BCM5338M, which does offer advanced schemes – but only rate limiting appears to be utilized – the physical FHHT link is Eth 100 FDX).

But the point I wanted to make is, that things might not be as bleak with the demise of WinXP as you suggest. Win7 comes with Compound TCP as the standard congestion control algorithm, which is a hybrid (latency / loss feedback) scheme – see http://tools.ietf.org/html/draft-sridharan-tcpm-ctcp-02.

So, running with really large tcp receivewindow to a decent server in my ISPs core, the latency impact with CTCP is significantly reduced over running NewReno.

The tests were performed using FTP session, captured with Wireshare, and analysed using Ostermann’s tcptrace [cygwin] utility, version 6.6.0 4Nov2003:

The b-side is the important part. In summary, the latency induced due to bufferbloat is 150 +-87 ms with NewReno (the high variance indicating frequent draining of the buffers – and frequent collapse of the throughput – averaging at 1.198 MB/sec). With CTCP in comparison, the induced latency is “only” 92 +-18 ms – i believe the latency component of CTCP has a target of 100ms – and the much better variance also indicates a much less pronounced sawtooth behavior for an average throughput of 1.204 MB/sec.

Still looking at the download traces (Ostermann’s tcptrace doesn’t correlate round-trip times properly from a receiver-side trace…)

Well, arriving with ~74-100ms is an improvement over latencies in the range of ~63-237ms, but the problem here is, that the baseline latency (speed of light) will be different for every path in the internet. And a sender has no means to distinguish signalling delay from queuing delay.

In my case, the unloaded (empty buffers at the beginning of the test, until slow-start overshoots and floods the queues) is slightly less than 3ms to the ISPs local server. And for the record, on the download path (from the same server), the full queue latency rises to 39.6 +- 2.8 ms – a bit better than in the uplink direction. Most likely this comes from the shared memory buffering used by many common switch designs (I’m linked up to Broadcom chipset boxes doing the rate limiting). Right now, there are about 20 home users, provisioned 10/10 and a few 30/30, sharing the 1GE link to my appartment complex – thus the total buffers of the switch in the basement is shared among all these users, which results in less (kB) buffering available to each individual port – and thereby limiting maximum latency indirectly…

From the point of control theory, you want the feedback signal as fast as possible, but not faster than the ground frequency of the control loop (ie. 1 RTT; ICMP source quenche violated that principle, and had a number of other shortcomings, such as generating more load at times when congestion is already prevailing), but not very much slower (ie. >> 2-3 RTT).

In my example, the empty queue RTT to my server is around 3ms, but the feedback loop reacts with ~40, ~100 or even ~150 ms – a factor 15, 35 and 50 more untimely than would be ideal…

OTOH, with such huge buffering delays, latency based congestion algorithms have not much trouble spotting these building queues… And you won’t even need to go for real-time optimized OS (minimizing OS scheduler / interrupt jitter, measuring times with high precision).

(For comparison, pathChirp needs to run it’s timers at mircosecond / sub-microsecond resolution to yield good results. This quickly leads into a rat hole of OS stack changes throughput…)

I have seen various people publish “hacks” to “improve” the performance of firefox by increasing the number of tcp sockets it can open to web servers. Thinking about it these go back to the days of WinXP and older? This then causes some people to go crazy and increase the values beyond any reasoned values.

I think a similar problem occurs when there is wifi congestion. People’s instinct is to turn the power UP on their wifi access points (or fit higher gain antenna) – their idea being that shouting louder overcomes the interference and congestion, but of course the wireless clients’ power can’t easily be fixed that way, not can their receivers easily be made less sensitive to suit!

The proper solution of course is for *everyone* to *reduce* the power output on their access points to the minimum just sufficient to cover their site. The snag is that this requires people to (a) understand the cause of interference and (b) cooperate with their neighbours.

Thanks for your investigation and write up. I think you’ve managed to bring together some fairly well know behaviors (fill your link and it becomes useless for anything else) and make sense of it :)

7-8 years ago I was trying to implement a “QoS” service on our then rather new network. The service consisted of a Frame Relay link to the customer which terminated on an ATM switch (I don’t recall what type, not my area), which re-encapsulated the packets in ATM and delivered them to my router (Unisphere, now Juniper ERX). Frame relay has the ability to interleave frames (FRF.12 I think it’s called) so you can run VoIP on low speed links, but due to the ATM layer in the middle we couldn’t take advantage of that.

Our primary concern was jitter caused by the serialisation delay on low speed links. At the time Cisco said 768Kbps was the lower bound, and we were aiming for 1Mbps given the additional buffering/jitter introduced by the ATM layer.

Most of my lab testing involved saturating low priority queues with UDP packets (no back off so the queues were permanently full) and testing that the higher priority queues still had a timely (low latency/jitter) service.

Some of the interesting facts I figured out during this process where:

Our Ethernet switches (Extreme i-series) use 256kB buffers on each port.
The Juniper ERX lines cards are equipped with 32MB of buffer which is shared dynamically among all the sub-interfaces on that card (each card supports up to 32000 sub interfaces). Each sub-interface has an upper bound of 7MB of buffer.
The Cisco CPE at the time (2500/2600 routers) used 64kB buffers.

During my final testing I dropped the speed of the service (256kbps I think) and tested the impact of filling the high priority queue while a VoIP call was in progress. I set it up such that the total traffic was around 10% higher than the available bandwidth. Once the queue filled and started tail dropping the service had the expected 10% packet loss and the VoIP call continued to work. The problem was the end-to-end latency was > 40 seconds. I was able to speak into one of the phones, stand around for 30 seconds, then walk across the room and listen to my message on the other phone.

The problem was my lab environment only had a single service configured on the line card, so the router had given it the full 7MB buffer, totally insane.

I was able to find the knob in the config to lower the upper bound to 64kB based on the reasoning “64kB is good enough for Cisco, it’s good enough for me”.

We then told our customers that “under no circumstances over-subscribe your high priority queue, use it for latency/jitter sensitive traffic only, e.g. RTP”. Even with the 64kB queue limit a full queue still caused more latency than was ideal for VoIP.

Let me finally get to my points :)

I was building a private service (RFC2547 based). Most of us using standard public Internet services don’t have the luxury of being able to mark important packets and expect them to get special treatment at all the choke points in the network.

Many (most?) Telco router gear has knobs to tune these sorts of settings, I wonder how many ISP’s actually change these settings from the defaults? Are the defaults on this gear sensible?

VoIP is really tolerant of packet loss, far more so than the Voice engineers are willing to admit. This means our obsessive pursuit of low/no loss services is unnecessary (at least when VoIP is used as the excuse to justify that pursuit). From my experience working at a Telco, loss is considered to be evil because voice doesn’t like it which I find ironic because the standard solution (larger buffers) causes far worse issues.

What I look forward to is the opportunity to meet again with Dave Clark and Vint Cerf. David and I met when he took a little trip up to Hillsoboro, OR awhile back and discussed tcptrace and slow start a loong time ago. Vint and I met while he was was with MCI and we met in Folsom, CA. It’s been awhile. I hope they are reading this and I wish them well.

The surprise here is that a single TCP connection fills these buffers (on anything except Windows XP). And the buffers are so large as to be causing TCP major confusion: congestion avoidance has been defeated.

Anything can fill the buffers, but TCP is the protocol that is negatively impacted by it. UDP by itself shouldn’t be impacted by this phenomenon beyond huge variability in latencies. The mean throughput for UDP should be good, as should the min latency (mean & median will be somewhat negatively impacted, but likely not too badly). Of course, a lot of UDP applications have their own congestion control at a higher layer, which causes tons of fun.

I think though that what Daniel is talking about is really a different problem related to actually maxing out available bandwidth in the pipe, rather than transient max outs caused by buffer bloat.

Given the fact of single queues and no classification in these devices, the buffers being full means that UDP traffic also suffers the delays just as much as TCP.

Yet worse: the buffers are being kept almost precisely full, with TCP pacing its packets to keep them topped off (more or less), particularly when in bursts as TCP tries to find new safe operating points, increasing the loss rates on competing flows whether TCP or UDP.

Actually, UDP traffic can be even more negatively impacted because of the lack of reliability and greatly contribute to bufferbloat due to the lack of congestion control mechanisms.

With full buffers, dropped UDP PDUs will not be retransmitted unless reliability is implemented in the application layer. Nevertheless, the fact that many applications will attempt to retransmit some packets to achieve some ends, such as DNS, will cause further flooding of packets in the network and the risk of no data actually passing through.

Right. If there are long lived flows that are greedy for bandwidth (elephants), they will fill the queues. Whether the elephants are TCP, or UDP (with app level congestion control), is moot.

We see more and more long form video (elephants) moving through the network. There are downloads (torrents etc.), and adaptive streaming for VoD, live streaming, and video conferencing. Elephants already dominate Internet traffic on fraction of bytes basis, and will do for the forseeable future. The presence of the elephants assure “bufferbloat” latencies will be ever more common. That is, until better mechanisms than the status quo are deployed.

Note that I wouldn’t call torrents elephants. And actually torrents are somewhat a mechanism to abuse TCP congestion control. While it pushes networks to their limits (which I guess is fine) it opens many connections to download chunks of data. Which causes unfairness to other well-behaved flows. A single download should be fair to a single streaming flow, however if a torrent opens 99 connections, the competing flow will only have 1/100 of the bandwidth.

In addition to it, in order to bypass the congestion control mechanism of TCP some implementations use UDP which turned out more detrimental to the network (but somewhat more efficient for the downloaders)

It appears that you may have mistaken implementation bugs in an early alpha, with the design goals of this particular protocol. (For some reason, particularly eastern european ISPs had issues at that time [2008] with uTP).

At least it’s good to see it’s going through the IETF. Now, as far as I know, uTorrent is attempting to transmit reliably over UDP, which seems to be what TCP does. While I’m all in with improving TCP (and there have been several proposals to that), I think (my opinion) using UDP to bypass the basic TCP friendliness requirements imposed to other protocols doesn’t seem to be fair. As I said, several other protocols that are friendly to TCP had been discarded for several reasons, and this type of “new” protocols need to go under special review before running into the wild. Particularly, when there’s no congestion control mechanism described for UDP traffic.

You are right, with plain UDP there is no congestion control other than what an application designer thinks is appropriate.

My point being, that BitTorrent in paricular is a bad example, because uTP does have a congestion control scheme, that goes to extreme length to be a scavenger service (less-than-best effort), compared with the best-effort service of TCP.

One key signal missing in TCP to build improved congestion control (such as LEDBAT, and available in uTP) is measuring one-way delay (instead of round trip time, which is the signal measured currently by TCP). There are efforts underway to address this aspect of TCP, btw.

@Arms: I know you (we all) are trying to make a point about the importance of congestion control mechanisms. My point of view is that several improvements have been proposed to maximize throughput while being TCP friendly. We can spend much time discussing the advantages of uTP.

My point being that XCP for example (and among many others) provided exclusive focus on congestion feedback, and was carefully studied. They followed the rules on TCP friendliness. UDP has not congestion control mechanism. Any congestion control mechanism on top of UDP is simply an application layer feature. And that shouldn’t be the way of competing against TCP, by enforcing reliability using a protocol that shouldn’t be used for it. It’s like using a hammer the other way around, it may feel lighter, it may do the job, but it’s not supposed to be the way to use it. And careful attention should be given to “wild” deployments of this type of protocols that may turn hurtful, reason why I think is good is going under the review of the IETF.

And the different flavors of congestion avoidance algorithms in TCP and other protocols are entirely moot, so long as we fail to notify the hosts of congestion in a timely fashion (necessary for the congestion avoidance servo mechanisms to have rapid and stable response). That’s what bufferbloat has done to us; the amount of buffering would not matter if had AQM algorithms that worked deployed everywhere.

It is easy to get lost in the forest among the trees of the congestion avoidance algorithms if you lose sight of this fundamental fact.

Yes, torrent clients maintain many connections. But at any given time, a relatively small subset of them will be actively transferring data. Compared to other “mousey” internet traffic types (web pages, e-mail, chat, …), in torrent swarms, the active connections engage in relatively long lived data transfers (MB+ range comprised of several chunks). The point is that these flows transfer data over sufficiently many RTTs to open up their TCP congestion window and keep it there for a while. In my book this makes them elephants. You could even say that elephants are precisely those flows that can induce bufferbloat effects.

Web browsing *also* induces bufferbloat effects; you have N connections (6 or more), all with their initial window’s data flying toward the broadband edge, where they go *splat* into the queues of the home devices.

So suffering comes in all forms: in the web browsing case, you get transient effects, just the thing to cause your VOIP traffic to have fits.

Charles, I read that book. Unluckily, I’d say it hasn’t been updated. Current webpages may download that amount easily. I think the rapid increase in bandwidth availability has shifted the meaning of elephants and mice. I know several cases already where terabyte transfers are competing against music streaming, because the bandwidth allows.

Sure. It would be an interesting study to classify the causes of bufferbloat in the wild. My bet is on torrents and streaming dominating the other web induced bufferbloat “events”. Hopefully with all this discussion, one or more such studies are already underway. :)

[…] Jim Gettys is leading an initiative to fight “bufferbloat”, i.e., overly large buffers that cause time-sensitive traffic to be delayed significantly in the presence of high-volume background data transfers. Take a look a Jim’s introductory article and the role of Netalyzr’s findings here. […]