Thursday, December 8, 2011

Long router queue sizes on the web continue to be a hot networking topic - Jim Gettys has a long interview in ACM queue. Large
unmanaged queues destroy low latency applications - just ask Randell Jessup

A paper like this does
a good job of showing just how bad the situation can be -
experimentally driving router buffering delay from 10ms to ~1000ms on
many common broadband cable and DSL modems. I wish the paper had been
able to show me the range and frequency of that queue delay under normal
conditions.

I'm concerned that decreasing the router buffer size, thereby increasing
the drop rate, is detrimental to the current HTTP/1.x web. A classic
HTTP/1.x flow is pretty short - giving it a signal to backoff doesn't
save you much - it has sent much of what it needs to send already
anyhow. Unless you drop almost all of that flow from your buffers you
haven't achieved much. Further, a loss event has a high chance of
damaging the flow more seriously than you intended - dropping a SYN or
the last packet of the data train is a packet that will have very slow
retry timers, and short flows are comprised of high percentages of these
kinds of packets. non-drop based loss notification like connex/ecn do
less damage but are ineffective again because the short flow is more or
less complete when the notification arrives so it cannot adapt the sending rate.

The problem is all of those other parallel HTTP sessions going on that
didn't get the message. Its the aggregate that causes the buffer build up. Many sites commonly use 60-90 separate uncoordinated TCP flows
just to load one page.

Making web transport more adaptable on this front is a big goal of my
spdy work. When spdy consolidates resources onto the same tcp flow it means the remaining larger flows will be much more
tcp friendly. Loss indicators will have a fighting chance of hitting the
flow that can still backoff, and we won't have windows growing
independently of each other. (Do you like the sound of IW=10 times 90? That's what 90 uncorrelated flows mean. IW=10 of a small number of flows otoh is excellent.).
That ought to keep router queue sizes down and give things like rtcweb a
fighting chance.

It also opens up the possibility of the browser identifying queue growth
through delay based analysis and possibly helping the situation out by
managing inside the browser our bulk tcp download rate (and definitely the
upload rate) by munging the rwin or something like that. If it goes
right it really shouldn't hurt throughput while giving better latency to
other applications. It's all very pie in the sky and down the road, but
its kind of hard to imagine in the current HTTP/1.x world.