ports so multiple connections between systems can occur simultaneously

connection setup and teardown

flow control via a sliding congestion window

Let's focus for a moment on TCP's sliding congestion window. It's the feature that allows
a sender to send a certain amount of data on the wire without having yet received an
acknowledgment from the receiver. It "slides" open and shut as the receiver lets the
sender know how much data can be safely transmitted at any given moment, based on network
conditions, free memory on the receiver, etc. Without this windowing feature, we'd have
to send a byte, wait for an acknowledgment, send another byte, and so on. You can imagine
that this "Stop and Wait" type of protocol would be a performance nightmare. But who
picks the size of this window? How big is big enough, or too big?

On this topic, a curiosity recently arose while I was working where some of our
intercontinental WAN links just weren't living up to their advertised capability. What
was going on? The links should be 2Mbps, but we were barely getting 600-700Kbps over them
with our FTP transfers. And this was between two Solaris 8 systems, which should have a
decent, if not one of the best TCP/IP stacks in the industry.

So after a bit of investigating performance problems on WAN links I came across upon a
plethora of writeups on TCP window sizes, the bandwidth delay product (BDP), and networks
with high BDPs, or "Long Fat Networks" (LFNs, read "elephants"). Bottom line, it turns
out most TCP stacks in the industry just aren't by default setup for use over today's WAN
links, satellite links, even gigabit ethernet--anything with either a high bandwidth or
delay, or both.

Let's consider what makes an LFN an LFN. We'll use the above WAN link as an example. Say
you measure the round trip time (RTT) to be 300ms using ping. This means it's taking
about 150ms for the data you're sending to get to the other side, and another 150ms for
the other side's acknowledgment to get back to you. 300ms is a pretty long delay, almost
a third of a second. Click the screenshot of Figure 1 below to watch a flash animation I
created demonstrating a ping over this LFN.

Figure 1: Ping behavior over an LFN (click to watch)

So let's consider the BDP--what is it? The BDP tells us the optimal TCP window size
needed to fully utilize the line. To keep the pipe fully utilized you must push data onto
the wire, at your given bandwidth, for as long as an entire RTT. That is, the receiver
must advertise a window size big enough to allow the sender to keep sending data right up
until he begins receiving acknowledgments. Say the WAN link is 2 megabits per second
(Mbps). In our case, we need a window size big enough to allow for 300ms of data. Let's
compute the BDP:

So in this example our TCP window size should be a minimum of 75,000 bytes. But let's say
that our systems haven't been properly tuned for this type of bandwidth and delay, and
their TCP stack have a maximum TCP window size of 24,000 bytes. Oh no! Click and watch
the simulated behavior in Figure 2.

Figure 2: Untuned TCP session over an LFN (click to watch)

You can clearly visualize in the animation above how inefficient this is. The sender can
only send so much data before it's completely filled the window and has to stop and wait
for some acknowledgments to come back. As soon as that first ACK comes back it can resume
sending data--until it prematurely fills the window again, and so on.

As it turns out, I didn't just pick 24,000 bytes out of the air--that is Solaris 8's
default maximum TCP window size (24576 to be exact). And this is the reason we were only
getting 600-700Kbps on a 2Mbps line. The fact is Solaris 8, by default, just isn't tuned
for performance on a high bandwidth-delay network. But this problem doesn't just affect
the fairly antiquated Solaris 8--most OSes today have a default maximum TCP window size of
at most 64KB, which would still be insufficient.

Fortunately, the TCP window size is easily adjustable on Solaris or pretty much any OS.
Instructions on how to do this are widely available (see links at the end of this
article). It's probably worth noting that the adjustments you make are actually to the
default socket buffer size for each application, which indirectly allows the system to
advertise larger TCP windows. The following commands are for Solaris:

So we adjusted our systems to use 75,000 byte TCP window sizes. Click and watch Figure 3
for a demonstration of an optimized TCP flow.

Figure 3: Optimal TCP session over an LFN (click to watch)

Much better. And in fact now we're getting a full 1.9-2Mbps on our WAN link.

The next logical questions would be: what are the dangers of increasing the TCP window
size, and how much is too much? The only real downside to turning up the default socket
buffer sizes is increased memory usage, which in this age of cheap and plentiful memory
doesn't seem like a big deal. But also the argument comes up that with more unacknowleged
data in flight, the risk of clogging your networks increases as more data might be
transmitted, lost, and retransmitted. This could be the case if your network is having
errors, in which case you should probably fix the problem anyway. But if you don't
increase the window size you're underutilizing your network. So which is worse?

As a side note, we found turning up the TCP window size on our gigabit ethernet-capable
systems had a significant performance boost on LAN throughput as well. Using Iperf (see
links at end of article), a great tool for testing different TCP window sizes and their
effect on throughput, we realized an increase from about 300Mbps to 850-900Mbps.

Let's take one last look at the BDP formula and observe the following, essentially the
point of this article: TCP bandwidth is limited by the round trip time of the line and the
size of the TCP window. While the former is out of your control, the latter is not. You
could in fact prove, using the BDP formula, that with a TCP window of 24,576 bytes, it was
impossible for me to get any more than 655Kbps on a line with a RTT of 300ms.

So while TCP is normally a well-oiled machine, there are still opportunities for
performance tuning and considerable payoff for understanding how things work under the
hood. Of all the networking settings you can tweak on your machine, tuning the TCP window
size is hands down the biggest bang-for-buck optimization you can make for improving
throughput over high bandwidth-delay networks.

While that's the end of the discussion on sizing and tuning TCP windows, there's actually
quite a bit more for anyone interested. Here are some tidbits: The original designers of
TCP made the window size 16 bits, which allow a maximum of 65,536 bytes. After all, who
would ever need a bigger window than that? :-). In order to accommodate larger window
sizes, the systems must both support TCP window scaling (RFC1323), which basically just
specifies a multiplier (2,4,8,16 etc) to the advertised window size. Also, when the
window sizes start getting huge, and thus the amount of unacknowledged data in flight also
gets huge, the ability to do selective ACKs (SACKs) becomes important. This allows
recipients to selectively request retransmits of particular sequence ranges, instead of
throwing away and requiring retransmits of everything past the lost segment, as was the
nature of the original TCP spec. Most modern OSes should be capable of both of the above,
I know Linux, Solaris, and FreeBSD 5.x all are, however surprisingly in almost all cases
these are not enabled by default.

TCP is a complicated yet fascinating protocol, here are some links which may be of
interest:

Great article. Much easier to understand than most of the other ones out
there.

On Wed Oct 24th 2007, 9:48am, Visitor posted:

Very clear. With satellites, the BDP is of the order of 10's of MB's. Would
you know what settings and tuning people use? Thx

On Wed Oct 24th 2007, 12:07pm, Steve Kehlet posted:

Hi Visitor, thanks for posting. I'd be curious to know too, but no, I
haven't done any work with TCP over satellite links. I'd guess, just like
any other link, you'd want to measure the RTT and then just use a TCP
window calculator (like the one I have linked above) to determine what the
optimal TCP window size would be. That should be a ballpark figure at
least.

On Fri Dec 14th 2007, 7:24am, Rohit posted:

the best i have found on net ..thanks
can u tell me what shud we do when network bandwidth is low and sender
sytem buffer gets full ?? should we increase delay (RTT)

On Wed Sep 10th 2008, 12:51am, Dharmendra Tripathi posted:

Great article. Thanks much.
I want a suggestion. We have a web application (deployed on Tomcat)on Sun
Sparc m/c with Solaris 9 and 2 GB of RAM. Sometimes in excessive load,
response times becomes too slow. How can we improve the server response
time by setting TCP config parameters? Please suggest.

On Thu Nov 6th 2008, 5:15pm, Mike posted:

Good article...

So, taken into consideration the BDP (RCV Window, Latency) what is the
calculation needed to answer questions like:
a) how long it would take to send 1GBYTES over a 100Mbit Link?
b) What size link is required to send 1GBYTE in 10 seconds?

Thanks,

On Thu Jan 8th 2009, 5:54pm, Visitor (Glenn) posted:

This is a fantastic, well-written article on a very important subject.
Well done!

On Thu Mar 26th 2009, 5:20pm, Jackie posted:

Good artical!

On Thu Apr 16th 2009, 12:08pm, Visitor posted:

Your article, Tuning TCP for High Bandwidth-Delay Networks, is really good!
I send this article to clients that think throwing bandwidth at a TCP
transmission issue is the answer. Once they see this and optimize their
TCP window size, things get much better. Some opt for WAN accelleration
which does this and much more.

Thanks again! This truly is a great resource!!!!

On Thu Jul 2nd 2009, 3:15pm, Mohan posted:

Excellent article on this subject. Nice flash animations.

I have one question. Available link bandwidth and delay of a link can vary
depending on how many others are using the link. In that case, how do we
calculate the BDP?

Thanks again for the wonderful artile. I have forwarded it my collegues.

On Tue Sep 29th 2009, 12:05pm, Visitor posted:

This is by the far the best and simplest article I found that gives an
excellent explanation about TCP windows Size and throughput ...

On Sun Nov 29th 2009, 12:31pm, Visitor posted:

Thanks for a very lucid, simple, and compelling explanation!

On Wed Mar 24th 2010, 8:14am, Visitor posted:

Great article. Thank you!

On Fri Mar 26th 2010, 8:02pm, Visitor posted:

Great article!; However, manual tuning will only work in certain
environments; consider this topology:
http://i39.tinypic.com/34g7ij6.jpg

if you manually adjust the RWIN for ftp01.lax01 and ftp01.nap01 for the
2Mbps link: RWIN: 75K, you will indeed see a performance increase; however,
once you try to establish a link to ftp01.ber01 from ftp01.lax01 and
transfer a file, you will see a decrease in performance: as the RWIN for
the connection will need to be at least 250K for 200ms and a 10Mbps VPN
pipe (assuming 0 percent utilization on both Internet connections):
10000000/8 * .2 = 250K, however, as you manually configured the RWIN on
ftp01.lax01, the maximum theoretical throughput you can achieve between
ftp01.lax01 and ftp01.ber01 would be 3Mbps (RWIN/RTT=Throughput); 75000/.2=
375KB/sec or 3Mbps. Check this out, if you started the same file transfer
from ftp02.lax01 to ft01.ber01 with TCP RWIN Auto Tuning, your performance
would be much better as the RWIN would be calculated on the fly.

Hope this clears things up a bit.

Evilbit

On Fri Mar 26th 2010, 11:29pm, Steve Kehlet posted:

@Evilbit: Glad you liked the article. Yes, when sizing your tcp windows
you need to consider the path with the largest bandwidth-delay product, or
you could unintentionally limit your throughput. In your example, simply
pick 250KB and you're covered in both cases. Regarding auto tuning of the
tcp window size, good point, it's probably best to let the OS handle this,
if it's capable--note I wrote this article six years ago, when neither
Windows nor Linux (and certainly not Solaris) had this ability.

On Sat Mar 27th 2010, 9:48am, Visitor posted:

Steve, exactly; six years ago is a long time; this is why I added it in
case someone else had some confusion. Take care

EB

On Sat Mar 27th 2010, 9:49am, Visitor posted:

Steve, exactly; six years ago is a long time; this is why I added it in
case someone else had some confusion. Take care

EB

On Mon Apr 19th 2010, 5:53pm, Visitor posted:

Hi - really appreciate the excellent & clear writeup. By chance, do
you have a new link for "A User's Guide to TCP Windows
(excellent)" The page can't be displayed. thanks!

On Mon Apr 19th 2010, 6:32pm, Steve Kehlet posted:

Sadly, no. Looks like the project wrapped up, google doesn't have a cached
copy. Oh wait! It occurred to me to try the Wayback machine, here it
is:

(here is the search page from the wayback machine:
http://web.archive.org/web/*/http://dast.nlanr.net/Guides/GettingStarted/TCP_window_size.html)

On Wed Apr 21st 2010, 4:43pm, Visitor posted:

thanks for the tip on the Wayback machine and the link to a copy of the
User's Guide! cool.

On Thu Jan 27th 2011, 9:45am, Visitor posted:

Hi Steve, I think you're message (and the same message from other authors)
is getting out. It's brought up another learning, that setting the window
size too large may fill buffers in newtork elements and ruin your latency
and jitter. If you have a home network and want to use different types of
applications, it's a good idea not to overdo the window size. Have a look
at more information by sarching for bufferbloat on Wikipedia. Best
Regards.

On Thu Jan 27th 2011, 10:39am, Steve Kehlet posted:

Thanks Visitor for bringing up the point of too large window sizes.

On Fri Jun 24th 2011, 1:45am, ThumbsUp posted:

Excellent, thank you very much !

On Fri Feb 10th 2012, 3:51pm, Sandeep posted:

awesome post. Thanks.

On Wed Mar 7th 2012, 6:52am, Visitor posted:

Thanks. It is great

On Mon May 13th 2013, 8:40pm, Visitor posted:

Nearly 9 years later, and still this article sits ready to serve up great
information. Thanks Steve!