WAD -- TCP tuning daemon

As part of our Net100
efforts in improving bulk transfers over
high speed, high latency networks, we have developed a TCP
tuning daemon, WAD (a workaround daemon)
based on a
Web100 modified Linux kernel.
The WAD can auto-tune
various TCP parameters of designated network flows.
Our hope is to work-around various application, kernel, and protocol
bottlenecks by

Our version 1 WAD, uses a static configuration file.
An entry in the WAD config file looks like
wad.conf
[net100.lbl.gov]
src_addr: 0.0.0.0
src_port: 0
dst_addr: 131.243.2.93
dst_port: 0
mode: 1
sndbuf: 4000000
rcvbuf: 4052159
wadai: 6
wadmd: .3
maxssth: 100
divide: 0
floyd: 1
If "mode" is 1, WAD will tune the flow even if the application has
done its own setsockopt() on the RCV/SNDBUFs.
If mode is 2, the WAD will use NTAF data for the buffer sizes.
If "floyd" is 2, the WAD will dynmically update (every 0.1 seconds)
the AIMD for the flow using Floyd's AIMD tables.
If "floyd" is 1, the WAD will enable the kernel version of Floyd's AIMD
tuning (continuous).
The wadai fields modifies TCP's additive increase for the flow.
The wadmd field modifies TCP's multiplicative decrease.
The maxssth enables Floyd's modified slow start (you need the event-driven
WAD to tune slow start, polling may be too late.)
Here are some early results using WAD to enable
Floyd's slow-start for designated flows.
If "divide" is 1, the WAD will dynamically reallocate the buffer size
among concurrent flows, otherwise each flow always get the full buffer size.

The current version of the WAD can either poll for new connections
or the kernel can notify the WAD via a netlink socket that
a new connection has been established (see info on event notification).
When the WAD identifies a new connection,
it checks the configuration file to see if the
flow should be tuned.

We are testing the WAD over various high speed, high delay links.
Here are some preliminary tuning results using the WAD.
The following figure shows the bandwidth for a 10 second netperf
transfer from ORNL to PSC.
The receiver at PSC (80 ms RTT, OC12) advertises a 2MB window, and the plot
shows the throughput when the transmitter
uses 64KB send buffer (typical default)
or a WAD-tuned 2 MB buffer.
The data for this graph was collected dynamically at the
sender from the Web100 MIB
variables using ORNL's Web100 tracer daemon,
a variation of LBL's Python WAD daemon.
WAD/web100 can only tune up to the window-scale factor used
by the application in the initial SYN packet.
Web100 provides a sysctl variable to set the initial scale factor.
We are investigating other TCP parameters that we might "tune",
such as a virtual MSS, AIMD parameters, dup threshold, etc.
We also have deployed WAD on both ends of the connection, getting
57 Mbs for wad-tuned (1 MB buffer) vs 6 Mbs for a 10 second iperf using
16K default buffers.
(Using 1MB buffers on both ends, the iperf gives 81 Mbs.
Using 1 MB buffer on iperf server, and letting Linux 2.4 autotune the
client achieves 77 Mbs.
Linux autotuning will not tune the receiver, so the receiver must
advertise a "big enough" window.)
We have a auto-tuning summary
that describes other approaches to dynamically tuning TCP.

A bigger MSS should help both network and operating
system performance.
We have modified the Linux kernel so that the WAD can use a ``virtual
MSS'' for designated flows.
The ``virtual MSS'' is implemented by
adding one segment to cwnd a constant K times per
RTT during congestion avoidance.
The virtual MSS does not cause IP
fragmentation or reduce the interrupt
overhead.
The effect of the virtual MSS is best illustrated when there is packet loss.
The following plot illustrates two transfers from ORNL
to NERSC with packet loss during slow start.
Both flows use the same TCP buffer sizes, but one flow is dynamically tuned
by the WAD to use a virtual MSS of 6 segments.

Our WAD can also further improve recovery after a loss, and hence,
throughput, by altering
TCP's multiplicative decrease.
Normally, TCP reduces cwnd by 0.5 after a loss and increases cwnd
by 1 segment per round trip time.
In the following graph we plot two different tests between ORNL and NERSC,
one with standard TCP and the other with WAD tuning the multiplicative
decrease to be only 0.3 and the additive increase to be 6.
This example also illustrates the typical packet loss during slowstart.

We have recently installed Sally's AIMD mods in the Linux kernel,
and our WAD has an option to periodically (every 0.1 seconds)
tune AIMD values for designated flows using Sally's table.
In the following plot, one can see the slope of the recovery increasing
as cwnd increases, and one can see that the multiplicative decrease
is no longer 0.5 for the WAD/Floyd tuned flow.
A kernel implementation would continuously update the AIMD values.
Two tests are illustrated using 2 MB buffers for a 60 second
transfer from ORNL to LBNL (OC12, 80 ms RTT).
(The better slow-start of the Floyd flow is just the luck of that
particular test.)
(Also see our WAD Floyd slow-start results.)

Tierney of LBNL has done more systematic testing in October of 2002
of Floyd's HSTCP.
Here
are some early results for testing HSTCP in the net100 kernel (2.4.19).
These results are averages of 6-30 30 second iperf tests for each path.

The following two graphs show a series of GridFTP tests transferring
a 200 MB file from ORNL to LBNL (64K IO buffer, 4MB TCP buffer),
for untuned stream, 4 parallel streams, and a WAD tuned AIMD (.125,4)
stream.
The single stream is configured to perform like the 4 parallel streams,
see multcp.
(A fully untuned stream, 64K TCP buffer, takes 200 seconds at 8 mbs.)
In one series the tuned single stream outperforms that parallel
stream.
In the second series of tests, they both perform about equally.
The tuned stream in the second plot also includes Floyd's modified slow-start
(max_ssthresh 100).
We have no conclusive results yet on tuning the buffer sizes of parallel
flows.
Parallel flows have an advantage in slow-start, because the WAD
cannot set the number of segments initially sent at the beginning of slowstart.
Though doubling the number of intial segments only reduces the slow-start
duration by one RTT.
The WAD can tune the slow-start increment to make a more aggressive
slow-start, we are still experimenting with this.
With an increment of K, slow-start time is reduced by a factor
of log(K)/log(2).
For other parallel results see here.

For future versions of the WAD,
we are considering having the end-point WADs exchange tuning information on an
active flow.
We are considering a number of additional TCP variables that might be tuned.
The current list is as follows: