Delayimplementation

[ This file explains how traffic shaping is implemented with the emphasis
on how the delay-agent works. ]
0. Overview
We can shape network links or LANs. Links can have their characteristics
set either symmetrically (duplex) or asymmetrically (simplex). LANs can
have characteristics set either uniformly for the entire LAN or individually
per node on the LAN. Note that shaped LANs are mostly used to emulate
"clouds" where you are modeling the last hop of nodes connected to some
opaque network. We can shape bandwidth, delay and packet loss rate, and
to a limited extent, queuing behavior.
From the user perspective, links and LANs can be shaped "statically" by
specifying their characteristics once in the NS file, or dynamically by
sending "shaping events" via a web page, client GUI, of the command line
tool "tevc."
Shaping is usually done using a dedicated "delay node" which is interposed
between nodes on a link or LAN. A single shaping node can shape one link
per two interfaces. So in Emulab, where nodes typically have four experimental
network interfaces, we can shape two links per shaping node. For a LAN in
Emulab, one shaping node can handle two nodes connected to the LAN.
More details are given below.
A lower-fidelity method is to shape the links at the end points ("end node
shaping"). Larger networks can be emulated in this way since it doesn't
require a dedicated shaping node for every 1-2 links.
As part of the Flexlab project, Emulab supports shaping LANs in more
specialized ways. These include the ability to shape traffic between
individual node-pairs and even to shape between individual TCP or UDP
flows among nodes.
Our shaping nodes currently use dummynet, configured via IPFW, running on
FreeBSD. Much of the terminology below (e.g., "pipe") comes from this
heritage though hopefully the parameters are general enough for other
implementations. The particular implementation sets up a layer 2 bridge
between the two interfaces composing a link. An IPFW rule is setup for
each interface, and a dummynet pipe associated with each rule. Shaping
characteristics are then applied to those pipes.
1. Specifying shaping.
Shaping can be specified statically in the NS file using (largely) standard
NS commands and syntax. Some commands were added by us, in particular to
handle LANs. Commands:
* Create a link between nodes with specified parameters:
set <link> [$ns duplex-link <node1> <node2> <bw> <del> <q-behavior>]
and to set loss rate on the link:
tb-set-link-loss <link> <plr>
Here the characteristics specified are one-way; i.e., traffic from
<node1> to <node2> are shaped with the values, as is the traffic from
<node2> to <node1>. The result is that a round-trip measurement such
as ping will see <bw> bandwidth, 2 * <del> delay, and 1-((1-<plr>)**2)
packet loss rate.
* To set simplex (individual direction) parameters on a link:
tb-set-link-simplex-params <link> <src-node> <del> <bw> <plr>
As measured from a single node doing a round-trip test, you will observe
the lesser of the two directional <bw> values, the sum of the directional
<del> values and 1 - ((1-<plr1>) * (1-<plr2>)) packet loss. In effect,
a duplex link is a degenerate case of the simplex link (duh!)
For LANs:
* Create a LAN with N nodes:
set <lan> [$ns make-lan "<node0> <node1> ... <nodeN>" <bw> <del>]
and to set loss rate for a LAN:
tb-set-lan-loss <lan> <loss>
Here a LAN appears as a set of pairwise links for the purposes of shaping
characteristics. Traffic from any node to any other will see the indicated
values. Thus, round-trip traffic between any pair of nodes will be the
same as for an identically shaped link between those nodes. For the
remainder of this text, we will generally refer to this as a "symmetrically
shaped" or simply "symmetric" LAN.
* You can also construct LANs with per-node characteristics:
set <lan> [$ns make-lan "<n1> <n2> ... <nN>" 100Mbs 0ms]
tb-set-node-lan-params <n1> <lan> <del1> <bw1> <loss1>
tb-set-node-lan-params <n2> <lan> <del2> <bw2> <loss2>
...
However, here the interpretation of the shaping values is slightly different.
In this case, the characteristics reflect one-way values to and from "the
LAN." In other words, it is a duplex-link to the LAN. In still other words,
round-trip traffic from <n1> to any other unshaped node on an unshaped LAN
will see <bw1> bandwidth, 2 * <del1>, and 1-((1-<loss1>)**2) packet loss.
If the other node involved in the round trip is also shaped, then round-trip
traffic will see:
bw: lesser of <bw1> and <bw2>
delay: 2 * <del1> + 2 * <del2>
plr: 1 - ((1 - <loss1>)**2 * (1 - <loss2>)**2)
We refer to this case as an "asymmetrically shaped" or "asymmetric" LAN.
NOTE: This is a bit of a misnomer however, as it is possible
to use tb-set-node-lan-params to set identical (symmetric)
characteristics on the node connections, but the observed
behavior will be different than for a so-called symmetric LAN
setup with the same characteristics.
It is also possible for the base LAN to be shaped (characteristics on the
make-lan method) and for the individual node connections to be shaped (an
asymmetric symmetric LAN?). For sanity reasons we won't EVEN go there.
Shaping can also be modified dynamically using the web page or tevc.
No matter the UI, the actual work is done by sending link shaping events
to the shaping client, as described later.
It is possible to force allocation of shaping nodes even for unshaped
links (i.e., links that might be later dynamically shaped) by using the
mustdelay method:
<link-or-lan> mustdelay
End node shaping can be set globally or per-link/LAN:
tb-use-endnodeshaping <enable?>
tb-set-endnodeshaping <link-or-lan> <enable?>
1a. Is a LAN of two nodes a link?
One interesting issue that arises here, and has implications further down
the line, is whether a LAN of two nodes is equivalent to a link. The answer,
as you might expect, is "yes and no." A duplex-link will behave the same
as a symmetric LAN of two nodes. The two are in fact represented in the DB
and implemented identically. The same is NOT (quite) true of an asymmetric
LAN of two nodes, because of the different semantics. The LAN will be
"collapsed" into a link in the DB and the implementation will be the same,
but the characteristics stored in the DB and the resulting observed behavior
will be different, reflecting the differing semantics.
2. Shaping info in the database.
Shaping information is stored in the DB in three tables. One stores
the virtual ("logical") information, which is essentially as specified in
the NS topology. The other two store the physical information needed by
either the dedicated shaping node or the experiment nodes themselves (when
endnode shaping is in effect). This info includes the physical nodes used
for shaping (if any), the interfaces involved, the dummynet pipe numbers
to use, etc.
2a. virt_lans
The logical information is stored in the virt_lans table. Here, for a
given experiment, there is a row for every endpoint of a link or lan for
every node involved. For example, a link between two nodes:
set link [$ns duplex-link n1 n2 1Mbps 10ms DropTail]
tb-set-link-loss $link 0.01
would have two rows, one for n1 and one for n2:
+-------+------+------+---------+------+------+---------+
| vnode | del | bw | loss | rdel | rbw | rloss |
+-------+------+------+---------+------+------+---------+
| n1 | 5.00 | 1000 | 0.00501 | 5.00 | 1000 | 0.00501 |
| n2 | 5.00 | 1000 | 0.00501 | 5.00 | 1000 | 0.00501 |
+-------+------+------+---------+------+------+---------+
Each row contains the characteristics for "outgoing" or forward traffic on
the endpoint (del/bw/loss) and the characteristics of "incoming" or reverse
traffic on the endpoint (rdel/rbw/rloss).
Since the characteristics specified by the user for a link are for
one-way between the nodes, this DB arrangement requires that the shaping
characteristics be divided up between the nodes (across DB rows) such that
the resulting combination reflects the user-specified values. For bandwidth,
the value stored in the DB is just what the user gave, since limiting the
BW on both sides is the same as limiting on one side. For delay, half the
value is associated with each endpoint since delay values are additive.
For loss rate, there is a multiplicative effect so "half" the value means
1 - sqrt(1-<loss>). Returning to the example above, this means that the
outgoing characteristics for n1, and incoming for n2, will be bw=1000,
delay=5, loss=0.00501. Since it is a duplex-link, the return path (outgoing
for n2, incoming for n1) will be set the same.
For simplex links in which each direction has different characteristics:
set link [$ns duplex-link $n1 $n2 100Mb 0ms DropTail]
tb-set-link-simplex-params $link $n1 10ms 1Mb 0.01
tb-set-link-simplex-params $link $n2 20ms 2Mb 0.02
the characteristics are again split, with the node listed as the source
node uses the "outgoing" fields to store the characteristics for that
direction:
+-------+-------+------+---------+-------+------+---------+
| vnode | del | bw | loss | rdel | rbw | rloss |
+-------+-------+------+---------+-------+------+---------+
| n1 | 5.00 | 1000 | 0.00501 | 10.00 | 2000 | 0.01005 |
| n2 | 10.00 | 2000 | 0.01005 | 5.00 | 1000 | 0.00501 |
+-------+-------+------+---------+-------+------+---------+
For a symmetric delayed LAN (i.e., one in which all node pairs have virtual
duplex links with the indicated characteristics); e.g.:
set lan [$ns make-lan "n1 n2 n3" 1Mbps 10ms]
tb-set-lan-loss $lan 0.01
the DB state for the endpoints for each set of nodes is setup as for a
duplex link above:
+-------+------+------+---------+------+------+---------+
| vnode | del | bw | loss | rdel | rbw | rloss |
+-------+------+------+---------+------+------+---------+
| n1 | 5.00 | 1000 | 0.00501 | 5.00 | 1000 | 0.00501 |
| n2 | 5.00 | 1000 | 0.00501 | 5.00 | 1000 | 0.00501 |
| n3 | 5.00 | 1000 | 0.00501 | 5.00 | 1000 | 0.00501 |
+-------+------+------+---------+------+------+---------+
As mentioned in earlier text, a symmetric LAN of two nodes is represented
identically to a duplex link with the same characteristics.
For asymmetric delayed LANs (those with per-node characteristics); e.g.:
set lan [$ns make-lan "n1 n2 n3" 100Mbs 0ms]
tb-set-node-lan-params $n1 $lan 10us 1Mbps 0.01
tb-set-node-lan-params $n2 $lan 20us 2Mbps 0.02
tb-set-node-lan-params $n3 $lan 30us 3Mbps 0.03
the user specified values are for the "link" between a node and the LAN.
Thus for LANs, the information is not split. Recalling that the user-
specified values are for traffic both to and from the node, the single row
associated with the connection of the node and the LAN contains those
user-specified values for both the forward and reverse directions:
+-------+-------+------+---------+-------+------+---------+
| vnode | del | bw | loss | rdel | rbw | rloss |
+-------+-------+------+---------+-------+------+---------+
| n1 | 10.00 | 1000 | 0.01000 | 10.00 | 1000 | 0.01000 |
| n2 | 20.00 | 2000 | 0.02000 | 20.00 | 2000 | 0.02000 |
| n3 | 30.00 | 3000 | 0.03000 | 30.00 | 3000 | 0.03000 |
+-------+-------+------+---------+-------+------+---------+
2b. delays
The delays table stores the "physical" information related to delays when
dedicated shaping nodes are used. This is information about the physical
instantiation of the virt_lans information and thus only exists when an
experiment is swapped in. The delays table information is structured for
the benefit of the shaping node for which it is intended.
NOTE: Shaping nodes do not even exist in the virtual (persistent)
state of an experiment. They are assigned when an experiment
is swapped in, and only physical state tables like delays know
about them.
For each experiment, there is a single row representing a delayed link or lan
connection. Each row has two sets of shaping characteristics, called "pipes".
Each pipe represents traffic flowing in one direction through the shaping
node. Exactly what that means, depends on whether we are shaping a link,
a symmetrically delayed LAN, or an asymmetrically delayed LAN. Let's look
at some examples.
NOTE: In the interest of full-disclosure, it should be noted that
the following DB tables were hand-edited for clarity. In particular,
many columns are omitted and we currently support only two delayed
links per physical shaping node. For the latter, pipe numbers have
been renumbered to be unique--as though all three LAN nodes were
delayed by the same shaping node. In reality the delays table also
contains the physical node_id of the shaping node, and it is the
combo of node_id/pipe that is truly unique.
For our example duplex link:
set link [$ns duplex-link n1 n2 1Mbps 10ms DropTail]
tb-set-link-loss $link 0.01
we get:
+------+-----+-------+------+-------+------+-----+-------+------+-------+
| vn1 | p0 | del0 | bw0 | loss0 | vn2 | p1 | del1 | bw1 | loss1 |
+------+-----+-------+------+-------+------+-----+-------+------+-------+
| n1 | 130 | 10.00 | 1000 | 0.010 | n2 | 140 | 10.00 | 1000 | 0.010 |
+------+-----+-------+------+-------+------+-----+-------+------+-------+
where pipe0 (p0) represents the shaping (del0/bw0/loss0) on the path from
n1 to n2, and pipe1 (p1) represents the shaping (del1/bw1/loss1) on the
path from n2 to n1. As we see, for the duplex link, both directions are
identical. For the simplex link:
set link [$ns duplex-link $n1 $n2 100Mb 0ms DropTail]
tb-set-link-simplex-params $link $n1 10ms 1Mb 0.01
tb-set-link-simplex-params $link $n2 20ms 2Mb 0.02
we get:
+------+-----+-------+------+-------+------+-----+-------+------+-------+
| vn1 | p0 | del0 | bw0 | loss0 | vn2 | p1 | del1 | bw1 | loss1 |
+------+-----+-------+------+-------+------+-----+-------+------+-------+
| n1 | 130 | 10.00 | 1000 | 0.010 | n2 | 140 | 20.00 | 2000 | 0.020 |
+------+-----+-------+------+-------+------+-----+-------+------+-------+
Here we see the two pipes reflecting the different characteristics.
For our symmetric delayed LAN:
set lan [$ns make-lan "n1 n2 n3" 1Mbps 10ms]
tb-set-lan-loss $lan 0.01
we have:
+------+-----+-------+------+-------+------+-----+-------+------+-------+
| vn1 | p0 | del0 | bw0 | loss0 | vn2 | p1 | del1 | bw1 | loss1 |
+------+-----+-------+------+-------+------+-----+-------+------+-------+
| n1 | 130 | 5.00 | 1000 | 0.005 | n1 | 140 | 5.00 | 1000 | 0.005 |
| n2 | 150 | 5.00 | 1000 | 0.005 | n2 | 160 | 5.00 | 1000 | 0.005 |
| n3 | 110 | 5.00 | 1000 | 0.005 | n3 | 120 | 5.00 | 1000 | 0.005 |
+------+-----+-------+------+-------+------+-----+-------+------+-------+
Notice that this is NOT like the entries in the delays table would be for
duplex links between any two nodes in the LAN. Instead, the values are
"halved". Even though the definition of a symmetric shaped LAN leads one
to believe that the connection between any pair of LAN nodes would look
like a duplex link, that isn't the case here. This is due to the fact that
the implementation of LANs is different than that of links and the delays
table reflects the implementation. The difference is that, for links,
shaping is between two nodes while, for LANs, the shaping is between a node
and the LAN. Hence one-way traffic on a link is shaped by a single pipe
(e.g., n1 -> n2 via pipe 130 in the duplex link table) while in a LAN, it
is shaped by two (e.g., n1 -> LAN via pipe 130, LAN -> n2 via pipe 160).
So the values must be different in the two implementations to achieve the
same observed result.
But what if we had a LAN of two nodes; e.g., removing "n3" above? Then
it is represented exactly like a duplex link. The two lines you would
get in the delays table by removing "n3" above, are in fact collapsed
into a single entry that looks like the duplex-link example.
For an asymmetric delayed LAN where nodes have individual shaping parameters,
such as:
set lan [$ns make-lan "n1 n2 n3" 100Mbs 0ms]
tb-set-node-lan-params $n1 $lan 10us 1Mbps 0.01
tb-set-node-lan-params $n2 $lan 20us 2Mbps 0.02
tb-set-node-lan-params $n3 $lan 30us 3Mbps 0.03
we get:
+------+-----+-------+------+-------+------+-----+-------+------+-------+
| vn1 | p0 | del0 | bw0 | loss0 | vn2 | p1 | del1 | bw1 | loss1 |
+------+-----+-------+------+-------+------+-----+-------+------+-------+
| n1 | 130 | 10.00 | 1000 | 0.010 | n1 | 140 | 10.00 | 1000 | 0.010 |
| n2 | 110 | 20.00 | 2000 | 0.020 | n2 | 120 | 20.00 | 2000 | 0.020 |
| n3 | 110 | 30.00 | 3000 | 0.030 | n3 | 120 | 30.00 | 3000 | 0.030 |
+------+-----+-------+------+-------+------+-----+-------+------+-------+
Here the entries are very much like the duplex-link case. That is because
asymmetric delayed LANs are essentially duplex-links from a node to the LAN.
Thus pipe0 is the path from node to LAN, and pipe1 the path from LAN to node.
Note that since this is a LAN configuration, traffic from one node to another
does traverse two pipes.
If we again remove "n3" to get a LAN of two nodes, the remaining two lines
are again collapsed into one, but the result is NOT the same as for the
simplex-link example. Instead we get:
+------+-----+-------+------+-------+------+-----+-------+------+-------+
| vn1 | p0 | del0 | bw0 | loss0 | vn2 | p1 | del1 | bw1 | loss1 |
+------+-----+-------+------+-------+------+-----+-------+------+-------+
| n1 | 130 | 30.00 | 1000 | 0.030 | n2 | 140 | 30.00 | 1000 | 0.030 |
+------+-----+-------+------+-------+------+-----+-------+------+-------+
This reflects the behavior that traffic from n1 to n2 will see, for example,
10ms delay from n1 to the LAN and then another 20ms (30ms total) from the
LAN to n2.
2c. linkdelays
The linkdelays table is the analog of the delays table for cases where
endnode shaping is used. In other words, linkdelays entries exist for
links and LANs that have endnode shaping specified, delays table entries
exist for all others.
The structure of linkdelays is very similar to that of delays.
As with delays, the entries only exist when an experiment is swapped in.
Again, for each experiment, there is a single row representing a delayed
link or LAN connection and each row has two pipes and the associated
characteristics. However, here the pipes represent traffic out of the
node ("pipe", analogous to delays "pipe0") and traffic into the node
("rpipe" analogous to delays "pipe1").
One interesting difference is in how links are represented. Instead of
being two pipes on the shaping node (and thus one delays table entry),
it is now one pipe each on each endnode (and thus two linkdelays table
entries). The entries for our example duplex link look like:
+-------+----+------+-------+------+-------+-------+------+------+-------+
| pnode | vn | pipe | del | bw | loss | rpipe | rdel | rbw | rloss |
+-------+----+------+-------+------+-------+-------+------+------+-------+
| pc1 | n1 | 130 | 10.00 | 1000 | 0.010 | 0 | 0.00 | 100 | 0.000 |
| pc2 | n2 | 130 | 10.00 | 1000 | 0.010 | 0 | 0.00 | 100 | 0.000 |
+-------+----+------+-------+------+-------+-------+------+------+-------+
Note the odd information for the reverse pipe. This is because a reverse
pipe is not setup as there is no shaping to do in that direction. Traffic
from n1 to n2 is shaped on n1 and traffic from n2 to n1 is shaped on n2.
The simplex link example is similar:
+-------+----+------+-------+------+-------+-------+------+------+-------+
| pnode | vn | pipe | del | bw | loss | rpipe | rdel | rbw | rloss |
+-------+----+------+-------+------+-------+-------+------+------+-------+
| pc1 | n1 | 130 | 10.00 | 1000 | 0.010 | 0 | 0.00 | 100 | 0.000 |
| pc2 | n2 | 130 | 20.00 | 2000 | 0.020 | 0 | 0.00 | 100 | 0.000 |
+-------+----+------+-------+------+-------+-------+------+------+-------+
Information for LANs is the same as in the delays table. For the symmetric
example all pipes are used the characteristics are "halved":
+-------+----+------+-------+------+-------+-------+------+------+-------+
| pnode | vn | pipe | del | bw | loss | rpipe | rdel | rbw | rloss |
+-------+----+------+-------+------+-------+-------+------+------+-------+
| pc1 | n1 | 110 | 5.00 | 1000 | 0.005 | 120 | 5.00 | 1000 | 0.005 |
| pc2 | n2 | 110 | 5.00 | 1000 | 0.005 | 120 | 5.00 | 1000 | 0.005 |
| pc3 | n3 | 110 | 5.00 | 1000 | 0.005 | 120 | 5.00 | 1000 | 0.005 |
+-------+----+------+-------+------+-------+-------+------+------+-------+
and for the asymmetric LAN:
+-------+----+------+-------+------+-------+-------+-------+------+-------+
| pnode | vn | pipe | del | bw | loss | rpipe | rdel | rbw | rloss |
+-------+----+------+-------+------+-------+-------+-------+------+-------+
| pc1 | n1 | 110 | 10.00 | 1000 | 0.010 | 120 | 10.00 | 1000 | 0.010 |
| pc2 | n2 | 110 | 20.00 | 2000 | 0.020 | 120 | 20.00 | 2000 | 0.020 |
| pc3 | n3 | 110 | 30.00 | 3000 | 0.030 | 120 | 30.00 | 3000 | 0.030 |
+-------+----+------+-------+------+-------+-------+-------+------+-------+
2d. A few words about queues.
Conspicuously absent from the discussion thus far, is the topic of queues
and queue lengths. The NS specification allows queue types and lengths to
be set on links and LANs, and both the virt_lans and delays tables contain
information about queues--I have just ignored it. This needs to be
addressed, I have just never taken the time to understood the issues.
However, briefly, the default queue size is 50 packets. That value can be
adjusted or changed to reflect bytes rather than packets. There are also
some parameters for describing RED queuing as implemented by dummynet.
The one anomalous case is for the so-called incoming (reverse) path for LANs
in the delays table. Here the queue size is hardwired to two, (I believe)
because bottleneck queuing for the connection between two nodes on the LAN
should take place only once, and that would be at the outgoing pipe.
This is an area that needs to be better understood and described.
3. Shaping info on the client.
To reiterate, shaping clients are most often dedicated "delay nodes,"
but may also be experiment nodes themselves when endnode shaping is used.
Each shaping client runs some delay configuration scripts at boot time to
handle the initial, static configuration of delays. The client also runs
an instance of the delay-agent to handle dynamic changes to shaping.
The boot time scripts use database information returned via the "tmcc"
database proxy to perform the initial configuration and also to provide
interface configuration to the delay-agent.
3a. Dedicated shaping node
The DB state is returned in the tmcd "delay" command and is a series of
lines, each line looking like:
DELAY INT0=<mac0> INT1=<mac1> \
PIPE0=<pipe0> DELAY0=<delay0> BW0=<bw0> PLR0=<plr0> \
PIPE1=<pipe1> DELAY1=<delay1> BW1=<bw1> PLR1=<plr1> \
LINKNAME=<link> \
<queue0 params> <queue1 params> \
VNODE0=<node0> VNODE1=<node1> NOSHAPING=<0|1>
pretty much a direct reflection of the information in the delays table,
one line per row in the table.
<mac0> and <mac1> are used to identify the physical interfaces which are
the endpoints of the link. The client runs a program called findif to map
a MAC address into an interface to configure. Identification is done in
this manner since different OSes have different names for interfaces
(e.g., "em0", "eth0") and even different versions of an OS might label
interfaces in different orders.
<pipe0> and <pipe1> identify the two directions of a link, with <delayN>,
<bwN> and <plrN> being the associated characteristics. How these are used
is explained below.
<link> is the name of the link as given in the NS file and is used to
identify the link in the delay-agent.
<queueN params> are the parameters associated with queuing which I will
continue to gloss over for now. Suffice it to say, the parameters pretty
much directly translate into dummynet configuration parameters.
<vnode0> and <vnode1> are the names of the nodes at the end points of the
link as given in the NS file.
The NOSHAPING parameter is not used by the delay agent. It is used for
link monitoring to indicate that a bridge with no pipes should be setup.
This information is used at boot time to create two files. One is for
the benefit of the delay-agent and is discussed later. The other,
/var/emulab/boot/rc.delay is constructed on dedicated shaping nodes.
This file contains shell commands and is run to configure the bridge
and pipes. For each delayed link/LAN (i.e., each line of the tmcc delays
information) the two interfaces are bridged together using the FreeBSD
bridge code and IPFW is enabled for the bridge. Then the assorted IPFW
pipes are configured, again using the information from tmcc. The result
looks something like this for a link:
sysctl -w net.link.ether.bridge_cfg=<if0>:69,<if1>:69,
sysctl -w net.link.ether.bridge=1
sysctl -w net.link.ether.bridge_ipfw=1
...
ipfw add <pipe0> pipe <pipe0> ip from any to any in recv <if0>
ipfw add <pipe1> pipe <pipe1> ip from any to any in recv <if1>
ipfw pipe <pipe0> config delay <del0>ms bw <bw0>Kbit/s plr <plr0> <q0-params>
ipfw pipe <pipe1> config delay <del1>ms bw <bw1>Kbit/s plr <plr1> <q1-params>
...
Or pictorially:
+-------+ +-------+ +-------+
| | +-----+ +-----+ | |
| node0 |---- pipe0 ---->| if0 | delay | if1 |<---- pipe1 ----| node1 |
| | del0/bw0/plr0 +-----+ +-----+ del1/bw1/plr1 | |
+-------+ +-------+ +-------+
In terms of physical connectivity, node0's interface and delay's
interface <if0> are in a switch VLAN together while node1's interface and
delay's <if1> are in another VLAN.
For a LAN, consider a shaping node which is handling two nodes from a
potentially larger LAN. There would be two lines from tmcd:
DELAY INT0=<mac0> INT1=<mac1> \
PIPE0=<pipe0> DELAY0=<delay0> BW0=<bw0> PLR0=<plr0> \
PIPE1=<pipe1> DELAY1=<delay1> BW1=<bw1> PLR1=<plr1> \
LINKNAME=<lan> \
<queue0 params> <queue1 params> \
VNODE0=<node0> VNODE1=<node0> NOSHAPING=<0|1>
DELAY INT0=<mac2> INT1=<mac3> \
PIPE0=<pipe2> DELAY0=<delay2> BW0=<bw2> PLR0=<plr2> \
PIPE1=<pipe3> DELAY1=<delay3> BW1=<bw3> PLR1=<plr3> \
LINKNAME=<lan> \
<queue2 params> <queue3 params> \
VNODE0=<node1> VNODE1=<node1> NOSHAPING=<0|1>
Note that LINKNAME is the same in both lines since the shaping node is
handling two nodes in the same LAN. Also, in each line, VNODE0 and VNODE1
are the same since the two pipes describe connections for the same node
to and from the LAN as indicated in the picture below.
sysctl -w net.link.ether.bridge_cfg=<if0>:69,<if1>:69,<if2>:70,<if3>:70,
sysctl -w net.link.ether.bridge=1
sysctl -w net.link.ether.bridge_ipfw=1
...
ipfw add <pipe0> pipe <pipe0> ip from any to any in recv <if0>
ipfw add <pipe1> pipe <pipe1> ip from any to any in recv <if1>
ipfw pipe <pipe0> config delay <del0>ms bw <bw0>Kbit/s plr <plr0> <q0-params>
ipfw pipe <pipe1> config delay <del1>ms bw <bw1>Kbit/s plr <plr1> queue 5
ipfw add <pipe2> pipe <pipe2> ip from any to any in recv <if2>
ipfw add <pipe3> pipe <pipe3> ip from any to any in recv <if3>
ipfw pipe <pipe2> config delay <del2>ms bw <bw2>Kbit/s plr <plr2> <q2-params>
ipfw pipe <pipe3> config delay <del3>ms bw <bw3>Kbit/s plr <plr3> queue 5
which looks like:
+-------+ +-------+ +-------+
| | +-----+ +-----+ | |
| node0 |---- pipe0 ---->| if0 | | if1 |<---- pipe1 ----| |
| | del0/bw0/plr0 +-----+ +-----+ del1/bw1/plr1 | |
+-------+ | | | |
| delay | | "lan" |
+-------+ | | | |
| | +-----+ +-----+ | |
| node1 |---- pipe2 ---->| if2 | | if3 |<---- pipe3 ----| |
| | del2/bw2/plr2 +-----+ +-----+ del3/bw3/plr3 | |
+-------+ +-------+ +-------+
| |
(to other | |
shaping nodes) V V
In terms of physical connectivity, node0's interface and delay's
interface <if0> are in a switch VLAN together as are node1's interface
and delay's <if2>. On the other side, delay's <if1> and <if3> (along with
interfaces for shaping nodes for any other nodes in the LAN) are in a VLAN
together. Note that traffic between node0 and node1 will not take a
loopback "shortcut" on delay as the bridges between <if0>/<if1> and
<if2>/<if3> ensure that traffic is pushed out onto the LAN.
NOTE: Looking at the previous two diagrams, you can see the main
reason that LANs of two nodes are converted into links. If the
LAN example just above had only the two nodes shown and was
implemented as a true LAN, it would require twice as many shaping
resources as for a link. That is, a link has one set of shaping
pipes between nodes while a LAN of two nodes would have two sets.
NOTE: These diagrams also show a subtle implementation issue
with respect to switches. Consider the shaped link between
node0 and node1 in the first diagram. Because the shaping node
is bridging traffic between its if0 and if1, a side effect is
that traffic with node0's MAC address will arrive at the switch
not only at the port to which node0 is attached, but also at the
port to which delay's if1 is attached. Ditto for node1's MAC
on its switch port and delay's if0 port. Some switches cannot
handle the same MAC address appearing on multiple ports in
different VLANs. These switches have a global (as opposed to
per-VLAN) MAC table used for "learning" which MACs are behind
which ports and are confused by the same MAC appearing on
different ports.
3a0. Linux dedicated delay nodes.
There exists a beta implementation of dedicated delays for linux. It supports
the same queue param set as FreeBSD's dummynet pipes (including RED/GRED queue
params). Essentially, it is our linkdelay support for linux (see Section 3c)
with the addition of bridges. The main difference once bridges are added is
that you hook the queueing disciplines directly to the bridge and (at least
as of 2.6) you no longer need IMQ support to handle ingress queueing.
3b. Endnode shaping
The DB state is returned in the tmcd "linkdelay" command and is a series of
lines, each line looking like:
LINKDELAY IFACE=<mac> TYPE=<type> LINKNAME=<link> VNODE=<node> \
INET=<IP> MASK=<netmask> \
PIPE=<pipe> DELAY=<delay> BW=<bw> PLR=<plr> \
RPIPE=<rpipe> RDELAY=<rdelay> RBW=<rbw> RPLR=<rplr> \
<queue params>
pretty much a direct reflection of the information in the linkdelays table,
one line per row in the table.
<mac> is used to identify the physical interface which corresponds to
the endpoint of the link on this node. As with shaping nodes, the client
uses findif to map the MAC address into an interface to configure.
<type> is the type of the link or LAN, either "simplex" or "duplex".
This is used as an indication as to whether the reverse pipe needs to
be setup (duplex) or not (simplex). This is an unfortunate overloading
of the terms, as a duplex link will be labeled with TYPE=simplex.
<link> is the name of the link as given in the NS file and is used to
identify the link in the delay-agent.
<node> is the node receiving the info (us). It is not really needed.
<IP> and <netmask> are no longer important. They were used to enable
endnode shaping on physical links that were multiplexed using IP aliasing.
These were used along with a local modification to IPFW to apply multiple
rules to an interface based on the network of the "next hop". We no longer
allow this (though the rules are still setup, see below) as it did not
completely work.
<pipe> and <rpipe> identify the two directions of a link, with <(r)delay>,
<(r)bw>, and <(r)plr> being the associated characteristics.
<queue params> are the parameters associated with queuing which I will
continue to gloss over for now. Suffice it to say, the parameters pretty
much directly translate into dummynet configuration parameters.
As with the shaping node case, the information is used at boot time to
create two files. One for the delay-agent discussed in the next section,
and the other for boot time configuration of the shaping pipes.
/var/emulab/boot/rc.linkdelay is analogous to rc.delay for dedicated shaping
nodes. It contains shell commands and is run to configure IPFW on the shaped
interfaces. The result looks something like:
ifconfig <if> media 100baseTX mediaopt full-duplex
ipfw add <pipe> pipe <pipe> ip from any to any out xmit <if>
ipfw pipe <if> config delay <del>ms bw <bw>Kbit/s plr <plr> <q-params>
ipfw add <rpipe> pipe <rpipe> ip from any to any in recv <if>
ipfw pipe <rpipe> config delay <del>ms bw <bw>Kbit/s plr 0 queue 5
3c. Endnode shaping on Linux.
Endnode shaping is currently only implemented in Redhat Linux 9 (aka, the
2.4 kernel) since it involves a local modification to tc ("traffic control")
and two packet schedulers (that implement delay and plr) in the kernel (NOTE:
these have since been ported to 2.6 kernels).
As described in Section 3b, we use the same information from the LINKDELAY tmcd
command to setup traffic shaping on links. On linux, we use tc (one of the
iproute2 utils) to do the traffic shaping and iptables (with IMQ patches) to
do some funky forwarding. tc is the userspace interface to the packet
schedulers in the kernel, which implement things like token bucket filters for
rate limiting, or plr and delay (tc calls these things "queueing
disciplines"). Instead of sending the packet directly to an interface, the
kernel pushes it to any queueing disciplines setup for the interface (by
default, each interface gets a pfifo buffer). qdiscs can be chained, and some
can be classful (meaning you can tag packets in iptables as belonging to a
class, and the per-class rules in the qdisc get applied). We only make use of
the chaining aspect. We add qdiscs to an interface in the following order:
PLR, delay, and rate limit. Note that linux endnode shaping does not presently
support RED/GRED queue params, although it easily could as of 2.6 kernels.
However, since the linux kernel network stack did not support ingress queueing
disciplines, we used the IMQ (intermediate queue) patches for the kernel and
iptables so that we could handle ingress queueing for the duplex case. The IMQ
patches create imqX devices. With iptables, you siphon off all incoming
traffic on an interface to an imq device, then set the imq device's "target" to
a real interface. Then, of course, we can attach qdiscs to the imq device to
handle ingress packet scheduling.
The commands to setup endnode shaping are written by
/usr/local/etc/emulab/delaysetup to /var/emulab/boot/rc.linkdelay. The key
commands look something like the following:
ifconfig <if> txqueuelen <slots>
tc qdisc add dev <if> handle <pipeno> root plr <plr>
tc qdisc add dev <if> handle <pipeno+q_i*10> parent <pipeno+(q_i-1)*10>:1 \
delay usecs <delay>
tc qdisc add dev <if> handle <pipeno+q_i*10> parent <pipeno+(q_i-1)*10>:1 \
htb default 1
tc class add dev <if> classid <pipeno+q_i*10> \
parent <pipeno+(q_i-1)*10>:1 htb rate <bw> ceil <bw>
(if the link is duplex, there are similar rules for the imq device that is
interposed for ingress queueing; the iptables command to do this shuffle
looks like:
iptables -t mangle -A PREROUTING -i <if> -j IMQ --todev <imqno> ).
4. Dynamic shaping with the delay-agent.
4a. delay-agent running on dedicated shaping nodes.
The delay-agent uses a mapping file created at boot time to determine
what names are associated with what interfaces and delay pipes.
/var/emulab/logs/delay_mapping contains a link describing each link
which is to be shaped by this node. Lines looks like:
<linkname> <linktype> <node0> <node1> <if0> <if1> <pipe0> <pipe1>
<linkname> is what is specified in the ns file as the link/lan name.
It is used as the ID for operations (events) on the link/lan.
<linktype> is duplex or simplex.
<node0> and <node1> are the ns names of nodes which are the endpoints
of a link. For a lan, then will be the same name.
<if0> and <if1> are the interfaces *on the delay node* for the two sides
of the link. <if0> is associated with <node0> and <if1> with <node1>.
For a lan, <if0> is associated with <node0> and <if1> with "the lan"
(see below for more info).
Reviewing the diagrams from the previous section, the shaping setup for
a link (or LAN of two nodes) looks like:
+-------+ +-------+ +-------+
| | +-----+ +-----+ | |
| node0 |--- pipe0 -->| if0 | delay | if1 |<-- pipe1 ---| node1 |
| | +-----+ +-----+ | |
+-------+ +-------+ +-------+
So the delay_mapping file provides the necessary context to map events
for the link <linkname> to IPFW operations on dummynet pipes.
A LAN of 3 or more nodes is considerably different. Each node will have
two pipes again, one between the node and the delay node and one between
the delay node and "the lan". The delay_mapping file now looks like:
<linkname> <linktype> <node0> <node0> <if0> <if1> <pipe0> <pipe1>
<linkname> <linktype> <node1> <node1> <if2> <if3> <pipe2> <pipe3>
<linkname> <linktype> <node2> <node2> <if4> <if5> <pipe4> <pipe5>
NOTE: Of course, our shaping nodes can only handle two links
each since they have only four interfaces, so there will
actually be two shaping nodes for a LAN of three nodes.
But for this explanation, we pretend that one shaping node
has six interfaces and would have the above lines.
This translates into:
+-------+ +-------+ +-------+
| | +-----+ +-----+ | |
| node0 |--- pipe0 -->| if0 | | if1 |<-- pipe1 ---| |
| | +-----+ +-----+ | |
+-------+ | | | |
| | | |
+-------+ | | | |
| | +-----+ +-----+ | |
| node1 |--- pipe2 -->| if2 | delay | if3 |<-- pipe3 ---| "lan" |
| | +-----+ +-----+ | |
+-------+ | | | |
| | | |
+-------+ | | | |
| | +-----+ +-----+ | |
| node2 |--- pipe4 -->| if4 | | if5 |<-- pipe5 ---| |
| | +-----+ +-----+ | |
+-------+ +-------+ +-------+
4b. delay-agent on end nodes.
Fill me in...
4c. Dynamic configuration via events.
Emulab events are used to communicate and effect changes on links.
delay-agent specific events have the following arguments.
OBJNAME: the link being controlled.
The name is of the form <linkname> or <linkname>-<nodename>.
The former is used duplex links and symmetrically shaped LANs to
change the global characteristics. The latter is used to identify
specific endpoints of a link or LAN to effect simplex-style changes.
OBJTYPE: the event agent type.
Always "LINK".
EVENTTYPE: operations on links and LANs.
One of: RESET, UP, DOWN, MODIFY.
RESET forces a complete re-running of "delaysetup" which tears down
all existing dummynet and bridging, and sets it up again. Currently
only used as part of the Flexlab infrastructure below, but not specific
to it.
UP, DOWN will take the indicated link up or down. Taking a link down
is done by setting the packet loss rate to 1. Up returns the plr to
its previous value.
MODIFY is used for all other changes:
PIPE=N: shaping pipe to apply changes to,
BANDWIDTH=N: bandwidth, measured in kilobits per second,
DELAY=N: one-way delay measured in milliseconds,
PLR=N.N: a packet loss probability between 0 and 1,
LIMIT=N: the maximum length of the bandwidth shaping queue in packets,
QUEUE-IN-BYTES: indicates that limit is bytes rather than packets,
MAXTHRESH, THRESH, LINTERM, Q_WEIGHT: assorted dummynet RED params.
The tevc syntax for sending these events is:
tevc -e pid/eid now <objname> <eventtype> <args...>
Links and lans are identified by the names given to them in the NS file
(e.g., "link1", "biglan"). Appending the node name with a dash (e.g.,
"link1-n2", "biglan-client5") identifies a particular node's connection
to the indicated link/lan and is used for modifying one-way params.
The object type (LINK) is implied by the object and does not need to be
passed via tevc.
5. Flexlab configuration.
Flexlab has a variety of interesting requirements. In the "application-
centric" Internet Model (ACIM), it wants to be able to perform traffic
shaping on a per-flow basis. Here a flow is a combination of protocol,
src and dst IP address and port numbers. These flow pipes are setup and
torn down on demand as the application being monitored opens and closes
network connections. Different pipes are used for traffic in each direction.
In addition to bandwidth and delay (no loss rate yet), it also indirectly
tracks the maximum queuing delay of the bottleneck router and wants to be
able to emulate that in dummynet.
There are also two "simple," measurement-driven models. The simple-static
model uses historic data to do a one-time configuration of the shaping of
links between host pairs. The simple-dynamic model continuously updates
the link shaping parameters based on current values collected from a
background measurement service. Both of the simple models require two
simplex pipes per host pair in the experiment, one in each direction.
Finally, there are some improvements on the simple model which attempt
to recognize bottleneck links in the topology of communicating nodes.
In this so-called "hybrid" model, there is not a complete set of node-pair
shaping pipes for all characteristics. Instead a nodes may share a
bandwidth pipe to a set of destination nodes. The node may also have a
different shared bandwidth pipe to another set of nodes, or may have
per-node destination bandwidth pipes as well:
N5 <----+ +----> N2
\ 5Mbps 10Mbps /
N6 <------+<----- N1 ------>+------> N3
/ \
N7 <-------------+ +----> N4
50Mbps
This is the "shared destination" variant of the hybrid model. There is also
an analogous (and as of yet unimplemented) shared-source model in which
incoming traffic for a node from a set of nodes shares a bandwidth pipe.
In both of the shared models, bandwidth and delay are still per node-pair
attributes.
It is assumed that the ACIM and simple models will not exist within the
same experiment; i.e., a delay node only has to implement one or the
other at any given time.
While all of these models are configured and managed dynamically using
events and the delay-agents, there are some changes to the NS file
necessary to accommodate Flexlab usage.
A Flexlab experiment must have all nodes in a "cloud" created via the
"make-cloud" method instead of "make-lan":
set cloud [$ns make-cloud "n1 n2 n3" 100Mbps 0ms]
Make-cloud is roughly equivalent to creating a symmetrically shaped LAN
with no initial shaping (i.e., with mustdelay set to force allocation of
shaping nodes), and with special triggering of a CREATE event (described
below) at boot time, e.g.:
set cloud [$ns make-lan "n1 n2 n3" 100Mbps 0ms]
$cloud mustdelay
$ns at 0 "$cloud CREATE"
A cloud must have at least three nodes as LANs of two nodes are optimized
into a link and links do not give us all the pipes we need, as we will see
soon.
The CREATE event is sent to all nodes in the cloud (rather, to the shaping
node responsible for each node's connection to the underlying LAN) and
creates, internal to the delay-agent, "node pair" pipes for each node to
all other nodes on the LAN. Actual IPFW rules and dummynet pipes are only
created the first time a per-pair pipe's characteristics are set via the
MODIFY event. This behavior is in part an optimization, but is also
essential for the hybrid model described later.
There is a corresponding CLEAR event which will destroy all the per-pair
pipes, leaving only the standard delayed LAN setup (node to LAN pipes).
Each node-to-LAN connection has two pipes associated with each possible
destination on the LAN (destinations determined from /etc/hosts file).
The first pipe is used for shaping bandwidth for the pair. The second
pipe is used for shaping delay (and eventually packet loss). While it
might seem that the single pipe from a node to the LAN might be sufficient
for shaping both, the split is needed when operating in the hybrid mode
as described below. Characteristics of these per-pair pipes cannot be
modified unless a CREATE command has first been executed.
Assuming all IPFW/dummynet pipes have been modified, the cloud snippet
above would translate into a physical setup of:
+----+ +-------+ +-------+
| |--- to n2 pipe -->+-----+ +-----+<- from n2 pipe --| |
| n1 |--- to n3 pipe -->| if0 | | if1 |<- from n3 pipe --| |
| |-- from n1 pipe ->+-----+ +-----+<-- to n1 pipe ---| |
+----+ (BW) | | (del/plr) | |
| | | |
+----+ | | | |
| |--- to n1 pipe -->+-----+ +-----+<- from n1 pipe --| |
| n2 |--- to n3 pipe -->| if2 | delay | if3 |<- from n3 pipe --| "lan" |
| |-- from n2 pipe ->+-----+ +-----+<-- to n2 pipe ---| |
+----+ (BW) | | (del/plr) | |
| | | |
+----+ | | | |
| |--- to n1 pipe -->+-----+ +-----+<- from n1 pipe --| |
| n3 |--- to n2 pipe -->| if0 | | if1 |<- from n2 pipe --| |
| |-- from n3 pipe ->+-----+ +-----+<-- to n3 pipe ---| |
+----+ (BW) +-------+ (del/plr) +-------+
where the top two pipes in each set of three are the new, per-pair pipes
and the final pipe is the standard shaping pipe which can be thought of
as the "default" pipe through which any traffic flows for which there is
not a specific per-pair setup. In IPFW, the rules associated with the
per-pair pipes are numbered starting at 60000 and decreasing. This gives
them higher priority than the default pipes which are numbered above 60000.
One important thing to note is that while bandwidth is shaped on the
outgoing pipe, when a delay value is set on n1 for destination n2, it is
imposed on the link *into* n1. This is different than for regular LAN
shaping (and for the ACIM model below), where bandwidth, delay and loss
are all applied in one direction. The reason for the split is explained
in the hybrid-model discussion below.
5a. Simple mode setup:
In both the simple-static and simple-dynamic models, tevc commands are
used to assign characteristics to the various per-pair pipes created above.
In the static case, this is done only at boot time. In the dynamic case,
it is done periodically throughout the lifetime of the experiment. To
accomplish this, the tevc MODIFY event is augmented with an additional
DEST parameter. The DEST parameter is used to identify a specific node
pair pipe (the source is implied by the link object targeted by the tevc
command). If the DEST parameter is not given, then the modification is
applied to the "default" pipe (i.e., the normal shaping behavior). For
example:
tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2 BANDWIDTH=1000 DELAY=10
Assuming 10.0.0.2 is "n2" in the diagram above, this would change n1's
"to n2 pipe" to shape the bandwidth, and change n1's "from n2 pipe" to
handle the delay. If a more "balanced" shaping is desired, half of each
characteristic could be applied to both sides via:
tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2 BANDWIDTH=1000 DELAY=5
tevc -e pid/eid now cloud-n2 MODIFY DEST=10.0.0.1 BANDWIDTH=1000 DELAY=5
5b. ACIM mode setup:
ACIM mode is again a dynamic shaping feature. As if per-node pair pipes
were not enough, here we further add per-flow pipes! For example, in the
diagram above, the six pipes for n1 might also have a seventh pipe for
"n1 TCP port 10345 to n2 TCP port 80" if a monitored web application running
on n1 were to connect to the web server on n2. That pipe could then have
specific BW, delay and loss characteristics.
Note that only one pipe is created here to serve bandwidth, delay and loss,
unlike the split of BW from the others on per-pair pipes. The one pipe is
in the node-to-lan outgoing direction (i.e., on the left hand side in the
diagram above).
Higher priority is given to per-flow pipes by numbering the IPFW rules
starting from 100 and working up. Thus the priority is: per-flow pipe,
per-pair pipe, default pipe.
For an application being monitored with ACIM, the flow pipes are created
for each flow on the fly as connections are formed. Flows from unmonitored
applications will use the node pair pipes. Note that this would include
return traffic to the monitored application unless the other end were also
monitored.
The tevc commands sports even more parameters to support per-flow pipes.
In addition to the DEST parameter, there are three others needed:
PROTOCOL:
Either "UDP" or "TCP".
SRCPORT:
The source UDP or TCP port number.
DSTPORT:
The destination UDP or TCP port number.
An example follows. First, a flow pipe must be explicitly created:
tevc -e pid/eid now cloud-n1 CREATE \
DEST=10.0.0.2 PROTOCOL=TCP SRCPORT=10345 DSTPORT=80
Note that unlike per-pair pipes, the CREATE call here immediately creates
the associated IPFW rule and dummynet pipe. A flow pipe will inherit its
initial characteristics from the "parent" per-pair pipe. Those
characteristics can be changed with:
tevc -e pid/eid now cloud-n1 MODIFY \
DEST=10.0.0.2 PROTOCOL=TCP SRCPORT=10345 DSTPORT=80 \
BANDWIDTH=1000 DELAY=10
When finished, the flow pipe is destroyed with:
tevc -e pid/eid now cloud-n1 CLEAR \
DEST=10.0.0.2 PROTOCOL=TCP SRCPORT=10345 DSTPORT=80
A new dummynet feature was added to model bottleneck queuing for ACIM.
The "maxinq" (maximum time in queue) parameter, and associated MAXINQ
MODIFY event parameter defines a maximum time for packets to be in a
queue before they are dropped. This is an alternative method to a
probability-based drop or RED. When a packet arrives at the bandwidth
shaper, its expected time to traverse the shaping pipe is calculated,
namely: time on the bandwidth queue plus the explicit delay time. If
that value exceeds the maxinq setting, the packet is dropped.
NOTE: since the BW and delay are applied via separate pipes in
the current Flexlab dummynet, the maxinq setting doesn't do
exactly what you might expect.
5c. Hybrid mode setup:
Hybrid mode adds the possibility groups of nodes sharing bandwidth to or
from a specific node. The current implementation allows only a limited
form. For a given node, it allows full per-destination delay settings and
partial per-destination bandwidth settings. All destinations that do not
have individual bandwidth pipes, will share a single, default bandwidth pipe.
This is where the separate pipes for bandwidth and delay/plr described above
come into play. Recall that the CREATE call only creates a full NxN set of
pipes internally, and that actual dummynet pipes are only created when the
first MODIFY event for the pipe is received. This allows for having only
a subset of per-pair pipes active. Hence, for a given node, by explicitly
setting the characteristics for only some destination nodes, all other
destinations will use the default pipe and its characteristics. This is
how hybrid mode achieves a shared destination bandwidth.
Specifically, in the current Flexlab hybrid-model implementation, every
node pair is set with individual delay and loss characteristics via MODIFY
events. These are the "from node" pipes (i.e., the right-hand side of
the diagram above). Thus for a LAN of N nodes, each node will have N-1
such "from node" pipes active. Nodes may then also have per-node pair
bandwidth pipes to some, but possibly not all, of the other nodes. These
are the "to node" (left-hand side) pipes. Where specific bandwidth per-pair
pipes are not setup with MODIFY, the default pipe will then be used and
thus its bandwidth shared by traffic to all unnamed destinations.
This mechanism allows only a single set of shared destination bandwidth
nodes. The implementation will have to be modified to allow multiple
shared destination bandwidth sets or shared source bandwidth sets.
The tevc commands to setup unique delay characteristics per pair use the
DEST parameter:
tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2 DELAY=10
would say that traffic from us to 10.0.0.2 should have a 10ms round-trip
delay. Likewise for setting up unique per-pair bandwidth:
tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2 BANDWIDTH=5000
which says that traffic between us and 10.0.0.2 has an outgoing "private"
BW of 5000Kb. To establish the "default" shared bandwidth, we simply
omit the DEST:
tevc -e pid/eid now cloud-n1 MODIFY BANDWIDTH=1000
to say that traffic from us to all other nodes in the cloud shares a 1000Kb
outgoing bandwidth.
5d. Late additions to Flexlab shaping.
A later, quick hack added the ability to specify multiple sets of shared
outgoing bandwidth nodes. A specification like:
tevc -e pid/eid now cloud-n1 MODIFY DEST=10.0.0.2,10.0.0.3 BANDWIDTH=5000
creates a "per node pair" style pipe for which the destination is a list
of nodes rather than a single node. This directly translates into an IPFW
command:
ipfw add <pipe> pipe <pipe> ip from any to 10.0.0.2,10.0.0.3 in recv <if>
so it was straightforward, though hacky in the current delay-agent, to
implement. This is clearly more general than the one "default rule"
bandwidth, but would be less efficient in the case where they is only
one set.
A final variation is a mechanism for allowing the specification for an
"incoming" delay from a particular node:
tevc -e pid/eid now cloud-n1 MODIFY SRC=10.0.0.2 DELAY=10
This would appear to be equivalent to:
tevc -e pid/eid now cloud-n2 MODIFY DEST=10.0.0.1 DELAY=10
and for round-trip traffic they will produce the same result. However,
they will perform differently for one way traffic. For the SRC= rule,
traffic from n2 to n1 will see 10ms of delay, but for the DEST= rule
traffic from n2 to n1 will see no delay since the shaping is on the
return path. This is really an implementation artifact though.
So why are there both forms? I do not recall if there was supposed to
be a functional difference, or whether it was just a convenience issue
depending on which object handle you had readily available.
5e. Future additions to Flexlab shaping.
Thus far, the only additional feature that has been requested is the
ability to specify a "shared source" bandwidth. For example, with:
set cloud [$ns make-cloud "n1 n2 n3 n4" 100Mbps 0ms]
we might want to say: "on n1 I want 1Mbs from {n2,n3}" which would
presumably translate into tevc commands:
tevc -e pid/eid now cloud-n1 MODIFY SRC=10.0.0.2,10.0.0.3 BW=1000
So why is this a problem? Going back to the base diagram for a cloud
(for simplicity assuming a shaping node that could handle shaping four links):
+-------+ +-------+ +-------+
| | +-----+ +-----+ | |
| n1 |- to pipes ->| if0 | | if1 |<- from pipes -| |
| | (BW) +-----+ +-----+ (del) | |
+-------+ | | | |
| | | |
+-------+ | | | |
| | +-----+ +-----+ | |
| n2 |- to pipes ->| if2 | | if3 |<- from pipes -| |
| | (BW) +-----+ +-----+ (del) | |
+-------+ | | | |
| delay | | "lan" |
+-------+ | | | |
| | +-----+ +-----+ | |
| n3 |- to pipes ->| if4 | | if5 |<- from pipes -| |
| | (BW) +-----+ +-----+ (del) | |
+-------+ | | | |
| | | |
+-------+ | | | |
| | +-----+ +-----+ | |
| n4 |- to pipes ->| if6 | | if7 |<- from pipes -| |
| | (BW) +-----+ +-----+ (del) | |
+-------+ +-------+ +-------+
So the shaping would need to be applied in the "from pipes" for "cloud-n1"
(i.e., the upper right). However, the from pipes already include one pipe
for adding per-pair delay from all other nodes to n1:
<n2-del> pipe <pipe1a> ip from 10.0.0.2 to any in recv <if1>
pipe <pipe1a> config delay 10ms
<n3-del> pipe <pipe1b> ip from 10.0.0.3 to any in recv <if1>
pipe <pipe1b> config delay 20ms
<n4-del> pipe <pipe1c> ip from 10.0.0.4 to any in recv <if1>
pipe <pipe1c> config delay 30ms
to which we would need to add a rule for shared bandwidth:
<n1-bw> pipe <pipe1d> ip from 10.0.0.2,10.0.0.3 to any in recv <if1>
pipe <pipe1d> config bw 1000Kbit/sec
but only one of these rules can trigger for each packet coming in on <if1>.
In this case, packets from 10.0.0.2 and .3 will go through the delay pipes
(pipe1a or pipe1b) and not the bandwidth pipe (pipe1d). Putting the
bandwidth pipe first won't help, now packets will pass through it and
not the delay pipes!
We could apply the appropriate bandwidth and delay to each of the from
pipes from .2 and .3 so that there is only one pipe from each node:
<n2-del> pipe <pipe1a> ip from 10.0.0.2 to any in recv <if1>
pipe <pipe1a> config delay 10ms bw 1000Kbit/sec
<n3-del> pipe <pipe1b> ip from 10.0.0.3 to any in recv <if1>
pipe <pipe1b> config delay 20ms bw 1000Kbit/sec
but now the bandwidth of 1000Kbit/sec is no longer shared.
We could instead augment the left-hand "to pipes" adding an "incoming"
rule so that we had:
# to pipes
<n2-bw> pipe <pipe0a> ip from any to 10.0.0.2 in recv <if0>
<n2-bw> pipe <pipe0b> ip from any to 10.0.0.3 in recv <if0>
<n2-bw> pipe <pipe0c> ip from any to 10.0.0.4 in recv <if0>
# new rule
<n1-bw> pipe <pipe0d> ip from 10.0.0.2,10.0.0.3 to any out xmit <if0>
However, when combining bridging (recall, <if0> and <if1> are bridged)
with IPFW, packets traveling in either direction will only pass through
IPFW once in each direction. This means that a packet coming from the
lan to n1, will trigger the appropriate "in recv <if1>" rule (pipe1?)
and then be immediately placed on the outgoing interface <if0> with no
further filtering. Hence, the "out xmit <if0>" rule (aka pipe0d) will
never be triggered.
So we cannot hang a "shared source bandwidth" pipe in either place nor
modify any of the existing pipes.
In the big picture, what we might want to be able to support in a shaping
node are, for each of BW, delay and loss and for each node in an N node cloud:
* shaping from node to {node set}
* shaping to node from {node set}
Here a {node set} might be "all N other nodes in the LAN" in which case
we have two shaping pipes for a node to and from the LAN (aka, the current
asymmetric shaped LAN), or a set might contain a single node in which case
we have N-1 shaping pipes for other nodes (aka, the current Flexlab per
node pair pipes), or it might be multiple pipes with subsets of 2 to N-1
nodes (aka, shared-source and shared-destination bandwidth pipes, as well
as possibly useless shared-source and shared-destination delay and PLR
pipes). The only requirement for a set would be that it be disjoint with
any other set.
5f. A way to implement multiple pipes for Flexlab shaping
If you set net.inet.ip.fw.one_pass to zero with sysctl, it changes the
behavior of IPFW so that ALL rules that match are applied to a packet.
This means that something like:
# delay pipes
<n2-del> pipe <pipe1a> ip from 10.0.0.2 to any in recv <if1>
pipe <pipe1a> config delay 10ms
<n3-del> pipe <pipe1b> ip from 10.0.0.3 to any in recv <if1>
pipe <pipe1b> config delay 20ms
# BW pipes
<n1-bw> pipe <pipe1d> ip from 10.0.0.2,10.0.0.3 to any in recv <if1>
pipe <pipe1d> config bw 1000Kbit/sec
...
from above, would work. For a packet coming in on <if1> from 10.0.0.2,
both <n2-del> and <n1-bw> would be applied. The tricky part here is to
try to limit the number of rules that are checked/applied to every packet.
Considering that we might have two (BW, delay) node-to-shapingnode pipes
and two (BW, delay) LAN-to-shapingnode pipes for every node being shaped,
and each shaping node can handle two links, there are eight possible rules
that every packet may have to match against. That includes control net
packets!
We can limit it some by doing something like:
skipto 60110 ip from any to any in recv <if1>
skipto 60210 ip from any to any in recv <if2>
skipto 60310 ip from any to any in recv <if3>
skipto 60410 ip from any to any in recv <if4>
skipto 65534 ip from any to any
# 60110: <if1> set
pipe 60110 ip from any to any # BW
pipe 60120 ip from any to any # delay
skipto 65534 ip from any to any
# 60210: <if2> set
...
# 65534: final rule
allow ip from any to any
Note that the "skipto 65534" rules at the end of each block are needed to
prevent fall-through, since processing will not stop until the final rule
(65534) is hit. This is clearly a convoluted mess and might get worse since
we can no longer use the "more specific rule overrides a more general rule"
flexibility when specifying shaping characteristics. In addition to the
standard constant BW/delay/plr values, mechanisms were added to dummynet
to support distribution- and table-driven methods for adjusting the three.
Those arguments are:
BWQUANTUM, BWQUANTABLE, BWMEAN, BWSTDDEV, BWDIST, BWTABLE:
Bandwidth tweakage.
DELAYQUANTUM, DELAYQUANTABLE, DELAYMEAN, DELAYSTDDEV, DELAYDIST, DELAYTABLE:
Delay tweakage.
PLRQUANTUM, PLRQUANTABLE, PLRMEAN, PLRSTDDEV, PLRDIST, PLRTABLE,
Packet loss tweakage.