Abstract:

Methods and systems for determining a congestion metric for a path in a
network are provided. According to one embodiment, multiple paths are
provided between each pair of multi-path load balancing (MPLB) components
within a Layer 2 network by establishing overlapping loop-free topologies
in which each MPLB component is reachable by any other via each of the
overlapping topologies. A first MPLB component associated with a first
network device sends a latency request packet, including a first
timestamp provided by a first clock associated with the first MPLB
component, to a second MPLB component associated with a second network
device via the path. Responsive thereto, the first MPLB component
receives, from the second MPLB component, a latency response packet,
including a second timestamp provided by a second clock associated with
the second MPLB component. The first MPLB component derives a one-way
latency value for the path based upon the timestamps.

Claims:

1. A computer-implemented method for determining a congestion metric for a
particular path in a network, comprising:providing a plurality of
possible paths between each pair of multi-path load balancing (MPLB)
components of a plurality of MPLB components within a Layer 2 network by
establishing a plurality of overlapping loop-free topologies within the
Layer 2 network in which each of the plurality of MPLB components is
reachable by any other of the plurality of MPLB components via each of
the plurality of overlapping loop-free topologies;sending, by a first
MPLB component of the plurality of MPLB components associated with a
first network device, a latency request packet to a second MPLB component
of the plurality of MPLB components associated with a second network
device via the particular path, wherein the latency request packet
comprises a timestamp provided by a first clock associated with the first
MPLB component;receiving, by the first MPLB component, a latency response
packet sent by the second MPLB component in response to the latency
request packet, wherein the latency response packet comprises a timestamp
provided by a second clock associated with the second MPLB component;
andderiving, by the first MPLB component, a one-way latency value for the
particular path based at least partially upon the timestamp provided by
the first clock and the timestamp provided by the second clock, wherein
the one-way latency value indicates how much time is needed for a set of
information to travel from the first MPLB component to the second MPLB
component along the particular path; andwherein the first MPLB and the
second MPLB are realized using one or more of hardware logic components,
one or more Application Specific Integrated Circuit (ASIC) and one or
more processors executing a set of instructions.

2. The method of claim 1, wherein said sending, by a first multi-path load
balancing (MPLB) component of a first network device, a latency request
packet to a second MPLB component of a second network device via the
particular path, said receiving, by the first MPLB component, a latency
response packet sent by the second MPLB component in response to the
latency request packet and said deriving, by the first MPLB component, a
one-way latency value for the particular path based at least partially
upon the timestamp provided by the first clock and the timestamp provided
by the second clock are repeated a plurality of times to derive a
plurality of one-way latency values for the particular path.

3. The method of claim 2, further comprising computing, based at least
partially upon the plurality of one-way latency values for the particular
path, a latency variation value which indicates variation in one-way
latency values on the particular path.

4. The method of claim 1, wherein the first and second clocks are not
synchronized with each other, wherein the method further
comprises:determining a clock offset value that represents a difference
in time values provided by the first clock and the second clock;
andwherein the one one-way latency value for the particular path is
derived based at least partially upon the timestamp provided by the first
clock, the timestamp provided by the second clock, and the clock offset
value.

5. The method of claim 4, wherein determining the clock offset value
comprises:sending a second latency request packet from the first MPLB
component to the second MPLB component via the particular path, wherein
the latency request packet comprises an initial timestamp provided by the
first clock of the first MPLB component;receiving, by the first MPLB
component, a second latency response packet sent by the second MPLB
component in response to the second latency request packet, wherein the
second latency response packet comprises a second timestamp provided by
the second clock of the second MPLB component;obtaining a received
timestamp from the first clock of the first MPLB component, wherein the
received timestamp approximately indicates a time at which the second
latency response packet was received by the first MPLB
component;determining, based at least partially upon the initial
timestamp and the received timestamp, a roundtrip latency value which
indicates how much time is needed for a set of information to travel from
the first MPLB component to the second MPLB component along the
particular path and back to the first MPLB component;determining whether
the roundtrip latency value is less than a previously established minimum
roundtrip latency value;in response to a determination that the roundtrip
latency value is less than the previously established minimum roundtrip
latency value, deriving an estimated one-way latency value based upon the
roundtrip latency value, wherein the estimated one-way latency value
indicates how much time is estimated to be needed for a set of
information to travel from the first MPLB component to the second MPLB
component along the particular path; andcomputing the clock offset value
based at least partially upon the initial timestamp, the second timestamp
provided by the second clock of the second MPLB component, and the
estimated one-way latency value.

6. The method of claim 5, wherein determining the clock offset value
further comprises in response to a determination that the roundtrip
latency value is less than the previously established minimum roundtrip
latency value, establishing the roundtrip latency value as the previously
established minimum roundtrip latency value.

10. The method of claim 5, wherein the second latency response packet
comprises (a) an arrived timestamp provided by the second clock
approximately indicating a time at which the second latency request
packet was received by the second MPLB component and (b) a return
timestamp provided by the second clock approximately indicating a time at
which the second latency response packet was sent by the second MPLB
component, and wherein determining the roundtrip latency value
comprises:determining a difference between the arrived timestamp and the
initial timestamp to derive a first difference;determining a difference
between the received timestamp and the return timestamp to derive a
second difference; andadding the first difference and the second
difference to derive the roundtrip latency value.

12. The method of claim 2, wherein the first MPLB component receives
packets associated with a flow from a source at a particular flow rate,
wherein the first MPLB component forwards the packets associated with the
flow to the second MPLB component, wherein the first MPLB component sends
a plurality of latency request packets to the second MPLB component, and
wherein the first MPLB component sends the plurality of latency request
packets to the second MPLB component at a rate which is proportional to
the particular flow rate.

13. The method of claim 2, wherein the first MPLB component receives
packets associated with a flow from a source, wherein the first MPLB
component forwards the packets associated with the flow to the second
MPLB component, and wherein sending a latency request packet from the
first MPLB component to the second MPLB component via the particular path
comprises:determining whether a certain amount of information associated
with the flow has been received by the first MPLB component since an
information count was last cleared;in response to a determination that a
certain amount of information associated with the flow has been received
by the first MPLB component since the information count was last cleared,
sending a latency request packet from the first MPLB component to the
second MPLB component via the particular path; andclearing the
information count.

14. The method of claim 1, wherein the first MPLB component comprises a
network interface of the first network device.

15. The method of claim 1, wherein the second MPLB component comprises a
network interface of the second network device.

16. A network interface comprising:a communication manager responsible for
sending and receiving communications on behalf of a networking device
with which the network interface is associated; anda path selection
manager, coupled to the communication manager, operable to perform a
method for determining a congestion metric for a particular path in a
network in which a plurality of possible paths exist between each pair of
network interfaces of a plurality of network interfaces within a Layer 2
network as a result of a plurality of overlapping loop-free topologies
having been established within the Layer 2 network in which each of the
plurality of network interfaces is reachable by any other of the
plurality of network interfaces via each of the plurality of overlapping
loop-free topologies, wherein the method comprises:sending a latency
request packet to a second network interface of the plurality of network
interfaces of a network device on the network via the particular path,
wherein the latency request packet comprises a timestamp provided by a
first clock associated with the network interface;receiving a latency
response packet sent by the second network interface in response to the
latency request packet, wherein the latency response packet comprises a
timestamp provided by a second clock associated with the second network
interface;deriving a one-way latency value for the particular path based
at least partially upon the timestamp provided by the first clock and the
timestamp provided by the second clock, wherein the one-way latency value
indicates how much time is needed for a set of information to travel from
the network interface to the second network interface along the
particular path and wherein the one one-way latency value for the
particular path is derived based at least partially upon the timestamp
provided by the first clock, the timestamp provided by the second clock,
and the clock offset value; andwherein the path selection manager and the
communication manager are realized using one or more of hardware logic
components, an Application Specific Integrated Circuit (ASICs) and a
processor executing a set of instructions.

17. The network interface of claim 16, wherein said sending a latency
request packet to a second network interface via the particular path,
said receiving a latency response packet sent by the second network
device in response to the latency request packet and said deriving a
one-way latency value for the particular path based at least partially
upon the timestamp provided by the first clock and the timestamp provided
by the second clock are repeated a plurality of times to derive a
plurality of one-way latency values for the particular path.

18. The network interface of claim 16, wherein the method further
comprises computing, based at least partially upon the plurality of
one-way latency values for the particular path, a latency variation value
which indicates variation in one-way latency values on the particular
path.

19. The network interface of claim 16, wherein determining the clock
offset value further comprises in response to a determination that the
roundtrip latency value is less than the previously established minimum
roundtrip latency value, establishing the roundtrip latency value as the
previously established minimum roundtrip latency value.

23. The network interface of claim 16, wherein the second latency response
packet comprises (a) an arrived timestamp provided by the second clock
approximately indicating a time at which the second latency request
packet was received by the second network interface and (b) a return
timestamp provided by the second clock approximately indicating a time at
which the second latency response packet was sent by the second network
interface and wherein determining the roundtrip latency value
comprises:determining a difference between the arrived timestamp and the
initial timestamp to derive a first difference;determining a difference
between the received timestamp and the return timestamp to derive a
second difference; andadding the first difference and the second
difference to derive the roundtrip latency value.

25. The network interface of claim 17, wherein the communication manager
receives packets associated with a flow from a source at a particular
flow rate, wherein the communication manager forwards the packets
associated with the flow to the second network interface, wherein the
communication manager sends a plurality of latency request packets to the
second network interface, and wherein the communication manager sends the
plurality of latency request packets to the second network interface at a
rate which is proportional to the particular flow rate.

26. The network interface of claim 17, wherein the communication manager
receives packets associated with a flow from a source, wherein the
communication manager forwards the packets associated with the flow to
the second network interface, and wherein sending a latency request
packet from the communication manager to the second network interface via
the particular path comprises:determining whether a certain amount of
information associated with the flow has been received by the
communication manager since an information count was last cleared;in
response to a determination that a certain amount of information
associated with the flow has been received by the communication manager
since the information count was last cleared, sending a latency request
packet from the communication manager to the second network interface via
the particular path; andclearing the information count.

Description:

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001]This is a continuation of U.S. patent application Ser. No.
11/732,599, filed on Apr. 3, 2007, which is a continuation-in-part of
U.S. patent application Ser. No. 11/336,713, filed on Jan. 20, 2006,
which is a continuation-in-part of U.S. patent application Ser. No.
11/056,436, filed on Feb. 10, 2005, which claims the benefit of priority
of U.S. Provisional Application No. 60/545,788, filed on Feb. 18, 2004.
The contents of all the aforementioned patent applications are hereby
incorporated by reference in their entirety for all purposes.

[0004]Embodiments of the present invention generally relate to computer
networking. In particular, embodiments of the present invention relate to
determining a congestion metric for a particular path in an Ethernet
network.

[0005]2. Description of the Related Art

[0006]Over the past several years, the computing and storage server
industries have been migrating towards a network-based computing and
storage model to take advantage of lower cost, high-performance commodity
processors and lower cost, high-density storage media. This server
industry trend has created a need for a highly scalable interconnect
technology to enable the various computing and storage resources to be
efficiently and effectively coupled. One type of interconnect that has
been considered for this purpose is an Ethernet network.

[0007]An Ethernet network is a loop-free switching path, reverse path
learning network. By "loop-free", it is meant that there is only one path
between any pair of nodes in the network. Because of this loop-free
property, it is possible for the switches in an Ethernet network to
forward packets by broadcast flooding, and to populate their forwarding
tables through reverse path learning.

[0008]Specifically, when an Ethernet switch encounters a packet with a
destination node address that it does not have in its forwarding tables,
the switch broadcasts that packet on all outgoing links, except for the
link on which the packet was received. All subsequent switches that
receive the packet that do not have the destination node address in their
forwarding tables do the same thing. Eventually, the packet will be
delivered to the destination node. Because there is only one path to the
destination node, it is assured that broadcasting the packet in this way
will not create an infinite loop of broadcasts.

[0009]In addition to broadcasting the packet, a switch also determines,
from the packet, the address of the source node that sent the packet. It
also notes the link on which the packet was received. This address and
link association is stored in the forwarding tables of the switch. In the
future, if the switch receives any packet destined for the source node,
it will know, based upon the address and link association in the
forwarding tables, which link to switch the packet to. It will not need
to broadcast the packet. In this way, an Ethernet switch learns the
reverse path of a packet. Because of this reverse path learning
capability, it is not necessary to pre-configure the forwarding tables of
Ethernet switches. The switches can build these forwarding tables on the
fly. This self learning capability of Ethernet switches is a key "plug
and play" attribute of an Ethernet network, and is one of the reasons why
Ethernet is so widely deployed.

[0010]While the loop-free aspect of an Ethernet network gives rise to
certain advantages, it also is the root of several significant drawbacks.
First, because there is only one path between each pair of nodes, the
network does not recover from failure as quickly as would be desired.
When a link in a path is disabled, another path has to be determined and
deployed. This takes a relatively long time, and during that time, nodes
coupled via that link cannot communicate. Another drawback is that the
single path between each pair of nodes limits the cross section bandwidth
of the network; thus, the switching capacity of the network is
underutilized. Furthermore, because there is only one path between each
pair of nodes, it is not possible to spread and balance the traffic
across a plurality of paths. For these and other reasons, Ethernet, as it
has been implemented, has not been an effective interconnect for coupling
computing and storage resources in a network-based, high-performance
system.

SUMMARY

[0011]Methods and systems are described for determining a congestion
metric for a particular path in a network. According to one embodiment,
multiple possible paths are provided between each pair of multi-path load
balancing (MPLB) components within a Layer 2 network by establishing
multiple overlapping loop-free topologies within the Layer 2 network in
which each of the MPLB components is reachable by any other of the MPLB
components via each of the overlapping loop-free topologies. A first MPLB
component associated with a first network device sends a latency request
packet to a second MPLB component associated with a second network device
via the particular path. The latency request packet includes a timestamp
provided by a first clock associated with the first MPLB component. The
first MPLB component receives a latency response packet sent by the
second MPLB component in response to the latency request packet. The
latency response packet includes a timestamp provided by a second clock
associated with the second MPLB component. The first MPLB component
derives a one-way latency value for the particular path based at least
partially upon the timestamp provided by the first clock and the
timestamp provided by the second clock. The one-way latency value
indicates how much time is needed for a set of information to travel from
the first MPLB component to the second MPLB component along the
particular path.

[0012]Other features of embodiments of the present invention will be
apparent from the accompanying drawings and from the detailed description
that follows.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013]Embodiments of the present invention are illustrated by way of
example, and not by way of limitation, in the figures of the accompanying
drawings and in which like reference numerals refer to similar elements
and in which:

[0014]FIG. 1 shows a sample Ethernet network.

[0015]FIG. 2 shows the sample Ethernet network of FIG. 1 after it has been
divided into two virtual networks.

[0016]FIG. 3 shows a sample loop-free switching path, reverse path
learning network in which multiple virtual networks have been established
to provide multiple paths between each pair of nodes.

[0020]FIG. 5 is a functional block diagram of a network interface, in
accordance with one embodiment of the present invention.

[0021]FIG. 6 is a flow diagram illustrating the operation of a network
interface in accordance with one embodiment of the present invention.

[0022]FIG. 7 shows a multi-path load balancing (MPLB) component
implemented in the sample network of FIG. 3, in accordance with one
embodiment of the present invention.

[0023]FIG. 8 is a functional block diagram of an MPLB, in accordance with
one embodiment of the present invention.

[0024]FIG. 9 is a flow diagram illustrating the operation of an MPLB in
accordance with one embodiment of the present invention.

[0025]FIG. 10 is a flow diagram illustrating how load balancing may be
performed in accordance with one embodiment of the present invention.

[0026]FIGS. 11A and 11B illustrate how latency values may be derived in
accordance with one embodiment of the present invention.

[0027]FIGS. 12A and 12B illustrate how a clock offset value and latency
values may be derived in accordance with one embodiment of the present
invention.

DETAILED DESCRIPTION

[0028]Methods and systems are described for determining a congestion
metric for a particular path in a network. One embodiment of the present
invention is implemented in a loop-free switching path, reverse path
learning network, such as an Ethernet network. Before describing this
embodiment in detail, some additional information on Ethernet networks
will be provided to facilitate a complete understanding of the invention.

Ethernet Networks

[0029]FIG. 1 shows a sample Ethernet network. As shown, the network 100
comprises four Ethernet switches a, b, c, and d. The switches are
connected to each other by links 1 through 6. Coupled to each switch are
two end nodes, identified by their Ethernet MAC (media access control)
addresses. Specifically, switch a is coupled to the nodes with MAC
addresses S and T. Switch b is coupled to the nodes with MAC addresses U
and V. Switch c is coupled to the nodes with MAC addresses W and X, and
switch d is coupled to the nodes with MAC addresses Y and Z.

[0030]In an Ethernet network, the path between each pair of nodes is
determined using a spanning tree protocol (STP). The STP ensures that
there is no more than one path between each pair of nodes so that the
network 100 is "loop-free". If an STP is applied to the network 100 of
FIG. 1 with switch a being designated as the root switch, then the link
configuration shown in FIG. 1 may result, where links 1, 2, and 6 (shown
in solid lines) are put into a forwarding state, and links 3, 4, and 5
(shown in dashed lines) are blocked. Because links 3, 4, and 5 are
blocked, they cannot be used to carry traffic; thus, for example, switch
c cannot send any information to switch d using link 3. Blocking some of
the links in this manner enforces the requirement that there be one and
only one path between each pair of nodes. The resulting tree structure
composed of links 1, 2, and 6 allows all switches to forward packets to
each other without encountering a forwarding loop.

[0031]Suppose now that one of the nodes (say the one with MAC address Z)
wants to send an Ethernet packet to the node with MAC address U. To do
so, it first sends the packet to switch d. Switch d looks into its
forwarding table, and if it does not find MAC address U, it broadcasts
the packet to its outgoing ports, in this case the port to link 6 and the
port connected to the node with MAC address Y. Ethernet packets that
arrive at nodes for which the MAC address does not match will be dropped;
thus, the node with MAC address Y will drop the packet. Switch a then
receives the packet and looks into its forwarding table for MAC address
U. If it does not find MAC address U, it broadcasts the packet to its
outgoing links, in this case links 1 and 2, and to both of its connected
nodes. When the packet arrives at switch b, switch b looks into its
forwarding table, and if it does not find MAC address U in the forwarding
table, it broadcasts the packet to its outgoing links, in this case to
the nodes with MAC address U and MAC address V. The node with MAC address
U thus receives the packet. In this way, the node with MAC address Z is
able to send an Ethernet packet to the node with MAC address U without
the nodes or the intermediate switches knowing the forwarding path
beforehand.

[0032]This broadcast or flooding mechanism, while effective, can cause
unnecessary traffic over the switching links. Ethernet solves this
problem by using reverse path learning to create a forwarding table
association between MAC addresses and ports. In reverse path learning,
the switches look at the source MAC address of the Ethernet packets that
they receive, and populate their forwarding tables with the MAC
address/received port associations. In the above example, when the node
with MAC address U later sends a packet to the node with MAC address Z,
all of the switches in the path (switch b, switch a and switch d) will
have "learned" the association of the MAC address Z with the respective
received port from the previous received Ethernet packet. Therefore,
switch b will know to forward Ethernet packets with MAC address Z as the
destination address on link 1 since that was the link on which it
received Ethernet packets with MAC address Z as the source MAC address.
Likewise, switch a will know to switch such packets to link 6. Similarly,
switch d will know to forward such packets to the port connected to the
node with MAC address Z. In this way, the network self-learns the reverse
paths so that broadcast and flooding is performed in very few instances.
The key ingredient to making reverse path learning possible is the fact
that there is only one path between each pair of nodes.

Virtual Networks within Ethernet Networks

[0033]In an Ethernet network, it is possible to divide the network into
one or more virtual networks (also referred to herein as virtual local
area networks, or VLANs). Doing so allows the network to be divided into
multiple switching domains to limit the links on which packet broadcasts
are performed. Virtual networks also allow nodes to be segregated so that
only certain nodes are allowed to communicate with each other.

[0034]FIG. 2 shows an example of the network 100 of FIG. 1 after it is has
been divided into two VLANs, VLAN 1 and VLAN 2. In this example, switch a
is the STP root for VLAN 1 and switch d is the STP root for VLAN 2. In
this example, VLAN 1 is configured in the same way as that described
previously for FIG. 1, namely, links 1, 2 and 6 are placed in a
forwarding state and links 3, 4 and 5 are blocked. For VLAN 2, links 2,
4, and 6 are placed in a forwarding state and links 1, 2, and 5 are
blocked. Thus, in this configuration, link 5 is always blocked and link 6
carries traffic for both VLAN 1 and VLAN 2.

[0035]Each end node is associated with one of the VLANs. Specifically, the
nodes with MAC addresses S, U, X, and Y are on VLAN 1, and the nodes with
MAC addresses T, V, W, and Z are on VLAN 2. Set up in this way, the nodes
with MAC addresses S, U, X, and Y will be able to communicate with each
other but not with the nodes with MAC addresses T, V, W, and Z.
Similarly, the nodes with MAC addresses T, V, W, and Z will be able to
communicate with each other but not with the nodes with MAC addresses S,
U, X, and Y. In this way, the VLANs separate the various nodes into
different domains such that the nodes are treated as if they are in
separate networks even though they are in the same physical network.

[0036]VLANs also enable the broadcasting of packets to be limited to
certain links. To illustrate this point, suppose that the node with MAC
address W wishes to send an Ethernet packet on VLAN 2 to the node with
MAC address V. To do so, it first sends the packet to switch c. Switch c
looks into its forwarding table, and if it does not find MAC address V,
it broadcasts the packet to its outgoing ports which support VLAN 2, in
this case the port to link 3. It does not broadcast to link 2 or to the
port coupled to the node with MAC address X since these ports support
VLAN 1 only. Switch d then receives the packet and looks into its
forwarding table for MAC address V. If it does not find MAC address V, it
broadcasts the packet to its outgoing links that support VLAN 2, in this
case links 4 and 6, and to the node with MAC address Z. When the packet
arrives at switch b, switch b looks into its forwarding table, and if it
does not find MAC address V in the forwarding table, it broadcasts the
packet to its outgoing links that support VLAN 2, in this case to the
node with MAC address V. In this way, the node with MAC address V
receives the packet. With the use of a VLAN, the packet reaches the
destination with broadcasts over a fewer number of links.

[0037]A point to note about VLANs is that they in and of themselves are
loop-free switching path, reverse path learning networks. Each VLAN
provides one and only one path between each pair of nodes; thus, packet
broadcast and reverse path learning can still be performed. Consequently,
VLANs implemented within an Ethernet network retain the advantages of
Ethernet networks.

Multiple Switching Paths

[0038]VLANs have primarily been used to segregate traffic within a
network. It has been observed by Applicants, however, that VLANs may also
be used to provide multiple paths between node pairs. For example, if all
of the nodes in FIG. 2 are associated with both VLAN 1 and VLAN 2, then
various nodes would have two possible paths to various other nodes. For
example, the node with MAC address Y would have two paths to the node
with MAC address V. Using VLAN 1, the path would be from switch d to link
6 to switch a to link 1 to switch b. Using VLAN 2, the path would be from
switch d to link 4 to switch b. Thus, by setting up multiple VLANs, and
by associating nodes with multiple VLANs, it is possible to provide
multiple paths between each pair of nodes. These multiple paths will
enable load balancing to be performed, fault tolerance to be enhanced,
and switching capacity to be more fully utilized. As an additional point,
these benefits can be realized without losing the advantages of a
loop-free switching path, reverse path learning network. As noted above,
VLANs do not violate any of the fundamentals of a loop-free switching
path, reverse path learning network. Thus, by setting up multiple VLANs,
and by associating nodes with multiple VLANs, it is possible to provide
multiple paths between node pairs, even in a loop-free switching path,
reverse path learning network, such as an Ethernet network.

Sample Multi-Path Network

[0039]FIG. 3 shows an example of a loop-free switching path, reverse path
learning network 300 in which multiple VLANs have been established to
provide multiple paths between each pair of nodes. Specifically, in
network 300, there are four VLANs (VLANs 1 through 4); thus, there are
four possible paths between each pair of nodes.

[0040]In FIG. 3, the components labeled with lower case letters are
switches, and the components identified with capital letters are nodes.
For the sake of convenience, the same capital letter is used to identify
both a node and its MAC address. Thus, node A has MAC address A, node F
has MAC address F, and so forth. For the sake of simplicity, only some of
the nodes in the network 300 are shown. Each link has one or more numbers
placed next to it. This number indicates the VLAN that is supported on
that link. For example, the link between switch a and switch j has the
label "3,4". This means that the link supports VLANs 3 and 4. Given this
labeling system, it can be seen that each of the links going to and from
a node supports all four VLANs. Thus, each node can use and can be
reached via any of the four VLANs.

[0041]In network 300, switch q is the root switch for VLAN 1, switch r is
the root switch for VLAN 2, switch s is the root switch for VLAN 3, and
switch t is the root switch for VLAN 4. Applying an STP to network 300
using each of the above switches as root switches will cause four
spanning trees to be produced. Each spanning tree will represent one of
the VLANs, and each spanning tree will specify the loop-free paths
between all of the nodes. The topology shown in FIG. 3 may be the result
of this process. To enforce the spanning trees, the various switches may
be configured, either manually or by an automated network manager (not
shown), to associate each link with one or more VLANs. For example,
switches a and j may be configured to associate the link between them
with VLANs 3 and 4. By configuring the switches in this way, each switch
will know which links can be used with which VLAN or VLANs. In one
embodiment, each spanning tree and hence, each VLAN, provides a different
path from a source node to a destination node. For example, on VLAN 1,
the path from node A to node K would be from switch a to switch i to
switch q to switch m to switch f. On VLAN 4, the path from node A to node
K would be from switch a to switch j to switch t to switch n to switch f.
By choosing different VLANs, packets can be sent from a source node to a
destination node along different paths.

[0042]To enable an Ethernet packet to be properly switched according to
VLANs, some information is added to each packet. FIG. 4A shows a standard
Ethernet packet comprising a destination MAC address 402, a source MAC
address 404, and data and other control information 406. In one
embodiment, to VLAN enable a packet, a set of path-selecting VLAN
information 408 is inserted into the packet between the source MAC
address 404 and the data and other control information 406, as shown in
FIG. 4B. The VLAN information 408 may include information (e.g. a VLAN
ID) indicating which VLAN is to be used to switch the packet, as well as
any other information (e.g. control information) needed by the switches
to switch the packet properly. The VLAN information 408 enables the
switches to determine the correct VLAN, and hence, the correct path to
take to switch the packet to the destination node. In one embodiment, the
path-selecting VLAN information 408 is removed from the packet when the
packet is received. Thus, by the time the destination node gets the
packet, the VLAN information 408 will no longer be there. As a result,
the destination node can remain unaware that the path-selecting VLAN
information 408 was ever in the packet. This in turn enables the
path-selection to be carried out without changing any networking
protocols implemented by the source and destinations nodes.

[0043]In some implementations, the source and destination nodes may
already be implementing VLANs (for example, for segregation purposes).
Thus, an Ethernet packet may already contain a set of VLAN information.
Even in such a case, a set of path-selecting VLAN information may still
be added to the Ethernet packet. This is shown in FIG. 4C, where the
path-selecting VLAN information 408 is inserted into the packet between
the source MAC address 404 and the node-provided VLAN information 410. In
this case, it will be the path-selecting VLAN information 408, not the
node-provided VLAN information 410, that will be used by the switches to
switch the packet through the network 300 to the destination node. As
this example shows, path selection can be implemented even with packets
that already contain node-provided VLAN information.

[0044]To show that the network 300 of FIG. 3 still behaves like a
loop-free switching path, reverse path learning network, reference will
now be made to an example. Suppose that node A wishes to send an Ethernet
packet to Node P. Suppose further that VLAN 1 is to be used; thus, the
packet is updated with information indicating VLAN 1 as the VLAN of
choice. Suppose further that the switches in the path have not yet
learned the MAC address/switch port association. In such a case, when the
packet is sent to switch a, switch a will broadcast the packet to all of
its outgoing ports on which VLAN 1 is enabled. These ports include the
port attached to MAC address B and the port labeled VLAN 1,2. The port
labeled 3,4 is not part of VLAN 1 and therefore would be blocked for the
broadcast of this packet. The packet is sent to switch i, which in turn,
broadcasts the packet onto those ports which are part of VLAN 1. In this
case, these would be the link labeled 1,2 which connects switch i to
switch b, and the link labeled VLAN 1, which connects switch i to switch
q. Switch b broadcasts the packet to the ports attached to MAC addresses
C and D but not to switch j. Switch q broadcasts the packet to switches
k, m and o. Switch k broadcasts the packet to switches c and d but not to
switch r since that link is only enabled for VLAN 2 Likewise, switch m
broadcasts the packet to switches e and f and not to switch r, and switch
o broadcasts the packet to switches g and k and not to switch r. Switch c
forwards the packet to the ports connected to MAC addresses E and F,
switch d forwards to ports connected to MAC addresses G and H, switch e
forwards to ports connected to MAC addresses I and J, switch f forwards
to ports connected to MAC addresses K and L, switch g forwards to ports
connected to MAC addresses M and N, and switch h forwards to ports
connected to MAC addresses O and P. Node P thus receives the packet. Note
that because of the tree structure on which VLAN 1 is defined, there are
no loops in the switching topology; hence, there is only one path between
each pair of end nodes for which VLAN 1 is enabled. This example
demonstrates how an Ethernet packet in network 300 can reach its
destination without apriori knowledge of the switching path, and is
consistent with how Ethernet typically performs flooding.

[0045]As each switch switches the packet, it is also carrying out reverse
path leaning. Thus, each switch (switches h, o, q, i, a) along the path
populates its forwarding tables with an association between source MAC
address A, VLAN 1, and the port on which the packet was received. That
way, when node P sends an Ethernet packet destined for node A on VLAN 1,
the switches will already know how to switch that packet. Namely, the
packet will be forwarded from node P to switch h then to switch o over
the link marked 1,2 to switch q over the link marked 1 to switch i over
the link marked 1 to switch a over the link marked VLAN 1,2 to the port
connected to MAC address A, which is Node A. No flooding or broadcasting
is necessary. As illustrated by this example, the reverse path learning
for a VLAN-based multi-path switching topology is just an extension (e.g.
taking the VLAN into account) of the reverse path learning methodology
carried out for non-VLAN based switching topology.

[0046]Continuing the example, assume that node P wants to send an Ethernet
packet to node A, but this time over VLAN 4. Further assume that the
switches have populated their forwarding tables via reverse path learning
from a previous Ethernet packet sent from node A to node P over VLAN 4.
The Ethernet packet with VLAN 4 and destination MAC address A will be
forwarded from node P to switch h then to switch p over the link marked
3,4 to switch t over the link marked 4 to switch j over the link marked 4
to switch a over the link marked 3,4 to the port connected to node A.

[0047]With the multiple VLANs defined between each pair of nodes, any node
can send an Ethernet packet to another node using one of the four VLANs,
and hence, one of the four switching paths through the switching
topology. The ability to use the four switching paths between each pair
of nodes effectively increases the cross section switching capacity of
the network. In the example above, there are four VLAN paths defined so
there is an increase by a factor of four to the cross section switching
capacity. The presence of the four switching paths also provides four
paths across which traffic can be spread. As a result, load balancing can
be performed. The manner in which load balancing can be carried out in
accordance with one embodiment of the present invention will be
elaborated upon in a later section.

Network Interface

[0048]As shown in FIG. 3, each node is coupled to the network 300 through
a corresponding network interface 302. Basically, the network interface
302 interfaces the node with the network 300 to enable the node to send
and receive sets of information (e.g. packets). In one embodiment, it is
the network interface 302 that selects which VLAN to use to transport a
set of information to a destination node. It is also the network
interface that performs the load balancing function. These aspects will
be discussed in greater detail in a later section.

[0049]For purposes of the present invention, a network interface 302 may
take on any desired form and its functionality may be implemented in any
desired manner. For example, a network interface 302 may be a physical
interface (e.g. a network interface card) that is coupled to but is
separate from the node. As an alternative, a network interface 302 may be
a physical component that is part of the node (e.g. integrated into the
node's motherboard, integrated into the node's processor or processors,
etc.). As a further alternative, a network interface 302 may be a logical
component (e.g. a computer program) that resides and runs on the node.
These and other embodiments are within the scope of the present
invention. In any embodiment, the functionality of the network interface
302 may be realized using hardware (e.g. hardware logic components,
ASICs, etc.), software (e.g. having one or more processors execute one or
more sets of instructions), or a combination of both.

[0050]For purposes of the present invention, a node may be any component
that communicates on the network. In one embodiment, a node comprises one
or more processors, and executes an operating system and one or more
applications. In such a node, the applications and operating system may
perform some networking functions, such as preparing sets of information
(e.g. packets) for sending into the network, implementing network
protocols, determining destinations nodes, etc. However, in one
embodiment, the operating system and the applications do not participate
in the selection of the VLAN, and hence, the selection of the path to be
taken to transport a set of information to a destination node. In fact,
the operating system and applications do not even need to be aware of the
multiple VLANs. Rather, in one embodiment, selection of the VLAN to use
in transporting a set of information is the responsibility of the network
interface 302. By delegating this responsibility to the network interface
302, it is possible to shield the operating system and applications from
the complexities of the network. As a result, it is possible to implement
the methodology of the present invention without modifying any existing
operating systems and applications.

[0051]FIG. 5 shows a functional block diagram of a network interface 302
in accordance with one embodiment of the present invention. As shown, the
network interface 302 comprises a communication manager 502, a path
selection manager 504, and a VLAN storage 506. Generally, the
communication manager 502 is responsible for sending and receiving
communications to and from its corresponding node and the network. The
path selection manager 504 is responsible for selecting a VLAN, and
hence, a path for outgoing sets of information. The VLAN storage 506 is
used to store information pertaining to the one or more VLANs that can be
used to transport a set of information from the corresponding node to a
destination node. The functions performed by these components will be
elaborated upon in the following discussion.

Sample Network Interface Operation

[0052]FIG. 6 shows a flow diagram which illustrates the manner in which a
network interface 302 operates in accordance with one embodiment of the
present invention. Before the operations shown in FIG. 6 are performed, a
network, such as network 300 in FIG. 3, is first configured. The network
300 may be configured by a network administrator by selecting the four
switches q, r, s, and t as the root switches for their respective VLANs,
and then running an STP for each root switch. This will produce a
spanning tree for each VLAN, which defines the paths between all of the
node pairs in the VLAN. After the four spanning trees are derived, the
switches are configured such that they recognize which links support
which VLANs, in accordance with the derived spanning trees. Once that is
done, each of the network interfaces 302 is informed of the VLANs that it
can use to transport information from a source node to a destination
node. In network 300, the usable VLANs are VLANs 1, 2, 3, and 4. This set
of VLANs is stored by each network interface 302 in its VLAN storage 506
(FIG. 5). With the network 300 thus configured, the network interfaces
302 are ready to facilitate communication by their respective nodes. In
the following discussion, it will be assumed for the sake of example that
node A is the source node, node P is the destination node, and network
interface 302(A) is the interface whose operation is being described.

[0053]To send a set of information (e.g. an Ethernet packet) from node A
to node P, network interface 302(A), and more specifically, the
communication manager 502 on network interface 302(A), acquires (block
602 of FIG. 6) a set of information from node A. Network interface 302(A)
may acquire this set of information in any number of ways. For example,
node A may simply provide the set of information to network interface
302(A). Node A may also provide a reference or a pointer to the set of
information, which the network interface 302(A) would use to access the
set of information. Network interface 302(A) may also acquire the set of
information in other ways.

[0054]In one embodiment, the set of information specifies an address for
the destination node or comprises information from which the address of
the destination node may be derived. For example, the set of information
may actually include the address for node P. Alternatively, the set of
information may include some other information (such as a memory
location) from which the address for node P can be derived. In such a
case, the communication manager 502 on network interface 302(A) derives
the address for node P. As a further alternative, the set of information
may comprise other information that can be used to derive the address of
node P.

[0055]Network interface 302(A), and more specifically, the path selection
manager 504 on network interface 302(A), determines (block 604) a set of
virtual networks that can be used to transport the set of information to
node P. As discussed above, at the time the network 300 is configured,
each network interface 302 is informed of and stores the set of VLANs (in
its VLAN storage 506) that it can use to transport sets of information.
Thus, network interface 302(A) accesses and consults this information,
and determines that, in the current example, it can use VLANs 1, 2, 3,
and 4.

[0056]Thereafter, network interface 302(A), and more specifically, the
path selection manager 504 on network interface 302(A), selects (block
606) one of the VLANs to use to transport the set of information. For the
sake of example, it will be assumed VLAN 1 is selected. In one
embodiment, this selection is made after the address of the destination
node is specified or derived. Also, in one embodiment, this selection is
made in a manner that tends to balance the traffic load across the
multiple VLANs. For example, the network interface 302(A) may select the
VLAN randomly (if the selection process is truly random, this should
spread traffic across the various VLANs relatively evenly). The network
interface 302(A) may also select the VLAN in a round-robin fashion.
Furthermore, the network interface 302(A) may select the VLAN based upon
current traffic conditions. For example, the network interface 302(A) may
select the VLAN with the path that is currently experiencing the lowest
traffic load, or the VLAN with the path that currently has the most
available capacity for carrying traffic. Information on current traffic
conditions may be provided to the path selection manager 504 of network
interface 302(A) by a central traffic monitoring component (not shown) or
may be derived based upon information gathered by the path selection
manager 504 of network interface 302(A). Based upon this information,
network interface 302(A) can select the VLAN in such a manner as to
balance the traffic load across the multiple paths. The manner in which
load balancing can be carried out in accordance with one embodiment of
the present invention will be elaborated upon in a later section.

[0057]After the VLAN (VLAN 1 in the current example) is selected, the
network interface 302(A), and more specifically, the path selection
manager 504 of network interface 302(A), updates (block 608) the set of
information to include information indicating the particular VLAN that
was selected. In one embodiment, this may be done as shown in FIGS. 4B
and 4C, wherein the path selecting VLAN information 408 is inserted into
the set of information. After the set of information is updated, it is
sent (block 610) by the network interface 302(A), and more specifically,
by the communication manager 502 of network interface 302(A), into the
network 300. The switches in the network 300 then use the address and
VLAN information in the set of information to switch the set of
information to node P along VLAN 1.

[0058]On the receiving end, the network interface 302(P), and more
specifically, the communication manager 502 on network interface 302(P),
receives the set of information, and removes the path-selecting VLAN
information 408 therefrom. The network interface 302(P) then passes the
set of information to node P for processing. The set of information is
thus delivered from node A to node P.

[0059]In one embodiment, the network interface 302(A) processes every set
of information destined for node P in this way. Thus, network interface
302(A) may acquire another set of information from node A that is
destined for node P. For this set of information, network interface
302(A) would perform the same operations as that shown in FIG. 5.
However, the network interface 302(A) may select a different VLAN (VLAN
2, for example) for this set of information. In general, the network
interface 302(A) may acquire a series of information sets from node A
that are destined for node P, and may select different VLANs for
different information sets. By doing so, the network interface 302(A)
distributes the traffic across the multiple VLANs, and hence, the
multiple paths. If each network interface 302 does this, then load
balancing can be achieved throughout the network 300. In this manner,
load balancing may be implemented on a dynamic, packet by packet basis.

Alternative Embodiment

[0060]Thus far, the path selection and load balancing functions have been
described as being implemented by the network interface 302. While this
is an advantageous embodiment, it is not the only one. As an alternative,
the path selection and load balancing functions may be implemented by a
multi-path load balancing (MPLB) component that is separate from the
network interface 302. Such a component deployed in the network 300 of
FIG. 3 is shown in FIG. 7. As shown, an MPLB 702 may be deployed between
the network interface 302 of a node (e.g. network interface 302(A) of
node A) and one of the switches (e.g. switch a) of the network 300. One
of the advantages of this arrangement is that it relieves the network
interface 302 of all path selection and load balancing responsibilities.
Thus, any standard network interface 302 may be used to couple a node to
the network 300. An MPLB 702 may also be used to couple other non-node
components to the network 300. These non-node components may be switches
(e.g. switch u, switch v), routers, or any other type of network
component.

[0061]In one embodiment, the MPLBs 702 are implemented at the edge of the
network 300. More specifically, an MPLB 702 is coupled to each of the
edge links of the network 300. In network 300, the switches a through h
are on the edge of the network 300, and each of these switches has two
edge links that allow external components to couple to the network 300.
In such a network 300, an MPLB 702 would be coupled to each of these edge
links. Implemented in this way, the MPLBs 702 act as the border of the
network 300. Everything above the dashed line 704 is considered internal
to the network 300. Everything below the dashed line 704 is considered
external to the network 300. The significance of this internal/external
distinction is that components external to the network 300 need not be
aware of the multiple VLANs implemented within the network 300. Hence,
none of the external components (e.g. node A, node P, switch u, switch v)
need to perform any path selection or load balancing functions. Instead,
it is up to the MPLBs 702 to perform these functions. When a set of
information (e.g. a packet) from an external component enters the network
300, it is up to the MPLB 702 to (when appropriate) select a VLAN, and
hence, a path, to use to switch the packet through the network 300. As
discussed previously in connection with the network interface 302, this
may involve inserting a set of path-selecting VLAN information into the
packet. When a packet leaves the network 300 to go to an external
component, it is up to the MPLB 702 to remove any path-selecting VLAN
information from the packet before sending it on. By doing so, the MPLB
702 makes the path selection and load balancing details completely
transparent to the external components.

[0062]In the implementation where the path selection and load balancing
functions are performed by a network interface 302, it is known that the
network interface 302 is coupled to a node. Because of this, certain
assumptions can be made as to the types of packets that the network
interface 302 will be handling. In the implementation where the path
selection and load balancing functions are performed by an MPLB 702,
however, such assumptions cannot be made because an MPLB 702 may be
coupled not just to a node but also to any other type of network
component. As a result, an MPLB 702 needs to know how to handle more
types of packets than a network interface 302. For some types of packets,
it may not be appropriate to implement a path selection or load balancing
function. For other types of packets, some additional processing may be
needed. Because of this variety, the MPLB 702 needs to be endowed with
additional capability to handle more types of packets. The handling of
different packet types will be discussed in greater detail in later
sections.

[0063]As noted above, an MPLB 702, in one embodiment, is deployed at the
border of the network 300. As a result, an MPLB 702 behaves as an access
point to the network 300 for external components. This access point
function is typically performed by an edge port of an edge switch (for
example, if MPLB 702(u) were not deployed, the access point function
would be performed by the edge ports of edge switch c). Because an MPLB
702 is now taking on the access point function of an edge port, the MPLB
702, in one embodiment, performs the functions that are typically
performed by an edge port. This is in addition to the path selection and
load balancing functions noted above.

Multi-Path Load Balancing Component

[0064]FIG. 8 shows a functional block diagram of an MPLB 702 in accordance
with one embodiment of the present invention. For the most part, the MPLB
702 comprises the same general components as the network interface 302 of
FIG. 5. These components include a communication manager 802, a path
selection manager 804, and a VLAN storage 806. Generally, the path
selection manager 804 is responsible for selecting a VLAN, and hence, a
path, to use to switch a set of information through the network 300. The
VLAN storage 806 is used to store information pertaining to the one or
more active VLANs that can be used to switch a set of information through
the network. The VLAN storage 806 may be populated with VLAN information
in the same manner as the VLAN storage 506 of the network interface 302
is populated with VLAN information. The communication manager 802 is
responsible for interfacing an external component with the network 300.
Unlike the communication manager 502 of network interface 302, this
communication manager 802 may be coupled not just to a node but to any
other type of network component. Because of this, the communication
manager 802 may receive various types of packets (the terms "packet" and
"set of information" are used interchangeably hereinafter). In one
embodiment, additional functionality is imparted to the communication
manager 802 to enable it to handle various different types of packets. In
addition, in one embodiment, the communication manager 802 is endowed
with additional functionality to enable it to perform the functions
typically performed by an edge port. The operation of these components
will be elaborated upon in later sections. As was the case with the
network interface 302, for purposes of the present invention, the
functionality of the MPLB 702 may be realized using hardware (e.g.
hardware logic components, ASICs, etc.), software (e.g. having one or
more processors execute one or more sets of instructions), or a
combination thereof.

Operational Overview

[0065]Each MPLB 702 may be used for ingress (to forward one or more
packets from an external component into the network 300) or egress (to
forward one or more packets from the network 300 out to an external
component). FIG. 9 shows a high level operational flow diagram for the
ingress functionality of an MPLB 702, in accordance with one embodiment
of the present invention. The following discussion assumes that the
network 300 has already been configured and that the VLANs 1, 2, 3, 4
have already been established in the manner described previously.

[0066]In operation, an MPLB 702, and more specifically, the communication
manager 802 on the MPLB 702, receives (block 904) a packet from an
external component. This packet may be received from a node, a switch, or
any other type of external network component. In response, the
communication manager 802 determines (block 908) a packet type for the
packet. This determination may be made, for example, based upon the
control information and data in the packet. Once the packet type is
determined, the communication manager 802 further determines (block 912)
whether path selection is appropriate for this type of packet. If it is
not, then the communication manager 802 proceeds to process (block 932)
the packet in a manner that is appropriate for that packet type. On the
other hand, if path selection is appropriate for this type of packet,
then the MPLB 702, and more specifically, the path selection manager 804
of the MPLB 702, proceeds to select a path to use to transport the packet
through the network 300.

[0067]In one embodiment, the path selection manager 804 selects a path by
first determining (block 916) a set of virtual networks that it can use
to transport the packet through the network 300. In one embodiment, the
path selection manager 804 makes this determination by accessing and
consulting the information stored in the VLAN storage 806, which sets
forth all of the active VLANs within network 300. In the current example,
the active VLANs are VLANs 1, 2, 3, and 4. Given this information, the
path selection manager 804 selects (block 920) one of the VLANs to use to
transport the packet. By selecting a VLAN, the path section manager 804
in effect selects a path (since each VLAN defines a different path
through the network 300).

[0068]In one embodiment, this selection is made in a manner that tends to
balance the traffic load across the multiple VLANs. For example, the path
selection manager 804 may select the VLAN randomly (if the selection
process is truly random, this should spread traffic across the various
VLANs relatively evenly). The path selection manager 804 may also select
the VLAN in a round-robin fashion. Furthermore, the path selection
manager 804 may select the VLAN based upon current traffic conditions.
For example, the path selection manager 804 may select the VLAN with the
path that is currently experiencing the lowest traffic load, or the VLAN
with the path that currently has the most available capacity for carrying
traffic. The path selection manager 804 may also select the VLAN with the
path that has the lowest current latency. Information on current traffic
conditions may be provided to the path selection manager 804 by a central
traffic monitoring component (not shown) or may be derived based upon
information gathered by the path selection manager 804. Based upon this
information, the path selection manager 804 can select the VLAN in such a
manner as to balance the traffic load across the multiple paths. The
presence of the four switching paths also provides four paths across
which traffic can be spread. As a result, load balancing can be
performed. The manner in which load balancing can be carried out in
accordance with one embodiment of the present invention will be
elaborated upon in a later section.

[0069]After the VLAN is selected, the path selection manager 804 updates
(block 924) the packet to include information indicating the particular
VLAN that was selected. In one embodiment, this may be done as shown in
FIGS. 4B and 4C, wherein the path selecting VLAN information 408 is
inserted into the packet. After the packet is updated, it is sent (block
928) by the MPLB 702, and more specifically, by the communication manager
802 of the MPLB 702, into the network 300. The switches in the network
300 thereafter use the VLAN information in the packet to switch the
packet through the network 300 along the selected path.

[0070]The above discussion provides a high level description of the
operation of the MPLB 702. To illustrate how the MPLB 702 handles
specific types of packets, reference will now be made to several specific
examples.

TCP/IP and UDP/IP Packets

[0071]Suppose that an MPLB 702, and more specifically, the communication
manager 802 of the MPLB 702, receives a TCP/IP or a UDP/IP packet from an
external component. This packet may be received from a node coupled to
the MPLB 702, from an external switch that has forwarded the packet to
the MPLB 702, or from some other external network component. In addition
to data, this type of packet comprises header information. The header
information includes an Ethernet portion, which includes a source MAC
address, a destination MAC address, and a type (which would be IP in this
case). The header information further comprises a source IP address, a
destination IP address, a protocol type (which would be either TCP or
UDP), a TCP or UDP source port number, and a TCP or UDP destination port
number. Upon receiving the packet, the communication manager 802 extracts
the header information therefrom, and uses the header information to
determine the packet type. From the header information described above,
the communication manager 802 determines that this packet is of the
TCP/IP or UDP/IP type. In one embodiment, path selection is appropriate
for this type of packet. Therefore, the communication manager 802 invokes
the path selection manager 804 to select a path through the network 300
for this packet.

[0072]In response to being invoked, the path selection manager 804
accesses the VLAN storage 806 and uses the VLAN information contained
therein to determine a set of VLANs that can be used to transport the
packet through the network 300. After the available VLANs are determined,
the path selection manager 804 selects one of the VLANs. By selecting a
VLAN, the path section manager 804 in effect selects a path through the
network 300. In one embodiment, this VLAN selection is made in a manner
that tends to balance the traffic load across the multiple VLANs.

[0073]After the VLAN is selected, the path selection manager 804 updates
the packet to include information indicating the particular VLAN that was
selected. In one embodiment, this may be done as shown in FIGS. 4B and
4C, wherein the path selecting VLAN information 408 is inserted into the
packet. After the packet is updated, it is sent by the communication
manager 802 into the network 300. The switches in the network 300
thereafter use the address and VLAN information in the packet to switch
the packet through the network 300. In this manner, the MPLB 702
implements path selection and load balancing on a TCP/IP or UDP/IP type
of packet.

[0074]In the above example, the path selection manager 804 performs path
selection on a packet-by-packet basis. In some implementations, it may be
desirable to group certain packets together into a flow, and to use the
same VLAN (and hence, the same path) for all of the packets in that flow.
To accommodate such an implementation, the path selection manager 804, in
one embodiment, may process the TCP/IP or UDP/IP packet as follows.

[0075]Before selecting a VLAN from the available VLANs specified in the
VLAN storage 806, the path selection manager 804 derives a flow label for
the packet. The flow label may be derived, for example, by processing the
header information in the packet through a hashing algorithm and deriving
a hash code therefrom. This hash code semi-uniquely or uniquely
identifies the flow of which the packet is a part. Once the flow label is
derived, the path selection manager 804 determines whether other packets
having this flow label have been processed before. In one embodiment, to
make this determination, the path selection manager 804 determines
whether there is an already existing flow data structure associated with
this flow label.

[0076]If such a data structure already exists (thereby, indicating that
packets associated with this flow were previously processed), then the
path selection manager 804 accesses that data structure, and extracts
therefrom a VLAN ID. This VLAN ID indicates the VLAN that was selected
for the previously processed packets in the flow. After the VLAN ID is
extracted, it is compared against the available VLANs specified in the
VLAN storage 806. If this VLAN is still available, it is selected as the
VLAN to be used for this packet. If it is no longer available, then
another VLAN is selected from the list of available VLANs, and that VLAN
ID is stored into the flow data structure. That VLAN will be used for
this packet and for future packets of the flow.

[0077]If there is no currently existing flow data structure associated
with the derived flow label, then the path selection manager 804 creates
a new flow data structure, and associates the flow label with that new
data structure. In addition, the path selection manager 804 selects one
of the available VLANs specified in the VLAN storage 806, and stores that
VLAN ID into the new flow data structure. By doing so, the path selection
manager 804 associates the selected VLAN with the flow. This will cause
this packet and future packets of the flow to use the selected VLAN. In
this manner, the path selection manager 804 can cause all of the packets
of a flow to use the same VLAN. By using different VLANs for different
flows, the path selection manager 804 can load balance across the
multiple VLANs.

Topology Control Packets

[0078]As shown in FIG. 7, an MPLB 702 may be coupled to an external
switch, such as switch u or v. One of the consequences of being coupled
to a switch is that the MPLB 702 may receive some of the topology control
packets sent by that switch. In one embodiment, the MPLB 702, and more
specifically, the communication manager 802 of the MPLB 702, knows how to
recognize and handle these topology control packets.

Spanning Tree BDPU

[0079]One type of topology control packet that may be sent by a switch is
a spanning tree bridge protocol data unit (spanning tree BPDU) packet.
This type of packet is directed to well known and specific MAC addresses
(i.e. this type of packet includes within the packet a well known and
specific MAC destination address). From the well known MAC destination
address, it is possible for the communication manager 802 to determine
the packet type of this packet. Once the packet type is known, the
communication manager 802 knows how to handle the packet. In one
embodiment, path selection is not appropriate for spanning tree BPDU
packets; thus, the MPLB 702 neither path selects nor load balances these
packets. Instead, in one embodiment, the MPLB 702 just forwards the
packet unmodified. Thus, for example, if switch u (see FIG. 7) sends a
spanning tree BPDU packet to MPLB 702(u), then MPLB 702(u) just forwards
the packet unmodified to the internal switch c.

GVRP BDPU

[0080]Another type of topology control packet that may be sent by a switch
is a generic attribute registration protocol (GARP) VLAN registration
protocol (GVRP) BPDU packet. The purpose of this type of packet is to
register a VLAN (or multiple VLANs) on a particular port or link. For
example, if switch u sends a GVRP BDPU packet to MPLB 702(u), and if the
packet indicates that VLAN X is to be registered, then it means that
switch u is trying to register VLAN X on the external link between switch
u and MPLB 702(u).

[0081]Like the spanning tree BDPU, this type of packet also uses well
known and specific MAC destination addresses. Thus, based upon the MAC
destination address in the packet, this type of packet can be readily
identified by the communication manager 802 as a GVRP BDPU. Also, like
the spanning tree BDPU, path selection and load balancing is not
appropriate (in one embodiment) for this type of packet. Thus, in
response to receiving a GVRP BDPU packet, the MPLB 702, and more
specifically, the communication manager 802, forwards the packet
unmodified. In an example where external switch u (see FIG. 7) sends a
GVRP BPDU packet to MPLB 702(u), MPLB 702(u) would just forward the
packet to the internal switch c unmodified.

[0082]In contrast to a spanning tree BDPU packet, however, the MPLB
702(u), in one embodiment, performs additional processing in response to
a GVRP BPDU packet. Specifically, the communication manager 802 of MPLB
702(u) interacts with switch c (which is implementing the GVRP protocol)
to register the VLAN indicated in the packet on the MPLBs external link.
Thus, if, for example, the packet indicates that VLAN X is to be
registered, then the communication manager 802 of MPLB 702(u) would
register VLAN X on the external link between MPLB 702(u) and external
switch u. Once VLAN X is registered in this way, the communication
manager 802 of MPLB 702(u) will know that VLAN X is supported on its
external link and hence, will allow packets tagged with VLAN X to be sent
out on that link. A point to note is that, registered in this way, VLAN X
is recognized by MPLB 702(u) as an external VLAN only. That VLAN is not
one of the internal VLANs used to switch a packet through the network
300.

Multicast Registration Packets

[0083]Another type of packet that an MPLB 702 may receive from an external
component is a multicast registration packet. This type of packet is sent
by a node to register that node to receive packets sent to a multicast
address. A multicast registration packet may be sent as a GARP multicast
registration protocol (GMRP) packet or as an Internet protocol group
multicast protocol (IGMP) packet. This type of packet uses well known and
specific MAC destination addresses. Thus, based upon the MAC destination
address in the packet, this type of packet can be readily identified by
the communication manager 802 of an MPLB 702 as a multicast registration
packet. Included with a multicast registration packet is a multicast
address. This multicast address is the address that is to be programmed
into the internal switches of the network 300 to cause the internal
switches to forward any packet sent to the multicast address to the node
that is originating the multicast registration packet. According to
current standards, certain address ranges are reserved for multicast
addresses; thus, from inspecting the multicast address, it can be readily
determined that the address is a multicast address. A multicast
registration packet may also be tagged with an external VLAN ID. This
VLAN ID indicates the external VLAN that is to be used to carry a
multicast packet to the node that is originating the multicast
registration packet. In one embodiment, the communication manager 802 of
an MPLB 702 processes a multicast registration packet as follows.

[0084]Upon receiving a multicast registration packet, the communication
manager 802 inspects the MAC destination address of the packet and
determines that it is a multicast registration packet. In one embodiment,
path selection is appropriate for this type of packet. Thus, the
communication manager 802 invokes the path selection manager 804 to
select an internal VLAN. In one embodiment, the path selection manager
804 of each of the MPLBs 702 is preconfigured (e.g. given some
configuration information) to associate certain multicast addresses with
certain internal VLANs. This configuration information may cause all
multicast addresses to be associated with the same internal VLAN (e.g.
VLAN 1), or it may cause certain ranges of multicast addresses to be
associated with certain internal VLANs (e.g. all multicast addresses
within a first range may be associated with VLAN 1, all multicast
addresses within a second range may be associated with VLAN 2, etc.).
Based on the configuration information, and the multicast address
specified in the registration packet, the path selection manager 804
determines which internal VLAN it should associate with the multicast
address, and selects that VLAN. The multicast address is effectively
mapped to this internal VLAN.

[0085]After the internal VLAN is selected, the path selection manager 804
proceeds to update the multicast registration packet. Before it does so,
however, it checks to see if the packet is currently tagged with an
external VLAN ID. If it is not, then the path selection manager 804
updates the packet to include information indicating the selected
internal VLAN. This may be done as shown in FIGS. 4B and 4C, wherein the
path selecting VLAN information 408 is inserted into the packet. After
the packet is updated, it is sent by the communication manager 802 into
the network 300.

[0086]On the other hand, if the packet is tagged with an external VLAN ID,
then the path selection manager 804 updates the packet by overwriting the
external VLAN ID in the packet with the ID of the selected internal VLAN.
In effect, this maps the external VLAN and the multicast address to the
internal VLAN. As part of this mapping, the path selection manager 804
records the external VLAN-multicast address-internal VLAN association.
This association enables the path selection manager 804 to perform the
reverse mapping when a packet leaves the network 300. After the multicast
registration packet is updated to replace the external VLAN ID with the
internal VLAN ID, the communication manager 802 forwards the updated
packet into the network 300. The multicast registration packet is thus
processed by the MPLB 702.

[0087]To illustrate the effect of the multicast registration packet on the
internal network 300, reference will be made to an example. Suppose that
node A of FIG. 7 wishes to register to receive packets sent to multicast
address X. In such a case, node A would send a multicast registration
packet containing multicast address X to MPLB 702(A). For the sake of
simplicity, it will be assumed that this packet does not include an
external VLAN ID. Upon receiving this packet, the communication manager
802 of MPLB 702(A) determines that the packet is a multicast registration
packet; thus, it invokes the path selection manager 804 to select an
internal VLAN. Based upon a set of configuration information and the
multicast address X, the path selection manager 804 selects one of the
internal VLANs (assume VLAN 1 for the sake of example). It then updates
the packet with the selected VLAN ID. The updated packet is thereafter
sent by the communication manager 802 to switch a.

[0088]Upon receiving the multicast registration packet, switch a updates
its forwarding table to indicate that all packets sent to multicast
address X should be forwarded to the port coupled to MPLB 702(A). From
the packet, switch a knows that internal VLAN 1 has been associated with
multicast address X; thus, it forwards the registration packet on to the
next higher switch in VLAN 1, which, in the network of FIG. 7, is switch
j. Upon receiving the multicast registration packet, switch j updates its
forwarding table to indicate that all packets sent to multicast address X
should be forwarded to the port coupled to switch a. From the packet,
switch j knows that internal VLAN 1 has been associated with multicast
address X; thus, it forwards the registration packet on to the next
higher switch in VLAN 1, which, in the network of FIG. 7, is root switch
q. Upon receiving the multicast registration packet, root switch q
updates its forwarding table to indicate that all packets sent to
multicast address X should be forwarded to the port coupled to switch j.
Since switch q is the root switch for VLAN 1, it does not forward the
multicast registration packet on to any other switch. Node A is thus
registered to receive packets sent to the multicast address X.

[0089]Suppose now that node P also wishes to register to receive packets
sent to multicast address X. In such a case, node P would send a
multicast registration packet containing multicast address X to MPLB
702(P). For the sake of simplicity, it will be assumed that this packet
does not include an external VLAN ID. Upon receiving this packet, the
communication manager 802 of MPLB 702(P) determines that the packet is a
multicast registration packet; thus, it invokes the path selection
manager 804 to select an internal VLAN. Based upon a set of configuration
information (which should be the same as that used by MPLB 702(A)) and
the multicast address X, the path selection manager 804 selects one of
the internal VLANs (this VLAN should be the same (i.e. VLAN 1) as that
selected previously by the path selection manager 804 of MPLB 702(A)).
The path selection manager 804 then updates the packet with the selected
VLAN ID. The updated packet is thereafter sent by the communication
manager 802 to switch h.

[0090]Upon receiving the multicast registration packet, switch h updates
its forwarding table to indicate that all packets sent to multicast
address X should be forwarded to the port coupled to MPLB 702(P). From
the packet, switch h knows that internal VLAN 1 has been associated with
multicast address X; thus, it forwards the registration packet on to the
next higher switch in VLAN 1, which, in the network of FIG. 7, is switch
o. Upon receiving the multicast registration packet, switch o updates its
forwarding table to indicate that all packets sent to multicast address X
should be forwarded to the port coupled to switch h. From the packet,
switch o knows that internal VLAN 1 has been associated with multicast
address X; thus, it forwards the registration packet on to the next
higher switch in VLAN 1, which, in the network of FIG. 7, is root switch
q. Upon receiving the multicast registration packet, root switch q
updates its forwarding table to indicate that all packets sent to
multicast address X should be forwarded to the port coupled to switch o.
Since switch q is the root switch for VLAN 1, it does not forward the
multicast registration packet on to any other switch. Node P is thus
registered to receive packets sent to the multicast address X. In the
above manner, multiple nodes may be registered to receive packets sent to
a multicast address.

Multicast Packets

[0091]The above discussion addresses the handling of a multicast
registration packet. This discussion will focus on the processing of an
actual multicast packet, which is a data packet. To facilitate
discussion, reference will be made to an example. In the following
example, it will be assumed that both nodes A and P of FIG. 7 have been
registered to receive packets sent to multicast address X. It will also
be assumed that a set of configuration information has been provided to
all of the MPLBs 702, and that the configuration information specifies
that internal VLAN 1 is to be used to carry the packets sent to multicast
address X.

[0092]Suppose now that node P sends a data packet having multicast address
X to MPLB 702(P). From the multicast address in the packet, the
communication manager 802 of MPLB 702(P) determines that this packet is a
multicast packet. In one embodiment, path selection is appropriate for
this type of packet. Therefore, the communication manager 802 invokes the
path selection manager 804 to select a path through the network 300 for
this packet.

[0093]In response to being invoked, the path selection manager 804
inspects the multicast address X, and using the configuration information
noted above, associates the multicast address X with VLAN 1; thus, VLAN 1
is selected as the internal VLAN to use to transport this packet. As
noted previously, the configuration information may specify that certain
multicast address ranges be mapped to certain internal VLANs. For
example, multicast addresses within a first range may be mapped to VLAN
1, while multicast addresses within a second range may be mapped to VLAN
2, while multicast addresses within a third range may be mapped to VLAN
3, and so forth. By mapping different multicast addresses to different
internal VLANs, it is possible to spread multicast traffic across the
multiple internal VLANs. Thus, a static version of load balancing can be
achieved.

[0094]After the internal VLAN is selected, the path selection manager 804
proceeds to update the packet with the ID of the selected VLAN. In one
embodiment, before it does so, the path selection manager 804 checks to
see if the packet is currently tagged with an external VLAN ID. If it is
not, then the path selection manager 804 updates the packet to include
information indicating the selected internal VLAN. This may be done as
shown in FIGS. 4B and 4C, wherein the path selecting VLAN information 408
is inserted into the packet. On the other hand, if the packet is tagged
with an external VLAN ID, then the path selection manager 804 updates the
packet by overwriting the external VLAN ID in the packet with the ID of
the selected internal VLAN. In effect, this maps the external VLAN and
the multicast address to the selected internal VLAN. After the packet is
updated with path selecting VLAN information, the communication manager
802 forwards the updated packet to switch h.

[0095]When switch h receives the packet, it consults its forwarding table
and finds that all packets destined for multicast address X should be
forwarded to the port coupled to MPLB 702(P) (this is from the previous
registration of node P). However, because the packet was received on that
port, switch h does not forward the packet back to that port. Instead,
because the packet specifies that internal VLAN 1 is to be used to
transport the packet through the network 300, switch h forwards the
packet to switch o. When switch receives the packet, it consults its
forwarding table and finds that all packets destined for multicast
address X should be forwarded to the port coupled to switch h (again,
this is from the previous registration of node P). However, because the
packet was received on that port, switch o does not forward the packet
back to that port. Instead, because the packet specifies that internal
VLAN 1 is to be used to transport the packet through the network 300,
switch o forwards the packet to root switch q. Upon receiving the
multicast packet, root switch q consults its forwarding table and sees
that all packets destined for multicast address X should be forwarded to
the port coupled to switch i (this is from the previous registration of
node A) and the port coupled to switch o (this is from the previous
registration of node P). Since the root switch q received the multicast
packet on the port coupled to switch o, it does not forward the packet
back to that port; thus, in this example, root switch q forwards the
packet just to the port coupled to switch i. When switch i receives the
packet, it consults its forwarding table and finds that all packets
destined for multicast address X should be forwarded to the port coupled
to switch a (this is from the previous registration of node A). Thus, it
forwards the packet to switch a. In turn, switch a consults its
forwarding table and finds that all packets destined for multicast
address X should be forwarded to the port coupled to MPLB 702(A) (again,
this is from the previous registration of node A). Thus, it forwards the
packet to MPLB 702(A).

[0096]When the communication manager 802 of MPLB 702(A) receives the
packet, it determines from the multicast address that this is a multicast
packet. The communication manager 802 then checks its records to see if
this multicast address X has an external VLAN associated therewith
(recall that an external VLAN-multicast address-internal VLAN association
may have been saved as part of the registration process for node A). If
the multicast address X does not have an external VLAN associated
therewith, then the communication manager 802 removes the path selection
VLAN information from the packet and forwards it on to node A. On the
other hand, if the multicast address X does have an external VLAN
associated with it, then the communication manager 802 replaces the path
selection VLAN information in the packet with the external VLAN ID. The
communication manager 802 then forwards the updated packet on to node A.
In this manner, a multicast packet is transported through the network
300.

Egress Functionality

[0097]The above discussion describes the ingress functionality of an MPLB
702. The following discussion will describe the egress functionality.
This egress functionality enables the MPLB 702, and more specifically,
the communication manager 802 of the MPLB 702, to receive a packet from a
switch internal to the network 300, and to either forward that packet to
an external component or to process the packet in an appropriate manner.

[0098]One type of packet that the communication manager 802 of an MPLB 702
may receive from an internal switch is a forwarded TCP/IP or UDP/IP
packet. When it receives such a packet, the communication manager 802
processes the packet to find a set of path-selecting VLAN information
(this path-selecting VLAN information was added to the packet by another
MPLB 702 when it entered the network 300). The communication manager 802
removes this path-selecting VLAN information from the packet and then
forwards the packet to an external component. By doing this, the
communication manager 802 makes the existence of the internal VLANs
transparent to the external component.

[0099]Another type of packet that the communication manager 802 may
receive from an internal switch is a GVRP declaration packet originated
by one of the root switches q, r, s, t. The purpose of this packet is to
set up one or more VLANs within the network 300 (the declaration packet
contains information indicating which VLAN or VLANs are to be set up). A
GVRP declaration packet is propagated from switch to switch until it
reaches an MPLB 702. When the communication manager 802 of an MPLB 702
receives a GVRP declaration packet, it does not forward the packet to an
external component. Rather, the communication manager 802 sends a
response in the form of a GVRP registration packet. The purpose of this
response is to acknowledge receipt of the GVRP declaration packet, and to
register the one or more VLANs specified in the declaration packet on the
appropriate switches. This GVRP registration packet, which includes
information indicating which VLAN or VLANs are to be registered, is sent
by the communication manager 802 to the internal switch to which the MPLB
702 is coupled. For example, if MPLB 702(A) is sending the GVRP
registration packet, that packet would be sent to internal switch a. Upon
receiving a GVRP registration packet on a particular port, the internal
switch registers the VLAN or VLANs specified in the packet on that
particular port. For example, if internal switch a receives a GVRP
registration packet from MPLB 702(A) on the port shown in FIG. 7, and if
the registration packet indicates that VLAN 1 is to be registered, then
switch c registers VLAN 1 on that port. Thereafter, the link between
switch c and MPLB(A) will support VLAN 1. In this manner, the
communication manager 802 of an MPLB 702 participates in the
establishment of an internal VLAN. As an additional note, in one
embodiment, when the communication manager 802 of an MPLB 702
participates in the VLAN establishment process, it updates the VLAN
storage 806 of the MPLB 702 to include the ID of the newly established
VLAN. By doing so, the communication manager 802 enables the path
selection manager 804 to select that VLAN in future path selection and
load balancing operations.

Other Functionalities

[0100]As noted previously, because an MPLB 702 is deployed at the edge of
a network, it performs the functions typically performed by an edge port
of an edge switch. One of the typical functions that an edge port
performs (if it is so configured) is to add a default VLAN tag to packets
that it processes. A default VLAN tag may be added to a packet when it
leaves the edge port, and/or when it enters the edge port. This
functionality is desirable in some contexts. In one embodiment, the
communication manager 802 of an MPLB 702 provides this functionality.
Specifically, the communication manager 802 may be configured to add a
default VLAN tag to packets ingressing the network 300 and/or add a
default VLAN tag to packets egressing the network 300. In one embodiment,
if the communication manager 802 adds a default VLAN tag to a packet
ingressing the network 300, it does so before invoking the path selection
manager 804. That way, the path selection manager 804 will add the
path-selecting VLAN information (which is different from the default VLAN
tag) to the packet after the default VLAN tag is already present. Doing
things in this order ensures that it will be the path-selecting VLAN
information and not the default VLAN tag that will be used by the
internal switches to switch the packet through the network 300.

[0101]Adding a default VLAN tag to packets is just one of the functions
that can be performed by an edge port of an edge switch. For purposes of
the present invention, the communication manager 802 may be endowed with
capability to perform any and all functions performed by an edge port of
an edge switch.

Further Alternative Embodiment

[0102]Thus far, the MPLB 702 has been described as being a separate,
standalone component. It should be noted, however, that if so desired,
the functionality of the MPLB 702 may be incorporated into one or more of
the edge ports of a switch. Also, each MPLB 702 has been shown as being
coupled to only one edge port. It should be noted, however, that if so
desired, an MPLB 702 may be coupled to multiple edge ports and even to
multiple edge switches. These and other embodiments are within the scope
of the present invention.

Load Balancing Overview

[0103]Thus far, the load balancing function performed by the path
selection manager 504 (FIG. 5) of the network interface 302 and the path
selection manager 804 (FIG. 8) of the MPLB 702 has been described at a
relatively high and general level. The following sections will describe a
specific methodology for implementing load balancing in accordance with
one embodiment of the present invention. In the following discussion,
reference will be made to the network 300 shown in FIG. 7. However, it
should be noted that such reference is made for illustrative purposes
only. The load balancing concepts taught herein are not limited to
implementation within such a network. Rather, they may be applied
generally to any network in which multiple paths are provided between
component pairs. All such applications are within the scope of the
present invention. Also, for the sake of simplicity and example, the
following discussion will refer to the path selection manager 804 of the
MPLB 702 as performing the load balancing operations to be described. It
should be noted that the path selection manager 504 of the network
interface 302 may perform the same operations as well. For purposes of
the present invention, the functionality of the path selection manager
504, 804 may be realized in any desired manner. For example, the
functionality of the path selection manager 504, 804 may be realized
using hardware (e.g. hardware logic components, ASICs, etc.), software
(e.g. having one or more processors execute one or more sets of
instructions), or a combination thereof.

[0104]In one embodiment, the path selection manager 804 performs load
balancing on a "per flow" basis. That is, the path selection manager 804
groups packets into flows, and sends packets onto paths as flows. For
example, the packets associated with a first flow may be sent along a
first path (e.g. a first VLAN), while the packets associated with a
second flow may be sent along a second path (e.g. a second VLAN). By
putting different flows on different paths, the path selection manager
804 can balance the load across the various paths.

[0105]For purposes of the present invention, packets may be associated
with a flow based upon any desired criteria. For example, all packets
having the same destination address may be associated with the same flow.
All packets of the same type may be associated with the same flow. All
packets carrying the same type of traffic may be associated with the same
flow. Further, flow association may be determined based upon information
in the header of the packets (e.g. all packets with the same quintuple
may be associated with the same flow), based upon information in the
payloads of the packets (determined based upon deep packet inspection),
etc. Flow association may also be determined based upon some additional
processing. For example, a hashing algorithm may be applied to some of
the information in a packet (e.g. the information in the header of the
packet). All packets that hash to the same result may be associated with
the same flow. These and other methodologies may be used to group packets
into flows. Basically, as long as a set of packets has some commonality,
they can be grouped together and be associated with the same flow. A flow
may be a long-lived flow having a large number of packets, or it may be a
short-lived flow having a relatively small number of packets.

[0106]In one embodiment, to manage a plurality of packets as a flow, the
path selection manager 804 maintains a flow data structure for each flow.
Each flow data structure contains information specific to its
corresponding flow. The information that may be stored within a flow data
structure will be elaborated upon in later sections. Each flow data
structure is associated with a unique flow label. In one embodiment, the
flow label is a set of information that can be derived from the
information in each packet of a flow. For example, a flow label may be a
destination address that is specified in the packet (in such a case, all
packets going to the same destination address would be part of the same
flow). The flow label may also be a hash value that would be derived by
applying a hashing algorithm to information contained within the packet.
The flow label may be these and other types of information. So long as a
set of information can be derived from the information in a packet, it
can be used as a flow label. The flow label will be used to determine
which packets belong to which flows (put another way, all packets that
cause the same flow label to be derived belong to the same flow). For
example, if the flow label is a hash value, then all packets that hash to
that hash value will belong to the same flow.

[0107]In one embodiment, the path selection manager 804 performs load
balancing at the time a new flow is started. Specifically, when a new
flow is started, the path selection manager 804 determines which paths
are available for forwarding packets associated with that flow to a
destination, and then selects one of those paths. The path selection
manager 804 selects the path in such a manner that traffic load on the
various paths is kept relatively balanced. In one embodiment, the path
selection manager 804 selects the path that is currently the least
congested. In another embodiment, the path selection manager 804 selects
the path randomly (if the selection is truly random, it will have the
tendency to spread flows evenly across the various paths), in a round
robin fashion, etc. These and other methods may be used to select the
path for the flow. After the path is selected, the packets associated
with the flow are sent to the destination along the selected path.

[0108]In one embodiment, the path selection manager 804 performs load
balancing even after a path has been selected for a flow. Specifically,
while the packets associated with a flow are being forwarded along the
selected path, the path selection manager 804 determines how congested
the selected path and the other available paths are. If, for example, the
selected path becomes significantly more congested than one of the other
paths, then the path selection manager 804 may move the flow from the
selected path onto the other path. Thus, the path selection manager 804
has to ability to move a flow in "mid-flow" (by "mid-flow", it is meant
that the flow has not yet terminated and that the packets associated with
the flow are still being forwarded from one component to another). With
this ability to move a flow while it is in mid-flow, the path selection
manager 804 can balance the traffic load across multiple paths in a very
dynamic manner. As traffic conditions on the different paths change, the
path selection manager 804 can move flows from path to path to
dynamically adapt and adjust to the changes in traffic conditions. By
moving flows in this manner, the path selection manager 804 can keep
congestion on the multiple paths relatively balanced and minimized.

Load Balancing Example

[0109]To illustrate how load balancing may be implemented in accordance
with one embodiment of the present invention, reference will now be made
to an example. Suppose that MPLB 702(A) (FIG. 7) receives a stream of
packets from node A, and that the packets have some commonality so that
they can be grouped and associated with the same flow. Suppose further
that all of the packets of the flow are destined for node P so that they
need to pass through MPLB 702(P). Suppose further that the path selection
manager 804 on MPLB 702(A) can use any one of the four VLANs, and hence,
any one of four paths, to forward the packets to node P. In this example,
the four possible paths from MPLB 702(A) to MPLB 702(P) are: (1) MPLB
702(A) to switch a to switch i to switch q to switch o to switch h to
MPLB 702(P) for VLAN 1; (2) MPLB 702(A) to switch a to switch i to switch
r to switch o to switch h to MPLB 702(P) for VLAN 2; (3) MPLB 702(A) to
switch a to switch j to switch s to switch p to switch h to MPLB 702(P)
for VLAN 3; and (4) MPLB 702(A) to switch a to switch j to switch t to
switch p to switch h to MPLB 702(P) for VLAN 4. Given this scenario, load
balancing may be implemented as follows. For the sake of simplicity, the
path selection manager 804 on MPLB 702(A) will hereinafter be referred to
simply as the path selection manager 804, and the path selection manager
on MPLB 702(P) will be referred to as the path selection manager on MPLB
702(P).

[0110]Initially, node A sends the first packet of the flow to MPLB 702(A).
When the communication manager 802 (FIG. 8) on MPLB 702(A) receives this
packet, it forwards the packet to the path selection manager 804. When
the path selection manager 804 receives the packet, it derives a flow
label based upon information in the packet (the flow label may be derived
by processing the packet in any desired manner). After the flow label is
derived, the path selection manager 804 determines whether there is any
existing flow data structure that has that flow label associated
therewith. In the current example, this packet is the first packet of the
flow; thus, there should be no existing flow data structure having that
flow label associated therewith. In such a case, the path selection
manager 804 creates a new flow data structure, and associates the newly
derived flow label with the newly created flow data structure.

[0111]Thereafter, the path selection manager 804 selects one of the VLANs
to use to forward the packet to node P. By selecting one of the VLANs,
the path selection manager 804 is in effect selecting a path through the
network 300 that is to be used to transport the packet to node P. In one
embodiment, the path selection manager 804 selects the path (i.e. the
VLAN) that is currently the least congested. In another embodiment, the
path selection manager 804 selects the path (i.e. the VLAN) randomly, in
a round robin fashion, etc. These and other methods may be used to select
the path. For the sake of example, it will be assumed that the path
provided by VLAN 1 is selected. Once VLAN 1 is selected, the path
selection manager 804 stores the ID of VLAN 1 into the flow data
structure. This serves as an indication that the path provided by VLAN 1
is the path that is to be used to forward packets associated with the
flow. This will cause subsequent packets associated with the flow to be
sent to node P using VLAN 1.

[0112]After VLAN 1 is selected, the path selection manager 804 updates the
packet using the ID of VLAN 1 as the path-selecting VLAN information 408
(in the manner previously described with reference to FIGS. 4B and 4C),
and sends the packet into the network 300. The switches in the network
300 will thereafter forward the packet to node P via the path provided by
VLAN 1 (it is assumed that reverse path learning has already been
performed).

[0113]Suppose now that the path selection manager 804 receives the second
packet of the flow from node A. Again, the path selection manager 804
derives a flow label based upon information in the packet. This flow
label should be the same as that derived for the first packet of the flow
(all packets of the flow should cause the same flow label to be derived).
The path selection manager 804 again determines whether this flow label
has an existing flow data structure associated therewith. This time, the
path selection manager 804 will find that the flow label is associated
with the flow data structure that was previously created. Upon learning
this, the path selection manager 804 accesses the flow data structure,
and extracts therefrom the ID of VLAN 1, which was previously stored in
the flow data structure. The path selection manager 804 then updates the
packet using the ID of VLAN 1 as the path-selecting VLAN information 408,
and sends the packet into the network 300. The switches in the network
300 will thereafter forward the packet to node P via the path provided by
VLAN 1. In this manner, the path selection manager 804 causes the packets
associated with the same flow to be sent along the same path.

[0114]In the manner described above, the path selection manager 804
forwards (block 1004 of FIG. 10) one or more packets associated with the
flow to node P along the path provided by VLAN 1. As the path selection
manager 804 is forwarding packets associated with the flow along this
path, the path selection manager 804 monitors the traffic conditions on
all of the available paths, including the path provided by VLAN 1 and the
other paths provided by VLANs 2-4. In one embodiment, the path selection
manager 804 monitors the traffic conditions on the various paths by
determining a congestion metric for each of the paths. Specifically, the
path selection manger 804 determines a congestion metric (block 1008) for
the path provided by VLAN 1, and determines a congestion metric (block
1012) for each of the other paths provided by VLANs 2-4. The manner in
which these congestion metrics are determined in accordance with one
embodiment of the present invention will be elaborated upon in a later
section. These congestion metrics basically provide an indication as to
how congested each path currently is.

[0115]As an alternative to determining a separate congestion metric for
each of the other paths provided by VLANs 2-4, the path selection manager
804 may instead determine (block 1012) a single aggregate congestion
metric for those paths. This aggregate congestion metric takes into
account the congestion on each of the paths provided by VLANs 2-4, but
provides a single congestion metric for all of those paths. In a sense,
the aggregate congestion metric reflects an "average" congestion across
the paths provided by VLANs 2-4. The manner in which the aggregate
congestion metric is determined in accordance with one embodiment of the
present invention will be elaborated upon in a later section.

[0116]Based upon the congestion metrics, the path selection manager 804
determines (block 1016) whether the flow should be moved from the path
provided by VLAN 1 to one of the other paths. In the embodiment where the
path selection manager 804 determines a separate congestion metric for
each of the possible paths, the path selection manager 804 makes this
determination by comparing the congestion metric for the path provided by
VLAN 1 with the congestion metric for each of the other paths to derive
an imbalance metric for each pair of paths. For example, the path
selection manager 804 derives an imbalance metric for the path provided
by VLAN 1 and the path provided by VLAN 2, derives an imbalance metric
for the path provided by VLAN 1 and the path provided by VLAN 3, and
derives an imbalance metric for the path provided by VLAN 1 and the path
provided by VLAN 4. These imbalance metrics indicate how much congestion
difference there is between the other paths and the path provided by VLAN
1. In one embodiment, the largest imbalance metric is selected, and a
determination is made as to whether that imbalance metric exceeds a
certain threshold. If so, then it may be worthwhile to move the flow from
the path provided by VLAN 1 to another path (the one with the largest
imbalance metric) provided by one of the other VLANs. For example, if the
imbalance metric shows that one of the other paths is significantly less
congested than the path provided by VLAN 1, then it may be worthwhile to
move the flow onto that other path.

[0117]In the embodiment where the path selection manager 804 determines a
single aggregate congestion metric for the other paths provided by VLANs
2-4, the path selection manager 804 may determine (block 1016) whether
the flow should be moved by comparing the congestion metric for the path
provided by VLAN 1 with the aggregate congestion metric to derive an
imbalance metric. A determination is then made as to whether the
imbalance metric exceeds a certain threshold. If so, then it may be
worthwhile to move the flow from the path provided by VLAN 1 to one of
the other paths (an alternate path) provided by VLANs 2-4. In this
embodiment, because the aggregate congestion metric is an aggregate
metric and hence does not provide information as to which of the other
paths is currently least congested, it is not immediately clear which of
the other paths provided by VLANs 2-4 the flow should be moved to. In one
embodiment, the path selection manager 804 selects one of the other paths
provided by VLANs 2-4 as the alternate path. For purposes of the present
invention, the path selection manager 804 may select the alternate path
in any desired manner (e.g. randomly, in a round robin fashion, etc.).

[0118]If the path selection manager 804 determines that the flow should
not be moved to another path, then the path selection manager 804 does
not make any changes to the flow data structure. In such a case, the path
selection manager 804 loops back to block 1004, and continues to forward
packets associated with the flow to node P using the path provided by
VLAN 1. The path selection manager 804 also continues to determine
(blocks 1008, 1012) congestion metrics for the path provided by VLAN 1
and the other paths provided by the other VLANs. In addition, the path
selection manager 804 continues to determine (block 1016), based upon the
congestion metrics for the various paths, whether to move the flow from
the path provided by VLAN 1 to another path. Thus, the path selection
manager 804 continuously looks for the need and the opportunity to move
the flow onto another path to balance the load across the multiple paths.

[0119]If, at some point, the path selection manager 804 does determine
that the flow should be moved onto another path, then the path selection
manager 804, in one embodiment, updates the flow data structure. For
example, suppose that the path selection manager 804 determines that the
flow should be moved to the path provided by VLAN 4. In such a case, the
path selection manager 804 would update the flow data structure by
overwriting the ID for VLAN 1 with the ID for VLAN 4. This will cause
subsequent packets associated with the flow to be forwarded to node P
along the path provided by VLAN 4 instead of the path provided by VLAN 1.
In one embodiment, the path selection manager 804 moves the flow from one
path to another in such a manner that the packets of the flow will not
arrive at the destination out of order; thus, the receiving component
will not need to reorder the packets. The manner in which the path
selection manager 804 moves a flow from one path to another will be
elaborated upon in a later section.

[0120]Even after the flow is moved to another path, the path selection
manager 804 still remains vigilant with regard to load balancing.
Specifically, while the path selection manager 804 is forwarding packets
(block 1004) associated with the flow to node P along the new, alternate
path, the path selection manager 804 continues to determine (blocks 1008,
1012) congestion metrics for the new path and the other paths provided by
the other VLANs, and continues to determine (block 1016), based upon the
congestion metrics of the various paths, whether to move the flow from
the new path to another path provided by another VLAN. By constantly
looking for load balancing opportunities in this manner, the path
selection manager 804 is able to adapt and adjust dynamically to ever
changing traffic conditions on the multiple paths.

Determining Congestion Metrics

[0121]As noted above, part of the load balancing process involves
determining (blocks 1008, 1012) congestion metrics for the paths between
a pair of components. To illustrate how a congestion metric can be
determined in accordance with one embodiment of the present invention,
reference will be made to an example. In the following example, which is
a continuation of the above example, the pair of components will be MPLB
702(A) and MPLB 702(P). The example will show how the path selection
manager 804 (the one on MPLB 702(A)) can determine a congestion metric
for the path provided by VLAN 1 between MPLB 702(A) and MPLB 702(P) (this
path would be from MPLB 702(A) to switch a to switch i to switch q to
switch o to switch h to MPLB 702(P)). While this example will discuss
only the path provided by VLAN 1, it should be noted that the same
methodology may be used to determine the congestion metrics for the other
paths between MPLB 702(A) and MPLB 702(P) provided by VLANs 2, 3, and 4.

[0122]In one embodiment, determining a congestion metric for a path
involves determining one or more latency values for that path. The
latency values that are determined for a path may be one-way latency
values (indicating how much time is needed to transfer a set of
information from a forwarding component to a receiving component along
that path) and/or roundtrip latency values (indicating how much time is
needed to transfer a set of information from a forwarding component to a
receiving component along that path and back to the forwarding
component). Determining a congestion metric for a path may also involve
determining one or more latency variation values for that path. These
latency values and latency variation value(s), viewed alone or together,
provide a congestion metric for a path, which in turn, provides an
indication as to how congested that path currently is.

[0123]To determine the congestion metric for the path provided by VLAN 1,
the path selection manager 804 (the one on MPLB 702(A)) may operate as
follows. While it is forwarding packets associated with the flow from
MPLB 702(A) to node P along the path provided by VLAN 1 (in the manner
described above), the path selection manager 804 at some point composes a
latency request packet. In the following example, the latency request
packet will be described as being a separate packet from the other
packets in the flow. However, if so desired, the latency request packet
may be piggybacked onto one of the data packets of the flow.

[0124]Within the latency request packet, the path selection manager 804
includes several sets of information. These sets of information may
include: [0125](1) a source address (in the current example, the source
address may be the address of MPLB 702(A) or the address of node A, which
is the source of the flow packets); [0126](2) a destination address (in
the current example, the destination address may be the address of MPLB
702(P) or the address of node P, which is the destination of the flow
packets); [0127](3) a set of information indicating that the packet is a
latency request packet; [0128](4) a current timestamp (referred to as the
initial timestamp) provided by a clock on MPLB 702(A); and [0129](5) a
set of VLAN selection information (e.g. the ID of VLAN 1) indicating that
VLAN 1 is to be used to transport the packet to the destination address.

[0130]If so desired, other sets of information may also be included in the
latency request packet. These sets of information may include, for
example, the flow label of the flow (this allows the latency request
packet to be associated directly with the flow), a set of information
indicating a priority level for the flow, and a set of information
indicating which VLAN should be used to transport a latency response
packet back to the path selection manager 804.

[0131]After composing the latency request packet, the path selection
manager 804 forwards the packet to the communication manager 802 (FIG. 8)
of MPLB 702(A), which in turn, sends the packet into the network 300.
Because the latency request packet includes VLAN selection information
specifying that VLAN 1 is to be used to transport the packet to the
destination, the switches in the network 300 will switch the packet to
the destination along the path provided by VLAN 1. Thus, the latency
request packet will travel from MPLB 702(A) to switch a to switch i to
switch q to switch o to switch h to finally arrive at MPLB 702(P) (it is
assumed that reverse path learning has already taken place between the
two MPLBs).

[0132]When the communication manager 802 of MPLB 702(P) receives the
packet, it determines, based upon the information in the packet, that the
packet is a latency request packet. Thus, it does not forward the packet
on to node P. Rather, the communication manager 802 passes the packet to
the path selection manager of MPLB 702(P) for further processing.

[0133]Seeing that the packet is a latency request packet, the path
selection manager of MPLB 702(P) knows that it should compose a latency
response packet in response to the latency request packet. In one
embodiment, to do so, the path selection manager of MPLB 702(P) starts
with the latency request packet, and makes changes and additions to it to
transform it into a latency response packet. Specifically, in one
embodiment, the path selection manager of MPLB 702(P) performs the
following operations: [0134](1) take the source address specified in
the latency request packet and make it the destination address of the
latency response packet (this will cause the latency response packet to
be sent back to MPLB 702(A); [0135](2) insert a source address into the
latency response packet; in the current example, the inserted source
address may be the address of MPLB 702(P) (since MPLB 702(P) is the
source of the latency response packet); alternatively, the inserted
source address may be an address that is recognized by all of the
switches in the network 300 as being an address on which reverse path
leaning should not be performed; [0136](3) change the information in the
latency request packet to indicate that the packet is now a latency
response packet rather than a latency request packet; [0137](4) store a
new set of VLAN selection information in the latency response packet to
indicate which VLAN is to be used to transport the latency response
packet back to MPLB 702(A); this VLAN may be the same VLAN (VLAN 1) that
was used to transport the latency request packet to MPLB 702(P), or it
may be a different VLAN; thus, the latency request and latency response
packets may travel along different paths; and [0138](5) add one or more
timestamps to the latency response packet; in one embodiment, the path
selection manager of MPLB 702(P) adds two timestamps: (i) an arrived
timestamp, which approximately indicates a time at which the latency
request packet was received by the path selection manager of MPLB 702(P);
and (ii) a return timestamp, which approximately indicates a time at
which the path selection manager of MPLB 702(P) sends the latency
response packet back to MPLB 702(A); in another embodiment, the path
selection manager of MPLB 702(P) may add just one timestamp to the
latency response packet; in such an embodiment, the timestamp may be
either the arrived timestamp or the return timestamp; in either case, the
timestamp(s) is/are provided by a clock on the MPLB 702(P).

[0139]In addition to the above information, the latency response packet
may also include any information that was included in the latency request
packet. This may include, for example, the initial timestamp provided by
the clock on MPLB 702(A), the flow label of the flow, and the set of
information indicating the priority level for the flow. This and other
information may be included in the latency response packet.

[0140]After composing the latency response packet, the path selection
manager of MPLB 702(P) forwards the packet to the communication manager
802 of MPLB 702(P), which in turn, sends the packet into the network 300.
Eventually, the latency response packet is transported to MPLB 702(A)
(again, it is assumed that reverse path learning has already taken place
between the two MPLBs).

[0141]When the communication manager 802 of MPLB 702(A) receives the
latency response packet, it determines, based upon the information in the
packet, that it is a latency response packet. Thus, it does not pass the
packet on to node A. Instead, it passes the packet on to the path
selection manager 804 (the one on MPLB 702(A)) for further processing.

[0142]Upon receiving the latency response packet, the path selection
manager 804 obtains a current timestamp (referred to as the received
timestamp) from the clock on MPLB 702(A). Then, based upon the received
timestamp and the various timestamps in the latency response packet, the
path selection manager 804 determines the latencies for the path provided
by VLAN 1.

[0143]In the embodiment where the path selection manager on MPLB 702(P)
provides only one timestamp (the arrived timestamp or the return
timestamp), the latencies can be determined as shown in FIG. 11A. Namely,
the one-way latency is derived by subtracting the initial timestamp from
the arrived or return timestamp (whichever one is provided by the path
selection manager on MPLB 702(P)). The roundtrip latency is derived by
subtracting the initial timestamp from the received timestamp. In the
embodiment where the path selection manager on MPLB 702(P) provides both
an arrived timestamp and a return timestamp, the latencies can be derived
as shown in FIG. 11B. Namely, the one-way latency is derived by
subtracting the initial timestamp from the arrived timestamp. The
roundtrip latency is derived by subtracting the initial timestamp from
the arrived timestamp, and adding that difference to the difference
derived by subtracting the return timestamp from the received timestamp.
However it is derived, the one-way latency value indicates how much time
is currently needed to transport a set of information from MPLB 702(A) to
MPLB 702(P) along the path provided by VLAN 1. The roundtrip latency
value indicates how much time is currently needed to transport a set of
information from MPLB 702(A) to MPLB 702(P) along the path provided by
VLAN 1, and back to MPLB 702(A) (along any path). In this manner, the
path selection manager 804 is able to measure how much time is needed to
transport information on the path provided by VLAN 1.

[0144]In one embodiment, the path selection manager 804 repeatedly sends
latency request packets to MPLB 702(P) along the path provided by VLAN 1,
and repeatedly receives latency response packets in response to these
latency request packets. Thus, the path selection manager 804, over the
life of the flow, derives a plurality of one-way latency values and a
plurality of roundtrip latency values for the path provided by VLAN 1.
Based on this plurality of latency values, the path selection manager is
able to determine a plurality of values for the path provided by VLAN 1.
In one embodiment, these values include an "average" one-way latency
value (AOWL), an "average" roundtrip latency value (ARTL), and a latency
variation value (LV). In one embodiment, these values may be derived as
follows.

Deriving "Average" One-Way Latency

[0145]In one embodiment, the AOWL may be derived using the following
equation:

AOWL=(1-ρL)*previous AOL+ρL*current one-way latency

[0146]In this equation, ρL is a constant that is used to weight
the various components of the equation to give rise to a smooth
"average". In one embodiment, ρL is a value that is between zero
and one. For ease of implementation, ρL may be set to a value
that is a factor of 1/2 (e.g. 1/4, 1/8, 1/16, 1/32, etc.). The optimal
value for ρL may differ from network to network and from
implementation to implementation. Thus, the optimal value for ρL
in any particular implementation may be determined experimentally.

[0147]To illustrate how this equation may be used, reference will be made
to an example, which is a continuation of the above example. Suppose
that, for the flow discussed in the above example, the path selection
manager 804 has not yet sent any latency request packets. Suppose now
that the path selection manager 804 sends a first latency request packet,
and receives a first latency response packet. Based upon the received
timestamp and the timestamp(s) in the first latency response packet, the
path selection manager 804 derives a current one-way latency value for
the path provided by VLAN 1 in the manner described in FIGS. 11A and 11B.
Once the current one-way latency value is derived, it is multiplied by
ρL. This result is added to (1-ρL)*previous AOWL.
However, since there is no previous AOWL (this is the first time the AOWL
is being derived for this flow), the value of (1-ρL)*previous
AOWL is zero. Thus, the newly computed AOWL is equal to ρL*the
current one-way latency value. In one embodiment, this new AOWL value is
stored in the flow data structure.

[0148]Suppose now that the path selection manager 804 sends a second
latency request packet, and receives a second latency response packet.
Based upon the received timestamp and the timestamp(s) in the second
latency response packet, the path selection manager 804 derives another
current one-way latency value for the path provided by VLAN 1 in the
manner described in FIGS. 11A and 11B. Once the current one-way latency
value is derived, it is multiplied by ρL. This result is added
to (1-ρL)*previous AOWL. In this case, the previous AOWL is the
AOWL value that was previously stored in the flow data structure. After
the value for (1-ρL)*previous AOWL is computed, it is added to
the value of ρL *current one-way latency to derive the new AOWL.
In one embodiment, this new AOWL value is stored in the flow data
structure, overwriting the AOWL value that was previously stored therein.

[0149]Continuing the example, suppose that the path selection manager 804
sends a third latency request packet, and receives a third latency
response packet. Based upon the received timestamp and the timestamp(s)
in the third latency response packet, the path selection manager 804
derives another current one-way latency value for the path provided by
VLAN 1 in the manner described in FIGS. 11A and 11B. Once the current
one-way latency value is derived, it is multiplied by ρL. This
result is added to (1-ρL)*previous AOWL. In this case, the
previous AOWL is the AOWL value that was previously stored in the flow
data structure. After the value for (1-ρL)*previous AOWL is
computed, it is added to the value of ρL*current one-way latency
to derive the new AOWL. In one embodiment, this new AOWL value is stored
in the flow data structure, overwriting the AOWL value that was
previously stored therein.

[0150]In one embodiment, the process described above is repeated for each
set of latency request and latency response packets. As a result, the
path selection manager 804 constantly updates the AOWL. Notice that the
AOWL is not a true average of all of the one-way latency values (that is,
it is not derived by summing up an n number of one-way latency values and
then dividing the sum by n). Rather, it is more of a rolling, weighted
average. Deriving the AOWL in this manner is simple and requires little
storage; thus, from an implementation standpoint, it is advantageous.
However, if so desired, the AOWL may be derived in any other desired
manner. For purposes of the present invention, any desired methodology
may be used to derive the AOWL.

Deriving "Average" Roundtrip Latency

[0151]In one embodiment, the ARTL may be derived using the following
equation:

ARTL=(1-ρR)*previous ARTL+ρR*current roundtrip latency.

[0152]In this equation, ρR is a constant that is used to weight
the various components of the equation to give rise to a smooth
"average". In one embodiment, ρR is a value that is between zero
and one. For ease of implementation, ρR may be set to a value
that is a factor of 1/2 (e.g. 1/4, 1/8, 1/16, 1/32, etc.). The optimal
value for ρR may differ from network to network and from
implementation to implementation. Thus, the optimal value for ρR
in any particular implementation may be determined experimentally.

[0153]This equation may be used in a manner similar to that described
above in connection with the equation for AOWL. Specifically, suppose
that, for the flow discussed in the above example, the path selection
manager 804 sends a first latency request packet, and receives a first
latency response packet (these may be the same packets as those described
above in connection with deriving the AOWL). Based upon the received
timestamp and the timestamp(s) in the first latency response packet, the
path selection manager 804 derives a current roundtrip latency value in
the manner described in FIGS. 11A and 11B. Once the current roundtrip
latency value is derived, it is multiplied by ρR. This result is
added to (1-ρR)*previous ARTL. However, since there is no
previous ARTL (this is the first time the ARTL is being derived for this
flow), the value of (1-ρR)*previous ARTL is zero. Thus, the
newly computed ARTL is equal to ρR *the current roundtrip
latency value. In one embodiment, this new ARTL value is stored in the
flow data structure.

[0154]Suppose now that the path selection manager 804 sends a second
latency request packet, and receives a second latency response packet
(these may be the same packets as those described above in connection
with deriving the AOWL). Based upon the received timestamp and the
timestamp(s) in the second latency response packet, the path selection
manager 804 derives another current roundtrip latency value in the manner
described in FIGS. 11A and 11B. Once the current roundtrip latency value
is derived, it is multiplied by ρR. This result is added to
(1-ρR)*previous ARTL. In this case, the previous ARTL is the
ARTL value that was previously stored in the flow data structure. After
the value for (1-ρR)*previous ARTL is computed, it is added to
the value of ρR *current roundtrip latency to derive the new
ARTL. In one embodiment, this new ARTL value is stored in the flow data
structure, overwriting the ARTL value that was previously stored therein.

[0155]Continuing the example, suppose that the path selection manager 804
sends a third latency request packet, and receives a third latency
response packet (these may be the same packets as those described above
in connection with deriving the AOWL). Based upon the received timestamp
and the timestamp(s) in the third latency response packet, the path
selection manager 804 derives another current roundtrip latency value in
the manner described in FIGS. 11A and 11B. Once the current roundtrip
latency value is derived, it is multiplied by ρR. This result is
added to (1-ρR)*previous ARTL. In this case, the previous ARTL
is the ARTL value that was previously stored in the flow data structure.
After the value for (1-ρR)*previous ARTL is computed, it is
added to the value of ρR*current roundtrip latency to derive the
new ARTL. In one embodiment, this new ARTL value is stored in the flow
data structure, overwriting the ARTL value that was previously stored
therein.

[0156]In one embodiment, the process described above is repeated for each
set of latency request and latency response packets. As a result, the
path selection manager 804 constantly updates the ARTL. Notice that the
ARTL is not a true average of all of the roundtrip latency values (that
is, it is not derived by summing up an n number of roundtrip latency
values and then dividing the sum by n). Rather, it is more of a rolling,
weighted average. Deriving the ARTL in this manner is simple and requires
little storage; thus, from an implementation standpoint, it is
advantageous. However, if so desired, the ARTL may be derived in any
other desired manner. For purposes of the present invention, any desired
methodology may be used to derive the AOWL.

Deriving Latency Variation

[0157]In one embodiment, the LV (latency variation) may be derived using
the following equation:

[0158]This equation may be used to derive the latency variation for both
one-way latency values and roundtrip latency values. If the equation is
used to derive the latency variation for one-way latency values, then the
"current latency value" in the equation would be the current one-way
latency value, and the "previous latency value" would be the previous
one-way latency value. If the equation is used to derive the latency
variation for roundtrip latency values, then the "current latency value"
in the equation would be the current roundtrip latency value, and the
"previous latency value" would be the previous roundtrip latency value.

[0159]In this equation, ρv is a constant that is used to weight
the various components of the equation. In one embodiment, ρv is
a value that is between zero and one. For ease of implementation,
ρv may be set to a value that is a factor of 1/2 (e.g. 1/4, 1/8,
1/16, 1/32, etc.). The optimal value for ρv may differ from
network to network and from implementation to implementation. Thus, the
optimal value for ρv in any particular implementation may be
determined experimentally.

[0160]To illustrate how this equation may be used, reference will be made
to an example, which is a continuation of the above example. In the
following example, the equation is used to determine the latency
variation for the one-way latency values. It should be noted that the
latency variation for roundtrip latency values may be derived in a
similar way. Suppose that, for the flow discussed in the above example,
the path selection manager 804 sends a first latency request packet, and
receives a first latency response packet (these may be the same packets
as those described above in connection with deriving the AOWL). Based
upon the received timestamp and the timestamp(s) in the first latency
response packet, the path selection manager 804 derives a current one-way
latency value for the path provided by VLAN 1 in the manner described in
FIGS. 11A and 11B. Once the current one-way latency value is derived, the
path selection manager 804 obtains from the flow data structure a
previous one-way latency value, and subtracts that previous one-way
latency value from the current one-way latency value. In the current
example, this is the first set of latency request/response packets, so
there is no previous one-way latency value. Thus, the result of this
subtraction is just the current one-way latency value. Thereafter, the
absolute value of this subtraction result is multiplied by ρv .
This result is added to (1-ρv)*previous LV. However, since there
is no previous LV (this is the first time the LV is being derived for
this flow), the value of (1-ρv)*previous LV is zero. Thus, the
newly computed LV is equal to ρv*the current one-way latency
value. In one embodiment, this new LV value and the current one-way
latency value are stored in the flow data structure.

[0161]Suppose now that the path selection manager 804 sends a second
latency request packet, and receives a second latency response packet
(these may be the same packets as those described above in connection
with deriving the AOWL). Based upon the received timestamp and the
timestamp(s) in the second latency response packet, the path selection
manager 804 derives another current one-way latency value for the path
provided by VLAN 1 in the manner described in FIGS. 11A and 11B. After
the current one-way latency value is derived, the path selection manager
804 obtains from the flow data structure the one-way latency value that
was previously stored therein. This previous one-way latency value is
subtracted from the current one-way latency value, and the absolute value
of the result is multiplied by ρv . This result is added to
(1-ρv)*previous LV. In this case, the previous LV is the LV
value that was previously stored in the flow data structure. After the
value for (1-ρv)*previous LV is computed, it is added to the
value of ρv*|(current latency value-previous latency valuel to
derive the new LV. In one embodiment, this new LV value is stored in the
flow data structure, overwriting the LV value that was previously stored
therein, and the current one-way latency value is stored in the flow data
structure, overwriting the one-way latency value that was previously
stored therein.

[0162]Continuing the example, suppose that the path selection manager 804
sends a third latency request packet, and receives a third latency
response packet (these may be the same packets as those described above
in connection with deriving the AOWL). Based upon the received timestamp
and the timestamp(s) in the third latency response packet, the path
selection manager 804 derives another current one-way latency value for
the path provided by VLAN 1 in the manner described in FIGS. 11A and 11B.
After the current one-way latency value is derived, the path selection
manager 804 obtains from the flow data structure the one-way latency
value that was previously stored therein. This previous one-way latency
value is subtracted from the current one-way latency value, and the
absolute value of the result is multiplied by ρv . This result
is added to (1-ρv)*previous LV. In this case, the previous LV is
the LV value that was previously stored in the flow data structure. After
the value for (1-ρv)*previous LV is computed, it is added to the
value of ρv*|current latency value-previous latency valuel to
derive the new LV. In one embodiment, this new LV value is stored in the
flow data structure, overwriting the LV value that was previously stored
therein, and the current one-way latency value is stored in the flow data
structure, overwriting the one-way latency value that was previously
stored therein. In one embodiment, the process described above is
repeated for each set of latency request and latency response packets. As
a result, the path selection manager 804 constantly updates the LV value.
In the manner described, the path selection manager 804 can compute and
maintain a latency variation value for the one-way latency values and/or
the roundtrip latency values.

[0163]The above example provides just one way of computing the latency
variation value(s). Many other methodologies may be used. For purposes of
the present invention, the latency variation value(s) may be computed
using any desired methodology.

Congestion Metric

[0164]Taken alone or together, the "average" latency value(s) (AOWL and/or
ARTL) and the latency variation value(s) provide a congestion metric for
the path provided by VLAN 1. This congestion metric in turn provides an
indication as to how congested the path provided by VLAN 1 currently is.
In one embodiment, the congestion metric for a path is derived by summing
a multiple of an "average" latency value for that path with a multiple of
a latency variation value for that path. Expressed in equation form, this
would be:

Congestion Metric=A*"average" latency value+B*Latency Variation.

[0165]In this equation, A and B are configurable constants. In one
embodiment, they are both set to 1; however, in some implementations, it
may be desirable to set these constants to other values. The congestion
metric for a path may be determined based upon one-way latency values or
roundtrip latency values. If the congestion metric is determined based
upon one-way latency values, then the equation would be:

Congestion Metric=A*AOWL+B*One-Way Latency Variation.

[0166]If the congestion metric is determined based upon roundtrip latency
values, then the equation would be:

Congestion Metric=A*ARTL+B*Roundtrip Latency Variation.

[0167]Once a congestion metric is determined for a path, it can be
compared with the congestion metric of another path to derive an
imbalance metric. In one embodiment, this imbalance metric may be derived
by subtracting one congestion metric from the other. This imbalance
metric indicates the difference in congestion between the two paths. If
this imbalance metric is large enough (e.g. exceeds a certain threshold),
thereby indicating that another path is much less congested than a
currently selected path, then it may be worthwhile for the path selection
manager 804 to move the flow from the currently selected path to the
other path. In one embodiment, a flow is moved from a currently selected
path to another path if the following expression is true.

A*"average" latency value of current path+B*latency variation of current
path≧C*"average" latency value of other path+D*latency variation
of other path.

[0168]In one embodiment, A=1, B=1, C=3, and D=-1. If so desired, other
values may be used for these constants. This expression basically
determines whether the difference in congestion metrics between a current
path and another path is sufficiently large enough to justify moving a
flow from the current path to the other path. If so, then the move may be
made. As will be made clear in a later section, the movement of a flow
from one path to another may not be immediate. Thus, the above expression
may be true at the time the decision is made to move the flow, but it may
no longer be true at the time the flow is actually moved. For this
reason, in one embodiment, two thresholds are used. The first threshold
is the above expression with A=1, B=1, C=3, and D=-1. The second
threshold is the above expression with A=1, B=1, C=3/2, and D=-1/2. The
first threshold is used to make the initial determination to move the
flow. The second threshold (which is lower than the first threshold) is
used at the time the flow is actually to be moved. If, at the time the
flow is to be moved, the second threshold is not met, then the movement
of the flow is aborted. This may occur, for example, if some other flows
have already been moved from the current path to the other path so that
the current path is no longer significantly more congested than the other
path. If the move is aborted, then the flow simply remains on the current
path. In this manner, the path selection manager 804 can determine
whether to move a flow from a current path to another path.

[0169]The above example shows how the path selection manager 804 can
determine a congestion metric for the path provided by VLAN 1. The same
methodology may be used to determine the congestion metrics for each of
the other paths provided by VLANs 2-4. All that would need to be changed
would be the set of VLAN selection information that is inserted into the
latency request packets. For example, to determine the congestion metric
for the path provided by VLAN 2, the path selection manager 804 would
insert the ID for VLAN 2 into the latency request packets. This will
cause the latency request packets to be transported from MPLB 702(A) to
MPLB 702(P) along the path provided by VLAN 2, thereby enabling the
latency values for that path to be measured.

Aggregate Congestion Metric

[0170]The above discussion addresses the embodiment in which a separate
congestion metric is determined for each of the possible paths. To
describe how an aggregate congestion metric may be determined for a
plurality of paths, reference will be made to the following example,
which is a continuation of the above example. In the following example,
it will be assumed that the path provided by VLAN 1 is the path that is
currently being used to forward packets associated with the flow to node
P, and that the other available paths (the paths for which an aggregate
congestion metric is to be determined) are the paths provided by VLANs
2-4.

[0171]In one embodiment, the aggregate congestion metric for a plurality
of paths may be determined using the same equations as those given above
for Congestion Metric, AOWL, ARTL, and LV. The difference is that the
values that are plugged into the equations are no longer values from just
one of the paths but rather are values from multiple paths.

[0172]To illustrate, the AOWL value for the plurality of paths provided by
VLANs 2-4 may be computed as follows. Initially, the path selection
manager 804 (the one on MPLB 702(A)) sends a latency request packet to
MPLB 702(P) via the path provided by VLAN 2, and receives a latency
response packet from MPLB 702(P). Based upon various timestamps, the path
selection manager 804 determines (in the manner described previously) a
current one-way latency value for the path provided by VLAN 2. This
current one-way latency value is plugged into the equation for AOWL to
derive an updated AOWL (this updated AOWL is stored). Thereafter, the
path selection manager 804 sends another latency request packet to MPLB
702(P), this time via the path provided by VLAN 3, and receives another
latency response packet from MPLB 702(P). Based upon various timestamps,
the path selection manager 804 determines (in the manner described
previously) a current one-way latency value for the path provided by VLAN
3. This current one-way latency value is plugged into the equation for
AOWL, along with the stored updated AOWL, to derive another updated AOWL
(this updated AOWL is stored). Thereafter, the path selection manager 804
sends another latency request packet to MPLB 702(P), this time via the
path provided by VLAN 4, and receives another latency response packet
from MPLB 702(P). Based upon various timestamps, the path selection
manager 804 determines (in the manner described previously) a current
one-way latency value for the path provided by VLAN 4. This current
one-way latency value is plugged into the equation for AOWL, along with
the stored updated AOWL, to derive another updated AOWL (this updated
AOWL is stored). This process may be repeated to take into account
multiple current one-way latency values from each of the paths provided
by VLANs 2-4. As can be seen, this AOWL is computed based upon one-way
latency values from the various paths. Hence, it does not reflect the
"average" one-way latency of any one path but rather the "average"
one-way latency of the multiple paths. Consequently, it is an aggregate
AOWL value for the multiple paths.

[0173]An aggregate ARTL value may be computed in a similar fashion.
Specifically, the path selection manager 804 sends a latency request
packet to MPLB 702(P) via the path provided by VLAN 2, and receives a
latency response packet from MPLB 702(P). Based upon various timestamps,
the path selection manager 804 determines (in the manner described
previously) a current roundtrip latency value for the path provided by
VLAN 2. This current roundtrip latency value is plugged into the equation
for ARTL to derive an updated ARTL (this updated ARTL is stored).
Thereafter, the path selection manager 804 sends another latency request
packet to MPLB 702(P), this time via the path provided by VLAN 3, and
receives another latency response packet from MPLB 702(P). Based upon
various timestamps, the path selection manager 804 determines (in the
manner described previously) a current roundtrip latency value for the
path provided by VLAN 3. This current roundtrip latency value is plugged
into the equation for ARTL, along with the stored updated ARTL, to derive
another updated ARTL (this updated ARTL is stored). Thereafter, the path
selection manager 804 sends another latency request packet to MPLB
702(P), this time via the path provided by VLAN 4, and receives another
latency response packet from MPLB 702(P). Based upon various timestamps,
the path selection manager 804 determines (in the manner described
previously) a current roundtrip latency value for the path provided by
VLAN 4. This current roundtrip latency value is plugged into the equation
for ARTL, along with the stored updated ARTL, to derive another updated
ARTL (this updated ARTL is stored). This process may be repeated to take
into account multiple current roundtrip latency values from each of the
paths provided by VLANs 2-4. As can be seen, this ARTL is computed based
upon roundtrip latency values from the various paths. Hence, it is an
aggregate ARTL that reflects the "average" roundtrip latency values on
those multiple paths.

[0174]An aggregate LV value may be computed in a similar fashion. In the
following example, the LV equation is used to determine the latency
variation for one-way latency values, but it should be noted that the
latency variation for roundtrip latency values may be derived in a
similar fashion. Initially, the path selection manager 804 sends a
latency request packet to MPLB 702(P) via the path provided by VLAN 2,
and receives a latency response packet from MPLB 702(P). Based upon
various timestamps, the path selection manager 804 determines (in the
manner described previously) a current one-way latency value for the path
provided by VLAN 2. This current one-way latency value is plugged into
the equation for LV to derive an updated LV. The current one-way latency
value for the path provided by VLAN 2 is stored as the previous one-way
latency value, and the updated LV is stored as well. Thereafter, the path
selection manager 804 sends another latency request packet to MPLB
702(P), this time via the path provided by VLAN 3, and receives another
latency response packet from MPLB 702(P). Based upon various timestamps,
the path selection manager 804 determines (in the manner described
previously) a current one-way latency value for the path provided by VLAN
3. This current one-way latency value is plugged into the equation for
LV, along with the previous one-way latency value and the stored updated
LV, to derive another updated LV. The current one-way latency value for
the path provided by VLAN 3 is stored as the previous one-way latency
value, and the updated LV is stored as well. Thereafter, the path
selection manager 804 sends another latency request packet to MPLB
702(P), this time via the path provided by VLAN 4, and receives another
latency response packet from MPLB 702(P). Based upon various timestamps,
the path selection manager 804 determines (in the manner described
previously) a current one-way latency value for the path provided by VLAN
4. This current one-way latency value is plugged into the equation for
LV, along with the previous one-way latency value and the stored updated
LV, to derive another updated LV. The current one-way latency value for
the path provided by VLAN 4 is stored as the previous one-way latency
value, and the updated LV is stored as well. This process may be repeated
to take into account multiple current one-way latency values from each of
the paths provided by VLANs 2-4. As can be seen, this LV is computed
based upon one-way latency values from the various paths. Hence, it is an
aggregate LV that reflects the "average" latency variation across the
multiple paths.

[0175]After the aggregate AOWL, ARTL, and LV values are determined, they
may be inserted into the Congestion Metric equation to compute the
aggregate congestion metric for the plurality of paths provided by VLANs
2-4. This aggregate congestion metric reflects an "average" congestion
across the multiple paths. In the manner described, an aggregate
congestion metric may be determined for a plurality of paths.

Determining and Using Clock Offset Value

[0176]In the above discussion, various timestamps are used to determine
the latency values. Some of these timestamps (e.g. the initial timestamp
and the received timestamp) are provided by the clock on MPLB 702(A),
while other timestamps (e.g. the arrived timestamp and the return
timestamp) are provided by the clock on MPLB 702(P). If these two clocks
are synchronized, then the one-way latency values derived from these
timestamps will be accurate. However, if the clocks are not synchronized,
then the one-way latency values will be off by the difference in the
clocks. For example, if the clock on MPLB 702(P) is faster than the clock
on MPLB 702(A) by 300 milliseconds, then the derived one-way latency
values will be 300 milliseconds longer than they should be, which will
make the path appear to be slower than it is. Unfortunately, in most
implementations, it is very difficult if not impossible to keep all of
the clocks on all of the components in a network synchronized. As a
result, clock offsets are almost inevitable.

[0177]In one embodiment, the path selection manager 804 compensates for
clock offset by determining at least an approximation of the clock offset
value between two clocks, and then taking this clock offset value into
account when determining one-way latency values. By doing so, the path
selection manager 804 is able to derive more accurate measurements of
absolute one-way latency values.

[0178]To illustrate how a clock offset value may be determined in
accordance with one embodiment of the present invention, reference will
again be made to the above example. As described above, the path
selection manager 804 (the one on MPLB 702(A)) sends a latency request
packet to MPLB 702(P) along the path provided by VLAN 1. When the path
selection manager 804 receives a latency response packet in response to
this latency request packet, the path selection manager 804 obtains a
received timestamp from the clock on MPLB 702(A). Based upon the received
timestamp and the one or more timestamps in the latency response packet
(e.g. initial timestamp, arrived timestamp, return timestamp), the path
selection manager 804 determines a current roundtrip latency value (this
roundtrip latency value may be derived in the manner described above with
reference to FIGS. 11A and 11B).

[0179]In one embodiment, after the current roundtrip latency value is
derived, the path selection manager 804 determines whether the current
roundtrip latency value is less than a previously established minimum
roundtrip latency value. The previously established minimum roundtrip
latency value may be stored within the flow data structure or in some
other data structure, and is initially set to a high value. If the
current roundtrip latency value is less than the previously established
minimum roundtrip latency value, then the path selection manager 804 will
perform a clock offset computation. Before or after the clock offset
computation is performed, the path selection manager 804 overwrites the
previously established minimum roundtrip latency value with the current
roundtrip latency value. This establishes the current roundtrip latency
value as the new previously established minimum roundtrip latency value
for future calculations.

[0180]To compute the clock offset value, the path selection manager 804
initially determines an estimated one-way latency value for the path
provided by VLAN 1. In one embodiment, the path selection manager 804
derives this estimated one-way latency value by dividing the current
roundtrip latency value by two (if so desired, the estimated one-way
latency value may be derived using some other computation). This
estimation assumes that it takes approximately the same amount of time to
travel from MPLB 702(A) to MPLB 702(P) along the path provided by VLAN 1
as it does to travel from MPLB 702(P) back to MPLB 702(A). In this
regard, the use of the minimum roundtrip latency value (which the current
roundtrip latency value is) is significant because it is likely to
produce the least amount of estimation error in the estimated one-way
latency value.

[0181]Once the estimated one-way latency value is calculated, the
estimated clock offset value between the clocks on MPLB 702(A) and MPLB
702(P) may be computed as follows. In the embodiment where the path
selection manager on MPLB 702(P) provides only one timestamp (the arrived
timestamp or the return timestamp) in the latency request packet, the
following equation may be used to compute the clock offset value:

Clock offset value=(AT or RT)-IT-estimated one-way latency value;

[0182]where AT is the arrived timestamp (if any) specified in the latency
response packet, RT is the return timestamp (if any) specified in the
latency response packet, and IT is the initial timestamp specified in the
latency response packet (see FIG. 12A).

[0183]In the embodiment where the path selection manager on MPLB 702(P)
provides both an arrived timestamp and a return timestamp in the latency
response packet, the following equation may be used to compute the clock
offset value:

Clock offset value=AT-IT-estimated one-way latency value;

[0184]where AT is the arrived timestamp specified in the latency response
packet and IT is the initial timestamp specified in the latency response
packet (see FIG. 12B).

[0185]It should be noted that this clock offset value is an estimation of
the actual clock offset between the clocks on MPLB 702(A) and MPLB
702(P). Nonetheless, it can be used effectively to derive more accurate
one-way latency values. The clock offset value may be used as follows.

[0186]In the embodiment where the path selection manager on MPLB 702(P)
provides only one timestamp (the arrived timestamp or the return
timestamp) in the latency response packet, the following equation may be
used to derive the current one-way latency value for the path provided by
VLAN 1:

Current one-way latency value=(AT or RT)-IT-clock offset value;

[0187]where AT is the arrived timestamp (if any) specified in the latency
response packet, RT is the return timestamp (if any) specified in the
latency response packet, and IT is the initial timestamp specified in the
latency response packet (again, see FIG. 12A).

[0188]In the embodiment where the path selection manager on MPLB 702(P)
provides both an arrived timestamp and a return timestamp in the latency
response packet, the following equation may be used to derive the current
one-way latency value for the path provided by VLAN 1:

Current one-way latency value=AT-IT-clock offset value;

[0189]where AT is the arrived timestamp specified in the latency response
packet and IT is the initial timestamp specified in the latency response
packet (again, see FIG. 12B).

[0190]By calculating the clock offset value, and by taking it into account
in this manner, the path selection manager 804 is able to derive much
more accurate absolute one-way latency values for a path. These more
accurate absolute one-way latency values in turn may be used to derive
more accurate AOWL values, and more accurate one-way latency variation
values.

[0191]In one embodiment, after the clock offset value is computed, it is
stored in either the flow data structure or another data structure for
future reference. In a future sending of a latency request packet and
receiving of a latency response packet, the calculated roundtrip latency
value for that latency response packet may not be less than the
previously established minimum roundtrip latency value. In such a case,
the path selection manager 804 will not recalculate the clock offset
value. Instead, it will access the clock offset value calculated above
and use it to derive the one-way latency value for that latency response
packet. Thus, once calculated, the clock offset value may be used over
and over again.

[0192]In addition to having clock offset, multiple clocks may also have
clock drift. That is, their clock offset may change over time. For
example, at time X, the clock on MPLB 702(P) may be faster than the clock
on MPLB 702(A) by 300 milliseconds but at time Y, it may be faster by 600
milliseconds. To correct for this clock drift, the path selection manager
804, in one embodiment, periodically increases the previously established
minimum roundtrip latency value. The larger the previously established
minimum roundtrip latency value becomes, the more likely it will be that
a calculated roundtrip latency value for a latency response packet will
be smaller than the previously established minimum roundtrip latency
value. This in turn makes it more likely that the clock offset value will
be recalculated. When the clock offset value is recalculated, the clock
drift will be detected and compensated for.

Controlling the Number of Latency Packets

[0193]As noted previously, over the life of the flow, the path selection
manager 804 may send a large number of latency request packets and
receive a large number of latency response packets. The sending and
receiving of these latency packets has the potential of flooding and
hence overburdening the network. In one embodiment, to prevent this from
happening, the path selection manager 804 limits the number of latency
request packets that are sent.

[0194]In one embodiment, the path selection manager 804 determines the
number of latency request packets to send based upon the rate of the
flow. Specifically, the path selection manager 804 determines a rate for
the flow (i.e. the rate at which information associated with the flow is
received from node A and forwarded to node P). Then, it limits the
sending of the latency request packets to a rate that is a certain
percentage of the rate of the flow. Thus, if the rate of the flow is X,
then the rate at which the latency request packets are sent will be a
certain percentage of X (1 to 3 percent, for example). By maintaining
proportionality between the rate of the flow and the rate of the latency
request packets, the path selection manager 804 ensures that the latency
request packets will not flood or overburden the network.

[0195]As an alternative, the path selection manager 804 may keep a count
of how much information (in terms of bits or bytes) has been received and
forwarded for the flow. At one or more certain points, a determination is
made as to whether the information count has reached a certain amount.
When the count reaches a certain amount, a latency request packet is
sent. At that point, the count is cleared, and the counting of
information restarts. When the count again reaches a certain amount,
another latency request packet is sent, and the count is restarted. By
metering the latency request packets in this manner, the path selection
manager 804 ensures that the network traffic generated by the latency
request packets will be kept in proportion to the network traffic
generated by the flow. These and other methods may be used to control the
number of latency request packets that are sent by the path selection
manager 804.

Moving a Flow without Requiring Packet Reordering

[0196]As noted previously, when traffic conditions warrant, the path
selection manager 804 may move a flow from one path to another. When a
flow is moved from one path to another, there is a potential that the
packets of the flow may arrive at the destination out of order. To
illustrate, suppose that a packet of a flow is forwarded to a destination
along a first path. Suppose further that after forwarding that packet,
the path selection manager 804 decides to move the flow to a second path.
When the path selection manager 804 receives the next packet in that
flow, it will forward that packet to the destination along the second
path. When that happens, a race condition arises. If conditions are
right, the packet sent along the second path may arrive at the
destination before the packet sent along the first path. In such a case,
the packets will be out of order, which will require the destination to
reorder the packets. Packet reordering is burdensome and adds overhead
and complexity; thus, if possible, it is desirable to avoid it.

[0197]In one embodiment of the present invention, the path selection
manager 804 moves a flow from one path to another in such a manner that
it is ensured that the packets of the flow will not arrive at the
destination out of order. Thus, it is ensured that the destination will
not have to perform packet reordering.

[0198]To illustrate how this may be done in accordance with one embodiment
of the present invention, reference will again be made to the above
example. Initially, the path selection manager 804 (the one on MPLB
702(A)) forwards packets associated with the flow to node P along the
path provided by VLAN 1. Suppose at some point that the path selection
manager 804 determines that the flow should be moved to the path provided
by VLAN 4. In one embodiment, after making this determination, the path
selection manager 804 does not immediately update the flow data structure
to indicate VLAN 4 instead of VLAN 1; hence, it does not immediately
start sending packets associated with the flow along the path provided by
VLAN 4. Rather, the path selection manager 804 first sends a request
packet to MPLB 702(P) along the path provided by VLAN 1. This request
packet may be one of the latency request packets discussed above. In one
embodiment, at the time that it sends the request packet, the path
selection manager 804 sets a "path clear" flag.

[0199]At some point, the path selection manager on MPLB 702(P) will
receive the request packet. When it does so, the path selection manager
on MPLB 702(P) composes and sends a response packet back to MPLB 702(A).
This response packet may be one of the latency response packets discussed
previously.

[0200]When the path selection manager 804 (the one on MPLB 702(A))
receives the response packet, it determines whether the "path clear" flag
is still set. In one embodiment, this flag will still be set if no
packets associated with the flow were received and forwarded by the path
selection manager 804 between the sending of the request packet and the
receiving of the response packet. If the "path clear" flag is still set,
then it means that all previously sent flow packets have reached MPLB
702(A). In such a case, it is known that the path is clear of flow
packets; thus, the path selection manager 804 can move the flow to the
path provided by VLAN 4, knowing that no packet reordering will need to
be done by the destination. Thus, at that point, the path selection
manager 804 updates the flow data structure to replace the ID for VLAN 1
with the ID for VLAN 4. Future packets associated with the flow will be
forwarded to the destination via the path provided by VLAN 4 instead of
the path provided by VLAN 1.

[0201]On the other hand, if the "path clear" flag is no longer set, then
it means that one or more intervening flow packets were received and
forwarded by the path selection manager 804 between the sending of the
request packet and the receiving of the response packet. In such a case,
the path selection manager 804 will repeat the above process. Namely, it
will send another request packet, set the "path clear" flag, receive
another response packet, and check to see if the "path clear" flag is
still set at that time. Until it encounters an instance in which a
request packet is received and the "path clear" flag is still set, the
path selection manager 804 will not move the flow onto the path provided
by VLAN 4.

[0202]In one embodiment, when the path selection manager 804 (the one on
MPLB 702(A)) receives a packet associated with the flow, it checks the
"path clear" flag. If the flag is set, it clears the flag (thereby
indicating that the path is no longer clear). It then sends the packet to
node P along the path provided by VLAN 1 (this is so despite the fact
that the path selection manager 804 has already determined that the flow
should be moved to the path provided by VLAN 4). By doing so, the path
selection manager 804 in effect causes the path provided by VLAN 1 to
still be used. It will continue to use the path provided by VLAN 1 until
it can move the flow to the path provided by VLAN 4 in such a manner that
it is ensured that the destination will not need to perform packet
reordering.

Using Latencies to Determine Link Failure Timeout Period

[0203]As described previously, the path selection manager 804 can use
latency request packets and latency response packets to determine a
plurality of latency values for a path. With these latency values, the
path selection manager 804 can derive an estimated latency value for that
path. For example, the plurality of latency values may be averaged to
give rise to the estimated latency value. These and other methods may be
used to derive the estimated latency value. In one embodiment of the
present invention, this estimated latency value may be used to derive a
more accurate link failure timeout period that will enable link failures
to be detected more quickly and accurately.

[0204]To elaborate, a link failure is typically detected by sending a
packet to a receiving node along a certain path, and then waiting for a
response. If the response is not received within a certain timeout
period, then it is concluded that a link failure has occurred. Typically,
the timeout period is a static value that is pre-established by a network
administrator. This static value has no correlation with the actual
performance of any path at any time. If the timeout period is set too
short, then there will be an excessive number of false positives (i.e.
false conclusions that a link failure has occurred when it actually has
not). To avoid this result, most administrators set the timeout period to
a long period. The problem with this approach is that the link failure
will not be detected for a long period of time. Thus, reaction to the
link failure will be undesirably slow.

[0205]To enable link failures to be detected more quickly and accurately,
one embodiment of the present invention derives a link failure timeout
period for a path based upon the estimated latency value for that path.
For purposes of the present invention, the estimated latency value may be
an estimated one-way latency value or an estimated roundtrip latency
value. Because each path may perform differently, each path may have a
different link failure timeout period. In one embodiment, the link
failure timeout period is set to a value that is a multiple of the
estimated latency value (for example, the multiple may be in the range of
10 to 30). Because this link failure timeout period is determined based
upon the actual latency on a path, it will provide a more realistic and
accurate timeout period with which to judge whether a link has failed or
not. If the proper multiple is used for the proper network, then the
optimal link failure timeout period will be derived which will enable
link failures to be determined with fewer, if any, false positives, and
link failures to be detected faster. Also, as the latency on the path
changes, the link failure timeout period also changes. As a result, the
link failure timeout period adapts to the performance changes of the
path.

[0206]In accordance with one embodiment of the present invention, link
failure may be detected as follows. Initially, the path selection manager
804 sends a plurality of latency request packets to another component
(e.g. MPLB 702(P) of FIG. 7) in the network 300 along a particular path
(e.g. the path provided by VLAN 1). The path selection manager 804 also
receives a plurality of latency response packets from the other
component. Based upon various timestamps, and in the manner described
above, the path selection manager 804 determines from these latency
response packets a plurality of latency values. Based upon these latency
values, the path selection manager 804 determines an estimated latency
value (for example, by averaging the latency values). Then, based upon
the estimated latency value, the path selection manager 804 determines a
link failure timeout period (for example, by multiplying the estimated
latency value by a certain factor).

[0207]Thereafter, the path selection manager 804 sends another latency
request packet to the other component along the same path. If the path
selection manager 804 determines that a latency response packet, which
should be sent by the other component in response to the latency request
packet, has not been received prior to expiration of the link failure
time period, then the path selection manager 804 concludes that a link
failure has occurred. At that point, the path selection manager 804 can
take whatever steps are necessary (e.g. moving a flow onto another path)
to react to the link failure. By using a link failure timeout period that
reflects the actual latency on a path, the path selection manager 804 is
able to detect link failures more quickly and accurately.

Flow Control at Network Border

[0208]Most switches in a network have the ability to instruct an upstream
component (e.g. an upstream switch) to stop sending packets, or at least,
to slow down the rate at which packets are being sent. A switch may do
this, for example, when it detects that its buffers are full or are about
to be full, and if the packet reception rate is not decreased, then the
switch will have to start dropping packets. The problem with a switch
within the heart of a network exercising this capability is that it can
lead to a chain reaction of switches exercising this capability. For
example, a switch may instruct an upstream switch to stop sending packets
to it, which may cause that switch to instruct an upstream switch to stop
sending packets to it, which may cause that switch to instruct an
upstream switch to stop sending packets to it, and so forth. Under
certain conditions, this chain reaction may cause the overall network to
become gridlocked.

[0209]To prevent this from occurring, one embodiment of the present
invention exercises flow control at the border of the network. More
specifically, in one embodiment, it is the path selection managers (on
either the MPLBs 702 or the network interfaces 502) that exercise flow
control. To illustrate how this can be done, reference will again be made
to the example discussed previously, in which: (1) node A (FIG. 7) sends
a plurality of packets associated with a flow to MPLB 702(A); (2) all of
the packets of the flow are destined for node P; and (3) the path
selection manager 804 (the one on MPLB 702(A)) has selected the path
provided by VLAN 1 to be the path to use to transport the packets to node
P.

[0210]In such a scenario, the path selection manager 804 receives the flow
packets from node A at a particular flow rate, and forwards the flow
packets to node P via the path provided by VLAN 1. As the path selection
manager 804 is doing so, it determines a congestion metric for the path
provided by VLAN 1 (in the manner described previously). Part of the
congestion metric will be one or more latency values for the path
provided by VLAN 1. Based at least partially upon this congestion metric,
the path selection manager 804 determines whether a certain congestion
threshold has been reached. This congestion threshold may, for example,
be a certain latency value. For instance, suppose that the path selection
manager 804 is aware that the latency on the path provided by VLAN 1 can
only get so high before the switches in that path start dropping packets.
In such a case, the latency value at which the switches start dropping
packets may be the congestion threshold. If the path selection manager
804 determines that the congestion on the path provided by VLAN 1 has
reached this congestion threshold, then it instructs the source (node A
in this example) to stop sending flow packets, or at least, to slow the
rate at which the flow packets are being sent. In one embodiment, as part
of this instruction, the path selection manager 804 may provide to node A
sufficient information for node A to specifically identify the flow. That
way, node A can stop sending packets, or slow down the sending of
packets, just for that flow. By exercising flow control in this manner,
the path selection manager 804 ensures that the network 300 will not
become overly congested, which in turn, eliminates the need for the
switches within the network 300 to exercise flow control. Thus, gridlock
caused by a chain reaction of switches exercising flow control is
prevented.

[0211]At this point, it should be noted that although the invention has
been described with reference to one or more specific embodiments, it
should not be construed to be so limited. Various modifications may be
made by those of ordinary skill in the art with the benefit of this
disclosure without departing from the spirit of the invention. For
example, the network 300 shown in FIG. 3 is just a sample network. This
switching topology can be extended to include more switches, to switches
with fewer or more ports, to fewer or more hierarchical levels of
switches, to fewer or more VLANs and VLAN paths, etc. These and other
modifications and extensions are possible. Thus, the invention should not
be limited by the specific embodiments used to illustrate it but only by
the scope of the issued claims and the equivalents thereof.