Abstract

A switch for a switching network includes a plurality of ports for communicating data traffic and a switch controller that controls switching between the plurality of ports. The switch controller selects a forwarding path for the data traffic based on at least topological congestion information for the switching network. In a preferred embodiment, the topological congestion information includes sFlow topological congestion information and the switch controller includes an sFlow client that receives the sFlow topological congestion information from an sFlow controller in the switching network.

Description

This application is a continuation of U.S. patent application Ser. No. 13/267,459 entitled “NETWORK TRAFFIC DISTRIBUTION,” filed on Oct. 6, 2011, the disclosure of which is incorporated herein by reference in its entirety for all purposes.

BACKGROUND OF THE INVENTION

1. Technical Field

The present disclosure relates in general to network communication and, in particular, to traffic distribution in packet switched networks.

2. Description of the Related Art

As is known in the art, network communication is commonly premised on the well known seven layer Open Systems Interconnection (OSI) model, which defines the functions of various protocol layers while not specifying the layer protocols themselves. The seven layers, sometimes referred to herein as Layer 7 through Layer 1, are the application, presentation, session, transport, network, data link, and physical layers, respectively.

At a source station, data communication begins when data is received from a source process at the top (application) layer of the stack of functions. The data is sequentially formatted at each successively lower layer of the stack until a data frame of bits is obtained at the data link layer. Finally, at the physical layer, the data is transmitted in the form of electromagnetic signals toward a destination station via a network link. When received at the destination station, the transmitted data is passed up a corresponding stack of functions in the reverse order in which the data was processed at the source station, thus supplying the information to a receiving process at the destination station.

The principle of layered protocols, such as those supported by the OSI model, is that, while data traverses the model layers vertically, the layers at the source and destination stations interact in a peer-to-peer (i.e., Layer N to Layer N) manner, and the functions of each individual layer are performed without affecting the interface between the function of the individual layer and the protocol layers immediately above and below it. To achieve this effect, each layer of the protocol stack in the source station typically adds information (in the form of an encapsulated header) to the data generated by the sending process as the data descends the stack. At the destination station, these encapsulated headers are stripped off one-by-one as the data propagates up the layers of the stack until the decapsulated data is delivered to the receiving process.

The physical network coupling the source and destination stations may include any number of network nodes interconnected by one or more wired or wireless network links. The network nodes commonly include hosts (e.g., server computers, client computers, mobile devices, etc.) that produce and consume network traffic, switches, and routers. Conventional network switches interconnect different network segments and process and forward data at the data link layer (Layer 2) of the OSI model. Switches typically provide at least basic bridge functions, including filtering data traffic by Layer 2 Media Access Control (MAC) address, learning the source MAC addresses of frames, and forwarding frames based upon destination MAC addresses. Routers, which interconnect different networks at the network (Layer 3) of the OSI model, typically implement network services such as route processing, path determination and path switching.

A large network typically includes a large number of switches, which operate somewhat independently. Switches within the flow path of network data traffic include an ingress switch that receives incoming data packets and an egress switch that sends outgoing data packets, and frequently further include one or more intermediate switches coupled between the ingress and egress switches. In such a network, a switch is said to be congested when the rate at which data traffic ingresses at the switch exceeds the rate at which data traffic egresses at the switch.

In conventional networks, when a switch in a data flow path is congested with data traffic, the congested switch may apply “back pressure” by transmitting one or more congestion management messages, such as a priority-based flow control (PFC) or congestion notification (CN) message, requesting other switches in the network that are transmitting data traffic to the congested switch to reduce or to halt data traffic to the congested switch. Conventional congestion management message may specify a backoff time period during which data traffic is reduced or halted, where the backoff time may be determined upon the extent of congestion experienced by the congested switch. Conventional congestion management messages may not provide satisfactory management of network traffic, however. While serving to temporarily reduce the transmission rate of some network nodes, conventional congestion management does nothing to address persistent long term congestion on switching ports, which can arise, for example, in cases in which different high-traffic source-destination address tuples hash to the same network path.

SUMMARY OF THE INVENTION

In at least one embodiment, a switch for a switching network includes a plurality of ports for communicating data traffic and a switch controller that controls switching between the plurality of ports. The switch controller selects a forwarding path for the data traffic based on at least topological congestion information for the switching network. In a preferred embodiment, the topological congestion information includes sFlow topological congestion information and the switch controller includes an sFlow client that receives the sFlow topological congestion information from an sFlow controller in the switching network.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a high level block diagram of a data processing environment in accordance with one embodiment;

FIG. 2 is a more detailed view of a switching network in the data processing environment of FIG. 1;

FIG. 3 illustrates an exemplary embodiment of a physical switch in a switching network;

FIG. 4 depicts an exemplary embodiment of a host platform that can be utilized to implement a virtual switch of a switching network;

FIG. 5 depicts the flow of traffic management information in an exemplary switching network; and

FIG. 6 is a high level logical flowchart of an exemplary embodiment of a process by which topological congestion information is employed to achieve improved traffic distribution in a switching network.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT

With reference now to the figures and with particular reference to FIG. 1, there is illustrated a high level block diagram of an exemplary data processing environment 100 in accordance within one embodiment. As shown, data processing environment 100 includes a collection of resources 102. Resources 102, which may include various hosts, clients, switches, routers, storage, etc., are interconnected for communication and may be grouped (not shown) physically or virtually, in one or more public, private, community, public, or cloud networks or a combination thereof. In this manner, data processing environment 100 can offer infrastructure, platforms, software and/or services accessible to various client devices 110, such as personal (e.g., desktop, laptop, netbook, tablet or handheld) computers 110a, smart phones 110b, server computer systems 110c and consumer electronics, such as media players (e.g., set top boxes, digital versatile disk (DVD) players, or digital video recorders (DVRs)) 110d. It should be understood that the types of client devices 110 shown in FIG. 1 are illustrative only and that client devices 110 can be any type of electronic device capable of communicating with and/or accessing resources 102 via a packet network.

Referring now to FIG. 2, there is illustrated a more detailed view of an exemplary embodiment of a switching network within resources 102 of data processing environment 100. In the depicted embodiment, resources 102 includes a plurality of server racks 202, which may form the computational resources of a data center, for example. Server racks 202 are coupled for communication by a Clos switching network 200, which may be, for example, a Transparent Interconnection of Lots of Links (TRILL) network, Ethernet network, a converged network carrying Fibre Channel over Ethernet (FCoE), or some other packet switched network.

Switching network 200 has at a lowest tier a plurality of top-of-rack (ToR) switches 204 each mounted on a respective one of server racks 202. Switching network 200 additionally includes a middle tier of aggregation switches 206, each of which is coupled to, and aggregates data traffic of one or more ToRs 204. Switching network 200 finally includes at an upper tier a plurality of core switches 208. In the depicted embodiment, aggregation switches 206 and core switches 208 are coupled in a full mesh topology in which each core switch 208 is coupled to each of aggregation switches 206.

In a switching network 200 such as that illustrated, any of switches 204, 206 and 208 may become congested as one or more other switches of switching network 200 transmit data traffic at a rate greater than that switch 202 is itself able to forward that data traffic towards its destination(s). In many switching networks 200, congestion in some intermediate node (a switch 208 or 206) prevents data packets from being delivered to a final egress switch 204 even if there exists some alternate path to that egress switch 204. For example, a particular core switch 208 may become congested as multiple aggregation switches 206 concentrate egress data traffic at the same core switch 208, for example, due to multiple frequently referenced source-destination address tuples hashing to the same network path.

With reference now to FIG. 3, there is illustrated a first exemplary embodiment of a physical switch 300 that may be utilized to implement any of switches of FIG. 2. As shown, switch 300 includes a plurality of physical ports 302a-302m. Each physical port 302 includes a respective one of a plurality of receive (Rx) interfaces 304a-304m and a respective one of a plurality of ingress queues 306a-306m that buffers frames of data traffic received by the associated Rx interface 304. Each of ports 302a-302m further includes a respective one of a plurality of egress queues 314a-314m and a respective one of a plurality of transmit (Tx) interfaces 320a-320m that transmit frames of data traffic from an associated egress queue 314.

Switch 300 additionally includes a switch fabric 310, such as a crossbar or shared memory switch fabric, which is operable to intelligently switch data frames from any of ingress queues 306a-306m to any of egress queues 314a-314m under the direction of switch controller 330. As will be appreciated, switch controller 330 can be implemented with one or more centralized or distributed, special-purpose or general-purpose processing elements or logic devices, which may implement control entirely in hardware, or more commonly, through the execution of firmware and/or software by a processing element.

In order to intelligently switch data frames, switch controller 330 builds and maintains one or more data plane data structures, for example, a Layer 2 forwarding information base (FIB) 332 and a Layer 3 routing information base (RIB) 334, which can be implemented, for example, as tables in content-addressable memory (CAM). In some embodiments, the contents of FIB 332 can be preconfigured, for example, by utilizing a management interface to specify particular egress ports 302 for particular traffic classifications (e.g., MAC addresses, traffic types, ingress ports, etc.) of traffic. Switch controller 330 can alternatively or additionally build FIB 332 in an automated manner by learning from observed data frames an association between ports 302 and destination MAC addresses specified by the data frames and recording the learned associations in FIB 332. A forwarding process 333 in switch controller 330 thereafter controls switch fabric 310 to switch data frames in accordance with the associations recorded in FIB 332. RIB 334, if present, can similarly be preconfigured or dynamically configured with routes associated with Layer 3 addresses, which are utilized by routing process 335 to route data packets. For example, in a embodiment in which switch 300 is a TRILL switch implemented in a TRILL network, RIB 334 is preferably preconfigured with a predetermined route through switching network 200 among multiple possible equal cost paths for each destination address. In other embodiments, dynamic routing algorithms, such as OSPF (Open Shortest Path First) or the like, can be utilized to dynamically select (and update RIB 334 with) a route for a flow of data traffic based on Layer 3 address and/or other criteria.

Switch controller 330 additionally includes an sFlow client 350 that, as discussed in greater detail below, receives sFlow information from the sFlow controller of switching network 200 and supplies the information to forwarding process 333 to optimize the distribution of data traffic in switching network 200.

As noted above, any of switches 202 may be implemented as a virtual switch by program code executed on a physical host platform. For example, FIG. 4 depicts an exemplary host platform 400 including one or more network interfaces 404 (e.g., network interface cards (NICs), converged network adapters (CNAs), etc.) that support connections to physical network links for communication with other switches 202 or other network-connected devices. Host platform 400 additionally includes one or more processors 402 (typically comprising one or more integrated circuits) that process data and program code, for example, to manage, access and manipulate data or software in data processing environment 100. Host platform 400 also includes input/output (I/O) devices 406, such as ports, displays, user input devices and attached devices, etc., which receive inputs and provide outputs of the processing performed by host 400 and/or other resource(s) in data processing environment 100. Finally, host platform 400 includes data storage 410, which may include one or more volatile or non-volatile storage devices, including memories, solid state drives, optical or magnetic disk drives, tape drives, etc. Data storage 410 may store, for example, program code 420 (including software, firmware or a combination thereof) executable by processors 402. Program code 420, which may comprise one or more of a virtual machine monitor (VMM), virtual machines, operating system(s) (OSs), and/or application software, may implement one or more switches 204, 206 or 208 (and one or more associated network links) virtually. As understood by those skilled in the art, such virtual switches may virtualize the components and functions of switch 300 of FIG. 3, including that of switch controller 330. Further, such switches can be configured to support any of a number of protocols, including TRILL, Fibre Channel, Ethernet, FCoE, etc.

With reference now to FIG. 5, there is illustrated a exemplary data flow diagram of traffic management information in an exemplary switching network, such as switching network 200 of FIG. 2. As shown, in the exemplary data flow, the sFlow agent 340 in each of a plurality of switches 300 of switching network 200 captures packet samples from the data traffic transiting its associated switch 300 and communicates the captured packet samples and interface counter values in UDP sFlow datagrams 502 to a central sFlow controller 500. SFlow controller 500, which may execute, for example, on server 110c of FIG. 1 or one of server racks 202, collects and analyzes the network traffic information in sFlow datagrams 502 to generate a network traffic report 504 that digests the types and distribution of network traffic in switching network 200, permitting performance optimization, accounting and billing for network usage, and detection and response to security threats. As will be appreciated, network traffic report 504 may be recorded in data storage (e.g., in a log) and may further be presented in a human-viewable (e.g., graphical, textual, tabular and/or numeric) format.

In accordance with the present disclosure, the capabilities of sFlow controller 500 are extended to include the distribution of relevant topological congestion information 506 to one or more (and possibly all of) switches 300 in switching network 200. Topological congestion information 506, which identifies one or more forwarding paths of the recipient switch 300 that are experiencing higher congestion relative to other forwarding paths of the receiving switch 300, is received by the sFlow client 350 of the recipient switch 300, which in turn informs forwarding process 333 of the recipient switch 300. In response, forwarding process 333 of the recipient switch 300 selects a forwarding path for its data traffic among multiple equal cost paths (i.e., ECMP paths) based on available path information from the routing process 335 and the topological congestion information provided by sFlow client 350. Forwarding process 350 may further update FIB 332 with an entry associating the selected forwarding path and the Layer 2 destination address of the data traffic.

Referring now to FIG. 6, there is illustrated a high level logical flowchart of an exemplary embodiment of a process by which a congestion information is employed to achieve improved traffic distribution in a switching network. As a logical rather than strictly chronological flowchart, at least some of the illustrated steps can be performed in a different order than illustrated or concurrently. The illustrated process can be implemented, for example, by a forwarding process 333 of the switch controller 330 of a switch 300 in switching network 200.

The process begins at block 600 and then proceeds to block 602 and 604, which illustrate forwarding process 333 asynchronously receiving sFlow topological congestion information (e.g., from sFlow client 350) and routing information (e.g., from routing process 335). Forwarding process 333 then selects a forwarding path for its data traffic from among multiple network paths based upon the available paths indicated by the routing information and the sFlow topological congestion information (block 606). At block 606, forwarding process 333 preferably selects the forwarding path in order to reduce network congestion along the forwarding path(s) indicated by the sFlow topological congestion information provided by sFlow controller 500 and sFlow client 350. If needed, forwarding process 333 updates FIB 332 with an entry associating the selected forwarding path and the Layer 2 destination address of the data traffic (block 608). Thereafter the process returns to block 602 and 604, which have been described.

As has been described, in at least one embodiment a switch for a switching network includes a plurality of ports for communicating data traffic and a switch controller that controls switching between the plurality of ports. The switch controller selects a forwarding path for the data traffic based on at least topological congestion information for the switching network. In a preferred embodiment, the topological congestion information includes sFlow topological congestion information and the switch controller includes an sFlow client that receives the sFlow topological congestion information from an sFlow controller in the switching network.

While the present invention has been particularly shown as described with reference to one or more preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention. For example, although aspects have been described with respect to one or more machines (e.g., hosts and/or network switches) executing program code (e.g., software, firmware or a combination thereof) that direct the functions described herein, it should be understood that embodiments may alternatively be implemented as a program product including a tangible machine-readable storage medium or storage device (e.g., an optical storage medium, memory storage medium, disk storage medium, etc.) storing program code that can be processed by a machine to cause the machine to perform one or more of the described functions.

Claims (5)

What is claimed is:

1. A method in a switching network including a plurality of switches, the method comprising:

an sFlow controller receiving sFlow datagrams from multiple of the plurality of switches coupled to different network links of the switching network;

based on the sFlow datagrams, the sFlow controller generating topological congestion information for a recipient switch among the plurality of switches and transmitting the topological congestion information to the recipient switch, wherein the topological congestion information identifies one or more forwarding paths of the recipient switch that are experiencing higher congestion relative to other forwarding paths of the recipient switch;

receiving data traffic at the recipient switch, wherein the recipient switch includes a physical platform, a plurality of ports for communicating data traffic, data storage including a Layer 2 forwarding information base, and a switch controller that controls switching of data traffic between the plurality of ports by reference to the Layer 2 forwarding information base and includes an sFlow client;

the sFlow client of the switch controller receiving the topological congestion information from the sFlow controller;

the switch controller selecting a Layer 2 forwarding path for the data traffic from among multiple possible forwarding paths then available and reachable from the recipient switch based on at least the received topological congestion information; and

the switch controller updating the Layer 2 forwarding information base based on the selected forwarding path and a Layer 2 destination address for the data traffic.

2. The method of claim 1, wherein selecting the forwarding path comprises selecting the L2 forwarding path based on the topological congestion information and available routes indicated by a routing process of the recipient switch.

3. The method of claim 1, and further comprising the recipient switch transmitting sFlow datagrams to the sFlow controller.

4. The method of claim 2, and further comprising:

the recipient switch maintaining a Layer 3 routing information base that associates a respective predetermined route through the switching network for each of multiple Layer 3 destination addresses; and

the switch determining, using the routing process, the available routes by reference to the Layer 3 routing information base.

5. The method of claim 2, and further comprising the recipient switch determining, using the routing process, the available routes utilizing dynamic routing.

Shi et al., Architectural Support for High Speed Protection of Memory Integrity and Confidentiality in Multiprocessor Systems, pp. 1-12, Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (2004).