Receive-Side Scaling: Maximizing Web Server CPU

UPDATE 1/7/2013: I recently re-ran the scenario on the original hardware and could not reproduce the RSS issue below. I found that by setting 'Number of RSS Queues' to 4 for each of the Broadcom NICs, I could get an even distribution of load across all 16 server cores. There was likely an RSS configuration issue, but I can't confirm what that was.

Re-running on the original hardware after setting 'Number of RSS Queues' to 4 for each NIC:

4 Intel Xeon Quad-Core processors (16 cores total)

4 Broadcom BCM5708C NetXtreme II GigE NICs

ORIGINAL POST:

In the previous post, I stated that when adding multiple NICs to a web server you should ensure that the network load is balanced across those NICs. In order to maximize CPU on your multi-processor server you will still need to ensure that those NICs support Receive Side Scaling (RSS), a technology that distributes network processing across multiple CPUs. In this post I will demonstrate how to verify that RSS is working.

Recall our webserver specs:

4 Intel Xeon Quad-Core processors (16 cores total)

4 Broadcom BCM5708C NetXtreme II GigE NICs

After load-balancing our server and clients, we now have the following wcat node configuration:

After collecting data on the server NICs, enabling processor counters for all processors and running my wcat scenario, I was able to see what happened to incoming network traffic. In the the table below, notice that only 8 of the 16 CPUs processed the Deferred Procedure Calls (DPCs). Meanwhile, the other 8 CPUs spent more time Idle. Something in my server configuration was preventing RSS from utilizing all 16 cores.

To compare, we borrowed some hardware from the IIS performance lab which had previously gone through this RSS exercise.

The new webserver has a total of 20 processors and 10 NICs:

20 Intel Xeon CPU E7 – 4860 processors

6 Intel Gigabit ET Dual Port Server Adapter NICs

4 Intel 82576NS Gigabit Ethernet Controller NICS

The following is the new load-balanced wcat node configuration:

And the following was the compilation of NIC and counter data for a wcat run. RSS was configured properly on this node – notice how the DPC processing was evenly distributed across all processors (taking into account that RSS CPUs 8-9 and e-f had increased load).

To compile this data, I did the following:

Noted which client(s) pointed to which IP(s), based on my load balancing

Found the # of requests sent in the per-client statistics section of the wcat log