Xeon CPUs + Ultra320 SCSI = blazing speed

Posted on November 01, 2003

We recently began testing commodity servers based on Intel's Xeon processors. One of the leading examples in this new class of server is the IBM xSeries 235 eServer. Sporting dual Xeon CPUs with 533MHz front-end buses, up to 12GB of ECC DDR266 memory, an embedded Broadcom NET-XTREME Gigabit Ethernet NIC, LSI Logic 53c1030 Ultra320 SCSI RAID controller, and 64-bit 133MHz PCI-X expansion slots, this type of server is designed for applications such as financial modeling, digital rendering, and seismic analysis, which require extremely high bandwidth.

Figure 1: Streaming throughput (reads only) on our 4-drive RAID 0 volume was higher on Windows 2003 Server (NTFS) than SuSE Linux (XFS). The use of asynchronous threads by Windows makes the difference in the high-end environment of a Xeon-based system. Tests were based on the oblDisk 1.0 and oblWinDisk 3.0 benchmarks.

Ultra320 SCSI introduces major protocol changes to reduce overhead and improve performance enough to support a data burst rate of 320MBps. One key feature of Ultra320 SCSI is called "packetized SCSI," which enables transferring multiple commands in a single connection along with data and status information. In contrast, Ultra160 SCSI transfers data during a synchronous phase at 160MBps, but transfers commands and status information during slower asynchronous phases. What's more, command and status information are restricted to a single transfer per connection. As a result, Utra320 SCSI does a much better job maximizing bus utilization and minimizing command overhead.

These performance improvements in SCSI controllers, however, introduce a performance problem for servers: Ultra320 SCSI's faster I/O performance can rapidly saturate a standard 64-bit PCI bus. Don't even think about trying to use a 32-bit PCI bus. If you want to reap the benefits of Ultra320 SCSI, you'll need a server with at least one 100MHz PCI-X slot.

We began testing each operating system's ability to handle large amounts of I/O by installing Windows 2003 Server Enterprise Edition on the xSeries 235 server. There are a number of improvements in storage management included in the Windows 2003 Server, one of the most interesting being the ability to perform snapshot backups on shared volumes via Volume Shadow Copy Service. System administrators can now automate the creation of point-in-time copies of single or multiple volumes and can immediately restore files from these point-in-time archives.

There are a number of ways that administrators can enable this function, the easiest being through the new File Server Management option under Administrative Tools. From this management console, shadow copies can be configured to occur on specific time intervals or when there are periods of disk inactivity. In addition, a size restriction on the total space reserved for shadow copies can be set.

To gain access to these shadow copies, client systems need to update the Windows Network Neighborhood applet. This update adds a Previous Versions tab to the properties window of shared volumes. Upon clicking on this option, a user is presented a list of point-in-time copies of the disk volume. Selecting a time period opens a window with the archived files, which can be opened, dragged, and dropped like any other file.

For comparison testing, we loaded SuSE Linux Enterprise Server (SLES) on the same IBM server. SLES is built on a single code base that spans multiple 32-bit and 64-bit architectures from Intel X86 to Itanium, Opteron, POWER4, and even eServer zSeries mainframes.

For the storage subsystem tests on the Windows 2003 Server we ran our oblWinDisk and oblLoad benchmarks. On SLES, we ran oblDisk and oblLoad. oblWinDisk and oblDisk test streaming throughput, while oblLoad tests I/O-transaction-request processing in terms of I/O operations per second (IOPS).

As expected, throughput for streaming I/O under the Windows 2003 Server ramped up as the size of reads increased. Under Linux, small blocks are bundled by the operating system into larger blocks—up to 128KB. As a result, there should be little if any variation as sequential read sizes increase from 2KB to 128KB.

Figure 2: Linux (XFS) was able to fulfill a higher number of requests than Windows 2003 Server (NTFS) under the traditional TPC1 benchmark rule of an average response time of less than 100ms. Tests were based on the oblLoad v.1.0 benchmark.

Click here to enlarge image

null

Typically, however, streaming throughput tests on Linux and Windows converge as reads on Windows reach 64KB. In our tests using both the LSI Logic Ultra320 SCSI controller on SLES and the Windows 2003 Server, Windows I/O throughput surpassed throughput on Linux at 16KB reads. At 64KB reads, throughput on the Windows 2003 Server was about 33% greater. It is important to note that most programs use 8KB reads. Typically, only system utilities such as backup software or an internal database utility use large-block reads in a Windows environment.

The reason for the better performance can be attributed to the way that Windows does file I/O. Rather than bundle requests, Windows fires off all of the requests and then waits for them to return asynchronously.

Conceptually, this is simulated in the oblDisk benchmark when it "chunks" the file into zones and uses a separate thread to read each zone in parallel. When we do this with four threads, throughput on Windows does not exceed that on Linux until the read size is greater than 32KB.

An important specification for the 15K Maxtor drives is a quick 3.2ms seek time. This was a key in the outstanding perfor- mance measured by our oblLoad I/O loading benchmark.

Normally this test, which measures the fulfillment of database-patterned I/O requests in a transaction-processing environment, has been devastating for Linux. The Linux kernel previously had no specialized support for asynchronous I/O, which is a strong point of Windows.

Click here to enlarge image

During the oblLoad benchmark, background thread processes are dispatched that issue their own unique 8KB data requests in a manner that simulates access patterns of a relational database. This continues until the average access time exceeds 100 milliseconds, which is the limit in formal TPC1 benchmarks. The results of oblLoad can then be evaluated in terms of total I/Os per second versus the total number of I/O-issuing processes, or in terms of the response time versus the number of I/Os per second being processed.

The ability of a system to complete more I/O requests as more I/O process threads are loaded is dependent upon data caching and thread load balancing.

With the LSI Logic controller, we were able to deliver more than 2,000 IOPS to more than 250 simulated users on SLES, which was more than double the I/O load supported by our Ultra160 SCSI reference subsystem. Surprisingly, this was greater than the number of I/Os per second processed by Windows 2003. Nonetheless, there was a quantifiable advantage garnered by the Windows 2003 Server in these tests. Under Windows 2003, we did measure a much more rapid buildup in the number of I/Os that could be processed in less than 8ms. With an 8ms response time, Windows 2003 was able to process about 40% more I/Os per second.

Jack Fegreus is the research director at CCI Communications, (www.ccicommunications.com). He can be contacted at jfegreus@ccicommunications.com.

Please enable Javascript in your browser, before you post the comment! Now Javascript is disabled.