Performance & Best Practices

Sun has demonstrated the first large scale grid validation for the SAS Grid Computing 9.2 benchmark. This workload showed both the strength of Solaris 10 utilizing containers for ease of deployment, as well as the Sun Storage
7410 Unified Storage Systems and Fishworks Analytics in analyzing, tuning and delivering
performance in complex SAS grid data-intensive multi-node environments.

In order to model the real world, the Grid Endurance Test uses large data sizes and complex data processing. Results demonstrate real customer scenarios and results.
These benchmark results represent significant engineering
effort, collaboration and coordination between SAS and Sun.
The results also illustrate the commitment of the two companies
to provide the best solutions for the most demanding data integration requirements.

A combination of 7 Sun Fire X2200 M2 servers utilizing Solaris 10 and a Sun Storage 7410
Unified Storage System showed continued performance improvement as the node count increased from 2 to 7 nodes for the Grid Endurance Test.

SAS environments are often complex. Ease of deployment, configuration,
use, and ability to observe application IO characteristics (hotspots,
trouble areas) are critical for production environments. The power
of Fishworks Analytics combined with the reliability of ZFS is a perfect
fit for these types of applications.

Solaris 10 Containers were used to create agile and flexible
deployment environments. Container deployments were trivially migrated
(within minutes) as HW resources became available (Grid expanded).

This result is the only large scale grid validation for the SAS Grid Computing 9.2, and the first and most timely qualification of OpenStorage
for SAS.

The test show a delivered throughput through client 1Gb connection of
over 100MB/s.

Configuration

The test grid consisted of 8x Sun Fire x2200 M2 servers, 1 configured as the grid manager, 7 as the actual grid nodes. Each node had a 1GbE connection through a Brocade FastIron 1GbE/10GbE switch. The 7410 had a 10GbE connection to the switch and sat as the back end storage providing a common shared file system to all nodes which SAS Grid Computing requires. A storage appliance like the 7410 serves as an easy to setup and maintain solution, satisfying the bandwidth required by the grid. Our particular 7410 consisted of 46 700GB 7200RPM SATA drives, 36GB of write optimized SSD's and 300GB of Read optimized SSD's.

About the Test

The workload is a batch mixture. CPU bound workloads are numerically intensive tests, some using tables varying in row count from 9,000 to almost 200,000. The tables have up to 297 variables, and are processed with both stepwise linear regression and stepwise logistic regression. Other computational tests use GLM (General Linear Model). IO intensive jobs vary as well. One particular test reads raw data from multiple files, then generates 2 SAS data sets, one containing over 5 million records, the 2nd over 12 million. Another IO intensive job creates a 50 million record SAS data set, then subsequently does lookups against it and finally sorts it into a dimension table. Finally, other jobs are both compute and IO intensive.

The SAS IO pattern for all these jobs is almost always sequential, for read, write, and mixed access, as can be viewed via Fishworks Analytics further below. The typical block size for IO is 32KB.

Governing the batch is the SAS Grid Manager Scheduler, Platform LSF. It determines when to add a job to a node based on number of open job slots (user defined), and a point in time sample of how busy the node actually is. From run to run, jobs end up scheduled randomly making runs less predictable. Inevitably, multiple IO intensive jobs will get scheduled on the same node, throttling the 1Gb connection, creating a bottleneck while other nodes do little to no IO. Often this is unavoidable due to the great variety in behavior a SAS program can go through during its lifecycle. For example, a program can start out as CPU intensive and be scheduled on a node processing an IO intensive job. This is the desired behavior and the correct decision based on that point in time. However, the intially CPU intensive job can then turn IO intensive as it proceeds through its lifecycle.

Results of scaling up node count

Below is a chart of results scaling from 2 to 7 nodes. The metric is total run time from when the 1st job is scheduled, until the last job is completed.

Scaling of 400 Analytics Batch Workload

Number of Nodes

Time to Completion

2

6hr 19min

3

4hr 35min

4

3hr 30min

5

3hr 12min

6

2hr 54min

7

2hr 44min

One may note that time to completion is not linear as node count scales upwards. To a large
extent this is due to the nature of the workload as explained above regarding 1Gb connections getting saturated. If this were a highly tuned benchmark with jobs placed with high precision, we certainly could have improved run time. However, we did not do this in order to keep the batch as realistic as possible. On the positive side, we do continue to see improved run times up to the 7th node.

The Fishworks Analytics displays below show several performance statistics with varying numbers of nodes, with more nodes on the left and fewer on the right. The first two graphs show file operations per second, and the third shows network bytes per second. The 7410 provides over 900 MB/sec in the seven-node test. More information about the interpretation of the Fishworks data for these test will be provided in a later white paper.

An impressive part is in the Fishworks Analytics shot above, throughput of 763MB/s was achieved during the sample period. That wasn't the top end of the what 7410 could provide. For the tests summarized in the table above, the 7 node run peaked at over 900MB/s through a single 10GbE connection. Clearly the 7410 can sustain a fairly high level of IO.

It is also important to note that while we did try to emulate a real world scenario with varying types of jobs and well over 1TB of data being manipulated during the batch, this is a benchmark. The workload tries to encompass a large variety of job behavior. Your scenario may vary quite differently from what was run here. Along with scheduling issues, we were certainly seeing signs of pushing this 7410 configuration near its limits (with the SAS IO pattern and data set sizes), which also affected the ability to achieve linear scaling . But many grid environments are running workloads that aren't very IO intensive and tend to be more CPU bound with minimal IO requirements. In that scenario one could expect to see excellent node scaling well beyond what was demonstrated by this batch. To demonstrate this, the batch was run sans the IO intensive jobs. These jobs do require some IO, but tend to be restricted to 25MB/s or less per process and only for the purpose of initially reading a data set, or writing results.

3 nodes ran in 120 minutes

7 nodes ran in 58 minutes

Very nice scaling - near linear, especially with the lag time that can occur with scheduling batch jobs. The point of this exercise being, know your workload. In this case, the 7410 solution on the back end was more than up to the demands these 350+ jobs put on it and there was still room to grow and scale out more nodes further reducing overall run time.

Tuning(?)

The question mark is actually appropriate. For the achieved results, after configuring a RAID1 share on the 7410, only 1 parameter made a significant difference. During the IO intensive periods, single 1Gb client throughput was observed at 120MB/s simplex, and 180MB/s duplex - producing well over 100,000 interrupts a second. Jumbo frames were enabled on the 7410 and clients, reducing interrupts by almost 75% and reducing IO intensive job run time by an average of 12%. Many other NFS, Solaris, tcp/ip tunings were tried, with no meaningful reduction in microbenchmarks, or the actual batch. Nice relatively simple (for a grid) setup.

Not a direct tuning but an application change worth mentioning was due to the visibility that Analytics provides. Early on during the load phase of the benchmark, the IO rate was less than spectacular. What should have taken about 4.5 hours was going to take almost a day. Drilling down through analytics showed us that 100,000's of file open/closes were occurring that the development team had been unaware of. Quickly that was fixed and the data loader ran at expected rates.

Okay - Really no other tuning? How about 10GbE!

Alright, so there was something else we tried which was outside the test results achieved above. The x2200 we were using is an 8 core box. Even when maxing out the 1Gb testing with multiple IO bound jobs, there was still CPU resources left over. Considering that a higher core count with more memory is becoming more the standard when referencing a "client", it makes sense to utilize all those resources. In the case where a node would be scheduled with multiple IO jobs, we wanted to see if 10GbE could potentially push up client throughput. Through our testing, two things helped improve performance.

The first was to turn off interrupt blanking. With blanking disabled, packets are processed when they arrive as opposed to being processed when an interrupt is issued. Doing this resulted in a ~15% increase in duplex throughput. Caveat - there is a reason interrupt blanking exists and it isn't to slow down your network throughput. Tune this only if you have a decent amount of idle cpu as disabling interrupt blanking will consume it. The other piece that resulted in a significant increase in throughput through the 10GbE NIC was to use multiple NFS client processes. We achieved this through zones. By adding a second zone, throughput through the single 10GbE interface increased ~30%. The final duplex numbers were (These are also peak throughput).

288MB/s no tuning

337MB/s interrupt blanking disabled

430MB/s 2 NFS client processes + interrupt blanking disabled

Conclusion - what does this show?

SAS Grid Computing which requires a shared file system between all nodes, can fit in very nicely on the 7410 storage appliance. The workload continues to scale while adding nodes.

The 7410 can provide very solid throughput peaking at over 900MB/s (near 10GbE linespeed) with the configuration tested.

The 7410 is easy to set up, gives an incredible depth of knowledge about the IO your application does which can lead to optimization.

Know your workload, in many cases the 7410 storage appliance can be a great fit at a relatively inexpensive price while providing the benefits described (and others not described) above.

10GbE client networking can be a help if your 1GbE IO pipeline is a bottleneck and there is a reasonable amount of free CPU overhead.

This is a great piece of work! Nice to see the collaboration going with SAS and the validation of the OpenStorage architecture. If you had CPU power left on the 7410, I would be curious to see if, a second network card might have provided even more scalability. Nice job again.

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.