Oracle Blog

Roch (rhymes with Spock) Bourbonnais :Kernel Performance Engineering

mercredi janv. 21, 2015

In the initial days of ZFS some pointed out that ZFS resilvering was
metadata driven and was therefore super fast : after all we only had
to resilver data that was in-use compared to traditional storage that
has to resilver entire disk even if there is no actual data stored.
And indeed on newly created pools ZFS was super fast for resilvering.

But of course storage pools rarely stay empty. So what happened when
pools grew to store large quantities of data ? Well we basically had to
resilver most blocks present on a failed disk. So the advantage of
only resilvering what is actually present is not much of a advantage,
in real life, for ZFS.

And while ZFS based storage grew in importance, so did disk sizes.
The disk sizes that people put in production are growing very fast
showing the appetite of customers to store vast quantities of data. This
is happening despite the fact that those disks are not delivering
significantly more IOPS than their ancestors. As time goes by,
a trend that has lasted forever, we have fewer and fewer IOPS
available to service a given unit of data. Here ZFSSA storage arrays with
TB class caches are certainly helping the trend. Disk IOPS
don't matter as much as before because all of the hot data is cached
inside ZFS. So customers gladly tradeoff IOPS for capacity given that
ZFSSA deliver tons of cached IOPS and ultra cheap GB of storage.

And then comes resilvering...

So when a disk goes bad, one has to resilver all of the data on it.
It is assured at that point that we will be accessing all of the
data from surviving disks in the raid group and that
this is not a highly cached set. And here was the rub with old style ZFS
resilvering : the metadata driven algorithm was actually generating
small random IOPS. The old algorithm was actually going through all of
the blocks file by file, snapshot by snapshot. When it found an
element to resilver, it would issue the IOPS necessary for that
operation. Because of the nature of ZFS, the populating of those
blocks didn't lead to a sequential workload on the resilvering disks.

So in a worst case scenario, we would have to issue small random IOPS
covering 100% of what was stored on the failed disk and
issue small random writes to the new disk coming in as a
replacement. With big disks and very low IOPS rating comes ugly resilvering times. That effect was also
compounded by a voluntary design balance that was strongly biased to protect
application load. The compounded effect was month long resilvering.

The Solution

To solve this, we designed a subtly modified version of
resilvering. We split the algorithm in two phases. The populating
phase and the iterating phase. The populating phase is mostly
unchanged over the previous algorithm except that, when encountering a block to
resilver, instead of issuing the small random IOPS, we generate a new
on disk log of them. After having iterated through all of the metadata
and discovered all of the elements that need to be resilvered we now
can sort these blocks by physical disk offset and issue the I/O in
ascending order. This in turn allows the ZIO subsystem to aggregate
adjacent I/O more efficiently leading to fewer larger I/Os issued to
the disk. And by virtue of issuing I/Os in physical order it
allows the disk to serve these IOPS at the streaming limit of the
disks (say 100MB/sec) rather than being IOPS limited (say 200
IOPS).

So we hold a strategy that allows us to resilver nearly as fast as
physically possible by the given disk hardware. With that newly acquired
capability of ZFS, comes the requirement to service application load
with a limited impact from resilvering. We therefore have some mechanism
to limit resilvering load in the presence of application load. Our
stated goal is to be able to run through resilvering at 1TB/day (1TB
of data reconstructed on the replacing drive) even
in the face of an active workload.

As disks are getting bigger and bigger, all storage vendors will see
increasing resilvering times. The good news is that, since Solaris
11.2 and ZFSSA since 2013.1.2,
ZFS is now able to run resilvering with much of the same disk
throughput limits as the rest of non-ZFS based storage.

mardi déc. 02, 2014

The initial topic from my list is reARC. This is a major rearchitecture of the code that manages ZFS in-memory cache along with its interface to the DMU. The ARC is of course a key enabler of ZFS high performance. As the scale of systems grow in memory size, CPU count and frequency, some major changes were required to the ARC to keep up with the pace. reARC is such a major body of work, I can only talk about of few aspects of the Wonders of ZFS Storage here.

In this article, I describe how the reARC project had impact on at least these 7 important aspects of it's operation:

Managing metadata

Handling ARC accesses to cloned buffers

Scalability of cached and uncached IOPS

Steadier ARC size under steady state workloads

Improved robustness for a more reliable code

Reduction of the L2ARC memory footprint

Finally, a solution to the long standing issue of I/O priority inversion

The diversity of topics covered serves as a great illustration of the incredible work handled by the ARC and a testament to the importance of ARC operations to all other ZFS
subsystems. I'm truly amazed at how a single project was able to deliver all this goodness in one swoop.

No Meta Limits

Previously, the ARC claimed to use a two-state model:

"most recently used" (MRU)

"most frequently used" (MFU)

But it further subdivided these states into data and metadata lists.

That model, using 4 main memory lists, created a problem for ZFS. The ARC algorithm
gave us only 1 target size for each of the 2 MRU and MFU states. The fact that we had 2 lists (data and metadata) but only 1 target size for the aggregate meant that when we needed to adjust the list down, we just didn't have the necessary information to
perform the shrink. This lead to the presence of an ugly tunable arc_meta_limit, which was impossible to set properly and was a source of problems for customers.

This problem raises an interesting point and a pet peeve of mine. Many people I've interacted with over the years defended the position that metadata was worth special protection in a cache. After all, metadata is necessary to get to data, so it has intrinsically higher value and should be kept around more. The argument is certainly
sensible on the surface, but I was on the fence about it.

ZFS manages every access through a least recently used scheme (LRU). New access to some block, data or metadata, puts that block back to the head of the LRU list, very much protected from eviction, which happens at the tail of the list.

When considering special protection for metadata, I've always stumbled on this question:

If some buffer, be it data or metadata, has not seen any accesses for
sufficient amount of time, such that the block is now the tail of an
eviction list, what is the argument that says that I should protect
that block based on it's state ?

I came up blank on that question. If it hasn't been used, it can be evicted, period.
Furthermore, even after taking this stance, I was made aware of an interesting fact about ZFS. Indirect blocks, the blocks that hold a set of block pointers to the actual data are non_evictable inasmuch as any of the block pointers they reference are currently in the ARC. In other words, if some data is in cache, it's metadata is also in the cache
and furthermore, is non-evictable. This fact really reinforced my position that in our LRU cache handling, metadata doesn't need special protection from eviction.

And so, the reARC project actually took the same path. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction. If you are tuning arc_meta_limit for legacy reasons, I advise you to try without this special tuning. It might be hurting you today and should be considered obsolete.

Single Copy Arc: Dedup of Memory

Yet another truly amazing capability of ZFS is it's infinite snapshot capabilities. There are just no limits, other than hardware, to the number of (software) snapshots that you can have.

What is magical here is not so much that ZFS can manage a large number of snapshots, but that it can do so without reference counting the blocks that are referenced through a snapshot. You might need to read that sentence again ... and check the blog entry.

Now fast forward to today where there is something new for the ARC. While we've always had the ability to read a block referenced from the N-different snapshots (or clones), the old ARC actually had to manage separate in-memory copies of each block. If the accesses were all reads, we'd needlessly instantiate the same data multiple times in memory.

With the reARC project and the new DMU to ARC interfaces, we don't
have to keep multiple data copies. Multiple clones of the same data
share the same buffers for read accesses and new copies are only created for a write access.
It has not escaped our notice that this N-way pairing has immense
consequences for virtualization technologies. The use of ZFS clones
(or writable snapshots) is just a great way to deploy a large number
of virtual machines. ZFS has always been able to store N clone copies
with zero incremental storage costs. But reARC is taking this one step
further. As VMs are used, the in-memory caches that are used to manage
multiple VMs no longer need to inflate, allowing the space savings to
be used to cache other data. This improvement allows Oracle to boast
the amazing technology demonstration of booting 16000 VMs simultaneously.

Improved Scalability of Cached and Uncached OPs

The entire MRU/MFU list insert and eviction processes have been
redesigned. One of the main functions of the ARC is to keep track of
accesses, such that most recently used data is moved to the head of
the list and the least recently used buffers make their way towards
the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability. Moreover, through a very clever algorithm, we're able to move buffers from the middle of a list to the head without acquiring the eviction lock.

These changes were very important in removing long pauses in ARC
operations that hampered the previous implementation. Finally, the
main hash table was modified to use more locks placed on separate
cache lines improving the scalability of the ARC operations. This lead
to a boost in the cached and uncached maximum IOPs capabilities of the
ARC.

Steadier Size, Smaller Shrinks

The growth and shrink model of the ARC was also revisited. The new
model grows the ARC less aggressively when approaching memory pressure
and instead recycles buffers earlier on. This recycling leads to a
steadier ARC size and fewer disruptive shrink cycles. If the changing
environment nevertheless requires the ARC to shrink, the amount by
which we do shrink each time is reduced to make it less of a stress
for each shrink cycle. Along with the reorganization of the ARC list
locking, this has lead to a much steadier, dependable ARC at high
loads.

ARC Access Hardening

A new ARC reference mechanism was created that allows the DMU to
signify read or write intent to the ARC. This, in turn, enables more
checks to be performed by the code. Therefore, catching bugs earlier
in the process. A better separation of function between the DMU and
the ARC is critical for ZFS robustness or hardening. In the new reARC
mode of operation, the ARC now actually has the freedom relocate
kernel buffers in memory in between DMU accesses to a cached
buffer. This new feature proves invaluable as we scale to large memory
systems.

L2ARC Memory Footprint Reduction

Historically, buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to
a bare minimum that now only requires about 80 bytes of metadata per
L2 buffers. With the arrival of larger SSDs for L2ARC and a better
feeding algorithm, this reduced L2ARC footprint is a very significant
change for the Hybrid Storage Pool (HSP) storage model.

I/O Priority Inversion

One nagging behavior of the old ARC and ZIO pipeline was the so-called I/O priority inversion. This behavior was present mostly for prefetching I/Os, which was handled by the ZIO pipeline at a lower priority operation than, for example, a regular read issued by an
application. Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending, would block waiting on the low priority I/O prefetch completion.

While it sounds simple enough to just boost the priority of the in-flight I/O prefetch, ARC/ZIO code was structured in such a way that this turned out to be much trickier than it sounds. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion meant that fairness between different types of I/O was restored.

Conclusion

The key points that we saw in reARC are as follows:

Metadata doesn't need special protection from eviction, arc_meta_limit has
become an obsolete tunable.

Multiple clones of the same data share the same buffers for great performance
in a virtualization environment.

We boosted ARC scalability for cached and uncached IOPs.

The ARC size is now steadier and more dependable.

Protection from creeping memory bugs is better.

L2ARC uses a smaller footprint.

I/Os are handled with more fairness in the presence of prefetches.

All of these improvements are available to customers of Oracle's ZFS Storage Appliances in any AK-2013 releases and recent Solaris 11 releases. And this is just topic number one. Stay tuned as we go about describing further improvements we're making to ZFS.

Well, look who's back! After years of relative silence, I'd like to
put back on my blogging hat and update my patient readership about
the significant ZFS technological improvements that have integrated
since Sun and ZFS became Oracle brands.
Since there is so much to cover, I tee up this series of article with a short
description of 9 major performance topics that have evolved
significantly in the last years. Later, I will describe each
topic in more details in individual blog entries.
Of course, these selected advancements represents nowhere near an
exhaustive list. There has been over 650 changes to the ZFS code in
the last 4 years. My personal performance bias has selected topics that I know best.
The designated topics are:

Allows the ZIL to carve up smaller units of
work for better pipelining and higher log device
utilisation.

It is the dawning of the age of the L2ARC

Not only did we make the L2ARC persistent on reboot,
we made the feeding process so much more efficient
we had to slow it down.

Zero Copy I/O Aggregation

A new tool delivered by the Virtual Memory team allows
the already incredible ZFS I/O aggregation feature to
actually do its thing using one less copy.

Scalable Reader/Writer locks

Reader/Writer locks, used extensively by ZFS and
Solaris, had their scalability greatly improved on
on large systems.

New thread Scheduling class

ZFS transaction groups are now managed by a new type
of taskqs which behave better managing bursts of cpu activity.

Concurrent Metaslab Syncing

The task of syncing metaslabs is now handled with more
concurrency, boosting ZFS write throughput capabilities.

Block Picking

The task of choosing blocks for allocations has been
enhanced in a number of ways, allowing us to work more
efficiently at a much higher pool capacity percentage.
There you have it. I'm looking forward to reinvigorating my blog so
stay tuned.

mardi févr. 28, 2012

Last October, we demonstrated storage leadership in block protocols with our stellar SPC-1 result showcasing our top of the line Sun ZFS Storage 7420.

As a benchmark SPC-1's profile is close to what a fixed block size DB
would actually be doing. See Fast
Safe Cheap : Pick 3 for more details on that result. Here, for an
encore, we're showing today how the ZFS Storage appliance can perform
in a totally different environment : generic NFS file serving.

Does price performance matters ? It does, doesn't it, See what Darius has to say about how we compare to Netapp : Oracle posts Spec SFS.

This is one step further in the direction of bringing to our customer
true high performance unified
storage capable of handling blocks and files on the same
physical media. It's worth noting that provisioning of space between
the different protocols is entirely software based and fully dynamic,
that every stored element fully checksummed, that all stored data can
be compressed with a number of different algorithms (including gzip),
and that both filesystems and block based luns can be snapshot and
cloned at their own granularity. All these
manageability features available to you in this high performance storage package.

Way to go ZFS !

SPEC and SPECsfs are registered trademarks of Standard Performance Evaluation
Corporation (SPEC).
Results as of February 22, 2012, for more information see www.spec.org.

lundi oct. 03, 2011

SPC-1 : Twice the performance of NetApp at the same latency; Half the $/IOPS;

I'm proud to say that, yours truly, along with a lot of great
teammates in Oracle, is not totally foreign to this milestone.

We are announcing that Oracle's 7420C
cluster acheived 137000 SPC-1
IOPS with an average latency of less than 10 ms. That is
double the results of NetApp's 3270A while delivering the same
latency. As compared to the NetApp 3270 result, this is a 2.5x
improvement in $/SPC-1-IOPS (2.99$/IOPS vs $7.48/IOPS). We're also showing that when
the ZFS Storage Appliance runs at the rate posted by the 3270A (68034
SPC-1 IOPS), our latency of 3.26ms is almost 3X lower than theirs
(9.16ms). Moreover, our result was obtained with 23700 GB of user
level capacity (internally mirrored) for 17.3 $/GB while NetApp's
, even using a space saving raid scheme, can only deliver 23.5$/GB. This
is the price per GB of application data actually used in the
benchmark. On top of that the 7420C still had 40% of space headroom
whereas the 3270A was left with only 10% of free blocks.

These great results were at least partly made possible with the
availability of 15K RPM Hard Disk Drives (HDD). Those are great to run
the most demanding databases because they combine a large IOPS
capability and are generally of smaller capacity. The ratio of IOPS/GB
makes them ideal to store high intensity database modeled by SPC-1.
On top of that, this concerted engineering effort lead to improved
software not just for those running on 15K RPM. We actually used this benchmark to seek out how to
increase the quality of our products. The preparation runs, after an initial
diagnostic of some issue, we were attached to finding solutions that
where not targeting the idiosyncrasies of SPC-1 but based on sound
design decision. So instead of changing the default value of some
internal parameter to a new static default, we actually changed the way
the parameter worked so that our storage systems or all types and sizes would benefit.

So not only are we getting a great SPC-1 results, but all existing
customers will benefit from this effect even if they are operating
outside of the intense conditions created by the benchmark.

So what is SPC-1 ? It is one of the few benchmarks which counts for
storage. It is maintained by Storage Performance Council (SPC). SPC-1 simulates multiple
databases running on a centralized storage or storage cluster. But
even if SPC-1 is a block based benchmark, within the ZFS Storage
appliance, a block based FC or iSCSI volume is handled very much the
same way as would be a large file subject to synchronous operation.
And by Combining modern network technologies (Infiniband or 10Gbe
Ethernet), the CPU power packed in the 7420C storage controllers and
Oracle's custom dNFS technology for databases, one can truly acheive
very high database transaction rates on top of the more manageable and
flexible file based protocols.

The benchmarks defines three Application Storage Unit (ASU): ASU1 with a heavy 8KB block
read/write component, ASU2 with a much lighter 8KB block read/write
component, and ASU3 which is subject to hundreds of write streams. As
such it's is not too far from a simulation of running hundreds of Oracle
database onto a single system : ASU1 and ASU2 for datafiles and ASU3
for redolog storage.

The total size of the ASUs is constrained such that all of the stored data
(including mirror protection and disk used for spares) must exceed 55%
of all configured storage. The benchmark team is then free to decide
how much total storage to configure. From that figure, 10% is given to
ASU3 (redo log space) and the rest divided equally between heavily
ASU1 and lightly used ASU2.

The benchmark team also has to select the SPC-1 IOPS throughput level it wishes to run.
This is not a light decision given you want to balance high IOPS; low
latency and $/user GB.

Once the target IOPS rate is selected, there are multiple criteria
needed to pass a successful audit; one of the most critical is that
you have to run at the specified IOPS rate for a whole 8 hour. Note
that the previous specifications of the benchmark used by NetApp
called for an 4 hour run. During that 8 hour run delivering a solid
137000 SPC-1 IOPS, the avg latency of must be less than 30ms
(we did much better than that).

After this brutal 8 hour run, the benchmark then enters another critical
phase: the workload is restarted (using a new randomly selected working set)
and performance is measured for a 10 minute period. It is this 10 minute
period that decides the official latency of the run.

When everything is said and done, you press the trigger; go to sleep
and wake up to the result. As you could guess we were ecstatic that
morning. Before that glorious day, for lack of a stronger word, a lot
of hard work had been done during the extensive preparation runs. With
little time, and normally not all of the hardware, one runs through
series of run at incremental loads, making educated guesses as to how
to improve the result. As you get more hardware you scale up the
result tweaking things more or less until the final hour.

SPC-1, with it's requirement of less than 45% of unused space, is
designed to trigger many disk level random read IOPS. Despite this
inherent random pattern of the workload, we saw that our extensive
caching architecture was as helpful for this benchmark as it is in
real production workloads. While the 15K RPM HDDs normally levels off
with random operation at a rate slightly above 300 IOPS, our 7420C, as
a whole, could deliver almost 500 user-level SPC-1 IOPS per HDDs.

In the end one of the most satisfying aspect was to see that
the data being managed by ZFS was stored rock solid on disk, properly
checksummed, all data could be snapshot, compressed on demand, and
delivering an impressively steady performance.

2X the absolute performance, 2.5X cheaper per SPC-1 IOPS, almost 3X lower
latency, 30% cheaper per user GB with room to grow... So, If you have
a storage decision coming and you need, FAST, SAFE, CHEAP : pick 3,
take a fresh look at the ZFS Storage appliance.

mercredi mai 26, 2010

Recall that I had Lun alignment on my mind a few weeks ago.
Nothing special about the ZFS storage appliance over any other storage. Pay attention to how you partition your luns, it can have a great impact on performance.
Right Roch ? :

The Sun Storage 7410 Unified Storage Array delivers over 22000
of 8K synchronous writes per second combining great DB
performance and ease of deployment of Network Attached Storage
while delivering the economics benefits of inexpensice SATA disks.

The Sun Storage 7410 Unified Storage Array delivers over 36000
of random 8K reads per second from a 400GB working set for great Mail application
responsiveness. This corresponds to an entreprise of 100000 people
with every employee accessing new data every 3.6 second consolidated
on a single server.

All those numbers characterise a single head of a 7410 clusterable
technology. The 7000 clustering technology stores all data in dual
attached disk trays and no state is shared between cluster heads
(see Sun 7000 Storage clusters). This
means that an active-active cluster of 2 healthy 7410 will deliver 2X
the performance posted here.

Also note that the performance posted here represent what is acheived
under a very tightly defined constrained
workload (see Designing 11 Storage metric) and those do not represent the performance limits of the systems. This is testing 1 x 10 GbE port only; each product can have 2 or 4 10 GbE ports, and by running load across multiple ports the server can deliver even higher performance. Achieving maximum performance is a separate exercise done extremely well
by my friend Brendan :

Measurement Method

To measure our performance we used the open source Filebench tool
accessible from SourceForge (Filebench
on solarisinternals.com). Measuring performance of a NAS storage
is not an easy task. One has to deal with the client side cache which
needs to be bypassed, the synchronisation of multiple clients, the
presence of client side page flushing deamons which can turn asynchronous
workloads into synchronous ones. Because our Storage 7000 line
can have such large caches (up to 128GB of ram and more than 500GB of secondary caches) and we wanted to test disk responses, we
needed to find a backdoor ways to flush those caches on the servers. Read
Amithaba Filebench
Kit entry on the topic in which he posts a link to the toolkit
used to produce the numbers.

We recently released our first major software update 2000.Q2 and along with that a new lower cost clusterable 96 TB Storage, the 7310.

We report here the compared numbers of a 7310 with the latest software release to those
previously obtained for the 7410, 7210 and 7110
systems each attached to an 18 to 20 client pool over a single 10Gbe interface
with the regular frame ethernet (1500 Bytes). By the way, looking at
brendan's results above, I encourage you to upgrade to use Jumbo Frames
ethernet for even more performance and note that our servers can drive
two 10Gbe at line speed.

The newly released 7310 was tested with the most recent software revision and that certainly is giving the 7310 an edge over it's peers.
The 7410 on the other hand was measured here managing a much large contingent of storage, including mirrored Logzillas and 3 times as many JBODs and that is
expected to account for some of the performance delta being observed.

Metrics

Short Name

1 thread per client streaming cached reads

Stream Read light

1 thread per client streaming cold cache reads

Cold Stream Read light

10 threads per client streaming cached reads

Stream Read

20 threads per client streaming cold cached reads

Cold Stream Read

1 thread per client streaming write

Stream Write light

20 threads per client streaming write

Stream Write

128 threads per client 8k synchronous writes

Sync write

128 threads per client 8k random read

Random Read

20 threads per client 8k random read on cold caches

Cold Random Read

8 threads per client 8k small file create IOPS

Filecreate

There are 6 read tests, 2 writes test and 1 synchronous write test
which overwrites it's data files as a database would. A final
filecreate test complete the metrics. Test executes against 20GB
working set _per client_ times 18 to 20 clients. There are 4 sets used
in total running over independent shares for a total of 80GB per
client. So before actual runs at taken, we create all working sets
or 1.6 TB of precreated data. Then before each run, we clear all
caches on the clients and server.

In each of the 3 groups of 2 read tests, the first one benefits from
no caching at all and the throughput delivered to the client over the
network is observed to come from disk. The test runs for N seconds
priming data in the Storage caches. A second run (non-cold) is
then started after clearing the client side caches. Those test will
see the 100% of the data delivered over the network link but not all
of it is coming off the disks. Streaming test will race through the
cached data and then finish off reading from disks. The random read
test can also benefit from increasing cached responses as the test
progresses. The exact caching characteristic of a 7000 lines will
depend on a large number of parameters including your application
access pattern. Numbers here reflect the performance of fully
randomized test over 20GB per client x 20 clients or a 400GB working
set. Upcoming studies will include more data (showing even higher
performance) for workloads with higher cache hit ratio than those used
here.

In a Storage 7000 server, disks are grouped together in one pool and
then individual Shares are creates. Each share has access to all disk
resource subject to quota (a minimum) and reservation (a maximum) that
might be set. One important setup parameter associated with each share
is the DB record size. It is generally better for IOPS test to use 8K
records and for streaming test to use 128K records. The recordsize can
be dynamically set based on expected usage.

The tests shown here were obtained with NFSv4 the default for Solaris clients (NFSv3 is expected to
come out slightly better). The
clients were running Solaris 10, with tuned tcp_recv_hiwat of 400K and
dopageflush=0 to prevent buffered writes from being converted into
synchronous writes.

Compared Results of the 7000 Storage Line

NFSv4 Test

7410 HeadMirrored Pool

7310 HeadMirrored Pool

7210 HeadMirrored Pool

7110 Head3+1 Raid-Z

Throughput

Cold Stream Read light

915 MB/sec

685 MB/sec

719 MB/sec

378 MB/sec

Stream Read light

1074 MB/sec

751 MB/sec

894 MB/sec

416 MB/sec

Cold Stream Read

959 MB/sec

598 MB/sec

752 MB/sec

329 MB/sec

Stream Read

1030 MB/sec

620 MB/sec

792 MB/sec

386 MB/sec

Stream Write light

480 MB/sec

507 MB/sec

490 MB/sec

226 MB/sec

Stream Write

447 MB/sec

526 MB/sec

481 MB/sec

224 MB/sec

IOPS

Sync write

22383 IOPS

8527 IOPS

10184 IOPS

1179 IOPS

Filecreate

5162 IOPS

4909 IOPS

4613 IOPS

162 IOPS

Cold Random Read

28559 IOPS

5686 IOPS

4006 IOPS

1043 IOPS

Random Read

36478 IOPS

7107 IOPS

4584 IOPS

1486 IOPS

Per Spindle IOPS

272 Spindles

86 Spindles

44 Spindles

12 Spindles

Cold Random Read

104 IOPS

76 IOPS

91 IOPS

86 IOPS

Random Read

134 IOPS

94 IOPS

104 IOPS

123 IOPS

Analysis

The data shows that the entire Sun Storage 7000 line are throughput
workhorse delivering 10 Gbps level NAS services per cluster head
nodes, using a single Network Interface and single IP address for easy
integration into your existing network.

As with other storage technology write streaming performance require
more involvement from the storage controller and this leads to about
50% less write throughput compared to read throughput.

The use of write optimized SSD in the 7410, 7310 and 7220 also give
this storage very high synchronous write capabilities. This is one of
the most interesting result as it maps to database performance. The ability to
sustain 24000 O_DSYNC writes at 192MB/sec of synchronized user data
using only 48 inexpensive sata disks and 3 write optimized SSD is one
of the many great performance characteristics of this novel storage
system.

Random Read test generally map directly to individual disk
capabilities and is a measure of total disk rotations. The cold runs
shows that all our platforms are delivering data at the expected 100
IOPS per spindle for those SATA disks. Recall that our offering is
based on the economical energy efficient 7.2 RPM disk technology. For
cold random reads, a mirrored pair of 2 x 7.2K RPM offers the same
total disk rotation (and IOPS) as expensive and power hungry 15 K
RPM disks but in a much more economical package.

Moreover the difference between the warm and cold random read runs is
showing that the Hybrid Storage Pool (HSP) is providing a 30% boost
even on this workload that addresses randomly 400GB working set on
128GB of controller cache. The effective boost from the HSP can be
much greater depending on the cacheability of workloads.

If we consider an organisation in which the avg mail message is 8K
in size, our results show that we could consolidate 100000 employees on
a single 7410 storage where each employee is accessing new data every
3.6 seconds with 70ms response time.

Messaging system are also big consumer of file creations, I've shown
in the past how efficient ZFS can be at creating small files
(Need Inodes ?). For the NFS protocol,
file creation is a straining workload but the 7000 storage line comes
out not too bad with more than 5000 filecreates per second per storage
controller.

Conclusion

Performance Can never be summerised with a few numbers and we have
just begun to scratch the surface here. The numbers presented here
along with the disruptive pricing of the Hybrid Storage Pool will, I
hope, go a long way to show the incredible power of the Open
Storage architecture being proposed. And keep in mind that this
performance is achievable using less expensive, less power hungry SATA
drives and that every data services : NFS, CIFS, iSCSI, ftp, HTTP etc.
offered by our Sun Storage 7000 servers are available at 0 additional
software cost to you.

lundi nov. 10, 2008

I see many reports about running campains of test measuring
performance over a test matrix. One problem with this approach is of
course the Matrix.
That matrix never big enough for the consumer of the information ("can
you run this instead ?").

A more useful approach is to think in terms of performance
invariants. We all know that 7.2K RPM disk drive can do 150-200 IOPS
as an invariant and disks will have throughput limit such as
80MB/sec. Thinking in terms of those invariant helps in extrapolating
performance data (with caution) and observing breakdowns in invariant
is often a sign that something else needs to be root caused.

So using 11 metrics and our Performance engineering effort what can be
our guiding invariants ? Bearing in mind that it is expected that
those are rough estimate. For real measured numbers check out Amitabha
Banerjee's excellent post on
Analyzing the Sun Storage 7000.

Streaming : 1 GB/s on server and 110 MB/sec on client

For read Streaming wise, we're observing that 1GB/s is somewhat our
guiding number for read streaming . This can be acheived with fairly
small number of client and threads but will be easier to reach if the
data is prestaged in server caches. A client normally running 1Gbe
network cards is able to extract 110 MB/sec rather easily. Read
streaming will be easier to acheived with the larger 128K records
probably due to the lesser CPU demand. While our results are with
regular 1500 Bytes ethernet frames, using jumbo frame will also make
this limit easier to reach or even break. For a mirrored pool, data
needs to be sent twice to the storage and we see a reduction of about
50% for write streaming workloads.

Random Read I/Os per second : 150 random read IOPS per mirrored disks

This is probably a good guiding light also. When going to disks that
will be a reasonable expectation. But here caching can radically
change this. Since we can configure up to 128GB of host ram and 4
times that much of secondary caches, there are opportunity to break
this barrier. But when going to spindles that needs to be kept under
consideration. We also know that Raid-z spreads records to all
disks. So the 150 IOPS limit basically applies to
raid-z groups. Do
plan to have many groups to service random reads.

In some instances, data after eviction from main memory will be kept
in secondary caches. Small files and tuned recordsize filesystem are
good target workload for this. Those read-optimized SSD can restitute this
data at a rate of 3100 IOPS L2 ARC). More
importantly so it can do so at much reduced latency meaning that
lightly threaded workloads will be able to acheive high throughput.

Synchronous writes can be generated by a O_DSYNC write (database) or
just as part of the NFS protocol (such as the tar extract :
open,write,close workloads). Those will reach the NAS server and be
coalesced in a single transaction with the separate intent log. Those
SSD devices are great latency accelerators but are still devices with
a max throughput of around 110 MB/sec. However our code actually
detects when the SSD devices become the bottleneck and will
divert some of the I/O request to use the main storage pool. The net
of all this is a complex equation but we've observed easily 5000-8000
synchronous writes per SSD up to 3 devices (or 6 in mirrored pairs).
Using smaller working set which creates less competition for CPU
resources we've even observed 48K synchronous writes per second.

Cycles per Bytes : 30-40 cycles per byte for NFS and CIFS

Once we include the full NFS or CIFS protocol, the efficiency was
observed to be in the 30-40 cycles per byte (8 to 10 of those coming
from the pure network component at regulat 1500 bytes MTU). More
studies are required to figure out the extent to which this is valid
but it's an interesting way to look at the problem. Having to run
disk I/O vs being serviced directly from cached data is expected to
exert an additional 10-20 cycles per byte. Obviously for metadata test
in which small amount of byte is transfered per operation, we probably
need to come up with a cycles/MetaOps invariant but that is still TBD.

Single Client NFS throughput : 1 TCP Window per round
trip latency.

This is one fundamental rule of network throughput but it's a
good occasion to refresh this in everyones mind. Clients, at least
solaris clients, will establish a single TCP connection to a server.
On that connection there can be a large number of unreleated requests
as NFS is a very scalable protocol. However, a single connection will
transport data at a maximum speed of a "socket buffer" divided by the
round trip latency. Since today's network speed, particularly in wide
area networks have grown somewhat faster than default socket buffers
we can see such things becoming performance bottleneck. Now given that
I work in Europe but my tests systems are often located in california,
I might be a little more sensitive than most to this fact. So one
important change we did early on, in this project was to simply bump
up the default socket buffers in the 7000 line to 1MB. However for
read throughput under similar conditions, we can only advise you to do
the same to your client infrastructure.