Oracle Blog

Roch (rhymes with Spock) Bourbonnais :Kernel Performance Engineering

vendredi févr. 20, 2015

The third topic on my list of improvements since 2010
is ZIL pipelining :

Allow the ZIL to carve up smaller units of
work for better pipelining and higher log device
utilization.

So let's remind ourselves of a few things about the ZIL and why it's
so critical to ZFS. The ZIL stands for ZFS Intent Log and exists in
order to speed up synchronous operations such as an O_DSYNC write or
fsync(3C) calls. Since most Database operation involve
synchronous writes it's easy to understand that having good ZIL
performance is critical in many environments.

It is well understood that a ZFS pool updates it's global on-disk state at
a set interval (5 seconds these days). The ZIL is actually what
keeps information in between those transaction group (TXG). The ZIL
records what is committed to stable storage from a users point of
view. Basically the last committed TXG + replay of the ZIL is the valid
storage state from a users perspective.

The on-disk ZIL is a linked list of records which is actually only
useful in the event of a power outage or system crash. As part of a
pool import, the on-disk ZIL is read and operations replayed such that
the ZFS pool contains the exact information that had been committed
before the disruption.

While we often think of the ZIL as it's on-disk representation (it's
committed state), the ZIL is also an in-memory representation of every
posix operation that needs to modify data. For example, a file
creation even if that is an asynchronous operation needs to be tracked
by the ZIL. This is because any asynchronous operation, may at any
point in time require to be committed to disk; this is often due to an
fsync(3C) call. At that moment, every pending operation on a given
file needs to be packaged up and committed to the on-disk ZIL.

Where is the on-disk ZIL stored ?

Well that's also more complex than it sound. ZFS manages devices
specifically geared to store ZIL blocks; those separate slog devices
or slogs are very often flash SSD. However the ZIL is not
constrained to only using blocks from slog devices; it can store data
on main (non-slog) pool devices. When storing ZIL information into the non-slog pool devices, the ZIL has a
choice of recording data inside zil blocks or recording full file
records inside pool blocks and storing a reference to it inside the
ZIL. This last method for storing ZIL blocks has the benefit of
offloading work from the upcoming TXG sync at the expense of higher
latency since the ZIL I/Os are being sent to rotating disks. This
mode is the one used with logbias=throughput. More on that below.

Net net: the ZIL records data in stable storage in a link list and
user applications have synchronization point in which they choose to
wait on the ZIL to complete it's operation.

When things are not stressed, operations show up at the ZIL, wait a
little bit while the ZIL does it's work, and are then
released. Latency of the ZIL is then coherent with the underlying device
used to capture the information. In this rosy picture we would not
have done this train project.

At times though, the system can get stressed. The older mode of operation of
the ZIL was to issue a ZIL transaction (implemented by ZFS function
zil_commit_writer) and while that was going on, build up the next ZIL
transaction with everything that showed up at the door. Under stress
when a first operation would be serviced with a high latency, the next
transaction would accumulate many operations, growing in size thus
leading to a longer latency transaction and this would spiral out of
control. The system would automatically divide into 2 ad-hoc sets of
users; a set of operations which would commit together as a group,
while all other threads in the system would form the next
ZIL transaction and vice-versa.

This leads to bursty activity on the ZIL devices, which meant that, at
times, they would go unused even though they were the critical
resource. This 'convoy' effect also meant disruption of servers
because when those large ZIL transaction do complete, 100s or 1000s of
user threads might see their synchronous operation complete and all
would end up flagged as 'runnable' at the same time. Often those would want to
consume the same resource, run on the same CPU, of use the same lock
etc. This led to thundering herds, a source of system inefficiency.

Thanks to the ZIL train project, we now have the ability to break down
convoys into smaller units and dispatch them into smaller ZIL level
transactions which are then pipelined through the entire data center.

With logbias set to throughput, the new code is attempting to group ZIL
transactions in sets of approximately 40 operations which is a
compromise between efficient use of ZIL and reduction of the convoy effect.
For other types of synchronous operations we group them into sets
representing about ~32K of data to sync. That means that a single
sufficiently large operation may run by itself but more threads will
group together if their individual commit size are small.

The ZIL train is thus expected to handle burst of synchronous activity
with a lot less stress on the system.

The THROUGHPUT VS LATENCY debate.

As we just saw the ZIL provides 2 modes of operation. The throughput
mode and the default latency mode. The throughput mode is named as
such not so much because it favors throughput but more so because it
doesn't care too much about individual operation latency. The
implied corollary of throughput friendly workloads is that they are
very highly concurrent (100s or 1000s of independent operations) and
therefore are able to get to high throughput even when served at high
latency. The goal of providing a ZIL throughput mode is to
actually free up slog devices from having to handle such highly
concurrent workloads and allow those slog devices to concentrate on
serving other low-concurrency, but highly sensitive to latency
operations.

For Oracle DB, we therefore recommend the use of logbias set to
throughput for DB files which are subject to highly concurrent DB
writer operations while we recommend the use of the default latency mode
for handling other latency sensitive files such as the redo log. This
separation is particularly important when redo log latency is very
critical and when the slog device is itself subject to stress.

When using Oracle 12c with dnfs and OISP, this best practice is
automatically put into place. In addition to proper logbias handling,
DB data files are created with a ZFS recordsize matching the
established best practice : ZFS recordsize matching DB blocksize for
datafiles; ZFS recordsize of 128K for redo log.

When setting up a DB, with or without OISP, there is one thing that
Storage Administrators must enforce : they must segregate redo log
files into their own filesystems (also known as shares or
datasets). The reason for this is that the ZIL is a single linked list
of transactions maintained by each filesystem (other filesystems run
their own ZIL independently). And while the ZIL train allows
for multiple transaction to be in flight concurrently, there is a
strong requirement for completion of the transaction and notification
of waiters to be handled in order. If one were to mix data files and
redo log files in the same ZIL, then some redo transaction would be
linked behind some DB writer transactions. Those critical redo
transaction committing in latency mode to a slog device would see
their I/O complete quickly (100us timescale) but nevertheless have to
wait for an antecedent DB writer transaction committing in throughput
mode to regular spinning disk device (ms timescale). In order to avoid
this situation, one must ensure that redo log files are stored in
their own shares.

mercredi janv. 21, 2015

In the initial days of ZFS some pointed out that ZFS resilvering was
metadata driven and was therefore super fast : after all we only had
to resilver data that was in-use compared to traditional storage that
has to resilver entire disk even if there is no actual data stored.
And indeed on newly created pools ZFS was super fast for resilvering.

But of course storage pools rarely stay empty. So what happened when
pools grew to store large quantities of data ? Well we basically had to
resilver most blocks present on a failed disk. So the advantage of
only resilvering what is actually present is not much of a advantage,
in real life, for ZFS.

And while ZFS based storage grew in importance, so did disk sizes.
The disk sizes that people put in production are growing very fast
showing the appetite of customers to store vast quantities of data. This
is happening despite the fact that those disks are not delivering
significantly more IOPS than their ancestors. As time goes by,
a trend that has lasted forever, we have fewer and fewer IOPS
available to service a given unit of data. Here ZFSSA storage arrays with
TB class caches are certainly helping the trend. Disk IOPS
don't matter as much as before because all of the hot data is cached
inside ZFS. So customers gladly tradeoff IOPS for capacity given that
ZFSSA deliver tons of cached IOPS and ultra cheap GB of storage.

And then comes resilvering...

So when a disk goes bad, one has to resilver all of the data on it.
It is assured at that point that we will be accessing all of the
data from surviving disks in the raid group and that
this is not a highly cached set. And here was the rub with old style ZFS
resilvering : the metadata driven algorithm was actually generating
small random IOPS. The old algorithm was actually going through all of
the blocks file by file, snapshot by snapshot. When it found an
element to resilver, it would issue the IOPS necessary for that
operation. Because of the nature of ZFS, the populating of those
blocks didn't lead to a sequential workload on the resilvering disks.

So in a worst case scenario, we would have to issue small random IOPS
covering 100% of what was stored on the failed disk and
issue small random writes to the new disk coming in as a
replacement. With big disks and very low IOPS rating comes ugly resilvering times. That effect was also
compounded by a voluntary design balance that was strongly biased to protect
application load. The compounded effect was month long resilvering.

The Solution

To solve this, we designed a subtly modified version of
resilvering. We split the algorithm in two phases. The populating
phase and the iterating phase. The populating phase is mostly
unchanged over the previous algorithm except that, when encountering a block to
resilver, instead of issuing the small random IOPS, we generate a new
on disk log of them. After having iterated through all of the metadata
and discovered all of the elements that need to be resilvered we now
can sort these blocks by physical disk offset and issue the I/O in
ascending order. This in turn allows the ZIO subsystem to aggregate
adjacent I/O more efficiently leading to fewer larger I/Os issued to
the disk. And by virtue of issuing I/Os in physical order it
allows the disk to serve these IOPS at the streaming limit of the
disks (say 100MB/sec) rather than being IOPS limited (say 200
IOPS).

So we hold a strategy that allows us to resilver nearly as fast as
physically possible by the given disk hardware. With that newly acquired
capability of ZFS, comes the requirement to service application load
with a limited impact from resilvering. We therefore have some mechanism
to limit resilvering load in the presence of application load. Our
stated goal is to be able to run through resilvering at 1TB/day (1TB
of data reconstructed on the replacing drive) even
in the face of an active workload.

As disks are getting bigger and bigger, all storage vendors will see
increasing resilvering times. The good news is that, since Solaris
11.2 and ZFSSA since 2013.1.2,
ZFS is now able to run resilvering with much of the same disk
throughput limits as the rest of non-ZFS based storage.

mardi déc. 02, 2014

The initial topic from my list is reARC. This is a major rearchitecture of the code that manages ZFS in-memory cache along with its interface to the DMU. The ARC is of course a key enabler of ZFS high performance. As the scale of systems grow in memory size, CPU count and frequency, some major changes were required to the ARC to keep up with the pace. reARC is such a major body of work, I can only talk about of few aspects of the Wonders of ZFS Storage here.

In this article, I describe how the reARC project had impact on at least these 7 important aspects of it's operation:

Managing metadata

Handling ARC accesses to cloned buffers

Scalability of cached and uncached IOPS

Steadier ARC size under steady state workloads

Improved robustness for a more reliable code

Reduction of the L2ARC memory footprint

Finally, a solution to the long standing issue of I/O priority inversion

The diversity of topics covered serves as a great illustration of the incredible work handled by the ARC and a testament to the importance of ARC operations to all other ZFS
subsystems. I'm truly amazed at how a single project was able to deliver all this goodness in one swoop.

No Meta Limits

Previously, the ARC claimed to use a two-state model:

"most recently used" (MRU)

"most frequently used" (MFU)

But it further subdivided these states into data and metadata lists.

That model, using 4 main memory lists, created a problem for ZFS. The ARC algorithm
gave us only 1 target size for each of the 2 MRU and MFU states. The fact that we had 2 lists (data and metadata) but only 1 target size for the aggregate meant that when we needed to adjust the list down, we just didn't have the necessary information to
perform the shrink. This lead to the presence of an ugly tunable arc_meta_limit, which was impossible to set properly and was a source of problems for customers.

This problem raises an interesting point and a pet peeve of mine. Many people I've interacted with over the years defended the position that metadata was worth special protection in a cache. After all, metadata is necessary to get to data, so it has intrinsically higher value and should be kept around more. The argument is certainly
sensible on the surface, but I was on the fence about it.

ZFS manages every access through a least recently used scheme (LRU). New access to some block, data or metadata, puts that block back to the head of the LRU list, very much protected from eviction, which happens at the tail of the list.

When considering special protection for metadata, I've always stumbled on this question:

If some buffer, be it data or metadata, has not seen any accesses for
sufficient amount of time, such that the block is now the tail of an
eviction list, what is the argument that says that I should protect
that block based on it's state ?

I came up blank on that question. If it hasn't been used, it can be evicted, period.
Furthermore, even after taking this stance, I was made aware of an interesting fact about ZFS. Indirect blocks, the blocks that hold a set of block pointers to the actual data are non_evictable inasmuch as any of the block pointers they reference are currently in the ARC. In other words, if some data is in cache, it's metadata is also in the cache
and furthermore, is non-evictable. This fact really reinforced my position that in our LRU cache handling, metadata doesn't need special protection from eviction.

And so, the reARC project actually took the same path. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction. If you are tuning arc_meta_limit for legacy reasons, I advise you to try without this special tuning. It might be hurting you today and should be considered obsolete.

Single Copy Arc: Dedup of Memory

Yet another truly amazing capability of ZFS is it's infinite snapshot capabilities. There are just no limits, other than hardware, to the number of (software) snapshots that you can have.

What is magical here is not so much that ZFS can manage a large number of snapshots, but that it can do so without reference counting the blocks that are referenced through a snapshot. You might need to read that sentence again ... and check the blog entry.

Now fast forward to today where there is something new for the ARC. While we've always had the ability to read a block referenced from the N-different snapshots (or clones), the old ARC actually had to manage separate in-memory copies of each block. If the accesses were all reads, we'd needlessly instantiate the same data multiple times in memory.

With the reARC project and the new DMU to ARC interfaces, we don't
have to keep multiple data copies. Multiple clones of the same data
share the same buffers for read accesses and new copies are only created for a write access.
It has not escaped our notice that this N-way pairing has immense
consequences for virtualization technologies. The use of ZFS clones
(or writable snapshots) is just a great way to deploy a large number
of virtual machines. ZFS has always been able to store N clone copies
with zero incremental storage costs. But reARC is taking this one step
further. As VMs are used, the in-memory caches that are used to manage
multiple VMs no longer need to inflate, allowing the space savings to
be used to cache other data. This improvement allows Oracle to boast
the amazing technology demonstration of booting 16000 VMs simultaneously.

Improved Scalability of Cached and Uncached OPs

The entire MRU/MFU list insert and eviction processes have been
redesigned. One of the main functions of the ARC is to keep track of
accesses, such that most recently used data is moved to the head of
the list and the least recently used buffers make their way towards
the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability. Moreover, through a very clever algorithm, we're able to move buffers from the middle of a list to the head without acquiring the eviction lock.

These changes were very important in removing long pauses in ARC
operations that hampered the previous implementation. Finally, the
main hash table was modified to use more locks placed on separate
cache lines improving the scalability of the ARC operations. This lead
to a boost in the cached and uncached maximum IOPs capabilities of the
ARC.

Steadier Size, Smaller Shrinks

The growth and shrink model of the ARC was also revisited. The new
model grows the ARC less aggressively when approaching memory pressure
and instead recycles buffers earlier on. This recycling leads to a
steadier ARC size and fewer disruptive shrink cycles. If the changing
environment nevertheless requires the ARC to shrink, the amount by
which we do shrink each time is reduced to make it less of a stress
for each shrink cycle. Along with the reorganization of the ARC list
locking, this has lead to a much steadier, dependable ARC at high
loads.

ARC Access Hardening

A new ARC reference mechanism was created that allows the DMU to
signify read or write intent to the ARC. This, in turn, enables more
checks to be performed by the code. Therefore, catching bugs earlier
in the process. A better separation of function between the DMU and
the ARC is critical for ZFS robustness or hardening. In the new reARC
mode of operation, the ARC now actually has the freedom relocate
kernel buffers in memory in between DMU accesses to a cached
buffer. This new feature proves invaluable as we scale to large memory
systems.

L2ARC Memory Footprint Reduction

Historically, buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to
a bare minimum that now only requires about 80 bytes of metadata per
L2 buffers. With the arrival of larger SSDs for L2ARC and a better
feeding algorithm, this reduced L2ARC footprint is a very significant
change for the Hybrid Storage Pool (HSP) storage model.

I/O Priority Inversion

One nagging behavior of the old ARC and ZIO pipeline was the so-called I/O priority inversion. This behavior was present mostly for prefetching I/Os, which was handled by the ZIO pipeline at a lower priority operation than, for example, a regular read issued by an
application. Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending, would block waiting on the low priority I/O prefetch completion.

While it sounds simple enough to just boost the priority of the in-flight I/O prefetch, ARC/ZIO code was structured in such a way that this turned out to be much trickier than it sounds. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion meant that fairness between different types of I/O was restored.

Conclusion

The key points that we saw in reARC are as follows:

Metadata doesn't need special protection from eviction, arc_meta_limit has
become an obsolete tunable.

Multiple clones of the same data share the same buffers for great performance
in a virtualization environment.

We boosted ARC scalability for cached and uncached IOPs.

The ARC size is now steadier and more dependable.

Protection from creeping memory bugs is better.

L2ARC uses a smaller footprint.

I/Os are handled with more fairness in the presence of prefetches.

All of these improvements are available to customers of Oracle's ZFS Storage Appliances in any AK-2013 releases and recent Solaris 11 releases. And this is just topic number one. Stay tuned as we go about describing further improvements we're making to ZFS.

Well, look who's back! After years of relative silence, I'd like to
put back on my blogging hat and update my patient readership about
the significant ZFS technological improvements that have integrated
since Sun and ZFS became Oracle brands.
Since there is so much to cover, I tee up this series of article with a short
description of 9 major performance topics that have evolved
significantly in the last years. Later, I will describe each
topic in more details in individual blog entries.
Of course, these selected advancements represents nowhere near an
exhaustive list. There has been over 650 changes to the ZFS code in
the last 4 years. My personal performance bias has selected topics that I know best.
The designated topics are:

Allows the ZIL to carve up smaller units of
work for better pipelining and higher log device
utilisation.

It is the dawning of the age of the L2ARC

Not only did we make the L2ARC persistent on reboot,
we made the feeding process so much more efficient
we had to slow it down.

Zero Copy I/O Aggregation

A new tool delivered by the Virtual Memory team allows
the already incredible ZFS I/O aggregation feature to
actually do its thing using one less copy.

Scalable Reader/Writer locks

Reader/Writer locks, used extensively by ZFS and
Solaris, had their scalability greatly improved on
on large systems.

New thread Scheduling class

ZFS transaction groups are now managed by a new type
of taskqs which behave better managing bursts of cpu activity.

Concurrent Metaslab Syncing

The task of syncing metaslabs is now handled with more
concurrency, boosting ZFS write throughput capabilities.

Block Picking

The task of choosing blocks for allocations has been
enhanced in a number of ways, allowing us to work more
efficiently at a much higher pool capacity percentage.
There you have it. I'm looking forward to reinvigorating my blog so
stay tuned.

mardi févr. 28, 2012

Last October, we demonstrated storage leadership in block protocols with our stellar SPC-1 result showcasing our top of the line Sun ZFS Storage 7420.

As a benchmark SPC-1's profile is close to what a fixed block size DB
would actually be doing. See Fast
Safe Cheap : Pick 3 for more details on that result. Here, for an
encore, we're showing today how the ZFS Storage appliance can perform
in a totally different environment : generic NFS file serving.

Does price performance matters ? It does, doesn't it, See what Darius has to say about how we compare to Netapp : Oracle posts Spec SFS.

This is one step further in the direction of bringing to our customer
true high performance unified
storage capable of handling blocks and files on the same
physical media. It's worth noting that provisioning of space between
the different protocols is entirely software based and fully dynamic,
that every stored element fully checksummed, that all stored data can
be compressed with a number of different algorithms (including gzip),
and that both filesystems and block based luns can be snapshot and
cloned at their own granularity. All these
manageability features available to you in this high performance storage package.

Way to go ZFS !

SPEC and SPECsfs are registered trademarks of Standard Performance Evaluation
Corporation (SPEC).
Results as of February 22, 2012, for more information see www.spec.org.

lundi oct. 03, 2011

At Oracle Openworld this week in San Francisco, The ZFS Storage appliance booth is
located in Moscone South, Center - SC-139. I'll be spending time there
tuesday and wednesday afternoon hoping to hear from both existing and prospective customers.

SPC-1 : Twice the performance of NetApp at the same latency; Half the $/IOPS;

I'm proud to say that, yours truly, along with a lot of great
teammates in Oracle, is not totally foreign to this milestone.

We are announcing that Oracle's 7420C
cluster acheived 137000 SPC-1
IOPS with an average latency of less than 10 ms. That is
double the results of NetApp's 3270A while delivering the same
latency. As compared to the NetApp 3270 result, this is a 2.5x
improvement in $/SPC-1-IOPS (2.99$/IOPS vs $7.48/IOPS). We're also showing that when
the ZFS Storage Appliance runs at the rate posted by the 3270A (68034
SPC-1 IOPS), our latency of 3.26ms is almost 3X lower than theirs
(9.16ms). Moreover, our result was obtained with 23700 GB of user
level capacity (internally mirrored) for 17.3 $/GB while NetApp's
, even using a space saving raid scheme, can only deliver 23.5$/GB. This
is the price per GB of application data actually used in the
benchmark. On top of that the 7420C still had 40% of space headroom
whereas the 3270A was left with only 10% of free blocks.

These great results were at least partly made possible with the
availability of 15K RPM Hard Disk Drives (HDD). Those are great to run
the most demanding databases because they combine a large IOPS
capability and are generally of smaller capacity. The ratio of IOPS/GB
makes them ideal to store high intensity database modeled by SPC-1.
On top of that, this concerted engineering effort lead to improved
software not just for those running on 15K RPM. We actually used this benchmark to seek out how to
increase the quality of our products. The preparation runs, after an initial
diagnostic of some issue, we were attached to finding solutions that
where not targeting the idiosyncrasies of SPC-1 but based on sound
design decision. So instead of changing the default value of some
internal parameter to a new static default, we actually changed the way
the parameter worked so that our storage systems or all types and sizes would benefit.

So not only are we getting a great SPC-1 results, but all existing
customers will benefit from this effect even if they are operating
outside of the intense conditions created by the benchmark.

So what is SPC-1 ? It is one of the few benchmarks which counts for
storage. It is maintained by Storage Performance Council (SPC). SPC-1 simulates multiple
databases running on a centralized storage or storage cluster. But
even if SPC-1 is a block based benchmark, within the ZFS Storage
appliance, a block based FC or iSCSI volume is handled very much the
same way as would be a large file subject to synchronous operation.
And by Combining modern network technologies (Infiniband or 10Gbe
Ethernet), the CPU power packed in the 7420C storage controllers and
Oracle's custom dNFS technology for databases, one can truly acheive
very high database transaction rates on top of the more manageable and
flexible file based protocols.

The benchmarks defines three Application Storage Unit (ASU): ASU1 with a heavy 8KB block
read/write component, ASU2 with a much lighter 8KB block read/write
component, and ASU3 which is subject to hundreds of write streams. As
such it's is not too far from a simulation of running hundreds of Oracle
database onto a single system : ASU1 and ASU2 for datafiles and ASU3
for redolog storage.

The total size of the ASUs is constrained such that all of the stored data
(including mirror protection and disk used for spares) must exceed 55%
of all configured storage. The benchmark team is then free to decide
how much total storage to configure. From that figure, 10% is given to
ASU3 (redo log space) and the rest divided equally between heavily
ASU1 and lightly used ASU2.

The benchmark team also has to select the SPC-1 IOPS throughput level it wishes to run.
This is not a light decision given you want to balance high IOPS; low
latency and $/user GB.

Once the target IOPS rate is selected, there are multiple criteria
needed to pass a successful audit; one of the most critical is that
you have to run at the specified IOPS rate for a whole 8 hour. Note
that the previous specifications of the benchmark used by NetApp
called for an 4 hour run. During that 8 hour run delivering a solid
137000 SPC-1 IOPS, the avg latency of must be less than 30ms
(we did much better than that).

After this brutal 8 hour run, the benchmark then enters another critical
phase: the workload is restarted (using a new randomly selected working set)
and performance is measured for a 10 minute period. It is this 10 minute
period that decides the official latency of the run.

When everything is said and done, you press the trigger; go to sleep
and wake up to the result. As you could guess we were ecstatic that
morning. Before that glorious day, for lack of a stronger word, a lot
of hard work had been done during the extensive preparation runs. With
little time, and normally not all of the hardware, one runs through
series of run at incremental loads, making educated guesses as to how
to improve the result. As you get more hardware you scale up the
result tweaking things more or less until the final hour.

SPC-1, with it's requirement of less than 45% of unused space, is
designed to trigger many disk level random read IOPS. Despite this
inherent random pattern of the workload, we saw that our extensive
caching architecture was as helpful for this benchmark as it is in
real production workloads. While the 15K RPM HDDs normally levels off
with random operation at a rate slightly above 300 IOPS, our 7420C, as
a whole, could deliver almost 500 user-level SPC-1 IOPS per HDDs.

In the end one of the most satisfying aspect was to see that
the data being managed by ZFS was stored rock solid on disk, properly
checksummed, all data could be snapshot, compressed on demand, and
delivering an impressively steady performance.

2X the absolute performance, 2.5X cheaper per SPC-1 IOPS, almost 3X lower
latency, 30% cheaper per user GB with room to grow... So, If you have
a storage decision coming and you need, FAST, SAFE, CHEAP : pick 3,
take a fresh look at the ZFS Storage appliance.

mercredi mai 26, 2010

Recall that I had Lun alignment on my mind a few weeks ago.
Nothing special about the ZFS storage appliance over any other storage. Pay attention to how you partition your luns, it can have a great impact on performance.
Right Roch ? :

jeudi mars 11, 2010

One of the major milestones for ZFS Storage appliance with 2010/Q1
is the ability to dedup data on
disk. The open question is then : What performance characteristics are we expected to see from
Dedup ? As Jeff says, this is the ultimate gaming ground for
benchmarks. But lets have a look at the fundamentals.

ZFS Dedup Basics

Dedup code is simplistically a large hash table (the DDT). It uses a 256 bit (32
Bytes) checksum along with other metata data to identify data
content. On a hash match, we only need to increase a reference count,
instead of writing out duplicate data. The dedup code is integrated in
the I/O pipeline and is done on the fly as part of the ZFS transaction
group (see
Dynamics
of ZFS,
The
New ZFS Write Throttle ). A ZFS
zpool typically holds a number of datasets : either block level LUNS
which are based on ZVOL or NFS and CIFS File Shares based on ZFS
filesystems. So while the dedup table is a construct associated with individual zpool,
enabling of the deduplication feature is something controlled at the
dataset level. Enabling of the dedup feature on a dataset, has no impact on existing
data which stay outside of the dedup table.
However any new data stored in the dataset will then be subject to the
dedup code. To actually have existing data become part of the dedup
table one can run a variant of "zfs send | zfs recv" on the datasets.

Dedup works on a ZFS block or record level. For a iSCSI or FC
LUN, i.e. objects backed by ZVOL datasets, the default blocksize is 8K. For
filesystems (NFS, CIFS or Direct Attach ZFS), object smaller than 128K
(the default recordsize) are stored as a single ZFS block while
objects bigger than the default recordsize
are stored as multiple records Each
record is the unit which can end up deduplicated in the DDT. Whole Files which are duplicated in many filesystems instances are
expected to dedup perfectly. For example, whole DB copied from a
master file are expected to falls in this category. Similarly for LUNS, virtual desktop users which were created
from the same virtual desktop master image are also expected to dedup
perfectly.

An interesting topic for dedup concerns streams of bytes such as a tar
file. For ZFS, a tar file is actually a sequence of ZFS records with
no identified file boundaries. Therefore, identical objects (files
captured by tar) present in 2 tar-like byte streams might not dedup well
unless the objects actually start on the same alignment within the
byte stream. A better dedup ratio would be obtained by expanding the
byte stream into it's constituent file objects within ZFS. If possible, the tools
creating the byte stream would be well advised to start new
objects on identified boundaries such as 8K.

Another interesting topic is backups of active Databases.
Since database often interact with their constituent files with an
identified block size, it is rather important for the deduplication
effectiveness that the backup target be setup with a block size that
matches the source DB block size. Using a larger block on the deduplication target has the undesirable consequence
that modifications to small blocks of the source database will cause those large blocks in the backup target
to appear unique and not dedup somewhat artificially. By using an 8K block size in
the dedup target dataset instead of 128K, one could conceivably see
up to a 10X better deduplication ratio.

Performance Model and I/O Pipeline Differences

What is the effect on performance of Dedup ? First when dedup is
enabled, the checksum used by ZFS to validate the disk I/O is changed to the
cryptographically strong SHA256. Darren Moffat shows in his
blog
that SHA256 actually runs at more than 128 MB/sec on a modern
cpu. This means that less than 1 ms is consumed to checksum a 128K
and less than 64 usec for an 8K unit. This cost is online incurred when actually reading or writing data to disk, an operation
that is expected to take 5-10 ms; therefore the checksum generation or validation is not a source of concern.

For the read code path, very little modification should be
observed. The fact that a reads happens to hit a block which is part
of the dedup table is not relevant to the main code path. The biggest
effect will be that we use a stronger checksum function invoked after
a read I/O : at most an extra 1 ms is added to a 128K disk I/O. However if a subsequent read is for a duplicate block which happens to
be in the pool ARC cache, then instead of having to wait for a full
disk I/O, only a much faster copy of duplicate block will be
necessary. Each filesystem can then work independently on their copy
of the data in the ARC cache as is the case without deduplication. Synchronous writes are also unaffected in their interaction with the
ZIL. The blocks written in the ZIL have a very short lifespan and are
not subject to deduplication. Therefore the path of synchronous writes
is mostly unaffected unless the pool itself ends up not being able to
absorb the sustained rate of incoming changes for 10s of seconds.
Similarly for asynchronous writes which interact with the ARC caches,
dedup code has no affect unless the pool's transaction group itself
becomes the limiting factor. So the effect of dedup will take place during the pool transaction
group updates. Here is where we take all modifications that occurred in
the last few seconds and atomically commit a large transaction group
(TXG). While a TXG is running, applications are not directly affected
except possibly for the competition for CPU cycles. They mostly
continue to read from disk and do synchronous write to the zil, and
asynchronous writes to memory. The biggest effect will come if the
incoming flow of work exceed the capabilities of the TXG to commit
data to disk. Then eventually the reads and write will be held up by
the necessary write (Throttling) code preventing ZFS from consuming up all of
memory .

Looking into the ZFS TXG, we have 2 operations of interest, the creation of a
new data block and the simple removal (free) of a previously used
block. ZFS operating under a copy on write (COW) model, any modification to
an existing block actually represents both a new data block creation
and a free to a previously used block (unless a snapshot was taken
in which case there is no free). For file shares, this concerns
existing file rewrites; for block luns (FC and iSCSI), this
concerns most writes except the initial one (very first write to a
logical block address or LBA actually allocates the initial data; subsequent
writes to the same LBA are handled using COW). For the creation of a new application data block, ZFS will then run
the checksum of the block, as it does normally and then lookup in the
dedup table for a match based on that checksum and a few other bits of
information. On a dedup table hit, only a reference count needs to be
increased and such changes to the dedup table will be stored on disk
before the TXG completes. Many DDT entries are grouped in a disk
block and compression is involved. A big win occurs when many
entries in a block are subject to a write match during one TXG. Then a
single 1 x 16K I/O can then replace 10s of larger IOPS.
As for free operations, the internals of ZFS actually holds the referencing block pointer which
contains the checksum of the block being freed. Therefore there is no need
to read nor recompute the checksum of the data being freed. ZFS, with checksum in hand, looks
up the entry in dedup table and decrement the reference counter. If
the counter is non zero then nothing more is necessary (just the dedup
table sync). If the freed block ends up without any reference then it
will be freed.

The DEDUP table itself an an object managed by ZFS at the pool level.
The table is considered metadata and it's elements will be stored in the ARC
cache. Up to 25% of memory (zfs_arc_meta_limit) can be used to store
metadata. When the dedup table actually fits in memory, then enabling
dedup is expected to have a rather small effect on performance. But when the
table is many time greater than allotted memory, then the lookups
necessary to complete the TXG can cause write throttling to be invoked
earlier than the same workload running without dedup. If using an
L2ARC, the DDT table represents prime objects to use the secondary
cache. Note that independent of the size of the dedup table, read
intensive workloads in highly duplicated environment, are expected to
be serviced using fewer IOPS at lower latency than without dedup. Also note that whole filesystem removal or large file truncation are
operation that can free up large quantity of data at once and when the
dedup table exceeds allotted memory then those operation, which are
more complex with deduplication, can then impact the
amount of data going into every TXG and the write throttling behavior.

So how large is the dedup table ?

The command zdb -DD on a pool shows the size of DDT entries. In one of
my experiment it reported about 200 Bytes of core memory for table
entries. If each unique object is associated with 200 Bytes of memory
then that means that 32GB of ram could reference 20TB of unique data
stored in 128K records or more than 1TB of unique data in 8K
records. So if there is a need to store more unique data than what
these ratio provide, strongly consider allocating some large read optimized SSD
to hold the DDT. The DDT lookups are small random IOs which are
handled very well by current generation SSDs.

The first motivation to enable dedup is actually when dealing with
duplicate data to begin with. If possible procedures that generate
duplication could be reconsidered. The use of ZFS Clones is actually a
much better way to generate logically duplicate data for multiple users
in a way that does not require a dedup hash table.

But when the operating conditions does not allow the use of ZFS Clones
and data is highly duplicated, then the ZFS deduplication
capability is a great way to reduce the volume of stored data.

The views expressed on this blog are my own and do not necessarily reflect
the views of Oracle.

Because of disk parititioning software on your storage clients
(keywords : EFI, VTOC, fdisk, DiskPart,...) or a mismatch between
storage configuration and application request pattern, you could be
suffering a 2-4X performance degradation....

Many I/O performance problem I see end up being the result of a
mismatch in request sizes or it's alignment versus the natural block
size of the underlying storage. While raw disk storage works
using a 512 Byte sector and performs at the same level independent of
the starting offset of I/O requests this is not the case for more
sophisticated storage which will tend to use larger block units. Some
SSDs today support 512B aligned requests but will work much better if
you give them 4K aligned requests as described in
Aligning on 4K boundaries
Flash and
Sizes. The Sun Oracle 7000 Unified Storage line supports
different sizes of blocks between 4K and 128K (it can actually go
lower but I would not recommend that in general). Having proper alignment
between the application's view, the initiator partitioning and the
backing volume can have great impact on the end performance delivered to
applications.

When is alignment most important ?

Alignment problems are most likely to have an impact with

running a DB on file shares or block volumes

write streaming to block volumes (backups)

Also impacted at a lesser level :

large file rewrites on CIFS or NFS shares

In each case adjusting the recordsize to match the workload and
insuring that partitions are aligned on a block boundary could have
important effect on your performance.

Let's review the different cases.

Case 1: running a Database (DB) on file shares or block volumes

Here the DB is a block oriented application. General ZFS Best Practices warrant
that the storage use a record size equal to the DB natural
block size. At the logical level, the DB is issuing I/O which are aligned on block
boundaries. When using file semantics (NFS or CIFS), then the
alignment is guaranteed to be observed all the way to the backend
storage. But when using block device semantics, the alignments of
requests on the initiator is not guaranteed to be the same as the
alignement on the target side. Misalignment of the LUN will cause two pathologies. First, an
application block read will straddle 2 storage blocks creating storage
IOPS inflation (more backend reads than application reads). But a more
drastic effect will be seen for block writes which, when aligned,
could be serviced by a single write I/O. Those will now require a
Read-Modify-Write (R-W-M) of 2 adjacent storage blocks. Such type of I/O
inflation leads to additional storage load and degrade performance
during high demand.

To avoid such I/O inflation, insure that the backing store uses a
block size (LUN volblocksize or Share recordsize) compatible with the
DB block size. If using a file share such as NFS, insure that the
filesystem client passes I/O requests directly to the NFS server
using a mount option such as directio or use Oracle's
dNFS client
(Note that with directio mount option,
memory management considerations independent of alignment concerns,
the server will behave better when the client specifies rsize,wsize options not exceeding 128K). To avoid such LUN misalignment, prefer the use full
LUNS as opposed to sliced partition. If disk slices must be used,
prefer partitioning scheme in which one can control the sector offset
of individual partitions such as EFI labels. In that case start partitions on a
sector boundary which aligns with the volume's blocksize. For instance
a initial block for a parition which is a multiple of 16 \* 512B
sectors will align on an 8K boundary, the default lun blocksize.

Case 2: write streaming to block volumes (backups)

The other important case to pay attention to is stream writing to a
raw block device. Block devices by default commit each write to stable
storage. This path is often optimized through the use of
acceleration devices such as write optimized SSD.
Misalignement of the LUNS due to
partitioning software imply that application writes, which could otherwise be committed to
SSD at low latency, will instead be delayed by disk reads caught in
R-M-W. Because the writes are synchronous in nature,
the application running on the initiator will thus be considerably
delayed by disk reads. Here again one must insure that partitions created on the client system are
aligned with the volumes blocksize which typically default to 8K. For
pure streaming workloads large blocksize up to the maximum 128K can lead to
greater streaming performance. One must take good care that the
block size used for a LUNS should not exceed the application writes sizes
to raw volumes or risk being hit by the R-M-W penalty.

Case 3: large file rewrites on CIFS or NFS shares

For file shares, large streaming write will be of 2 types : they will
either be the more common file creation (write allocation) or they
will correspond to streaming overwrite to existing file. The more
common write allocation would not greatly suffer from misalignment
since there is no pre-existing data to be read and modified. But for
the less common streaming rewrite to files, one can definitely be
impacted by misalignment and R-M-W cycles. Fortunately file protocols
are not subject to LUN misalignment so one must only take care that the write
sizes reaching the storage be multiple of the
recordsize used to create the file share in the storage. The solaris NFS clients often issues 32K write size for streaming
application while CIFS has been observed to use 64K from clients.
If existing streaming asynchronous file rewrite is an important component of your I/O
workloads (a rare set of conditions), it might well be that setting the LUN blocksize
accordingly will provide a boost to delivered performance.

In summary

The problem with alignment is more generally seen with fixed
record oriented application (as for Oracle Database or Microsoft
Exchange) with random access pattern and synchronous I/O semantics. It
can be caused by partitioning software (fdisk, diskpart) which create
disk partitions not aligned with the storage blocks. It can also
be caused to a lesser extent by streaming file overwrite when the
application write size does not match the file share's blocksize. The Sun Storage 7000 line offers great flexibility in selecting
different blocksizes for different use within a single pool of
storage. However it has no control on the offset that could be
selected during disk partitioning of block devices on client
systems. Care must be taken when partitioning disks to avoid
misalignment and degraded performance. Using full LUNs is preferred.

The views expressed on this blog are my own and do not necessarily reflect
the views of Oracle.

One of the great advances present in the ZFS Appliance
2010/Q1
software update relates to the block allocation strategy. It's been
one the most complex performance investigation I've ever had to deal
with because of the very strong impact previous history of block
allocation had on future performance. It was maddening experience
littered with dead end leads. During that whole time it was very hard to
make sense of the data and segregate what was due to a
problem in block allocation from author causes that leads
customer to report performance issues.

Executive Summary

A series of changes to ZFS metaslab code lead to 50% improved OLTP performance
and 70% reduced variability from run to run. We also saw a full 200% improvement on MS Exchange performance from these changes.

At some point we started to look at random synchronous file rewrite (a
la DB writer) and it seemed clear that the performance was not what we
expected for this workload. Basically, independent DB block
synchronous writes were not aggregating into larger I/Os in the vdev
queue. We could not truly assert a point where a regression had set
in, so rather than threat this as a performance regression, we just
decided to study what we had in 2009/Q3 and see how we could make our
current scheme work better. And that lead us on the path of the
metaslab allocator:

As Jeff explains, when a piece of data needs to be stored on disk, ZFS
will first select a top level vdev (a raid-z group, a mirrored set, a
single disk) for it. Within that top level vdev, a metaslab (slab for
short) will be chosen and within the slab a block of Data Virtual
Address (DVA) space will be selected. We didn't have a problem with the vdev selection process, but we
thought we have an issue with the block allocator. What we were
seeing was that for random file rewrite the aggregation factor was
large (say 8 blocks or more) when performance was good but dropped
to 1 or 2 when performance was low. So we tried to see if
we could do a better job at selecting blocks that would lead to better
I/O aggregation down the pipeline. We kept looking at the effect of block allocation
but it turned out the source of problem was in the slab selection
process.

So a slab is a portion of DVA space within a metaslab group (aka a top
level vdev). We currently divide VDEV space into approximately 200
slabs (see vdev_metaslab_set_size).
Slabs can be either loaded in memory or not. When loaded, the
associated spacemaps are active meaning we can allocate space from them.
When slabs are not loaded, we can't allocated space but we can still
free space from them (ZFS being copy-on-write or COW, a block rewrite frees up the old space). In
this case we just log to disk the freed range information. As load and
unload of spacemaps are not cheap and we insure we minimize such operation.

So each slab is weighted according to a few criteria and the slab with
the highest weight is selected for allocation on a vdev. The first criteria for slab selection is to reuse the same one as the last
one used: basically don't change a winner. We refer to this as the PRIMARY slab. The second criteria for slab selection is the amount of free
space. The more the better. However, lower LBA (logical block
addresses) which maps to outer cylinders will generally give better
performance. So we weight lower LBA more than inner ones at
equivalent free space. Finally, a slab that has already been used in the past, even if
currently unloaded, is preferred to opening up a fresh new slab. This
is the SMO bonus (because primed slabs have a Space Map Object
associated). We do want to favor previously used slabs in
order to limit the span of head seeks : we only move inwards when
outer space is filled up.

The purpose of the slabs is to service a block allocation, say for a
128K record. So when a request comes in, the highest weighted slab is
chosen as we ask for a block of the proper size using an AVL tree of
free/allocated space. There was a problem we had to deal with in previous
releases which occurred when such allocation failed because of free space
fragmentation. Then the AVL tree was then not able to find a span of the
requested size and was consuming CPU only to figure out there was
no free block present to satisfy an allocation. When space was really tight in a pool we walked every
slab before deciding that the allocation needed to be split into
small chunks and a gang block (a block of blocks) created. So the spacemaps were
augmented with another structure that allowed ZFS to immediately know
how large an allocation could be serviced in a slab (the so called
picker private tree organized by size of free space).

At that point we had 2 ways to select a block, either find one in
sequence of previous allocation (first fit) or use one that fills in
exactly a hole in the allocated space: so called best fit
allocator. We also decided then to switch from best fit to first fit
as a slab became 70% full. The problem that this created, we now realize,
is that while it helped the compactness of the on-disk
layout, it created a headache for writes. Each new allocation, got a
taylored-fit disk area and this lead to much less write aggregation
than expected. We would see that write workloads to a slab slowed
down as it transitioned to 70% full (note this occurred when a slab was 70% full not the
full vdev nor the pool). Eventually, the degraded slab became fully
used and it would transition to a different slab with
better performance characteristic. Performance could then fluctuate from an hour to the next.

So to solve this problem, what went in 2010/Q1 software release is
multifold. The most important thing is: we increased the threshold at
which we switched from 'first fit' (go fast) to 'best fit' (pack tight)
from 70% full to 96% full. With TB drives, each slab is at least 5GB
and 4% is still 200MB plenty of space and no need to do anything
radical before that. This gave us the biggest bang. Second, instead of trying to reuse the same primary slabs until it failed
an allocation we decided to stop giving the primary slab this preferential
threatment as soon as the biggest allocation that could be satisfied
by a slab was down to 128K (metaslab_df_alloc_threshold). At that
point we were ready to switch to another slab that had
more free space. We also decided to reduce the SMO bonus. Before, a slab that was 50%
empty was preferred over slabs that had never been used. In order to
foster more write aggregation, we reduced the threshold to 33%
empty. This means that a random write workload now spread to more
slabs where each one will have larger amount of free space leading to
more write aggregation. Finally we also saw that slab loading was contributing to lower
performance and implemented a slab prefetch mechanism to reduce down
time associated with that operation.

The conjunction of all these changes lead to 50% improved OLTP
and 70% reduced variability from run to
run (see David Lutz's post on OLTP performance) . We also saw a full 200% improvement on MS Exchange performance
from these changes.

The views expressed on this blog are my own and do not necessarily reflect
the views of Oracle.

jeudi oct. 08, 2009

Now that we have Gigabytes/sec class of Network Attached OpenStorage and
highly threaded CMT
servers to attach from you figure just connecting the two would be
enough to open the pipes for immediate performance. Well ... almost.

Our openstorage system can deliver great performance but we often find
limitation on the client side. Now that NAS servers can deliver so much power,
their NAS client can themselve be powerful servers trying to deliver
GB/sec class services to the internet.

CMT servers are great throughput engines for that, however they
deliver the goods when the whole stack is threaded. So in a recent
engagement, my collegue David Lutz found that we needed one tuning at
each of 4 levels in Solaris : IP, TCP, RPC and NFS.

Service

Tunable

IP

ip_soft_ring_cnt

TCP

tcp_recv_hiwat

RPC

clnt_max_conns

NFS

nfs3_max_threads

NFS

nfs4_max_threads

ip_soft_rings_cnt requires tuning up to Solaris 10 update 7.
The default value of 2 is not enough to sustain the high throughput in
a CMT environment. A value of 16 proved beneficial.

In /etc/system :

\* To drive 10Gbe in CMT in Solaris 10 update 7 : see blogs.sun.com/roch
set ip_soft_rings_cnt=16

The receive socket buffer size is critical to the TCP connection
performance. The buffer is not preallocated and memory is only
used if and when the application is not reading the data
it has requested. The default at 48K is from the age of 10MB/s Network cards
and 1GB/sec systems. Having a larger value allows the peer to not
throttle it's flow pending the returning TCP ACK. This is specially
critical in high latency environment, urban area networks or other
large fat network but it's also critical in the datacenter to reach a
reasonable portion of the 10Gbe available in today's NIC. It turns out
that NFS connection inherit the TCP default for the system and so it's
interesting to run with a value between 400K and 1MB :

ndd -set /dev/tcp_recv_hiwat 400000

But even with this, a single TCP connection is not enough to extract
the most out of 10Gbe on CMT. And the solaris rpc client will
establish a single connection to any of the server it connects to.
The code underneath is highly threaded but did suffer from a few bugs
when trying to tune that number of connections notably
6696163,
6817942
both of which are fixed in S10 update 8.

With that release, it becomes interesting to tune the number of RPC
connections for instance to 8.

In /etc/system :

\* To drive 10Gbe in CMT in Solaris 10 update 8 : see blogs.sun.com/roch
set clnt_max_conns=8

And finally, above the RPC layer, NFS does implement a pool of threads
per mount point to service asynchronous requests. These will be mostly
used in streaming workloads (readahead and writebehind) while other
synchronous requests will be issued within the context of the
application thread. The default number of asynchronous requests is
likely to limit performance in some streaming scenario. So
I would experiment with

In /etc/system :

\* To drive 10Gbe in CMT in Solaris 10 update 7 : see blogs.sun.com/roch
set nfs3_max_threads=32
set nfs4_max_threads=32

As usual YMMV and use
them with the usual circumspection, remember that tuning
is evil but it's better to know about these factors than being in
the dark and stuck with lower than expected performance.

jeudi sept. 17, 2009

One of the much anticipated feature of the 2009.Q3 release of the
fishworks OS is a complete rewrite of the iSCSI target implementation
known as Common Multiprotocol SCSI Target or
COMSTAR. The new target code is
an in-kernel implementation that replaces what was previously known as
the iSCSI target deamon, a user-level implementation of iSCSI.

Should we expect huge performance gains from this change ? You Bet !

But like most performance question, the answer is often : it
depends. The measured performance of a given test is gated by the
weakest link triggered. iSCSI is just one component among many others
that can end up gating performance. If the daemon was not a limiting
factor, then that's your answer.

The target deamon was a userland implementation of iSCSI : some daemon
threads would read data from a storage pool and write data to a socket
or vice versa. Moving to a kernel implementation opens up options to
bypass at least one of the copies and that is being considered as a
future option. But extra copies while undesirable do not necessarily
contribute to the small packet latency or large request throughput;
For small packets requests, the copy is small change compared to the
request handling. For large request throughput the important things is
that the data path establishes a pipelined flow in order to keep every
components busy at all times.

But the way threads interact with one another can be a much greater
factor in delivered performance. And there lies the problem. The old
target deamon suffered from one major flaw in that each and every
iSCSI requests would require multiple trips through a single queue
(shared between every luns) and that queue was being read and written
by 2 specific threads. Those 2 threads would end up fighting for the
same locks. This was compounded by the fact that user level threads
can be put to sleep when they fail to acquire a mutex and that going
to sleep for a user level thread is a costly operation implying a
system call and all the accounting that goes with that.

So while the iSCSI target deamon gave reasonable service for large
request, it was much less scalable in terms of the number IOPS that
can be served and the CPU efficiency in which it could do that. IOPS
being of course a critical metrics for block protocols.

As an illustration of that with 10 client initiators and 10 threads
per initiators (so 100 outstanding request) doing 8K cache-hit reads,
we observed

Old Target Daemon

Comstar

Improvement

31K IOPS

85K IOPS

2.7X

Moreover the target daemon was consuming 7.6 CPU to service those
31K IOPS while comstar could handle 2.7X more IOPS consuming only 10
cpus, a 2X improvement in iops per cpu efficiency.

On the write side, with a disk pool that had 2 striped write
optimised SSD, comstar gave us 50% more throughput (130 MB/sec vs
88MB/sec) and 60% more cpu efficiency.

Immediatedata

During our testing we noted a few interesting contributor to delivered
performance. The first being the setting of iSCSI immediatedata
parameter
iSCSIadm(1M). On the
write path, that parameter will cause the initiator iSCSI to send up
to 8K of data along with the initial request packet. While this is a good
idea to do so, we found that for certain sizes of writes, it would
trigger some condition in the zil that caused ZFS to issue more data
than necessary through the logzillas. The problem is well understood
and remediation is underway and we expect to get to a situation in
which keeping the default value of immediatedata=yes is the best. But
as of today, for those attempting world record data transfer speeds
through logzillas, setting immediatedata=no and using a 32K or 64K write
size might yield positive result depending on your client OS.

Interrupt Blanking

Interested in low latency request response ? Interestingly, a chunk of
that response time is lost in the obscure setting of network card
drivers. Network cards will often delay pending interrupts in the hope
of coalescing more packets into a single interrupt. The extra
efficiency often results in more throughput at high data rate at the
expense of small packet latency. For 8K request we manage to get 15%
more single threaded IOPS by tweaking one such client side
parameter. Historically such tuning has always been hidden in the
bowel of each drivers and specific to ever client OS so that's too
broad a topic to cover here. But for Solaris clients, the
Crossbow
framework is aiming among other thing to make latency vs throughput decision
much more adaptive to operating condition relaxing the need for per
workload tunings.

WCE Settings

Another important parameter to consider for comstar is the 'write
cache enable' bit. By default all write request to an iSCSI lun needs
to be committed to stable storage as this is what is expected by most
consumers of block storage. That means that each individual write
request to a disk based storage pool will take minimally a disk
rotation or 5ms to 8ms to commit. This also why a write optimised SSD
is quite critical to many iSCSI workloads often yeilding 10X
performance improvements. Without such an SSD, iSCSI performance will
appear quite lackluster particularly for lightly threaded workloads
which more affected by latency characteristics.

One could then feel justified to set the write cache enable bits on
some luns in order to recoup some spunk in their engine. One good news
here is that in the new 2009.Q3 release of fishworks the setting is
now persistent across reboots and reconnection event, fixing a nasty
condition of 2009.Q2. However one should be very careful with this
setting as the end consumer of block storage (exchange, NTFS,
oracle,...) is quite probably operating under an unexpected set of
condition. This setting can lead to application corruption in case
of outage (no risk for the storage internal state).

There is one exception to this caveat and it is ZFS itself. ZFS is
designed to safely and correctly operate on top of devices that have
their write cached enabled. That is because ZFS will flush write
caches whenever application semantics or its own internal consistency
require it. So a ZPOOL created on top of iSCSI luns would be well
justified to set the WCE on the lun to boost performance.

Synchronous write bias

Finally as described in my blog entry about Synchronous write bias,
we now have to option to bypass the write optimised SSDs for a lun if
the workload it receive is less sensitive to latency. This would be
the case of a highly threaded workload doing large data
transfers. Experimenting with this new property is warranted at this
point.

With the release of 2009.Q3 release of fishworks along with a new
iSCSI
implemtation we're coming up with
a very significant new feature for managing performance of Oracle
database : the new dataset Synchronous write bias property or
logbias for short. In a nutshell, this
property takes the default value of Latency signifying that the
storage should handle synchronous writes in urgency, the historical
default handling. See Brendan's
comprehensive blog entry on the Separate Intent Log and synchronous writes.
However for datasets holding Oracle Datafiles,
the logbias property can be set to Throughput signifying that the
storage should avoid using log devices acceleration instead trying to
optimize the workload's throughput and efficiency. We definitely
expect to see a good boost to Oracle performance from this feature for
many types of workloads and configs; workloads that generate
10s of MB/sec of DB writer traffic and have no more than 1 logzilla per tray/JBOD.

The property is set in the Share Properties just above
database recordsize. You might need to unset the Inherit from
projet checkbox in order to modify the settings on a particular
share:

The logbias property addresses a peculiar aspect of Oracle workloads :
namely that DB writers are issuing a large number of concurrent
synchronous writes to Oracle datafiles, writes which individually
are not particularly urgent. In contrast to other types of synchronous
writes workloads, the more important metrics for DB Writers is not
about individual latency. The important metric is that the storage
keep up with the throughput demand in order to have database buffers
always available for recycling. This is unlike redo log
writes which are critically sensitive to latency as they are holding
up individual transactions and thus users.

ZFS and the ZIL

A little background; with ZFS, synchronous writes are managed by the
ZFS Intent Log ZIL.
Because synchronous writes are typically holding up applications, it's
important to handle those writes with some level of urgency and the
ZIL does an admirable job at that.

In the Openstorage
hybrid storage pool the ZIL itself
is speeded up using low latency write-optimized SSD devices : the
logzillas. Those devices are used to commit a copy of the in-memory
ZIL transaction and retain the data until an upcoming transaction group
commits the in-memory state to the on-disk pooled storage
(Dynamics of
ZFS,
The
New ZFS write throttle).

So while the ZIL speeds up synchronous writes, logzillas speeds up the
zil. Now SSDs can serve IOPS at a blazing 100μs but also have
their own throughput limits: currently around 110MB/sec per device.
At that throughput, committing, for example, 40K of data will need
minimally 360μs. The more data we can divert away from log devices, the lower the
latency response of those devices will be.

It's interesting to note that other types of raid controllers will be
hostage of their NVRAM and require, for consistency, that data be
committed through some form of acceleration in order to avoid the Raid
Write Hole (Bonwick on Raid-Z). ZFS, however,
does not require that data passes through its SSD commit accelerator and
it can manage consistency of commits either using disk or using
SSDs.

Synchronous write bias : Throughput

With this newfound ability of storage administrators to signify to ZFS
that some datasets will be subject to highly threaded synchronous
writes for which global throughput is more critical than individual
write latency, we can enable a different handling mode. By setting
Logbias=Throughput ZFS is able to divert writes away from
the Logzillas which are then preserved for servicing low latency
sensitive operations (e.g. redo log operations).

A setting of Synchronous write bias : Throughput for a dataset allows synchronous
writes to files in other datasets to have lower latency
access to SSD log devices.

But that's not all. Data flowing through a logbias=Throughput
dataset is still served by the ZIL. It turns out that the ZIL has
different internal options in the way it can commit transactions one
of which being tagged WR_INDIRECT. WR_INDIRECT commits issue an
I/O for the modified file record and record a pointer to it in the zil chain.
(see WR_INDIRECT in
zil.c,
zvol.c,
zfs_log.c
).

ZIL transaction of type WR_INDIRECT might use more disk I/Os and
slightly higher latency immediately but less I/Os and less total bytes
during the upcoming transaction group update. Up to this point, the
heuristics that lead to using WR_INDIRECT transactions, were not
triggered by DB writer workloads. But armed with the knowledge that
comes with the new logbias property, we're now less concerned
about the slight latency increase that WR_INDIRECT can have. So from
efficiency consideration the logbias=Throughput datasets
are now set to use this mode leading to more leveled latency
distributions of Transactions.

Synchronous write bias : Throughput is a dataset mode that reduces the number of
I/Os that need to be issued on behalf of this dataset during the regular transaction
group updates leading to more leveled response time.

A reminder that such kind of improvements sometimes can go unnoticed
in sustained benchmarks if the downstream Transaction group destage is
not given enough resources. Make sure you have enough spindles (or
total disk KRPM) to sustain the level of performance you need. A
pool with 2 logzillas and a single JBOD, might have enough SSD
throughput to absorb DB writer workloads without adversely affecting
redo log latency and so would not benefit from the special logbias
settings, however for 1 logzillas per JBOD the situation might be
reversed.

While the DB Record Size property is inherited by files in a dataset and is
immutable, the logbias setting is totally dynamic and can be
toggled on the fly during operations. For instance, during database
creation or some lightly threaded write operations to Datafiles, it's
expected that logbias=Latency should perform better.

Logbias deployments for Oracle

As of the 2009.Q3 release of fishworks, the current wisdom around
deploying Oracle DB an Openstorage system with SSD acceleration, is to
segregate, at the filesystem/dataset level, but within the single
storage pool, Oracle datafiles, index files and redo Log files. Having
each type of files in different dataset allows better observability
into each one using the great analytics
tool. But also, each dataset can then be tuned independantly to
deliver the most stable performance characteristics. The most
important parameter to consider is the ZFS internal recordsize used to
manage the files. For Oracle datafiles the established (ZFS
Best Practice) is to match the recordsize and the DB block size.
For redo log files using default 128K records means that fewer file
updates will be stradling multiple filesystem records. With 128K
records we expect to have fewer transaction needing to wait for redo
log input I/Os leading more leveled latency distribution for
transactions. As for Index files, using smaller blocks of 8K offers
better cacheability feature for both the primary and secondary caches
(only cache what is used from indexes), but using larger blocks offers
better index-scan performance. Experimenting is in order, depending on
your use case, but an intermediate block size of maybe 32K might also
be considered for mixed usage scenario.

For Oracle datafiles specifically, using the new setting of
Synchronous write bias : Throughput has potential to deliver
more stable performance in general and higher performance for redo log
sensitive workloads.

Dataset

Recordsize

Logbias

Datafiles

8K

Throughput

Redo Logs

128K(default)

Latency(default)

Index

8K-32K?

Latency(default)

Following these guidelines yielded a 40% boost in our Transaction
processing testing in which we had 1 logzillas for a 40 disk pool.