Oracle Blog

Richard Elling's Weblog

Thursday Apr 05, 2007

The FreeBSD team has added ZFS to the FreeBSD-7.0 release! This is excellent news and all of us are happy to share with the FreeBSD community. Pawel Jakub Dawidek has posted this note to the FreeBSD and ZFS community. This will greatly expand the use of ZFS and will no doubt lead to more innovative developments in the community. Well done!

So, why is build 55 so special? It is the first build which has become known as the Solaris Express Developer Edition. This is a milestone release because it is available with a support contract. Many people don't care about support, and you will often hear them complaining about the fact that Solaris 10 doesn't have some version of an application which was just put up on the web last week. The quality process involved in producing, distributing, and supporting a large software product, such as Solaris, takes time to crank through. If you really want the lastest, greatest, and riskiest version of Solaris, then you need to be on a Solaris Express release. The problem is that we couldn't provide a support contract for Solaris Express. That problem is now solved. Those who demand support and want to be closer to the leading edge can now get both in the Solaris Express Developer Edition.

Go ahead, give it a try. Then hang out with the Solaris community at the OpenSolaris.org website where there is always interesting discussions going on.

Wednesday Jan 31, 2007

Wrapping up the thread on space, performance, and MTTDL, I thought that you might like to see one graph which would show the entire design space I've been using. Here it is:

This shows the data I've previously blogged about in scale. You can easily see that for MTTDL, double parity protection is better than single parity protection which is better than striping (no parity protection). Mirroring is also better than raidz or raidz2 for MTTDL and small, random read iops. I call this the "all-in" slide because, in a sense, it puts everything in one pot.

While this sort of analysis is useful, the problem is that there are more dimensions of the problem. I will show you some of the other models we use to evaluate and model systems in later blogs, but it might not be so easy to show so many useful factors on one graph. I'll try my best...

Tuesday Jan 30, 2007

In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems.

The best thing about a model is that it is a simplification of
real life.The worst thing about a model is that it is a
simplification of real life.

Small, Random Read Performance Model

For this analysis, we will use a small, random read performance
model. The calculations for the model can be made with data which is
readily available from disk data sheets. We calculate the expected
I/O operations per second (iops) based on the average read seek and
rotational speed of the disk. We don't consider the command overhead,
as it is generally small for modern drives and is not always
specified in disk data sheets.

I purposely used those two examples because people are always
wondering why we tend to prefer smaller, faster, and (unfortunately)
more expensive drives over larger, slower, less expensive drives - a
78% performance improvement is rather significant. The 3.5"
drives also use about 25-75% more power than their smaller cousins,
largely due to the rotating mass. Small is beautiful in a SWaP
sense.

Next we use the RAID set configuration information to calculate
the total small, random read iops for the zpool or volume. Here we
need to talk about sets of disks which may make up a multi-level
zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of
mirrored sets (RAID-1). RAID-0 is a stripe of disks.

For dynamic striping (RAID-0), add the iops for each set or
disk. On average the iops are spread randomly across all sets or
disks, gaining concurrency.

For mirroring (RAID-1), add the iops for each set or disk.
For reads, any set or disk can satisfy a read, so we also get
concurrency.

For single parity raidz (RAID-5), the set operates at the
performance of one disk. See below.

For double parity raidz2 (RAID-6), the set operates at the
performance of one disk. See below.

For example, if you have 6 disks, then there are many different
ways you can configure them, with varying performance calculations

RAID Configuration (6 disks)

Small, Random Read Performance Relative to a Single Disk

6-disk dynamic stripe (RAID-0)

6

3-set dynamic stripe, 2-way mirror (RAID-1+0)

6

2-set dynamic stripe, 3-way mirror (RAID-1+0)

6

6-disk raidz (RAID-5)

1

2-set dynamic stripe, 3-disk raidz (RAID-5+0)

2

2-way mirror, 3-disk raidz (RAID-5+1)

2

6-disk raidz2 (RAID-6)

1

Clearly, using mirrors improves both performance and data
reliability. Using stripes increases performance, at the cost of data
reliability. raidz and raidz2 offer data reliability, at the cost of
performance. This leads us down a rathole...

The Parity Performance Rathole

Many people expect that data protection schemes based on parity,
such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance
of striped volumes, except for the parity disk. In other words, they
expect that a 6-disk raidz zpool would have the same small. random
read performance as a 5-disk dynamic stripe. Similarly, they expect
that a 6-disk raidz2 zpool would have the same performance as a
4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a
checksum to validate the contents of a block of data written. The
block is spread across the disks (vdevs) in the set. In order to
validate the checksum, ZFS must read the blocks from more than one
disk, thus not taking advantage of spreading unrelated, random reads
concurrently across the disks. In other words, the small, random read
performance of a raidz or raidz2 set is, essentially, the same as the
single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.

Many people also think that this is a design deficiency. As a RAS
guy, I value the data validation offered by the checksum over the
performance supposedly gained by RAID-5. Reasonable people can
disagree, but perhaps some day a clever person will solve this for
ZFS.

So, what do other logical volume managers or RAID arrays do? The
results seem mixed. I have seen some RAID array performance
characterization data which is very similar to the ZFS performance
for parity sets. I have heard anecdotes that other implementations
will read the blocks and only reconstruct a failed block as
needed. The problem is, how do such systems know that a block has
failed? Anecdotally,
it seems that some of them trust what is read from the disk. To
implement a per-disk block checksum verification, you'd still have to
perform at least two reads from different disks, so it seems to me
that you are trading off data integrity for performance. In ZFS, data
integrity is paramount. Perhaps there is more room for research here,
or perhaps it is just one of those engineering trade-offs that we
must live with.

Other Performance Models

I'm also looking for other performance models which can be applied
to generic disks with data that is readily available to the public.
The reason that the small, random read iops model works is that it
doesn't need to consider caching or channel resource utilization.
Adding these variables would require some knowledge of the
configuration topology and the cache policies (which may also change
with firmware updates.) I've kicked around the idea of a total disk
bandwidth model which will describe a range of possible bandwidths
based upon the media speed of the drives, but it is not clear to me
that it will offer any satisfaction. Drop me a line if you have a
good model or further thoughts on this topic.

You should be cautious about extrapolating the performance results
described here to other workloads. You could consider this to be a
worst-case model because it assumes 0% disk cache hits. I would hope
that most workloads exhibit better performance, but rather than
guessing (hoping) the best way to find out is to run the workload and
measure the performance. If you characterize a number of different
configurations, then you might build your own performance graphs
which fit your workload.

Putting It All Together

Now we have a method to compare a variety of different ZFS or RAID
disk configurations by evaluating space, performance, and MTTDL.
First, let's look at single parity schemes such as 2-way mirrors and
raidz on the Sun
Fire X4500 (aka Thumper) server.

Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better
performance and MTTDL than raidz for any specific space requirement
except for the case where we run out of hot spares for the 2-way
mirror (using all 46 disks for data). By contrast, all of the raidz
configurations here have hot spares. You can use this to help make
design trade-offs by prioritizing space, performance, and MTTDL.

You'll also note that I did not label the left-side Y axis (MTTDL)
again, but I did label the right-side Y axis (small, random read
iops). I did this with mixed emotion. I didn't label the MTTDL axis
values as I explained previously. But I did label the performance
axis so that you can do a rough comparison to the double parity graph
below. Note that in the double parity graph, the MTTDL axis is in
units of Millions of years, instead of years above.

Here you can see the same sort of comparison between 3-way mirrors
and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.

Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place. If you want to be happier, you should use mirroring with at least one hot spare.

Conclusion

We can make design trade-offs between space, performance, and
MTTDL for disk storage systems. As with most engineering decisions,
there often is not a clear best solution given all of the possible
solutions. By using some simple models, we can see the trade-offs
more clearly.

For this blog, I want to explore the calculation of MTTDL for a
bunch of disks. It turns out, there are multiple models for
calculating MTTDL. The one described previously here is the simplest
and only considers the Mean Time Between Failure (MTBF) of a disk and
the Mean Time to Repair (MTTR) of the repair and reconstruction
process. I'll call that model #1 which solves for MTTDL[1]. To
quickly recap:

For non-protected schemes (dynamic striping, RAID-0)

MTTDL[1]
= MTBF / N

For single parity schemes (2-way mirror, raidz, RAID-1,
RAID-5):

MTTDL[1]
= MTBF2 / (N \* (N-1) \* MTTR)

For double parity schemes (3-way mirror, raidz2, RAID-6):

MTTDL[1]
= MTBF3 / (N \* (N-1) \* (N-2) \* MTTR2)

You can
often get MTBF data from your drive vendor and you can measure or
estimate your MTTR with reasonable accuracy. But MTTDL[1] does not
consider the Unrecoverable Error Rate (UER) for read operations on
disk drives. It turns out that the UER is often easier to get from
the disk drive data sheets, because sometimes the drive vendors don't
list MTBF (or Annual Failure Rate, AFR) for all of their drive
models. Typically, UER will be 1 per 1014 bits read for consumer
class drives and 1 per 1015 for enterprise class drives. This can be
alarming, because you could also say that consumer class drives
should see 1 UER per 12.5 TBytes of data read. Today, 500-750 GByte
drives are readily available and 1 TByte drives are announced. Most
people will be unhappy if they get an unrecoverable read error once
every dozen or so times they read the whole disk. Worse yet, if we
have the data protected with RAID and we have to replace a drive, we
really do hope that the data reconstruction completes correctly. To
add to our nightmare, the UER does not decrease by adding disks. If
we can't rely on the data to be correctly read, we can't be sure that
our data reconstruction will succeed, and we'll have data loss.
Clearly, we need a model which takes this into account. Let's call
that model #2, for MTTDL[2]:

First, we calculate the probability of unsuccessful
reconstruction due to a UER for N disks of a given size (unit
conversion omitted):

Precon_fail = (N-1) \* size /
UER

For single-disk failure protection:

MTTDL[2] = MTBF / (N \*
Precon_fail)

For double-disk failure protection:

MTTDL[2] = MTBF2/ (N \* (N-1)
\* MTTR \* Precon_fail)

Comparing the MTTDL[1] model to
the MTTDL[2] model shows some interesting aspects of design. First,
there is no MTTDL[2] model for RAID-0 because there is no data
reconstruction – any failure and you lose data. Second, the MTTR
doesn't enter into the MTTDL[2] model until you get to double-disk
failure scenarios. You could nit pick about this, but as you'll soon
see, it really doesn't make any difference for our design decision
process. Third, you can see that the Precon_fail is a function of the
size of the data set. This is because the UER doesn't change as you
grow the data set. Or, to look at it from a different direction, if
you use consumer class drives with 1 UER for 1014 bits, and you have
12.5 TBytes of data, the probability of an unrecoverable read during
the data reconstruction is 1. Ugh. If the Precon_fail is 1, then the
MTTDL[2] model looks a lot like the RAID-0 model and friends don't
let friends use RAID-0! Maybe you could consider a smaller sized
data set to offset this risk. Let's see how that looks in pictures.

2-way mirroring is an example of
a configuration which provides single-disk failure protection. Each
data point represents the space available when using 2-way mirrors in
a zpool. Since this is for a X4500, we consider 46 total available
disks and any disks not used for data are available as spares. In
this graph you can clearly see that the MTTDL[1] model encourages the
use of hot spares. More importantly, although the results of the
calculations of the two models are around 5 orders of magnitude
different, the overall shape of the curve remains the same. Keep in
mind that we are talking years here, perhaps 10 million years, which
is well beyond the 5-year expected life span of a disk. This is the
nature of the beast when using a constant MTBF. For models which
consider the change in MTBF as the device ages, you should never see
such large numbers. But the wish for more accurate models does not
change the relative merits of the design decision, which is
what we really care about – the best RAID configuration given a
bunch of disks. Should I use single disk failure protection or double
disk failure protection? To answer that, lets look at the model for
raidz2.

From this graph you can see that
double disk protection is clearly better than single disk protection
above, regardless of which model we choose. Good, this makes sense.
You can also see that with raidz2 we have a larger number of disk
configuration options. A 3-disk raidz2 set is somewhat similar to a
3-way mirror with the best MTTDL, but doesn't offer much available
space. A 4-disk set will offer better space, but not quite as good
MTTDL. This pattern continues through 8 disks/set. Judging from the
graphs, you should see that a 3-disk set will offer approximately an
order of magnitude better MTTDL than an 8-disk, for either MTTDL
model. This is because the UER remains constant while the data to be
reconstructed increases.

I hope that these models give
you an insight into how you can model systems for RAS. In my
experience, most people get all jazzed up with the space and forget
that they are often making a space vs. RAS trade-off. You can use
these models to help you make good design decisions when configuring
RAID systems. Since the graphs use Space on the X-axis, it is easy to look at the design trade-offs for a given amount of available space.

Just one more teaser... there are other MTTDL models,
but it is unclear if they would help make better decisions, and I'll
explore those in another blog.

Thursday Jan 11, 2007

It is not always obvious what the best RAID set configuration should be for a given set of disks. This is even more difficult to see as the number of disks grows large, like on a Sun Fire X4500 (aka Thumper) server. By default, the X4500 ships with 46 disks available for data. This leads to hundreds of possible permutations of RAID sets. Which would be best? One analysis is the trade-off space and Mean Time To Data Loss (MTTDL). For this blog, I will try to stick with ZFS terminology in the text, but the principles apply to other RAID systems, too.

The space calculation is straightforward. Configure the RAID sets and sum the total space available.

The MTTDL calculation is one attribute of Reliability, Availability, and Serviceability (RAS) which we can also calculate relatively easily. For large numbers of disks, MTTDL is particularly useful because we only need to consider the reliability of the disks, and not the other parts of the system (fodder for a later blog :-). While this doesn't tell the whole RAS story, it is a very good method for evaluating a big bunch of disks. The equations are fairly straightforward:

For non-protected schemes (dynamic striping, RAID-0)

MTTDL = MTBF / N

For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):

MTTDL = MTBF2 / (N \* (N-1) \* MTTR)

For double parity schemes (3-way mirror, raidz2, RAID-6):

MTTDL = MTBF3 / (N \* (N-1) \* (N-2) \* MTTR2)

Where MTBF is the Mean Time Between Failure and MTTR is the Mean Time To Recover. You can get MTBF values from disk data sheets which are usually readily available. You could also adjust them for your situation or based upon your actual experience. At Sun, we have many years of field failure data for disks and use design values which are consistent with our experiences. YMMV, of course. For MTTR you need to consider the logistical repair time, which is usually the time required to identify the failed disk and physically replace it. You also need to consider the data reconstruction time, which may be a long time for large disks, depending on how rapidly ZFS or your logical volume manager (LVM) will reconstruct the data. Obviously, a spreadsheet or tool helps ease the computational burden.

Note: the reconstruction time for ZFS is a function of the amount of data, not the size of the disk. Traditional LVMs or hardware RAID arrays have no context of the data and therefore have to reconstruct the entire disk rather than just reconstruct the data. In the best case (0% used), ZFS will reconstruct the data almost instantaneously. In the worst case (100% used) ZFS will have to reconstruct the entire disk, just like a traditional LVM. This is one of the advantages of ZFS over traditional LVMs: faster reconstruction time, lower MTTR, better MTTDL.

Note: if you have a multi-level RAID set, such as RAID-1+0, then you need to use both the single parity and no protection MTTDL calculations to get the MTTDL of the top-level volume.

So, I took a large number of possible ZFS configurations for a X4500 and calculated the space and MTTDL for the zpool. The interesting thing is that the various RAID protection schemes fall out in clumps. For example, you would expect that a 3-way mirror has better MTTDL and less available space than a 2-way mirror. As you vary the configurations, you can see the changes in space and MTTDL, but you would never expect a 2-way mirror to have better MTTDL than a 3-way mirror. The result is that if you plot the available space against the MTTDL, then the various RAID configurations will tend to clump together.

The most obvious conclusion from the above data is that you shouldn't use simply dynamic striping or RAID-0. Friends don't let friends use RAID-0!

You will also notice that I've omitted the values on the MTTDL axis. You've noticed that the MTTDL axis uses a log scale, so that should give you a clue as to the magnitude of the differences. The real reason I've omitted the values is because they are a rat hole opportunity with a high entrance probability. It really doesn't matter if the MTTDL calculation shows that you should see a trillion years of MTTDL because the expected lifetime of a disk is on the order of 5 years. I don't really expect any disk to last more than a decade or two. What you should take away from this is that bigger MTTDL is better, and you get a much bigger MTTDL as you increase the number of redundant copies of the data. It is better to stay out of the MTTDL value rat hole.

The other obvious conclusion is that you should use hot spares. The reason for this is that when a hot spare is used, the MTTR is decreased because we don't have to wait for the physical disk to be replaced before we start data reconstruction on a spare disk. The time you must wait for the data to be reconstructed and available is time where you are exposed to another failure which may cause data loss. In general, you always want to increase MTBF (the numerator) and decrease MTTR (the denominator) to get high RAS.

The most interesting result of this analysis is that the RAID configurations will tend to clump together. For example, there isn't much difference between the MTTDL of a 5-disk zpool versus a 6-disk raidz zpool.

But if you look at this data another way, there is a huge difference in the RAS. For example, suppose you want 15,000 GBytes of space in your X4500. You could use either raidz or raidz2 with or without spares. Clearly, you would have better RAS if you choose raidz2 with spares than any of the other options for the space requirement. Whether you use 6, 7, 8, or 9 disks in your raidz2 set makes less difference in MTTDL.

There are other considerations when choosing the ZFS or RAID configurations which I plan to address in later blogs. For now, I hope that this will encourage you to think about how you might approach the space and RAS trade-offs for your storage configurations.

Monday Jan 01, 2007

Over the holiday break I installed Solaris Nevada build 55 on a few machines. This is an important build and I think you'll see lots of people talking and blogging about it. Here are a few reasons why I think it is cool:

StarOffice 8 is now integrated. Previous Nevada builds had StarOffice 7, which meant that I had to install and patch StarOffice 8 separately. This significantly reduces the amount of time I needed to get my desktop up to snuff.

Studio 11 is now integrated. This is another time saver.

NetBeans 5.5 is also integrated. I use NetBeans quite a bit for Java programming. I fell in love with refactoring and the debugging environment. I also do quite a bit of GUI work, and NetBeans has saved me many hundreds of hours of drudgery.

NVidia 3-D drivers are integrated. Still more installation time savings.

I do install every other build of Nevada, and these changes have significantly reduced the time it takes for me to go from install boot to being ready as my primary desktop. More importantly, it shows that once we can get past some of the (internal political) roadblocks, we can create a richly featured, easy to install, open source Solaris environment that showcases some of the best of Sun's technologies.