Thursday Feb 08, 2007

My last entry provided some recommendations regarding the use of ZFS
with databases. Time now to share some updated numbers.

Before we go to the numbers, it is
important to note that these results are for the OLTP/Net workload,
which may or may not represent your
workload. These results are also specific to our system configuration,
and may not be true for all system configurations. Please test your own
workload before drawing any conclusions. That said, OLTP/Net is based
on well known standard benchmarks, and we use it quite extensively to
study performance on our rigs.

Filesystem

FS
Checksum

Database
Checksum1

Normalized
Throughput2

UFS
Directio

N/A

No

1.12

UFS
Directio

N/A

Yes

1.00

ZFS

Yes

No

0.94

1 Both block checksumming as well as block checking2 Bigger is better

Databases usually checksum its blocks
to maintain data integrity.
Oracle for example, uses a per-block checksum. For Oracle, checksum
checking is on by default. This is typically recommended as most
filesystems do not have a checksumming feature. With ZFS checksums are
enabled by default. Since
databases are not tightly integrated with the filesystem/volume
manager, a checksum error is handled by the database. Since ZFS
includes volume manager functionality, a checksum error will be
transparently handled by ZFS (i.e if you have some kind of redundancy
like mirroring or raidz), and the situation is corrected before
returning a read error to the database. Moreover ZFS will repair
corrupted
blocks via self-healing. While RAS
experts will note that end-to-end
checksum at the database level is slightly better than end-to-end
checksum at
the ZFS level, ZFS checksums give you unique advantages while providing
almost the same level of RAS.

If you do not penalize ZFS with
double checksums, you can note that we
are within 6% of our best UFS number. So 6% gives you
provable data integrity, unlimited snapshots, no fsck, and all the
other good features. Quite good in my book Of course, this number
is
only going to get better as more performances enhancements make it into
the
ZFS code.

More about the workload.

The tests were done with OLTP/Net
with a 72
CPU Sun Fire E25K connected to 288
15k rpm spindles. We ran the test with around 50% idle time to simulate
real customers. The test was done on Solaris
Nevada build 46. Watch
this space for numbers with the latest build of Nevada.

Monday Sep 25, 2006

Databases and ZFS

Comparing UFS and ZFS out-of-the-box, we find that
ZFS performs slightly better than UFS Buffered. We also demonstrate that it
is possible to get performance improvements with ZFS by following
a small set of recommendations. We have also identified a couple of tunings
that help performance. These tunings will be on by default in future releases
of ZFS

We (PAE - Performance Engineering) recently completed a study to
understand database performance with ZFS. Read on more details and
recommendations. You can also read Roch's blog on
the same study

Databases stress the filesystem in unique ways. Depending on the
workload and configuration, you can have thousands of IO operations per
second. The size of these IO is usually small (database block size).
All the writes are synchronized writes. Reads can be random or
sequential. Some writes are also more critical than others. Depending
on the configuration, Reads are cached by the database program or the
filesystem (if supported/requested). In many cases where filesystems
are used, the IO is spread over a few files. This causes the single
writer lock to be very hot under certain configurations like Buffered
UFS.

Since IO is so important for databases, not surprisingly, there are a
lot heavy weight players in this arena. UFS, QFS, VxFS, are quite
popular with customers as the underlying filesystem. So how does the
new kid on the block (ZFS) do?

We used an internally developed benchmark called OLTP/Net to study
database performance with ZFS. OLTP/Net (O-L-T-P slash Net) is a OLTP
benchmark that simulates an online store. The major feature of the
benchmark is that it has a bunch of tuning knobs that control the ratio
of network IO to disk IO, and/or read/write nature of the transactions,
and/or number of new connects/disconnects to the database etc.. This
makes it quite easy to simulate customer situations in our labs. We use
it quite extensively inside Sun to model real-world database
performance, and have found/fixed quite a few performance issues using
this workload.

For our ZFS study, we used the default settings for OLTP/Net. In this
scenario, we have a read/write ratio of 2:1 and a network/disk IO ratio
of 10:1. Since our goal is to run like most customers, we controlled
the number of users (load generators) such that the box was 60%
utilized.

The hardware configuration consisted of a T2000 with 32x1200Mhz CPUs,
32GB RAM connected to 140 Fibre channel JBODs. We used both Solaris 10
Update 2 as well as Solaris Nevada build 43 to do the analysis We
created one big dynamically stripped pool with all the disks. We set
the recordsize of
this pool to 8k. Each disk was divided into 2 slices. These
slices were allocated to UFS and ZFS in round robin fashion to ensure
that each filesystem got equal number of inner and outer slices.

Normally for OLTP benchmark situations, we try to use the smallest
database blocksize for best performance. When we started out with our
study, we used a block size of 2048 as that gives us the best
performance for other filesystems. But since we are trying to do what
most customers might do, we switched over to a block size of 8192.
We did two kinds of tests, a cached database as well as a large (not
cached) database. Details follow in following sections.

Recommendations for ZFS and Databases

Most customers use UFS buffered filesystems and ZFS already performs
better than UFS buffered!. Since want to test performance, and
we want ZFS to be super fast, we decided to compare ZFS with UFS directio.
We noticed that UFS Directio performs better than what we get with
with ZFS out-of-the-box. With ZFS, not only was the throughput much lower,
but we used more twice the amount of CPU per transaction, and we are
doing 2x times the IO. The disks are also more heavily utilized.
We noticed that we were not only reading in more data, but we were also
doing more IO operations that what is needed. A little bit of dtracing
quickly revealed that these reads were originating from the write code
path! More dtracing showed that these are level 0 blocks, and are being
read-in for the read-modify-write cycle. This lead us to the FIRST
recommendation

Match
the database block size with ZFS
record size.

A look at the DBMS statistics showed that "log file sync" was one of
the biggest wait events. Since the log files were in the same
filesystem as the data, we noticed higher latency for log file writes.
We then created a different filesystem (in the same pool), but set the
record size to 128K as log writes are typically large. We noticed a
slight improvement in our numbers, but not the dramatic improvement we
we wanted to achieve. We then created a separate pool
and used that pool for the database log files. We got quite a big boost
in performance. This performance boost could be attributed to the
decrease in the write latency. Latency of database log writes is
critical for OLTP performance. When we used one pool, the extra IOs to
the disks increased the latency of the database log writes, and thus
impacted performance. Moving the logs to a dedicated pool improved the
latency of the writes, giving a performance boost. This
leads us to our SECOND recommendation

If
you have a write heavy workload, you are
better off by separating the log files on a separate pool

Looking at the extra IO being generated by ZFS, we noticed that the
reads from disk were 64K in size. This was puzzling as the ZFS
recordsize is 8K. More dtracing, and we figured out that the vdev_cache
(or software track buffer) reads in quite a bit more than
what we request. The default size of the read is 64k (8x more than what
we request). Not surprisingly, the ZFS team is aware of this, and there
are quite a few change requests (CR) on this issue

4933977: vdev_cache could be
smarter about prefetching6437054: vdev_cache: wise up or
die 6457709: vdev_knob values should
be determined dynamically

Tuning the vdev_cache to read in only 8K at a time decreased the amount
of extra IO by a big factor, and more importantly improved the latency
of the reads too. This leads to our THIRD
recommendation

Tune down the vdev_cache using ztune.sh1 until 6437054 is fixed

Ok, we have achieved quite a big boost from all the above tunings, but
we are still seeing high latency for our IOs. We see that the disks are
busier during the spa_sync time. Having read Eric
Kustarz's blog about 'vq_max_pending' , we tried playing with that
value. We found that setting it to 5 gives us the best performance (for
our disks, and our workload). Finding the optimal value involves
testing it for multiple values -- a time consuming affair. Luckily the
fix is in the works

6457709: vdev_knob values should
be determined dynamically

So, future releases of ZFS will have this auto-tuned. This leads us to
our FOURTH recommendation

Tune vq_max_pending using ztune.sh1 until 6457709 is fixed

We tried various other things. For example, we tried changing the
frequency of the spa_sync. The default is once every 5 seconds. We
tried once every second, or once every 30 seconds, and even once every
hour. While in some cases we saw marginal improvement, we noticed
higher CPU utilization, or high spin on mutexes. Our belief is that
this is something that is good out of the box, and we recommend you do
not change it. We also tried changing the behaviour of the ZIL
by
modifying the zfs_immediate_write_sz
value. Again, we did not see
improvements. This leads to our FINAL
recommendation

Let
ZFS auto-tune. It knows best. In cases were tuning helps, expect ZFS to
incorporate that fix in future releases of ZFS

In conclusion, you can improve out-of-the-box performance of databases
with ZFS by doing simple things. We have demonstrated that it is
possible to run high-throughput workloads with current release of ZFS.
We have also shown that it is quite possible to get huge improvements
in performance for databases in future versions of ZFS. Given the fact
that ZFS is around a year old,
this is amazing!!

Wednesday Aug 09, 2006

I attended a talk titled "Nanotechnology and Renewable Energy" by
Prof. Paul Alivisatos
Professor of Chemistry, University of California, Berkeley, and
Associate Laboratory Director for Physical Sciences Lawrence Berkeley
National Laboratory. It was very nice talk. Do not miss out on
a chance to attend his talks.

Efficiency of Solar cells vary from 2% to 35%. As the efficiency
increases, cost increases non linearly.
It is interesting to note that the amount of energy consumed by the United States in
one year is roughly equal to the amount of solar energy received by
the earth in one hour! To generate 3TW a year (the current US power usage) using a 3%
efficient solar technology would require solar cells to cover an area
that is roughly equal to the size of Texas!. The cost per kilowatt will be
much higher too.

Disclaimer: I did not take notes during the talk, so my numbers
may be slightly off. But I guess you get the big picture

Monday Jul 31, 2006

Performance for the real-world, where it matters the most.

A major portion of my job (@ PAE) is spent trying to optimize
Solaris for
real customer workloads. We tend to focus on databases, but work with
other applications too. We have tons (both weight wise and dollar wise ) of equipment
in our labs, where we try to replicate a real enterprise data center.
Of course, the term "real customer workload" is a loaded term. Since
most big customers are rarely willing to share their workloads, we
have to simulate them or write something close it in house. Trying to
rewrite every customer's workload is not a scalable approach. Hence
we have developed a workload called OLTP/Net that can be retrofitted
to fit most customer workloads. Using several tuning knobs we can control
the amount of reads, writes, network packet per transaction, connects,
disconnects, etc.. Think of it like a super workload! We have used it
quite effectively to simulate several customer workloads.

There is a big difference in trying to get the best numbers for a benchmark and
in replicating a customer's setup. PAE has traditionally focused on getting
the most out of the system. Our machines typically run at 100% utilization,
run the latest and greatest Solaris builds, have lot of tunings applied to
the system. We believe fully in Carry Millsap's statement

Each CPU cycle that passes by unused is a cycle that
you will never have a chance to use again; it is
wasted forever. Time marches irrevocably onward."
(Performance Management: Myths & Fact, Cary V. Millsap,
Oracle Corp, June 28, 1999)

However, many customers run their machines at less than 100%
utilization to leave enough headroom for growth. When machines are not
running at 100% utilization, things like idle loop performance matter
a lot. If you have followed Solaris
releases closely, there were several
enhancements to the idle loop performance that increase the efficiency of
lightly loaded systems by quite a bit. Similarly we have seen
quite a few UFS + Database performance enhancements over the past few
releases of Solaris.

So while benchmark numbers do matter, real performance also matters, and we
are working on it!

Monday Dec 12, 2005

Update:
In my previous blog I showed how to install 6 os's on a disk. Well, actually you can have seven (7). Disk partitions are numbered from 0 to 7. Ignoring slice 2, that leaves us with 7 free slices on which to install our OS. Although I am yet to log on to a machine with 7 OS's on disk!!

Richard Elling pointed it out that you could also
use slice 2 (the loopback/backup/overlap slice) also. So that's 8. He also mentions that some SCSI devices support 16 slices, and so you could install quite a lot more OS installations!
Maybe we should have a completion of how many OS's you have installed on a single disk My personal best is 6.

Friday Dec 02, 2005

Six (6) OS's in one disk

Do you want to install 6 OS's on a single disk? If so read on..

The goal is to have
6 bootable OS on a single disk. Why should one do it? Because better
sharing, more reliability, easier comparisons between OS versions,
quicker recovery, ...BTW, I have only tried this on sparc.

Although I am sure that people have been doing this for ages, I first
heard it from Charles
Suresh,
who encouraged me to go ahead and give it a try.

Create Partitions

Disk partitions usually are from 0 - 7, with 2 being the overlap.
For our experiment, we set 1 to be the swap. We sized the other
partitions equally, with 0 being a little smaller than others. On my
36G disk, the partition looks like the following

Install The OS

Install Solaris from any source. I typically download the images from
nana.eng, and use my jumpstart server. You can also install from CD,
DVD etc.. Once you install on a slice, you can dd(1) it to other slices, and
fix /etc/vfstab. This is
the fastest way of installing multiple solaris instances on a disc. If
you want another version, or a different build, bfu is your friend. You
can also save off these slices to some /net/... place and restore an
OS at will (again using dd
both ways since you need to preserve the boot blocks). If you slice
multiple machines this way, you can even copy slices across machines
(assuming same architecture etc) - more scripts are needed to change /etc/hosts,
hostname, net/\*/hosts etc

Install via Jumpstart: Setup Profile

If you like things automated, you could perform a hands-off install via
custom jumpstart. The first step is to setup the profile for your
server. Since you want to preserve the
existing partitions, you have to use the preserve keyword. The
profile for my machine looks like the following

Note that these profiles can be stored on any server. That machine does
not need to have anything special installed. You only need to make sure
that the location of the profile, and other custom jumpstart scripts
are shared via NFS in a "read-only"
mode.

Jumpstart

On the jumpstart server (abc.yyy
in my case),
we added our machine to the list
of clients as follows

Monday Nov 28, 2005

As I mentioned in my previous
blog entry, the ZIL (ZFS Intent Log) operates with block sizes
between ZIL_MIN_BLKSZ(4K)
and ZIL_MAX_BLKSZ(128k).
Let us take a closer look at this.

The ZIL has to allocate a new zil block before it commits the current zil
block. This is because the zil block being committed has to have
a link to the next zil block. If you do not preallocate, you will have
to update the next pointer in
the previous block whenever you write a new zil block. This means that
you will have read in the previous block, update the next pointer, and rewrite it out.
Obviously this is quite expensive (and quite complicated).

The current block selection strategy is to chose either
the sum of all outstanding ZIL blocks or if no outstanding zil blocks
are present, the size of the last zil block that was committed. If the
size of the outstanding zil blocks is greater than 128k, it is rounded
up to 128k.

The above strategy works in most cases, but behaves badly for
certain edge cases.

Let us examine the zil block size for the set of
actions described below
(dtrace -n
::zil_lwb_commit:entry'{@[1]=quantize((lwb_t\*)args[2]->lwb_sz);}')

When the first O_DSYNC
write was initiated in (4), the zil
coalesced all outstanding log operations into a big
block (in my case a 128k block and a 64k block) and then did
a zil_commit. The next O_DSYNC
write then chose 64k as the
zil block size as that was the size of the last zil_commit. The
following O_DSYNC writes then continued to use 64K as the zil block
size.

Neil Perrin filed CR
6354547: sticky log buf size to fix this issue. His proposed fix is
to use the size of the last block as the basis for the size of the new
block. This should work optimally for most cases, but there is a
possiblity for empty log writes. Need to investigate this issue with "real" workloads.

Wednesday Nov 16, 2005

A quick guide to the ZFS Intent Log (ZIL)

I am not a ZFS developer. However I am interested in ZFS performance,
and am intrigued by ZFS Logging. I figure a good way to learn about something is to blog about it . What follows is my notes as I made my way through the ZIL

Introduction

Most modern file systems include a logging feature to ensure
faster
write times and crash recovery time (fsck). UFS has supported
logging since Solaris 2.7 and uses logging as the default on Solaris
10.
Our tests internally have shown us that logging file systems perform
as good as (sometimes even better) non-logging file systems.

Logging is implemented via the ZFS Intent Log module in ZFS. ZFS
Intent Log or ZIL is implemented in the
zil.c file. Here is a
brief walk through of the logging implementation in ZFS. All of
this knowledge can be found in the zil.[c|h] files in the ZFS
source
code. I also recommend you check out Neil's blog -- He
is one of the ZFS developers who works on the ZIL.

All file system related system calls are logged as transaction
records
by the ZIL. These transaction records contain sufficient information
to replay them back in the event of a system crash.

ZFS operations are always a part of a DMU (Data Management Unit)
transaction. When a DMU transaction is opened, there is also a ZIL
transaction that is opened. This ZIL transaction is associated with
the DMU transaction, and in most cases discarded when the DMU
transaction commits. These transactions accumulate in memory until
an fsync or O_DSYNC
write happens in which case they are committed to
stable storage. For committed DMU transactions, the ZIL transactions
are discarded (from memory or stable storage).

The ZIL consists of a zil header, zil blocks and zil trailer. The
zil header points to a list of records. Each of these log records
are variable sized structures whose format depends on the
transaction type. Each log record structure consists of a common
structure of type lr_t
followed by multiple structures/fields that
are specific to each transaction. These Log records can reside
either in memory or on disk. The on disk format described in zil.h.
ZIL records are written to disk in variable sized blocks. The
minimum block size is defined as ZIL_MIN_BLKSZ
and is currently 4096
(4k) bytes. The maximum block size is defined as ZIL_MAX_BLKSZ
which is equal to SPA_MAXBLOCKSIZE
(128KB). The zil block
size written to disk is chosen to be either the size of all
outstanding zil blocks (with a maximum of ZIL_MAX_BLKSZ) or if there
are no outstanding ZIL transactions, the size of the last zil block
that was committed.

ZIL and write(2)
The zil behaves differently for different size of writes that
happens. For small writes, the data is stored as a part of
the log record. For writes greater than zfs_immediate_write_sz
(64KB), the ZIL does not store a copy of the write, but rather syncs
the write to disk and only a pointer to the sync-ed data is stored
in the log record.
We can examine the write(2)
system call on ZFS using dtrace.

230 -> zfs_write 21684
230 -> zfs_prefault_write 28005
230 zfs_time_stamper 69932
230 -> zfs_time_stamper_locked 72893
230 zfs_log_write 81054
230
As you can see there is a log entry associated with every
write(2)
call. If the file was opened with the O_DSYNC flag, writes are
supposed to be synchronous. For synchronous writes, the ZIL has to
commit the zil transaction to stable storage before returning. For
non-synchronous writes the ZIL holds on to the transaction in memory
where it is held until the DMU transaction commits or there is an
fsync or an O_DSYNC write.

zil.c walk thorough

There are several zil functions that operate on zil records.
What follows is a very brief description of their functionality.

zil_create() creates a dmu
transaction and allocates a first log block
and commits it.

zil_itx_assign() is used to
associate this intent log transaction with
a dmu transaction.

zil_itx_clean() is used to
clean up all in memory log transactions.
Clearing in memory zil transactions implies that these are not flushed
to disk. zil_itx_clean()
is called via the zil_clean() function which
dispatches a
work request to a dispatch thread.

zil_sync() ZIL transactions
are then cleaned (or deleted) in the
zil_sync routine when
the DMU transactions that they are assigned to is
committed to disk (maybe as a result of a fsync) It is mostly called
from the txg_sync_thread every txg_time (5 seconds) via
this code path.

During file system mount time, ZFS checks to see if there is an
intent log. If there is an intent log, this implies that the system
crashed (as the ZIL is deleted at umount(2)
time). This intent log is
converted to a replay log and is replayed to updated the file system
to a stable state. If both the replay log and intent log are
present, it implies that the system crashed while replaying the
replay log in which case it is OK to ignore/delete the replay log
and replay the intent log.

ZIL Tunables
I am almost tempted to mention some tunables here but the truth is that
ZFS is intended to not require any tuning. ZFS should (and will)
perform optimally "Out of the Box". You might find some switches in the
code, but they are only for internal development and will be yanked out
soon!

ZIL Performance
As you must have figured out by now, ZIL performance is critical for
performance of synchronous writes. A common application that issues
synchronous writes is a database. This means that all of these writes
run at the speed of the ZIL. The ZIL is already quite optimized, and I
am sure ongoing
efforts will optimize this code path even further.
As Neil
mentions, using nvram/solid state disks for
the log would make it scream!. I also recommend that you checkout Roch's
work on ZFS performance for details of other performance studies in
progress.

Tuesday Nov 15, 2005

I am Neelakanth Nadgir and I am a part of PA2E
(Performance Architecture, and Availability Organization)
group. I work out of Menlo Park, CA. My professional
interests include scalability, networking, filesystems,
distributed systems etc.

Before joining PA2E, I worked at Sun's Market
Development Engineering, where I spent 4 years working on
Performance tuning, Porting, Sizing, and ISV account
management.

I am was also involved with several open source projects. I am an active member of the JXTA community and jointly started two projects viz
Ezel Project and
JNGI Project. I have
also served as web-master to the
GNU project
for 2 years. I also contributed to the
Mozilla project in the past by providing sparc binaries and misc performance fixes.

Before working at Sun, I graduated with a masters in
Computer Sc from
Texas Tech University at Lubbock, TX (GO Raiders!). My thesis was on the Reliability
of distributed systems, where I devised a faster algorithm for
calculating minimal file spanning trees. I have a Bachelor's degree in Computer Sc from
Karnatak University, India.

My other interests include
Cricket, and tropical aquarium fish (
African cichlids in
particular)
My favorite fish is known as
Pseudotropheus demasoni. My wife got me hooked on to the
aquarium hobby after we got married, and even before I knew, we had more than 60 fishes in 6 tanks :-)

I plan to use this blog to share the knowledge that
I gained from working with lots of cool people here at
Sun. Keep tuned for more insights!