Roch (rhymes with Spock) Bourbonnais :Kernel Performance Engineering

lundi nov. 10, 2008

Today Sun is announcing a new line of Unified
Storage designed by a core of the most brilliant engineers . For
starters Mike Shapiro provides a great introduction into this product,
the new economics behind it and the killer App in Sun
Storage 7000.

The killer App is of course Bryan Cantrill's brainchild, the already
famous Analytics.
As a performance engineer, it's been a great thrill to have given this
tool an early test drive. Working a full 1 ocean's (the atlantic) + 1
continent (the USA) away from my system running Analytics I was
skeptical at first that I would be visualizing in real time all that
information : the NFS/CIFS ops, the disk ops, the CPU load and network
throughput, per client, per disk, per file ARE YOU CRAZY ! All that
information available IN REAL TIME; I just have to say a big thank you
to the team that made it possible. I can't wait to see our customer
put this to productive use.

Lest we forget the immense contribution of the boundless Energy bubble
that is Brendan Gregg; the man that braught DTracetoolkit to the
semi-geek; he must be jumping with excitement as we now see the power
of DTrace delivered to each and every system administrator.
He talks here about the Status
Dashboard. And Brendan's contribution does not stop here, he is
also the parent of this wonderful component of the HSP known
as the L2ARC which is how the readzillas become activated. See his own
previous work on the L2ARC along with Jing Zhang more recent
studies. Quality assurance people don't often get into the spotlight but check out Tim Foster 's post on how he tortured the zpool code
adding and removing l2 arc devices from pools :

For myself, it's been very exciting to be able to see performance
improvement ideas get turned into product improvements from weeks to
weeks. Those interested should read how our group influenced the product that
is shipping today, see Alan Chiu
and my own Delivering Performance Improvements.

Such a product has a strong Price/Performance appeal and given that we
fundamentally did not think that there where public benchmarks that
captured our value proposition, we had to come up with a third millenium
participative ways to talk about performance. Check out how we
designed our Metrics
or maybe go straight to our numbers obtained by Amitabha
Banerjee a concise entry backed up by immense, intense and
carefull data gathering effort in the last few weeks. bmseer is putting his own light
on the low level data (data to be updated with numbers from a grander config).

On the application side, we have the great work of Sean (Hsianglung
Wu) and Arini Balakrishnan showing how a 7210 can deliver > 5000 concurrent video streams at an aggregate of,
you're kidding, : WOW ZA 750MB/sec.
More Details on how this was acheived in cdnperf.

See our Vice President, Solaris Data, Availability, Scalability &
HPC Bob Porras trying to tame this beast into a nutshell
and pointing out code bits reminding everyone of the value of the
OpenStorage proposition.

We can talk all we want about performance but as Josh Simons points out,
these babies are available to you for your own
try and buy.
Or check out how you could be running the appliance within the next hour really :
Sun Storage 7000 in VMware.

It seems I am in competition with another less verbose aggregator
Finally capture the whole stream of related posting to
Sun Storage 7000

mardi nov. 04, 2008

The standard answer to any computer performance question is
almost always : "it depends" which is semantically
equivalent to "I don't know". The better answer is to state
the dependencies.
I would certainly like to see every performance issue studied with a
scientific approach. OpenSolaris and Dtrace are just incredible
enablers when trying to reach root cause and finding those causes is
really the best way to work toward delivering improved performance.
More generally tough, people use common wisdom or possible faulty
assumption to match their symptoms with that of other similar reported
problems. And, as human nature has it, we'll easily blame the
component we're least familiar with for problems. So we often end up
with a lot of report of ZFS performance that once, drilled down,
become either totally unrelated to ZFS (say HW problems) , or
misconfiguration, departure from Best Practices or, at times,
unrealistic expectations.
That does not mean, there are no issues. But it's important
that users can more easily identify known issues, schedule
for fixes, workarounds etc. So anyone deploying ZFS should
really be familiar with those 2 sites : ZFS Best Practices and Evil Tuning Guide
That said, what are real commonly encountered performance problems
I've seen and where do we stand ?
Writes overunning memory
That is a real problem that was fixed last March and is integrated in
the Solaris U6 release. Running out of memory causes many different
types of complaints and erratic system behavior. This can happen
anytime a lot of data is created and streamed at rate greater than
that which can be set into the pool. Solaris U6 will be an important
shift for customers running into this issue. ZFS will still try to
use memory to cache your data (a good thing) but the competition this
creates for memory resources will be much reduced. The way ZFS is
designed to deal with this contention (ARC shrinking) will need a new
evaluation from the community. The lack of throttling was a great
impairement to the ability of the ARC to give back memory under
pressure. In the mean time lots of people are capping their arc size
with success as per the Evil Tuning guide.
For more on this topic check out : The new ZFS write throttleCache flushes on SAN storage
This is a common issue we hit in the entreprise. Although it will
cause ZFS to be totally underwhelming in terms of performance, it's
interestingly not a sign of any defect in ZFS. Sadly this touches
customers that are the most performance minded. The issue is somewhat
related to ZFS and somewhat to the Storage. As is well documented
elsewhere, ZFS will, at critical times, issue "cache flush" request to
the storage elements on which is it layered. This is to take into
account the fact that storage can be layered on top of _volatile_
caches that do need to be set on stable storage for ZFS to reach it's
consistency points. Entreprise Storage Arrays do not use _volatile_
caches to store data and so should ignore the request from ZFS to
"flush caches". The problem is that some arrays don't. This
misunderstanding between ZFS and Storage Arrays leads to underwhelming
performance. Fortunately we have an easy workaround that can be used
to quickly identify if this is indeed the problem : setting
zfs_nocacheflush (see evil tuning guide). The best workaround here is
to configure the storage with the setting to indeed ignore "cache
flush". And we also have the option of tuning sd.conf on a per array
basis. Refer again to the evil tuning guide for more detailed
information. NFS slow over ZFS (Not True)
This is just not generally true and often a side effect of the
previous Cache flush problem. People have used storage arrays to
accelerate NFS for long time but failed to see the expected gains with
ZFS. Many sighting of NFS problems are traced to this.
Other sightings involve common disks with volatile
caches. Here the performance delta observed are rooted in
the stronger semantics that ZFS offer to this operational
model. See NFS and ZFS for a more detailed description of the
issue.
While I don't consider ZFS as generally slow serving NFS, we did
identify in recent months a condition that effects high thread count
of synchronous writes (such as a DB). This issue is fixed in the
Solaris 10 Update 6 (CR 6683293).
I would encourage you to be familiar to where we stand regarding ZFS
and NFS because, I know of no big gapping ZFS over NFS problems (if
there were one, I think I would know). People just need to be aware
that NFS is a protocol need some type of accelaration (such as NVRAM)
in order to deliver a user experience close to what a direct attach
filesystem provides.ZIL is a problem (Not True)
There is a wide perception that the ZIL is the source of performance
problems. This is just a naive interpretation of the facts. The ZIL
serves a very fundamental component of the filesystem and does that
admirably well. Disabling the synchronous semantics of a filesystem
will necessarely lead to higher performance in a way that is totally
misleading to the outside observer. So while we are looking at further
zil improvements for large scale problems, the ZIL is just not today
the source of common problems. So please don't disable this unless you
know what you're getting into.Random read from Raid-Z
Raid-Z is a great technology that allows to store blocks on top of
common JBOD storage without being subject to raid-5 write hole
corruption (see : http://blogs.sun.com/bonwick/entry/raid_z). However
the performance characteristics of raid-z departs significantly from
raid-5 as to surprise first time users. Raid-Z as currently
implemented spreads blocks to the full width of the raid group and
creates extra IOPS during random reading. At lower loads, the latency
of operations is not impacted but sustained random read loads can
suffer. However, workloads that end up with frequent cache hits will
not be subject to the same penalty as workloads that access vast
amount of data more uniformly. This is where one truly needs to say,
"it depends".
Interestingly, the same problem does not affect Raid-Z streaming
performance and won't affect workloads that commonly benefit from
caching. That said both random and streaming performance are
perfectible and we are looking at a number different ways to improve
on this situation. To better understand Raid-Z, see one of my very
first ZFS entry on this topic : Raid-ZCPU consumption, scalability and benchmarking
This is an area we will need to make more studies. With todays very
capable multicore systems, there are many workloads that won't suffer
from the CPU consumptions of ZFS. Most systems do not run at 100% cpu
bound (being more generally constrained by disk, networks or
application scalability) and the user visible latency of operations
are not strongly impacted by extra cycles spent in say the ZFS
checksumming.
However, this view breaks down when it comes to system benchmarking.
Many benchmarks I encounter (the most crafted ones to boot) end up as
host CPU efficiency benchmarks : How many Operations can I do on this
system given large amount of disk and network resources while
preserving some level X of response time. The answer to this question
is purely the reverse of the cycles spent per operation.
This concern is more relevant when the CPU cycles spent in managing
direct attach storage and filesystem is in direct competition with
cycles spent in the application. This is also why database
benchmarking is often associated with using raw device, a fact must
less encountered in common deployment.
Root causing scalability limits and efficiency problems is just part of the never ending performance optimisation
of filesystems. Direct I/O
Directio has been a great enabler of database performance in other
filesystems. The problem for me is that Direct I/O is a group of
improvements each with their own contribution to the end result. Some
want the concurrent writes, some wants to avoid a copy, some wants to
avoid double caching, some don't know but see performance gains when
turned on (some also see a degradation). I note that concurrent writes
has never been a problem in ZFS and that the extra copy used when
managing a cache is generally cheap considering common DB rates of
access. Acheiving greater CPU efficiency is certainly a valid goal
and we need to look into what is impacting this in common DB
workloads. In the mean time, ZFS in OpenSolaris got a new feature to
manage the cachebility of Data in the ZFS ARC. The per filesystem
"primarycache" property will allow users to decide if blocks should
actually linger in the ARC cache or just be transient. This will
allow DB deployed on ZFS to avoid any form of double caching that
might have occured in the past.
ZFS Performance is and will be a moving target for some time in the
future. Solaris 10 Update 6 with a new write throttle, will be a
significant change and then Opensolaris offers additional
advantages. But generally just be skeptical of any performance issue that is
not root caused: the problem might not be where you expect it