Virtually All The Timehttps://blogs.oracle.com/jsavit/
Jeff's blogen-usCopyright 2015Thu, 5 Feb 2015 01:34:36 +0000Apache Roller BLOGS401ORA6 (20130904125427)https://blogs.oracle.com/jsavit/entry/so_much_news_so_littleSo much news, so little time...jsavithttps://blogs.oracle.com/jsavit/entry/so_much_news_so_little
Mon, 27 Dec 2010 10:02:01 +0000SundomainslogicalsolarisIt's a long time since my last blog - it's been a busy time! In this blog entry I catch up with some of the big news announcements, and illustrate upgrading from OpenSolaris to Oracle Solaris 11 ExpressIt's a <b>long</b> time since I've posted here, and so many things have happened.
I've been remiss in not keeping up, but my excuse is that it's been an eventful (and productive)
time since my last blog and I've been busy with Real Work.
<p>
Well, what's happened in my interest areas? Lots of big things! In no specific order:
<ul>
<li><b>The new SPARC T3 chip and servers</b> for the chip multi-threading T-series. It <b>doubles</b> compute density and throughput compared to the prior generation, with up to 512 CPU threads in a 4-socket server.
<li><b>The SPARC64 VII+ chip</b> for the M-series systems, which emphasize single-thread performance. The SPARC64 VII+ has up to a 3.0GHz clock and L2 cache doubled to 12MB. This provides <b>performance substantially more than twice</b> that of the original M-series processors.
<li><b>New core multipliers</b> for these products: 0.25 for the T3 and 0.5 for the SPARC64 VII+. Combined with the performance increases mentioned above, this makes these servers exceptionally competitive platforms for running Oracle software products (and in general).
<li><b>New world record performance</b> results on SPARC, and announcement of the SPARC Supercluster Architecture.
<li><b>New Exadata models and announcement of the Exalogic Elastic Clould</b> system.
Exadata has special sauce in hardware and software (eg: Infiniband connectivity, SSD disks, hybrid columnar compression)
that provide tremendous performance running Oracle RAC.
Exalogic provides similar benefits for the Java middleware tier.
What I like about these products is not just that they have tremendous performance,
but that they are engineered for scale and integrated so they arrive at the loading dock ready to be deployed, instead of the 6-month science project one often sees when different parts and pieces from multiple vendors have to be assembled on site.
<li><b>A new version of Logical Domains, now renamed to Oracle VM Server for SPARC</b>. Among the new features are the ability to add and remove RAM to a running guest domain without disruption. I don't think there are many virtualization technologies with this ability, which provides operational flexibility under changing load conditions. Adding and removing CPUs for a running domain has been around since at least 2006.
<li><b>New versions of VirtualBox</b>, with support for live migration, multiple-CPU guests, more graphics acceleration and remote display support, and improved exploitation of Intel and AMD virtualization performance extensions.
<li><b>Announcement of Oracle Solaris 11, and delivery of Oracle Solaris 11 Express 2010.11</b> (which I am now running). It has lots of powerful features, which I plan to discuss in future posts. Among them are:
<ul compact>
<li>Image Packaging System (IPS) which replaces the aging SVR4 packaging system
<li>Automated Installer (AI) which is the modernized network install system
<li>enhancements to ZFS that including ZFS encryption and deduplication
<li>network virtualization
<li>Solaris 10 Containers, which let you migrate and run virtualized Solaris 10 systems under Solaris 11.
</ul>
</ul>
<P>
These items have been described in official announcements, tech documents,
and in excellent blog entries by my colleagues:
<a href="http://blogs.sun.com/JeffV/">Jeff Victor</a>,
<a href="http://blogs.sun.com/scottdickson/">Scott Dickson</a>,
<a href="http://blogs.sun.com/stw/">Steffen Weiberle</a>,
<a href="http://blogs.sun.com/bobn/">Bob Netherton</a>, as well as many others.
I won't duplicate their content - instead, I recommend you visit the above links.
</p>
<h2>Enough news updates, Jeff. How about some actual content, huh?</h2>
<p>
While I haven't been blogging, I have been using some of the new technologies, and I'll start by describing how I upgraded my own systems to Oracle Solaris 11 Express
(S11E for short.) One of my machines I just "blew away" and did a fresh install from CD media. I also did fresh installs within logical domains. For my daily use desktop and laptop systems I upgraded from OpenSolaris, using the instructions in the release notes. I am extremely in favor of reversible system changes - regardless
of OS or vendor - so you can fall back to a "last known good" configuration.
I took this route for my own purposes (in case I ran into trouble on the upgraded system), and also so I could illustrate the process for others.
<p>
My systems were running an internal build of the OpenSolaris code base, build 142,
and I was operating my own software repository in my lab,
so my commands may be slightly different from somebody using the
public OpenSolaris build at 134b level.
<p>
First, I built and booted into a fresh boot environment (BE) without making any software changes.
Since alternate boot environments are so "cheap" to create and use,
it's a good idea to make a new one before messing about with the system.
This lets you easily fall back to the last known "production" environment in case of any difficulty.
The few seconds it takes to do a <code>beadm create</code> and the tiny disk space consumed
are well worth the protection a new boot environment provides.
<P>
Below I have text captured before and just after rebooting the new environment.
In this case, I was also replacing an internal-use network package with a new one
(not illustrated below)
so there really was a change between BE opensolaris-3 and the "prior production environment" opensolaris-2.
Note that in some examples I do this work from a non-root userid. You don't
have to be 'root' (with all that implies) if you use role based access controls.
<pre>
away ~ $ pfexec beadm create opensolaris-3
away ~ $ pfexec beadm activate opensolaris-3
away ~ $ beadm list
BE Active Mountpoint Space Policy Created
-- ------ ---------- ----- ------ -------
opensolaris - - 7.17M static 2010-08-27 14:55
opensolaris-1 - - 9.63M static 2010-08-28 08:43
opensolaris-2 N / 33.5K static 2010-11-03 17:18
opensolaris-3 R - 6.22G static 2010-11-17 16:54
away ~ $ pfexec init 6
... a brief pause to reboot and login again ...
away ~ $ beadm list
BE Active Mountpoint Space Policy Created
-- ------ ---------- ----- ------ -------
opensolaris - - 7.17M static 2010-08-27 14:55
opensolaris-1 - - 9.63M static 2010-08-28 08:43
opensolaris-2 - - 8.11M static 2010-11-03 17:18
opensolaris-3 NR / 6.23G static 2010-11-17 16:54
</pre>
<p>
The "N" flag indicates which boot environment is active <u>n</u>ow,
and "R" indicates which will become active on the next <u>r</u>eboot.
If necessary, I could safely fall back by activating an older BE.
You can see that inactive boot environments take up very little disk space. The disk
footprint represents the <i>difference</i> in contents between boot environments.
<p>
I then followed steps in the release guide for upgrading to Oracle Solaris 11 Express 2010.11.
The first step is to change repository publishers from my
personal lab (at the build 142 level) to the official repository (at 151a level).
I removed "contributions" repositories I decided I didn't need any more.
I also pointed to an OpenSolaris repository to see if there's anything
I need for the current image. Unsurprisingly I didn't, since I'm already at a later level. The tool prevents down-revving software components.
(I also admit to simply playing with the publisher settings just to see what happens. Note that the raw IP address is a machine in my lab.)
<pre>
root@away:~# pkg publisher
PUBLISHER TYPE STATUS URI
opensolaris.org (preferred) origin online http://192.168.1.3/
contrib.opensolaris.org origin online http://pkg.opensolaris.org/contrib/
software-packages.org (disabled) origin online http://ips.software-packages.org/
root@away:~# pkg set-publisher -P -O http://pkg.opensolaris.org/release/ opensolaris.org
root@away:~# pkg publisher
PUBLISHER TYPE STATUS URI
opensolaris.org (preferred) origin online http://pkg.opensolaris.org/release/
contrib.opensolaris.org origin online http://pkg.opensolaris.org/contrib/
software-packages.org (disabled) origin online http://ips.software-packages.org/
root@away:~# pkg image-update
No updates available for this image.
root@away:~# pkg set-publisher --non-sticky opensolaris.org
root@away:~# pkg set-publisher --non-sticky contrib
root@away:~# pkg publisher
PUBLISHER TYPE STATUS URI
opensolaris.org (non-sticky, preferred) origin online http://pkg.opensolaris.org/release/
contrib.opensolaris.org (non-sticky) origin online http://pkg.opensolaris.org/contrib/
software-packages.org (disabled) origin online http://ips.software-packages.org/
root@away:~# pkg set-publisher -P -g http://pkg.oracle.com/solaris/release/ solaris
root@away:~# pkg publisher
PUBLISHER TYPE STATUS URI
solaris (preferred) origin online http://pkg.oracle.com/solaris/release/
opensolaris.org (non-sticky) origin online http://pkg.opensolaris.org/release/
contrib.opensolaris.org (non-sticky) origin online http://pkg.opensolaris.org/contrib/
software-packages.org (disabled) origin online http://ips.software-packages.org/
root@away:~# pkg unset-publisher contrib
root@away:~# pkg unset-publisher software-packages.org
root@away:~# pkg unset-publisher opensolaris.org
root@away:~# pkg publisher
PUBLISHER TYPE STATUS URI
solaris (preferred) origin online http://pkg.oracle.com/solaris/release/
</pre>
<p>
Then I started the upgrade, which downloaded new package contents and put them in a new BE.
</p>
<pre>
root@away:~# pkg image-update --accept -f
------------------------------------------------------------
Package: pkg://solaris/consolidation/osnet/osnet-incorporation@0.5.11,5.11-0.151.0.1:20101104T230646Z
License: usr/src/pkg/license_files/lic_OTN
Oracle Technology Network Developer License Agreement
Oracle Solaris, Oracle Solaris Cluster and Oracle Solaris Express
... many other lines with license and export control stuff...
DOWNLOAD PKGS FILES XFER (MB)
Completed 985/985 27598/27598 708.6/708.6
PHASE ACTIONS
Removal Phase 10438/10438
Install Phase 18266/18266
Update Phase 31125/31125
A clone of opensolaris-3 exists and has been updated and activated.
On the next boot the Boot Environment opensolaris-4 will be mounted on '/'.
Reboot when ready to switch to this updated BE.
Deleting content cache
---------------------------------------------------------------------------
NOTE: Please review release notes posted at:
http://docs.sun.com/doc/821-1479
---------------------------------------------------------------------------
</pre>
<P>
The update is complete now, with a brand new boot environment.
The new BE name <code>opensolaris-4</code> is misleading
(derived, I suppose, from the current BE name)
but contains Oracle Solaris 11 Express.
<pre>
root@away:~# beadm list
BE Active Mountpoint Space Policy Created
-- ------ ---------- ----- ------ -------
opensolaris - - 7.17M static 2010-08-27 14:55
opensolaris-1 - - 9.63M static 2010-08-28 08:43
opensolaris-2 - - 8.11M static 2010-11-03 17:18
opensolaris-3 N / 9.62M static 2010-11-17 16:54
opensolaris-4 R - 8.34G static 2010-11-17 19:01
root@away:~# init 6
updating //platform/i86pc/boot_archive
updating //platform/i86pc/amd64/boot_archive
</pre>
<P>
After I reboot under Oracle Solaris 11 Express, I create a
new BE called <code>s11-2</code>. That's not necessary, but
I like to keep a pristine copy of the original environment
in case I need to fall back to it in case of a future problem,
or simply for comparison purposes.
Note that the new <code>zfs diff</code> command makes it easy to compare
snapshots and filesystems.
<pre>
away ~ $ pfexec beadm create s11-2
away ~ $ beadm list
BE Active Mountpoint Space Policy Created
-- ------ ---------- ----- ------ -------
opensolaris - - 7.17M static 2010-08-27 14:55
opensolaris-1 - - 9.63M static 2010-08-28 08:43
opensolaris-2 - - 8.11M static 2010-11-03 17:18
opensolaris-3 - - 13.86M static 2010-11-17 16:54
opensolaris-4 NR / 8.39G static 2010-11-17 19:01
s11-2 - - 95.0K static 2010-11-18 13:34
away ~ $ pfexec beadm activate s11-2
</pre>
<p>
At this point I simply booted into <code>s11-2</code>, and once there
renamed the <code>opensolaris-4</code> BE
to <code>s11-1</code> so the name indicates what it contains.
<p>
That's it - this machine is now running Oracle Solaris 11 Express.
If needed (and I haven't needed to) I can reactive an older boot environment with OpenSolaris.
<P>
Which reminds me: with later versions of Solaris you often get later versions of ZFS pool on-disk
data formats. ZFS data is upwards compatible, but you may see messages like the following:
<pre>
away ~ $ zpool status rpool
pool: rpool
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scan: scrub repaired 0 in 2h51m with 0 errors on Fri Dec 24 19:16:38 2010
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c7d0s0 ONLINE 0 0 0
errors: No known data errors
</pre>
<P>
You can check the version of a current pool and see what the OS supports with the following commands:
<pre>
away ~ $ zpool upgrade
This system is currently running ZFS pool version 31.
The following pools are out of date, and can be upgraded. After being
upgraded, these pools will no longer be accessible by older software versions.
VER POOL
--- ------------
26 rpool
Use 'zpool upgrade -v' for a list of available versions and their associated
features.
away ~ $ zpool upgrade -v
This system is currently running ZFS pool version 31.
The following versions are supported:
VER DESCRIPTION
--- --------------------------------------------------------
1 Initial ZFS version
2 Ditto blocks (replicated metadata)
3 Hot spares and double parity RAID-Z
4 zpool history
5 Compression using the gzip algorithm
6 bootfs pool property
7 Separate intent log devices
8 Delegated administration
9 refquota and refreservation properties
10 Cache devices
11 Improved scrub performance
12 Snapshot properties
13 snapused property
14 passthrough-x aclinherit
15 user/group space accounting
16 stmf property support
17 Triple-parity RAID-Z
18 Snapshot user holds
19 Log device removal
20 Compression using zle (zero-length encoding)
21 Deduplication
22 Received properties
23 Slim ZIL
24 System attributes
25 Improved scrub stats
26 Improved snapshot deletion performance
27 Improved snapshot creation performance
28 Multiple vdev replacements
29 RAID-Z/mirror hybrid allocator
30 Encryption
31 Improved 'zfs list' performance
For more information on a particular version, including supported releases,
see the ZFS Administration Guide.
</pre>
<P>
Upgrading to S11E has added levels 27 through 31.
30 is the one I really want to play with!
However, the system runs just fine with previously existing features,
and you can still fall back to the older version of Solaris
that understands the older ZFS pool version.
When you are <u>sure</u> you will never need to fall back to the older OS version
you can issue the <code>zpool upgrade</code> command to enable new functionality,
and use <code>beadm destroy</code> to remove obsolete boot environments.
<h3>Summary</h3>
A lot of things have happened since my blog post - this entry describes just a few of the highlight items, and shows how to smoothly upgrade an existing OpenSolaris system to Oracle Solaris 11 Express. I recommend that people download and try out S11E to get familiar with the many new features. Later, I'll blog on some of the new functionality delivered with this OS release.https://blogs.oracle.com/jsavit/entry/response_to_a_comparison_ofResponse to "A comparison of virtualization features of HP-UX, Solaris and AIX"jsavithttps://blogs.oracle.com/jsavit/entry/response_to_a_comparison_of
Thu, 25 Mar 2010 16:48:35 +0000SuncontainersdomainslogicaloraclesolarisvirtualboxvmIt's a long time since I've had to refute errors in partisan "comparisons" of Sun technology, but it's time again. An article posted by Ken Milberg on IBM sites inaccurately describes Oracle virtualization technology, and I set the record straight....<html>
<head>
<title>Response to "A comparison of virtualization features of HP-UX, Solaris and AIX"</title>
</head>
<body>
<h2>Oh no, not again...</h2>
<p>
I've been able to avoid refuting FUD and misinformation for quite a while
(and stick to my preferred topic: blogging on technology), but I just got an
e-mail pointing me to Ken Milberg's
<a href="http://www.ibm.com/developerworks/aix/library/au-aixvirtualization/index.html?ca=dgr-twtrComparVirtdth-AIX&S_TACT=105AGY83&S_CMP=TWDW">A
comparison of virtualization features of HP-UX, Solaris and AIX</a> on IBM's website, and have to respond.
</p>
<p>
Just as in his previous article, to which I responded 3 years ago in
<a href="https://blogs.oracle.com/jsavit/entry/response_to_ibm_sun_and">Response to IBM, Sun and HP: Comparing UNIX Virtualization Offerings</a>,
he "surveys" the Unix vendors virtualization technologies and
makes observations about different virtualization products. Many of his comments are simply wrong.
As before, I'm not going to comment on his remarks about HP, for the simple reason that I don't know HP technology well enough, but
I will correct mistaken remarks about Sun (now Oracle) capabilities.
Note: his article is updated as of March 23, 2010, so it <i>should</i> be up to date in content.
<h2>Some miscellaneous bits first</h2>
<p>
Mr. Milberg says "Sun is also claiming features such as predictive self-healing, which has long been available on the System p."
It's not a "claim" - it's a delivered feature of Solaris 10 for several years. We can decommission parts before they fail, when they
show signs of getting sick, as well as recover from errors.
I don't believe AIX has anything comparable.
Perhaps he should take a closer look to understand what Solaris does - whether on SPARC, Intel,
or AMD (something the "p-only" AIX cannot do.)
</p>
<p>He also refers to VirtualBox with a reasonable description - but omits the fact that this is a capability
that we at Oracle have that IBM lacks - a desktop virtualization product that runs on the most popular chipset in the world.
To anybody who wants to run multiple virtual machines on their PC or laptop, don't hesitate: go to
<a href="http://www.virtualbox.org/">http://www.virtualbox.org/</a> and download a copy. It runs on Oracle Solaris,
Linux and Windows, and can host guests running those operating systems, BSD, and even OS/2!
<h2>Virtualization, point by point</h2>
<p>
First, the names: even before being acquired by Oracle, Sun no longer called our virtualization products "xVM".
Now, of course, they are part of the Oracle Virtualization product set and are named accordingly - Logical Domains are
now "Oracle VM for SPARC."
This is something an article updated 2 days ago should reflect.
But that's a minor item when there are more serious factual errors to deal with.
<p>
He spends some effort disparaging Logical Domains, which I correct in this table. Note that he insists on calling things "partitions", when they're
not: that's an IBM-specific term which doesn't apply elsewhere.
<p>
<table border=1>
<tr><td><b>Milberg claim</b></td><td><b>Correction</b></td></tr>
<tr><td> </td><td> </td></tr>
<tr>
<td valign="top">"Scalability - Only eight CPUs and 64 GB RAM on one machine"</td>
<td valign="top">Wrong by a mile: a T5440 goes up to 256 CPU threads on 32 cores on 4 chips, and goes up to 512GB of RAM</td>
</tr>
<tr>
<td valign="top">"Server-line - Only low-end Sparc servers are supported"</td>
<td valign="top">Wrong: Machines like a T5440 or T5240 are nobody's "low-end" machine</td>
</tr>
<tr>
<td valign="top">"Limited micro-partitioning - Four partitions on T1, 8 on T2"</td>
<td valign="top">Boy, oh boy is this wrong. A T1-based server (no longer sold) could go up to 32.
T2-based servers can go up to 64.
On the T5x40 servers based on the T2 plus chip, you can have up to 128 domains.
</td>
</tr>
<tr>
<td valign="top">"No dynamic allocation between partitions"</td>
<td valign="top">Wrong again: you can transfer CPUs, cryptographic accelerators, and I/O assets non-disruptively. Using the free resource manager in LDoms 1.3, you can even move CPU capacity between domains automatically, based on resource requirements.
</td>
</tr>
</table>
<p>
None of the statements he makes above are correct - he is wrong about fundamental platform capabilities.
Contrary to LDoms having "many inherent flaws", they are a popular <b>no extra cost</b> feature of Oracle's Sun SPARC T-series product line,
that compare very favorably to the expensive and less flexible virtualization options available on IBM POWER.
<p>
Besides these fundamental errors, Milberg misses some crucial points:
He forgets to mention is the massive license fees you have to pay to use virtualization on POWER:
Power7 licensing pricing for AIX, PowerVM and SWMA (required Software Maintenance) are extremely expensive: AIX 6.1 on a POWER 780 8/32 is
about $130K, and PowerVM Enterprise Edition $89K. That is a pretty hefty price - exclusive of maintenance!
Did I mention that Logical Domains, aka "Oracle VM for SPARC" comes with no cost at all?
<p>
Further, to do crypto at hardware speed on POWER, you have to buy crypto accelerator device$.
Also, while Milberg makes a bunch of comments about mobility - he forgets to mention that domains can be moved between T-series servers
without a reboot. That's another no-added-cost feature.
</p>
<p>
Finally, he makes a number of comments about Oracle Solaris Containers including the (fairly accurate) comment that we "had it and IBM did not."
Well, that's only half the story: IBM and its proxies like Ken spent several years disparaging Containers - until they
imitated them! :-)
<h2>Summary</h2>
<p>
Ken Milberg's article claims to be a comparison of virtualization technologies, but is marketing posing as analysis, and is full of fatal
errors.
<p>
The article trots out the old chestnut "IBM has a 40-plus year history of virtualization. No other vendor can come close to making this claim.",
which would be interesting if any of the POWER technologies were based on VM/370... but they're not.
He pulls out the second chestnut he's used before
"They offer one virtualization strategy, PowerVM, unlike the myriad of solutions available from Sun or HP",
which is weird in an article that not only names several virtualization technologies available on POWER: PowerVM and WPARS, but
also refers to the (completely unrelated) mainframe virtualization technologies.
In reality, IBM offers multiple virtualization strategies - which isn't a bad thing (we do), but it's contrary to Ken's comment.
Unfortunately for IBM, they don't have their own products on x86 servers, so their solutions depend on 3rd parties,
while Oracle has Oracle VM, Oracle VirtualBox, and of course Solaris Containers - providing a complete virtualization portfolio.
<p>
In short, his article is merely a pitch, and is replete with errors.
Readers who want an accurate comparison of virtualization technologies will have to go elsewhere.https://blogs.oracle.com/jsavit/entry/a_new_look_at_anA new look at an old SA practice: separating /var from /jsavithttps://blogs.oracle.com/jsavit/entry/a_new_look_at_an
Sun, 21 Mar 2010 11:50:48 +0000Sun/varsolariszfsToday's blog is a short look at two variations on the old sysadmin practice
of separating <code>/var</code> from <code>/</code>, inspired by recent "how do I do this?" calls - influenced by ZFS boot and Solaris Containers (zones) with ZFS root.<h2>An old school SA practice...</h2>
<p>
This is probably the geekiest blog title I've used - but today's blog is a short look at two variations on the old sysadmin practice
of separating <code>/var</code> from <code>/</code>, inspired by recent "how do I do this?" calls.
<h2>Why do it? How was it done before?</h2>
<p>
This was traditionally done to ensure that growing space consumption in <code>/var</code>, perhaps caused by core, log or package files,
didn't exhaust critical parts of your file system. This could happen because some program kept dumping core or generating log entries.
Exhausting space would add further injury by causing other failures.
<p>
Several techniques can prevent such problems. One method is to use <code>coreadm</code> to put core files somewhere else,
and to use <code>logadm</code> and <code>/etc/logadm.conf</code> to rotate log files on a schedule that is consistent with your disk space and retention policy. But, the biggest hammer and most complete solution was to keep <code>/var</code> in a separate file system by giving
it a dedicated UFS file system on its own disk slice. That way, even if something went amok and filled <code>/var</code>, it had no effect on
other file systems.
<p>
The disadvantage, of course, is the hassle of creating and sizing separate disk slices. You had to plan how many slices you
needed and how big they were, and if you got them wrong it was really inconvenient to change them.
You might have one slice and file system too big, wasting space you really needed in another slice, but reallocating space was a drag.
Having storage allocated into little islands was a real time-waster, especially on the itty-bitty disk drive capacities we used to live with.
</p>
ZFS, and in this case ZFS boot, pretty much eliminated this inconvenience - as I'll discuss in a moment.
<h2>But now, an old joke...</h2>
<p>
Before I go into the two examples that came up, a classic joke from mathematics or science class.
</p>
<p>
The professor is in the front of the classroom and writes an equation on the blackboard
(I'm picturing the professor I had when studying Fourier transforms in EE class, but I won't try to do his accent.)
Pointing to it, he tells the class, "As you can see, this theorem is clearly trivial."
<p>
Turning back to the blackboard he pauses for a moment, puts his hand on his chin and says "Hmmm.... just a moment."
Now he starts working on the equation's derivation, covering blackboard after blackboard with equations -
everything from &#945 to &#969. He fills all the blackboards in the classroom, mumbles "excuse me, I'll be right back,"
and then goes into an adjacent empty classroom to use its blackboards.
<p>
Twenty minutes pass. Finally, the professor returns to the classroom.
He beams at the students with a big smile and says "I was right. It <i>is</i> trivial!"
<p>
I think this may be relevant to the rest of the post! :-)
<h2>The trivial case with ZFS root file system</h2>
<p>
I'll start with the straightforward case first. I was contacted by a long time friend
(who has exceptional knowledge of Solaris and other operating systems, but is new to ZFS)
who had wanted to restrict <code>/var</code>
for a fresh installation of Solaris 10 that he had just done. He used ZFS boot and selected the option that allocated a
separate ZFS dataset for <code>/var</code>, and wanted to know if there was an easy way to control its size.
</p>
I thought that this should be easy with a ZFS quota, but to be sure, I brought up a new instance of Solaris 10 under
<a href="http://www.virtualbox.org/">VirtualBox</a> to run through the steps and get the right ZFS dataset name.
I allocated the separate <code>/var</code> (it's an option you specify during install), and after installation completed
I logged in and issued the following commands:
<pre>
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
rpool 15.9G 4.16G 11.7G 26% ONLINE -
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 4.62G 11.0G 34K /rpool
rpool/ROOT 3.12G 11.0G 21K legacy
rpool/ROOT/s10x_u8wos_08a 3.12G 11.0G 3.06G /
rpool/ROOT/s10x_u8wos_08a/var 65.6M 11.0G 65.6M /var
rpool/dump 1.00G 11.0G 1.00G -
rpool/export 265K 11.0G 23K /export
rpool/export/home 242K 11.0G 242K /export/home
rpool/swap 512M 11.5G 42.0M -
</pre>
<p>
Right - all I should need to do is set a quota on <code>rpool/ROOT/s10x_u8wos_08a/var</code>, so let's do that.
I picked a quota slightly larger than the amount of space already consumed so I could easily test filling it up by
creating dummy files with random data. I did that once to make sure I didn't mess up the syntax, and once more in earnest
to exceed the quota:
<pre>
# zfs set quota=80m rpool/ROOT/s10x_u8wos_08a/var
# zfs get quota rpool/ROOT/s10x_u8wos_08a/var
NAME PROPERTY VALUE SOURCE
rpool/ROOT/s10x_u8wos_08a/var quota 80M local
# dd if=/dev/urandom of=/var/XX1 bs=1024 count=10000
10000+0 records in
10000+0 records out
# zfs list rpool/ROOT/s10x_u8wos_08a/var
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/s10x_u8wos_08a/var 75.3M 4.67M 75.3M /var
# dd if=/dev/urandom of=/var/XX2 bs=1024 count=10000
write: Disc quota exceeded
4737+0 records in
4737+0 records out
#
# ls -l XX\*
-rw-r--r-- 1 root root 10240000 Mar 17 14:15 XX1
-rw-r--r-- 1 root root 4849664 Mar 17 14:16 XX2
# zfs list rpool/ROOT/s10x_u8wos_08a/var
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/s10x_u8wos_08a/var 80.1M 0 80.1M /var
</pre>
<p>
Mission accomplished: the second file reached the quota allocated to this ZFS dataset as required.
The only odd thing (in my opinion) is the odd spelling "Disc" instead of "Disk" in the message
<code>write: Disc quota exceeded</code>.
So, if I'm building a Solaris system and want to keep <code>/var</code> from exhausting disk space, all I need is one command
to set the quota. Sweet.
<h2>A less trivial case, with zones</h2>
<p>
Shortly after the preceding example, I was contacted by a customer who wanted to do something similar
to control <code>/var</code> within Solaris Containers.
He tried to create the zone with <code>/var</code> defined as a delegated ZFS file system
using legacy mounts.
There seems to be a chicken-and-egg situation about what parts of the zone's filesystem must already be mounted
before the zone can boot, but then you can't delegate it to the zone.
Instead, I created a ZFS dataset and assigned it to the zone's <code>/var</code>:
<pre>
# zfs create rpool/zones/vartest
# zfs list rpool/zones/vartest
# cat varzone.cfg
create
set zonepath=/zones/varzone
set autoboot=false
add net
set physical=e1000g0
set address=192.168.56.164
end
add fs
set dir=/var
set special=/zones/vartest
set type=lofs
end
add inherit-pkg-dir
set dir=/opt
end
verify
commit
# zonecfg -z varzone -f varzone.cfg
# zoneadm -z varzone install
A ZFS file system has been created for this zone.
Preparing to install zone &lt;varzone&gt;.
Creating list of files to copy from the global zone.
Copying &lt;2899&gt; files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize &lt;1062&gt; packages on the zone.
Initialized &lt;1062&gt; packages on zone.
Zone <varzone> is initialized.
Installation of &lt;2&gt; packages was skipped.
The file &lt;/zones/varzone/root/var/sadm/system/logs/install_log&gt; contains a log of the zone installation.
</pre>
<p>
So far so good. After booting the zone without incident, I set a quota and fill it up
(note: this is a much bigger <code>/var</code> because I'm building a zone in a Solaris instance with
a bunch of additional software in <code>/var/sadm/pkg</code> )
<pre>
# zfs list rpool/zones/vartest
NAME USED AVAIL REFER MOUNTPOINT
rpool/zones/vartest 274M 9.31G 274M /zones/vartest
# zfs set quota=300m rpool/zones/vartest
</pre>
<p>
Within the zone, I exhaust allocated space using the same method as before:
<pre>
# dd if=/dev/urandom of=/var/xx1 bs=1024 count=100000
write: Disc quota exceeded
26369+0 records in
26369+0 records out
</pre>
<p>
So, I was able to create a separate <code>/var</code> for the zone, and manage its space independently from the zone's root.
<b>WARNING:</b> I do not know if this is a supported or recommended procedure, even though it seems to work.
My recommendation is that it's more important to impose a quota on the zone's ZFS-based zone root, in order to control
its total accumulation of disk space. That protects other zones and other applications that may be using the same ZFS pool.
</p>
<h2>Conclusions</h2>
<p>
Separating
<code>/var</code> was especially important with the small boot disk capacities we had to work with in Ye Olde Days, and perhaps became less
important with the large disks we have now.
However, this becomes important again due to the availability of relatively low capacity Solid State Disk (SSD)
boot drives being used for fast local boot disks with low power consumption, and because of virtual environments in which a
single Oracle Solaris instance might host many containers, each with its own <code>/var</code> and pattern of space consumption.
<p>
So, maybe this is a useful Old School idea that has new, and slightly different relevance today.https://blogs.oracle.com/jsavit/entry/logical_domains_1_3_releasedLogical Domains 1.3 Releasedjsavithttps://blogs.oracle.com/jsavit/entry/logical_domains_1_3_released
Thu, 21 Jan 2010 16:41:33 +0000SundomainsldomslogicalvirtualizationA new version of Logical Domains has been released, with new functionality for resource management and domain migration.<h2>Logical Domains 1.3 Released</h2>
While this may be overlooked with all the excitement over the impending Oracle acquisition
(I'll wait till the ink is dry to comment, if I do - though I share the excitement
and enthusiasm expressed by many of my colleagues for the opportunity this represents),
<b>innovation continues</b> and
Sun just announced the latest enhancements to Logical Domains, with version 1.3.
<h2>Super-fast review of Logical Domains (LDoms)</h2>
LDoms is a virtual machine capability for Sun's Chip Multithreading (CMT) servers.
It permits as many as 128 domains (virtual machines) on a single server at <i>no extra cost</i>.
LDoms exploits the "many CPUs" nature of CMT servers for efficient implementation of virtual machines,
without the overhead commonly seen in VM systems. Instead of timeslicing CPUs among many "guests" (which creates overhead),
each domain gets its own CPUs, which it can use at native speed. Domains can also use advanced features
like the hardware cryptographic accelerators that are standard with the CMT servers.
<h2>New Features</h2>
Among the new features are:
<ul>
<li><b>Domain Migration enhancements:</b>
<ul>
<li>You can now migrate domains that have a cryptographic accelerator (a restriction removed).
<li>Multi-threaded memory compression speeds migration. Memory contents are compressed before being encrypted
and transmitted to the target system - an 80% speedup compared to the prior release.
Processing is multi-threaded and takes advantage of
the CPU threads in the control domain, and exploits the cryptographic accelerator.
You wouldn't want memory contents of a guest domain (with passwords and other private data) to be transmitted in clear, would you? As Liam mentions in his comment below, memory contents are <i>always</i> encrypted, but it's much faster with the hardware accelerator.
<li>Automated migration for non-interactive migration. Passwords are stored in root-access-only files so migration can
be done without interactive prompts.
</ul>
<li><b>CPU Dynamic Resource Management</b> - this is my favorite, and I'll discuss below
<li><b>Link-based IPMP</b>
Previously you couldn't do IP Multi Pathing link-based failure detection in a guest domain using a virtual switch
(probe based failure detection worked, however).
If the physical NIC's connection failed, that status wasn't passed
to virtual network devices connected to the virtual switch associated with the NIC.
The connection from virtual NIC to virtual switch was intact, but didn't know the downstream connection wasn't.
You can now specify the <code>link-prop=phys-state</code> option on the virtual device to pass link state of physical NIC to virtual for failover.
<li><b>Crypto Dynamic Reconfiguration (DR)</b>
Guests with crypto accelerators can now have CPUs dynamically added and removed, providing
they are running on Solaris 10 10/09 or later (a restriction removed).
<li><b>Boot domain from disk bigger than 1TB</b>
<li><b>Ability to change guest hostid</b> You can use the
<code>host-id</code> and <code>mac-addr</code> in the <code>ldm set-domain</code> command to change the hostid.
</ul>
<p>
There are other changes, plus bug fixes and performance improvements, but the above are the highlights.
There is one important restriction: LDoms 1.3 is for T2 and T2+ based systems: the T5x40 and T5x20 servers and blades.
Older, T1-based systems such as T1000 and T2000 can continue to use LDoms 1.2.
<H2>Dynamic Resource Management</H2>
<p>
This is my favorite addition...
Before explaining it, a little more review of logical domains. Instead of assigning CPU shares or weights
as is done with traditional hypervisors or the Solaris Fair Share Scheduler, you adjust the CPU capacity of
a domain by assigning it more or fewer CPUs.
This is consistent with the CPU-rich design of CMT servers - with so many addressable CPUs, you simply don't have
to timeslice CPUs to share the physical processor. You can assign them directly to the guest domain.
Each domain is given some number of CPU threads that belong to it, and to it alone.
<p>
Logical Domains has supported dynamic reconfiguration from the outset: you adjust CPU capacity for a domain by
adding and removing CPUs, using commands like:
<pre>
# <b>ldm set-vcpu 16 mydomain</b> # set the number of CPUs for 'mydomain'
# <b>ldm add-vcpu 8 mydomain</b> # give it some more CPUs for a spike in load
# <b>ldm rm-vcpu 8 mydomain</b> # take them back - <b>set-vcpu</b> would have worked too
</pre>
<p>
It is easy to put these commands in a script, perhaps initiated by <code>cron</code>.
It also has always been possible to parse the output of <code>ldm list -p</code> to see the CPU utilization for each domain
and adjust CPU counts accordingly.
A mere "SMOP" (Small Matter Of Programming), eh? But it takes a fair bit of work to do this properly!
<h3>LDoms Dynamic Resource Management</h3>
LDoms 1.3 provides a policy-based resource manager that automatically adds
or removes CPUs from a running domain based on its utilization and relative priority.
Policies can be prioritized to ensure important domains get preferential access to resources.
Policies can also be enabled or disabled manually or based on time of day for different prime shift and off-hours policies.
For example, one domain may have the highest resource needs and priority during the day time, while a domain running
batch work may be more resource-intensive at night.
<p>
Policy rules specify the number of CPUs that a domain has, bounded by mininum and maximum values, and based on their utilization:
<ul>
<li> The number of CPUs is adjusted between <b>vcpu-min</b> and <b>vcpu-max</b> based on <b>util-upper</b> and <b>util-lower</b> CPU busy percentages (all of these variables are property values associated with the policy)
<li>If CPU utilization exceeds the value of <b>util-upper</b>, virtual CPUs are added to the domain until the number is between <b>vcpu-min</b> and <b>vcpu-max</b>
<li>If the utilization drops below <b>util-lower</b>, virtual CPUs are removed from the domain until the number is between <b>vcpu-min</b> and <b>vcpu-max</b>
<li>If <b>vcpu-min</b> is reached, no more virtual CPUs can be dynamically removed. If <b>vcpu-max</b> is reached, no more virtual CPUs can be dynamically added (manual changes to the number of CPUs can still be done using the <code>ldm</code> commands shown above)
<li>Multiple policies can be in effect, and are optionally controlled by <b>tod-begin</b> and <b>tod-end</b> (Time Of Day) values
</ul>
The resource manager includes
ramp-up (attack) and ramp-down (decay) controls to adjust response to workload changes, specifying the
number of CPUs to add or remove based on changes in utilization, and how quickly the resource manager responds.
<p>
Resource management is disabled in elastic power management mode, in which CPUs are powered down when unused to reduce power consumption.
<p>
The following is an example of a command creating a policy:
<pre>
# <b>ldm add-policy tod-begin=09:00 tod-end=18:00 util-lower=25 util-upper=75 \\
vcpu-min=2 vcpu-max=16 attack=1 decay=1 priority=1 name=high-usage ldom1</b>
</pre>
This policy controls the number of CPUs for domain <b>ldom1</b>, is named <b>high-usage</b> and is in effect between 9am and 6pm.
The lower and upper CPU utilization settings are 25% and 75% CPU busy.
The number of CPUs is adjusted between 2 and 16: one CPU is added or removed at a time (the attack and decay values).
For example, if the CPU utilization is above 75%, a CPU is added unless ldom1 already has 16 CPUs.
<p>
This provides flexible and powerful dynamic CPU resource management for Logical Domains. I expect there will be future
enhancements, possibly for other resources categories.
<h2>Summary</h2>
Logical Domains 1.3 provides a new level of functional capability, representing the continued investment and enhancement
of this flexible and powerful virtualization capability.
For further information on the latest update, see
<a href="http://www.sun.com/servers/coolthreads/ldoms/index.jsp">www.sun.com/ldoms</a>
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->
</body></html>
https://blogs.oracle.com/jsavit/entry/deduplication_now_in_zfsDeduplication now in ZFSjsavithttps://blogs.oracle.com/jsavit/entry/deduplication_now_in_zfs
Sun, 6 Dec 2009 18:26:34 +0000SundedupopensolarissolariszfsDeduplication is now available in ZFS in the latest OpenSolaris build! Have a look at early experience with this exciting new feature.I've been waiting for this ever since it was announced - and deduplication is now available in ZFS! (hence this hastily written blog...).
The basic concept is that when data is written to a ZFS filesystem with dedup turned on, ZFS only stored blocks that are <i>unique</i> within the ZFS pool, rather than storing redundant copies of identical data.
See <a href="http://blogs.sun.com/bonwick/entry/zfs_dedup">Jeff Bonwick's blog</a> for more information on concepts and implementation.
<h2>Getting started with ZFS dedup</h2>
<p>
If you're running OpenSolaris 2009.06 and are pointing to the hot-off-the-presses repository you can upgrade by
using <tt>System -&gt; Package Manager -&gt; File -&gt; Updates</tt>, or use the CLI example shown in
<a href="http://blogs.sun.com/pomah/entry/opensolaris_build_128_now_availble">Roman Ivanov's blog</a>.
Some painless downloading and a reboot, and you have access to the bits that contain ZFS dedup.
<pre>
$ <b>uname -a</b>
SunOS suzhou 5.11 snv_128a i86pc i386 i86pc Solaris
$ <b>cat /etc/release</b>
OpenSolaris Development snv_128a X86
Copyright 2009 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 23 November 2009
</pre>
<p>
Now, I'm a rather paranoid guy, and the machine in question is my main desktop, so I'm going
to experiment on a ZFS pool in a ramdisk, rather than the pool I keep all my data. At least for today!
Here, I create a 100MB ramdisk, put a ZFS pool in it, and
turn on dedup for its top-level ZFS dataset. I then create a ZFS dataset in it, and hand it to
my normal, non-root userid.
<p>Note:</b> the error message below is because you don't turn on 'dedup' at the pool level - you turn it on at the ZFS dataset level. This is despite the (possibly confusing fact) that candidate blocks for deduplication are across
the entire ZFS pool (not the dataset), and the space savings ratio due to deduplication are reported at the pool level.
So, in this case, the zpool command thinks I'm looking for the deduplication ratio, which is a read-only attribute.
After that error, I turn dedup on for the ZFS dataset at the top of the pool. Out of habit, I turn on compression too.
<pre>
# <b>ramdiskadm -a dimmdisk 100m</b>
/dev/ramdisk/dimmdisk</b>
# <b>zpool create dimmpool /dev/ramdisk/dimmdisk</b>
# <b>zfs create dimmpool/fast</b>
# <b>zpool set dedup=on dimmpool</b>
cannot set property for 'dimmpool': 'dedup' is readonly
# <b>zfs set dedup=on dimmpool</b>
# <b>zfs set compression=on dimmpool</b>
# <b>chown savit /dimmpool/fast</b>
</pre>
Now, I'll populate that ZFS dataset with a few directories and copy some files into the first of them:
<pre>
$ <b>mkdir /dimmpool/fast/v1</b>
$ <b>mkdir /dimmpool/fast/v2</b>
$ <b>mkdir /dimmpool/fast/v3</b>
$ <b>mkdir /dimmpool/fast/v4</b>
$ <b>mkdir /dimmpool/fast/v5</b>
$ <b>cp \*gz /dimmpool/fast/v1</b>
$ <b>zpool list dimmpool</b>
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
dimmpool 95.5M 5.32M 90.2M 5% 1.00x ONLINE -
</pre>
No duplicate data so far, which isn't a surprise,
now for the real test. Let's copy the <i>same</i> data into different directories
and see what happens:
<pre>
$ <b>cp \*gz /dimmpool/fast/v2</b>
$ <b>cp \*gz /dimmpool/fast/v3</b>
$ <b>cp \*gz /dimmpool/fast/v4</b>
$ <b>cp \*gz /dimmpool/fast/v5</b>
$ <b>zpool list dimmpool</b>
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
dimmpool 95.5M 6.55M 89.0M 6% 5.00x ONLINE -
</pre>
<p>
<b>Wow!</b> I have 5 copies of the same data and ZFS trimmed away the excess copies when storing on disk.
<h2>The mystery of the growing disk</h2>
One surprise with dedup is how traditional tools like <tt>df</tt> respond. They obviously are unaware of deduplication, but have to somehow cope with the idea that the same filesystem is storing more (user accessible) bytes. This is very well explained in Joerg's blog at <a href="http://www.c0t0d0s0.org/index.php?url=archives/6168-df-considered-problematic.html">df considered problematic</a> (and tip of the hat to Craig to provoke me to comment on this).
<p>
To illustrate this, I'll create a small ZFS pool backed by a file (a file on ZFS, which is perfectly valid). I'm going to fill it with multiple copies of the same CD image (a Ubuntu 9.10 ISO), which neatly get deduped away. As I do this, look at the output of the <tt>df</tt> command:
<pre>
# <b>mkfile 1g /var/tmp/TEMP</b>
# <b>zpool create temp /var/tmp/TEMP</b>
# <b>zfs set dedup=on temp; zfs set compression=on temp</b>
# <b>zpool list temp</b>
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
temp 1016M 141K 1016M 0% 1.00x ONLINE -
# <b>cp ubuntu-9.10-desktop-i386.iso /temp</b>
# <b>zpool list temp</b>
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
temp 1016M 685M 331M 67% 1.00x ONLINE -
# <b>df -h /temp</b>
Filesystem Size Used Avail Use% Mounted on
temp <font color="red">983M</font> 683M 300M 70% /temp
# <b>cp ubuntu-9.10-desktop-i386.iso /temp/iso2</b>
# <b>df -h /temp</b>
Filesystem Size Used Avail Use% Mounted on
temp <font color="red">1.7G</font> 1.4G 295M 83% /temp
# <b>cp ubuntu-9.10-desktop-i386.iso /temp/iso3</b>
# <b>df -h /temp</b>
Filesystem Size Used Avail Use% Mounted on
temp <font color="red">2.3G</font> 2.1G 289M 88% /temp
# <b>cp ubuntu-9.10-desktop-i386.iso /temp/iso4</b>
# <b>zpool list temp</b>
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
temp 1016M 692M 324M 68% 4.00x ONLINE -
# <b>df -h /temp</b>
Filesystem Size Used Avail Use% Mounted on
temp <font color="red">3.0G</font> 2.7G 278M 91% /temp
</pre>
<p>
Notice that the ZFS pool shows the deduplication ratio, and <tt>df</tt> acts as if the disk is getting bigger - growing in this case to 3GB. Otherwise, how could it explain 2.7GB of data in a filesystem that was only the original size of 983M?
<h2>A bigger test</h2>
Well, the above tests are probably "best case scenarios". Let's try something bigger, and use a real disk.
<p>
I'll use a spare ZFS pool called 'tpool'.
Since this is a ZFS pool I've had for a while, I
have to upgrade the pool to the latest on-disk format, which I show here.
Note which level (21, in listing below) provides deduplication.
For a change of pace, I'll do this via <tt>pfexec</tt> from my userid, instead of being <tt>root</tt>.
<pre>
$ <b>zpool list tpool</b>
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tpool 736G 87.1G 649G 11% 1.00x ONLINE --
$ <b>pfexec zpool status tpool</b>
pool: tpool
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
tpool ONLINE 0 0 0
c18t0d0p2 ONLINE 0 0 0
errors: No known data errors
$ <b>zpool upgrade -v</b>
This system is currently running ZFS pool version 22.
The following versions are supported:
VER DESCRIPTION
--- --------------------------------------------------------
1 Initial ZFS version
2 Ditto blocks (replicated metadata)
3 Hot spares and double parity RAID-Z
4 zpool history
5 Compression using the gzip algorithm
6 bootfs pool property
7 Separate intent log devices
8 Delegated administration
9 refquota and refreservation properties
10 Cache devices
11 Improved scrub performance
12 Snapshot properties
13 snapused property
14 passthrough-x aclinherit
15 user/group space accounting
16 stmf property support
17 Triple-parity RAID-Z
18 Snapshot user holds
19 Log device removal
20 Compression using zle (zero-length encoding)
<b>21 Deduplication</b>
22 Received properties
For more information on a particular version, including supported releases, see:
http://www.opensolaris.org/os/community/zfs/version/N
Where 'N' is the version number.
$ <b>pfexec zpool upgrade tpool</b>
This system is currently running ZFS pool version 22.
Successfully upgraded 'tpool' from version 16 to version 22
</pre>
Now I can turn dedup on. As I mentioned before, you turn it on at the ZFS dataset level.
Here, I want all ZFS datasets in this pool to inherit this property, so I set it at the topmost level.
<pre>
$ <b>pfexec zfs set dedup=on tpool</b>
$ <b>zfs get dedup tpool</b>
NAME PROPERTY VALUE SOURCE
tpool dedup on local
$ <b>zpool get dedupratio tpool</b>
NAME PROPERTY VALUE SOURCE
tpool dedupratio 1.00x -
</pre>
Now I'll copy about 28GB worth of data - note how the free disk space is only reduced by 5GB!
<pre>
$ <b>zpool list tpool</b>
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tpool 736G 206G 530G 28% 1.02x ONLINE -
$ <b>cp -rp /usbpool/Music /tpool/</b>
$ <b>zfs list tpool/Music</b>
NAME USED AVAIL REFER MOUNTPOINT
tpool/Music 27.9G 513G 27.9G /tpool/Music
$ <b>zpool list tpool</b>
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tpool 736G 211G 525G 28% 1.20x ONLINE -
</pre>
ZFS found that the written data was a duplicate of existing data, and deduplicated it on the fly. Only one copy of the 28GB occupies space on disk.
<h2>Putting this in context</h2>
<p>
Now, let's understand this correctly - <i>somewhere</i> in the <tt>tpool</tt> filepool I had the same 28GB of music files. With a small amount of effort I could have found the original copy (I know where I tend to put things), and just stored new or changed data. But that's not the point!
<p>
Imagine if this were a file store serving hundreds of users. They obviously can't search through one another's directories and see if the exact same data was already stored by someone else (and somehow know that the data would be retained unchanged, for exactly as long as they themselves needed it). Consider the case where a group of employees receive the same e-mail attachments and store them in their private directores: without deduplication the same data is stored many times, with deduplication it is stored only once.
<p>
ZFS deduplication works at the block level, not on an individual file basis, so it is possible that "mostly the same" files can enjoy benefits of deduplication and reduced disk space consumption. Imagine even a single user working with a series of medical image files from CAT scans, or different versions of an animated film. The different files might be almost identical in contents, and potentially the majority of disk blocks could be the same - and stored only once instead of storing the same contents multiple times.
<h2>Caveats and Considerations</h2>
<p>
There are important considerations to keep in mind, so please don't blindly turn on deduplication. First, Dedupe is available in OpenSolaris 2009.06 and Solaris 11 Express - it is not available in Solaris 10.
<p>
Second, deduplication requires a great deal of RAM (or an SSD-based L2ARC) to store the ZFS dedup table - the metadata that contains the block signatures and describes ownership. If you don't have enough RAM, then file operations (especially writes and in particular, file deletion) will be <i>extremely slow</i>. On a large server class machine this may not be a problem, but be very cautious about deploying it on a desktop class system with a few GB of RAM. See <a href="http://blogs.sun.com/roch/entry/dedup_performance_considerations1">Roch Bourbonnais' blog on dedup performance</a> for technical details and results of sizing experiments.
<p>
Given those considerations - when would you use deduplication? Consider it when you have business processes that create a lot of duplicate data on disk, hosted on a server with suitable RAM and I/O. Then you can get substantial space benefits while maintaining good performance.
<p>
This is not quite the same advice that I give for ZFS compression: I recommend turning default ZFS compression on in almost all cases, since the CPU cost is small and the savings in disk space can be valuable (in fact, the reduced disk space can decrease overall CPU cost, since fewer physical I/Os are needed) - very little downside risk is involved. It's more like deciding whether to turn on ZFS gzip-9 compression, which provides greater potential space savings at a much higher CPU cost. In general, evaluate your data and your server configuration before turning on deduplication.
<h2>Conclusions</h2>
<p>
This is a powerful new feature for ZFS - providing deduplication as a new native feature of the most advanced file system, without imposing add-on license fees and with a simple user interface. This currently is a feature of Solaris 11 Express, not Solaris 10. Users who want to make use of this feature now can certainly do so with Solaris 11 Express and can obtain formal support.
<p>
However, you don't have to upgrade to Solaris 11 Express to get benefits from deduplication (and other features introduced in OpenSolaris.) Sun leverages Solaris 11 in the Sun Storage 7000 Unified Storage Systems (a bit wordy, that name) storage appliances. This provides a fully-supported way to gain the benefits of our rapid introduction of advanced features, providing unique solutions for the systems and storage markets.
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->https://blogs.oracle.com/jsavit/entry/hercules_goes_commercialHercules goes commercial!jsavithttps://blogs.oracle.com/jsavit/entry/hercules_goes_commercial
Fri, 2 Oct 2009 15:31:15 +0000SunemulatorherculesmainframeoemThe Hercules open-source mainframe emulator had lead to a commercial mainframe offering? What does this mean - can it be successful?There's a news report
at <a href="http://www.theregister.co.uk/2009/09/25/turbohercules_goes_commercial/">The Register</a>
that the <a href="http://www.hercules-390.org/">open source Hercules mainframe emulator</a> will be used in a commercial offering
called <a href="http://www.turbohercules.com/">TurboHercules</a>.
<p>
Hercules, which I've referred to <a href="http://blogs.sun.com/jsavit/entry/dtrace_and_the_mighty_hercules">elsewhere on this blog</a>,
is an open source mainframe emulator, originally intended for hobbyists.
It performs software emulation of IBM mainframes, capable of running old and new operating systems for that platform.
It does <i>not</i> provide an OS. Instead, it provides a virtual mainframe that you can boot (I should say "IPL") an OS on.
<p>
Hercules has picked up a wide user base - mostly current and former mainframe systems programmers, including
IBM customers and (apparently) IBM employees too.
All can have the geeky fun of running their own copy of
MVS or VM on their PC, or run one of the mainframe Linux variants.
So, you can run your very own mainframe datacenter, sort of.
<h3>What Operating Systems can you run?</h3>
<p>
<i>Technically</i> it's possible for Hercules to run current IBM operating systems like z/OS,
but my understanding is that <i>legally</i> you can only run very old IBM operating systems from over 20+ years ago
(OS/360, DOS/360, MVS 3.8, VM/370),
and current <i>non-</i>IBM operating systems - specifically Linux.
(The OpenSolaris port to mainframe doesn't run on Hercules for technical reasons.
It uses z/VM paravirtualization APIs that Hercules doesn't implement yet. That's unfortunate, but running as a guest makes development easier.)
<p>
IBM only licenses their operating systems for their own hardware, so you can't get licenses to run z/OS and z/VM on Hercules.
Clearly there is interest in doing so.
Every once in a while somebody pops up on the Hercules mailing list asking for instructions on how to obtain a current IBM OS for Hercules
and the response from list members is "go contact IBM, not us. We don't encourage/help anyone using unlicensed software".
<h2>Can Hercules be used for "production" work? Is it up to the job?</h2>
<p>
The first question is whether it's compatible enough with the Real Thing, and the answer clearly is "yes". There's very good evidence that it's architecturally compatible, with reports of people running current operating sytems. A lot of z/Linux development is done on
Hercules, and you can certainly run the major distros there.
<p>
Does Hercules have the horsepower to run small to moderate sized mainframe applications?
Definitely: some hobbyists have reported running
the equivalent of well over 100 MIPS on stock personal computers you can buy anywhere.
The <a hre"http://www.hercules-390.org/hercfaq.htm">FAQ page</a> on the Hercules web site
<a href="http://www.hercules-390.org/hercfaq.html#3.02">reports over 300 MIPS</a> and 500 I/Os per second.
That's a bit old, so I imagine that with current Nehalem or Shanghai processors you could do more, especially with
server-class I/O. That's more than enough to run the workload of small mainframes (or a small portion of a current mainframe's
workload) at a far lower cost than the Real Iron. Consider that a low-end "z10 BC" mainframe introduced Q4 2008 starts at under 30 MIPS
and costs about $100K exclusive of disk or memory. Hercules would be overkill for replacing such a small system and could be done at a tiny cost.
<p>
For nostalgic comparison: IBM used to have a thing called the XT/370, which used a co-processor in a PC/XT to implemented a subset
of the System/370 architecture and came with a weird single user dialect of VM/CMS. I had one of these things and it was tremendous
fun, even if not a barn-burner in performance at 0.1 MIPS! (I mostly used it for a project porting
<a href="http://www.modula2.org/modula-2.php">Modula-2</a> to the 370). There were later varieties of this product that got as much as 7 MIPS
of performance, which was once considered pretty powerful - about the power of a 3083. Now we can emulate many times that performance in
software. Moore's Law in action.
<h2>So, what is TurboHercules?</h2>
<p>
The <a href="http://www.turbohercules.com/">TurboHercules web site</a> says what their mission is
in Roger Bowler's <a href="http://www.turbohercules.com/news/permalink/welcome-to-turbohercules/">welcome page</a>.
Roger Bowler is the original author of Hercules (which of course has had a number of developers in its 10 year history as an open source project),
and he says that the idea is to fit into the niche once occupied by "OEMs" like Amdahl and PSI,
but for "ancillary mainframe workloads".
<p>
The proposed way to get around the OS licensing issue is described
in a <a href="http://www.turbohercules.com/news/permalink/random-thoughts-from-share-denver/">page by co-founder Tom Lehmann</a> ,
where he refers to institutions (especially governmental ones) who require a second mainframe for legally mandated disaster recovery,
but don't have the budget for it.
He says that "<i>the IBM license can be transferred to an alternate machine in the event that the original machine is inoperable</i>"
and that "<i>the cost of the TurboHercules based machine is well within their budget while still providing the same level of service as the original mainframe.</i>"
<p>
So, if I understand this correctly, TurboHercules proposes to offer such a service, based on Hercules technology and with consulting to provide a turn-key disaster recovery
environment for mainframes. That's a very interesting twist on the OEM business, and maybe a way to avoid the problems faced by PSI
and other companies formerly in this market.
<h2>Will it fly?</h2>
<p>
First, I have to say that (a) I'm not involved with this in any way, (b) I have no inside knowledge of this,
and (c) I Am Not A Lawyer. Roger Bowler is a modern hero of programming and a lot of people benefit from his talent and hard work.
If he can find a way to monetize the mainframe emulator he created, that would be very cool.
<p>
Certainly there have been mainframe emulators before - it's a rich target because of mainframe's high prices and so-so
performance. There once was a good number of vendors competing against IBM in this space.
Vendors doing this in the past for commercial purposes have run into hot water due to licensing and IP issues.
<p>
What different with Hercules is that it's a well-established Open Source project, which makes it hard to kill.
The TurboHercules people have explicitly condemned the illicit use of IBM software, and have made a point
of saying that the Hercules project is completely independent of TurboHercules.
The disaster recovery angle for licensing may make this commercial venture possible. Maybe it could some day lead
to competition and better pricing in the mainframe market.
It sure is a money saver for customers compared to paying for pricey Blue tin.
Competition helps drive prices down and innovation up - monopolies don't like that.
<p>
Will it fly? I honestly don't know. There are obvious legal risks: IBM doesn't like anybody selling mainframe "workalikes",
as you can see from the recent litigation, and back years ago, the fierce combat with Amdahl. I also wonder if there isn't a
market for used or older mainframes, or mainframe hosting services, that would compete with this. Or, IBM could offer low-end machines for
peanuts to prevent a customer from going this way. They could be willing to sacrifice margin (which they surely have on mainframes) to keep
a customer from straying from the fold. Hint to any customer: have a printed copy of the TurboHercules press release next time your IBM
sales rep comes to call :-) (This once was called the <a href="http://www.cbronline.com/news/the_amdahl_coffee_mug_effect">Amdahl coffee mug</a> trick.)
<p>
A lot depends on what IBM does. <a href="http://clientservernews.com/?p=261">Maureen O'Gara adds some tart comments</a> to
TurboHercules' press release that said "it hopes to benefit from IBM's long-standing support of open source software" questioning
whether IBM will live up to its rhetoric about openness when their own IP might be de-monetized.
IBM hasn't been particularly committed to "open" when it comes to customers running using lower cost alternatives to mainframes - that's why
you can't even <i>pay</i> them to license their OSes on non-IBM boxes or Hercules. Maybe they'll permit that now - only time will tell.
Perhaps the CIO magazine title at
<a href="http://www.cio.co.uk/article/3202956/mainframe-computing-is-set-for-a-rebirth/?pn=2">Mainframe computing is set for a rebirth</a>
is correct after all - but not in the way that article's authors anticipated. :-)
<p>
If I'm all wrong - just do the usual thing and post a comment. I'm interested in opinions.
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->https://blogs.oracle.com/jsavit/entry/zfs_live_upgrade_and_flashZFS, Live Upgrade and Flash Archive - happy together at lastjsavithttps://blogs.oracle.com/jsavit/entry/zfs_live_upgrade_and_flash
Fri, 25 Sep 2009 18:58:29 +0000SunarchiveflashlivesolarisupgradezfsFlash Archive is a really handy Solaris feature for cloning Solaris systems. Unfortunately, until recently it didn't work with Solaris 10 systems that leveraged
ZFS boot - but now it's available. Here's my experience making use of this.<p>
Nothing really controversial today - just some pleasant experiences with relatively
recent enhancements to Solaris.
<h2>ZFS Rocks</h2>
<p>
One of the really great features of Solaris 10 is ZFS,
described <a href="http://www.oracle.com/technetwork/server-storage/solaris11/technologies/zfs-338092.html">here</a>,
<a href="http://www.opensolaris.org/os/community/zfs/">here</a>,
and many places, such as <a href="https://blogs.oracle.com">http://blogs.oracle.com/</a>.
ZFS really is a tremendous advance in filesystems - an elegant solution providing long-needed improvements for data integrity, usability, and performance. (Repeat after me: "No more <i>fsck</i>. No more <i>fsck</i>. No more <i>fsck</i>.")
<h2>ZFS Boot and Live Upgrade</h2>
<p>
Since last fall, with <b>Solaris 10 10/08</b>, it's been possible to use ZFS as a boot file system, too.
If ever there's a filesystem you want to have immunized against failures, it's the boot environment.
Now, you can use a mirrored ZFS boot environment to protect against media failures, enable compression to save disk space, take snapshots to preserve the point-in-time image of your system as a safeguard against "Ooops!", and other benefits.
<p>
You can also use ZFS with Live Upgrade, which lets you update system software on an
alternate boot environment (ABE) separate from the currently-running system.
This lets you safely apply system changes, but with UFS filesystems you have
the aggravation of having to plan for and allocate a disk slice for each ABE.
<p>
With a ZFS boot filesystem, you just specify the name of the ZFS pool when you create the ABE.
Live Upgrade then takes a ZFS snapshot and builds the environment
on a clone of it. Because ZFS uses a "copy on write" architecture,
you don't waste disk space on duplicate copies of unchanged disk contents in
each boot environment. Only changed bytes are stored. The result is that you save a tremendous amount of
disk space, creating a boot environment is much faster, and you can fit many, many more of them on the same disk.
<p>
For example, see the creation of an alternate boot environment below. It only took a few minutes, and the disk footprint
is just a few hundred KB for metadata, instead of several GB.
<pre>
# lucreate -c s10u7 -n s10u7patched -p rpool
Analyzing system configuration.
No name for current boot environment.
Current boot environment is named &lt;s10u7&gt;.
Creating initial configuration for primary boot environment &lt;s10u7&gt;.
The device &lt;/dev/dsk/c0t0d0s0&gt; is not a root device for any boot environment; cannot get BE ID.
PBE configuration successful: PBE name &lt;s10u7&gt; PBE Boot Device &lt;/dev/dsk/c0t0d0s0&gt;.
Comparing source boot environment &lt;s10u7&gt; file systems with the file
system(s) you specified for the new boot environment. Determining which
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment &lt;s10u7patched&gt;.
Source boot environment is &lt;s10u7&gt;.
Creating boot environment &lt;s10u7patched&gt;.
Cloning file systems from boot environment &lt;s10u7&gt; to create boot environment &lt;s10u7patched&gt;.
Creating snapshot for &lt;rpool/ROOT/s10s_u7wos_08&gt; on &lt;rpool/ROOT/s10s_u7wos_08@s10u7patched&gt;.
Creating clone for &lt;rpool/ROOT/s10s_u7wos_08@s10u7patched&gt; on &lt;rpool/ROOT/s10u7patched&gt;.
Setting canmount=noauto for &lt;/&gt; in zone &lt;global&gt; on &lt;rpool/ROOT/s10u7patched&gt;.
Population of boot environment &lt;s10u7patched&gt; successful.
Creation of boot environment &lt;s10u7patched&gt; successful.
# lustatus
Boot Environment Is Active Active Can Copy
Name Complete Now On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
s10u7 yes yes yes no -
s10u7patched yes no no yes -
gilbert / # zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 20.2G 55.1G 94K /rpool
rpool/ROOT 3.38G 55.1G 18K legacy
rpool/ROOT/s10s_u7wos_08 3.38G 55.1G 3.38G /
rpool/ROOT/s10s_u7wos_08@s10u7patched 69.5K - 3.38G -
rpool/ROOT/s10u7patched 111K 55.1G 3.38G /
...snip for brevity...
</pre>
<p>
Note the tiny disk space needed for the snapshot (the object with the "@" in its name) and the new boot environment.
Now, I can go off and apply patches or install software to the alternate boot environment
without affecting the running environment.
When I'm done, I activate the alternate boot environment, reboot, and I'm running in the updated world.
If I'm unhappy with the results, or if this was just a test period for the new OS level, I can fall back
easily and safely to the unaltered original software environment.
<p>
Did I mention that you can also put the <i>zoneroot</i> of a Solaris Container on ZFS now?
That leverages ZFS snapshots too, so cloning a zone (complete with customization) only takes a few seconds, and a new
Container has a tiny disk footprint.
Life is good.
<p>Parenthetically, I just have to say that after years working in datacenters, I've always been queasy at the
thought of modifying the One Good Copy of the OS on a system that you know is working. If new system code is bad,
or Murphy's Law hits you have a lot of trouble on your hands. Live Upgrade is a Good Thing.
<h2>Something's missing - where's the flash?</h2>
<p>
Solaris administrators have long enjoyed the "Flash Archive" feature to create many cloned environments from a
single "golden image". An administrator could configure a system, install software and fixes, test it, then save all or part of its contents in a
"flash archive" (usually called a <i>"flar"</i>), and then use Jumpstart to install as many machines with that system image as needed.
That make provisioning identical machines an almost completely hands-off activity.
You can also make "differential flash archives" that only include differences (files added, removed, and changed) from a base (full) archive. With that, you can install a system from a full archive, and then quickly customize for individual servers or apply additional changes. This saves a lot of disk space, as you don't need to keep full archives for each variation. This is very nicely described at Joerg Moellenkamp's excellent <a href="http://www.c0t0d0s0.org/archives/4581-Less-Known-Solaris-features-Jumpstart-Enterprise-Toolkit-Part-4-Jumpstart-FLASH.html">Less Known Solaris Features - Jumpstart</a> pages.
<p>
Unfortunately, Flash could not be used in ZFS boot environments until recently. If you tried to use it you would get an error message
saying (essentially) that it wasn't supported. The functionality was so useful that my colleague Scott Dickson came up
with ingenious ways to have a Flash-like capability with ZFS boot. He described that in blog entries:
<a href="http://blogs.oracle.com/scottdickson/entry/flashless_system_cloning_with_zfs">Flashless System Cloning with ZFS</a>
and
<a href="http://blogs.oracle.com/scottdickson/entry/a_much_better_way_to">A Much Better Way to use Flash and ZFS Boot</a>.
The first blog describes how to get the effects of a flash install, and the second is a very clever way to use a custom
jumpstart profile to provision a ZFS boot environment from a flash archive. Very elegant, and solved the problem till
an official solution arrived.
<h2>Now it can be done directly</H2>
<P>
Fortunately, there is now official support for Flash Archives when using ZFS boot. You have to apply particular patches to
enable this feature:
<ul>
<li>SPARC:
<ul compact>
<li>119534-15 : fixes to the /usr/sbin/flarcreate and /usr/sbin/flar command
<li>124630-26: updates to the install software
</ul>
<li>x86:
<ul>
<li>119535-15 : fixes to the /usr/sbin/flarcreate and /usr/sbin/flar command
<li>124631-27: updates to the install software
</ul>
</ul>
<p>
Once you have this software installed you're able to proceed in a more direct manner:
<pre>
# flarcreate -n s10u7patched /export/home/flar/s10u7patched.flar
Full Flash
Checking integrity...
Integrity OK.
Running precreation scripts...
Precreation scripts done.
Determining the size of the archive...
The archive will be approximately 16.97GB.
Creating the archive...
Archive creation complete.
Running postcreation scripts...
Postcreation scripts done.
Running pre-exit scripts...
Pre-exit scripts done.
</pre>
<p>
That's a pretty enormous flash archive, isn't it? I was making a flash archive that contained a software repository, and multiple copies of Solaris virtual disks for Logical Domains (each of them taking up a few GB). I later did a custom jumpstart using the
following profile, and I was off to the races:
<pre>
install_type flash_install
archive_location nfs 192.168.2.4:/export/home/flar/s10u7patched.flar
partitioning explicit
pool rpool auto auto auto c0t0d0s0
</pre>
<p>Voila! Once I jumpstarted the target machine, it inhaled the flash archive and the new system image was built on the flash archive
without me having to type anything.
<p>
Here it is in all the gory detail:
<pre>
System identification complete.
Starting Solaris installation program...
Searching for JumpStart directory...
Using rules.ok from 192.168.2.4:/jumpstart.
Checking rules.ok file...
Using profile: prescott_prof
Executing JumpStart preinstall phase...
Searching for SolStart directory...
Checking rules.ok file...
Using begin script: install_begin
Using finish script: patch_finish
Executing SolStart preinstall phase...
Executing begin script "install_begin"...
Begin script install_begin execution completed.
WARNING: Flash install: Pool name <rpool> specified in profile will be ignored
Processing profile
- Opening Flash archive
- Validating Flash archive
- Selecting all disks
- Configuring boot device
- Configuring / (c0t0d0s0)
Verifying disk configuration
Verifying space allocation
Preparing system for Flash install
Configuring disk (c0t0d0)
- Creating Solaris disk label (VTOC)
- Creating pool rpool
Beginning Flash archive processing
Predeployment processing
16 blocks
32 blocks
16 blocks
No local customization defined
Extracting archive: s10u7patched
Extracted 0.00 MB ( 0% of 17386.52 MB archive)
Extracted 1.00 MB ( 0% of 17386.52 MB archive)
Extracted 2.00 MB ( 0% of 17386.52 MB archive)
...
Extracted 17386.52 MB (100% of 17386.52 MB archive)
Extraction complete
Postdeployment processing
No local customization defined
- Finishing post-flash pool setup for pool rpool
- Creating swap zvol for pool rpool
- Creating dump zvol for pool rpool
Customizing system files
- Mount points table (/etc/vfstab)
- Network host addresses (/etc/hosts)
- Environment variables (/etc/default/init)
Cleaning devices
Customizing system devices
- Physical devices (/devices)
- Logical devices (/dev)
Installing boot information
- Installing boot blocks (c0t0d0s0)
- Installing boot blocks (/dev/rdsk/c0t0d0s0)
Installation log location
- /a/var/sadm/system/logs/install_log (before reboot)
- /var/sadm/system/logs/install_log (after reboot)
Flash installation complete
Executing JumpStart postinstall phase...
The begin script log 'begin.log'
is located in /var/sadm/system/logs after reboot.
Creating boot_archive for /a
updating /a/platform/sun4v/boot_archive
15+0 records in
15+0 records out
syncing file systems... done
rebooting...
</pre>
<p>
Pretty geeky stuff, huh? Note that while you use the <code>-x</code> option to exclude UFS directories from the flash archive, you use the <code>-D</code> option to exclude ZFS datasets.
<h2>Life is good and flashy</h2>
<p>
If I wanted to, I could replicate this system on as many servers as I have, without hands-on for each. I recently worked with a customer deploying 100 new Sun servers, and this capability is a life saver for knocking out cookie-cutter systems.
Humans are just not good at doing the same thing many times in a row - we don't do "identical" and "fast" very well.
Flash solves that problem for Solaris installs, and it now works with ZFS boot too, for best of both worlds.
<p>
I should mention: Flash and directly installing systems is very Old School. That's okay by me - I'm Old School, too. But that still doesn't scale as much as we'd like and can still be labor and skill-intensive. Automation is the way to go for discovery, mass provisioning, monitoring and management. For that, I recommend you take a look at <a href="https://blogs.oracle.com/opscenter/entry/ops_center_blog">Ops Center</a>.
<p>
<b>An update:</b> Solaris 10 10/09 is available now, and includes the patches needed for this support. Another edit Dec 4, 2012: clean up URLs (content otherwise unchanged).https://blogs.oracle.com/jsavit/entry/virtual_smp_in_virtualbox_3Virtual SMP in VirtualBox 3.0jsavithttps://blogs.oracle.com/jsavit/entry/virtual_smp_in_virtualbox_3
Wed, 1 Jul 2009 11:01:31 +0000SunguesthypervisormachinessmpvirtualvirtualboxVirtualBox 3.0 introduces support for multiprocessor virtual machines. This is an exciting development! Some thoughts on when to apply this, and recommendations for how to use it with optimal performance.So many things to blog about, so little time, but this topic is so big I just had to blog on it! VirtualBox 3.0 is out now, and supports multiprocessor guest virtual machines, See the <a href="http://www.virtualbox.org/wiki/Changelog">changelog</a>, and go <a href="http://download.virtualbox.org/virtualbox/3.0.0/">here</a> to download the binaries.
<p>
This is a really big step. Now you can host guests with up to 32 virtual CPUs on machines with VT-X or AMD-V. Think of all the work that had to be done to provide architecturally consistent CPU and memory in the virtual machine while dispatching virtual CPUs. This affects locking and atomic memory semantics, CPU scheduling to prevent starvation - a lot of things to think about and implement. I may wind up eating my words, but 32 virtual CPUs ought to be enough to handle most people for a very long time.
<h2>Use cases for multiprocessor guests</h2>
<p>
We also have things to think about when using it. For starters, "why would you do this?" The most obvious answer is "when you have work in a virtual machine that requires more than one CPU's power". Well, that should be pretty clear, and I'm going to suggest how to make use of this while avoiding some potential pitfalls. Also, multi-CPU guests can be used to test applications to make sure the operate correctly in SMP environments, handling mutual exclusion and locking correctly, and avoiding race and deadlock conditions that might not show up in a uniprocessor environment - and that the application scales well in such an environment, distributing load across the (virtual) CPUs visible to the application running in the guest. Yes, the timing and performance characteristics will be different from "real" iron, but this can be used to filter out bugs and gain insight. Finally, this is an excellent testbed for trying out procedures that would otherwise require dedicated, large, multi-CPU machines: for example, testing out how to configure zones with dedicated CPUs, or manually creating dynamic resource pools of CPUs. You can't do that on a 1-CPU virtual machine!
<h2>Consolidation and multiple virtual CPUs</h2>
The obvious use case is for guests whose workload is big enough to require more than one CPU. VirtualBox makes it possible to do this now, making it more suitable for production workloads. Consider a modern server with multiple quad-core CPUs - this lets a single guest drive more work at the same time. Or, you might have multiple guests with different peak periods, each occasionally using substantial CPU, but with non-overlapping peak periods that permit significant server consolidation. With Solaris as the host OS, you can even run VirtualBox within zones or projects, permitting fine grain control on the amount of CPU power to give to each guest. That makes a very powerful consolidation platform.
<h2>Oversubscribing has to be watched carefully!</h2>
<p>
As a general rule, you do not want to schedule more <b>active</b> virtual CPUs than you have physical CPUs to run them on, as that just adds overhead without adding capacity. Yes, there are exceptions, such as when you want to have a single-threaded application only be able to suck up a single (virtual) CPU's capacity while leaving other CPUs for other guest applications. But, that's a rather blunt instrument, and you should really use a resource manager to control the CPU allocated to the guest's applications. Not always possible if the guest has only primitive resource management! Certainly, use Solaris' RM when it is the guest, as the better mechanism.
<p>
So, some suggestions: don't define virtual machines with so many CPUs that the number of actively CPU-busy virtual CPUs exceeds the physical CPUs on the box. In fact, you may wish to keep one or more physical CPUs unused by virtual machines if the computer is multiple-use (like, it runs some work "first level". Maybe it's your desktop computer...). Again, if your host is Solaris, you can also constrain VirtualBox resource consumption via the Solaris Resource Manager. As a specific rule of thumb: if a guest's workload fits in a single CPU, then give it only one CPU. Best performance will probably be achieved by giving a guest the smallest number of CPUs that do the job, since that reduces internal overhead within the guest OS, and overhead in the hypervisor as well. But, if you need more CPU power, VirtualBox now makes that possible as well.https://blogs.oracle.com/jsavit/entry/sherlock_holmes_and_the_adventureSherlock Holmes and The Adventure of the Odd Permissionsjsavithttps://blogs.oracle.com/jsavit/entry/sherlock_holmes_and_the_adventure
Mon, 11 May 2009 10:08:35 +0000SunaclfilepermissionszfsA mystery in which a user can create directories and files on ZFS, but cannot remove them! <h2>"Come, Watson, come!" he cried. "The game is afoot. Not a word! Into your clothes and come!" </h2>
<p>
Well, it wasn't quite as dramatic as that, and it wasn't a "three pipe problem",
but a little while ago I was handed a puzzler with an unexpected result with ZFS file systems.
Specifically, a non-root user created a directory on ZFS, put a file and a subdirectory in it, and then
was unable to remove it. Doing the same with UFS worked as expected.
This violates the <a href="http://en.wikipedia.org/wiki/Principle_of_least_astonishment">Principle of Least Astonishment</a>!
<!-- annotation skeleton: <font color="green"><i></i></font> -->
<pre>
tank/fs1> mkdir temp <font color="green"><i>create a directory</i></font>
tank/fs1> cd temp
tank/fs1/temp> touch a <font color="green"><i>put a file in it</i></font>
tank/fs1/temp> mkdir bak <font color="green"><i>and a subdirectory</i></font>
tank/fs1/temp> cd ..
tank/fs1> ls -lad temp <font color="green"><i>yup, there it is</i></font>
drwxr-xr-x 3 joeuser99 smile 4 Apr 15 15:46 temp
tank/fs1> rm -rf temp <font color="green"><i></i></font>
tank/fs1> ls -lad temp <font color="green"><i>huh? why is it still there</i></font>
drwxr-xr-x 3 joeuser99 smile 3 Apr 15 15:46 temp
tank/fs1> cd temp <font color="green"><i>let's get closer</i></font>
tank/fs1/temp> ls -la <font color="green"><i>the file is gone, but subdir remains</i></font>
total 9
drwxr-xr-x 3 joeuser99 smile 3 Apr 15 15:46 .
drwxr-x--- 4 joeuser99 smile 9 Apr 15 15:46 ..
drwxr-xr-x 2 joeuser99 smile 2 Apr 15 15:46 bak
tank/fs1/temp> cd ..
tank/fs1> chmod -R 777 temp <font color="green"><i>brute force is always fun</i></font>
tank/fs1> ls -ald temp
drwxrwxrwx 3 joeuser99 smile 3 Apr 15 15:46 temp
tank/fs1> rm -rf temp <font color="green"><i>remove it</i></font>
tank/fs1> ls -lad temp <font color="green"><i>it doesn't want to be removed</i></font>
drwxrwxrwx 3 joeuser99 smile 3 Apr 15 15:46 temp
tank/fs1> cd temp
tank/fs1/temp> ls -la <font color="green"><i>Blimey! Same as before</i></font>
total 9
drwxrwxrwx 3 joeuser99 smile 3 Apr 15 15:46 .
drwxr-x--- 4 joeuser99 smile 9 Apr 15 15:46 ..
drwxrwxrwx 2 joeuser99 smile 2 Apr 15 15:46 bak
</pre>
<p>
Well, that makes no sense - if you create a directory or file, you should be able to remove it, right? Not necessarily!
<p>
ZFS works differently from traditional UFS: ZFS uses a pure ACL model, unlike
UFS which either has ACL settings or permission bits.
If you're used to traditional Unix file permissions (nicely described
<a href="http://en.wikipedia.org/wiki/File_system_permissions">here</a>
and <a href="http://mason.gmu.edu/~montecin/UNIXpermiss.htm">here</a>, as well as hundreds of other places)
that's how it works, but when you're using
<a href="http://en.wikipedia.org/wiki/Access_control_list">Access Control Lists</a>
the permission to <i>write</i> to a directory doesn't necessarily imply
the permission to <i>remove</i> objects placed in it.
<p>
Specifically, there are separate access privileges:
<code>add_file</code> (permission to add a new file to a directory),
<code>add_subdirectory</code> (permission to create a subdirectory),
<code>delete</code> (permission to delete a file),
and
<code>delete_child</code> (permission to delete a file or directory within a directory)
<h2>"Data! Data! Data!" he cried impatiently. "I can't make bricks without clay."</h2>
<p>
In the user's situation, things worked as expected when the ZFS pool had the default settings of:
<pre>
aclmode groupmask default
aclinherit restricted default
</pre>
<pre>
but not when it had
aclmode passthrough local
aclinherit passthrough local
</pre>
So, the question here is: what were the ACL settings and permissions on the parent filesystem?
Unfortunately, the data was removed by brute force so I never got the settings that caused the
unexpected results, and I wasn't able to duplicate this using default privileges and ACL settings
with either combination of <code>aclmode</code> and <code>aclinherit</code>.
I imagine that somewhere there were non-default ACL settings that specifically granted the <code>add_\*</code>
permissions without the corresponding <code>delete_\*</code> permissions, but I can't know for sure without data.
<h2>Where to get more data</h2>
Clearly, changing the default ACL settings and how they are passed via inheritance can have
surprising effects.
For a very good discussion on how ACLs work with ZFS
see Mark Shellenbaum's 2005 blog <a href=""http://blogs.sun.com/marks/entry/zfs_acls>ZFS ACLs</a> which has a clear
explanation and useful examples to demonstrate how ACLs work.
For the reference document, see
<a href="http://docs.sun.com/app/docs/doc/819-5461/ftyxi?a=view">Chapter 8 Using ACLs to Protect ZFS Files</a>
in the <a href="http://docs.sun.com/app/docs/doc/819-5461">Solaris ZFS Administration Guide</a>
<p>
Apologies to <a href="http://en.wikiquote.org/wiki/Arthur_Conan_Doyle">Arthur Conan Doyle</a> for lifting
<a href="http://en.wikipedia.org/wiki/Sherlock_Holmes">Sherlock Holmes</a>
quotes from
<a href="http://en.wikipedia.org/wiki/The_Red-Headed_League">The Red-Headed League</a>,
<a href="http://en.wikipedia.org/wiki/The_Adventure_of_the_Abbey_Grange">The Adventure of the Abbey Grange</a>,
and
<a href="http://en.wikipedia.org/wiki/The_Adventure_of_the_Copper_Beeches">The Adventure of the Copper Beeches</a>.https://blogs.oracle.com/jsavit/entry/dtrace_and_the_mighty_herculesDTrace and the mighty Herculesjsavithttps://blogs.oracle.com/jsavit/entry/dtrace_and_the_mighty_hercules
Sun, 22 Feb 2009 19:02:38 +0000SundtraceherculessolarisDTrace is one of the really powerful features of Solaris, providing fantastic observability into application and system behavior. Here, an example of using DTrace to provide instrumentation data into an emulator program I occasionally use.<h1>DTrace and the mighty Hercules</h1>
<p>
One of the advanced features of Solaris is <a href="http://opensolaris.org/os/community/dtrace/">DTrace</a>,
a powerful tool for troubleshooting system or application problems, very frequently used for performance issues.
It gives you a powerful tool for looking inside a system or application - in a non-intrusive way - for the best
observability of any OS I know. In fact, a company I visited just a few weeks ago is very interested in moving from
Linux to Solaris because of the observability that DTrace provides, along with other Solaris features
designed for production work in the enterprise
<!-- a brief list if not too distracting for the article:
<a href="http://www.sun.com/bigadmin/content/zones/">Solaris Containers and resource manager</a>
<a href="http://opensolaris.org/os/community/zfs/">ZFS</a>,
<a href="http://opensolaris.org/os/community/smf/">service management</a> and
<a href="http://opensolaris.org/os/community/fm/">fault management</a> capabilities.
Call me crazy if you want, but I think "production work" requires an OS with
observability, a resource manager, automation to respond to software and hardware failures (without a reboot!)...
-->
<p>
Anyhow, that got me thinking about using DTrace to look at a workload I sometimes run for
hobby purposes, the <a href="http://www.hercules-390.org/">Hercules System/370, ESA/390, and z/Architecture Emulator</a>.
This lets me run a simulated mainframe, on which I can can boot ("IPL") an operating system like Linux and ancient
versions of MVS or VM. (Technically it's possible to run current mainframe operating systems, but they are
proprietary and not made available for download or licensed operation).
<p>
So, I booted up Hercules, which I ported to Solaris by making a few minor changes (incorporated in the Hercules 3.06 download)
to see what I could learn about where time is spent when Hercules is running.
<h2>First the stethoscope, then the MRI</h2>
<p>
An important thing about DTrace is that it's not the first place to start when you're trying to understand the performance
of a system. I like to say that it's like an MRI or CAT scan in medical diagnosis. You don't start with the
advanced tool for a deep "look inside" - first you start with the general tools (like a stethoscope) to get an overall picture of what is going on.
<p>
So, the first thing I did was run some standard "*stat" commands while running Hercules:
<pre>
$ vmstat 3
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd s0 s1 s2 in sy cs us sy id
0 0 0 2728320 1004208 54 333 0 0 0 0 54 8 -1 -1 -1 1147 3458 1700 9 2 89
0 0 0 1867328 129140 6 26 0 0 0 0 0 36 0 0 0 4280 23438 8437 53 6 41
1 0 0 1834364 96368 0 0 0 0 0 0 0 163 0 0 0 4314 20460 8153 52 6 42
0 0 0 1846368 108384 0 0 0 0 0 0 0 58 0 0 0 8277 20873 8582 52 7 41
0 0 0 1852356 114376 0 0 0 0 0 0 0 0 0 0 0 3629 20029 7300 54 4 42
0 0 0 1848048 110092 0 0 0 0 0 0 0 57 0 0 0 4501 24527 9419 53 6 41
0 0 0 1823660 88100 50 2613 0 0 0 0 0 61 0 0 0 3856 21135 7355 62 9 29
^C
$ mpstat 3
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 165 0 8 602 126 877 66 48 20 1 1771 9 2 0 89
1 168 0 28 563 242 853 66 48 20 1 1769 10 2 0 88
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 9 0 1 2322 368 3990 338 143 62 0 10859 55 8 0 37
1 3 0 21 2026 305 4478 573 142 51 0 11870 50 8 0 42
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 5449 390 4138 438 150 98 0 9052 63 6 0 31
1 116 0 3187 2097 317 4313 384 151 68 0 12272 42 8 0 50
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 5 0 0 2038 165 4323 535 164 122 0 11398 52 6 0 42
1 0 0 19 1992 400 4078 438 157 90 0 11289 59 6 0 36
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 205 2475 479 4313 443 130 67 0 11991 50 9 0 41
1 0 0 223 2208 276 4293 487 135 85 0 10710 55 8 0 37
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 1919 336 3847 502 143 114 0 10301 64 6 0 29
1 181 0 19 1927 392 4269 662 143 115 0 11507 56 7 0 3
</pre>
<p>
Okay, I'm not short on memory, am driving some CPU, and I'm spending most of my CPU time
in user state, not in the kernel.
<h2>Who is making system calls?</h2>
<p>
Despite the fact we're not spending most time in the kernel,
we are doing a substantial number of system calls per second (<code>syscl</code>), so for fun let's see
which ones they are - DTrace comes in here.
<p>
First, let's see which programs are issuing system calls; there's other
stuff going on and we shouldn't assume it's Hercules. I'm using one of the most frequently seen
DTrace one-liners, which "fires" on any system call, and creates an <b>aggregate</b> of number of system calls,
indexed by the
name of the program (<code>execname</code> built-in DTrace variable) issuing them.
The following command traces each probe that matches any system call (the <code>syscall</code> provider),
and creates a table of counts of system calls indexed by executable program name. By default, the action
of stopping the command (the Control-C) dumps aggregates to the terminal.
(Note: I'm running DTrace from <code>root</code> userid, but that's not necessary - you can
use roles-based authorization to assign the privileges needed to run DTrace to a regular userid.
This is my workstation, so I just used root).
<pre>
# dtrace -n 'syscall::: { @[execname] = count(); }'
dtrace: description 'syscall::: ' matched 472 probes
^C
inetd 1
gnome-volume-man 3
iiimd 3
mdnsd 5
devfsadm 7
nwam-manager 15
sendmail 19
... lines omitted for brevity ...
dtrace 3170
evolution 3213
firefox-bin 3447
hercules 3951
gnome-terminal 6035
rhythmbox 7628
</pre>
<h2>Which system calls is Hercules making?</h2>
Ah, hah - playing music while I work is the leading cause of system calls!
Well, let's disregard that, and focus on <i>only</i> the system calls
coming from Hercules, and see which calls they are. Another well-known one liner,
this time using a <b>predicate</b> that restricts observation only to calls issued
by Hercules' binary.
Let's index the calls by the name of the system call itself; I did this while running
an application that reads from a simulated tape (the tape image is actually a disk file):
<pre>
# dtrace -n 'syscall::: /execname == "hercules" / { @[probefunc] = count(); }'
dtrace: description 'syscall::: ' matched 472 probes
^C
gtime 16
pollsys 16
ioctl 18
nanosleep 1623
yield 6077
write 12380
llseek 62482
lwp_park 105148
read 116566
</pre>
Well, that makes a lot of sense: I'm reading from a disk file, and sure enough, there are lots of <code>read</code> system calls.
We can even use DTrace to see which file it is: when <code>read()</code>, <code>write()</code> or <code>poll()</code> are frequent system calls,
aggregate on arg0 to the call, which is the file descriptor number, and then look up which file via <code>pfiles</code>.
<pre>
# dtrace -n 'syscall::read:entry /execname == "hercules"/ { @[arg0] = count(); }
dtrace: description 'syscall::read:entry ' matched 1 probe
^C
34 2
35 48440
</pre>
The left column is the file descriptor number, and the second column is the number of times it was seen, so the
vast majority of <code>read</code> system calls was for file descriptor number 35.
Using pfiles I can see which file it was:
<pre>
# pfiles 6376
6376: /rpool/hercules/misc/hercules-3.06/.libs/hercules -f hercules.cnf
Current rlimit: 256 file descriptors
0: S_IFCHR mode:0620 dev:315,0 ino:820483781 uid:103812 gid:7 rdev:24,4
O_RDWR
/dev/pts/4
1: S_IFIFO mode:0000 dev:312,0 ino:2943 uid:103812 gid:10 size:0
O_RDWR
...
35: S_IFREG mode:0644 dev:182,65562 ino:53251 uid:103812 gid:10 size:1443627008
O_RDWR|O_LARGEFILE
/rpool/sirius/phase7/phase7.awstape
...
</pre>
Now I know which system call was dominant, and which file it was for (the simulated tape file).
<h2>Which part of the program was making those system calls?</h2>
But wait, there's more! (just like on late night TV ads). What if I also want
to know which program <i>inside</i> the executable was doing the system calls. Well, DTrace has access to both kernel
and user stack, so we can issue:
<pre>
# dtrace -n 'syscall::read:entry /execname == "hercules" / { @[ustack(5)] = count(); }'
dtrace: description 'syscall::read:entry ' matched 1 probe
^C
libc.so.1`__read+0x7
libhercd.so`cckd_read+0x78
2
libc.so.1`__read+0x7
hdt3270.so`console_connection_handler+0x2f9
28
libc.so.1`__read+0x7
libhercu.so`logger_thread+0x170
0x632e7265
1002
libc.so.1`__read+0x7
hdt3420.so`readhdr_awstape+0x53
58414
libc.so.1`__read+0x7
hdt3420.so`read_awstape+0x120
58414
</pre>
Now I know exactly which functions (and offsets inside function) were the most frequent issuers
of the <code>read</code>. No surprise there: the dominant system call was for reads, and the reads
were issued from the piece of code that handles reading from simulated tape drives. In this case
I could have made an easy guess about what was going on, but DTrace made it possible to instrument
this, in a completely unmodified application.
<h2>Looking inside a program's entry points</h2>
<p>But this has more implications: DTrace can see the objects within the executable, using the
<code>pid</code> provider, so I can get even more detailed information. First, let's get a list
of the probe points that are exposed when the program is running. In this case, I know that the process
has process id (pid) 6376 (in later examples the pid may differ), so I can expose its contents this way:
<pre>
# dtrace -l -n pid6376:::entry|more
ID PROVIDER MODULE FUNCTION NAME
78992 pid6376 hercules _start entry
78993 pid6376 hercules __fsr entry
78994 pid6376 hercules _mcount entry
78995 pid6376 hercules main entry
...
... snip for brevity
...
79467 pid6376 libherc.so set_or_reset_console_mode entry
79468 pid6376 libherc.so set_screen_pos entry
79469 pid6376 libherc.so erase_to_eol entry
79470 pid6376 libherc.so clear_screen entry
79471 pid6376 libherc.so set_screen_color entry
79472 pid6376 libherc.so translate_keystroke entry
79473 pid6376 libherc.so console_beep entry
...
... snip for brevity
...
79484 pid6376 libherc.so configure_cpu entry
79485 pid6376 libherc.so deconfigure_cpu entry
79486 pid6376 libherc.so get_devblk entry
79487 pid6376 libherc.so ret_devblk entry
79488 pid6376 libherc.so find_device_by_devnum entry
79489 pid6376 libherc.so attach_device entry
79490 pid6376 libherc.so detach_devblk entry
...
... snip for brevity
...
79650 pid6376 libherc.so s370_branch_on_condition entry
79651 pid6376 libherc.so s370_branch_on_count_register entry
79652 pid6376 libherc.so s370_branch_on_count entry
79653 pid6376 libherc.so s370_branch_on_index_high entry
79654 pid6376 libherc.so s370_branch_on_index_low_or_equal entry
79655 pid6376 libherc.so s370_compare_register entry
79656 pid6376 libherc.so s370_compare_logical_register entry
... snip
</pre>
<p>Hey, I can see the C functions that implement service routines, and the ones
that implement the simulated instructions!
<h2>How many times do I invoke each function in the program?</h2>
Well, if I can do that, then I
should be able to see how often they are invoked, and how much time is spent in them.
So, I could issue a command that shows me when a given instruction is emulated, like:
<pre>
dtrace -n 'pid6376:libherc.so:s390_set_storage_key_extended:entry { trace(probefunc);}'
</pre>
but that will run off the screen REAL FAST, as there will be zillions of times any given instruction
is issued. Aggregates are our friend again, so instead I issue (notes, this was done a different day, different pid):
<pre>
# dtrace -n 'pid19284:::entry { @a[probefunc] = count(); }'
dtrace: description 'pid19284:::entry ' matched 11135 probes
^C
...
... I snipped out lines for low-frequency functions
...
s390_shift_left_double 341
s390_external_interrupt 343
s390_perform_external_interrupt 343
s390_store_characters_under_mask 346
...
......................I snipped a bunch of lines.....................
...
s390_compare_logical_register 11769
memcpy 12616
s390_exclusive_or_register 12627
s390_branch_and_link 13483
s390_branch_and_save_register 14161
s390_subtract_register 14351
s390_move_immediate 14601
...
......................I snipped a bunch of lines.....................
...
s390_move_character 40638
s390_load_and_test_register 44384
s390_branch_on_condition_register 47965
s390_compare_logical_immediate 49598
sigon 49928
ptt_pthread_trace 73559
s390_compare_logical 76252
s390_compare_register 83654
s390_compare_logical_character 91370
s390_load_register 96943
s390_test_under_mask 132845
s390_load_address 158997
s390_insert_characters_under_mask 168542
s390_store 196300
s390_load 309268
s390_branch_on_condition 725825
</pre>
<p>
Wow - what this tells me is that Branch on Condition (<code>BC</code>) and Load (<code>L</code>)
are the most frequently seen instructions. This will mean something to you if you program in
mainframe assembly language, otherwise not. It gives a good hint that anything that speeds
up processing of these instructions will make a difference.
<h2>Quantifying actual time spent in functions</h2>
<p>
However, we can refine that: the above gives you the number of times you invoked the most frequently used functions,
but not how much time was spent in them. So, let's use a different script that remembers the entry and exit times
of the probe points, and creates aggregates of both number of times they're invoked, and a histogram of the times
inside those functions. This is too complicated for a one-liner at the terminal, so I'm using a script I found
(IIRC, in the <a href="http://opensolaris.org/os/community/dtrace/dtracetoolkit/">DTrace Toolkit</a>).
What this does is remember the high-precision timestamp on entry to each function to compute the elapsed
time spent in it, as well as the number of entries and exits.
<table border=1>
<tr><td>
<pre>
#!/usr/sbin/dtrace -s
pid$1:::entry
{
self->ts[self->stack++] = timestamp;
}
pid$1:::return
/ self->ts[self->stack - 1] /
{
this->elapsed = timestamp - self->ts[--self->stack];
@[probefunc] = count();
@a[probefunc] = quantize(this->elapsed);
self->ts[self->stack] = 0;
}
</pre>
</td></tr>
</table>
<td><b>The quantify.d DTrace script</b></td></tr>
<p>
Notice that the pid id is passed as an argument to the script.
Running this against pid 21080 gives the following output (with low frequency and low elapsed time entries removed):
<pre>
# dtrace -s quantify.d 21080
... output snipped ...
s390_compare_register 51374
s390_subtract_logical_register 52798
s390_compare 56810
s390_load_and_test_register 59370
s390_compare_logical 68099
s390_and 71375
s390_branch_on_condition_register 72790
ptt_pthread_trace 76804
s390_load_register 135094
s390_add_logical_register 139985
s390_store 199027
s390_test_under_mask 234316
s390_load_address 293952
s390_load 393249
s390_branch_on_condition 746037
</pre>
The first aggregate dumped showed frequency of invocation by function call, and that's similar to other tests.
<code>TM</code>, <code>LA</code>, <code>L</code>, and <code>BC</code> are indeed frequent instructions.
<p>
But the aggregate based on time shows interesting different results!
For exposition purposes I removed output for function calls related to sleeping or waiting on a condition, as they would be expected to have
high elapsed times (a later test could be used to distinguish between elapsed times spent waiting for
work to do or I/O to complete - versus time spent waiting for a lock to be released).
<p>
Instead, what is interesting is that while <code>s390_branch_on_condition</code> does get invoked a lot,
it doesn't have high elapsed times - the left column represents time in nanoseconds, and this is a fast
instruction in emulation. Other instructions, and a number of support routines, got invoked a lot
less frequently, but represented disproportionate time for the number of invocations!
<pre>
s390_branch_on_condition
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 727692
2048 | 3758
4096 |@ 14337
8192 | 103
16384 | 37
32768 | 39
65536 | 26
131072 | 40
262144 | 4
524288 | 1
1048576 | 0
s390_execute_b2xx
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@@@@@@@@@@ 1635
8192 |@@@@@@@@@@@@ 1282
16384 | 14
32768 | 1
65536 |@@@@@@@@@ 990
131072 | 7
262144 | 7
524288 |@ 150
1048576 | 22
2097152 | 1
4194304 | 0
8388608 | 0
16777216 | 6
33554432 |@ 141
67108864 | 13
134217728 | 0
sprintf
value ------------- Distribution ------------- count
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 40
16384 |@@@@@@@@@ 14
32768 | 0
65536 | 0
131072 | 0
262144 | 0
524288 | 0
1048576 | 0
2097152 | 0
4194304 | 0
8388608 | 0
16777216 | 0
33554432 | 0
67108864 | 0
134217728 | 0
268435456 | 0
536870912 | 0
1073741824 |@@@@@ 8
2147483648 | 0
s390_insert_characters_under_mask
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29837
2048 |@ 844
4096 |@ 621
8192 | 5
16384 | 5
32768 | 8
65536 |@ 675
131072 |@ 749
262144 |@ 601
524288 |@ 569
1048576 |@ 706
2097152 |@ 1547
4194304 |@@ 2278
8388608 |@ 756
16777216 | 402
33554432 |@ 866
67108864 |@@@ 2776
134217728 | 0
</pre>
<p>
In this example, the long durations for some invocations of
<code>s390_insert_characters_under_mask</code> make it disproportionately more expensive than
its frequency of use would indicate. Optimizing the "tail" of long duration calls, and for that
matter reducing the number of calls to <code>sprintf</code> could make a difference in performance.
<h2>Summary</h2>
In this (slightly contrived) example, I've shown a little bit about how DTrace can be used for performance measurement.
We were able to use standard Solaris tools to see how busy the system was, and then used DTrace to get counts of system calls
by application and by function. After that, we figured out which functions in the application were making system calls,
which file was involved in a filesystem call, and then we got the frequency counts and durations of the "hot" entry points
in the application. This illustrates performance observability of an application not otherwise set up for instrumentation.
No special compiler options were used, no code was changed, but we were able to see where we were spending our time.
<p>
This just scratches the surface of what can be done - it's possible to get a <b>lot more information</b>, such as flow of
control from function to function, just as an example. With a little investment of effort
(especially if you borrow existing DTrace scripts available on the 'net!) you can get incredible insight into the
behavior of your applications or entire Solaris systems. There's really nothing remotely like this elsewhere, and
I can't recommend it strongly enough.
<p>
For a great reference for learning more, go to the
<a href="http://docs.oracle.com/cd/E18752_01/html/819-5488/index.html">DTrace User Guide</a>
or <a href="http://opensolaris.org/os/community/dtrace/">OpenSolaris DTrace page</a>.https://blogs.oracle.com/jsavit/entry/sirius_opensolaris_on_system_zsirius - OpenSolaris on System z - an updatejsavithttps://blogs.oracle.com/jsavit/entry/sirius_opensolaris_on_system_z
Mon, 16 Feb 2009 16:53:49 +0000SunopensolarissiriussystemzWork has continued on "sirius" prototype port of OpenSolaris in the last few months. Here's some news on how the project is going, and experiences with the code.<h1>OpenSolaris on System z - an update</h1>
<p>
Some months ago I
<a href="http://blogs.sun.com/jsavit/entry/opensolaris_on_system_z_finally">reported</a>
on the prototype port of OpenSolaris to IBM mainframe under z/VM ("sirius").
Some things have transpired since then, and I figure its time for an update (also, some people asked me off-line, so why answer separately?)
<h2>Why we're helping - it's a Community thing</h2>
<p>
First, let me make Sun's position clear, as there's been a little confusion.
We are obviously <b>not advocating</b> that people do more
computing on mainframes! (This doesn't come as a surprise, right guys? :-) )
Remember, <i>we make our own computers</i> which we believe are
far superior computing platforms, hosting the fully-supported, complete implementations of Solaris and OpenSolaris
with a massive installed base and ISV partner product portfolio.
We have our own highly effective, scalable, widely adopted virtualization, providing the best possible
platforms for virtualized, consolidated Solaris operation.
Solaris also runs on <a href="http://www.sun.com/bigadmin/hcl/">many</a>
SPARC, Intel and AMD systems from other vendors, insulating customers
from lock-in or inflated prices.
<p>
Instead, our purpose is to <b>grow the OpenSolaris community</b>, and if that includes
people who are traditionally mainframe customers, that's fine too. Maybe those folks will get experience
with Solaris using excess capacity on their mainframes, see the light :-) and then deploy
Solaris on SPARC and x86/x64 platforms, where the implementations are complete, fully supported, and widely deployed.
<p>
To that end, we've loaned SPARC server hardware to <a href="http://www.sinenomine.net/">SNA</a> for their cross-compile port
work, set up an <a href="http://www.opensolaris.org/os/project/systemz/">OpenSolaris project page</a>
and contributed time installing and testing the prototype, and providing bug reports
and feedback. The last part is where I come in (in addition to having made introductions within Sun.)
<p>
<b>Note:</b> You can get started with OpenSolaris right now on your own Intel or AMD desktop or laptop computer:
download it at <a href="http://www.opensolaris.com/get/index.jsp">http://www.opensolaris.com/get/index.jsp</a> right now!
You can run it from a LiveCD image so you don't have to overlay the OS you have installed on your computer,
or you can run it in a virtual machine: download VirtualBox from the same URL above, and have your cake and eat it too!
<h2>Status update - things fixed and pending</h2>
<p>
So, that out of the way, some notes on the prototype.
A lot of work has been done since my last post on this topic, though the implementation is still far from complete.
<p>
Last time, for example, there were no <code>man</code> or <code>prstat</code> commands, both of which are now available.
There was no <code>hostid</code> command, but one has now been written (it didn't make the image I'm now testing, but I'm told it will
be on the next one).
Problems that made <code>getconf -a</code> and 64-bit <code>ls</code> fail have been resolved.
There was a bug where you couldn't change your password - that's been resolved too.
<p>
One step in the right direction relates to the <a href="http://www.sun.com/software/solaris/virtualization.jsp">Solaris Containers</a>
(frequently referred to as "zones")
feature of Solaris 10 and OpenSolaris (also see <a href="http://en.wikipedia.org/wiki/Solaris_Containers">Wikipedia</a>).
Back in autumn, attempting to create a zone produced the following error:
<pre>
# zonecfg -z zone1
zone1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone1> create
Segmentation Fault (core dumped)
</pre>
The good news is that <code>zonecfg</code> now apparently works. Unfortunately
zone installation fails with a complaint that a script related to Solaris
<a href="http://www.sun.com/software/solaris/liveupgrade/">Live Upgrade (LU)</a> is missing,
saying <code>/usr/lib/lu/lucreatezone: not found</code>
This is confusing: LU doesn't actually exist per-se on OpenSolaris, and that file <i>shouldn't</i> be present! OpenSolaris
makes use of <a href="http://dlc.sun.com/osol/docs/content/2008.11/snapupgrade/snap3.html">a completely different way
to manage boot environments</a>, so there is an issue about implementing Solaris Containers in the way that is compatible with OpenSolaris
boot and package system - both of which have yet to be ported. Still, that's a step in the right direction, and one that I'm sure will
be straightened out in due time.
<p>
This does open the big issue of differences between OpenSolaris and Solaris 10. For example, the prototype boots off
the <a href="http://en.wikipedia.org/wiki/Unix_File_System">Unix File System (UFS)</a>, whereas OpenSolaris boots off
<a href="http://opensolaris.org/os/community/zfs/">ZFS</a> (see also <a href="http://en.wikipedia.org/wiki/ZFS">ZFS at Wikipedia</a>).
This is a substantial difference, because OpenSolaris uses ZFS snapshots and clones for managing boot environments. This entire
infrastructure is part of what makes OpenSolaris, and needs to be implemented.
<p>
A number of important features are still missing:
<a href="http://opensolaris.org/os/community/dtrace/">DTrace</a> (see also <a href="http://en.wikipedia.org/wiki/DTrace">DTrace at Wikipedia</a>)
for example, is not implemented. That's a key feature of Solaris 10 and OpenSolaris, and is even available on FreeBSD and Mac OS X.
There's a deep need for a package/patch management system - currently there's no way to apply a patch, though the
<a href="http://opensolaris.org/os/project/pkg/">Image Packaging System</a> is now going to be ported over.
<p>
One of the obvious things you do with a prototype system is you debug and diagnose things.
Annoyingly, if you use commands like <code>pstack</code> or <code>pfiles</code>
to diagnose what a program is up to, it can die or kill the target process, as that set of tools
isn't quite working yet (Guess what happened to me one time when I tried to run <code>pstack</code> against <code>sshd</code>, while logged in via ssh! One time is all it took to teach me to not do that again!)
More frustrating, using one of those commands to look at a <code>core</code> file, could have the tool crash and overlay the core
file with its own dump. Ow.
<p>
Other bits and pieces are missing:
If you issue the <code>format</code> command, it still says "No disks found" (even though there are definitely disks available and mounted). The <code>kstat</code> and <code>prtconf -vp</code> commands don't work.
<p>
In general, there are things you bump into now and then.
This should be <b>expected</b> - this is a <b>prototype</b>, and that's what prototypes are like. The developer is outstanding, but this
is a Herculean task (there's a pun here - and I'll mention Hercules again in a moment) and things like this should be expected. It's been my task to look for errata - so it shouldn't be a surprise that I find them.
<h2>In the queue now</h2>
<p>
In my latest testing I came up with new problems - or at the very least, newly discovered ones.
<code>ping</code> didn't work from a non-root userid, which was quickly solved.
<pre>
$ ping 10.80.63.130
ping: socket Permission denied
$ ppriv -De ping 10.80.63.130
ping[104270]: missing privilege "net_icmpaccess" (euid = 103812, syscall = 230) for "devpolicy" needed at common_specvp+0x5e
ping: socket Permission denied
</pre>
Adding 'setuid' bit to the binary fixed that problem. Ain't
<a href="http://www.google.com/search?q=solaris+privilege+bracketing&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a">Solaris privilege bracketing neat</a>?
<p>
Right now, there are a few new problems that have been head-scratchers: one is that programs like
<code>ping</code> or <code>ftp</code> dump core if I use DNS name resolution (they worked fine
if you just put in the target IP address). Previously I hadn't set up <code>/etc/nsswitch.conf</code> to use DNS
so the issue never arose.
This provoked a problem where <code>sshd</code> would refuse to accept new
sessions: any previously established session was fine, but attempts to establish new connections failed.
Eventually (smack forehead!) I figured out that the forked instance of the ssh daemon was dieing while trying to
do a reverse-DNS lookup of the client, so just turned off DNS resolution altogether.
This is important, so I'll be looking into this again soon.
<p>
Around this time I started having
recursive crashes in the <a href="http://opensolaris.org/os/community/smf/">Service Management Facility (SMF)</a> daemons;
every time I looked, <code>svc.configd</code> and <code>svc.startd</code> were running and chewing up as much z9 CPU time as they could get.
Back in the summer, I was getting error messages about a corrupt SMF repository - maybe that error persists and has just bit me.
Even more odd, I was getting crashes rebooting the system!
<p>
So, for sake of starting afresh, I'm reloading the test system from the disk restore image
(you install this by doing an image restore of a disk volume, and manually configuring network identity).
That way, we can start debugging again (or see if the above problems never recur) from a system where the tracks
haven't been muddied.
<p>
<b>Please take note:</b> As I said before, this is a <b>prototype</b>. Errors and odd situations are what happens in the laboratory
when working with experimental software, in this case a port of a complex operating environment. Please have expectations that
are consistent with reality and not expect this software to have the properties of the production grade Solaris OS, which has the
benefit of development teams, test organizations, QA groups, and years of baking in.
<h2>Speaking of Hercules</h2>
<P>
I previously alluded to the <a href="http://www.hercules-390.org/">Hercules System/370, ESA/390, and z/Architecture Emulator</a>.
Early in the sirius project I advocated that sirius run only under VM, as that simplifies the porting effort.
<p>
A consequence of that, which I regret, is that it makes it impossible to run sirius on commodity, developer personal computers. Hercules
provides a virtual mainframe, but sirius makes use of hypervisor functions provided by VM. Fortunately, the Hercules community is looking into
fixing this - either by modifying the sirius port code so it can run without the hypervisor functions, or by adding them to the Hercules emulator.
That will make it possible for a lot more people to use this prototype port, as not everybody has access to a mainframe! That will help grow the OpenSolaris community, which is why Sun is participating.
<p>
(It may be possible in future to run OpenSolaris in a Hercules virtual mainframe, running inside a Solaris Container underneath OpenSolaris or Solaris 10, running inside a VirtualBox or VMware Virtual Machine. Hm, maybe at that time we can bring up Hercules underneath sirius, for
an arbitrary number of recursive levels of virtualization!)
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->https://blogs.oracle.com/jsavit/entry/opensolaris_on_system_z_finallyOpenSolaris on System z - finally goes publicjsavithttps://blogs.oracle.com/jsavit/entry/opensolaris_on_system_z_finally
Fri, 17 Oct 2008 13:37:32 +0000SunopensolarissystemzThe initial version of OpenSolaris on z (also called "sirius") is now "out in the open" - there's a bit of buzz about it. I've been involved with this project for several years, and now it's something we can talk about.It took a long time, but the binary and source distributions for the initial version of OpenSolaris on z (also called "sirius") are now "out in the open". The port has been done by <a href="http://www.sinenomine.net/">Sine Nomine Associates</a> (SNA), mostly by Neale Ferguson, with some support from IBM (they have an interest in finally having a good OS on their hardware :-) ) and Sun.
<p>
I've been involved in this too, but I've kept mum till now. I've been in conversation with SNA's David Boyes and Neale Ferguson since 2005 when they started on this, and have been testing since February of this year. I made a few introductions to Sun people, helped arrange the workstation SNA uses for the port, suggested some design choices (64-bit mode only, and leverage the VM hypervisor for ease of implementation).
<p>
Next week, I'll be presenting on this subject at the <a href="http://vm.marist.edu/~mvmua">Metropolitan VM User Association (MVMUA)</a>, which I used to be President of when I was a customer. I have long roots in this community. I'll be presenting on Sun, Solaris, OpenSolaris and Sun virtualization in general (this group is primarily IBM customers) for background, and then discuss OpenSolaris on z. I'll describe testing and performance (oh yes indeed, I did some performance comparisons between z and other platforms. Heh, heh.) Should make for <i>quite</i> an interesting session.
<p>
If you're interested, now's the time to go have a look. The OpenSolaris project page is at <a href="http://www.opensolaris.org/os/project/systemz/">OpenSolaris Project: Systemz</a>, and the binary distribution and documentation is at SNA's <a href="http://distribution.sinenomine.net/opensolaris/">download page</a>.
<p>
If you want to run it, you'll need specific hardware: an IBM z9 or z10 mainframe running current z/VM. Sorry, but it won't run under the popular <a href="http://en.wikipedia.org/wiki/Hercules_emulator">Hercules mainframe emulator</a>. Not yet, at least. When that's available this project will be accessible to a far wider community than the people who happen to have a z.
<p>
For a little excitement, this topic has already come up on <a href="http://tech.slashdot.org/tech/08/10/17/1552248.shtml">Slashdot</a>, where you can already see the kind of back and forth Slashdot is known for!
<p>
Oh, in case your wondering why it's called "sirius". Well, there is a <a href="http://opensolaris.org/os/community/power_pc/">ongoing port of Solaris to PowerPC</a> (they need a good OS too!) called "Polaris". Neale is from Australia, and sirius makes a good name alternative as it provides symmetry, but is the Southern hemisphere equivalent, and the name of the flagship of the British First Fleet to Oz. There, some history and astronomy too!
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->https://blogs.oracle.com/jsavit/entry/respone_to_joe_temple_jobResponse to Joe Temple's blog on my blog...jsavithttps://blogs.oracle.com/jsavit/entry/respone_to_joe_temple_job
Fri, 19 Sep 2008 18:07:29 +0000SunbenchmarkingibmmainframeperformancesolarissunIBM made an announcement that made invidious and inaccurate comparisons to our products. I blogged about why it was misleading. Other people blogged about my blog and said things that were even more misleading. Read my response here.My attention was called to <a href="http://mainframe.typepad.com/blog/2008/07/response-to-j-1.html">this blog page</a> containing Joe Temple's (of IBM) continued argument with me and the points I raised on my blog. I tried to respond by adding a comment on <i>his</i> blog (as I published his comments on my blog), but he declined to return the favor of publishing my response. That's a shame, don't you think?
<p>
I'd would prefer to ignore this and let sleeping dogs snooze. We've made our points and neither of us is going to convince the other. We probably won't convince anybody else who already has a position staked out. However, a lot of what Mr. Temple said about Sun product, and about IBM System z is wrong. I dislike misrepresentation of facts, and especially misrepresentation of what I said, so I'm going to pick up the sword at least this one more time.
<p>
So, I'm going to respond to his blog. I won't recap my dismembering of the phony comparisons that accompanied the z10 announcement, as you can find that on my blog at
<a href="http://blogs.sun.com/jsavit/entry/the_ten_percent_solution">Ten Percent Solution</a> and
<a href="http://blogs.sun.com/jsavit/entry/no_there_isn_t_a">No, there isn't a Santa Claus</a>. Especially read the latter, because that is where Mr. Temple and I previously conversed, and I extended him the courtesy of saying "Joe Temple is an IBM Distinguished Engineer, and in my opinion a person who has earned respect". All the more reason for my disappointment at seeing his blog. Consequently, I feel compelled to respond to some of Mr. Temple's distortions.
<p>
First: He says of me "he compares the LSPR scaling ratios to Industry benchmark results on UNIX SMPs." I'll be blunt: <b>Joe knows I did the exact opposite</b>. See <a href="http://blogs.sun.com/jsavit/entry/the_ten_percent_solution">Ten Percent Solution</a> where on March 18, 2008, I said of LSPR <cite>"What I object to is it being used as a marketing tool in an official IBM announcement to extrapolate performance for comparison to a completely different platform based on a workload that isn't even the same as the LSPR benchmark workload."</cite> and I describe <cite>'"\*legitemate\* purpose IBM has for LSPR: for same-platform-family capacity planning. Anything else should be marked with disclaimers that admit to the "this is only an estimate". For cross-platform comparisons, there are the standard open benchmarks which IBM refuses to publish for System z.'</cite> Joe read the material I quote here long before his post saying I said the opposite.
<p>
Let's go to the beginning of this. My blog responded to IBM's <a href="http://www-03.ibm.com/press/us/en/pressrelease/23592.wss">February announcement of the z10</a> and a related <a href="http://www-128.ibm.com/developerworks/blogs/page/benchmarking?entry=back_to_the_future_with">IBM blog</a> that had the text <cite>"384.5 RPEs is approximately equivalent to the number of z10 RPEs at 90% when you use 20 RPEs equal to 1 MIP where MIPS are based on the LSPR curve for the z10."</cite> It was IBM using LSPR to compare z to Unix benchmarks, and it was me saying why that was baseless. Joe accuses me of what I clearly was refuting, on a blog he read months before his blog. I do not like being misrepresented in this manner.
<p>
The IBM announcement originally had a footnote 3 that had "justification" for their claims, and the blog elaborated on it. When it turned out that this fuzzy math was based on inappropriate use of a 3rd party's tools, both the announcement page and the blog were <b>airbrushed</b> to remove those claims. I wish I had the foresight to have printed or saved those pages, but the text I quoted has been removed. Dear Reader - feel free to speculate why.
<p>
Fortunately, the Internet has cached copies of the IBM press release before it was censored. See the <a href="http://findarticles.com/p/articles/mi_pwwi/is_200802/ai_n24330674/print?tag=artBody;col1">original text</a> or use <a href="http://tinyurl.com/5u3emk">http://tinyurl.com/5u3emk</a>. That page includes the removed text: <cite>"\*3 Source: On Line Transaction Processing Relative Processing Estimates (OLTP-RPEs): Derivation of 760 Sun X2100 2.8 Opteron processor cores with average OLTP-RPEs per Ideas International of 3,845 RPEs and available utilization of 10% and 20 RPEs equating to 1 MIPS compared to 26 z10 EC IFLs and an average utilization of 90%."</cite> So, there is IBM using RPEs inappropriately. The IBM blog used a mythical "LSPR ratio" (there is no such thing - there are lots of ratios - one for each combination of benchmark and platform combination) to extrapolate from z9 to z10, though, unfortunately nobody kept a copy of the above blog page. I'll use Joe's wording, it is IBM that "compares the LSPR scaling ratios to Industry benchmark results on UNIX SMPs." <b>Not me</b>.
<p>
By the way, the basis of IBM's claim of replacing 1,500 servers was that 1,375 of them were essentially idle. With that trick, why not claim you can replace 15,000 or 150,000, or 1,500,000 if they're powered off! See the <a href="http://www.ibm.com/developerworks/blogs/page/InsideSystemStorage?entry=yes_jon_there_is_a">IBM blog entry</a> or for convenience, <a href="http://tinyurl.com/5lxzrn">http://tinyurl.com/5lxzrn</a> for IBM blogger Tony Pearson saying <cite>"<b>125 Backup machines running idle</b> ready for active failover in case a production machine fails. <b>1250 machines for test, development and quality assurance, running at 5 percent</b> average utilization"</cite> (bold fonted for emphasis) I think we can agree that any contemporary machine can replace a large number of machines doing essentially nothing. <b>It just costs much more if you do that on z</b>.
<p>
Joe says that Sun disparages TPC-C because we don't do it well. Not so, IBM says the same. Read the IBM document at <a href="ftp://ftp.software.ibm.com/eserver/benchmarks/wp_TPC-E_Benchmark_022307.pdf">ftp://ftp.software.ibm.com/eserver/benchmarks/wp_TPC-E_Benchmark_022307.pdf</a> which describes TPC-C as <cite>"an aging benchmark losing relevance"</cite>, and lists its deficiencies for current computers and workloads. So, IBM also acknowledges that TPC-C is broken (except for the cases where they still use it - not on IBM z, of course - to convince the credulous).
<p>
Frankly, I've argued that we should blow away the numbers on TPC-C (I believe our current servers could handily do this), and then hold up the trophy for best TPC-C and say "we won the record", and then say outright it's bogus without people making spurious claims that we only say so because we can't do it. The counter argument, which has won so far, is that we've long made a statement of principle that TPC-C is broken, and it would be misleading and a distraction to then go out and publish new TPC-C results. Unlike some other vendors who both disparage it (see above) and also run it, depending on which audience they have handy.
<p>
What is comical about this is that Joe implies that Sun doesn't run one particular benchmark because we're unable to "keep up", while defending IBM not publishing ANY standard benchmark on System z. This is called "chutzpah". I can't make up my mind whether z-advocates truly believe they deserve a free pass and are exempt from proving performance and price/performance, or if this is a cynical ploy to avoid publishing z's price/performance.
<p>
In fact, Sun has many world record performance results on its servers. See for example <a href="http://www.spec.org/jAppServer2004/results/jAppServer2004.html"> a Java app server benchmark</a> (IBM is there, but not for z, of course). Same with <a href="http://www.spec.org/web2005/results/web2005.html">web serving</a>. Or <a href="http://www.sybase.com/detail?id=1056945">the world's largest data warehouse</a> on Sun and Sybase. Or <a href="http://www.oracle.com/corporate/press/2008_jul/sap-sd.html">world record with SAP using Oracle and Sun</a>. Click <a href="http://www.sun.com/servers/sparcenterprise/benchmarks/index.html">here</a> for many more. There are lots of them.
<p>
In sharp contrast, <b>IBM refuses to publish performance results on System z in a way that would permit direct platform comparison</b>. Despite the handwaving and chaff Joe spreads about cache coherency (much of which is fanciful and wrong), many of the standard benchmarks are indeed very good predictors for application performance: such as web serving, file serving, Java application servers and databases. There is nothing mystical about it. These are exactly the workloads IBM wants you to run on z without offering any public evidence that z performs them adequately. Need I add that the industry standard benchmarks vary widely in terms of cache coherence and threadedness? IBM publishes none of them for z.
<p>
IBM does run workloads on z similar to some industry benchmarks, but only show performance relative to other z models. See the <a href="http://www-03.ibm.com/systems/z/advantages/management/lspr/lsprwork.html">LSPR page</a> where WASDB (on z/OS) and WASDB/L (on z/Linux)is a Java application server benchmark <cite>"written to open Web and Java Enterprise APIs, making the WASDB application portable across J2EE-compliant application servers."</cite> It's very much like the Java app server benchmarks Joe says are invalid and so different from what you run on a mainframe. Except it <i>does</i> run on a mainframe, and IBM says it <i>is</i> a valid predictor of performance, contrary to what Joe says. What they <i>won't</i> do is make it easy for you to do a direct comparison with any other platform. Think about it. Also take a time to look at the actual benchmark descriptions: just as with the open benchmarks they have different levels of parallelism and cache interference. To describe the mainframe workloads as being one one side of parallel nirvana, purgatory or hell (see Temple's remarks on his blog, or on the copy below) is nonsensical, as the LSPR studies include all three. IBM also doesn't run the z/OS LSPR studies under z/VM, so his references to virtualization overhead is irrelevant: IBM reports the non-virtualized results, and they still scale poorly.
<p>
I do not understand Joe's obsession with simplistic characterizations of the parallelism and scalability of our processors and industry-standard benchmarks, which have a wide range of properties. Labelling them as a group having a single characteristic for scalability, NUMA sensitivity, parallelism, etc, is simply wrong. An <a href="http://www.sun.com/servers/index.jsp?cat=Sun%20Fire%20High-End%20Servers&tab=3">M-series enterprise server</a> is very different in performance characterstics and other features from a <a href="http://www.sun.com/servers/coolthreads/overview/">coolthreads chip-multithreading server</a>. To lump them together as identical is silly. A T5220, for example, has uniform memory latency - it's not NUMA in the least. Not so with an E25K or an M8000, which do have NUMA properties. Sun recognizes that different workloads have different processor requirements, and makes <i>SPARC and Solaris binary compatible systems that can handle all of them</i> so you can trivially move your application to the compatible system that runs it best. Unlike with IBM, where by Joe's own words (see below), you have to switch platform architectures and (consequently) all the platform software if an application turns out to not to be a good performance match for z.
<p>
Joe is mistaken in how he characterizes the different processor families. One difference is that Sun's enterprise servers increase cache and I/O bandwidth as processors are added, while IBM's don't. As you enable CPs in an IBM z9 or z10 "book" you decrease the cache available to each processor, since a "book" contains a fixed amount of cache, regardless of CPUs. Not <i>too</i> bad for z/OS because it uses shared address spaces (eg: LPA and CSA) for multiple jobs, but a bigger problem for Linux under z/VM which has to discard cache and TLB contents on every dispatch. (Joe must have missed Sun's E25K product, which, like z, also had a large L3 cache. That's hardly unique to IBM). Anyway, the cache limitations may explain IBM z's sublinear scalability shown in all the LSPR tests, and the fact that IBM doesn't publish LSPR for more than 32 CPUs in a single OS instance. See <a href="http://www-03.ibm.com/systems/z/advantages/management/lspr/Systemz10zOS18SI.html">IBM LSPR report</a> and notice how doubling the number of CPUs doesn't double the measured throughput and performance. It's sometimes stated that IBM doesn't do full-system tests of a single OS instance on z for reasons of cost, but that's obviously not the case: IBM runs <a href="http://www-03.ibm.com/systems/z/advantages/management/lspr/Systemz10zOS18MI.html">LSPR tests on fully-configured z systems</a> but they need multiple OS instances to drive the boxes and still can't get near linear scale.
<p>
Joe also neglected to respond to my blog's mention of negative scale on IBM mainframes (which he has already seen): <a href="http://www.vm.ibm.com/perf/reports/zvm/html/24way.html#FIG24WETR">a 16 way z990 doing only 1,322 web transactions per second, and a 24-way doing under 1,000</a>. That's right - <b>Adding CPUs resulted in lower performance!</b>. So much for the claims of scalability for IBM mainframe. Crikey! Any contemporary 1RU x86 or SPARC server will dramatically outperform that at a tiny fraction of the cost, floor space, software licenses, staff, and environmentals. Talk about "inconvenient facts". Instead of misapplying Gunther's book, I suggest Joe read <cite>"IBM Mainframes: Architecture and Design, 2nd ed"</cite> (Prasad and Savit). Or my VM performance and internals books, for that matter.
<p>
So, I urge you to simply ignore all this hand-waving about cache and NUMA properties and the supposed magic capabilities on IBM's z line. It's completely inaccurate, in the cases where it isn't irrelevant. It's just chaff to distract people from looking at z's disastrously bad price/performance.
<p>
Joe also disingenuously claims IBM doesn't enjoy monopoly pricing for z. Of course it does. If you want to run z/OS applications, you have nowhere to run them except on IBM processors using IBM's z/OS, CICS, VTAM, etcetera (it's not just the hardware price). IBM has the luxury of charging high margins because there are no Amdahl or Hitachi systems to compete against, nor alternative sources for the software I just listed, and because they know rehosting an application takes a substantial amount of effort. I have hands-on experience with the excellent Unikix application suite, now owned by <a href="http://www.clerity.com">Clerity Inc</a>. It absolutely can rehost a z/OS batch + CICS application on open systems without rewriting. I've seen successful migrations with a replacement TCO a mere 10% to 15% of the mainframe's TCO (that's TCO, folks. Not acquisition cost.) But, it requires a good deal of effort. The barriers to exit from mainframe are much, much higher than between Unix dialects. That's the whole point of Open Systems, and why you shouldn't check into a z-motel its hard to check out of.
<p>
So, yes. Mainframes are proprietary and have monopoly pricing. Just try to buy a z/OS equivalent or a system to run it on from anyone else. See for example <a href="http://www.theinquirer.net/gb/inquirer/news/2008/07/03/ibm-flirt-antitrust-buying-main">IBM shuts up competitor PSI by buying it - The INQUIRER</a>. In contrast, you can run Solaris on SPARC machines from us and from a partner+competitor, and you can run Solaris on x86 servers from dozens of vendors - including IBM.
<p>
Which reminds me: Joe distorts the agreement between IBM and Sun with his statement <cite>"Finally, Sun itself recognized System z and zVM as "the premier virtualization platform" when Sun and IBM jointly announced support of Open Solaris on IBM hardware."</cite> We did no such thing: the statement was support for Solaris on IBM's x86 (Intel) servers, not on "System z and zVM". Surely Joe knows this. See <a href="http://www.sun.com/aboutsun/pr/2007-08/sunflash.20070816.1.xml">IBM Expands Support for the Solaris OS on x86 Systems</a>. By the way, I've been involved with the OpenSolaris on z project since its inception - I obtained the workstation the non-IBM developers used for the port, personally installed it on z/VM, and it's not finished yet, either. It does give me the chance to do like-to-like performance comparisons of SPARC, Intel, and IBM System z (same app, same OS), and it substantiates everything I've been saying here.
<p>
The last paragraph illustrates one of the things that most disturbs me about this sorry episode. I've witnessed a shameless willingness from certain quarters to "just make things up". I've seen claims that you could use z/OS features when running z/Linux under z/VM, and completely imaginary ratios between different computers and their own, and claims that people (like me) said the exact opposite of what they actually said. It's really a sad thing, and makes me doubt the many years I spent as a loyal IBM customer and (alas) fan.
<p>
Jeff
<hr>
Just to be on the safe side, since some things I've referred to have disappeared from the sites that originally hosted them, here is a copy of the blog article I am responding to. Can't be too careful these days. As he notes, my material (following the "Posted by" line) is mingled with his. His text is in blue italics, just as on his blog.
<p>
<blockquote>
<font size="-2">
<h3>Response to Jeff Savit Blog</h3>
<p>As part of the announcement of z10 IBM made some marketing claims about the large number of distributed Intel servers that&nbsp; could be consolidated with zVM on a z10.&nbsp; The example cited used Sun rack optimized servers with&nbsp; Intel Architecture CPUs.&nbsp; Sun Blogger Jeff Savit objected strenuosly to the claims mainly because of the low utilization assumed on the Sun machines that the claims compared to.&nbsp; You can read it here: </p>
<p><span style="font-size: 0.8em;"><a href="http://blogs.sun.com/jsavit/entry/no_there_isn_t_a">http://blogs.sun.com/jsavit/entry/no_there_isn_t_a</a></span>I responded, he responded.&nbsp; When I was out of pocket&nbsp; for awhile and did not respond soon enough and his blog cut off replies on that thread.&nbsp; I am putting my latest response here.&nbsp; Thanks to Mainframe blog for providing the venue to do so.&nbsp; My latest responses to Jeff are in <em><span style="color: #0000ff;">blue italics.</span></em></p>
<p class="comment-details">Posted by <strong>Joe Temple</strong> on June 24, 2008 at 11:28 AM EDT <a class="entrypermalink" title="comment permalink" href="http://blogs.sun.com/jsavit/entry/no_there_isn_t_a#comment-1214321283000">#</a> </p>
<p><a id="comment-1214520066000" name="comment-1214520066000"></a></p>
<div class="comment odd" id="comment11"><p>This format is very difficult for parry and riposte, but let's try. I would like to use different colors, but I can't (AFAIK) put in HTML markup to permit that. So: Joe's stuff verbatim within brackets, and each of his sections starts with a quote of a sentence of mine (which I identify, within quotes) for context. Each stanza identified by name and employer (this is Jeff speaking):</p>
<p>Joe(IBM): [[[Jeff, your post is rather long and rather than build a point by point discussion too long for a single comment I will put up several comments. Starting with the moral of the story: There are several: • quoting Jeff: &quot;Use open, standard benchmarks, such as those from SPEC and TPC.&quot;</p>
<p>Better to use your own. They have not been hyper tuned and specifically designed for. They have a better chance of representing reality. But be careful not to measure wall clock time on “hello world” or lap tops will beat servers every time.]]]&nbsp; </p>
<p>Jeff(Sun): In a perfect world, every customer would have the opportunity to test their applications on a wide variety of hardware platforms to see how they perform. But they don't, and they rely on open standard benchmarks to give them some information about how the platforms would perform. Or, they do have applications they could benchmark, but they're non-portable, or run solely on a single CPU (making all non-uniprocessor results worthless), or otherwise have poor scalability or any of a hundred other problems. Imagine comparing IBM processors based on the speed of somebody writing to tape with a blocksize of 80 bytes! Even if they get a useful result, the next customer doesn't benefit at all and has to start from scratch. It's not trivial to make good benchmarks that aren't flawed in some way. That's why the benchmark organizations exist - to provide benchmarks that characterize performance and give a level playing field for all vendors. IBM, Sun, and others are active in them - our employers must think they have value. Obviously there is &quot;benchmarketing&quot; and misuse of benchmarks. THAT is what I'm railing against. Hence, my following bullet that says &quot;read and understand&quot;. But frankly, benchmarks Specweb/specwebssl/Specjvm, the SPEC fileserver benchmarks, and benchmarks like TPC.org's TPC-E provide representative characterization of system performance (with sad exceptions like TPC-C, which is broken and obsolete, but IBM still uses for POWER). <span style="color: #3366ff;"><em>The characterization of</em> TPC-C <em>as &quot;old and broken&quot;&nbsp; may have something to do with Sun's inability to keep up on that benchmark.&nbsp; One of the characteristics of TPC-C that none of the other benchmarks has is that it has at least some &quot;non local&quot; accesses in the transactions.&nbsp; Sun's problem with this is that such accesses defeat the strong NUMA characteristic of their large machines.&nbsp; One of the results of this&nbsp; is that all machines scale worse on TPC-C than on the benchmarks Jeff cites. Since Sun is very dependent on scaling a large number of engines to get large machine capacity close to IBM's machines they are highly susceptible to this.&nbsp; &nbsp;The effect&nbsp; is&nbsp; exacerbated by NUMA (non uniform memory access).&nbsp; That is, a flat SMP structure will mitigate this.&nbsp; &nbsp;The mainframe community's problem with TPC-C is that the non-local traffic is all balanced and a low percentage of the load.&nbsp; As a result TPC-C still runs best on a machine with a hard affinity switch set and does not drive enough cache coherence traffic to defeat numa structures.&nbsp; When workload runs this way it does not gain any advantage from z's schedulers or shared cache or flat design. Think of TPC-C as a fence.&nbsp; There is workload on Sun's side and there is workload on the mainframe side of TPC-C.&nbsp; All the Industry Standard Benchmarks sit on Sun's side and scale more linearly than TPC-C.&nbsp; For workloads that are large enough to need scale that run on the Sun side of the TPC-C fence, IBM sells System p and System x.&nbsp; When you consolidate disparate loads the Industry Standard benchmarks do not represent the load and&nbsp; with enough &quot;mixing&quot;&nbsp; the&nbsp; composite workload will eventually move to the mainframe side of the TPC-C fence.&nbsp; See Neil Gunther's <u>Guerilla Capacity Planning</u>, for a discussion of contention and coherence traffic and their effect on scale.&nbsp; Particularly&nbsp; read chapter 8, to get an idea about how the benchmarks lead to overestimation of scaling capability.&nbsp; &nbsp; </em></span>A lot of people have worked very hard to make them be as good as they are. IBM uses these benchmarks all the time - with the notable exception of System z.&nbsp; <em><span style="color: #3366ff;">System z is designed&nbsp; to run workloads with non uniform memory access patterns, randomly variable loads, and much more serialization and cache migration than occurs in the standard benchmarks , where strong affinity hurts, rather than enhances throughput. It is the only machine designed that way (Large shared L3 and only 4 way NUMA on 64 processors). Also, the standard benchmarks are generally used for &quot;benchmarketing&quot;.&nbsp; As a result the hard work involved is not purely driven by the noble effort by technical folks that Jeff portrays, but rather by practical business needs, including the need to show throughput and scale in the best possible light.&nbsp; </span></em>That's the point, isn't it. It works in a monopoly priced marketplace where it doesn't have to compete on price/performance,&nbsp; as it does with its x86 and POWER products. Where else are you going to run CICS, IMS, and JES2?&nbsp; <span style="color: #3366ff;"><em>There are alternatives to System z on all workloads, it is matter of migration costs v benefits of moving.&nbsp; Many applications have moved off CICs and IMS to UNIX )and Windows over the years. Sun has whole marketing programs to encourage migration.&nbsp; In fact a large fraction of UNIX/Windows loads do work that was once done on mainframes.&nbsp; As result the mainframe must compete.&nbsp; &nbsp;Similar costs are incurred moving work from any UNIX (Solaris, HPUX,&nbsp; AIX, Linux to zOS. Or moving from UNIX to Windows.&nbsp; </em></span><span style="color: #3366ff;"><em>The other part of the barrier is the difference in machine structure.&nbsp; This barrier is workload dependent.&nbsp; Usually, when considering two platforms for a given piece of work one of the machine structures will be a better fit.&nbsp; &nbsp;When moving work in the direction favored by the machine structure difference the case can be made to pay for the migration..&nbsp; This is what all verndors do.&nbsp; Greg Pfister (<u>In Search of Clusters)</u>, suggests that there are three basic categories of work.&nbsp; Parallel Hell, Paralle Nirvana, and Parallel Purgatory.&nbsp; I would suggest that there are three types of machines optimized for these environments (Blades in Nirvana, Large UNIX machines in Purgatory, and Mainframes in Hell)&nbsp; To the extent that workload is in parallel hell, the barrier to movement off the mainframe will be quite high.&nbsp; &nbsp;Similarly attempts to run purgatory or nirvana loads on the mainframe will run in to price and scaling issues. IBM asserts that consolidation of disparate workloads using virtualization will drive the composite workload toward parallel hell, where the mainframe has advantages due to its design features, mature hypervisors and machine structure. </em></span></p>
<p>To the second observation about wall clock time on trivial applications: yes, obviously.</p>
<p>Joe(IBM): [[[quoting Jeff: •&quot;Read and understand what they measure, instead of just accepting them uncritically.&quot;<br />Yes, particularly understand that the industry standard benchmarks run with low enough variability and low thread interaction that it makes sense to turn on a hard affinity scheduler. Your workload probably does not work this way.]]]&nbsp; </p>
<p>Jeff(Sun): I'm not sure what's intended by that. Are you claiming that benchmarks should be run against systems without fully loading them to see what they can achieve at max loads? Hmm. Anyway, see below my comments about low variability and low thread count - which applies nicely to IBM's LSPR.]]]&nbsp; &nbsp;<em><span style="color: #3366ff;">I guess I am claiming that the industry benchmarks basically represent parallel nirvana and parallel purgatory.&nbsp; I am asserting that mixing workload under single OS or virtualizing servers within an SMP drives platforms toward parallel hell.&nbsp; The near linear scaling of the industry standard loads on machines optimized for them will not be achieved on mixed and virtualized workloads.&nbsp; In part this because sharing the hardware across multiple applications will lead to more cache reloads and migrations than occur in the benchmarks.&nbsp; &nbsp;I see Jeff's reference&nbsp; to LSPR as a red herring for two reasons.&nbsp; While LSPR has not been applied across the industry,&nbsp; the values it contains have been used to do capacity planning rather than marketing. The loads for which this planning is done are usually a combination of virtualized images each either running mixed and workload managed&nbsp; under zOS or&nbsp; VM and zLinux.&nbsp; &nbsp;This could not be done successfully if&nbsp; the scalability were as idealized as the Industry standard benchmarks.&nbsp; &nbsp;Second, I do not suggest that LSPR is the answer, but rather that the current benchmarks do not sufficiently represent the workloads in question (mixed/virtualized) for Jeff to make the claim that z does not scale as he did elswhere in the blog entry.&nbsp; Basically,&nbsp; to draw his conclusion he compares the LSPR scaling ratios to Industry benchmark results on UNIX SMPs. This is not&nbsp; a good comparison. </span></em></p>
<p>Joe(IBM): [[[quoting Jeff: •&quot;Get the price-tag associated with the system used to run the benchmark.&quot; Better to understand your total costs including admin, power, cooling, floorspace, outages, licensing, etc.&quot;</p>
<p>Jeff(Sun): That's what I meant. <span style="color: #3366ff;">Great.&nbsp; Because the hardware price difference that Sun usually talks about is only a small percentage of total cost.&nbsp; The share of total cost represented by hardware price shrinks every year.</span></p>
<p>Joe(IBM): [[[quoting Jeff: • Relate benchmarks to reality. Nobody buys computers to run Dhrystone.&quot; Only performance engineers run benchmarks for a living.]]]</p>
<p>Jeff(Sun): Sounds like a dog's life, eh? OTOH, they don't have users...</p>
<p>Joe(IBM): [[[quoting Jeff: •&quot;Don't permit games like &quot;assume the other guy's system is barely loaded while ours is maxed out&quot;. That distorts price/performance dishonestly.&quot; Understand what your utilization story is by measuring it. Don’t permit games in which hypertuned benchmarks with little or no load variability and low thread interaction represent your virtualized or consolidated workload. Understand the differences in utilization saturation design points in your IT infrastructure and what drives them.&quot;]]] </p>
<p>Jeff(Sun): Your comment has nothing to do with what I'm describing. What I'm talking about is the dishonest attempt to make expensive products look competitive by proposing that they be run at 90% utilization, while the opposition is stipulated to be at 10%, and claim magic technology (like WLM, which z/Linux can't use) to permit higher utilization and claim better cost per unit of work on your own kit. That's nothing more than a trick to make mainframes look only 1/9th as expensive as they are. Imagine comparing EPA mileage between two cars by spilling 90% of the gas out of the competitor's tank before starting. As far as &quot;no load variability and low thread interaction&quot;, I suggest you take a good look at IBM's LSPR. See <a href="http://www-03.ibm.com/servers/eserver/zseries/lspr/lsprwork.html" rel="nofollow">http://www-03.ibm.com/servers/eserver/zseries/lspr/lsprwork.html</a> which describes long running batch jobs (NO thread interaction at all) on systems run 100% busy (NO load variability). The IMS, CICS (mostly a single address space, remember), and WAS workloads in LSPR should not be assumed to be different in this regard either. This doesn't make LSPR evil: it is not - it's very useful for comparisons within the same platform family. But consider SPECjAppserver, which has interactions between web container, JSP/servlet, EJB container, database, JMS messaging layer, and transaction management - many in different thread and process contexts. I suggest you reconsider your characterization about thread interaction. Complaints about thread interaction and variability of load are misplaced and misleading.&nbsp; <span style="color: #3366ff;">The comparison <em>of zLinux /VM at high utilization with highly distributed solution at low utiliation is valid, and well founded on both data&nbsp; and system theory.&nbsp; &nbsp;You could make similar comparisons of&nbsp; consolidated&nbsp; Virtualized UNIX v&nbsp; distributed Unix,, VMware v Distirbuted Intel.&nbsp; Any cross comparison of virtualized v distributed servers&nbsp; will be leveraged mainly by utilization rather than by raw&nbsp; performance as measured by benchmarks.&nbsp; Thus the comparison Jeff complains about as dishonest does in fact represent what happens when consolidating existing servers using virtualization.&nbsp; &nbsp;My second point is that in making comparisons between consolidated mixed worklload solutions that industry benchmarks are not represetative of the relative capacity or the saturation design point for each of the&nbsp; systems in question.&nbsp; There is no current benchmark to use for these comparisons.&nbsp; This includes LSPR, Suns Mvalues, rPerfs,&nbsp; as well as the industry benchmarks.&nbsp; None of them works.&nbsp; Each vendor asserts leverage for consolidation based on their own empirical results, or perceived strengths in terms of machine design.&nbsp; &nbsp;&nbsp; I am saying that the scaling of these types of workloads is&nbsp; less linear that the industry benchmark results and that&nbsp; some of the things z leverages to do LSPR well&nbsp; will&nbsp; apply in this environment as well. </em></span>Joe(IBM): [[[quoting Jeff: •&quot;Don't compare the brand-new machine to the competitor's 2 year old machine&quot; Understand what the vintage of your machine population is. When you embark on a consolidation or virtualization project compare alternative consolidated solutions, but understand that the relative capacity of mixed workload solutions is not represented by any of the existing industry standard benchmarks.]]]&nbsp; </p>
<p>Jeff(Sun): We're talking at mixed purposes. What I mean is that one vendor's 2008 product tends to look a lot better than the competition's 2002 box, making invidious comparisons easy. Moore's Law has marched on.<em><span style="color: #3366ff;">&nbsp; The truth is that when you do a consolidation you usually deal with a range of servers some of which are 4 or 5 years old.&nbsp; 2 year old&nbsp; vintage is probably farirly representative.&nbsp; In any case Moore's law does not improve utilization of distributed boxes unless you consolidate work in the process of upgrading. Unless a consolidation is done the utilization will drop when you replace old servers with new servers.&nbsp; For the consolidation to occur within a single application, the application has to span multiple old servers in capacity.&nbsp; Server farms are full of applications which do not use a single modern engine efficiently let alone a full multicore server.&nbsp; &nbsp;Jeff's main argument is with the utilization comparison.&nbsp; &nbsp;The utilization of distributed servers, including HP's, Sun's and IBM's, is&nbsp; very often quite low.&nbsp; It is possible to consolidate a lot of low utilized servers on a larger machine. The mainframe has a long term lead in the ability to do this, that includes hardware design characteristics (Cache/Memory Nest), specific scheduling capability in hypervisors (PR/SM and VM), and hardware features (SIE).&nbsp; &nbsp;How many two year old low utilized servers&nbsp; running disparate work can an M9000 consolidate?&nbsp; &nbsp;</span></em> </p>
<p>Joe(IBM): [[[quoting Jeff: • &quot;Insist that your vendors provide open benchmarks and not just make stuff up.&quot; <br />Get underneath benchmarketing and really understand what vendor data is telling you. Relate benchmark results to design characteristics. Characterize your workloads. (Greg Pfister's In Search of Clusters and Neil Guther's Guerilla Capacity Planning suggest taxonomies for doing so.) Understand how fundamental design attributes are featured or masked by benchmark loads. Understand that ultimately standard benchmarks are “made up” loads that scale well. Learn to derate claims appropriately, by knowing your own situation. (Neil Gunther's Guerilla Capacity Planning suggests a method for doing so)]]]</p>
<p>Jeff(Sun): This is not the &quot;making stuff up&quot; that I was referring to. I was referring to misuse of benchmarks in the z10 announcement, which IBM was required to redact from the announcement web page and the blogs that linked to it. I'm not arguing against synthetic benchmarks that honestly try to mimic reality, I'm arguing against attempts to game the system that I discussed in my &quot;Ten Percent Solution&quot; blog entry.&nbsp; <em><span style="color: #3366ff;">I have explained the comparison made for the z10 announcement above.&nbsp; &nbsp;Jeff objects to the utilzation coparison which is legitimate. In fact when servers are running at low utilization most of them are doing nothing most of the time.&nbsp; That is the central argument for virtualization which is generally accepted in the industry.&nbsp; I am also pointing out that Industry Standard Benchmarks are not created in purely noble attempt to uncover the truth about capacity.&nbsp; In fact they are generally defined in a way that supports the distributed processing, scale out. client server camp of solution design, which is why they scale so well.&nbsp; &nbsp;Think about it.&nbsp; The industry standard committees each vendor has a vote.&nbsp; System z represents 1/4 of IBM's vote.&nbsp; &nbsp;Do you think there will ever be an industry standard benchmark which represents loads that do well on its machine structure?&nbsp; The benchmarks and their machines have evolved together.&nbsp; They can represent loads from single application codes that are cluster or numa concious.&nbsp; &nbsp;What happens to all of those optimizations when workloads are stacked and the data doesn't remain in cache or must migrate from cache to cache?&nbsp; The point is that relevance and validity of&nbsp; either side of this argument is highly workload dependent.&nbsp; &nbsp;The local situation will govern most cases.&nbsp; Neither an industry benchmark result nor a single consolidation scenario&nbsp; is more valid than the other.&nbsp; </span></em></p>
<p>Joe(IBM): [[[quoting Jeff: &quot;Be suspicious!&quot;Be aware of your own biases. Most marketing hype is preaching to the choir. Do not trust “near linear scaling” claims. Measure your situation. Don’t accept the assertion that the lowest hardware price leads to the lowest cost solution. Pay attention to your costs, and don’t mask business priorities with flat service levels. Be aware of your chargeback policies and their effects. Work to adjust when those effects distort true value and costs.&quot;]]] </p>
<p>Jeff(Sun): With this I cannot disagree. That's exactly what I have been discussing in my blog entries: unsubstantiated claims of &quot;near linear scaling&quot; to permit 1,500 servers to be consolidated onto a single z (well, the trick here is to stipulate that 1,250 of the 1,500 do no work!) <em><span style="color: #3366ff;">By definition servers running at low utilization are doing nothing most of the time.</span></em>or to ignore service levels (see my &quot;Don't keep your users hostage&quot; entry). <em><span style="color: #3366ff;">Actually virtualization&nbsp; of servers&nbsp; on shared hardware can improve service levels by improving latency of interconnects.&nbsp; </span></em>I'll also add &quot;beware of the 'sunk cost fallacy'&quot;: you shouldn't throw more money into using a too-expensive product that has excess capacity because you've already sunk costs there.&nbsp; <em><span style="color: #3366ff;">Actually, adding workload to an existing large server can be the most effiicent thing to do in terms of power, cooling, floorspace, people, deployment, and time to market, even if the price of the processor hardware is higher.&nbsp; These efficiencies and the need for them is locally driven.&nbsp; In general there may or may not be a &quot;sunk cost fallacy&quot; .&nbsp; In fact&nbsp; you should also be aware of the &quot;hardware price bargain fallacy&quot;.&nbsp; Finally, Sun itself recognized System z and zVM as &quot;the premier virtualization platform&quot; when Sun and IBM jointly announced support of Open Solaris on IBM hardware.</span></em></p></div>
</font>
</blockquote>
<p>
(end of quoted material)
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->https://blogs.oracle.com/jsavit/entry/no_there_isn_t_aNo, there isn't a Santa Clausjsavithttps://blogs.oracle.com/jsavit/entry/no_there_isn_t_a
Sun, 15 Jun 2008 09:01:48 +0000Suncapacitydatabasesibmlsproltpperformancerpesolarisx2100x86I really don't like to use this blog for refuting competitor exaggerations and FUD. I'd really rather spend the infrequent time I spend on this blog talking about Sun technology.
THere's so much great new stuff in Solaris, in our servers and storage products, and software stack, that it's
just a nuisance to have to refute silly attacks on us.
<p>
But, once again into the fray. Read on to see the latest fuzzy mathI really don't like to use this blog for refuting IBM exaggerations about mainframes. That's the world I come from, and I have a lot of respect and nostalgia for it, but I'm too frequently drawn into pointing out distortions in press releases or marketing FUD. I'd really rather spend the infrequent time I spend on this blog talking about Sun technology.
There's so much great new stuff in Solaris, in our servers and storage products, and software stack, that it's
just a nuisance to have to refute silly attacks on us.
<p>
But, once again into the fray. I receive <a href="http://www.mainframe-exec.com/">Mainframe Executive</a> magazine, and the
May/June issue's closing column by Jon Toigo of <a href="http://www.toigopartners.com/portal/">Toigo Partners</a> had some incorrect statements that I just <i>had</i> to correct, saying:
<p>
<ul>
<li>that LPARs support up to 1,500 virtual environments. Actual maximum is 60.
<li>that z/VM and z/Linux make use of IBM's Sysplex, Workload Manager (WLM), and Intelligent Resource Director (IRD). No, those are only useable in a z/OS environment, which doesn't host virtual machines.
That's z/VM's job, and not only does it not have those features (frequently claimed to have magic properties!)
but it also has substantial costs that affect the cost-per-server figures that Toigo cited.
<li>that z/Linux can use a feature called DFHSM to reduce disk space needs. No, that also is a z/OS-only feature.
<li>that VMware systems (why were no others mentioned?) can only support about 20 guests on a high end server.
Too low by a factor of 4 or 5 (Sun
virtualization technologies like Solaris Containers and Logical Domains were not mentioned, alas)
Besides, as I mentioned in my
<a href="http://blogs.sun.com/jsavit/entry/don_t_keep_your_users">Don't Keep Your Users Hostage</a>
blog, user counts without reference to service levels is the wrong way to think about capacity.
</ul>
If you stipulate that another platform can run only 1/4 the work
that it can actually run, and omit the very substantial costs on the other platform - z, and believe grossly exaggerated
claims about its capabilities, and fail to mention features of other platforms that provide
comparable or superior features that z cannot do (VMotion anyone?),
well, you're going to be a few orders of magnitude off.
<p>
I don't mean to pick on Mr. Toigo. I e-mailed him, and he said that he wanted to be accurate, and
would contact IBM to verify facts. You can't ask for more than that from a journalist. I don't know if he'll come to see
the light regarding the exaggerations I point out in the
<a href="http://blogs.sun.com/jsavit/entry/the_ten_percent_solution">Ten Percent Solution</a>
(he is after all writing for a mainframe publication), but at least we can straighten out errors of indisputable fact
- stuff you can look up in vendor manuals.
<P>
(This confusion about system features is very common, even among IBMers, because so many
people think that "mainframe" implies "z/OS function set", when z/OS is only one of several operating systems that
run on z. <b>When you are <u>not</u> running an operating system, you don't get to use its features</b> - for good or bad!).
<p>
All this was inspired by IBM's recent claims, which I have refuted at length on this blog. I won't repeat my points
in full, because the material is
<a href="http://blogs.sun.com/jsavit/entry/the_ten_percent_solution">here</a> and
<a href="http://blogs.sun.com/jsavit/entry/how_i_found_out_that">here</a>.
but IBM makes the absurd claim that customers run database servers at 10% busy,
and through the magic wand of a few buzzwords, you can run <i>any</i> collection of workloads at 90% CPU utilization,
and somehow you can only do this using
features that only exist on z. All complete rubbish, including mistakes about which IBM products have which features.
<b>It's silly that I have to correct IBM employees about IBM technology</b>.
<p>
Here's the comment
I sent to the <a href="http://www.ibm.com/developerworks/blogs/page/InsideSystemStorage?entry=yes_jon_there_is_a">blog</a>
of IBMer Tony Pearson:
<hr>
<blockquote><font color="blue">
I'm sorry you didn't take the opportunity to challenge my blog, cited as "some might question, dispute or challenge this ten percent". That would have been a good time to expose errors, if they exist, in my refutation of IBM claims.
<p>
However, I see errors of fact in your blog:
<p>
(1) You say WLM and IRD make it possible to run mainframes at 90% utilization. This is impossible: z/VM and z/Linux do not implement these z/OS-only functions. See, for example
<a href="http://publib.boulder.ibm.com/infocenter/eserver/v1r1/en_US/index.htm?info/veicinfo/eicartechzseries.htm">http://publib.boulder.ibm.com/infocenter/eserver/v1r1/en_US/index.htm?info/veicinfo/eicartechzseries.htm</a>
or
<a href="http://www-03.ibm.com/servers/eserver/zseries/zos/wlm/">http://www-03.ibm.com/servers/eserver/zseries/zos/wlm/</a>
<p>
(2) David Boyes' Test Plan Charlie ran no workload other than booting OS images. It cannot be used to extrapolate capacity for doing any actual work. It also used Linux kernel customizations to reduce overhead that you could not use "in real life".
<p>
(3) You say you can define z/VM LPARs in a Sysplex. Sysplex is a System z feature only available with z/OS, so what you suggest is impossible. You cannot use Sysplex for coordinating times or recovery with z/VM or z/Linux. z/VM only supports guest Sysplex within a single z/VM instance, and only for z/OS guests.
<p>
(4) Actual cost per IFL is $125,000, not $100,000, and that doesn't count the cost for RAM.
<p>
You are right in suggesting that you would have to add up actual software and hardware costs of both platforms for a fair comparison. I've done so, and even using IBM's "10% solution on x86, 90% busy on z" argument that I dispute, and the server counts at
<a href="http://www-03.ibm.com/press/us/en/pressrelease/23592.wss">http://www-03.ibm.com/press/us/en/pressrelease/23592.wss</a>,
IBM z cost about 7 times as much as the distributed solution IBM compares it to.
<p>
Your figures disagree with the IBM press announcement, which had claimed that 26 IFLs had sufficient capacity to do the job of 760 x86 cores (which is 380 servers, not 1,500). The page
<a href="http://www-03.ibm.com/press/us/en/pressrelease/23592.wss">http://www-03.ibm.com/press/us/en/pressrelease/23592.wss</a>
used to have a footnote number 3 with the math, which now has been removed. In your analysis, a 64-CPU z10 E64 would be needed. That costs about $26 million dollars, excluding RAM, disks, and software licenses. That is over 14 times more expensive than 1,500 x2100s (the Sun price includes the RAM and pre-installed OS). If the CPUs are configured as IFLs, then they cost $125,000 each, totalling $8 million dollars. With the minimum RAM configuration of 160GB, it still costs 5.38 times as much as the x2100s.
<p>
I will address several other errors and points of contention in my blog. The most important mistake, though, is the implication that it is hard to consolidate or virtualize servers on x86 (or SPARC) servers at high utilization, or simply to share assets among production, test, and disaster recovery for reduced costs. Nobody need run at 10% busy, or pay a high premium to get higher utilization. If the x86 servers are managed for higher utilization, far fewer will be needed, and the price difference will be even higher.
<p>
(end here)
</font>
</blockquote>
<p>
That's the comment I placed on the blog. It will be interesting to see what response it generates, if any.
It's very interesting to me that IBM removed the "justification" in footnote number 3 that used to exist
on the <a href="http://www-03.ibm.com/press/us/en/pressrelease/23592.wss">announcement page</a> I referred to above.
Also interesting, in my previous blog
<a href="<a href="http://www-128.ibm.com/developerworks/blogs/page/benchmarking?entry=back_to_the_future_with#comments">I linked to another IBM blog</a> which had the "basis" of their claims. <b>That blog has also had the content I referred to expurgated!</b>
Curiouser and curiouser!
<p>
There are other mistakes in the blog: There is no rule of thumb saying you can reduce by 15% the capacity needed
to run consolidated workloads. I'm interested in learning where <i>that</i> came from.
I should mention again that IBM's LSPR figures don't run the RPE benchmark anyway - they run proprietary IBM benchmarks,
so all of the IBM projections are specious reasoning, comparing their servers running one workload to other servers
running different workloads.
<p>
I did enjoy the part where he talks about the poor scalability of System z. That part was accurate. Sun's high end SPARC servers
not only are "bigger" in every capacity metric than z10, but they also are much better at vertical scale - we don't suffer
from the problem he describes. That's why IBM doesn't produce LSPR for above 32-CPU systems, and they truncate results
for "Single Image" capacity before that. They just don't scale as well.
That's another reason not to believe these projections: they assume they understand the scalability of the application
as more CPUs are added! The original IBM press release used fuzzy math based on linear scale - which System z doesn't achieve
(as the IBM blog says).
<p>
I guess I should mention that Sun uses NPIV too, and that Sun also has hierarchical storage management capabilities.
ZFS, a free feature of the Solaris 10 operating environment, also provides on-disk data compression to reduce disk space needs.
<h2>The 90% fallacy</h2>
<p>
In fact, the whole "90% utilization" premise is completely flawed, regardless of vendor.
<p>
Let's reason this through: If you have 90% average utilization, you probably have periods higher than 90%
and lower, unless you are incredibly lucky enough to have absolutely static, predictable resource demand. That
rarely happens, especially with interactive or on-line systems, which is the kind of workload under discussion.
<p>
You do not want to run on-line systems at close to 100% busy due to queueing effects that would
raise response times, regardless of platform. Queuing theory applies to everybody!
<p>
The only way you can do this is if you have an individual workload with predictable characteristics
on a platform with enough capacity to serve it.
<b>No workload manager in the world helps you run
on systems that don't have enough capacity to meet service level objectives.</b>
Workload managers shift resources between workloads to meet service level objectives.
When there is only one application, or insufficient capacity to meet service levels, there's nothing
a workload manager can do.
<p>
On the other hand, when you are consolidating workloads, then you can run at extremely high utilizations levels
<b>only if the different workloads have different service levels and priorities</b>
which would permit you to starve the lower-priority workloads while giving preferential service
to the high-priority workloads. If the aggregate capacity requirements of the high-priority
workloads are less than the server capacity, and you're willing to starve low priority workloads
(we sometimes call these "cycle soakers", since they soak up whatever capacity is left over)
then you can run your systems at or near 100% busy.
<p>
There's absolutely no magic to this, and nothing that makes it possible on one platform and
impossible on another. As long as you have a resource manager: Solaris Resource Manager on Solaris
(There is no technical obstacle to running Solaris systems at high utilization.
Sun runs compute farms for circuit design close to 100% busy for months at a time),
System Resource Manager on z/VM (same initials!), Workload Manager (WLM) on z/OS, then you can
run flat out (Linux doesn't seem to have anything suitable), But you also require a workload you can
shed or run slower if needed, and a way to throttle it while running more important work.
(The real difference here is that mainframe systems have a tradition of being run fulling loaded,
due to the high acquisition costs, while distributed systems have less economic pressure,
and frequently are purchased by individual lines of business who didn't want to share!)
<p>
Even this is an over-simplification, as "capacity"
consists of many facets (CPU, I/O bandwidth, I/O operations per second, network bandwidth and latency)
- many applications, such as the OLTP example touted by IBM, are more likely to be I/O bound than
CPU bound. Throwing more CPU capacity at such systems is just a way to go idle sooner, and they naturally run
I/O bound no matter what you do!.
<p>
The moral of the story: don't believe the hype. High utilization isn't the answer to all questions
(the right question is "how do I minimize the cost of computing: acquisition costs, operational costs,
energy, real-estate, staff and license costs - while maintaining service levels and meeting business needs".)
High utilization is helpful, of course - idle machines are an expense. But you can only get
"close to 100%" with fortunate combinations of workload and business priorities.
<p>
Finally, there is no magic wand on mainframes that makes it possible to run them 90% for any old workload.
And <b>nobody</b> needs to run their production databases at 10% busy on distributed systems.
Customers frequently stack prod, test, development, QA, and disaster recovery on the same machines to reduce
server counts
If you choose to manage to higher utilization, there is no reason to run with the 1,500 servers
outlined in Pearsons blog: 125 production machines at 70% busy might reasonably be left alone, but the 125 backup servers could easily be consolidated with the grossly too-many 1250 test machines running 5% busy. Nobody needs to do set up so many
almost-idle machines for test, development, and QA.
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->
https://blogs.oracle.com/jsavit/entry/zones_in_opensolaris_livecd_insideZones in OpenSolaris LiveCD inside VirtualBox under OpenSolaris!jsavithttps://blogs.oracle.com/jsavit/entry/zones_in_opensolaris_livecd_inside
Mon, 2 Jun 2008 10:25:32 +0000SuncontainerslivecdopensolarispkgvirtualboxzonesCan you use the OpenSolaris LiveCD to test features like Solaris Containers? Can you do it even when you run it in a virtual machine? Yes, and it's very helpful, so read on...I've been working with several Sun virtualization techniques - Solaris Containers, Logical Domains,
and the upcoming Sun xVM Server, and recently started using <a href="http://virtualbox.org/">VirtualBox</a>
as one of my primary tools.
As my colleague <a href="http://blogs.sun.com/bobn/">Bob Netherton</a> says "It rawks!".
It's free, runs on Solaris, Linux, Mac OS X, and Windows - and has been downloaded 5 million times.
With a little effort,
described in <a href="http://blogs.sun.com/jimlaurent/entry/importing_solaris_vmdk_image_into">Jim Laurent's blog</a>,
you can even import virtual machine VMDK images from VMware.
<h2>We all live in a virtual machine, virtual machine, virtual machine (to the tune of "Yellow Submarine")</h2>
<p>
I've used VirtualBox to bring up Windows and Ubuntu Linux under Solaris for fun:
<p>
<img src="https://blogs.oracle.com/jsavit/resource/Screenshot-3-sm.png" alt="Three's a crowd">
<br>Three's a crowd: Windows XP, Solaris 10, and OpenSolaris all booted
at the same time underneath OpenSolaris<br>
<br>
<p>However, my usual purpose is to practice using Solaris features and gain experience with ones that are new to me.
Now I have a desktop (an Acer M3100) and laptop (Toshiba Tecra M9)
with enough CPU power and (especially important) RAM to run multiple guest operating systems.
VirtualBox makes it really easy.
Safe, non-destructive to my normal work environments, and something I can do while multitasking
with normal desktop activity (web, e-mail, presentations, occasionally hack at a program, listen to music through tinny speakers). And the operating system I boot on the bare iron is OpenSolaris. Life is good.
<h2>Can you use this with OpenSolaris live CD?</h2>
<p>
In this case, I was inspired by another colleague,
<a href="http://blogs.sun.com/JeffV/">Jeff Victor</a> who wondered in an
e-mail whether the OpenSolaris' <a href="http://www.opensolaris.com/get/index.html">Live CD</a>
feature could be used to demonstrate Solaris Containers, aka zones.
You can boot off the OpenSolaris CD
(you can even boot OpenSolaris off a <a href="http://blogs.sun.com/uejio/entry/experience_with_opensolaris_developer_preview">USB stick</a>)
and use it as an OS (eg: fire up a browser and do other work) rather than just as an installer.
That would be helpful, because you could then demonstrate and test zones, or gain experience with them, without having
to fully install Solaris - either on a spare real machine or in a guest.
Just boot off the OpenSolaris CD, and play with some of its features.
Afterwards, either boot the real computer back into your standard environment, or shut the virtual machine down.
<p>
But would it work? Only one way to find out.
I already installed OpenSolaris in a VirtualBox guest machine, and the guest even has the CD image <b>.iso</b> file
attached to it, so let's fire it up
<p>
<img src="https://blogs.oracle.com/jsavit/resource/Screenshot-VB-list-sm.png" alt="list of guests">
<br>VirtualBox list of guests
<p> I let it boot off the virtual CD image to load the Live CD.
Soon it gives me a Grub screen that asks me what to do.
<p>
<img src="https://blogs.oracle.com/jsavit/resource/Screenshot-OpenSolaris-Grub-sm.png" alt="OpenSolaris Live CD Grub screen">
<br>OpenSolaris Live CD Grub screen
<p>I don't want to boot up the already-installed OpenSolaris image, so I select the first entry instead of the last.
After a little while I have the graphical desktop up, and I can open a terminal window, and look around.
<p>
<img src="https://blogs.oracle.com/jsavit/resource/Screenshot-OpenSolaris-term1-sm.png" alt="OpenSolaris graphical desktop">
<p>
At this point I can add a zone interactively.
For convenience, I use <code>pfexec</code> to put myself into my preferred shell and work as root
(it's <i>my</i> OS instance and I'll do what I like!), and then start a zone install.
Oops, at first, I tried to add it to a <code>/zones</code> directory
but that failed right away due to running out of space.
Duh. So, let's try that again in a reasonable place - there's space under <code>/jack</code>, the
home directory for the default live CD user login.
<blockquote><pre>
jack@opensolaris:~$ pfexec bash
jack@opensolaris:~# zoneadm list -civ
ID NAME STATUS PATH BRAND IP
0 global running / native shared
jack@opensolaris:~# mkdir /jack/zones
jack@opensolaris:~# zonecfg -z live
live: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:live> create
zonecfg:live> set zonepath=/jack/zones/live
zonecfg:live> add net
zonecfg:live:net> set physical=pcn0
zonecfg:live:net> set address=10.0.2.160
zonecfg:live:net> end
zonecfg:live> verify
zonecfg:live> commit
zonecfg:live> exit
jack@opensolaris:~# zoneadm list -civ
ID NAME STATUS PATH BRAND IP
0 global running / native shared
- live configured /jack/zones/live ipkg shared
</pre></blockquote>
<p>
That seems okay - with an interesting new type of brand. Dan Price describes this in
his <a href="http://blogs.sun.com/dp/category/Solaris">blog</a>.
At this time, zones in OpenSolaris are a little different from those in regular Solaris 10.
Specifically, zones have a new <i>ipkg</i> brand, because the zone commands have behavior based on native zones using
SysV packaging, instead of the new <a href="http://www.opensolaris.org/os/project/pkg/">Image Packaging System</a>.
OpenSolaris zones are not sparse root, and their contents are obtained from a network repository.
<p>
<blockquote><pre>
jack@opensolaris:~# zoneadm -z live install
WARNING: /jack/zones/live is on a temporary file system.
Image: Preparing at /jack/zones/live/root ... done.
Catalog: Retrieving from http://pkg.opensolaris.org:80/ ... done.
Installing: (output follows)
DOWNLOAD PKGS FILES XFER (MB)
Completed 49/49 7634/7634 206.85/206.85
PHASE ACTIONS
Install Phase 12602/12602
Note: Man pages can be obtained by installing SUNWman
Postinstall: Copying SMF seed repository ... done.
Postinstall: Working around http://defect.opensolaris.org/bz/show_bug.cgi?id=681
Postinstall: Working around http://defect.opensolaris.org/bz/show_bug.cgi?id=741
Done: Installation completed in 337.728 seconds.
Next Steps: Boot the zone, then log into the zone console
(zlogin -C) to complete the configuration process
jack@opensolaris:~#
</pre></blockquote>
<p>
I like the warning message - little does it know that any writeable file system mounted to the live CD is transient!
Despite having to fetch its contents from a repository out on the 'net
(and while running in a virtual machine), the whole process took about 5 minutes.
Dan also mentions working to keep the default zone image rather small - and it only took up about 230MB.
<p>
<h2>A zone, inside a VirtualBox virtual machine, under OpenSolaris</h2>
Now let's boot it up and prove the whole thing works. I do that, answer the usual configuration questions, and here I have:
<p>
<blockquote><pre>
live console login: root
Password:
Jun 1 12:43:55 live login: ROOT LOGIN /dev/console
Sun Microsystems Inc. SunOS 5.11 snv_86 January 2008
-bash-3.2# zonename
live
-bash-3.2# ifconfig -a
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
pcn0:1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
inet 10.0.2.160 netmask ff000000 broadcast 10.255.255.255
</pre></blockquote>
<p>All seems normal, and after a few minutes I shut the zone down.
<h2>Clone that, will ya?</h2>
<p>Just for fun I cloned that zone: I configured a new zone <code>zonecfg -z liveclone</code>
and then populated it via <code>zoneadm -z liveclone clone live</code>. Interestingly was that the zone clone took only a moment, despite having to copy the file system. The ram disk file image is UFS, not ZFS, so it can't use a ZFS clone, but I imagine copying ramdisk based
filesystems must be fast!
<p>
<h2>Summary</h2>
<p>
The answer is <i>yes</i>: you can demo or learn Solaris Containers, as well as other OpenSolaris features, while running the live CD, and you can do that from within a VirtualBox virtual machine (or under VMware, I imagine). With the new packaging concept, in which the install media is kept smaller and new bits are obtained from a repository, you're dependent on having access to the Internet in order to download the software needed to create the zone. It all happens transparently - I didn't have to issue any commands to do that - but you won't get far without it. If that's not a problem, you can easily bring up OpenSolaris and play - whether in a virtual machine or a real one.
<p>
One other note: for a long time in my prior life I was spoiled by the ability to test out the OS I was working on by bringing it up in a virtual machine. Too many of the things I did in my work life involved changing the behavior of the OS I was working on - and I liked working with a net (and while my users were doing their mission critical work instead of on weekends and late night shifts). Having a safe sandbox to try things out, add and test software, and even crash things on purpose was a great aide to productivity and safe computing. It's really great to have that again.
https://blogs.oracle.com/jsavit/entry/how_i_found_out_thatHow I found out that I might not existjsavithttps://blogs.oracle.com/jsavit/entry/how_i_found_out_that
Wed, 28 May 2008 14:44:19 +0000Sunibmlinuxmainframeperformancepricesolarissunz/linuxLast weekend I got an e-mail from an acquaintance of mine who pointed me to an industry web site where my blog (in particular the mainframe discussion) had been linked - but it said I might be a pseudonym! Read on to see how I got my name back :-) and learned a few things in the bargain.<h2>My name in lights - or is it?</h2>
<P>
I got an e-mail from an old acquaintance of mine, David Jones, a z/VM and Linux expert at V/Soft, Inc (<a href="http://www.vsoft-software.com">http://www.vsoft-software.com</a>).
He sent me to the valuable <a href="http://www.tech-news.com/">Technology News</a> website
at <a href="http://www.tech-news.com/">http://www.tech-news.com/</a>, where there was an
article called <a href="http://www.tech-news.com/another/ap200804.html">Bears Turns</a> (For those unfamiliar with it: <a href="http://www.tech-news.com/">Technology News</a> is full of information and insight on mainframes - recommended reading)
<p>
This article covers a lot of ground: financial institutions, the economy, the <a href="http://en.wikipedia.org/wiki/Glass_steagall">Glass Steagall Act</a>, and IBM (and competitors like Sun) and System z. It says a number of interesting and insightful things (rather than quote them, I suggest you just <a href=""http://www.tech-news.com/another/ap200804.html">go there</a>). The article also mentions this blog. Hooray, my name in lights.
<h2>Jeff learns two things</h2>
I now learned that "Savit" in Hindustani means "Sun", so an observer might think my name is a pseudonym. No such luck - I'm really me, for good or for ill.
<p>
The article pointed out a mistake of mine - one I did on purpose. I didn't include the price for mainframe RAM, so I understated the cost of running on mainframes.
That was deliberate: At that time I didn't know the price, I chose to just leave it out of the calculations I made in my previous blog entries. I would rather err on the side of being conservative rather than inflate the price of a competitor's product. I think that's fair and ethical.
<h2>Re-work the numbers</h2>
I looked around and got an estimate, so I can revise my previous figures (which the article cited above correctly said understates IBM costs) as follows:
<ul>
<li>IBM claimed 26 IFLs would do the work of 760 cores on Sun AMD servers (in their dreams!) Well, 26 times $125,000 is <b>$3,250,000</b>
<li>I compared to the Sun servers, including RAM and disk, at <b>$451,820</b>. <b><i>IBM costs 7 times as much even using their bogus capacity and utilization assumption</i></b>
<li>But, the RAM would be about $6K per GB (down from $8K on z9. Still many times higher than we charge on our servers! sun.com says $255 for 4GB on AMD server; $440 for 4GB on T5240; $4K for 4GB on M4000/5000. Different features and prices, but all a lot less than on IBM z). A minimum configuration z10 model E26 has 64GB of RAM (according to the z10 EC Tech Intro document). So, at minimum, that's an additional cost of $384,000, raising the total cost to $3,364,000. <b>Now IBM is 8x more expensive - using the minimum/cheapest IBM configuration</b>.
<li>I don't think that's remotely enough memory to do the job of 380 machines, so you might want to consider what it would cost with a lot more RAM. This scenario replaces 380GB of RAM with only 64GB of RAM, so I think you'll thrash terribly: OLTP applications are memory intensive, and at the proposed 90% busy you can't count on idle workloads being paged out to disk. If you give it equivalent RAM the cost would be 380GB times $6K,
or $2,280,000. Yikes. <b>Now the System z CPU+memory cost is $5,530,000, 12 times more expensive than Sun.</b>
<li>But I don't believe the premise in the first place: the z doesn't have remotely the compute power needed to do the job, and they fudged the numbers by proposing it replace almost-idle database servers while it ran flat out. That's just silly. If we run our servers at the same utilization IBM proposes for itself, the Sun solution would be $51,127. Now the ratio (fairly computed) is close to <b>68 times more expensive on IBM</b>
using minimum/cheapest RAM configuration on IBM, and <b>108 times more expensive on IBM</b> using equivalent RAM.
</ul>
<p>Okay, it's only a game. What difference does it make whether mainframe is 7x or 8x more expensive (with bogus utilization) or 63x to 108x more expensive (with equivalent utilization). It's just a heck of a lot more expensive, and you have to rely on the unproven, untested, unverifiable benchmarketing that says it can do the job in the first place.
<p>Question to any enterprising reader: what would be the price for the mainframe disk? I don't know,
and I'd like to make the competition a little more fair but including for IBM what I factored in for Sun. :-)
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->https://blogs.oracle.com/jsavit/entry/don_t_keep_your_usersDon't Keep Your Users Hostage!jsavithttps://blogs.oracle.com/jsavit/entry/don_t_keep_your_users
Mon, 7 Apr 2008 17:40:44 +0000SunperformanceresponsetimeYears ago I was taught a formative lesson from response time pioneer Walter Doherty. I asked a (slightly silly) question, and was taught the importance of focussing on service levels. See what I learned here.In recent blogs, I've gone on at length about misusing performance numbers to make expensive things look competitive
(just <i>imagine</i> the thought process behind saying "we don't use distributed benchmarks on z"
- as if there was anything "distributed" about database or OLTP applications! -
while trying to convince customers to move their "distributed" applications to z)
and comparing systems deliberately run at high utilization (due to capital costs), with others deliberately not run at high utilization, and chirping "Our systems can run at high utilization!". Sigh.
In this post, however, I want to discuss a formative lesson I got on this subject many years ago. It bears relevance to this, too.
<h2>An ancient conflict</h2>
<p>
About 25 years ago, I was involved in a battle in the long-running war between MVS/TSO and VM/CMS. For people outside the mainframe world this may be incomprehensible (just as mainframers will shake their head over Unixers doing "BSD vs. SysV" or "vi vs. emacs" wars. For the latter, they'll just say "they both stink!"). I definitely was in the VM camp (which is where I spent much of my career), despite having done internals work in MVS.
My boss asked me to call a famous IBMer, Walter Doherty, and ask him how many more CMS users we could put on a given machine, compared to TSO (both were timesharing environments; TSO being the "time sharing option" of MVS. Or, as we VMers called it
"<b>T</b>he <b>S</b>low <b>O</b>ne"). The idea was to collect expert testimony backing our preferred system.
<h2>The Economic Value of Rapid Response Time</h2>
<p>
I have to interject that Walt Doherty is notable for being among the first to recognise and quantify the importance of good response time in interactive systems. His paper "The Economic Value of Rapid Response Time" (1982, IBM document GE20-0752; you can see it <a href="http://204.146.134.18/devpages/jelliott/evrrt.html">here</a> on Jim Elliott's site) broke new ground in showing that response time with low latency - ideally, close to the limits of human perception at 0.1 seconds - and with little variability, promoted higher productivity for system users, and that this could be measured to be more valuable in financial terms than the cost of the computer assets needed to deliver the response time. This was a revelation in those long-lost times when the relatively small number of people using computers (in the early PC days, and often on oversubscribed minicomputers or mainframes) tolerated long or inconsistent response times.
<h2>Don't keep your users hostage</h2>
<p>
Somehow I managed to get Mr. Doherty's telephone number and posed the question, confident that he would give me a range of ratios between CMS and TSO - something like "between 2.5 and 5 times as many users".
Imagine my surprise when he responded by bluntly saying <b>"Your question is wrong."</b> He went on to say "If all you do is count users without managing the response times and service you're providing for them, you'll just give them bad response and
<b>keep them hostage - you've got their data, and you won't let them get at it</b>.
This was a revelation to me.
<h2>The proper measure of systems</h2>
<p>
The important lesson Doherty taught me (eventually he took pity on me, and told me that indeed, CMS could support
2.5 to 5 times as many users as TSO could at comparable service levels), was that you <b>must</b>
manage to service levels, not to merely how many users you manage to get to log on to your system,
and then condemn to watching the hourglass...
<h2>Thanks to a pioneer</h2>
This post is in part a public thanks to a pioneer. He made a lasting contribution to the field, and he taught me a valuable
lesson. Provide rapid (subsecond) response time to your users when feasible, and with low variability
(uneven response time - sometimes fast, sometimes slow) is very disruptive to users).
And, don't measure your systems by "how many users" you can cram onto it.
This lesson really influenced me in a large way, and I'm really grateful. I'm also adamant on the distinction,
which you may have seen in my recent posts!
<p>
<h2>On a sad note</h2>
<p>
An additional note of respect, for a great pioneer who has just passed.
Bob Tomasulo was the 1997 Eckert-Mauchly Award recipient for the ingenious
algorithm which enabled out-of-order execution processors to be implemented.
Tomasulo's algorithm was first used in the floating point processor of the
IBM 360/91. His work was a great contribution to high performance computing
that influenced many later computer architects.
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->https://blogs.oracle.com/jsavit/entry/the_ten_percent_solutionThe Ten Percent Solutionjsavithttps://blogs.oracle.com/jsavit/entry/the_ten_percent_solution
Mon, 3 Mar 2008 20:00:10 +0000Sunibmlinuxmainframeperformancesolarissunz/vmz10How do you make a slow, expensive platform look like it's competitive?
This is the problem facing IBM with the announcement of their new System z10 mainframe. Read this article to see through deceptive benchmarketing<b>The Ten Percent Solution</b>
<p>
How do you make a slow, expensive platform look like it's competitive with any other platform?
This can be difficult, especially if you don't have (or don't dare have) public open benchmarks such as
those at <a href="http://www.spec.org/">SPEC</a> (Standard Performance Evaluation Corporation)
or <a href="http://www.tpc.org/">TPC</a> (Transaction Processing Performance Council).
This is the problem facing <a href="http://www.ibm.com">IBM</a> with the announcement of their new
<a href="http://www-03.ibm.com/press/us/en/pressrelease/23592.wss">System z10 mainframe</a>.
How do you make costly, proprietary hardware look like it can compete on a price/performance basis?
<p>
One way is to play tricky games with numbers.
Unfortunately, that's what it seems IBM has done with in its z10 announcement. If you look
at <a href="http://www-03.ibm.com/press/us/en/pressrelease/23592.wss">http://www-03.ibm.com/press/us/en/pressrelease/23592.wss</a>
you will see the following text
<blockquote>
The z10 also is the equivalent of nearly 1,500 x86 servers, with up to an 85% smaller footprint, and up to 85% lower energy costs. The new z10 can consolidate x86 software licenses at up to a 30-to-1 ratio. \*3
</blockquote>
and a pointer to footnote 3:
<blockquote>
3 Source: On Line Transaction Processing Relative Processing Estimates (OLTP-RPEs): Derivation of 760 Sun X2100 2.8 Opteron processor cores with average OLTP-RPEs per Ideas International of 3,845 RPEs and available utilization of 10% and 20 RPEs equating to 1 MIPS compared to 26 z10 EC IFLs and an average utilization of 90%.s
</blockquote>
<p>
This even comes up on an IBM blog at <a href="http://www-128.ibm.com/developerworks/blogs/page/benchmarking?entry=back_to_the_future_with#comments">http://www-128.ibm.com/developerworks/blogs/page/benchmarking?entry=back_to_the_future_with#comments</a>, where a reader asks how the calculations work, and is told:
<blockquote>
The 30 to 1 claim is based on 760 X2100 cores to 26 z10. The 760 to 26 is based on 3845 RPEs at 10% = 384.5 RPEs is approximately equivalent to the number of z10 RPEs at 90% when you use 20 RPEs equal to 1 MIP where MIPS are based on the LSPR curve for the z10.
</blockquote>
<p>
<b>Games People Play</b>
<p>
When I saw this, I thought "what kind of nonsense is this?" The whole idea of benchmarks is to determine the
capacity of a system under load - not to say "assume 100% of the capital and operational costs of the server,
and then only use 1/10th of it".
That's crazy. Or crazy like a fox.
(Not to mention, there's no such thing as a "MIP". All true mainframe performance guys know that).
<p>
Let's illustrate how this works: they take a consultancy's benchmark (instead of an open one like those at SPEC and TPC),
and then specify that the competitor's machines (that's ours, by the way) should be evaluated at 10% CPU busy,
while they specify that their own machines are 90% busy. With this "just because we say so" trick they use
9 times as much of their machine as they do of ours. Play games like this, and they can make yourself look a lot better than they are.
On top of that, they base their figures on extrapolating an IBM-only benchmark (the "LSPR" referred to) to estimate
the difference between the z9 and z10. That's a really shaky limb to go out on. There's not a lot of science here.
<p>
<b>Work the numbers</b>
<p>
Let's do some math and convert that into prices. Assume for the moment that
the price of a z10 IFL is the same as the $125,000 per IFL on the z9.
It's not easy to tell, since IBM isn't exactly transparent
about pricing and I couldn't find it on the site. So, 26 times $125,000 is $3,250,000.
They say 760 x2100 CPU cores, which I can price on
<a href="http://shop.sun.com/is-bin/INTERSHOP.enfinity/WFS/Sun_NorthAmerica-Sun_Store_US-Site/en_US/-/USD/ViewStandardCatalog-Browse?CategoryName=HID1907504065&CategoryDomainName=Sun_NorthAmerica-Sun_Store_US-SunCatalog">sun.com</a>
at $1,189 for 2 cores and 1GB of RAM. I can add 500GB of disk for $359, but let's hold that for a moment.
Now, using IBM's "ten percent solution", 760 cores requires 380 servers at $1,189, for a cost of $451,820.
Hey, that's not bad. <b>Even with the bogus 10% trick, Sun still has a 7 to 1 price/performance advantage
over the IBM z10 mainframe</b>.
Fantastic, huh? Even with that little deceptive trick, the z10 is trailing in the dust.
You <i>did</i> notice that IBM didn't actually put prices on their web pages, even with the 10% trick, right?
<p>
Now let's make it more realistic :-)
<p>
Database servers are just as likely to be 90% CPU busy on one platform as on another - that's just a question of not overprovisioning.
So let's work the numbers <b>fairly</b> so the x2100 systems are no more overprovisioned than the z10.
Instead of 760 cores at 10% busy, that would be 84.4 cores at 90%, which I'll round up to 86. Two cores per server: 43 servers. A rack.
So, a set of Sun x2100s provisioned for actual capacity is 43 servers times $1,189 for a total of $51,127.
<b>That leaves Sun with a 63 to 1 price performance advantage over the z10</b>.
That's actually pretty consistent with mainframe-to-SPARC and mainframe-to-x64 conversions I've been involved in.
Do you wonder why I got out of the mainframe side of computing after becoming an expert there? This is part of the reason.
<p>
<p>
And it gets worse (for IBM, that is). I didn't include the disk drives, which will be much more
expensive on the mainframe side. Nor did I include the operating systems:
For the mainframe you will need 26 CPU licenses of z/VM, which will be about $20,000 per CPU, for a total of
$520,000. There's a formula for this, but that's close enough for a quick estimate, and excludes subsequent maintenance. We provide Solaris
"right to use" for free, and I assure you our software maintenance is a heck of a lot less.
Plus, on the mainframe, you'll license Linux at a cost of $15,000 or more per CPU
for a one-year license
(see <a href="http://www.novell.com/licensing/price/index.html">Novell</a>
and <a href="https://www.redhat.com/apps/store/server/">Red Hat</a> pricing)
for another $390,000 per year.
<p>
Your first year cost for the IBM solution is $3,250,000 for the server (which by the way, just gave
you a single point of failure!), $520,000 for z/VM, and an annual $390,000 for Linux.
Ours is a few orders of magnitude less expensive, starting with $51K for the servers, and continues that way.
<p>
<b>Advance disclaimer</b>
<p>
It's late and I may have made a few mistakes. This is all
"back of the envelope" calculation. But, when there's orders-of-magnitude differences in platform price,
a few percent error one way or the other really doesn't matter.
If I made a mistake - it's happened! - I apologize for it in advance, and will correct as needed.
<p>
<b>Integrity in what you say</b>
<p>
Let me make this clear: I made my living for years as a mainframe system performance expert.
I ran mission critical, performance-sensitive systems, consulted with major companies,
wrote books, and taught classes on mainframe performance. I stay current.
I've run mainframe Linux, and taught a performance class on that, too.
I know what is a fair comparison and what is "benchmarketing" and trying to game the system.
If the numbers worked out differently, I would report them and let the chips (no pun intended) fall where they may.
<p>
Which reminds me. I Am Not A Lawyer, but I have to imagine that IDEAS (the consultancy whose benchmark IBM used)
specifies contracted terms and conditions that state how their benchmarks can be reported and used.
I expect that there's a contract stipulating that RPE is a proprietary metric
and restricted in use. IANAL, but I would be unsurprised if
IBM used it without IDEAS' permission, and may be in violation of IDEAS copyright or usage terms.
<p>
<b>The moral of the story</b>
<p>
There are several:
<ul>
<li>Use open, standard benchmarks, such as those from SPEC and TPC.
<li>Read and understand what they measure, instead of just accepting them uncritically.
<li>Get the price-tag associated with the system used to run the benchmark.
<li>Relate benchmarks to reality. Nobody buys computers to run Dhrystone.
<li>Don't permit games like "assume the other guy's system is barely loaded while ours is maxed out". That distorts price/performance dishonestly.
<li>Don't compare the brand-new machine to the competitor's 2 year old machine
<li><b>Insist</b> that your vendors provide open benchmarks and not just make stuff up.
<li>Be suspicious!
</ul>
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->
https://blogs.oracle.com/jsavit/entry/as_i_was_sayingAs I was sayingjsavithttps://blogs.oracle.com/jsavit/entry/as_i_was_saying
Mon, 3 Mar 2008 17:39:28 +0000Sundomainslogicalsolarisvirtualizatiovirtualizationvmwarexvmz/vmAs I was sayingAs I was saying before I was interrupted... No, nothing as memorable as the (apocryphal?) <a href="http://news.bbc.co.uk/2/hi/uk_news/magazine/5054802.stm">BBC story</a>
or <a href="http://www.amazon.com/Jack-Paar-Was-Saying-More/dp/B00000G3AN">Jack Paar</a>.
(I think that phrase was also memorably used by a British journalist - Peregrine Worsthorne? - but for the life me I can't remember. Anyone know this?)
I've just been busy with Real Life, and in fact Virtual Life, having a lot of fun with different virtualization technologies.
<p>
Among other things, I've run OpenSolaris and Ubuntu Linux on VMware, Sun xVM Server and VirtualBox on Intel and AMD machines,
and brought up Solaris on Logical Domains on T2000 machines. All great stuff that I plan to relate in upcoming blog entries.
I've even started to bring up OpenSolaris in a virtual machine under z/VM, with the port-in-progress for that platform
(It's really fun to be logged onto a VM system again, although it feels kind of retro now.
I even did some <a href="http://en.wikipedia.org/wiki/Hartmann_pipeline">CMS Pipelines</a> the other day.
It's like getting back on a bicycle.) Anyhow, I plan to blog some "compare and contrast" of hypervisors, old and new, and tips
on how to use them.https://blogs.oracle.com/jsavit/entry/response_to_ibm_sun_andResponse to: IBM, Sun and HP: Comparing UNIX Virtualization Offeringsjsavithttps://blogs.oracle.com/jsavit/entry/response_to_ibm_sun_and
Mon, 26 Feb 2007 22:12:07 +0000SunKen Milberg, a technical editor for IBM Systems Magazine, Open Systems edition, wrote an article comparing and contrasting virtualization technologies from IBM, Sun, and HP. You might not expect an impartial analysis from an IBM publication, but you might hope. Unfortunately, he got a lot of things wrong, both about Sun technology and IBM's. I'll go over some of the mistakes and misleading comments.In his article <a href="http://cl.exactt.net/?ffcb10-fe35157274660d79751073-fde61576726601787c1d7773-ff2513727c6c-fec5157277600c74-fe21177772640d747d1473">
IBM, Sun and HP: Comparing UNIX Virtualization Offerings</a>, Ken Milberg of IBM Systems Magazine makes a number of observations about the different vendor's virtualization products, many of which are wrong.
<p>
Like Jim Laurent, who has already <a href="https://blogs.oracle.com/jimlaurent/entry/response_to_ibm_sun_and">posted observations</a> on this article, I'll confine myself to the comments on Sun technology. Well, actually, I'll make some points about the IBM technology comments he makes too. Somebody better qualified than me on HP products will have to address any possible inaccuracies there.
<p>
The article starts with a brief compliment about Solaris, but then goes into a digression about volume mirroring. As Jim points out, the author could have used ZFS (certainly for the data components of the storage) which he already had mentioned, and "hundreds of commands" sounds like an exaggeration to me as well. Perhaps he could have done something as prosaic as a script? In any case, ZFS sounds like it would have been ideal for easy management of non-root filesystem data, and there are ways to make managing root filesystems a low-pain activity too.
<p>
The author then goes on to talk about Solaris Containers (otherwise called zones), saying <i>"this method requires all partitions have the same OS and patch levels. Their virtualization essentially virtualizes an OS environment more-so than hardware. In fact, they don't emulate any of the underlying hardware."</i>. First of all, they're not partitions - a revealing remark that shows an IBM-product frame of mind, Yes, they do virtualize OS abstractions instead of hardware - that's the great thing about them, The preceding posts on this blog have gone to (perhaps tedious) description of how complex and expensive it is to virtualize hardware details. Solaris Containers bypass all that to provide light-weight private computing environments with essentially no overhead at all. It's not the solution for all problems - what is? - but provides an excellent, efficient way of addressing server sprawl. System p virtualization doesn't. Here's a pop quiz: how many 8GB instances of AIX can you have on a 32GB server? (There's a trick here, obviously, but even the "obvious" answer of 4 isn't a big solution for server sprawl - I'll mention the actual answer some other time).
<p>
He makes several other observations, only some of which are on target: <i>OS level must be exactly the same across all containers</i>. Well, that's not completely true: containers share the same kernel, but you can have different software levels for everything else. More important: that's actually a best practice you want to encourage. If this was the halcyon days of mainframe timesharing with VM/CMS - each user in his or her own virtual machine - you would (or should!) see dozens or hundreds of virtual machines with the same OS level. Otherwise you get chaos of different patch levels to manage. The small footprint in the Containers model lets you have dozens, hundreds, or even thousands of private environments on the same Solaris server. <i>One kernel fault will bring down every container...There's also limited security isolation as a result of a single kernel across containers. What that means is one breach will impact every container in the OS image.</i> - well, that's only partially true, and somewhat misleading. A kernel breach would affect all environments - just as it would with IBM's z/VM or VMware's VMware products. However, a bad guy getting root privileges in one container gets no access at all to other containers. Nada, zip, zilch, zero. So, his observation is misleading.
<p>
He goes on to say <i>"From a licensing perspective, one must also be aware ISVs will charge on a per CPU basis across all containers in the single image, even though they may need only a part of the OS image capacity"</i>. Well, that varies from ISV to ISV, as Jim Laurent pointed out. His comments on ISV pricing on the basis of the whole system are basically wrong.
<p>
He goes further astray with <i>"Sun containers also can't share I/O, which is not a good thing."</i> He's absolutely wrong here - this is a howler. By default, Solaris Containers are set up in 'sparse root' mode in which most of its file systems are shared. You can mix and match any combination of shared and dedicated storage assets for containers, at a volume, partition, slice, or directory level. Or (and this is a trick I like) set up a loopback filesystem within a file. Solaris Containers have far more flexible I/O sharing. Ditto for non-disk assets of all types. Just RTFM.
<p>
He complains of the management interface: <i>"one must type endlessly from the command line and, when you type so many commands, you can make mistakes. You could use the Solaris Container Manager, though I suspect you may have similar problems that I had with the Solaris Management Console in configuring storage resources."</i>. That's wrong on at least two levels: first, you can put most information in configuration files as input to zonecfg and related commands. So, you do not type endlessly from the command line. In fact, for automation and scripting you want (in my opinion) a CLI and config files to free you from the tyranny of having to point and click over and over. Moreover, he simply ignores the Sun GUI for containers. Very odd journalistic behavior to complain about there being a GUI deficit in the same paragraph where he mentions the GUI and that he never tried it. Odd.
<p>
He goes on to mischaracterize Sun's efforts with Xen. There's no a big secret here - the development is clearly there on OpenSolaris.org. That's one of the consequences of being open! He could even download the bits that are available for testing.
<p>
He then gets totally confused with the idea that Windows could run on the (SPARC architecture). He works with IBM, so he should be familiar with the idea that there are multiple product lines. Our SPARC systems run Solaris and some Linux distros; our x64/86 systems run Solaris, Linux, and Windows. Not a difficult concept to grasp, and certainly easier than the mix of systems IBM has. VMware, of course, currently runs on x64/86. Xen is emerging on the same platform. Logical domains will be on T1000/T2000. That's a pretty rich portfolio providing customer choice on popular platforms.
<p>He goes on to say <i>" they continue to want to be all things to everyone. With up to 5 separate ways to virtualize, they'll continue to confuse most people. IBM has one consistent strategy and method of virtualization. They also have only one hardware platform for their POWER5 technology."</i> The last sentence is a marvel of circular reasoning. There's only one hardware platform for one of their hardware platforms? Well, duh. IBM certainly does not have a consistent virtualization strategy. LPAR on System p is unlike LPAR on System z, which is unlike z/VM on System z (though the latter two at least share some DNA). Unfortunately for IBM, they have no operating system of their own on the most popular computing platform on the earth (Intel/AMD's x64/86) - so they have no virtualization story there either. They used to have a dialect of AIX that ran there, but they killed it long ago.
<p>
Which brings me to Mr. Milberg's closing comments:
<ul>
<li><i>A 39 year history of virtualization, offering a very mature technology.</i> That's totally wrong. The form of virtualization in System p is totally unlike the virtual machine technology from IBM's history. All that trap and emulate stuff I described? Doesn't do it, and most of the other technical aspects of VM. In fact, the System p style of CPU partitioning was first done by Amdahl (Multiple Domain Facility) before IBM was forced to respond and come out with LPAR.
<li><i>Capped and uncapped partitions. Allowing users to take advantage of unused clock cycles via a shared processor pool is an innovation that no one else has. HP requires a workload manager system similar to PLM, while Sun has nothing.</i> - that's absurd. Sun has dynamic resource pools and the Solaris Resource Manager that both (in different ways) let customers distribute CPU resources with a fine degree of control. More RTFM is needed.
<li><i>SMT - Only IBM has it</i> - Not only does the relevance of this escape me, but it's painting a deficiency as a benefit. Sun has it's Chip Multi Threading (CMT) that is a far more advanced processor design.
<li><i>Dedicated or Shared I/O on a virtual partition - Only IBM has it</i>. Very wrong. Solaris Containers provide both.
<li><i>IBM has only one virtualization strategy (APV)</i> - maybe for THIS hardware platform (which seems to have multiple OS strategies - is it to be AIX or Linux now?), but IBM certainly has more than one virtualization strategy. There will be a lot of shocked IBM employees in the Endicott labs who are working on the descendant of the actual 39-year old technology.
<li><i>One hardware platform for AIX and Linux partitions (POWER5)</i> - this actually documents IBM reneging on a promise. Linux, obviously, runs on many platforms (which has little to do with IBM), but some of us remember when AIX was also supposed to run on mainframe (AIX/370, AIX/ESA) and on Intel, but IBM reneged on those promises and canceled those projects. So, sure there's only one hardware platform LEFT for AIX, and a vanishingly small percentage of Linux runs there.
</ul>
<p>
By contrast, Solaris runs both on SPARC - scaling from single-CPU to massive supercomputers, and on the Intel/AMD architecture, the world's volume leader, and has outstanding virtualization capabities on both. Give it a try! Back at Jim Laurent's <a href="https://blogs.oracle.com/jimlaurent/entry/response_to_ibm_sun_and">blog entry</a> on this article, he closes with reasons for trying out Solaris. Don't believe any of us - you can see for yourself.https://blogs.oracle.com/jsavit/entry/two_questions_1_5_answersTwo questions, 1.5 answers...jsavithttps://blogs.oracle.com/jsavit/entry/two_questions_1_5_answers
Fri, 2 Feb 2007 15:08:56 +0000SunTwo questions (and a partial answer) about when LDoms will be available and whether it requires spending money. I can answer <b>part</b> of that!After my last blog entry, I received two questions, which I can partially answer:
<p>
<ul>
<li>"Do you have any information when the LDOM firmware for the T2000 will be made available?"
<br>Well, yes I do, but it hasn't been officially announced to the world, so I don't think I'm allowed to even hint. I'm really sorry, as I'm itching to...
<li>"When will the Logical Domain Manager be available outside of Sun? Is it proper to read it as "a separate free download"?"
<br>For the first part of this: same answer as to the first question. To the second: my understanding is that the logical domains support (firmware, new packages, etc) will all be available as free downloads. What I've heard all along is that if you have a T1000/T2000 server, you do <i>not</i> have to spend any extra money or do a box swap, in order to get logical domains! Download the firmware and new packages, install them, follow the docs we'll make available, and you're off in LDom land.
</ul>
<!-- Start of StatCounter Code -->
<script type="text/javascript">
var sc_project=6611784;
var sc_invisible=1;
var sc_security="4251aa3a";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script><noscript><div
class="statcounter"><a title="visit tracker on tumblr"
href="http://statcounter.com/tumblr/" target="_blank"><img
class="statcounter"
src="http://c.statcounter.com/6611784/0/4251aa3a/1/"
alt="visit tracker on tumblr" ></a></div></noscript>
<!-- End of StatCounter Code -->https://blogs.oracle.com/jsavit/entry/and_now_for_something_virtuallyAnd now for something virtually different...jsavithttps://blogs.oracle.com/jsavit/entry/and_now_for_something_virtually
Sat, 27 Jan 2007 09:10:32 +0000SunThe previous entries describe some of the complex, slow and tricky things needed to implement virtual machines the "traditional" way. This entry gives an overview of how Logical Domains work, and align with CMT architecture, and why they are a completely different approach for low-overhead virtualization.Okay, so the title is a homage to Monty Python... But, what if we had a completely different approach to providing virtual environments on a computer system. The previous entries on this blog describe the tricky bits - complexity and overhead - involved in creating virtual machines. These problems have been addressed with pretty darned good success through heroic programming, but what if we could avoid some of the issues entirely with a new approach.
<p>
Traditional virtual machines timeslice physical CPUs among multiple virtual machines, intercepting instructions that change system state or do I/O, and emulating them as needed. This is based on the historical design of computer systems where physical CPUs are relatively rare and expensive (hence must be time-multiplexed), and that state-changing events for one virtual machine must not affect others (hence must run without full machine access privileges, and require trapping and emulation of such functions). As I've been outlining, this is complicated and expensive. Even simple timeslicing between virtual machines can cost hundreds of clock cycles, because cache and TLB contents have to be discarded.
<P>
The T1 chip in Sun's "Niagara"-based systems (T1000, T2000, and others to come) turns the assumption of expensive/rare CPUs upside down. This processor's Chip Multi Threading (CMT) design provides up to 32 logical CPUs ("strands") in a 1 or 2 rack unit, low-cost server. Now, CPU strands are plentiful and cheap. Instead of timeslicing a few CPUs between VMs, just give each virtual machines one or more dedicated logical CPUs for its own use. That is the basis of logical domains (LDoms): every domain has its own assigned CPUs (roughly 3% granularity of the entire box CPU count) which can be dynamically added or removed to a Solaris instance. Each domain also has its complement of disk, network, and cryptographic assets. Everything is assigned by a <i>control domain</i>, and virtual network and disk I/O is provided by bridged access <i>service domains</i>.
<p>
This gives us several important benefits right away: since each domain has its own logical CPUs, it can change its state (such as enable or disable interrupts) without having to cause a trap and emulation. After all, it owns the CPU and its interrupt mask all by itself. That can save thousands of context switches per second. Second, since each CPU strand has its own private context in hardware, the T1000/T2000 can switch between domains in a single clock cycle, not the several hundred needed for most virtual machines.
<p>
Typically that happens when a domain references memory that is not currently in cache. Fetching contents from RAM to the processor (all vendor's processors, not just this one!) can take many clock cycles during which a logical CPU stalls execution of the single instruction causing the cache miss. By switching to another CPU strand on the same physical CPU core, the T1000/T2000 lets another logical CPU continue instruction processing, during time that is "dead time" on most processors. On most existing CPUs, cache misses result in dead time - but on the Sun T1 chip, that time can be used to continue processing other work. This is the essence of CMT's "Throughput Computing" that makes the T1 chip so poweful.
<p>
Next time, some more information on how LDoms works and is used.https://blogs.oracle.com/jsavit/entry/trickiest_virtual_bitsTrickiest (virtual) bits - more on why virtual machines are hard to do welljsavithttps://blogs.oracle.com/jsavit/entry/trickiest_virtual_bits
Sun, 24 Dec 2006 14:37:55 +0000SunSome parts of making virtual machines work are especially tricky. Read here for discussion of several of the most difficult aspects of virtualization, and how they are handled on different architectures.Some parts of implementing virtual machines seem to always be difficult, even across completely different computer architectures. These are worth a little investigation:
<ul>
<li>Privileged operations in general
<li>I/O instructions
<li>Timer management
<li>Sharing CPU, and going to sleep the right way
<li>Virtual memory management
</ul>
<h2>Privileged operations in general</h2>
<p>We've already discussed the general issue of privileged operations, that privileged operations executed from a virtual machine's OS generally require a trap to the hypervisor and then emulation, for a cost of at least two context switches, which costs instruction counts and displaces cache and TLB contents.
<p>
What I didn't mention previously is that some architectures (x86 in particular) are not well suited for virtualization because there are instructions that <i>should</i> be intercepted by the hypervisor, but actually fail silently when run in without being in the right mode (aka ring). For example, the POPF instruction controls whether interrupts are masked off or enabled, but are silently ignored when not in supervisor state. An OS can't run properly if interrupts are enabled when it needs them disabled, or vice-versa, so this is a problem that has to be solved. VMware handles this cleverly by scanning for offending binary code sequences in the virtual machine and replacing them so the guest traps out when this type of intervention is needed. Xen handles this by modifying the guest OS so such instruction sequences aren't needed in the first place, an elegant alternative when OS source code is available. z/VM and LPAR on mainframe don't have this particular problem, and also try to address the cost of guest privileged operations by pushing some of the work into microcode - it still has a cost, but less than when processed in software. VM's CMS timesharing environment reduces privop handling costs by using paravirtualization for many of its operations - the word "paravirtualization" is new, I suppose, but the technique has been around since the 1970s.
<h2>I/O instructions</h2>
Well, that's nice for the general question of privileged operations, and the above especially address the context switch aspects, but some privileged instructions have such complex semantics that they always are problematic. Enabling or disabling interrupts is just flipping a bit in a descriptor - converting a virtual I/O operation into a real one is a lot more involved.
<p>
Consider the flow of doing I/O from a virtual machine:
<ul>
<li>Guest application performs I/O operation. The system call might itself cause an intercept that traps to the hypervisor, who then hands it back to the guest OS.
<li>Depending on OS, the application's buffers may or may not yet pinned (fixed) into memory where they can be used for DMA transfers, so the guest OS may have to copy application buffers to kernel RAM or otherwise lock virtual memory pages into RAM for the duration of the I/O operation.
<li>The OS issues the physical I/O instruction for that computer's architecture.
<li>The hypervisor intercepts the I/O instruction (another context switch)
<li>The I/O instruction is checked for validity: make sure disk seeks don't go outside the bounds of a virtual disk, buffers map into the address space of the guest, I/O is to a device present in the guest's virtual configuration. Take some work to do this.
<li>Guest virtual memory is pinned into real memory, possibly even copied from disk if those locations were paged/swapped out.
<li>The I/O instruction is issued on the real hardware, with the buffer and I/O devices address mapped into the real device. If the I/O is bridged by software in the hypervisor (for example, a virtual disk or network provided by the hypervisor) then an even more complex flow of events start, and the I/O is converted into a request to a virtual device service.
<li>Eventually the I/O even completes, and interrupt status and buffers are returned to the guest OS.
</ul>
The point here is that this takes a lot of CPU processing and complexity. This is an area in which performance takes a substantial hit (especially with bridged I/O), and is often a source of errors. Mapping the idiosyncratic behavior of many I/O devices and buses back to a virtual machine is tricky business.
<h2>Timer management</h2>
Often overlooked, timers can be difficult to maintain accurately. Consider an OS written with the assumption that it owns the real machine and that real ("wall clock") time proceeds in uniform manner, possibly notified by a hardware-provide metronome to keep time. An OS might expect a clock interrupt every 10ms. and suitably adjust its time of day clock. Well, when running as a guest, an OS has no way of knowing that in between two of its instructions, the hypervisor chose to run another guest for a few milliseconds. Wall clock time proceeds, while "on CPU" time doesn't, so the time of day clock in the guest gets further and further away from actual time. The lack of repeatable timing can be solved but this is architecture dependent and clock skew can be alarming. One solution is shown on mainframes, which have architected instructions for wall-clock and CPU time. There is still clock skew (the hypervisor might not deliver a simulated clock cycle exactly on time, for example), but it's not as tough a problem as in some other places.
<p>
Actually, simply handling clock interrupts can be a cause of massive overhead. Consider a Linux guest on VMware or z/VM that receives a timer interrupts very 10ms, as was the case in 2.4 kernels (this is the so-called "jiffy" event). A dedicated PC can easily handle 100 clock interrupts per second, but this can snowball in virtual machines. Imagine if you have 1,000 guests each taking 100 clock interrupts per second (with a pair of context switches for each one). Experience several years ago on mainframe Linux, where people were trying to drive very high numbers of guests to compensate for the platform's price, showed that large numbers of Linux guests could saturate expensive systems just processing jiffies, even when they were otherwise idle! Yes, even idle Linux guests had enough overhead to swamp CPUs. Some people tested these systems with the HZ value changed to make the clock interrupts far less frequent - but then the guests became unresponsive to events. There eventually was a fix to this - the Linux kernel was changed to use a different time-keeping algorithm on this architecture - but this was only the most crippling timer event that had to be muzzled.
<h2>Sharing CPU, and going to sleep the right wayoing to sleep the right way</h2>
Any OS that is trained to believe that all the CPU cycles on a box belong to it might be an inhospitable guest, soaking up all the CPU it can with background activities. A generally unsolved (that is, not solved to particular satisfaction) is how to handle CPU priorities when a high priority guest (say, one running your database or transaction processor) needs to run some low priority work (such as maintenance tasks like RPM management on Linux, backups, other tasks than can run at low priority).
<p>One of two things typically happens: either virtual machine's priority is so high that it continues to starve other guests from running their work(even when running non-critical work), or it absorbs so much capacity running the low priority work that its absorbed its share, and has nothing left when it needs to run its serious work again. A good solution to this was designed in the late 1970s by Robert Cowles, then of Cornell University, and now at SLAC. He wrote an interface by which a guest signaled to the hypervisor the importance of the work it was about to do, and the hypervisor ran it at a lower priority, relative to the other guests, if it was low priority work from a high priority guest. This was done as open source modifications - I don't know of any commercial offerings that adopted this architecture.
<p>The worst case is when operating systems run "idle loops" when they have nothing to do. On real machines this makes some sense, because there is nothing else on the machine - but in a virtual machine it means that CPU cycles are burnt by idle guests when there is real work to be done elsewhere on the machine. The only answer for this is for guest operating systems to yield control and enter a wait state, enabled for interrupts that signal new work has arrived.
<h2>Virtual memory management</h2>
Finally, there's the problem of memory management. Systems like VMware and z/VM let you overcomit RAM. That is, the size of installed RAM can be much smaller than the memory sizes of the virtual machines that are running, and working sets are kept resident in memory, while unused pages are saved on disk and fetched as needed.
<p>This is the same thing that virtual memory managers do in standard OSes, with two important exceptions: first, operating systems have crummy locality of reference. They are designed to use all the RAM available on the box they run on, for buffers, for application binaries, etc. This makes complete sense when run on the real machine (otherwise the RAM would be wasted), but is counter productive when there may be dozens of virtual machines all contending for the same RAM. In general, if you tell a guest OS that it has a virtual machine size of 512MB, it will use pretty close to 512MB.
<p>Another annoyance is that nested memory managers don't play nicely together. The usual idea is to attempt to provide an "LRU" (Least Recently Used) policy, where the least recently used ("oldest") memory is displaced onto swap or page disk when there's a real memory shortage for other virtual memory locations. But, when a virtual memory system (the guest OS, whether it be Linux, z/OS, Solaris, or Windows) runs under a hypervisor that is also a virtual memory system (VMware, z/VM, etc) you can get double paging: the guest needs to page or swap out to disk from RAM in order to make room for a new request, but the hypervisor has already paged out that location. So, you need to do a page read just to get the contents that will be written out again and overwritten in memory. Ow.
<p>VMware addresses the general problem via a balloon technique: the hypervisor tells the guest when its under memory pressure, and the guest obligingly uses less (the ballon is a buffer of RAM that isn't actually referred to) and reduces its memory footprint. This, and a similar feature from IBM for mainframe VM ("cooperative memory management") help reduce the pain. What people do in practice is buy additional memory to prevent these problems in the first place, but unfortunately that raises the cost of virtualization solutions.
<h2>It's hard out there</h2>
I hope that this closing entry for the year demonstrates why doing virtual machines is tricky business, with lots of technical problems that must be solved to make them work at all, and many more to make them work efficiently. Next year, I'll talk about why it frequently doesn't matter, and about some completely different approaches that make some of the problems simply disappear.
<p>
In the meantime, happy holidays and New Year to all!https://blogs.oracle.com/jsavit/entry/why_virtual_machines_are_hardWhy virtual machines are hardjsavithttps://blogs.oracle.com/jsavit/entry/why_virtual_machines_are_hard
Tue, 5 Dec 2006 20:06:31 +0000SunIn my entry "Virtual history", I mentioned that some aspects of virtual machines were hard to do efficiently, and then hand-waved over the details. I'd like to spend a few minutes (it's late...) mentioning just a few of the things to be considered. Let's delve into that some now.Let's review the issues of system state. The hypervisor ("host") runs in the processor's supervisor state (sometimes refered to as "ring 0" or other jargons) in which it has unrestricted access to the architected instruction set of that platform (be it x86, System/370, etcetera). Virtual machines ("guests") run in <i>user</i> or <i>problem</i> state, in which a subset of instructions is permitted. In particular, the instructions that alter address translation (virtual memory mapping), perform I/O, enable or disable interrupts, are forbidden, and generate an exception or trap when executed.
<p>
So, let's review what happens when a guest is running: it goes about its business until it executes one of these supervisor instructions. This generates a trap (in some contexts called an <i>intercept</i> or <i>privop exception</i>) that causes a context switch to the hypervisor. The hypervisor saves the current state, and then inspects the "virtual" state of the guest system: if the guest was running its operating system and was in virtual supervisor state, the privop is emulated (more on that later), and eventually control is returned to the virtual machine (if it's still the highest priority non-blocked virtual machine).
<p>
If the guest was running an application (say, a Linux or Solaris virtual machine under VMware running a user-land process), then this is a program exception that must be handled by the guest operating system, just as would happen on a real (non-virtual) machine if an application issued such an instruction. The exception is <i>reflected</i> to the guest virtual machine: its registers and next-instruction location are set up exactly as would be the case if a misbehaving application did this when no virtual machines are involved.
<p>
Every time a virtual machine executes a privileged operation, then, we have at least two context switches to deal with, switching from guest to host, and then back again. Several hundred or several thousand instructions are executed, and the clock cycle cost is higher: we must discard translation lookaside buffer entries (TLBs) as the two environments operate with different physical to real memory mappings, and probably touch enough code and data operands to displace the L1 cache contents (and perhaps L2 cache as well).
<p>
This describes the simple cost of state switching, typically thousands of clocks per intercept, exclusive of the CPU time the hypervisor spends emulating the guest's instructions. This process happens every time the guest does I/O, context switches between its processes, sets or responds to a time or I/O interrupt, or one of its applications executes a system call. Thousands of clock cycles spent each time, at a frequency that can add up to thousands of times per second.
<p>
Next time: the three levels of memory needed when a virtual memory guest runs under a hypervisor, and the magic of the shadow page table. G'nite. https://blogs.oracle.com/jsavit/entry/a_slight_digression_with_logicalA slight digression with Logical Domains and Apachejsavithttps://blogs.oracle.com/jsavit/entry/a_slight_digression_with_logical
Thu, 16 Nov 2006 18:51:56 +0000SunBefore I carry on describing architectures for virtual machines and other forms of virtualization, a related anecdote driving Apache web server under logical domains (LDoms)Before I carry on describing architectures for virtual machines and other forms of virtualization, a related anecdote:
<p>
Right now, I'm at a Sun internal conference for USA technical staff, and I presented sessions on virtualization, with a demo of the Logical Domains (LDoms) capability coming soon on our T1000 and T2000 servers. In some ways they use an opposite method than traditional hypervisors that work by multiplexing a CPU by time-slicing among virtual machines. That's something I'll discuss later.
<p>
For my session, I wanted to demonstrate Apache in one domain being driven by a client program in another domain - a simple test that demonstrates inter-domain network connectivity. For this demo, I used 'ab' - the Apache benchmark tool that lets you repeatedly hammer a specified URL <i>N</i> times, with <i>C</i> concurrent requests. I've used this before, as Apache is easy to set up (of course, 'ab' can be used with other web servers) and is pre-installed with Solaris 10. I used 'perfbar' and 'cpubar' tools to draw nice CPU utilization charts in graphical format, since eye-candy is fun for everyone, and these tools show system loads very dramatically in real time.
<p>
What happened when I tested this surprised me: I was getting really high latencies (and hence, low hits/second), yet at the same time the CPU utilization in both the client instance of Solaris, and the Solaris running Apache was extremely low. If anything was doing work, it was the service domain providing virtualized disks, which showed a little bit of CPU activity. Hmmm, why is this happening? The logical domain running web server looked almost completely idle!
<p>
I suppose I could have used DTrace to find out where the applications were spending their time - there are a number of scripts that would have been helpful here, such as running profiles and seeing scheduler state, system calls, or stacks. Instead, I resorted to traditional, low-tech methods. What solved the problem was the Apache error log, which had zillions of error messages (warnings, actually) about an unreachable network. Well, that didn't make any sense to me (the web hits actually did proceed, even if not at the speed I expected).
<p>
I then did something that has become a habit: I cut and paste the error message into a Google screen (why look up error message when it can look it up - with commentary - for me?) Lo and behold, there were several hits on this, including one I found in under a minute that essentially said "In <i>httpd.conf</i> replace <i>Listen 80</i> with <i>Listen 0.0.0.0:80</i>".
<p>
That really shouldn't be necessary, but let's give it a try. I did, and all of a sudden the test took off: thousands of web hits flashed by, and perfbar and cpubar showed pretty CPU load charts in the logical domains they were supposed to (primarily, in the server's domain). Appropriate oohs and aahs from the audience - very gratifying. I then went to another instance of Solaris with an unmodified httpd.conf, and I was able to establish the original effect - which pretty much was that the rate of web hits was gated by the rate Apache could wrote error messages to disk! This also explained the slight CPU load in the service domain, as it was doing the disk writes.
<p>
So, a nice little demonstration worked, and illustrated a few more points than I originally had planned for it by showing low-latency, high throughput web serving across domains (the web server is also available off that server box, of course, so I wrote a tiny CGI script that displays the number of CPUs the web server thinks it has), and the effects of disk writes when network access was expected. Logical domains can be expanded or shrunk on the fly by simple commands, so I removed virtual CPUs from the web server domain, re-ran the test (remaining CPUs get saturated), put the virtual CPUs back (the web server now has ample CPU capacity and returns web hits faster). A nice example of changing resource distribution under logical domains.
<p>
Now, of course, I have a little exploration to do when I have a few free minutes. Why does Apache require this (should not be necessary) alteration to its <i>Listen</i> directive? Does this happen only with an 'ab' command, or does it occur from a real browser as well? Does this happen in non-LDom situations (ditto with Solaris Containers versus global zone, for symmetry)? That will be easy enough to determine in a test, and probably I will use DTrace to dig deeper to see what Apache binds with both the original and altered <i>Listen</i>. There might be a bug to file somewhere, but first I need to find out who is doing what. That sounds like a fun exercise to do when I get home. https://blogs.oracle.com/jsavit/entry/virtual_historyVirtual historyjsavithttps://blogs.oracle.com/jsavit/entry/virtual_history
Sat, 11 Nov 2006 19:44:53 +0000SunA little historical perspective on different forms of virtualization, leading (eventually) to constrast with Logical Domains and Solaris Containers (zones)Well, I've been very remiss about posting to the blog. Well, that's very bad of me, but in my defense I have to say I've been working with the stuff I'll be writing about, specifically, Logical Domains (LDoms) and Solaris Containers (Zones), along with my day job.
<p>
First, some background on virtual machines Old School Style - "trap and emulate".
<p>
<b>In the beginning</b>, say about 1966, virtual machines emerged on one of the IBM mainframes of the time (the 370/67), with a hypervisor called <b>CP/67</b>, which became VM/370 and the successor virtual machine products for mainframes (the definitive history of this product line is Melinda Varian's <a href="http://www.princeton.edu/~melinda/">VM and the VM Community: Past, Present, and Future</a>). This family of hypervisors - also called virtual machine monitors (VMMs) - influenced other product lines, such as VMware on x64/x86.
<p>
These systems use their platforms' architectural support for <i>privileged state</i> (OS level) and <i>unprivileged state</i> (application program level) to let a hypervisor, also called a <i>host</i> run multiple virtual machines (called <i>guests</i>). The hypervisor time-slices between guests just as a conventional multiprogramming OS time-slices between processes. The difference is that operating systems execute privileged instructions - things that change memory mapping for virtual memory, perform I/O, and so forth. Since the guests run in unprivileged state, this causes a program exception - a trap - which causes a context switch to the hypervisor. The hypervisor then figures out what the guest was trying to do (issue I/O, enable or disable interrupt masks, flush cache, switch address spaces, whatever), and then emulates that instruction on behalf of the guest. Periodically an event happens on the real machine that the hypervisor "reflects" to the guest: a timer or I/O interrupt, for example. All this makes it possible to give the guest OS the illusion that it has its own machine. Neat, huh?
<p>
Well, yes this is neat - very elegant, in fact, but I've just hand-waved over tremendous complexity that is resolved only via Heroic Tricky Programming in each hypervisor. Before I discuss some of these complexities, I'm going to mention a few of the constraints that shaped how these traditional hypervisors work. First and foremost: CPU power was limited and highly granular sharing is mandatory. Most machines are uniprocessors (SMP systems come into the picture much later), so the only way to meaningfully run virtual machines is to time-slice between them, even if (as it turns out) the cost of context switch turned out to be high. The second major constraint is that RAM is also scarce, and the only way to meaningfully fit virtual machines into RAM is to demand page them, and keep only their working sets in RAM with other parts paged out (swapped out) to disk. This turns out to be problematic too, as operating systems have very poor locality of reference (why should they have good locality of reference? They're written under the assumption that all of the RAM visible to them should be used) and thus working sets tend towards virtual machine defined memory size. An honorable exception is the timesharing shell, VM/CMS: a single-user (per virtual machine) simple OS, it has a small memory footprint, and benefits from the fact that pages can be recycled if necessary during user think times between interactions.
<p>
So, what did that lead to? It defined families of virtual machine systems in which operating systems were time-sliced on a small number of CPUs ("1" not being too small a number), with actively used pages maintained in RAM via demand paging (I'm simplifying here a bit, but not in a way that distorts how this works). When the guest issues a privileged operation ("privop") the hardware traps out to the hypervisor: trap and emulate.
<p>
Well, that's what we had for many years, and it worked pretty well. However, it also could incur tremendous overhead. In my next installment I'm going to discuss some of the difficult things I blithely dismissed a few sentences ago:
<ul>
<li>context switch overhead between virtual machines, and between processes in a virtual machine.
<li>3rd level storage addressing and shadow page tables
<li>the difficulty of emulating privileged operation in general, and I/O emulation in particular. Plus: the specific problem of trapping privileged operations in x64 and x86 architecture.
<li>the problem of having nested CPU and memory resource managers that don't talk to one another.
</ul>
<p>
These are very difficult issues to solve well, as I'll describe. In fact, some issues now encountered in the VMware world are well known to veterans of the VM/370 world, and never satisfactorily solved in either (or solved better then than now: one VM expert wryly commented to me that in reinventing the wheel some solutions aren't quite as round as they should be!). Now that the constraining assumptions on CPU and memory resources have been overtaken by Moore's Law, there may be a completely different way of handling the same question. I'll discuss that in an installment or two from now.
https://blogs.oracle.com/jsavit/entry/it_s_a_virtual_worldIt's a virtual world out therejsavithttps://blogs.oracle.com/jsavit/entry/it_s_a_virtual_world
Thu, 26 Oct 2006 13:17:02 +0000SunI've finally been tempted to start posting blog entries. Yet Another Form Of Electronically Mediated Communication, but a good one.
Most of my posts, at least, the computer-related ones, will be related to server virtualization. It's interesting times now, as a formerly niche technology has become mainstream. As somebody who happily lived for a long time in that niche (mostly in the context of mainframe VM) it's interesting to see a large number of people discover how useful virtualization can be, along with some of the pitfalls you can run into. In a world where there a virtual machines or partitioning technologies on almost every platform (VMware, Xen, Solaris Containers, LDoms, xPars, Virtual Iron and others), you can see people rediscovering issues like nested memory managers (hint: LRU no longer works the way you think). virtual timer management, and other aspects virtualized environments.