So, I use OpenSolaris on most of my boxes as native operating system. VMs run other OSs. My choice for OpenSolaris was driven by the availability of DTrace. One of the greatest tools for system/program analysis ever created. By running OpenSolaris I've also got ZFS which is Oracle's über file system. I never really cared about ZFS, at least not until I missed it. So ZFS integrates all the different storage layers in one system - RAID-controller, logical volume manager, POSIX file system layer, ... Really nice to have that integrated, eases management. Now I don't change my disks that often and the file system silently runs underneath. From time to time I looked into my auto snapshot to restore some stuff and got used to snapshot my VMs (running on ZFS-powered "virtual" zvol devices) before updating them which over time became a habit about which I didn't really think.

Then I've got myself a netbook. Some cheap up to date ASUS EeePC. On that system I choose to install Ubuntu - which was troublesome enough (had to compile my owned manually patched wireless driver) so I didn't bother to try OpenSolaris. Works like a charm, even without ZFS. Some time after I configured the netbook a new Ubuntu release came out and since then I'm in trouble. I read on too many sites that things broke with this release so I dare to update the system. On my OpenSolaris boxes updating to a update, even to a dev build, is a no-brainer: The packaging system automatically creates a ZFS snapshot and configures the boot loader in a way that the old as well as the new system can be booted. So I can click the update button, reboot and either it works (typical case) or I can revert. Really nice.

Now back to Ubuntu: If I press the button and something goes wrong I have to reinstall the system (or use a backup) which I don't want. I just want to use the netbook as a mobile browser, presentation system etc. There are other systems I use to play/experiment with...

At the recent PHP Barcamp Salzburg we got to a discussion about ZFS, too. In the discussion there was talk about the auto snapshotting and a claim was "well, I won't need it, I have everything in a version control software and I know what I delete" that might be true but once you have ZFS you change your way to operate and you don't have the whole system in a version control thing. It's so great to be able to clone a VM in less than a second to play with some stuff. It is cool to be able to enable compression with one short shell command. It's fantastic to have a fully checksummed filesystem with RAID-Z. Man how did we live in the old days? Nice to e aware of the luxury I'm used to

P.S. This blog is running on ZFS, too - of course, gave a good feeling to be able to revert during today's update, too.

Ok, so this site (and some other stuff) is now running on OpenSolaris. The previous previous article was mostly a test entry for me to see whether the DNS update was through but as some people wonder why I'm using this system that "fails while trying to copy Linux" I decided to discuss some of the reasons in more detail.

Some people already know that my main system meanwhile runs OpenSolaris. The reason there is DTrace - a great way to see what the system, from the kernel, over userspaces programs, into a VM like the JVM or PHP's Zend VM, ... is doing which is a big help while debugging and developing applications. Even though DTrace is meant to do such analysis on live machines this wasn't the main reason for this choice on the server. For the server I actually didn't plan a change, ok, the old Linux box wasn't maintained well but it worked good enough for the few things it does, but then David came along and had the idea to share a server so I started thinking about dropping the old contract and getting a new machine for us both - and possible some other friends. And there we find the actual reasons for the OS choice:

Zones

So we were planing to share a box as both of us are doing Web/PHP-related stuff it was clear that it's likely that both of us would might need special versions and configurations of some software components which will then conflict with each other. Additionally I want to be able to do a killall apache in case I configured something wrong and I don't want the others to be affected too much while configuring my web servers as I need/want them. The obious solution these days? - Virtualization.

Now virtualization comes in many flavors. The simple one most people know is Desktop Virtualization, so you take a software like VirtualBox, which is running as a regular userspace application and holds a complete operating stack. In there one has a kernel of the virtualized system which thinks it's running directly on physical hardware. The big benefit is that one can run any operating system in the VM but also has negative effects in areas like disk buffers (the virtualized and the host kernel buffer independently) or overall process scheduling (the VM is scheduled by the host and then schedules itself again..) or syscalls (an application running in the VM does a syscall to the VM's kernel which then calls a Hypervisor-provided hardware emulation function which then triggers a syscall on the hostsystem)

Another approach is Operating System Virtualization like Solaris Zones. Here the operating system handles the virtualiztion. With zones this works in a way were one has a single kernel and multiple userland instances. By this one has one kernel with one scheduler (ok, Solaris allows using different schedulers and so on - let's ignore this and look at the default) and one disk IO layer. Inside a Zone one has Zone-specific userland with service management an own network device (see more on this below), an own user database (/etc/passwd, LDAP, ...) and so on. But as of the syscall interface it all runs on one kernel which also means that all processes are handled equally by the kernel (unless configured otherwise)

The result of using Solaris Zones is that one has a lightweight isolation of independent userland environments. Now as said the virtualisation has one boundary at the syscall layer, so the userland has to be Solaris - one thinks. But that's not true: There are Branded Zones which emulate another syscall interface,by that one can run a Linux userland on a Solaris kernel so Linux-only apps benefit from stuff like ZFS and DTrace - but that's not relevant for me here.

So to summarize: Zones are great for lightweight isolation (and other stuff)

Crossbow

Now I was mentioning that each Zone can have it's own network interface assigned.This is nice if you have a box with many network devices - now a typical server you get as a root-server for little money usually has just one. Now what you traditionally can do is assigning multiple IPs to that device and then use the single device shared over multiple zones. That works but is inconvenient as you can't really check the status (which device/zone is producing how much traffic?) or add bandwidth limitations (I want to be able to reduce one zones bandwidth in case an article is slashdotted without going to deep into everything to keep other parts of the system running) and additionally IP addresses are limited and I don't want all zones to be publicly accessible - for instance my MySQL zone can't be reached from the outside.

Now crossbow - that's the name of the Solaris network virtualization layer introduced with OpenSolaris 2009.06 - for me always was a so what thing till I started using it. Well yes you can create virtual switches and virtual network interfaces. So what? Well combined with zones I can achieve what I described in the above paragraph.

That's all that's needed to create an internal ethernet with two devices. Next step is to assign them to zones and configure IP for this network. In my current setup I have a zone for this web site and one zone for the MySQL server. The MySQL zone has a vnic for an internal network, the web-zone has two vnics - one is used for the internal network and the second is configured to work on top of the physical networking device so it can talk to the outside using its own public IP address. For limiting resources and stuff there's the flowadm tool for simple access to control network resource limits or service priorities (ssh connections have higher priorities so the system can be controlled in case the network is busy)

And even for me, who tries to stay above the TCP layer, this is quite trivial to setup.

ZFS

Now one of the most cited features of Solaris is the zfs filesystem. While zfs is more than just a filesystem - it's a combination of volume manager, raid controller and other related things. The key feature there for me is snapshotting: zfs is using a copy on write mechanism so zfs can create snapshot which in itself has barely no costs. Only if data is changed a new block is being written and the old one is kept untouched by that the snapshots costs only the space the difference needs. Additionally this allows clones so one gets a copy of a directory and it will cost space only if data is changed - that's of special interest with zones. As said each zone is it's own userspace system. By using zfs clones they share the same blocks on disk. Really useful. In the next version this will even be better thanks to deduplication in zfs ...

Problems

Coming from Linux there are - of course - different problems, as I'm using OpenSolaris on other boxes for sometime now I'm used to many administration tools but I learn new things every time i work on the system.

A bit more problematic is that the main OpenSolaris package repository doesn't offer as much software as typical linux distributions, but for most software packages can be found in other repositories, too. This is a bit annoying but as one can see the growth and has access to above mentioned features this is no big problem - especially on a server where most of the tools exist for Solaris, too.

Oh, and for the German speakers: David and I discussed some experience while installing the server in the latest HELDENFunk podcast.

So, this website moved. It isn't the citizen of a Linux box anymore but is running inside a zone on an OpenSolaris host. The only non-default software powering this server I compiled myself is a current svn snapshot of PHP 5.3.2-dev. Let's see if I can keep this system clean or whether it becomes such a mess as the old Linux box. For now I'm happy about the isolation using zones, snapshots with ZFS before playing around and DTrace in case something goes wrong

Sun recently introduced their Amber Road storage systems which act like a storage appliance with a Web-Interface for configuration and analytics. Due to the power of DTrace combined with an AJAX web-inerface, you have the ability to really see what's going on. Brendan Gregg, the engineer who's famous for screaming at his storage systems, recently published a new blog posting giving some insights on the heat maps created by these systems. But Sun engineers won't be Sun engineers if they don't have weird ideas. So what about sending text messages to the analytics software? - Scroll to the bottom of Brendan's blog to see what I mean.

Angelo recently showed an easy way to dump SQL queries using DTrace, while reading the articles I felt that some important information is missing: The name of the user executing the query and the selected database. So I sat down a few minutes and tried to collect that data.

For the database name I found a quite simple solution: It is passed as parameter to the check_user() function to MySQL so we can easily add a thread-local variable to keep that name. Simple script for that:

Getting the username is a bit harder, for the PID-provider we need a function where it is passed as parameter, best would be if it was passed as char*, but there's no such function reliably called. (For instance there are functions for checking max connections per user where all we need is passed in a proper way, but it's only called when a limit was set) The only way would be to take the user_connect property of the THD object which is passed to dispatch_command and then access the username (and hostname). But getting that working from within DTrace is quite some work. I prepared some scripts doing this with simple C structures for the second part of my DTrace article series, which is ready in my head and is waiting to be typed, so in theory it should be possible, anybody wants to try?

Over the past few weeks I annoyed my environment with praisingDTrace whenever possible. Yesterday, during a break at the Barcamp Munich, I gave Wolfram a short introduction on his Mac and decided to put some stuff here:

DTrace is a toolkit available on Solaris (Solaris 10 or OpenSolaris), recent MacOS versions and FreeBSD for mightier than tools like truss or strace but with way less impact. DTrace allows you to "hook" (called "probes") into the system and allows to do some analysis then.

I guess all that works best by showing an example first: PHP uses a wrapper over the system's memory allocation using a function called _emalloc (which is wrapped by a CPP macro called emalloc) so it might be interesting to see how often that function is being called. For doing that we can use a D-script (D being the DTrace scripting language, not DigitalMars's D) like that:

We can now simply call that script and tell DTrace to start a PHP interpreter and run a PHP script. DTrace will then change the running program in memory so that the message is printed whenever the system for the process, with the PID $target, enters the function _emalloc. $target is a special variable referring to a process started by DTrace using -c or a PID provided using -p.

That's nice but not really useful in any way, yet. As we'd like to at least know the size of the allocated memory area, which is the first parameter to _emalloc. The pid-provider helps us by providing the parameters to the functions as D-variables, so we can simply change our action to print that variable:

The output is quite long and still rather useless, for making use from this information we at least need some aggregation, but DTrace helps there, too, so let's create an aggregation variable collecting the data in a usable way:

This tells us that the most used allocation size is between 9 and 16 bytes and the largest space allocated is somewhere between 65536 and 131072 bytes.

For a deeper analysis we can now add a predicate to our probe so the action triggers only for that allocation. Such predicates are writing between slashes between the probe name and the action. Additionally I'm adding a ustack() call to the action, this will print the systems userspace backtrace -- which is C level, not PHP space.

So we see we're in the startup of PHP allocating some space on it's stack. One question now might be about the costs of an _emalloc call, one important factor there are syscalls to the operating system. As DTrace is made for utilizing the whole system that can be done quite easy using the syscall provider. Me might now use syscall:::entry as probe to be triggered on every call, but that will be quite a lot. As we're only interested in syscalls from _emalloc we'll use a thread-local variable as a flag and check that flag in the predicate condition:

So we're calling brk two times. brk is the syscall to "change the amount of space allocated for the calling process's data segment" which is exactly what we expect, but why is it called two times? Adding a ustack call to the syscall action can tell us where it happens, using the source this can then probably be optimized. That's left as an exercise to the interested reader.

In summary: No need to change the code and lots of information, I plan to write an additional article showing how to get interesting facts system-wide, not only for a specific process but all running ones, which is especially interesting when searching for a problem on production systems (DTrace is made to be used on productive systems!) or problems related to concurrent processes/threads.

DTrace is a damn cool debugging tool, unfortunately only available for Solaris and different BSD flavors. If you want to learn about it watch the quite entertaining video (well, I guess you should be a true geek to be entertained, ...) from Bryan Cantrill's talk.

The reason for me writing this is that I had some problems with the PID provider and wanted to note the solution for myself: