Wednesday, October 19, 2011

I work at Joyent – a cloud computing company – doing performance analysis of small to large cloud environments. Most of our systems have DTrace: the original implementation of Dynamic Tracing, which we use for monitoring and performance analysis. But we also have some Linux, which I’ve been analyzing in the following situations:

I would prefer to log in to the KVM guests and run DTrace, but it’s not available on those Linux systems. (In the future, we may have a way to reach into the guest using DTrace from the host to trace its internals, be it Linux or Windows.) In the meantime, I need to do some serious performance analysis on Linux (including kernel code-path latency) so I’m using SystemTap.

Why I’m blogging about it

While using SystemTap, I’ve been keeping notes of what I’ve been doing, including what worked and what didn’t, and how I fixed problems. It’s proven a handy reference.
In some recent threads about DTrace, people have asked how it compares to SystemTap – with some commenting that they haven’t had time to study either. I’ve been encouraged to post about my experiences, which is easy to do from my notes. I could (and probably should) get some of these into the SystemTap bug database, too.
What I’m sharing are various experiences of using SystemTap, including times where I made mistakes because I didn’t know what I was doing. I’ll begin with a narrative about tracing disk I/O, which connects together various experiences. After that it gets a little discontiguous, clipped from various notes. I’ll try to keep this technical, positive, and constructive. Perhaps this will help the SystemTap project itself, to see what a user (with a strong Dynamic Tracing and kernel engineering background) struggles with.

Who’s using my disk I/O?

There are plenty of examples out there for using DTrace (including posts on my blog), but not so many for SystemTap. A good one I’ve seen is titled SystemTap: It’s like dtrace for linux, yo., which begins with “Who’s using my disk I/O?”. The author writes:

One of my favorite (now retired) interview questions for UNIX administrators is “iostat shows a lot of write activity to a device. How do you determine what program is responsible for writing the most data to this device?”. Before SystemTap, the only way to answer this under Linux was to install kernel counter patches, or try to parse block_dump. Using the disktop script from the SystemTap Scripts & Tools page, it’s trivial to find this kind of information out.

That’s a great question!
Getting a quick sense of who is using the disks is a problem I’ve dealt with for over a decade, and it isn’t answered well by standard tools. It’s getting more important as systems get larger and more crowded, and have more applications sharing the same set of disks.
This makes for a great example of Dynamic and Static tracing, which can be used to examine disk I/O events in whatever detail is desired. By stepping through this example, I’ll be able to discuss and share various experiences of using SystemTap.

disktop

disktop.stp is on the SystemTap page. It’s also shown as the first example in section 4.2 Disk of the Red Hat Enterprise Linux documentation for SystemTap.
I’ll run it as the #1 example, just as the previous author did. First, I need to install SystemTap.

Patching /usr/share/systemtap headers for the build system (on your own)

Power cycling the system if the kernel freezes

Extra manual steps (eg, running the little script at the end of this page)

It takes me about an hour each time I do this.
You may not need all these steps, or you may need more. The SystemTap engineers have said it’s much easier on Red Hat Enterprise Linux (I’ve yet to try).
I’d recommend keeping a text file of SystemTap install (and runtime) failures, including screen output and commands tried. I’ve found it a great timesaver: now when I hit failures I can grep my text file to see if I’ve debugged that one before, and what commands I used.
Also, If you gather enough info for install failures, you’ll be able to file SystemTap bugs so that the product can get better over time.

Filing Bugs

The bug tracking system is at Red Hat, and requires a login (easy to get). There is a good page on the SystemTap wiki on How To Report Bugs, which lists information that can be helpful to gather. Extra info is requested for these types of problems:

The most unfortunate problems occur when systemtap compiles your probe script properly, but then the machine dies.

These aren’t all necessary, and if the process seems onerous I take it you can just chat to the engineers in #systemtap in irc.freenode.net instead. I’m told they are friendly. :-)

Bug Searching

When you go to file a bug, search to see if it’s already there. I search for both open and closed bugs, as I may have hit a bug that’s recently been closed (fixed), but not on the version of SystemTap I’m running. In which case, I’ll still find the bug, notice that the fix date was recent, and then go refresh to the latest SystemTap. Here’s an example search:
I tried filtering out really old bugs to reduce the list, but that excluded at least one crash that I’ve hit recently: #2725 “function(“*”) probes sometimes crash & burn” (filed over 5 years ago, last updated over 2 years ago).

Manual Patching

As an example of the sort of issue you may hit, this is what happened on my last install (Ubuntu 11.4):

This one-liner counts system calls, which I’m running to check that everything is installed and working. It failed with the “build-id mismatch” error highlighted above, which I hadn’t seen before.
This wasn’t in the Red Hat bug database, but some Internet searching found a useful post that covered this error, suggesting to manually patch the /usr/share/systemtap/runtime/sym.c file. The patch was:

But this patch had already been applied. I checked the C code in the runtime directory, and started wondering if the “mismatch” was due to running an older kernel but with a newer patched sym.c. So, as an experiment, I rolled back the patch. When I ran the one-liner again, the system hung hard, requiring a power cycle. My mistake! I rolled back all my changes.
Some digging later led me to try upgrading the kernel to resolve the mismatch. That worked!

Specifically, the system has hung. It’s not responding to keystrokes via either SSH or the console, and it’s not responding to ping. The only way out seems to be power cycling the system. It has a system controller (DRAC) allowing me to power cycle it remotely.
This was a similar hang to what I saw earlier when I modified the runtime/sym.c files, so perhaps that wasn’t my mistake after all. I don’t know – this will need careful analysis. Fortunately I have a good place to start: the SystemTap bugs checklist mentioned earlier. I’m currently following one of the steps: getting kdump configured and enabled so that I can capture kernel crash dumps. I can’t be certain what went wrong until I take a look; it’s possible that the crash was in a different kernel subsystem which SystemTap happened to trigger (ie, not the SystemTap code itself).

Safe use on production systems

That wasn’t the only system I’ve had that has crashed when running SystemTap. The comparison table on the SystemTap wiki has a row for “safe use on production systems”. At the time of writing, the SystemTap column has “yes”.
To write this post, my system has hung when running SystemTap more than six times (I lost count), requiring a power cycle each time. While this system seems more brittle than others on which I’ve run SystemTap (I usually have to run more than just disktop.stp to hit a crash), I still don’t feel safe using SystemTap in production.
In the past seven years, I’ve used DTrace in production thousands of times, on hundreds of different systems. It’s running continuously, in production, to monitor every system in the new Joyent datacenters. It can also be configured to run continuously to monitor Oracle’s ZFS Storage Appliance. I’ve never known it to cause system crashes or hangs.
DTrace’s safety record was a product of the kernel engineering culture at Sun Microsystems, and of Bryan Cantrill: father of Dynamic Tracing and DTrace. Bryan wrote about DTrace Safety in 2005, explaining the risks and principle of production safety very clearly:

DTrace must not be able to accidentally induce system failure. It is our strict adherence to this principle that allows DTrace to be used with confidence on production systems.

This was also the aim of Solaris in general, as described in the Developing Solaris article, which was also written by Bryan. Here is an excerpt:

You should assume that once you putback your change, the rest of the world will be running your code in production. More specifically, if you happen to work in MPK17, within three weeks of putback, your change will be running on the building server that everyone in MPK17 depends on. Should your change cause an outage during the middle of the day, some 750 people will be out of commission for the order of an hour. Conservatively, every such outage costs Sun $30,000 in lost time — and depending on the exact nature of who needed their file system, calendar or mail and for what exactly, it could cost much, much more.

I’ve included this to describe the background from which DTrace was born. I won’t talk at length about SystemTap’s safety, since I don’t understand that very well yet. Any of the following may be true:

The Linux kernel may be more crash-prone due to other subsystems, not SystemTap.

The Linux kernel may have fundamentally different behaviour to Solaris, Mac OS X, and FreeBSD, such that avoiding system crashes may be inherently more difficult.

The SystemTap architecture may introduce additional risk.

My experience is uncommon.

Right now, all I know is that running SystemTap may crash or hang the systems I’m on. I don’t know exactly why (crash dump analysis takes time), but I don’t think I’m alone either.

Most of the Time

Fortunately, I think most of the time disktop.stp doesn’t crash your system. Some of the time I just got this second error (but no system crash):

The kernel read fault address changes each time. This seems to be exacerbated by running load, which I’m doing so that we have something to see with disktop. By trying again numerous times, I finally got disktop to run long enough to collect a screenshot.

Neat! You can see the process ID, command name, disk device, direction and bytes – directly associating the process with the disk activity.
UPDATE: As pointed out in a comment on this post, the WARNINGs seen above are actually from a bug in the code (typo). For a write-only workload, the bug would cause the summary and header lines to not be printed (was that workload originally tested?). For this workload (reads present) the bug just means that those warnings are printed, but the output is all intact. It’s still a little annoying to have warnings printed out (from what is a core script example), and I’m guessing that the warning code predated this script (which is from 2007).
The output shows that it’s all those dd processes that are hammering the disks with reads. I could tune my filesystem to better cache those reads in DRAM, or just buy more DRAM.

Uh-oh…

The summary line above said that the disks were averaging 5003498 KB/sec, which is 4.8 GB/sec. This system has four 7200 RPM disks presented via hardware RAID as a single device. I’m pretty sure they can’t do 4.8 GB/sec combined.
According to iostat, the disks are actually doing this:

That’s right – nothing.
My dd workload has been running for so long it’s now just hitting out of DRAM cache, yet disktop.stp claims that they are still reading Gbytes from disk. And I’m pretty sure that the point of disktop.stp was to measure disk activity. From the SystemTap wiki:

disktop.stp : List the top ten activities of reading/writing disk every 5 seconds

Oh, no. Both are tracing at the Virtual File System (VFS) level, not the disk level!

Operating Systems 101

There is some code that tries to “skip read from cache” based on device information. VFS is supposed to abstract the underlying file systems and provide a well-defined interface (originally created by Sun Microsystems in 1985 for SunOS 2.0, to allow UFS and NFS to coexist).

Storage device details are not part of the VFS read interface, so to examine them in VFS read context is immediately suspicious. Perhaps Linux does it anyway (what some may call a “layer violation”).
What’s also suspicious is the lines of code in the wiki version:

That walks the file inode (d_inode), to the file system superblock (struct super_block *i_sb), to the device info (dev_t *s_dev). So far that’s just checking that the file resides on a file system that has a backing device.
For the disktop.stp logic to work, the second line will need to do the “skip read from cache” logic. It calls __find_bdevname, also in the vfs tapset, which takes that device info and returns the device name:

/*
We don't want to have to do a bdevname() call every time
we want a devname, so we'll hash them here.
*/
/* XXX: Is this hashing really that helpful? The call to bdevname()
* isn't very involved... */
global __devnames
function __find_bdevname:string(dev:long, bdev:long)
{
if (dev in __devnames)
return __devnames[dev]
else
return __devnames[dev] = bdevname(bdev)
}

(I’m tempted to join the comment conversation at the top, adding, tounge-in-cheek: “/* XXX: Have you heard of SystemTap? I bet that can measure bdevname() cost. ;-) */”.)
__find_bdevname() calls bdevname(), which is declared in the Linux kernel source (fs/partitions/check.c):

disk_name() just does the formatting. So I don’t see how the “skip read from cache” logic works by examining the code. It also doesn’t pass a sanity check (as previously mentioned), nor work in practise (as previously shown). It looks busted, and has been that way since the script was written in 2007.

Does it Matter?

Performance tools don’t need to be perfect to be useful. If it steers you in the right direction, you apply a fix, and the fix works, it may not matter much that the numbers were out by 5% along the way. Just bear in mind that the tool has an error margin.
Here, disktop.stp is measuring I/O at the VFS level instead of the disk level. Does that matter? Most disk I/O begins life as VFS I/O anyway (either directly or indirectly), so the numbers could be pretty close.
To get an idea of the difference, I measured the I/O at both the VFS and disk level on several production servers. Since I want production data, I won’t use SystemTap (for risk of system crash). Instead, I’ll pick production servers that have DTrace. The script, vfs.d, just traces reads to start with:

fop_read() is the Solaris VFS interface for read. Disk I/O is traced via the io:::start probe. To keep this simple, I didn’t track context properly for the disk I/O, so some execnames may be misidentified as the kernel (asynchronous prefetch). Which is fine, as I’m really running it to compare the byte counts.
Here’s the first production server I tried, tracing for about ten seconds:

memcached is pretty busy, reading over 200 Mbytes, but that’s returning entirely from filesystem DRAM cache or from another VFS fronted subsystem (perhaps socket reads over sockfs; more DTrace can confirm). mysqld was the only application that missed out of the DRAM cache, reading about 600 Kbytes from disk. This ratio wasn’t dissimilar to the other production servers.
disktop.stp, reporting VFS-based numbers, would have claimed about 300 Mbytes of disk reads. In reality it was about 0.6 Mbytes. That’s out by 500x!
In some situations, this may be a worse problem than having the system crash. With crashes, you know it happened, and can take action immediately to avoid it. Here the disktop tool appears to work, but is providing bad data. You may act upon bad data without realising it for months or years.
When a tool reports bad data, the best case is when it’s so bad it’s obviously wrong. If I did run disktop.stp on the above server and had it report memcached at over 200 Mbytes, I’d immediately know something was wrong: the disks aren’t doing that (as we know from iostat) and it doesn’t make sense for memcached. Worst case is where the numbers are subtlely wrong, so that it’s not obvious.

Avoiding This

Metrics are easy. Correct metrics are hard.
Dynamic Tracing makes it easy to produce numbers for all sorts of activity, but that doesn’t mean they are right. I learned this years ago in creating the DTraceToolkit, which felt like the TestingToolkit as most of my time was spent testing the scripts (and I’m very grateful for the time others donated to help with this testing).
Here are a few suggestions:Buy a pocket calculator. There are always little sums you can perform as a form of checking. Did the per second throughput exceed maximum bus, disk or network throughput? This is what alerted me to the issue with disktop.stp: more throughput than made sense.If you can’t code the workload, you can’t code the script. Get good at writing mini tools that apply a known workload, which you can then double check with what your SystemTap script reports. I’d recommend C, and sticking to syscalls (to avoid library buffering) where possible, so that the workload applied is exactly what you coded. I’ve had cases were people have complained that coding (or applying) a known workload is too hard: that may be a good sign that you shouldn’t attempt the script! Writing the workload will often reveal subtle details that should be handled correctly by your observability tool. It’s a very good idea, wherever possible, to try to write the workload.Chase down small discrepancies. If your SystemTap script reports close to the known workload, but not quite, it’s usually worth chasing down why. I’ve discovered many behaviours this way, which may be small discrepancies for the test workload but larger in production. Reasons can include tracing the common path through code (aka “fast path”) but not the edge cases (“slow path”). Sometimes what you try will never make sense, and you discover that the system is much more complex than you thought.
So, I’m not really surprised that the disktop.stp script is broken – I’ve seen a lot of this sort of mistake before. I am a little surprised that it doesn’t appear to have been noticed and fixed since 2007, though.

Another thing…

The output of disktop is pretty familiar to me, having written iotop for the DTraceToolkit years earlier:

Apart from the visual similarities, both tools default to a 5 second interval. This wouldn’t be the first time that my work has inspired SystemTap.
If disktop was inspired by my iotop, it’d be nice if the script mentioned that somewhere (as Red Hat consulting did). I certainly mentioned where I got the idea for iotop:

It’s not necessary, it’s just a nice thing to do. You never know, one day you may meet the guy!

Did I get the job?

At the start of this article I mentioned a Linux interview question: “Who’s using my disk I/O?”, which was now considered trivial to answer thanks to disktop.
If I’m being interviewed for that job, and dived into an explanation of VFS and why all the official documentation was wrong, would I get the job?

iostat-scsi.stp

While disktop.stp is top of the disk sections of SystemTap documention, it’s not #1 on the SystemTap Scripts & Tools page. That is iostat-scsi.stp.

Which should be useful, since the iostat(1) tool on current versions of Linux don’t include all devices, such as tape drives. Fetching this sort of basic observability is a cakewalk for Dynamic Tracing.
Running it:

That’s just the top – the whole script is 90 lines long. It looks pretty complicated, and runs different code depending on the kernel version. I suppose it’s listed first because of the “by category” ordering of the SystemTap page. Due to its complexity – not to mention the 40 lines of errors when I tried to run it – it certainly doesn’t look like a good script example to place first.

USAGE

As part of debugging this script, it would be helpful if it had a header to describe version dependancies and usage. Like the following, for example:

Instead of 40 lines of errors.
I shouldn’t complain, since this was from a wiki page where anyone could contribute anything. But it’s also the first script listed, and doesn’t work as I’d expect a “script” to work. I sometimes wonder if I’m the only person trying these things out.

What was that -g?

SystemTap supports a guru mode where script safety features such as code and data memory reference protection are removed. Guru mode is set by passing the -g option to the stap command. When in guru mode, the translator accepts C code enclosed between “%{” and “%}” markers in the script file. The embedded C code is transcribed verbatim, without analysis, in sequence, into generated C code.

!!! So the first script example is running in guru mode so that it can execute arbitrary C code in kernel context without safety features. Code that varies between kernel versions:

So if you run this on a new kernel version that isn’t handled by the code – which seems very possible given the special casing it already does for kernel versions – then you are running possibly incorrect code with safety features disabled!
And this is the first example on the SystemTap Scripts & Tools page.

vfsrlat.stp

Here is an example experience of things going (mostly) well with SystemTap. I was interested in the distribution of latency for cached reads on different Linux file systems, especially to see if there were any latency outliers (occasional instances of high latency, which can be caused by lock contention during file system events, for example). I wrote a SystemTap script that does this: vfsrlat.stp.

Script

This can be traced a few different ways. It could trace at the system call level, the VFS level, or in the file systems themselves. I picked VFS, to capture different file system types from one script. A logarithmic histogram is printed to summarize the latency (in nanoseconds), and each file system type is printed separately.

This was a little tricky to write. I switched between the vfs tapset probe vfs.read and the kernel function probe kernel.function(“vfs_read”), looking for an easy way to get the file system type. I didn’t find it, nor did I find any example SystemTap scripts where anyone else had done it either.
So I pulled up the Linux kernel source for VFS and mounts and started reading. I figured out that I could get it by digging from the inode to the superblock, and then the file system type name. Maybe there is an easier way – maybe I missed a tapset function (I looked, but I know this particular system has an older version of SystemTap and its tapsets). This worked for me anyway.

Screenshot

The script traced for several seconds while a single threaded read workload hit a cached file. A few file system types are truncated from this screenshot to keep it short:

Great!
For this test, the ext4 file system latency is very fast and consistent. This also caught other file system types, including proc, showing their latency distributions. This is pretty interesting.
If you haven’t read these type of graphs before, the left column (“value”) shows the time ranges in nanoseconds. The right column (“count”) shows the number of occurrences at that range. The middle part shows an ASCII graph so that you quickly get an idea of the distribution.
As an example of reading the above output: The ext4 graph above (last) showed 16,321 reads that were between 8,192 and 16,383 nanoseconds (in other words: 16 thousand reads between 8 and 16 microseconds). The slowest read was in the 65 to 131 us range. Most of the I/O were consistantly fast: a distribution I like to see.

25% wider ASCII graphs

This example lets me describe some favourable features for SytemTap vs DTrace:

The “devpts” graph has a “~” where lines were compressed, keeping more detail on the screen.

While not shown here, the same statistic variable can be reprocessed via different actions other than @hist_log, which saves declaring multiple aggregations.

Oh? It’s complaining about a symbol from the aux_syscalls.stp tapset. But I didn’t edit that file, and I couldn’t see what was wrong by browsing the code.
It’s also asking for another “–vp 01″ option. But I didn’t use one in the first place.
Trying the same one-liner again:

Huh? Now it’s complaining about “translation”. The one-liner seems fine – I don’t understand this error message, or the previous one. And it’s still asking for another “–vp” option, this time “001″, even though I didn’t use that option.
Trying again:

Now it’s complaining of an “empty aggregate”, an error message which finally makes sense: that aggregate is empty since there wasn’t activity to trace.
The “-v” output explained why I got the previous error messages: they were happening during the other passes by SystemTap to build up the program, and I was hitting Ctrl-C too early. But I still don’t know what the error messages actually mean. Suggesting that there were errors in a tapset file that I didn’t modify – simply because I hit Ctrl-C too soon – is a seriously confusing error message.
After this experience, I use “-v” all the time.

2. Another –vp

I still don’t understand the “–vp” error message, though. It even happens with:

Ok, I am not adding any more “–vp 00001″s – this is enough to read just to debug Hello World!
Reading through the above output, I can see “Certificate db /etc/systemtap/staprun permissions too loose”. Maybe that’s the problem?

I should ask the SystemTap engineers about this one…
I have noticed that if scripts call the “exit()” action, they don’t produce this error. Is every script supposed to call exit() at some point? From the SystemTap Beginners Guide:

SystemTap scripts continue to run until the exit() function executes. If the users wants to stop the execution of the script, it can interrupted manually with Ctrl+C.

It doesn’t say I need the exit() on Ctrl-C. Neither the Beginners Guide nor the Language Refernce explain the –vp, either.

3. Slow startup time

Some of the previous errors occurred when I hit Ctrl-C too soon, interrupting the SystemTap build process. It was easy to do, since for the time that one-liner finally did work, I hit Ctrl-C as soon as the script began running and got:

real 0m29.311s
user 0m23.620s
sys 0m1.450s

Which is almost 30 seconds – just for the one-liner to begin tracing.
The “-v” output includes times for each stage:

The longest stage is the compilation stage, taking over 22 seconds.
DTrace usually starts up in a few seconds, and in 30 seconds I may be on my third one-liner. With SystemTap, I’m still waiting for the first to gather data. The startup time needs to be greatly improved.

4. No start message

A minor irritation is that without the “-v” option, there is no message to indicate that SystemTap has begun tracing. And since, as just mentioned, it can take up to 30 seconds, you can spend a long time wondering if anything is happening yet. A number of times I’ve hit Ctrl-C only to find it hadn’t began tracing yet, and I had to start all over again (a reminder to use “-v”).
DTrace has a default message, which is supressed when you use the “quiet” option. If I do use the quiet option, I always try to print a begin message of some sort, either “Tracing… Hit Ctrl-C to end” or column headers. That lets the end user know that tracing has begun.

5. Multi user issues

With DTrace, multiple users can usually run the same one-liners or scripts at the same time. Here’s what happened when I tried with SystemTap:

Not only did it not allow two users to run Hello World, but it’s also example of another confusing error message: “not a zombie systemtap module”?
It appears that systemtap takes a hash (md5?) of the script, and uses that as a unique identifier. You’d think the identifier would include a process ID, but I’m guessing it doesn’t as part of the SystemTap cache system (under ~/.systemtap) that caches previously built modules to reduce runtime. I’m guessing (hoping) that you disable the cache system, multiple users can run the same script. Which reminds me:

6. Stale cache

Sometimes SystemTap gets into a totally weird state where scripts and one-liners that worked a moment ago are all failing. The remedy is usually to delete the entire ~/.systemtap directory, which contains cached modules.
And, not to belabor the point, but the error messages for this issue are also totally confusing. The following:

error: ‘struct hrtimer’ has no member named ‘expires’

Wasted two hours of my life. I picked through the runtime C code to comprehend what it meant. The ~/.systemtap directory cache was stale.

And I don’t know what that stap module is for. There’s no stap processes running right now.

8. function(“*”) crashes

This bug was filed over 5 years ago, and last updated over 2 years ago:Bug 2725 “function(“*”) probes sometimes crash & burn”
It’s not clear if this is inherently difficult to fix with SystemTap, or not considered a priority to fix. With DTrace, I use this functionality often. It’s limiting to have to keep avoiding it in SystemTap due to this known bug.
As an example: last Friday I was asked to help with a network issue, where a server had lost connectivity on one of its 10 GbE interfaces. Various possibilities had already been checked, including swapping the optics and fibre with known-to-be-good ones, without success. The driver was checked from the usual OS ways, dumping various error counters and messages, and it seemed fine. The question I wanted to answer: is there a bug in the driver? Is it doing something abnormal, or does it have an abnormal internal state that isn’t visible via the OS tools?
I hadn’t DTraced this driver in a while, and certainly didn’t remember what functions it used. Listing them showed there were hundreds available, but that’s not much use unless they are being called so that I can trace them. I used wildcarding to see which of the hundreds were being called, which turned out to be only several, and then traced them to reveal internal state. I did this in parallel with a working server, and all the extra details I was digging out proved to be the same. I then moved to other areas of the TCP/IP stack by increasing the wildcard to eventually match everything as a list of canditates. But I wasn’t finding a smoking gun, making the theory of a bug in the driver looking less likely. More physical analysis in the datacenter found that the optics in the network switch (not the server) were dead! (Yes, this might have been a better example if Dynamic Tracing actually found the issue directly; instead, it’s an example of Dynamic Tracing finding the absence of an issue).
With SystemTap and this bug, I couldn’t do wildcarding as part of this analysis, as it leads to system crashes. While one network interface was down, others were still up, and this was production. Instead, I’d read through the list of available probes and use a trial-and-error approach: tracing one to see if it was being used, then moving on to the next one. With SystemTap startup times being slow (as mentioned earlier) and hundreds of functions available, this could have taken over an hour, instead of the minutes it did take. This was also a production emergency where other company experts were pulled in to help: so this wouldn’t have just wasted my time – it’d waste theirs, too.

9. no profile-997

While I’m on the topic of probes that don’t work, here is another that I use frequently – almost every time I do DTrace analysis. This samples kernel stack traces at 997 Hertz, to get a quick idea of where the CPU time is spent in the kernel:

I ran this on an older system and caught a hot codepath I’m quite familiar with, caused by memory capping software (rcapd) scanning memory pages. I remember going from this sampled stack to Dynamic Tracing by picking some of those functions and tracing entry to return time: useful data that helped us comprehend the magnitude of this issue in production, which was later fixed (thanks Jerry!).
Apart from sampling kernel stacks, by changing “stack()” to “ustack()” (usually with a pid or execname predicate) this can sample application user-level stacks too. Many more customizations are possible.
The 997 Hertz is a deliberate choice so that the sampling doesn’t run in lockstep with a kernel activity, resulting in over-samples or under-samples. I’ll sometimes use 1234 Hertz, too.
Unfortunately, this didn’t work for me with SystemTap:

It seems that backtrace() doesn’t work with arbirtrary Hertz. Even if it did work, it looks like the timer.hz probe only fires on one CPU, instead of profiling across them all as desired.
The timer.profile probe does fire on all CPUs, and backtrace() does work. However, it only runs at 100 Hertz:

SystemTap was created after DTrace, adding to the syntax the term “probe” and forcing the predeclaration of variables. This hasn’t improved usability for me. I don’t know why they wouldn’t just support the DTrace syntax, or at least improve upon it. Another baffling change was to lowercase the “BEGIN” probe to be “begin”, contrary to not only DTrace but also to awk (and Perl) from where it originated. Was it necessary to sacrifice usability just to be different?
The rest of the above one-liner was long to make the stack traces readable. Without the end action, the default output of SystemTap is:

These are just stack frame addresses, which are not human readable. By default, DTrace prints the stack trace aggregation with human readable translations. As a one-liner I use often, this reduces typing and improves useability.
I’ve found that SystemTap often requires more typing to achieve the same result as DTrace. There are some exceptions where SystemTap is shorter, which reminds me of the next topic:

11. Non-intuitive

Here’s an example where SystemTap is slightly less typing than DTrace:

This is a basic one-liner, tracing the time for the read() system call. Let me try to describe what each of these do:With DTrace: the entry to the read() syscall is traced and a timestamp is saved into the self->s variable, associated with the current thread. The return for the read() syscall is also traced, which checks for the presence of the self->s variable to ensure the entry was seen and the timestamp recorded. If present, a delta time is calculated and saved into a quantize aggregation. The self->s variable is then set to zero, clearing that variable.With SystemTap: the return to the read() syscall is traced, and the delta time between the current timestamp and the timestamp from the entry to the read() syscall is saved into the “s” statistic. The entry timestamp was retrieved by passing an action into an @entry() clause, which automatically instantiates an entry probe to execute the action, and returns the value. Storage for the @entry() clause result is implicitly created, checked on the return probe (I think), and then freed after use (I think). This is a shortcut to avoid explicitly probing “syscall.read.entry” and the timestamp variable. The end probe uses the @hist_log() action to print a logarithmic historagram.
Dynamic Tracing is hard to learn. The implicit behaviour of the @entry() action makes this more difficult to follow than DTrace’s simple explicit version. It also complicates an important task when using and reusing these scripts: scanning the probes to gauge the overhead. Single probe scripts have much less overhead than pairs of probes that allocate and free a variable between them. This difference is clear with the DTrace version, but not the SystemTap version.
This is just a minor point, since the solution should be to simply not use this shortcut.

12. Incomplete documentation

I mentioned “I think” a couple of times in my description above for @entry. The @entry() action is not explained in any of the documentation I’ve searched, including the reference manual:
While there are a lot of reference docs for SystemTap, this isn’t the first time I’ve found something missing. It’s also something that has some implicit behaviour, for which I’d really like to read the documentation. It is used in the example scripts.

13. Root escalation

This did get fixed in SystemTap version 1.6, so make sure you are on that version:CVE-2011-2502 systemtap: insufficient security check when loading uprobes kernel module.
Which provided a root escalation from any user in the stapusr group via a one-liner.
I installed systemtap via apt-get a few days ago, and got version 1.3 (other times I compile from the latest source). So there may still be some vulnerable versions out there (although, Debian do backport security fixes, so their 1.3 may have this fix – I don’t know).

Must stop…

I could go on, but I’ve already spent more time than I should on this, and yet I’ve only scratched the surface. There are many more scripts and tapsets and functionality I haven’t mentioned yet, including user-land tracing. Exploring (and wrestling) with these could become a day job for the next several months.
I’d summarize my SystemTap experience so far as “painful”, due to:

installation: not a default install, and not easy to get running on the systems I tried

I am using SystemTap to solve real issues (including analyzing performance differences between OSes that do and don’t have DTrace), so it has brought some amount of value, but at the cost of countless hours of work.
By detailing these experiences I’ve probably helped the SystemTap project in some way (similar to the feedback I sent the DTrace engineers during its development), and maybe everything I’ve listed will be fixed in a year or two from now. But, the future could also be very different.

Past, Present, Future

The SystemTap project has been running for over six years, with contributions from about one hundred engineers. They’ve completed a lot of work on an incredibly difficult engineering problem – Dynamic Tracing – and made a good effort.
The years ahead may be very different though, now that Oracle has announced that they are porting DTrace to Linux. Their current version may be an even worse experience than SystemTap. However, it’s early days. What’s interesting is that they are the first Linux vendor to believe that it is both possible and should be done. Based on that precedent, other Linux vendors may follow suit.
It’s possible that SystemTap is, and has always been, on the wrong trajectory. The slow startup times caused by compilation – an architectural decision by SystemTap – may never be completely fixed. Bryan understands this better than I do from his work on DTrace, and wrote recently about the SystemTap compilation stages:

It would be incorrect to say that we considered this and dismissed it for DTrace, because never for a moment would we have considered something so obviously asinine. I elaborated on our thinking in http://dtrace.org/blogs/bmc/2005/07/19/dtrace-safety/; while that post never mentions SystemTap by name, it is very much aimed at what I perceived to be the reckless design decisions that they had made.

When DTrace was released in 2005 as open source, I remember telling Sun sales folk that they probably had three months of competitive advantage before Linux would have the code up and running. They wern’t too happy about that, and there was some resentment at Jonathan for open sourcing the crown jewels of Solaris 10. Little did we know what would happen in the years to come.
I’ll still be using SystemTap in the short term future while DTrace does not exist on the Linux systems I have, in the lab. My next tasks are to get kdump to collect kernel crash dumps, try other Linux distributions, upgrade to the latest SystemTap, and take better notes (version numbers of everything).

Conclusion

Dynamic plus Static Tracing – with the ability to write custom tools that summarize data in-kernel and are safe for production use – is the greatest performance technology of my lifetime, and should be remembered as one of the greatest inventions in computing. The scope, depth and utility of this observability is mind blowing.
In this article I explored the SystemTap version of this technology, and ran into various issues including system crashes. In summary, I would not recommend the current version of SystemTap for production use at all, at least in the production environments that I’m familiar with. I also discovered that a key example from the documentation, disktop.stp, doesn’t make sense – raising questions on how well SystemTap is being tested or used. I have mentioned various other issues, including confusing error messages and slow startup times, that make the tool difficult to use.
I’m particularly worried about people getting the wrong impression of Dynamic Tracing based on SystemTap. SystemTap is not DTrace, and looks as if it’ll be a while yet before it is even close. That is, if it is even on the right trajectory.
During the past seven years, DTrace has become an essential part of my day job in performance analysis, and it’s unthinkable to return to a time without it. I’ve hoped that should my career move to Linux (anything’s possible), I’d learn the SystemTap syntax quickly, and be able to carry on with the Dynamic Tracing methodology that I’ve been perfecting for years. That now seems much harder than I hoped.
I’m expecting a response from the SystemTap developers will be: “it’s better on Red Hat Enterprise Linux”. This may certainly be true. But if I’m going to switch OSes to do Dynamic Tracing, then I’m switching to an OS with DTrace.