Using strace to debug application errors

Recently I inheritted ownership of an SVN server which was misbehaving. Trying to determine why it wasn't working correctly involved a few hours of testing, careful thought, and caffeine. Eventually I got it working correctly using the often-overlooked tool strace.

strace is a common tool upon many GNU/Linux systems including Debian. Put simply strace is a "system call tracer" - which is where it gets its name from.

Using strace, as root, you can monitor the system calls made by any process upon your system. This can be enormously beneficial when you have a misbehaving program.

If you don't have it installed you can install it as easily with apt-get (or aptitude):

I've snipped the loading of several library files, but even with that truncated output you can see a lot of actions taking place such as:

The opening of files.

The return codes of things which failed.

The writing of text to the console.

Each of the lines of output we see correspond to the invocation of a particular kernel system call. These system-calls are the primitives which are used by the C runtime, and are the kernels interface to userspace.

For example the "open" systemcall is used to open files, etc. We can see several details of how the uname binary operates - because we can see that it opens several local files for reading and writing. Some of these fail and others succeed:

It tries to access /etc/ld.so.nohwcap - which fails

It tries to acesss /etc/ld.so.preload - which also fails.

It opens /proc/loadavg which succeeds and returns the filehandle "4".

It writes to filehandle 1 the string "19:37...." which is the output of the program.

Because the output can be quite unwieldy dumped to the console it is common to redirect this text to a file, via the -o option. For example to record the system calls used by the ls command to the file /tmp/ls-trace we can execute:

strace -o /tmp/ls-trace /bin/ls

Another common option is the "-p", or PID, option. This allows you to connect to a running program and see its output. This is useful in the case of long-running daemons which you cannot restart easily - or which only need to be monitored very rarely.

For more details of the available options please see the manpage by running "man strace".

Returning back to the subversion problem we'll see how useful strace was to me. The basic problem was simple enough:

Randomly SVN clients would "hang" and once they did so nobody could access the SVN server again - not just the hung client all clients in the office would be unable to use SVN.

Rebooting the server "fixed" it, but not permanently.

Initially I tried looking for logfiles, but there weren't any.

Once I excluded that I tried to see how things were setup. The basic installation seemed simple enough. Clients had SSH access and the SVN client appeared to essentially run:

ssh username@svnRepository svnserve

This svnserve process was responsible for actually conducting the operations, and for some reason was hanging. Since there were no useful debugging options that I could see for altering svnserve I resorted to a hack.

Rather than asking the clients to adjust their setup to add logging, or verbosity (which might not have shown the problem anyway) I moved the svnserve binary to a safe name:

# mv /usr/bin/svnserve /usr/bin/svnserve.real

Once that was done I wrote small shell script which would run the real binary under strace so I could try to see why it was hanging, or stalling:

This script, when made executable, meant that clients could still connect and run the svnserve command they expected to use - but that now I'd have a strace logfile generated in /tmp for each invocation.

Note that there is no quoting of the arguments to the child process - something that wasn't an issue here.

After a few clients had connected I could immediately see the problem as each logfile ended with the lines:

...
...
open("/dev/random", O_RDONLY) = 5
read(5,

Here we see that the client attempted to open the file /dev/random to read some random data - this open succeeded and the client was given the filehandle number 5. Unfortunately the reading of this filehandle failed. (We can see that it failed since the read call didn't return.)

/dev/random stalls if there isn't enough entropy available to generate a random number stream. This explains why rebooting the host fixed the problem temporarily - since rebooting would cause this pool of entropy to fill as the system started back up.

A real solution to this problem would be to investigate why there wasn't enough entropy available, but as a simple "hotfix" to allow things to work it was sufficient to link /dev/random to /dev/urandom - which doesnt stall when entropy isn't available.

In many cases this might give you security weaknesses, especially if you need a good random number stream, but for this particular server I don't believe there to be any significant problems.

Tracking down this problem and creating a fix would have been almost impossible without the ability to use strace or a debugger - and using strace is much simpler.

The idea of using a shell script as a wrapper around commands which you cannot modify is also a neat thing to remember. I first saw it used by a colleague a couple of years ago and thought it was an inspired solution at the time, and still do now..

(Just remember to put things back the way they were when you're done!)

I have a related issue of my own around entropy. One of my Xen domains based on etch occasionally has its sshd die. Last time I ran it foreground in debug mode to try and figure out why and I see a fatal error regarding not enough entropy in /dev/random.

Digging about a bit I see that /dev/random gets entropy from mouse, keyboard, display, audio and the disk subsystem. I think that since a Xen domain has no disk subsystem it may not get anything from there. It also has none of the other devices, so I can see why it might run low.

None of my other domains do this, but they are all sarge. Sure enough, etch has a much more up to date openssh. Something may have changed in openssh to make it more paranoid about entropy.

I would be interested in any possible solutions for this that do not involve installing a non-Debian package or recompiling a debian package.

egd doesn't appear to be packaged for debian and even if it was I think openssh would need to be recompiled to use it.

I suspect the entropy is updated based on receipt of hardware interrupts and I don't think a domU kernel is going to get those when it doesn't have any hardware. Just because it has an etho and a /dev/sda doesn't mean it has an ethernet card or scsi. Disable all ethernet and scsi drivers and it still works.

I think the last time I used strace in a useful manner was when I'd tried to install a bunch of perl modules from one host to another - and I'd used rsync.

Once the transfer was over it turned out that none of the modules using a shared library would work - and it was my fault for not telling rsync to preserve permissions. Somehow I'd failed to spot this for around 40 minutes, despite looking at the .so files a lot.

You can also add the -ff option when using the -o one. This will make strace to generate on file per each process spawned by the command you analize. It's simpler than following in your head who's doing what by their PID.

strace has to be one of my favourite commands. I can't recommend the book 'Advanced Programming in the Unix Environment', by W. R. Stevens, highly enough for very in depth documentation on the behaviour and idiosyncrasies of the various Unix system calls. My usual usage would be something along the lines of:

However, strace -Tfv -p 12431 just hangs on "read(3, " If I run strace -Tfv -p 12568, I see a few read()'s every other second or so If I run strace -Tfv -p 13258, I see much more, with "[pid ] ..." pre-pended to children processes....

From 'man strace' docuemntation, it seems that -Ff -f and/or -F will follow children, however this doesn't seem to be the case. Any ideas?

I wonder if this is a glitch with using the backtick operator in perl vs. system?