Tuesday, December 18, 2007

A fairly common request I get when helping to debug problems with executables, or other types of programs that need to be "trussed" (or have their system calls traced; which is a huge generalization ;), is to "please explain what all the error output means." Sometimes, possession of this knowledge can be the answer to the problem handed to me on a plate.

Solaris' truss (available on Linux as strace or xtrace - although with slightly different options and output) is an excellent tool for debugging and can be used as simplistically or with as much complexity as you're comfortable with. End users of Unix systems (the client or office staff) generally don't want to deal with it at all.

Although it isn't always true, a lot of times the error message that precedes a program's crash is a significant help in determining the root cause of the problem. Just knowing that, for instance, the program tries to write to an output file, and gets an error indicating that there's not enough space left on the device (or partition), immediately before it crashes can solve the case right then and there.

To that end, I threw together today's script. I think we'll definitely delve deeper into using truss (and also xtrace/strace) in a completely separate post. This script serves a very limited purpose, but can be a helpful tool to use as a first step, if you've got a lot going on (or if you believe - like I do - that most problems aren't as complicated as we can make them ;). It takes the arguments of whatever program you would run truss against (along with that program's arguments). So, if you would normally run (and we're keeping the truss simple here, with no options, which isn't the case in the script):

/usr/bin/truss /usr/bin/myprogram -f myconfig

Here, you'd simply run ( I'll name this script error_detail.sh for now):

./error_detail.sh /usr/bin/myprogram -f myconfig

The script essentially strips down the output of truss to the lines that contain system errors (like ENOENT and EIO, etc) and then takes those lines and prints them out, followed by the literal description of the error ( extracted from the contents of Solaris' /usr/include/sys/errno.h).

e.g. EPERM translates into "Operation Not Permitted"

Hopefully, this will help you get to those simple conclusions a little faster (and also remind you of what some of the more obscure system errors actually mean ;)