I keep being surprised how few people are aware of all the things they can use strace for. It's always one of the first debug tools I pull out, because it's usually available on the Linux systems I run, and it can be used to troubleshoot such a wide variety of problems.

What is strace?

Strace is quite simply a tool that traces the execution of system calls. In its simplest form it can trace the execution of a binary from start to end, and output a line of text with the name of the system call, the arguments and the return value for every system call over the lifetime of the process.
But it can do a lot more:

It can filter based on the specific system call or groups of system calls

It can profile the use of system calls by tallying up the number of times a specific system call is used, and the time taken, and the number of successes and errors.

It traces signals sent to the process.

It can attach to any running process by pid.

If you've used other Unix systems, this is similar to "truss". Another (much more comprehensive) is Sun's Dtrace.

How to use it

This is just scratching the surface, and in no particular order of importance:

1) Find out which config files a program reads on startup

Ever tried figuring out why some program doesn't read the config file you thought it should? Had to wrestle with custom compiled or distro-specific binaries that read their config from what you consider the "wrong" location?
The naive approach:

The same approach work for a lot of other things. Have multiple versions of a library installed at different paths and wonder exactly which actually gets loaded? etc.

2) Why does this program not open my file?

Ever run into a program that silently refuse to read a file it doesn't have read access to, but you only figured out after swearing for ages because you thought it didn't actually find the file? Well, you already know what to do:

$ strace -e open,access 2>&1 | grep your-filename

Look for an open() or access() syscall that fails

3) What is that process doing RIGHT NOW?

Ever had a process suddenly hog lots of CPU? Or had a process seem to be hanging?
Then you find the pid, and do this:

Ah. So in this case it's hanging in a call to futex(). Incidentally in this case it doesn't tell us all that much - hanging on a futex can be caused by a lot of things (a futex is a locking mechanism in the Linux kernel). The above is from a normally working but idle Apache child process that's just waiting to be handed a request.
But "strace -p" is highly useful because it removes a lot of guesswork, and often removes the need for restarting an app with more extensive logging (or even recompile it).

4) What is taking time?

You can always recompile an app with profiling turned on, and for accurate information, especially about what parts of your own code that is taking time that is what you should do. But often it is tremendously useful to be able to just quickly attach strace to a process to see what it's currently spending time on, especially to diagnose problems. Is that 90% CPU use because it's actually doing real work, or is something spinning out of control.
Here's what you do:

After you've started strace with -c -p you just wait for as long as you care to, and then exit with ctrl-c. Strace will spit out profiling data as above.
In this case, it's an idle Postgres "postmaster" process that's spending most of it's time quietly waiting in select(). In this case it's calling getppid() and time() in between each select() call, which is a fairly standard event loop.
You can also run this "start to finish", here with "ls":

Pretty much what you'd expect, it spents most of it's time in two calls to read the directory entries (only two since it was run on a small directory).

5) Why the **** can't I connect to that server?

Debugging why some process isn't connecting to a remote server can be exceedingly frustrating. DNS can fail, connect can hang, the server might send something unexpected back etc. You can use tcpdump to analyze a lot of that, and that too is a very nice tool, but a lot of the time strace will give you less chatter, simply because it will only ever return data related to the syscalls generated by "your" process. If you're trying to figure out what one of hundreds of running processes connecting to the same database server does for example (where picking out the right connection with tcpdump is a nightmare), strace makes life a lot easier.
This is an example of a trace of "nc" connecting to www.news.com on port 80 without any problems:

So what happens here?
Notice the connection attempts to /var/run/nscd/socket? They mean nc first tries to connect to NSCD - the Name Service Cache Daemon - which is usually used in setups that rely on NIS, YP, LDAP or similar directory protocols for name lookups. In this case the connects fails.
It then moves on to DNS (DNS is port 53, hence the "sin_port=htons(53)" in the following connect. You can see it then does a "sendto()" call, sending a DNS packet that contains www.news.com. It then reads back a packet. For whatever reason it tries three times, the last with a slightly different request. My best guess why in this case is that www.news.com is a CNAME (an "alias"), and the multiple requests may just be an artifact of how nc deals with that.
Then in the end, it finally issues a connect() to the IP it found. Notice it returns EINPROGRESS. That means the connect was non-blocking - nc wants to go on processing. It then calls select(), which succeeds when the connection was successful.
Try adding "read" and "write" to the list of syscalls given to strace and enter a string when connected, and you'll get something like this:

This shows it reading "test" + linefeed from standard in, and writing it back out to the network connection, then calling poll() to wait for a reply, reading the reply from the network connection and writing it to standard out. Everything seems to be working right.

Other ideas?

I'd love to hear from you if you use strace in particularly creative ways. E-mail me ([email protected]) or post comments.

I'm a Norwegian technologist who has been living in London since 2000. I have a son, Tristan, born in 2009.

I've got development and management experience from a number of startups as well as more established companies, after co-founding my first company at age 19. I currently split my time between being the Technical Director of Nudge CRM Ltd, and running my consulting business, primarily related to development and devops. (More...)