Top 10 DTrace scripts for Mac OS X

Since version 10.5 “Leopard”, Mac OS X has had DTrace, a tool used for performance analysis and troubleshooting. It provides data for Apple’s Instruments tool, as well as a collection of command line tools that are implemented as DTrace scripts. I’m familiar with the latter as I wrote the originals for theDTraceToolkit, which Apple then customized and enhanced for Mac OS X where they are shipped by default (great!). I use them regularly to answer this question:

why is my MacBook slow?

I work in an office where everyone has MacBook Pros, and “why is my MacBook slow?” is a common question. Applications can become slow or unresponsive while waiting for CPU work, memory requests or disk I/O to complete.

For people who try to ignore the slowdown, the question can become:

why is my MacBook fan so loud?

Standard performance analysis tools like Activity Monitor and top(1) (and any third-party tools based on the same foundation) can’t tell you some key information about activity on your system, such as how much CPU consumption is caused by short-lived processes, or which processes are causing disk I/O. DTrace, however, can see (just about) everything.

In this post, I’ll cover the top ten Mac OS X DTrace scripts that I use for figuring out why laptops are slow or why applications are misbehaving. Most of these scripts are already installed, a few are from the new DTrace book.

How to use DTrace

If you’ve never run a DTrace script before or even used the command line, here’s a basic walkthrough:

Open Terminal

Run the “Terminal” application. You can find it here in Finder:

You can also type “terminal” in Spotlight (the magnifying glass in the top right corner of your Mac’s screen), which should find it. I usually drag it to my Dock so it’s easy to find later:

Customize Terminal

When you first run Terminal, it’ll probably look like this:

I find the default font small and hard to read. The size of the window in terms of characters is also small (80 columns by 24 rows), presumably to pay homage to original Unix terminals of that size. (Why Unix? Mac OS X is Unix under the hood: the Darwin kernel).

You can adjust the font to your liking in Terminal->Preferences->Text. I use Monaco 13pt, with “Antialias Text” on. Under the “Window” tab is the default size, 80×24, which you can also increase later by clicking and dragging the bottom right corner of the terminal. For example, here’s my screen as I write this blog post (in a terminal-based text editor).

Running top

While DTrace can see everything, there are some things already covered by easy-to-use (and easy-to-type) tools, like top(1). Now that you have Terminal running, type “top -o cpu” and hit enter, which will refresh the screen showing top running processes. Type “q” to exit the top program. The output will look something like this:

Shell

1

2

3

4

5

6

7

8

Brendan-2:~brendan$<b>top-ocpu</b>

[...]

PID COMMAND%CPU TIME#TH #WQ #PORT #MREGS RPRVT RSHRD RSIZE

67254-firefox-bin98.804:53:4562/117439935+374M+78M637M+

86550top15.300:28.601/1024331748K264K2324K

1068Terminal13.217:38:1951358132637M67M61M+

61501-Google Chrom7.913:40.896193-556-45M-73M77M-

[...]

I’ve truncated the header block to just include the columns. Look at the “%CPU” column to see which processes are making the CPUs busy, and the “RSIZE” column to see who is consuming main memory. The busiest (CPU) process will be at the top, since we sorted on cpu (“-o cpu”). The top was “firefox-bin” (Mozilla Firefox) at 98.8% CPU, which is in terms of a single processor (this has two). If firefox stayed that high you could look for the responsible tab and close it down, or restart Firefox.

After top, I turn to DTrace.

Running DTrace

DTrace requires admin privileges, so to use it you’ll usually need to type in a password to authenticate, provided you have administrator access (if you aren’t sure you do, click here to see how to check).

You can run DTrace by prefixing your DTrace commands with “sudo”, which will prompt for the password the first time around (but not for some time after that). When a DTrace command is running, you usually type Ctrl-C to end it. Here’s an example:

Scripts

The scripts follow. A note on style: if the script ends with “.d”, it’s a basic DTrace script. If it doesn’t, then it’s a shell-wrapped script that provides command line options. Use “-h” to list them.

And if you’d like to learn DTrace, I’d recommend starting by reading the basic ones.

1. iosnoop

This “traces” disk I/O execution live. Each time a disk I/O completes, a line of output is printed to summarize it, including process name and filename details:

This lets you instantly find out which applications are using the disk, and what files they are reading or writing to. Disk I/O is typically slow (for non-SSD disks), so an application calling frequent disk I/O (a dozen per second or more) may run slowly as it waits for the disk I/O to complete.

In that output I caught Google Chrome reading from a cache file (“data_2″), and writing to cookie files (“Cookies-journal” and “Cookies”). TweetDeck also wrote to database files (“td_26_brendangregg.db” and “td_26_brendangregg.db-journal”). The “??” is where the path information ends for iosnoop (if you are a darwin programmer and want to take a swing at improving that, see the fi_pathname translator in /usr/lib/dtrace/io.d).

The “-h” option lists options (don’t need “sudo” for this):

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

Brendan-2:~brendan$<b>iosnoop-h</b>

USAGE:iosnoop[-a|-A|-DeghiNostv][-ddevice][-ffilename]

[-mmount_point][-nname][-pPID]

iosnoop# default output

-a# print all data (mostly)

-A# dump all data, space delimited

-D# print time delta, us (elapsed)

-e# print device name

-g# print command arguments

-i# print device instance

-N# print major and minor numbers

-o# print disk delta time, us

-s# print start time, us

-t# print completion time, us

-v# print completion time, string

-ddevice# instance name to snoop

-ffilename# snoop this file only

-mmount_point# this FS only

-nname# this process name only

-pPID# this PID only

eg,

iosnoop-v# human readable timestamps

iosnoop-N# print major and minor numbers

iosnoop-m/# snoop events on filesystem / only

For tricky performance issues I often use “-stoD” to get start and end timestamps for each I/O in microseconds, and a couple of different types of I/O time calculations.

In hindsight, I should have called it diskiosnoop, since “io” could refer to different locations in the kernel I/O stack.

2. hfsslower.d

The reason is that applications rarely request disk I/O directly, rather, they access a file system that does disk I/O on their behalf. To increase performance, the file system will usually try to cache as much file data as possible in main memory (DRAM). The application may (by some notion) think that it’s doing disk I/O, but it’s actually reading from very fast DRAM, thanks to the file system. Writes can also buffer in DRAM and write to disk later, which also speeds up application performance.

The hfsslower.d script measures I/O before it is processed by the HFS+ file system (Apple’s current default file system). iosnoop measures I/O after the file system, and only if it reaches disk:

A couple of points:

hfsslower.d will see a lot more I/O than iosnoop, as it includes file system cache hits.

hfsslower.d better reflects application performance, as it measures the same latency that the application directly suffered.

The hfsslower.d script is from the DTrace book, and can be found here. To run it, you’ll need to create a text file containing the script (or pull it from the DTrace book tarball), and make the file executable from Terminal by running “chmod 755 hfsslower.d”.

This script takes an argument which is the minimum number of milliseconds to show I/O for. Here’s tracing 1ms HFS+ I/O and slower:

If you use the argument “0″, it will trace everything. If I’m chasing down slow I/O, I’ll often use an argument of “10″ for I/O slower than 10 milliseconds.

At this point you may think: if you just care about slow I/O, then just use iosnoop. That works to a point, but there can be slow I/O caused by something other than disks (file system lock contention, for example). The other advantage of the hfsslower.d script is that the measured latency matches the application pain suffered, whereas at the disk level you can only assume a correlation.

3. execsnoop

This traces the execution of new processes. This is great at identifying short-lived processes that may be caused by misbehaving applications and can slow down your system. These short-lived processes are usually too quick to be picked up by standard monitoring tools like the Activity Monitor or top(1).

To demonstrate this tool, here’s what happens when you type “man ls”:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Brendan-2:~brendan$<b>sudo execsnoop-v</b>

STRTIME UID PID PPID ARGS

2011Sep2820:19:1806723466312man

2011Sep2820:19:1806723466312man

2011Sep2820:19:1806723567234sh

2011Sep2820:19:1806723567234gzip

2011Sep2820:19:1906723667234sh

2011Sep2820:19:1806723466312man

2011Sep2820:19:1906723667234gzip

2011Sep2820:19:1906723466312man

2011Sep2820:19:1906724467243less

2011Sep2820:19:1906724267239sh

2011Sep2820:19:1906724267239gzip

2011Sep2820:19:1906724067238tbl

2011Sep2820:19:1906724167238groff

2011Sep2820:19:1906724567241troff

2011Sep2820:19:1906724667241grotty

This prints a line for each new process that is executed, in a rolling output similar to the previous tools. The fields are: STRTIME = (string) timestamp, UID = user ID, PID = process ID, PPID = parent process ID, ARGS = process name (should be process + arguments, but that doesn’t yet work on Mac OS X; if you want to debug, see pr_psargs in /usr/lib/dtrace/proc.d).

Here’s what happens when I turn AirPort (wifi) off, then on:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Brendan-2:~brendan$<b>sudo execsnoop-v</b>

STRTIME UID PID PPID ARGS

2011Sep2820:28:000672041airportd

2011Sep2820:28:0150367205140fontworker

2011Sep2820:28:010672061mDNSResponderHel

2011Sep2820:28:060491autofsd

2011Sep2820:28:1006720837ManagedClient

2011Sep2820:28:3150367209140fontworker

2011Sep2820:28:430672101airportd

2011Sep2820:28:490672111mDNSResponderHel

2011Sep2820:28:4950367212140fontworker

2011Sep2820:28:510672141kerberosautoconf

2011Sep2820:28:510672151kerberosautoconf

2011Sep2820:28:560491autofsd

2011Sep2820:29:0106721937ManagedClient

2011Sep2820:29:040672201ocspd

2011Sep2820:29:165036722154079Google Chrome He

2011Sep2820:29:1950367222140fontworker

Near the end I opened up a new tab in Mozilla Firefox and another in Google Chrome. You can see a fundamental difference: Chrome creates a new process for that tab, Firefox doesn’t.

execsnoop has various options, use -h to list them all. I used -v above, to print the time.

4. opensnoop

This traces file opens and prints various details, including the time and error code when using “-ve”. I usually use it to look for failed opens, which can be a sign of misconfigured applications. Discovering their config files and resource files can also be useful.

The files opened to performed this can be seen by the “SystemUIServer” process, beginning with a Preferences file and then some PDFs. Huh? PDFs? What are those PDFs?

That’s bizarre. I didn’t know that those tiny icons were implemented as PDFs!

5. dtruss

The previous two tools, opensnoop and execsnoop, operate by tracing specific system calls. A system call (or “syscall”) is what an application performs to request the operating system kernel to perform privileged work, including process creation, file operations, and other I/O (eg, disk or network I/O). Syscalls are a great target for analysis with DTrace, since examining them often provides a pretty good picture of what an application is trying to do. They can also provide useful details including byte counts, file and process names, error codes, and latency.

The dtruss tool traces all types of system calls, which is very useful for general debugging, especially since Mac OS X doesn’t come with a standard syscall tracer (like Linux’s “strace” or Solaris’s “truss”). One advantage of dtruss over those other tools is that dtruss can trace multiple processes at the same time, matching on the process name “-n”. For example, tracing Firefox via its process name “firefox-bin”:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Brendan-2:~brendan$<b>sudo dtruss-nfirefox-bin</b>

PID/THRD SYSCALL(args)=return

67254/0x720286:write_nocancel(0x7,"8\0",0x1)=10

67254/0x720286:lseek(0x24,0x75500,0x0)=4805120

67254/0x720286:read_nocancel(0x24,"\0",0x200)=5120

67254/0x720286:write_nocancel(0x7,"8\0",0x1)=10

67254/0x720286:lseek(0x24,0x73B00,0x0)=4738560

67254/0x720286:read_nocancel(0x24,"\0",0x200)=5120

67254/0x746a81:select_nocancel(0x3D,0xB04909B8,0xB0490938)=10

67254/0x746a81:read_nocancel(0x6,"8\0",0x400)=10

67254/0x746a81:recvfrom_nocancel(0x36,0xB0490C3F,0x1)=-1Err#35

67254/0x746a81:select_nocancel(0x3D,0xB0490998,0xB0490918)=10

67254/0x746a81:sendto_nocancel(0x36,0x20A51008,0x292)=6580

67254/0x720286:write_nocancel(0x7,"8\0",0x1)=10

67254/0x720286:lseek(0x24,0x74000,0x0)=4751360

[...]

The output will be many pages, as applications commonly make frequent system calls.

dtruss can also launch and trace a program. Here the humble “ls -l” command was traced:

dtruss supports various options, including the printing of system call timing for use when analyzing performance.

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Brendan-2:~brendan$<b>dtruss-h</b>

USAGE:dtruss[-acdefholLs][-tsyscall]{-pPID|-nname|command}

-pPID# examine this PID

-nname# examine this process name

-tsyscall# examine this syscall only

-a# print all details

-c# print syscall counts

-d# print relative times (us)

-e# print elapsed times (us)

-f# follow children

-l# force printing pid/lwpid

-o# print on cpu times

-s# print stack backtraces

-L# don't print pid/lwpid

-bbufsize# dynamic variable buf size

eg,

dtruss df-h# run and examine "df -h"

dtruss-p1871# examine PID 1871

dtruss-ntar# examine all processes called "tar"

dtruss-ftest.sh# run test.sh and follow children

I’ve used the “-e” option many times to figure out what’s slowing down an application. It shows the elapsed time for the system call in microseconds:

Shell

1

2

3

4

5

6

7

8

Brendan-2:~brendan$<b>dtruss-els-lhfsslower.d</b>

-rwxr-xr-x1brendan staff1152Jan282011hfsslower.d

ELAPSD SYSCALL(args)=return

245mmap(0x10000B000,0x2000,0x5,0x12,0x3,0x7FFF00000001)=0xB0000

11mmap(0x10000D000,0x1000,0x3,0x12,0x3,0x7FFF00000001)=0xD0000

10mmap(0x10000E000,0x1F10,0x1,0x12,0x3,0x7FFF00000001)=0xE0000

31open("/dev/dtracehelper\0",0x2,0x7FFF5FC45370)=30

40__sysctl(0x7FFF5FBFD5B0,0x2,0x7FFF5FBFD5DC)=00

I then look down the “ELAPSD” column for the largest times.

A dtruss FAQ is: why is the output in the wrong order?

As with other DTrace “tracing” style tools (that print lines of output as they occur) the output can be shuffled slightly due to CPU buffering on multi-CPU systems. The “-d” option for relative times (since program start) is useful for cases where the output order is important. You can then post-sort by that column so that the dtruss output is in the correct order. (For example, by redirecting the output to a file, then using the sort(1) command on that column.)

Ryan Dahl (creator of node.js) had been using dtruss so frequently recently that he made some enhancements and posted a newer version on github here.

6. soconnect_mac.d

This script is from the DTrace book, and traces outbound TCP connections along with details:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Brendan-2:~brendan$<b>sudo./dtbook/Chap6/soconnect_mac.d</b>

PID PROCESS FAM ADDRESS PORT LAT(us)RESULT

45343firefox-bin270.32.100.166443735Inprogress

65002TweetDeck2199.59.148.2018094Inprogress

65002TweetDeck2208.93.137.538089Inprogress

54079Google Chrome297.74.215.1438076Inprogress

45343firefox-bin297.74.215.14380752Inprogress

45343firefox-bin297.74.215.1438088Inprogress

45343firefox-bin297.74.215.1438085Inprogress

45343firefox-bin297.74.215.1438039Inprogress

45343firefox-bin297.74.215.1438022Inprogress

45343firefox-bin297.74.215.1438020Inprogress

27ntpd217.151.16.2012324Success

[...]

It’s a quick way to find out what applications are connecting to whom on the Internet. The script is here.

Many connections these days are “non-blocking”, so the latency appears low and has the result “In progress”. The actual connection completes sometime later (the timing can be traced using some more DTrace, when desired).

There is a companion tool, soaccept_mac.d, which shows inbound TCP connections. If you ran both regularly, you may catch something performing networking that shouldn’t be (e.g., spyware, virus, …).

7. errinfo

This tool provides a summary of which system calls were failing, showing the process name, error code, and short description of the error:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Brendan-2:~brendan$<b>sudo errinfo-c</b>

Tracing...Hit Ctrl-Ctoend.

^C

EXEC SYSCALL ERR COUNT DESC

TweetDeck lstat6421No such fileordirectory

TweetDeck select91Bad filedescriptor

TweetDeck stat6421No such fileordirectory

cupsd unlink21No such fileordirectory

firefox-bin connect_nocancel361Operation now inprogress

launchd mkdir171Fileexists

launchd open_nocancel21No such fileordirectory

[...]

Google Chrome035

Google Chrome035

mdworker getattrlist224No such fileordirectory

mdworker mkdir1324Permission denied

TweetDeck access233No such fileordirectory

TweetDeck read3535Resource temporarily unavailable

mDNSResponder recvmsg3538Resource temporarily unavailable

mdworker stat64247No such fileordirectory

gpg-agent read_nocancel3560Resource temporarily unavailable

thnuclnt ioctl2560Inappropriate ioctl fordevice

ntpd __pthread_canceled2269Invalid argument

ntpd sigreturn-269

ntpd sigsuspend469Interrupted system call

Google Chrome00

activitymonitor proc_info31495No such process

This can be another quick way to track down failing or misconfigured applications. Note that (usually) most system call errors are not a problem: it can be normal for system calls to fail, and the calling application handles that condition correctly. errinfo is particularly useful when the application didn’t handle the failure correctly, and the system calls are silently failing.

8. bitesize.d

This is a simple DTrace script that characterizes the disk I/O workload, showing a distribution of the size of the I/O in bytes along with the application name:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

Brendan-2:~brendan$<b>sudo bitesize.d</b>

Tracing...Hit Ctrl-Ctoend.

^C

PID CMD

31502vim\0

value-------------Distribution-------------count

2048|0

4096|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@4

8192|0

16384|@@@@@@@@1

32768|0

54079Google Chrome\0

value-------------Distribution-------------count

2048|0

4096|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@18

8192|@@@@@@@4

16384|0

86143bsdtar\0

value-------------Distribution-------------count

2048|0

4096|@@@19

8192|@@@@23

16384|@@12

32768|@@@@@@@@@@@@@@@@@@@@@@@151

65536|@@@@25

131072|@@10

262144|@@10

524288|@8

1048576|3

2097152|0

The “value” column shows the minimum size of the I/O in bytes, and the “count” column shows the number of I/O in that range: between the minimum size and the next size shown.

To follow an example: the “bsdtar” command shown above performed 151 disk I/Os that were between 32768 bytes and 65535 bytes in size (32K to 64K), which was where most of its I/O fell. The text plot in the middle provides a visualization for the distribution of the events, which shows a spike that corresponds to this 32K to 64K range. Google Chrome did a couple of dozen I/O between 4 and 16 Kbytes.

Disk I/O is an expensive operation, so it’s usually preferable for it to be larger in size to improve the throughput to the device.

9. iotop

This presents the same data as iosnoop, but in a summarized way similar to top(1). It’s handy when disk I/O is so frequent that iosnoop is too verbose, and you want a high level summary of which process is rattling the disks. Here I’ve used “-CP” to not clear the screen and provide a rolling output (-C), and to show disk busy percentages (-P):

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

Brendan-2:~brendan$<b>sudo iotop-CP1</b>

Tracing...Please wait.

2011Sep2719:33:11,load:0.74,disk_r:0KB,disk_w:57344KB

UID PID PPID CMD DEVICE MAJ MIND%I/O

06666166312cp??142W98

2011Sep2719:33:12,load:0.74,disk_r:4KB,disk_w:55752KB

UID PID PPID CMD DEVICE MAJ MIND%I/O

06666166312cp??142W93

2011Sep2719:33:13,load:0.68,disk_r:60KB,disk_w:54464KB

UID PID PPID CMD DEVICE MAJ MIND%I/O

06666166312cp??142R11

06666166312cp??142W88

2011Sep2719:33:14,load:0.68,disk_r:2560KB,disk_w:32928KB

UID PID PPID CMD DEVICE MAJ MIND%I/O

50365002140TweetDeck??142W0

010launchd??142W1

06666166312cp??142R11

06666166312cp??142W71

The “??” is for where some details aren’t yet available to DTrace, for the same reason as was described for the iosnoop script (if you are a programmer, you can see what needs to be fixed in the /usr/lib/dtrace/io.d file).

If you get “dynvardrops”, sorry: back when I wrote this (6 years ago), I never saw a disk workload heavy enough to cause these. A remedy can be to increase the dynvarsize tunable by adding the following line:

Shell

1

#pragma D option dynvarsize=16m

near this one:

Shell

1

#pragma D option quiet

in /usr/bin/iotop. You may want to make a backup of iotop first (iotop.orig) or copy it somewhere else to run it.

Interesting! While tracing I saved the file I was editing in vim, which is seen in the last four lines. This tells me that vim is creating and removing temporary files as part of the save process.

Also note that it looks like TweetDeck created the file twice before removing it (two VNOP_CREATEs followed by a VNOP_REMOVE). This isn’t the correct order, which can be seen by examining the TIME(ms) column. A side-effect of DTrace’s negligible performance impact design is that output can be slightly shuffled due to the way it collects data from per-CPU buffers. I often include a TIME column like that one, not just for the usefulness of knowing time, but also as a means to post sort the output.

Other Scripts

There are other tools ready-to-go in Mac OS X. If you run the “man -k dtrace” command in a Terminal (or “apropos dtrace”), you’ll see the following output:

And More…

DTrace isn’t just about running scripts; you can write your own custom scripts, run one-liners, and use higher-level tools that use DTrace behind the scenes, like Apple’s Instruments and Joyent’s Cloud Analytics. And it’s for more than just your MacBook: if you are using servers that have DTrace available, you can use it to diagnose their performance and issues too, including tracing kernel and application code. I regularly use it to see how a MySQL database interacts with a kernel file system, for example.