May 6, 2011

I stumbled across this little tutorial I wrote back in the mists of time, probably around 1996 or 1997. And it was based on a tutorial I wrote at Visix, probably in 1993 during one of our Optimization Parties. It describes how to read the output of gprof, a profiling tool available on most unix systems. It’s even still there on Mac OS X. So you kids with your fancy Shark and Instruments, here’s what some of your elders used.

gprof is not a GNU tool, even though it has the leading “g”. That “g” probably stands for “call Graph” profiler. You’ll need to check your system’s documentation (e.g. man gprof) for exact instructions on getting gprof to work, but usually it just involves compiling and linking with -pg, running your program, and doing gprof gmon.out > oopack.

Here’s a 300K sample of output from gprof on the Dec Alpha if you want to take a look at it. This particular report is from a run of AOLServer 2.2.1 which involved fetching index.html 53,623 times. The links that follow go to anchors in that 300K sample. What was I wanting to profile? I wanted a gut check to make sure that life in the server was sane, and if there were any obvious bottlenecks that maybe I could address if I had the time. The test was to fetch index.html over and over again. In this case, around 53,000 times

I go to the flat profile section when I first start looking at gprof output. The big time consumers are usually pretty obvious. You’ll notice that each function has a [number] after it. You can search on that number throughout the file to see who called that function and what functions that function calls. Emacs incremental search is really nice for bouncing around the file like this.

Out of 21.05 seconds of total clock time, Ns_DStringNAppend consumed about 4 seconds, or about 18% of the time in and of itself. It was called 13 million times.

MakePath consumed one and a half seconds itself, and its children consumed three and a half seconds. At least one individual call to this consumed 0.01, and at least one individual call took a total of 0.03 seconds in MakePath and its children.

Handy tip – the function numbers in brackets are approximately sorted by time consumption, so a function with a [low number] will generally be more interesting than one with a [high number].

Now that you know that Ns_DStringNAppend is called a bunch of times, this could be a useful target for optimization, I’d look at its entry in the call graph section.

Before doing that, just for illustration, take a look at AllocateCa [33] since it has all of the interesting pieces of the call graph in a more compact size:

The entries above AllocateCa [33] are the functions that call AllocateCa. The entries below that are the functions that AllocateCa calls. There are two numbers separated by a slash: the first number is the number of calls that the function has made, while the second number is the total number of invocations of that function.

In other words, for 160866/321890 Ns_DStringVarAppend [30], AllocateCa called Ns_DStringVarAppend 160866 times. Across all of AOLServer, Ns_DStringVarAppend was called 321890 times.

Similarly, for 53622/160866 Ns_CacheNewEntry [62], means that Ns_CacheNewEntry called AllocateCa 53622 times, and AllocateCa was called 160866 times total.

So, just by looking at this snippet, you know that the three Ns_Cache functions each call AllocateCa about once per serving of index.html, and that AllocateCa makes a single call to Ns_DStringVarAppend, Ns_DStringFree, etc… each time. What’s also interesting to note is that someone is calling Ns_DStringFree more than Ns_DStringInit. This may be (or may not) be a bug in AOLServer. You can go see Ns_DStringInit and Ns_DStringFree yourself and track down who the culprit is.

The floating “3.0” on the left is the percent of total time that the function consumed. The two columns of numbers are the amount of time (in seconds) that the function consumed itself (AllocateCa took 0.11 seconds of time total to run its own code) and the amount of time in the function’s children (0.53 seconds were spent in its children)

Getting back to real analysis of DStringNAppend, you can see that MakePath made 50% of the Ns_DStringNAppend calls. Since you know that there were 53623 fetches of index.html, that means that for each page, MakePath was called twice, and for each call to MakePath, Ns_DStringNAppend was called 64 times.

If one call to MakePath could be elided (since it’s getting called twice), or if fewer than 64 Ns_DStringNAppends could be done per call, we could see a performance boost.

Just browsing the gprof output can be an illuminating exercise. If you have a gut feeling that a particular function is a hot spot (say, Ns_LockMutex [43]), you can see the call graph for that function, see if it’s consuming lots of time, or if it’s being called a whole bunch. Here it was called 1,341,534 times, or about 25 times per page serve. Maybe that’s high. Maybe that’s right. Sometimes a suspected culprit isn’t there, or you find a surprising time waster.

Because this sample gprof output was done on a Dec Alpha system, there was some suckage involved, such as no explicit time recorded for system calls. So we don’t know if, for example, select() blocked for a long time on each call.

April 3, 2011

Mike Ash’s recent Friday Q&A about signals mentioned SIGWINCH, the hearing of which always sends me down memory lane. My first professional bug was centered around SIGWINCH. By “professional bug”, I mean a bug that someone paid me actual money to fix during a period of employment.

I went to work for a company called Visix straight out of college in the early 90’s, which at the time sold a product called Looking Glass, a file browser much like the Macintosh Finder but for Unix. Eventually Looking Glass would become the Caldera Linux desktop. Looking Glass supported the major graphical windowing systems of the time: X11, Intergraph’s Environ V, and Sun’s SunView. The image at the top of this posting is the only screen shot I could find of the version of Looking Glass I worked on running on SunView. Notice the awesome desktop widgets at the top. That was typical SunView style, so Looking Glass was pure awesome eye candy in comparison.

I was hired for the tech support team, and our duties were phone support (typically debugging network configurations and X server font paths) and porting Looking Glass to other platforms. Being the Lo Mein on the totem pole I got given the old platform nobody wanted to touch any more: SunView.

SunOS 4.1.X had just come out, and Looking Glass would hang randomly. It worked fine on 4.0.3. My job was to find and fix this hang. This was my first introduction to a lot of things: C, unix systems, windowing systems, navigating large code bases, conditional compilation, debuggers, vendor documentation that wasn’t from Apple, working in a company, and so on. Luckily the SunView version didn’t sell terribly well any more because everyone was moving to X11, but there were a couple of customers bitten by this problem.

So what is SunView? SunView is a windowing system: different programs run displaying graphical output into a window. Nowadays that’s commonplace, but back when SunView came out it was pretty cool. SunView was one of the earlier windowing systems,so it had a bunch of peculiarities: the biggest was that each window on the screen was represented by an honest-to-god kernel device. /dev/wnd5 is a window, as would be /dev/wnd12. There were a finite number of these window devices, so once the system ran out of windows you couldn’t open any more.

There was a definite assumption of “one window to one process” in SunView. Your window was your only playground. Looking Glass was different because it could open multiple windows. Because of the finite number of windows available system-wide, we had to create the alert that said “You can’t open any more windows because you’re out of windows” at launch time, thereby consuming a precious window resource, and hide it offscreen. It was the only way we could reliably tell users why they couldn’t open any more windows. Glad I wasn’t the one that had to make this work in the first place. I was just fixing Legacy Code.

The other peculiarity is that you never got window events. Even in the 1.0 version of the Macintosh toolbox you could easily figure out if the user dragged the window, or resized it, or changed its stacking order. In SunView you just got a signal. SIGWINCH, for WINdow CHange, and hence the memory-lane trigger. The user moved a window? SIGWINCH. The user resized it? SIGWINCH. The user changed the z-order? SIGWINCH.

With just one window that’s not too bad. Just query your only window for its current size. For us, though, we had to cache every window’s location, size, and stacking order. Upon receipt of a SIGWINCH we would walk all of our windows and compare the new values to the cached version. If something interesting changed we would need to do the work of laying out the window’s contents.

So, back to my bug. It took me a solid month to fix. All this time I thought I was a failure and was worried I’d get fired. That would be embarrassing. It took so long to fix because it was part time work in amongst my other responsibilities, and also because it was difficult to reproduce. Spastic clicking and dragging could make it lock up, but not reliably. Using the debugger was pointless – a 4 meg Sun 3/50 swapped for two hours as dbx tried to load Looking Glass. I ended up using a lot of caveman debugging.

The application event architecture we used is shown right up there. Each window had an event queue (remember that one window to one process assumption) that held all of the mouse and keyboard events. Upon receipt of new events (I forget if we got a signal for that, or if some file descriptor became readable), we would walk our windows: read each event, handle it, then move on to the next window.

I was getting some printouts, though, showing an window receiving mouse-downs and mouse-drags, but no mouse-up. Occasionally I would see a mouse-up, with no mouse-downs. Ah-ha! The mouse-up was being delivered to the wrong window’s event queue, probably due to some race condition down in the system that didn’t notice the current window changed during the drag. The fix was easy once I found it : just merge the events from all the windows first, and then process them. Happiness and light.

It was then I learned how expensive malloc is. I malloc’d and free’d event structures, but performance was dog-slow, especially during mouse drags. Caching the structures made life fast again.

Memories like these make me so happy with the cool tech we get to play with these days.