Teething Troubles: Assigning blame for pagefaults

I have been able to get precise filed-backed page fault logging(systemtap on Linux, xperf on Windows) for a while. It is incredibly useful to see exactly how Firefox is being loaded from disk. From there one I deduce what is causing the IO, try to make improvements and measure if I accomplished anything.

Unfortunately, a mere IO log requires a lot of pondering of why IO is happening. It would be so much easier if one could just get a report of every IO operation + an application backtrace to easily identify the cause. I was having trouble figuring out why some of my optimizations were not having the impact I expected, so i embarked on adding a backtrace to my log.

XPerf Fail
I fed xperf my Firefox symbols hoping this would plop stack traces next to my faults, but no such luck. It records backtraces in just about every probe, except for the “hard faults” probe I care about. I wonder if a custom perf probe could log what I want.

Perf Fail

Some prominent kernel hackers have long been complaining about OProfile/SystemTap/NIH performance monitoring tools. They finally produced a perf tool (It’s like they tried to make it hard to google for. It does not have a proper website; Real men read the source and skim LKML archives) to do profiling the Linux way(tm). I might be wrong, but so far it appears to be a functional equivalent of Microsoft’s xperf minus the nice UI.

Turned out that my Fedora 2.6.32 perf implementation is too buggy to even log pagefaults. Apparently this works in the current Linus kernels. I’m not completely sure, but looks like even if xperf pagefault logging worked, it’s pretty neutered. It does not appear that it can log file offsets next to pagefaults, nor stacks.

I think perf could be fixed to log io and accompanying userspace backtraces. There are some talented folks contributing to it. However I think that the pre-canned analysis model sucks. It is useful for building sophisticated versions of top, but when you really need to dig into what’s causing a particular issue, it really sucks to be restricted by what the developers foresaw as useful.

SystemTap

As awesome as the kernel side of SystemTap is, I keep running into userspace bugs and limitations. Getting userspace stacks for large collections of large libraries that Firefox relies on has been a systemtap-bug-finding affair. I can occasionally get useful userspace tracks for userspace probes, but apparently recording a userspace stack from a kernel probe is a hard problem that SystemTap devs haven’t fully addressed yet.

Luckily SystemTap provides a uaddr() function which appears to get correct addresses from my kernel probes(which is way more than the other tools offer). Unfortunately usymname() fails to resolve those addresses.

As a workaround, Jim Blandy suggested turning off address-space randomization so I can log uaddr() and resolve the values in gdb. I’ve been manually printing this with gdb’s “p/a <addr>” command until recently.

I still don’t have backtraces, but at least now I have the name of the function that’s causing trouble. This is surprisingly useful already. One can now easily tell how much of startup is being wasted on relocations(dlopen() in a prelinked binary!). Another obvious one is the harm of single-page COW faults to zero .bss (memset entries in the log). Turns out sprinkling initializers all over the binary is a bad idea. Looks like there are significant performance wins to be had with a bit of ‘easy’ compiler/linker hacking.

All of the above problems are really obvious and would’ve been fixed a long time ago if it was easier to get at this information. Unfortunately, there is still a lot of room for improvement in developer tools.

Update:

Sounds like I can use addr2line instead of gdb.

This entry was posted on Monday, May 24th, 2010 at 10:52 am and is filed under Uncategorized.
You can follow any comments to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.

I guess one of the problem in that tool space is that most people are already busy enough trying to get data on their own code and fixing their own code that they don’t want to go and contribute to yet another project to improve those tools.

I’m probably far too late to the party (was on vacation), but I have a python script that I use for my systemtap fun that deals with the address space randomization by reading /proc/PID/maps (having been told PID) and nm and some caching to do what you want in realtime as a form of stream processing. I also use it to perform some mangling like removing the vtable crap.