Xperf is a powerful (and free) Windows profiler that provides amazing insights into what is constraining the performance of Windows programs. Xperf includes a sampling profiler and a very capable viewer of the sampled call stacks (see below for UIforETW instructions).

Good news! As of summer 2016 WPA includes the ability to display flame graphs natively! This is far more flexible and faster than the cumbersome export/generate process discussed here. See this article for details, and stop reading this article.

However the numerical and text-based nature of the xperf stack viewer is sometimes not ideal. Sometimes a graphical summary would be better. In this post I show how to create Flame Graphs (like the one to the right) that graphically visualize xperf sampling data.

The normal way to explore the sampling data in an xperf capture is to use a summary table with the Stack column to the left of the orange grouping bar. This makes it easy to drill down along the hottest stack.

Unfortunately, while this does nicely highlight the stack which is the largest contributor to CPU consumption, the main highlighted stack only contains 27 samples out of a total of 637 on this thread. The siblings contribute another 23 samples, but it’s still hard to see the big picture. Such is the nature of displaying data primarily through text and numbers.

Flame Graphs are an alternate way of displaying call stacks, popularized by Brendan Gregg on his blog, in posts such as this one.

Any type of weighted call stack can be converted to a Flame Graph so I decided to write a Python script that would convert ETW traces to collapsed stack files that could be fed to Brendan’s flamegraph.pl script. A Flame Graph of the sampling data above gives an alternate way of visualizing what is going on:

The vertical axis is the call stack, and the horizontal axis represents how much time was spent along each call stack.

Creating flame graphs from ETW data is easier than ever before with UIforETW. Starting with v1.19 the UIforETW releases contain basic flame graph support – just select a trace and select Scripts-> Create flame graph from the context menu. Okay, don’t forget to make sure that python and perl are installed and Brendan’s flamegraph.pl script is in the UIforETW bin directory. Additional features (selecting what to flame graph) may come later.

It is immediately viscerally obvious that the majority of the time is spent along two call stacks – the largest one rooted in XapiWorkerThread::ProcessSyncTask and a smaller one rooted in XapiWorkerThread::ProcessASyncTask. The full name of the second one isn’t visible in the screen shot above, but when you view the VisualStudio.svg file then when you hover over an entry it shows the full name. Firefox, Chrome, and IE 10 can all display the .svg files.

How does it work?

The collapsed stack format that is used as an input to flamegraph.pl is easy enough to understand. Each line represents a call stack with individual entries separated by semicolons, and each line ends with a space and a number representing the count or weight of the line. This text file can even be used with grep to do some basic exploring of the call stacks, although the length of the lines usually makes this unwieldy.

The script to convert xperf traces to collapsed stack files has been rewritten to use wpaexporter which makes the process much simpler and cleaner. The export process is slightly faster, and the processing of the exported data is far faster as there is an order of magnitude less data to parse.

The next section of this post is now obsolete and is only retained for historical interest. The new script is clean enough that, with the wpaexporter post linked above, it should be easy to understand.

To convert xperf files to the collapsed stack format I made use of a few ‘lightly documented’ and unsupported features of xperf.

You can get some hints about how they work by typing xperf -help processing. None of the specific processing commands gave me what I wanted so I used the generic dumper command. The syntax for this, expressed in Python, is:

The arguments are the input ETL file, the output text file, and the beginning and ending of the time range in microseconds.

The format of the output is only documented through text lines at the top of the output file which list the names of the fields for each of the types of output data. The first type of data that we care about is SampledProfile:

Unfortunately these blocks of data are disjoint and there is no documentation on how to associate them. It appears that the only way to do this is to use the common information which is the TimeStamp and ThreadID. So, my Python parser scans the xperf results looking for SampledProfile and Stack data. When it sees a SampledProfile event it says “Hey, this TimeStamp and ThreadID is associated with this Process Name(PID). Watch for it”. When a blob of stack data shows up the script uses the TimeStamp and ThreadID to associate the data with a process. It’s crude, and doesn’t produce identical results to the WPA UI, but it appears to be close enough to be useful.

The rest of the script is just data manipulation – dictionary lookups, sorting, spindling, and other boring details.

I currently have my Python script configured to summarize just the busiest thread as collapsed call stack files, but that can easily be configured, eventually through UIforETW.

Summary tables versus Flame Graphs

When all of the CPU consumption is along a single stack then drilling down into the WPA summary table is straightforward. However if the samples are spread out over a couple of call stacks then it can take a while to drill down and find the significant parts of both call stacks. If there are a half-dozen significant call stacks then summary tables are quite unwieldy.

In the example below, from Outlook when my entire computer was having a bad day, there were four thousand samples on the main thread, almost all of them in ClearSendMessages. However they are on many different call stacks. In the screen shot below we can see two of the dozen or so call stacks, accounting for fewer than a thousand of the four thousand samples. It took several minutes of exploring to realize that this one function was accounting for almost all of the CPU time:

The Flame Graph makes it obvious. The many irrelevant functions are squeezed horizontally and are easy to ignore. The expensive functions are wide horizontal lines and you can instantly look at the top of the widest columns and realize that the same function is always at the top:

The three wide areas at the top are all ClearSendMessages, and so is the one at the left, and this is plainly obvious, instead of being hidden amongst a sea of text. You can see more details in the original Outlook.svg file.

Kernel callbacks

Initial versions of this graph had severely truncated call stacks due to a rarely seen detail of how call stack collection happens during kernel callbacks. Stack walking cannot cross the kernel boundary (they use different stacks, and may even have different pointer sizes) so the collected stacks are cut off at the kernel boundaries. The xperf trace viewers hide this by merging in the other portions of the stack, and in order to get the dumper action to do this you have to specify the -stacktimeshifting option (thanks to Kyle in the comments). Since WindowProcs are kernel callbacks it means that this call stack coalescing is needed more frequently than you might think.

It appears that stack collection on Linux using perf behaves identically in the kernel, with [VDSO] entries showing up as a separate stack root.

Summary

Flame Graphs are clearly a useful alternate way of visualizing stack data. If this sort of graphical summary was built in to WPA it would make some performance issues easier to understand. Currently if you want to use Flame Graphs you have to find an area of interest where CPU consumption is the issue and then run the scripts, passing the time range along to them.

The most likely use cases I can see for this is doing reports – visual summaries of performance problems found – or helping in finding regressions. I can certainly imagine a nightly performance test which creates Flame Graphs for key parts of a process so that the causes of regressions can be seen visually.

Other things that would need to be done to make these truly useful would include:

Create scripts for dumping idle time stacks, perhaps in the same graph as the CPU consumption but in a different color

Integrate Flame Graphs into WPA to make it easy to get a graph of the current time range being explored

In most cases the ability to interactively explore the xperf trace data in WPA – zooming in and out and seeing many types of data on the timeline – will outweigh the benefits of generating Flame Graphs outside of WPA. However I anticipate using them on some investigations in order to get a better high-level overview of what is happening, and to more easily share findings with those who aren’t as accustomed to reading xperf’s call stacks.

I hope that some day Microsoft integrates this type of visualization, either directly or though allowing plugins to the WPA UI.Done!

Share this:

Like this:

LikeLoading...

Related

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs.
I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8.

…then you get each stack printed on a single line, comma-separated. Much easier to process. (There may be issues if stack frames contain commas, not sure what the behavior would be in that case.)

Second, you can assume that stacks are printed right after the event they’re logged for. XPerf restores or provides as much associativity as it can, deterministically, because ETW doesn’t associate these stacks with the events any more than through them having the same timestamp and CPU ID. XPerf virtualizes the thread ID through context switch events, I believe.

Third, the reason why stacks are logged in separate blocks (and, by default, printed in separate blocks) is because you’re looking at stacks on events logged from elevated IRQL, in which case the stackwalker cannot traverse pageable stacks so ETW queues a DPC/APC to finish the job. It’s more efficient to log these in separate events and it also buys you some ad-hoc stack compaction: if multiple kernel events are logged before a thread returns to user-mode ETW will neither walk nor log identical user stacks.

Fourth, interleaving of 32-bit and 64-bit frames is a touchy subject but it isn’t the reason for the splitting of things above. Note that xperf will not be able to properly interleave 32-bit and 64-bit stacks; it’ll print the entire 64-bit stack first, then the 32-bit stack. This generally works out just fine for 32-bit apps unless you’re looking at kernel upcalls with 32-bit code calling 64-bit code which calls back to 32-bit code. The stacks just end up looking odd and you have to sort it out by hand.

Finally, the reason why xperf.exe’s dumper defaults to non-“timeshifting” stack dumping is that in order to perform said “timeshifting” xperf needs to store all the stacks of the trace before xperf can dump it and that can be very expensive, memory-wise. The default configuration of the trace dumper is designed to operate with a minimal working set that doesn’t have to store much of anything about the trace so that xperf can still dump many-GB traces to text. (It still needs to take two passes to facilitate process/thread/image name and symbol resolution but the amount of data that is stored for that is small.)

1) I’ll try -target machine. Some call stacks definitely contain commas such as “CRefCounted1<IRefCounted,CRefCountServiceBase<1,CRefMT> >::Release” so I’ll have to test to see whether it handles them rationally.

2) I’m not sure what you mean by “stacks are printed right after the event they’re logged for”. At the very least I have often seen twelve SampledProfile events (one per hardware thread) followed by twelve stacks. I thought there was other interleaving going on as well. Then again, I haven’t revisited this with -stacktimeshifting enabled — I just added that last night. I do agree that the stacks aren’t delayed for arbitrarily long, but it also wasn’t obvious how long the delays could be.

3) That makes sense.

4) Isn’t 32-bit code calling 64-bit code which calls back to 32-bit code quite common? Most applications are still 32-bit and most Windows installs are 64-bit so it seems like every WindowProc hits this situation. The Outlook trace is the first one I’ve examined that has this situation and since I don’t have symbols for Outlook.exe I can’t tell for sure how well -stacktimeshifting fixed it. Luckily in games we don’t tend to do a lot of work in WindowProcs so it probably won’t affect us much.

5) It would be nice if the processing commands were better documented. Having to manually specify -stacktimeshifting is fine, but it remains frustrating that this option, like so much of xperf’s awesomeness, is essentially undocumented. I would never have guessed that it was helpful if Kyle hadn’t mentioned it. Searching for “xperf summary table documentation” is amusing, and “xperf stacktimeshifting” is worse.

This is great work! Thanks for the post and suggestions. I want Flame Graphs whenever I’m on Windows now (would be great if this was in WPA).

Something that’s clear to me from this post, is that this can also help a novice get to know the internals of software quickly. I’m not familiar with the target software here, but I have everything on one screen (mental picture), and can explore with the mouse (SVG) to follow the functions, understand their interactions, and see where the majority of work is done for this workload. It’s CPU samples only, but it’s a big start.

Using these for non-regression testing should help identify CPU regressions quickly. I’ve done this by having SVGs in different browser tabs and flicking between them, and looking for any growing mountains. Robert Mustacchi was working on a way to diff Flame Graphs, which could be useful for such testing, which he will blog about soon (http://dtrace.org/blogs/rm/; not there yet, but his blog does have other Flame Graph examples in the meantime). Zoomable Flame Graphs is another feature that might be worth developing.

Having Flame Graphs show memory allocations should be straightforward, and useful: instead of stacks and sample counts, it could be stacks and byte counts, and the Flame Graph code should handle it. Can also use “-title Flame Graph: Memory Bytes” to set the title (feature recently added).

I think what is really needed is being able to zoom in on the profile data in WPA (changing the time range and thread of interest) and having a Flame Graph for the new area of interest display immediately. Zooming in on the Flame Graph itself is less critical, but would still be nice, especially when looking at memory reports.

The main extra addition I’d like to see to flamegraph.pl is a way to have two or more sets of data with different color themes. I’d like to show CPU consumption in red, and CPU idle time in blue. I’m sure I can extract stacks for both and I think putting them on the same graph would work better if they were visibly separated.

A problem may be apparent: for multi-threaded applications, I find there is usually so much “cold” off-CPU time that the “hot” on-CPU time gets horizontally compressed in the visualization.

Solutions? One could be zoomable Flame Graphs, although once you start zooming you can loose track of the big picture (including being able to visually compare frame lengths with higher level functions). Perhaps there is a solution to that problem (dynamic visual legend). Another could be to show hot and cold graphs separately in the same SVG, both taking the full width, and providing details (a visual legend) to show the relative spans of each. I need to hack at it and find something that works.

I did this Hot Cold Graph using two mechanisms, both in DTrace: sampling stacks for the on-CPU time (coarse, but low overhead); and tracing kernel scheduler block events, with timing details and stack trace when the thread runs again. Tracing gives millisecond totals; sampling was converted to milliseconds by multiplying by the rate (rough). I mentioned this method in a recent talk as the “Stack Profile Method”: which differs from traditional profiling as we capture both on-CPU and off-CPU stacks (http://dtrace.org/blogs/brendan/2012/12/13/usenix-lisa-2012-performance-analysis-methodology/).

My solution to the multi-threaded question is apparent in my xperf Flame Graphs — each one covers just a single thread. That felt ‘right’, and matches how I typically explore xperf sampling and idle-time data — I almost always group by both process and thread. Showing just a single thread means that idle plus execution equals elapsed.

Is a version of flamegraph.pl that supports the two different colors available? I could try hacking your Perl script to add support but Perl and I are not good friends.

I would render those graphs using logarithmic time. It will compress the relative differences, but you can still visualize them. Might be hard to dynamically pick a “good” power factor for the data, but it should be possible. Or make that a custom user option, or just do this – http://xkcd.com/1162/

BTW, you talk about identifying which thread readied another thread using heuristics — but usually this should not be necessary. You can record the ReadyThread which, in the case of SetEvent, LeaveCriticalSection, and other readying events will give definitive evidence as to which thread readied another thread.

For using context switches to calculate how long a thread is running I believe that the trick is that the context switch events record when a thread starts running. You then have to watch to see when *another* thread starts running on the same CPU in order to know when the first thread stopped running. To be really accurate you should also watch for time lost to interrupts and DPCs.

Yes I definitely want to use the ReadyThread event. But I was seeing that for longer periods of time threads did ready my thread from processes that did definitely have nothing to do with my current work. I assume that an arbitrary thread can ready me because when I am not giving up my quantum by e.g waiting I can run until my time slice quantum is used up. Later when I am scheduled again the last active thread did ready me.
When I do use context switche events it is also easy to visualize the wait time but as I said it is perhaps not terribly useful. For the graph I have found that reversing the notation gives much better results by printing the method name first and then the class+namespace and module later.
ComanyDll!Company.Infrastructure.Algo.Class.Method(int arg1, IInterface *pInt) would become
Method(int arg1, IInterface *pInt) Company.Infrastructure.Algo!ComanyDll
That way you can immediately see which methods are involved which is the most interesting thing anyway. Another useful thing is to throw away from the call stack everything that does not belong to my code. That gives much better insight what my code is causing to the system.

I believe that you are correct that you can be ‘readied’ by any thread if you were swapped out rather than waiting on an event. I think the same thing can happen when you resume after a call to Sleep(). It would be nice if it was better documented.

Throwing away data worries me — I have seen many problems where the issue was in somebody else’s code, and leaving gaps in the call stack makes them less logical.

I actually only use flame graphs occasionally. They are handy for visually summarizing how code is behaving, but in 95% of cases I use WPA alone to explore traces. Because I am looking at custom events, GPU usage, which window is active, disk I/O, how much idle CPU time there is, what my process is doing, and where my process is waiting, anything less than the full WPA interface means that I might be missing the true cause.

A huge part of the trace analysis process for me is finding the right region to look at, based on the data in many graphs, and until I do that I can’t possibly create a flame graph. And once I have located the region of interest I often don’t need to create a flame graph. I really want flame graphs to be displayed instantly, within the WPA interface, along with the other rich data.

That is, I only create a flame graph of CPU consumption when I already know that CPU consumption is the issue, and when I need a different view of the CPU consumption data. With flame graphs for idle time I would only create them when I already knew that CPU idle time was the issue, and only if I needed a different view.

But, perhaps your code will make creating them enough easier that I will create them more often.

Flame graphs are gorgeous, couldn’t be prettier, but forgive me, I’m a contrarian. My question is, if there’s a lucrative speedup lurking in the code, can it hide from the flame graph? If it can hide from the flame, does it matter? I’m afraid the answer is Yes, and Yes.

I would not advocate using flame graphs as the only way to explore performance data. My usual tool is ETW, including the excellent Windows Performance Analyzer. It lets me see expandable/explorable call stacks of sampled data. I can also find a suspicious function and get a view of all call stacks that that function shows up in, for functions that are expensive because they are called from many places.

The data can also be viewed without call stacks which is an alternate way to find expensive functions that are called from many places. You can have a flat list of functions, you can group them by module, thread, process, etc., and you can even see what addresses within a function the samples hit at.

And, because I record all context switches and because WPA has robust viewing of this data I can also see where all threads of all processes go idle, including the stacks they are waiting on, who is readying them, and what functions frequently end up waiting (perhaps on different call stacks).

And, unlike the technique of breaking in to the debugger (which is fine, but too low of a sampling rate for many purposes), I can ask customers to record ETW traces and then analyze them on my machine, and see cross-process interactions of bugs that I can’t even reproduce. I have fixed *dozens* of bugs that only occur on customer machines.

Flame graphs are not the ultimate solution, but they provide a useful visualization that can sometimes make a problem apparent. That’s enough to justify their existence. However I tend to spend more time using WPA’s other visualizations as they are more dynamic and avoid many of the limitations that you mention.

ETW looks great and is full of features. I can’t tell from the doc if it 1) allows one to take samples mainly during subjective slowness, 2) allows samples on random wall-clock time, not CPU time, 3) retains line-level or instruction-level resolution, not just function, at every level on a stack sample, 4) lets you relate each point on the stack back to its source code, so you can tell exactly why that moment in time was spent, 5) lets you actually see individual stack samples, so you can bypass all the summaries such as stack counting, sorting, and hot-path extraction that speedups can easily hide from.

I’m sure you’ve seen all my explanations of why high sampling rates are not good – namely that you pay for measurement precision by missing speedups. Also that software can contain several problems, and if you miss finding even one of them you pay a steep price – getting 4 out of 5 is not good enough. (Some people, I’m sure, are happy to be told they have no problems, but if they have competition they can’t be so careless.)

You have a good point about finding remote problems. I have worked mainly in terms of software development. These ideas could be extended to situations like that, but it’s not my day job 🙂

ETW is designed such that you can leave it running – tracing to circular memory buffers – 24×7. I have done that for months at a time. Then, where there is some slowness (subjective, detected by software, whatever) I can save the buffers, typically covering the last 20-60 seconds. The samples are done at a fixed rate. I usually set it to ~1.1 KHz to avoid aliasing with 1 KHz behaviors. You can crank it up to 8 KHz.

The exact address of every sample is recorded. Typically a call stack is recorded. The viewer is extremely flexible and you can always export the raw data so yes you can see individual samples/stacks. The default viewer doesn’t take you back to source code – just to functions and modules and addresses – but you could write an address to source mapper easily enough.

I don’t understand why high sampling rates aren’t good. Not a complete solution, sure, but still pretty useful. With the ETW sampler running at 8 KHz the interference is still very low (< 5% overhead) and I have enough samples that if a single frame of a video game runs long (say, 30 ms) then I have enough samples to figure out why.

And if the problem is not CPU consumption but CPU idle, having call stacks for every context switch is pretty amazing.

It is useful to understand that ETW is a system for recording lots of data. This includes but is not limited to CPU samples, context switches, disk I/O, memory allocations, file I/O, virtualAlloc, registry accesses, custom data, and more, and call stacks can optionally be recorded on any of these.

Windows Performance Analyzer is a very flexible tool for analyzing ETW traces. It can answer more questions than most other tools I have seen because it is very configurable. And, you can export the raw data to do custom visualizers, so you aren't constrained to using WPA.

It is unfortunate that the whole toolset is so poorly documented and hard to learn, but I've had fun trying to fix that. Well worth trying if you work on Windows. It doesn't do everything that perf does, but it does a lot of things better.

8khz x .030s = 240 samples. Too many to look at, you gotta summarize them, and that’s how speedups hide from you. The only way you know you didn’t find it is to try another approach and find it. If you’ve got one taking, say, 30% of time, the number of samples you need to examine, to see it twice, is 2/0.3 = 6.67 samples on average. Examining samples finds any speedup that summaries find, plus speedups that they don’t. So if you find yourself actually examining particular samples, drawn at random from those 240, then I think you’re being most effective.

We’ll just have to disagree. I think that looking at too few samples will fail in many cases as it may give you an incorrect sense of where time is spent. This is particularly true when the CPU bottleneck varies over time.

Another handy feature of WPA is that when you highlight a sample (address, function, module, call stack, whatever) it highlights all the time ranges where matching samples occur. This makes it trivial to see patterns that manual inspection will miss (Is this location constantly hit? Is it hit every second? Every frame?).

I think you find random breaking to be better than the alternatives because you haven’t tried great alternatives. A great profiler should let you visualize the data in many different ways, and a great programmer will use the right view for the particular problem they are facing.