A call-by-call look at context switch logging with the Event Tracing for Windows API.

In response to last week’s post, I received the following e-mail:

“

I might be a little late for a question about your recent post, but just in case I'm not: Do you have any strategies for, or stories about, dealing with an external library that you couldn't get rid of that violated some (or all) of these API design guidelines? It's a vague question, but I'm really just asking about any past experience as a user of an API that really sticks out in your mind.

- Michael Bartnett

This reminded me that I’d always wanted to go write down the steps necessary to use a bad API, just to highlight how terrible it can be for the programmer. I don’t think people who make APIs really appreciate how important it is to get them right, and how much unnecessary work their mistakes can cause for hundreds, thousands, sometimes even millions of other programmers. So I felt like it was important to spend an article walking through an API and showing just how much unnecessary work an API can manufacture.

It’d probably be a nice column on its own — a weekly dissection of a bad API. But since I don’t have time for something like that, if I was only going to dissect one API, the most important question was, which API should I choose?

It’s a great time in the history of computing to be writing an article about bad APIs (which is another way of saying it’s a terrible time to actually have to program for a living). There’s so many bad APIs out there, I could have picked one at random and been very likely to find enough problems to fill a 3000-word article. But if I was only going to pick apart one specific operation in one API, it seemed only right to try to pick the worst API I’d ever actually used.

Now there are a lot of APIs out there that routinely turn in top-ranking efforts for the “worst API” leaderboard. CSS, for example, can probably claim half the spots on the top 10 for any year in which there’s a new version. DirectShow, while it was still a going concern, certainly dominated the rankings for its era. And in the modern age, newcomers like the Android SDK are showing real potential with development environments so convoluted that the quality of the APIs when called from actual C++ code are the last thing you’ll worry about when trying to ship something with them.

But when I thought long and hard about who the all-time heavyweight bad API champion was, there was one clear winner: Event Tracing for Windows.

Event Tracing for Windows is an API that does something very simple: it allows any component of the system (including end-user software) to announce “events” which any other component can then “consume”. It is a logging system, and it is used to record performance and debugging information by everything from the kernel upwards.

Now, normally, a game developer would have no reason to use the Event Tracing for Windows API directly. You can use tools like PerfMon to view logged information about your game, like how much working set it was using or how much disk I/O it did. But there is one specific thing that directly accessing Event Tracing gives you that you can’t get anywhere else: context switch timing.

Yes, if you have any relatively recent version of Windows (like 7 or 8), the kernel will log all thread context switches, and using the CPU timestamp included in those events, you can actually correlate them with your own in-game profiling. This is incredibly useful information to have, and is the kind of thing you often only get from console hardware. It’s the reason tools like RAD’s Telemetry can show you when your running threads were interrupted and had to wait for system threads to do work, something that can often be critical to debugging weird performance problems.

So far, the API is sounding pretty good. I mean, context switch timing is very vaulable information, so even if the API was a little janky, it’d still be pretty great, right?

Before we take a look at the actual Event Tracing for Windows API, I want to walk the walk here and do exactly what I said to do in last week’s lecture: write the usage code first. Whenever you evaluate an API, or create a new one, you must always, always, ALWAYS start by writing some code as if you were a user trying to do the thing that the API is supposed to do. This is the only way to get a nice, clean perspective on how the API would work if it had no constraints on it whatsoever. If it was “magical”, as it were. And then, once you have that, you can move forward and start thinking about the practical problems, and what the best way is for you to get to something implementable.

So, if I were a programmer, with no knowledge of the Event Tracing for Windows API, how would I want to get a list of context switches? Well, two methods come to mind.

The most straightforward approach would be something like this:

// At the beginning of the programetw_event_trace Trace = ETWBeginTrace();ETWAddEventType(Trace, ETWType_ContextSwitch);

That’s one way to do it. Very simple, trivial to understand, pretty hard to mess up. Someone stepping into this with the debugger would be able to see exactly what was going on, and you’d be able to tell pretty easily if you’d done something wrong.

However, I could imagine a scenario where performance-critical code would not want to pay the cost of the copy from the kernel’s buffer to your buffer, which this API requires (ETWGetEvents must copy the events from some OS-internal buffer, since it has to get them from somewhere). So a slightly more complex version would be to get some mapped memory back from the API that you use as a reading buffer:

// At the beginning of the programetw_event_trace Trace = ETWBeginTrace(4096*sizeof(etw_event));ETWAddEventType(Trace, ETWType_ContextSwitch);

All I have done here is changed the return mechanism from a copy to a ranged pointer. In ETWBeginTrace, the user now passes in the number of events they want to buffer at maximum, and the kernel reserves room in the user’s address space for that many events. It then writes directly into that memory if it can, avoiding unnecessary copies. When the user calls ETWBeginEventRead(), a begin and end pointer are returned that span some part of the event memory. Since it will be treated as a circular buffer, the caller is expected to loop on in case there are two ranges (a “head” and “tail”) that need to be returned. I included an end call, since certain methods of implementation might require the kernel to know what part of the buffer the user is looking at, so it can avoid writing into memory that is actively being read. I don’t really know that this sort of thing would be necessary, but if you wanted to cover your bases and give the kernel the maximum implementation flexibility, this definitely supports more implementations than the ETWGetEvents() version.

If one were so inclined, one could even support both retrieval methods with the same API just by still allowing the ETWGetEvents() call. Also, to complete the API with some error reporting, it might also be nice to have something like:

bool ETWLastGetEventsOverflowed(event_trace Trace);

to allow you to check after each ETWGetEvents() whether too many events had occurred since the last check, and the kernel was forced to throw some away.

To each his own, but I suspect that most of the programmers I know wouldn’t have a lot of complaints with my API proposal as written. Everyone has their own taste, so I’m sure they would each tweak something to be more to their liking, but I doubt anyone would say it was horrible. It’s all pretty straightforward, and I suspect most programmers could integrate it into their code trivially without really thinking too much about it.

The reason the API is so straightforward is not because I employed a sophisticated set of API design practices to finesse my way to a good API. Quite the contrary. The API is simple because the problem it’s solving is trivial. It’s essentially the simplest API problem you can have in a system: how to move data from one place to another. It’s a glorified memcpy().

But it is precisely the simplicity of the problem that allows the Event Tracing for Windows API to really shine. Even though all it has to do is move memory from one place to another, it manages to involve almost every kind of complexity you can see in an API.

I don’t know how anyone is supposed to actually learn how to use the Event Tracing for Windows API. Maybe there are good examples floating around that I just never found. I had to piece the usage code together over the course of many hours of experimentation, pulling from various snippets of documentation. Each time I figured out another step of the process I thought, “Wait, seriously?” And each time Microsoft implicitly replied, “Seriously.”

Having me tell you how to call the API does take some away from the sheer awe of the experience, so I will say, if you want the full monty, stop reading now and go try to get context switch timestamps on your own. I can assure you it will be hours of fun and excitement. Those who’d rather save time at the expense of a day full of facepalm moments, read on.

OK, here we go. The equivalent to my proposed ETWBeginTrace() is Microsoft’s StartTrace() call. At first glance, it seems innocent enough:

A cursory glance at this mass of data reveals a bit of oddity: why are there members like “EventsLost” and “BuffersWritten”? The reason is because, instead of making different structures for the different operations you might do on a trace, Microsoft grouped the API functions into a few groups, and all the functions in each group share one amalgamated structure for their parameters. So instead of the user getting a clear idea of what goes into and comes out of a function by looking at its parameters, they must instead rely entirely on the MSDN documentation for each API, and hope that it correctly enumerates which members of the giant parameter structure are used for each call, and whether those members go into or come out of the function.

Of course, because there’s so many different ways it can be used, and given the possibility of future APIs using it, too, Microsoft requires that you clear this gigantic beast to zero before you use it:

EVENT_TRACE_PROPERTIES SessionProperties = {0};

For StartTrace(), if we just want to get the data back directly and aren’t trying to log to a file, we need to fill out some members. These two make some sense:

The EnableFlags says what we wanted to get. We want context switches, so we set that flag. Now, what happens when you have more than 32 types of events coming from a single provider, I don’t know, but I guess they weren’t particularly concerned with that eventuality. I was, which is why I went with the enum-and-function-call approach in my proposal, since it supports four billion event types, but hey, “32 event types should be enough for everyone”, so Microsoft went with a 32-bit flags field. Not a huge deal, but this is definitely the kind of short-term thinking that leads to things like unnecessarily duplicated functions with “Ex” appended to their names.

The LogFileMode just says whether we wanted to get events back directly, or whether we just want the kernel to write them to disk. Since they are such disparate operations, I would have broken those two things up into different function calls, but, hey, we’ve already got one big giant structure for everything, might as well throw it all in there.

Things get a little weirder with this field:

SessionProperties.Wnode.Flags = WNODE_FLAG_TRACED_GUID;

According to the documentation, that’s the only value you’re allowed to have. So it’s just busywork for you to do. Again, not a huge deal, because maybe they’re trying to plan for future expansion or something (heaven help us). But we are already getting nasty data coupling here, where the function call and the structure contents are effectively redundant.

Things continue to sour as we come to this field:

SessionProperties.Wnode.ClientContext = 1;

What does that mean? Well, the cryptically named “ClientContext” actually refers to the type of timestamp you want your events to have. “TimestampType” might have been a little bit more descriptive, but whatever. The real fun is the bare “1” value that’s just sitting on the right.

It turns out that there’s actually a set of enumerated values to which you can set ClientContext, but Microsoft didn’t ever give them symbolic names. So you’re just supposed to read the documentation and remember that 1 means the timestamps come from QueryPerformanceCounter, 2 means the “system time”, and 3 means the CPU cycle count.

In case it is not obvious, there are a number of reasons why public APIs should never do something like this. Internally, I will occasionally do something like this, in situations where a local piece of code just wants to, say, use -1 and -2 for some kind of convoluted special cases in an indexing scheme. But for an API that ships to literally millions of developers, you always want to define your constants.

First, it makes the code readable. Nobody knows what a ClientContext of “1” is, but a ClientContext of USE_QUERY_PERFORMANCE_COUNTER_TIMESTAMPS would be crystal clear. Second, it makes the code searchable. Nobody can grep a codebase for “1”, but they can trivially grep it for USE_QUERY_PERFORMANCE_COUNTER_TIMESTAMPS. And although you might think “well, no big deal, I’ll just grep for ClientContext = 1”, remember that more complex usage of this API could involve variables, like “. . .ClientContext = TimestampType;”. Third, it makes the code not compile on future versions of the SDK where things have changed. For example, if the platform maintainers decided to deprecate the use of USE_QUERY_PERFORMANCE_COUNTER_TIMESTAMPS, they could remove the #define for it and make it be USE_QUERY_PERFORMANCE_COUNTER_TIMESTAMPS_DEPRECATED. Now all the old code that uses that value will get a compile error on the new SDK, and the programmer will go look at the new documentation and see what they should be using instead.

Etc., etc., etc.

Perhaps the most annoying field we have to fill out is this one:

SessionProperties.Wnode.Guid = SystemTraceControlGuid;

The GUID says who you’re trying to trace. In our case, we’re trying to get data from the kernel logger, and SystemTraceControlGuid is the globally defined GUID that refers to this service. It could have been named better, sure, but that’s a minor issue compared to the fact that, if you were to actually try to compile this line of code, you’d find that SystemTraceControlGuid was an undefined symbol at link time.

This happens, of course, because the GUIDs are so large that Microsoft apparently couldn’t figure out any way to embed them in the headers (I can think of several possible ways, but I guess they didn’t like any of them), so instead they force you to pick one file in your project where the Windows headers will inject the actual GUID definitions. To signal that you want this to happen, you have to do something like this:

So now you have to be careful about where you do this, perhaps by making a new file in your project where the GUIDs will live so everyone can reference them, or some other such nonsense, so you don’t ever do it twice.

But whatever, we’re almost done filling out the structure. All we have to do is deal with the SessionName parameter, which we should be able to just pass as a string, right? Since it’s the name of the session, I guess maybe just do:

But alas, that is not the way things work. It turns out that, even though you already pass a GUID in the SessionProperties that specifies that the kernel is the source of the events, you must also pass the pre-defined constant KERNEL_LOGGER_NAME as the session name. Why? Well, it’s because of a little secret surprise that I’ll save for a bit, just so you can savor the suspense of it all.

It turns out that, although the SessionName string is passed as the second parameter, that’s really just a “convenience” feature. Really, the SessionName is required to be embedded directly into the SessionProperties, but since Microsoft didn’t want to limit the maximum length of the name, they decided to just go ahead and pack it in after the EVENT_TRACE_PROPERTIES structure. So really, you can’t do this:

Yes, that’s right, every user of the Event Tracing for Windows API has to do the arithmetic and layout of the packed structure format themselves. I have absolutely no idea why the name has to be bundled in this way, but surely if you wanted everyone to do this, you should have supplied a utility macro or function that would do the right thing for the user and keep them out of the business of understanding your weird data packing requirements.

But hey, at least you don’t have to copy the name in there yourself! The convention Microsoft adopted for this API was that the StartTrace() function will copy the name into the structure for you, since after all, it’s passed as the second parameter.

Well, it’s a nice gesture, but it doesn’t pan out in practice. It turns out that forcing the SessionName to be KERNEL_LOGGER_NAME wasn’t superfluous with the GUID after all, and that is the secret surprise to which I alluded. The real reason that it has to be KERNEL_LOGGER_NAME is because Windows only allows you to have one session in the system — total — that reads events from the SystemTraceControlGuid. Other GUIDs can be read by multiple sessions, but not SystemTraceControlGuid. So really what is happening when you pass KERNEL_LOGGER_NAME is that you are saying you want the one unique session that can exist in the system at any given time with the GUID of SystemTraceControlGuid. If someone else already started that session, then your attempt to start it will fail.

It gets better. The session is global to the operating system and doesn’t automatically close on the termination of the process that started it. So if you write code that calls StartTrace(), but you have a bug in that code somewhere — anywhere — and your program crashes, the KERNEL_LOGGER_NAME session is still running! When you go to re-run your program, perhaps after fixing the bug, your attempt to StartTrace() will fail with ERROR_ALREADY_EXISTS.

So, basically, StartTrace(), which is the call that helpfully copies the SessionName into the struct for you, is rarely the first call you would make anyway. What you’re much more likely to do is call this:

This shuts down any existing session, so that your subsequent call to StartTrace() will succeed. But of course, ControlTrace() doesn’t copy the name in like StartTrace() does, which means that, in practice, you have to do it yourself, since StartTrace() gets called after ControlTrace()!

It’s crazy, but the ramifications of all this are crazier still. If you think about what it means to have only one possible trace that is connected to the kernel logger, you realize quickly that there are security issues at play. If some other process had called StartTrace(), and they were the ones that were using the kernel logger, how does the system know that it’s OK for our process to come in and stop that trace so we can start it back up again with our settings?

The answer, hilariously, is that it doesn’t! In fact, it’s a complete free-for-all — may the best process win! Whoever calls StartTrace() last, well, that’s who gets to configure the trace.

Well, not quite. Obviously, you don’t want just any old process to be able to steal the kernel logger away from other processes. So Microsoft decided that the best course of action would be to simply prohibit all processes from accessing the kernel logger at all, unless they’ve been specifically granted administrator privileges.

That’s right — I am not exaggerating. If you simply want to receive a list of the context switches, even if only for your process, it must be running with full administrator privileges. The whole right-click-run-as-administrator nine yards. If you don’t, and you just run your process normally, your StartTrace will fail with insufficient privileges. (In theory, you may also be able to add your user to the “Performance Log Users” group and avoid having to run the process as administrator, but I’m sad to say that, at the moment, I can’t actually remember if that works for kernel logger connections or only other types of traces. . .)

Amazing, yes? To do the equivalent of what should have been two trivial function calls (ETWBeginTrace() / ETWAddEventType()), we’ve done memory arithmetic, we’ve allocated memory, we’ve cleared memory, we’ve computed offsets, we’ve copied strings, we’ve filled out structures, used not one but two global GUID-style constants, forced a #define manhandling of the preprocessor #includes, and required that the user start our process with full administrator privileges.

I know what you’re thinking. You’re thinking that the section after “Starting the Trace” should cover getting the events from the trace, right? Nonsense! This is the Event Tracing for Windows API, people! Starting a trace doesn’t start a trace! It just half-starts a trace! If you want to start the trace for real, everyone knows you have to start it, then open it. . . with the OpenTrace() function:

TRACEHANDLE OpenTrace(EVENT_TRACE_LOGFILE *Logfile);

What does this function do? Well, it turns out that a “started” trace is just a trace that is collecting events. It doesn’t actually provide any way to ever get those events. If you want to get them, you need to open the trace up with OpenTrace().

So to call OpenTrace(), we need an EVENT_TRACE_LOGFILE. Of course, we’re not actually making a log file, we just want to get the events, so it’s a little odd that we’d be filling out something called “LOGFILE”. But much like StartTrace(), OpenTrace() is part of a family of functions that all share the same parameter structure, so really, the fact that it’s inappropriately named for our purposes is one of the more minor annoyances.

If you got nervous when you saw the word “Callback” appear in that structure, I’m right there with you. Getting events should be a simple matter of asking for them and receiving the memory. There should be absolutely no need for a callback.

But, moving on, EVENT_TRACE_LOGFILE is one of those giant grab-bag structures, so Microsoft asks that you clear it first so any unused parts will be zero:

EVENT_TRACE_LOGFILE LogFile = {0};

Since the OpenTrace() function doesn’t actually take any handles, we need to give it some way to find the trace we “started” before. It turns out this is done by string-matching, so we pass the session name yet again:

LogFile.LoggerName = KERNEL_LOGGER_NAME;

Strangely, this time we don’t have to copy it to the end of the structure. Why? Who knows! Variety is the spice of life, I am told. Microsoft wants your life to be spicy.

Next, we have to fill in the tracing mode, which is actually a set of mode flags:

PROCESS_TRACE_MODE_REAL_TIME, as far as I can tell, is a completely redundant flag, because if you don’t specify a log file name, then I’m not sure how you’d receive events otherwise. PROCESS_TRACE_MODE_EVENT_RECORD is a compatibility flag that tells Windows you want to use the new EventRecordCallback, as opposed to the older EventCallback (yes, believe it or not, this fine API has actually gone through multiple revisions!) And the PROCESS_TRACE_MODE_RAW_TIMESTAMP flag tells Windows not to override the setting for the ClientContext passed to the StartTrace() call. I guess the idea here is that the person who started the trace might have used a timestamp format other than “2”, and if you want “2”, then you can always for it to be “2” when you actually get the events. If you always wanted “1” or “3”, well. . . no such luck.

Finally, we need to point the API at a function we write that will collect the events:

LogFile.EventRecordCallback = CaseyEventRecordCallback;

And then we’re finally ready for the big call:

TRACEHANDLE ConsumerHandle = OpenTrace(&LogFile);

Now, before we move on to actually trying to get some actual events (perish the thought!), I’d like you to stop for a minute and marvel at the pure brilliance of StartTrace() and OpenTrace(). These are two APIs that are part of the same system. They both generate a new TRACEHANDLE. They can both fail. They both take the session name. But, for each of these things, they do it completely differently. Completely!

StartTrace() returns a ULONG error code, and takes a pointer to space for the handle it returns. OpenTrace() returns the handle directly, but sets it to INVALID_HANDLE_VALUE if there was an error. StartTrace() takes the session name as a parameter, and forces you to leave space in its parameter structure so it can copy it in. OpenTrace() takes a pointer to the session name in its parameter structure, but doesn’t require it to be copied anywhere.

We’ve worked so hard and come so far, it would be nice to finally get an actual context switch event, wouldn’t it? To get one, we’ll of course need to implement the actual callback we pointed OpenTrace() at:

Mercifully, this is just a bunch of structure accesses, and there’s nothing particularly unusual going on. The event comes in, we pull the data out of it, and then we could do whatever we wanted to do with it. But, when does this callback get called?

Well, Windows only calls it back when you attempt to “process” events in your open trace using this API:

Using this function, you can pass an open trace handle, and Windows will spin through all the events it has on that handle, calling your callback for each one. So we’re done, right? We just call this function every frame of our game, or some other appropriate interval, and we harvest the events with the callback?

I’m afraid not, my friends, because here is the brown icing on the poopcake, that savory sauce that makes the Event Tracing for Windows API more grainy and aromatic that its contemporaries: ProcessTrace is defined to never return.

Yes, you read that correctly. The definition of ProcessTrace(), for a real-time trace, is that it dispatches any events it finds currently logged on the trace, and then just blocks until more come in. It will block forever, or until the trace is manually closed with CloseTrace(). This means the only way to actually get the events and continue running your process is to create an entirely new thread to do nothing but block on ProcessTrace()!

You think I’m joking, but I’m completely serious. You first have to make a thread stub that blocks forever on ProcessTrace():

So there you have it, ladies and gentlemen: the only API I’ve ever used that requires both elevated privileges and a dedicated user thread just to copy a block of structures from the kernel to the user. I’d never seen anything like it before, and I’ve never seen anything like it since. Add that to the littany of what-not-to-dos that comprise the actual API calls themselves, and I hope everyone will agree that the Event Tracing for Windows API is in a class of bad API entirely by itself.

You can nit-pick this API by examining it with the principles I outlined last week (well, ten years ago). For example, you can point out that a required thread and a callback for a simple memory transfer is a major red flag under the Flow Control principle. You can point out that there is no granularity at all, as you must call the calls in exactly the way that I described. You can point out that nearly all the data and all the functions are coupled in a dozen different ways, including bizarre ones like the requirement that the SessionName be KERNEL_LOGGER_NAME if the GUID is SystemTraceControlGuid.

But really, the most important takeaway from a bad API like this is “always write the usage code first”. In the lecture, I said it was the first and second rule of API design, and I wasn’t kidding. Without thinking about pricinples at all, or getting bogged down in details, just the simple exercise of writing down what the API should look like was all that was really necesssary to see all the places the Microsoft version fell down. If you go back and contrast the two, it’s immediately apparent how overcomplicated, error-prone, and unnecessary the Microsoft version is using only your programmer’s intuition.