8/25/2010

08-25-10 - ProfileParser

I realized a long time ago that
the work I did for my AllocParser could basically be applied as-is
to parsing profiles. It's the same thing, you have a counter (bytes or clocks), you want to see
who counted up that counter, get hierarchical info, etc.

Also Sean and I talked long ago about how to do the lowest impact possible manually-instrumented profiler with full
information. Basically you just record the trace and don't do any processing on it during recording.
All you have to do is :

where ptr is some big global array of U64's , and we will later use the stringized label as a unique id
to merge traces. Once
your app is done sampling, you have this big array of pushes and pops, you can then parse that to figure
out all the hierarichical timing information. In practice you would want to use this with a scoper class
to do the push/pop for you, like :

Very nice! (obviously the pop marker doesn't have to provide the label, but it's handy for consistency
checking).

(if you want to get fancy, the profiler push/pop array should really be write-combining memory, and the stores
should be movnti's on x64 (*), that way the profiler push/pop wouldn't even pollute your cache, which makes it
very low impact indeed)

(* BTW movnti is exposed as _mm_stream_si64 ; unfortunately, this does not exist in VC2005,
you need 2008; the fact that they took away inline asm and then failed to expose all the
instructions in intrinsics was a huge "fuck you" to all low-level developers; it was terrible in the early days,
they've caught up a bit more with each rev ) (note that movntq is not the same, movnti comes from normal registers).

So I did this, and made my AllocParser able to parse that kind of input and turn it into AllocRecords.
(the normal mode of AllocParser is to handle stack traces of memory allocations).

So now I can make tabview output for either top-down or bottom-up hierarchies, and also graphiz
output like :
lz_profile.svg .

There are a few minor things to clean up, like I'd like to be able to show either seconds or clocks,
I'd like to be able to divide clocks by the total # of bytes to make it a clocks per byte, and if
a node only has a self time, I'd like to delete the redundant +self node.

Another option is to show the hierarchy in a totally different graphviz style, using the boxes-within-boxes method.
I tried that before for allocs and it was hideous, but it might be okay for timing. If I do that, then I could
combine this with a global timeline to show thread profiles over time, and then use the arrows to connect
dependencies.

26 comments:

We do the exact same thing. We define big thread local arrays and have nearly identical macros. At the end of the program we dump the arrays to a disk file and have a Java program that post processes them. We're going to generate time line plots using gnuplot so you can see load balancing problems.

Our motivation is that all of the linux profilers are awful at communicating in a reasonable way how much time you spend in the kernel processing a libc call.

Dealing with all the different profile builds on all the platforms is a nightmare IMO. The XbTrace stuff for Pix is probably the best of all worlds but even that is a bit of a hassle because it's awfully slow, but generally it's nice to just be able to flip on "DO_PROFILE" and get info on all my platforms and then view the info in a consistent way.

It's an open question to me whether streaming the data out is better than immediately encoding into info with a custom stack like iprof does, since the long stream is going to touch a lot of memory and cost performance that way.

"oh nevermind, you mentioned the write-combining possibility. although that doesn't really work unless you write out a whole cacheline? or something."

Yeah, it will hold the partially written CL in the store buffer for a while, I'm not sure exactly what triggers it to give up and have to fetch the line.

BTW it would be nice if the PC had a mechanism like dcbz ; and even dcbz isn't exactly what you want. You want to be able to say "I'm streaming into some buffer and will write the bytes sequentially, and I don't care what the contents are in the region after my write pointer".

if it needs a whole cache-line (if it can't communicate with memory in smaller amounts), then you're kind of fucked, because i'm pretty sure that stuff just sits in the write buffers so it won't last very long (not long enough to finish out the cache).

in indie game jam 0, I tried to make the sprite rendering go through Nvidia's VAR extension which required writing to write combining memory, and if I tried to construct the data on the fly it just always failed. the only way I was able to hit the presumably optimal performance was to write to a small (1K?) buffer, and then copy that buffer directly to the WC memory. (I say presumably optimal, because that got me to performance parity with doing it the naive OpenGL immediate way (multiple function calls per vertex), which Nvidia had already themselves optimized to do the same thing (I suspect).)

Of course machines are different now and the write-combining stuff has had time to be tweaked and improved, but I can't imagine it's more than the write buffers.

I'm no expert on this shit anymore, but I believe that while a partial write is slower, it still doesn't pollute your caches, so it's good in that sense.

Also you can always try having a little cache-line-sized temp buffer to accumulate the push/pops and then flush it out to the main buffer using WC on that granularity. This just steals one cache line which is still significantly lower impact than "iprof" style.

Also modern Intel processors have 4 or more store buffers, not 1 like the old days.

BTW I'm still a big fan of iprof style mini-profilers for when you want realtime in-game info. It's nice to be able to just pull up your HUD and get some numbers. In cblib I support both methods and you can toggle each on and off with a #define.

Quoting from the IA-32 manuals on write combining (Vol 3A, System Programming Guide): "Writes to the WC memory type are not cached in the typical sense of the word cached. They are retained in an internal write combining buffer (WC buffer) that is separate from the internal L1, L2, and L3 caches and the store buffer. [..] The size and structure of the WC buffer is not architecturally defined. For the Intel Core 2 Duo [etc.] processors; the WC buffer is made up of several 64-byte WC buffers."Plus a lot of notice about how WC memory is explicitly weakly ordered and you have to call sfence if you want the data to reliably show up in memory. No details on how buffer allocation works or when they're flushed. Anyway, WC buffers are separate from regular store buffers, so data can be sitting there for quite some time even when there's a lot of other memory accesses going on (which is good for this application).

Even if you're not writing full cache lines, WC is a win in this case because the memory is never going to be read unless you actually do a read on it (Fallback for non-full-cacheline WC writebacks is multiple partial write transactions, not fetch whole cache line - apply changes - write whole cache line).

Previous processors had less WC buffers (sometimes only 1), and some architectures have more stringent requirements on write-combined writes (e.g. "must be strictly sequential access with no holes").

"You want to be able to say "I'm streaming into some buffer and will write the bytes sequentially, and I don't care what the contents are in the region after my write pointer"."One thing that *ahem* certain platforms have is a mechanism called the "write gather pipe". You set the destination address once, then write whatever you want to a specific MMIO address and it will just output everything sequentially in cache line-sized transactions without ever reading memory or polluting the cache. It also packs accesses tightly, i.e. a byte write will only advance the output pointer by 1 byte and there's no penalty when writes don't end up being naturally aligned or cross cache lines. It's the awesome way to do queuing.

"I tried to make the sprite rendering go through Nvidia's VAR extension which required writing to write combining memory, and if I tried to construct the data on the fly it just always failed."Did you check the disassembly? When accessing WC memory, it's absolutely crucial that you write the whole buffer without holes, and you really want the accesses to be all sequential (on older processors with just 1 WC buffer, non-sequential writes are as bad as holes, since they effectively amount to the same thing).

In C/C++, that means you basically have to make your writes volatile, which sucks since it's way too strong, or alternatively insert a compiler write barrier after every store, which also sucks since it's so messy.

If an extra memcpy makes things go faster, you can practically count on the write sequence in the generate code being fucked up.

Quoting from the IA-32 manuals on write combining (Vol 3A, System Programming Guide): "Writes to the WC memory type are not cached in the typical sense of the word cached. They are retained in an internal write combining buffer (WC buffer) that is separate from the internal L1, L2, and L3 caches and the store buffer. [..] The size and structure of the WC buffer is not architecturally defined. For the Intel Core 2 Duo [etc.] processors; the WC buffer is made up of several 64-byte WC buffers."Plus a lot of notice about how WC memory is explicitly weakly ordered and you have to call sfence if you want the data to reliably show up in memory. No details on how buffer allocation works or when they're flushed. Anyway, WC buffers are separate from regular store buffers, so data can be sitting there for quite some time even when there's a lot of other memory accesses going on (which is good for this application).

Even if you're not writing full cache lines, WC is a win in this case because the memory is never going to be read unless you actually do a read on it (Fallback for non-full-cacheline WC writebacks is multiple partial write transactions, not fetch whole cache line - apply changes - write whole cache line).

Previous processors had less WC buffers (sometimes only 1), and some architectures have more stringent requirements on write-combined writes (e.g. "must be strictly sequential access with no holes").

"You want to be able to say "I'm streaming into some buffer and will write the bytes sequentially, and I don't care what the contents are in the region after my write pointer"."One thing that *ahem* certain platforms have is a mechanism called the "write gather pipe". You set the destination address once, then write whatever you want to a specific MMIO address and it will just output everything sequentially in cache line-sized transactions without ever reading memory or polluting the cache. It also packs accesses tightly, i.e. a byte write will only advance the output pointer by 1 byte and there's no penalty when writes don't end up being naturally aligned or cross cache lines. It's the awesome way to do queuing.

"I tried to make the sprite rendering go through Nvidia's VAR extension which required writing to write combining memory, and if I tried to construct the data on the fly it just always failed."Did you check the disassembly? When accessing WC memory, it's absolutely crucial that you write the whole buffer without holes, and you really want the accesses to be all sequential (on older processors with just 1 WC buffer, non-sequential writes are as bad as holes, since they effectively amount to the same thing).

In C/C++, that means you basically have to make your writes volatile, which sucks since it's way too strong, or alternatively insert a compiler write barrier after every store, which also sucks since it's so messy.

If an extra memcpy makes things go faster, you can practically count on the write sequence in the generate code being fucked up.

Since I've never figured out how to make iprof-style work efficiently threaded, and then write-to-a-buffer can just write out thread IDs, then you're probably right that storing to a small (cache-line-sized) buffer and flushing with non-temporal stores is going to be the lowest impact.

If you then stream that out to another process, i.e. the telemetry model, and you have free cpu cores so that process can run without interfering with the thing you're profiling, that probably is the performance sweet spot. (So long as it's designed the level of fine-grained profiling that iprof was.)

aren't we conflating two different things, the write-combining memory type and the non-temporal streaming stores? i mean, I can believe they might use the same hardware mechanisms, but the rules may not be the same.

(I know the streaming, non-cache-polluting store was something I pitched to Intel when they came by Looking Glass one time, so I like to pretend I'd had some effect when it showed up many years later, but I imagine it was pretty obvious.)

ryg, I imagine I looked at the asm, but who knows now. I don't remember if I did the volatile stuff or not. I think I got the impression that it was doing too much other work on the way to generating the data--enough work that it had to spill registers, which meant that it had to do other writes, and I think I assumed that either those writes were interfering, or I was just doing it "too slow", or something. It didn't really matter because once I did the memcpy() and found than the original naive glVertex() implementation, I figured I wasn't leaving much performance on the floor (yes, the memcpy is overhead, but not that much), so I killed it (IIRC).

"aren't we conflating two different things, the write-combining memory type and the non-temporal streaming stores? i mean, I can believe they might use the same hardware mechanisms, but the rules may not be the same."Intels manuals explicitly specify that a non-temporal write is treated exactly as if it went to write-combined memory. Quoting again:

"If a program specifies a non-temporal store with one of these instructions and the destination region is mapped as cacheable memory (write back [WB], write through [WT] or WC memory type), the processor will do the following:* If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted.* The non-temporal data is written to memory with WC semantics."

I loved the lightweight profiler idea so much I decided to rip out our shitty existing profiler and implement something similar, although thread safety became an issue and I didn't want to have a big ass array for every thread (ie. TLS), so I implemented the basic idea like this-

The other way is to have a std::vector per thread in TLS. Then you take the time before you do the extend, and add that as a special marker that gets subtracted out in the post-process.

That way if the vector has to do a malloc to grow it doesn't show up in the timing at all. Obviously it will distort things a bit even if that timing is subtracted, but this will be lower impact on average than the atomic add.

Yeah I thought about doing it that way, but I really wanted to avoid multiple buckets (vectors) per thread, since I want to be able to specify exactly up-front how much memory will be reserved for the profile buffer (I'm not anti dynamic mem allocations, but try to avoid it as much as possible, even with our custom allocators which are designed to handle this kind of behaviour). I guess the overhead goes away once your buffers reach their maxima, but you still have a branch in there to check the size.

I also didn't want to introduce the overhead of malloc/new to the profiler, since parent begin/end tags will still contain the overhead of child tags, even with the malloc before the start-tag/after the end-tag.

That all said, I'm probably splitting hairs over a miniscule performance loss.

Reading comprehension FTW, I misread your reply, ok subtracting the time difference from all parent tags wouldn't pollute the timings, but I guess you have the overhead of calling QPC twice or whatever timing function you're using. Again I think we're splitting hairs :)

Atomic add? You should definitely benchmark that to compare performance under contention. I seem to remember ~40 cycles for the rdtsc time stamp into a thread local array and as many as 700 cycles for a contended cache line access (depends on the pair of cores). I would expect that you would do better with a conditional branch...

Pre-allocate one known-size big long array, then have each thread keep sub-arrays in it. Each thread does TLS to allocate from its sub-array, and if it exhausts its sub-array, it atomic adds to allocate a new chunk from the big array and starts using that as its own subarray.

You end up with fixed-size overhead per thread (the size of one chunk) instead of the arbitrary overhead of per-thread arrays, and reduced contention compared to atomic add, and even though you've got the per chunk overhead you probably actually fit more data in the same sized big array because you don't have to write out thread IDs on every operation, just once per chunk.