This is one step of a population count routine, which folds pairs of bits together into two-bit counts. (Yeah, I know this can be done better with subtraction, but popcount isn't the subject here.) Run this through VC10, and you get this:

Unnecessary moves blah blah blah... you've heard it here before. Then again, let's take a closer look. Why did the compiler emit the MOVDQA XMM3, XMM2 instruction? Hmm, it's because it did the shift next, but it still needed to keep "x" around for the second operation. And how about that PAND that follows? Well, it couldn't modify "mask," so it copied that too. Waaaiit a minute, it's just doing everything exactly the way I told it. That might be OK if x86 used three-argument form instructions, but since x86 is two-argument, that kinda sucks. What about if we rewrote the routine this way:

Well, that looks a bit better. It appears that Visual C++ is unable to take advantage of the fact that the binary operations used here are commutative, which means that the efficiency of the code generated can differ significantly based on the order of the arguments even though the result is the same. The upside is that you can swap around arguments to get better code; the downside is that you're doing what the code generator should be doing. Interestingly, based on some experiments it looks like the code generator can do this for scalar operations, so something didn't get hooked up or extended to the intrinsics portion.

Anyway, if you've got extra moves showing up in the disassembly when using intrinsics, try shaking the expression tree a bit and see if some of the moves fall out.

I needed to install a third-party video player recently to diagnose a problem with paletted video, only to discover that it was really, fatally broken in that regard. Okay, I can't give too much crap for that, because I've broken paletted video plenty of times in VirtualDub. However, this is the first time that I've seen a decoder broken not only such that it uses the wrong stride to decode the video, but that the stride used depends on the size of the window. At that point I decided that getting a paletted video stream to work in this player was useless, and decided to uninstall it.

That's when I found out how much damage that this player had done to my system.

You see, this player is so awesome that it automatically decided to silently register itself as the default player for ALL video types, including AVI, MPEG, and ASF. Hey, it plays Flash video too, so why not take SWF? People store MPEG video in DAT files, so let's take that too, since nobody would ever use .DAT for anything else, right? And while we're at it, we'll take the .AVS Avisynth extension, because obviously if you're using an Avisynth script it's because you just want to play the result. The File menu in Explorer is a bit lonely too, so we'll add half a dozen menu entries just for whatever you'd want to do with this wonderful player.

Okay, I've been through this before... just reassociate the files with the One True Player(tm) (i.e. Media Player Classic) and go on. Or not. You see, this player also decided to register all new file types in Explorer, changing every single multimedia file type to use its own icon and label, so that instead of "Video file" for .AVI, it would show up as FOO - Video File, even if the type was changed back to use a different player than FOO. Which made me very unhappy as I then had to use Registry Editor to manually fix each and every single file type that had been farked up by this stupid player application, and thus ensuring that this player stays permanently on my Do Not Install shiatlist.

A few days ago I discovered that some prototype DirectShow-based code I had was suddenly taking a lot longer to open files. By a lot longer, I mean up to a minute -- at full CPU. As you might imagine, this was pretty irritating, especially since not only was it running at full CPU, but it was doing something that made the entire system performance especially suck during that time. Great.

A bit of digging with the mighty F12 profiler -- actually, I guess it was Ctrl+Break, since I was using CDB -- revealed it to be the DirectShow filter graph "intelligent connect" code. Specifically, it was taking an abnormally long time to connect the audio sample grabber. "Intelligent connect" in DirectShow refers to the way in which the filter graph manager will automatically find a sequence of intermediate filters to connect two filters together whenever a direct connection isn't possible. For instance, trying to connect a renderer that wants uncompressed video to a compressed video source will result in a video decoder being stuck in between. As you might imagine, this is both handy and hazardous, the latter coming into play when the filter graph comes up with some horror like MJPEG Compressor + MJPEG Decompressor to do a color conversion. I had suspected that at first, but inspection of the resulting filter graph via GraphEdit's remote connect function didn't show anything unusual.

Some more investigation with the debugger revealed that a lot of time was being spent in creating and destroying DirectDraw surfaces, which some filter was using as part of its media type check -- not a great idea, considering how expensive it is and how often media type queries happen. For a moment, I had thought maybe some application I had installed recently had added a ton of slow or broken filters, which I'd have to hunt down and then uninstall. The situation was pretty bad too, because the filter graph manager was recursing a lot and trying some pretty deep chains of filters. Then it dawned on me... why was the filter graph manager trying so many video filters to connect an audio filter? Shouldn't it know that it already had an audio stream, and that only audio filters should be checked? Unless....

I checked the connection code again, and it turned out that I wasn't trying to connect an audio pin, but rather a source type pin. That meant that the intelligent connect code had to figure out both the demultiplex and decoder filters for the intermediate connections. Then, after checking the sample grabber code, I had a light bulb moment. It turns out that I hadn't reimplemented the EnumMediaTypes() code on the sample grabber's input pin, so it was returning no media type structures. That meant that the filter graph manager was trying to establish a connection with the following media type information:

The sample grabber did check the media type in the query function, so it only accepted audio connections. However, the filter graph manager had no way to know this since EnumMediaTypes() returned nothing, so the only way it found a connection was to do a brute force search through all possible combinations of filters that would make Dijkstra proud. And when you have M filters that can be combined up to a chain N long, the result unsurprisingly is a whole lot of CPU time spent trying connections. So I reimplemented EnumMediaTypes() to return a single entry with the media type set properly, and suddenly load time dropped to sub-second range.

Moral of the story? Make sure your filter isn't being too ambiguous with its reported connection requirements.

Recently I had to implement a low-pass audio filter in software. A low-pass filter is so named because it passes low frequencies while muting high ones, similar to what you'd get by turning treble all the way down on a stereo. Low-pass filters have a number of uses, the particular use in this case being to prevent aliasing in a subsequent resampling pass.

There are many ways to implement a low-pass filter, but the method that I used was a finite impulse response (FIR) filter. FIR filters have a few advantages, such as simplicity of implementation in software and ease of making linear-phase filters. The cutoff frequency was fairly high, so the FIR filter kernel didn't need that many taps -- a 15 tap symmetric filter was enough.