Stuff

In the previous entry, a commenter asked if SIMD intrinsics are worthwhile in VS2008.

Truth be told, I didn't try them, because Microsoft only has a skeleton crew (person?) on the C++ compiler for VS2008, and they're not even taking most bug fixes, much less a feature addition or optimization like improving SIMD code generation. The rest of the compiler team is busy rewriting the compiler for Orcas+N. As such, I don't really expect any change in intrinsics compared to VS2005 SP1, which in turn is just VS2005 RTM + some new kernel mode intrinsics. I do have some experience working with intrinsics in other venues, though, so I can at least tell you my experiences with VS2005.

The first problem you'll run into with intrinsics is that they require alignment. If you construct all of your SIMD code to use unaligned loads and stores, your performance will be pathetic. The heap alignment on Win32 is only 4 bytes, though, so you need to use _aligned_malloc(), with its associated space penalties, or switch to a custom allocator. The compiler does handle alignment of sub-objects for you, and in theory it does for stack objects as well, but my experience is that VC8 is buggy with regard to returning aligned objects and frequently gets it wrong. Fortunately, x86 gives you a clear exception when this occurs; some platforms instead helpfully align the pointer for you by zeroing LSBs of the address, which leads to some nice heap corruption bugs. If you're interoperating with .NET, you're in for some annoyance because the CLR knows jack about alignment. STL can also give you problems if its allocators aren't alignment-savvy; I think VC8's implementation might be problematic in this regard.

The second problem is MMX, or more specifically, the prohibition on mixing x87 and MMX. This isn't a performance issue -- you will actually get incorrect results if you mix the two without appropriate [F]EMMS instructions, because the FPU will start spitting out NaNs when it notices its register stack is full. VC7 had some severe bugs with the optimizer rearranging floating point calculations around _mm_empty() or __asm { emms } statements and nearly made it impossible to safely use MMX intrinsics. I think these were fixed in VC8, but then you have the problem of when to do it. The last thing you want to do is call EMMS at the end of each and every function in a library, because performance will be dreadful, and trying to document which ones use MMX and forcing the client to figure out where to put the barriers is really bad too. And if you think MMX is dead, do consider that unless you have SSE2, it's really hard to efficiently handle integers, even if you just want to convert them to and from floats (well, unless you only want to do one at a time and only 32-bit integers).

The third problem is the ABI. More specifically, the x86 ABI wasn't designed with SIMD in mind, so it has none of the features that would help. The stack isn't aligned, so the compiler has to generate code to create an aligned stack frame -- although I've heard that LTCG can help in this regard by eliminating this in nested calls. Perhaps more annoying is that there is no convention for preserving SSE registers or passing floats in SSE registers, so the compiler tends to bounce values out to memory and possibly through the x87 stack, even if /arch:SSE is used. This is especially distressing if you're writing a math library -- which you would think is a natural use for SSE intrinsics -- until you discover that the vector and float portions of the compiler don't talk to well to each other.

The fourth problem that I have with VC's intrinsics is that I sometimes find them harder to use -- x = _m_paddw(x, y) isn't much better than PADDW x, y, and I find the _mm_epi32_add() style particularly ugly. I've seen intrinsics code that looked like it was just translated line-by-line from assembly code, which basically just meant it was slower and uglier. They get more usable if you wrap them in operators, but then you end up with lots of function calls that impede debugging and make your debug builds suck. And isn't it supposed to be the compiler's job to wrap instructions in a higher level form??

I should note that the x64 versions of Windows avoid a number of these issues, as the platform is guaranteed to support SSE2 and the ABI was designed with that in mind. However, with x64 being very poorly supported and Microsoft trying its best to drive it into the ground with stupidity like the signed driver requirement in Vista x64, I've almost written it off entirely.

Truth be told, I'd love to ditch assembly and use intrinsics, but I find it hard to tolerate these flaws. SIMD makes the most difference in code that is performance critical and that means it's also the code that can least tolerate flaws in the compiler's output. I also tend to run into non-SIMD issues whenever I consider the switch, because there are a lot of missing scalar intrinsics. For instance, in a lot of my scaling code I use 32:32 fixed point, where the 32-bit halves are joined by the carry flag and thus I can use the upper half directly without needing shift ops. C++ doesn't have support for the carry flag and VC++'s __int64 code generation sucks (why would you change <<32 into *2^32???). External precision arithmetic is also very difficult to do with the provided intrinsics, to the point that I had to write a silly three-line assembly routine in an .asm file just to do MulDiv64() on x64. It seems like any new scalar intrinsics are being added just for the NT kernel team and not really for anyone else -- the new intrinsics that were added in VS2005 SP1, for instance, are essentially useless in user mode.

As a side note, when I tried Intel C++ 6.0, it did generate very nicely optimized MMX code, but it also bloated code by about 30%. In the end, I gave up supporting it because I was tired of tracking down compiler-induced bugs like thrown exception objects being destroyed twice and misgenerated STL code. I haven't tried GCC yet... it probably would do somewhere between VC++ and Intel C++ codegen-wise and probably more stably than Intel C++. Sadly, it's also hands-down the most annoying compiler on the planet.

I downloaded and installed Visual Studio 2008 Beta 2 "Orcas" a couple of days ago, and as expected, there isn't a whole lot new for C++ programmers. In fact, it looks the same as VS2005. The main new feature is file-level multiprocessor builds instead of project-level, which I didn't try out (VPC running on single core CPU), and oh, you can't build executables for any Win9x platforms. I don't know if I'd recommend upgrading, but on the other hand, less is likely to break compared to earlier upgrades.

VirtualDub 1.7.2 mostly compiles without issues on VS2008b2 after converting the solution to VS9 format. There was one line that broke. Pointer sizing was tightened slightly in MASM 9, and if you happen to have an explicit qword size on a memory operand in an MMX unpack low instruction:

punpcklbw mm0, qword ptr [eax]

...it will now fail to assemble, whereas this was fine in MASM 8 (FDBK294468). The solution is simply to use dword ptr instead, which assembles fine on both MASM 8 and MASM 9. The inline assembler in the x86 compiler still accepts either.

Strictly speaking, the new behavior is correct -- the MMX forms of punpcklbw/wd/dq are unusual in that they take a 32-bit memory argument like MOVD. I think some Intel publications occasionally got this wrong and said these instructions took m64 instead of m32, although the current manuals are right. I first saw this difference called out in an AMD optimization guide, and it's significant with regard to misalignment penalties and page faults. You can thus safely pick up a misaligned quadword as follows:

movd mm0, dword ptr [eax]
punpckldq mm0, dword ptr [eax+4]

Note that this isn't necessarily faster than an unaligned read, because modern x86 CPUs generally only impose an alignment penalty if you cross an L1 cache line. Also, the MMX high unpack versions (punpckhbw/wd/dq) and the SSE2 integer unpack instructions still do full 64-bit and 128-bit reads, respectively.

Incidentally, the new assembler also appears to accept the new SSE4.1 opcodes, such as PMOVZXBW, if you're so inclined.

It's amazing how much faster computers have become. Sure, they can never be fast enough for some purposes -- like, uh, video processing -- but for others, it's getting a bit ridiculous.

A linear feedback shift register is a type of sequence generator that is useful for quickly generating psuedo-random sequences of numbers. It's not a great generator, but because it only requires shifts and XORs, it's very fast. It also easily generates a maximal non-repeating sequence of size 2^N-1, which means it's also useful for generating exact coverage patterns without duplicates or dropouts, especially since the sequence is generated on the fly and isn't stored. It can be used for real-time image dissolves and static noise generation... on a 1MHz Apple II! I haven't used it in VirtualDub yet, but there are a few places where I could, such as if I needed a random dither for audio sample conversion from 16-bit to 8-bit (although I'd probably try error diffusion first).

Anyway, today I needed a 32-bit LFSR generator. An LFSR generator basically shifts new bits in that are XORs of a series of taps on the shift register, and the position of those taps is critical: if they aren't correct, the generator won't produce the maximum possible sequence. I'm too lazy to actually look up a list of primitive polynomials, though, so I usually just try a bunch of tap combinations until I get one that works. So I picked four taps and just let a test app measure the length of the sequence. It then dawned on me that in less than a minute, the computer had run through all four billion bits in the sequence. And I didn't even optimize the algorithm.

Basically, computers have gotten fast enough that sometimes it's better just to brute force test all inputs to an algorithm... it's easy, it's harder to screw up, and a passing result from one is fairly convincing. And it's lazy. I like it when companies like Intel and NVIDIA help me be lazy. I look forward to my next upgrade, which will probably include some dual core and DX10 goodness and help me be more lazy.

(Sharp readers will note that since I was able to run through the 32-bit sequence in less than a minute, I probably need a longer generator, and I can't really brute force test a 64-bit LFSR... but hey, at least it'll still be damn fast.)

If you look closely, you'll notice there's a big units problem here -- I've computed audio bit rate using SI units (1k = 1000), but video byte rate using binary units (1k = 1024), and labeled both as kbps. That's incorrect and confusing, but that's what Windows XP Explorer does. In this case, the actual video rate is closer to 400 kbps (1 kbps = 1000 bits/second). The other thing that's really bogus here is that the video rate is being computed according to the size of the entire file, which includes headers, audio, and whatever junk may be at the end. I did a test of concatenating an AVI file to itself back-to-back, and the reported video rate doubled. Oops.

So, basically, ignore whatever Windows XP Explorer says.

That brings us to Windows Vista. For some AVI files, it does display reasonable bitrates, but for files written by VirtualDub, it still displays strange values. I'll spare you all of the experiments I did and just give you the actual algorithm used by Windows Vista Explorer:

I had originally intended to post a massive pair of blog entries about bicubic filtering, but I committed the sin of Exiting Without Saving, so I guess I'll wing it instead.

Bicubic filtering is generally the next step up from bilinear filtering; it's not nearly as big of an improvement as bilinear is from point sampling (a.k.a. no filtering), but it's noticeably better if you are doing enlargement or gradient determination. In terms of implementing it on hardware, all decent 3D graphics hardware supports bilinear filtering, but none of them support bicubic. (Well, ATI's R600 is touted as doing so, but I haven't heard any information about how you'd actually use it, so as far as I'm concerned it doesn't exist.) That means bicubic has to be emulated, and as usual, the cheaper the implementation, the better.

Bicubic filtering, or rather, cardinal spline bicubic filtering, is a 4x4 (16 tap) filter composed of two 1-D 4 tap cubic filters. Well, actually only for interpolation, but that's still a very useful case. The main difficulty in implementing this in a shader is that the filter kernel contains two negative lobes, which corresponds to the outer two taps on the discrete filter. GPU Gems 2 has a way to implement B-spline bicubic via the bilinear filter hardware, but that relies on all of the taps being positive. NVIDIA also has a direct 2D implementation in bicubic.fx in their SDK, but it's absurdedly expensive, at ~50 clocks per pixel -- pretty close to not being real-time.

Something I've been experimenting with a lot lately is field-based display of video on a computer screen. I've written about this before, but to review, regular analog video isn't composed of a series of sequential frames, but alternating half-frames called fields at twice the rate. That is, instead of updating the whole screen at 30Hz, even and odd scanlines update at 60Hz. This is called interlacing, and is an attempt to increase the quality of video by getting some of the benefits of higher resolution and smoother motion.

It's also a gigantic pain.

Frequently, such video is displayed on a computer screen by simply displaying pairs of fields as frames at 30Hz. The most common objectionable result of this is intermittent combing caused by pairing fields that don't match, which can be resolved by deinterlacing. If you're dealing with film-rate material that has been upsampled from 24 fps, this may look fine. However, for material that's actually been recorded at 60 fields/second, the difference can still be significant and the 30Hz output will be considerably less fluid. The right thing to do is to upsample the video from 60 fields/second to 60 frames/second. This isn't easy, nor is there one "right" way to do it, but the result is worth the effort... and it takes a lot of effort, at least on Windows.