This is the division approximation code for the temporal smoother filter in VirtualDub. It essentially computes:

result.rgb = color.rgb * div_table[sum >> 48];

The crash was an access violation, indicating a bad pointer. Problem number one is that, the way the code's structured, the table index never exceeds 128. Problem number two:

EAX = 00800000

Extracting the top 16 bits of a 64-bit unsigned quantity gave a value bigger than 0x10000. That's... not possible.

I couldn't figure out how this could happen, so I wrote back the user asking if the crash was reproducible. As it turned out, he'd already diagnosed the problem: bad RAM. My guess is that the OS had done a context switch in the middle of these instructions, giving the opportunity for EAX to be dumped to memory and be corrupted. Sometimes the impossible does actually happen... well, at least when hardware failure is involved.

Why are the "even field first" and "odd field first" labels in VirtualDub reversed?

Well, actually, they're not, depending on how you label scan lines.

The even field in VirtualDub corresponds to the field that has the upper set of scan lines, and the odd field is the one with the lower set. The reason is that internally VirtualDub numbers scan lines so that the top scan line is scan line 0. That means the even field consists of 0, 2, 4... and the odd field is 1, 3, 5..., meaning that the even field is positioned higher than the odd field. If you consider scan lines to be numbered starting from 1, then this would be backwards and thus confusing. When this issue was raised, I did some searching around and didn't see a clear consensus on scan line numbering, so the plan is to abandon even/odd terminology in UI and just use top/bottom instead across the board.

If you're confused about the field order of a clip, the best way to check it is to use bob deinterlacing to double it to field rate and then check if you get motion juddering. The mode that gives you the smooth output is the correct one. You can do this in VirtualDub via the "bob doubler" filter in 1.8.0+, or the "deinterlace" filter in 1.9.2+. Unfortunately, there are a few places where I've goofed the field order settings at times; the bob doubler had this backwards until 1.8.2, and I've just been informed that it's currently backwards in the new IVTC filter in 1.9.x. I'm working on making sure everything's correct for 1.9.3.

I've always thought that hash tables were well named, because often when you see how people have used them you wonder what they were smoking at the time. Often the problem revolves around a mistaken notion that switching a binary search tree for a hashed container bestows some sort of magical constant time lookup behavior, but sometimes it's more subtle. One case has to deal with the choice of hash function.

The hash function for a hashed container converts a key to a bucket index with the intent of trying to distribute data items as evenly as possible. Given a decent distribution for input values, the hash function for an integral key can be as simple as just using the integer value itself, with the container then applying a modulus operation to wrap it within the bucket count. Anyone who's gone down this route, however, then discovers the problem of trying to do this for a key that is of floating point type. Usually the first thing they try is something like this:

size_t operator()(float value) const { return (size_t)(value * 100);}

This is unfortunately usually fairly slow due to poor performance in the float-to-int conversion. There's also the matter of slightly worse behavior around zero due to truncation toward zero instead of negative infinity.

At this point, the inclination is probably to just give up and either deal with it or use a different container. Others go "aha!" and use this hack instead:

This code uses the bit pattern of the float as the hash value. Yeah, it's non-portable. It's also got problems with the aliasing rules of the C language. In the not so unusual case of being able to depend on a 32-bit integral type and IEEE single precision floating point, though, it's a really neat and fast trick. And, sadly, it's also wrong. If you've done this or thought about it, don't feel bad. The .NET Framework team almost made this mistake, too.

After experimenting with pixel shaders for video acceleration, I decided to give NVIDIA's CUDA a shot to see how it performed on the same hardware. As I've noted before, Direct3D 9 can be used for video acceleration, but has some performance problems in specific areas. CUDA uses a more general programming model and also provides more direct access to video driver, making it potentially more interesting for video acceleration.

The first thing I have to say about CUDA is that it's fairly easy to use. Kernels are written in C using a special nvcc pre-compiler that allows you to mix both host and device code; the precompiler automatically splits the two apart, compiling the host code using VC++ or GCC and compiling the device code separately, and then linking the two together. This means that you can compile a CUDA application from the command line as easily as a standard Hello World application. Invoking a kernel on the device is also easy:

Filter<<<M, N>>>(dy, py, dx, px, 1024, 1024);

This special syntax looks just like a function call, with the added fields in angle brackets describing the grid and block dimensions for threading purposes. This is an asynchronous call, so it's easy to do the right thing of queuing up kernel calls on the device and only waiting when you need to do data transfers. Those are also easy, because CUDA has device analogs for malloc(), free(), memcpy(), and memset() operations. The API is also nicely partitioned, with the runtime API being suitable for writing CUDA applications directly, and the explicit driver API being better for integration into frameworks. You can even use PTX assembly language if you need to check the compiler's output or generate code for CUDA directly.

My first attempt at CUDA didn't give stellar performance, but the result was reasonable: