Stuff

I spent some time today trying to get some Direct3D 9 code to switch the display into 50Hz refresh so it could display PAL video. Displaying PAL video on a 60Hz display doesn't look very good because of the difference in frame rates -- you get a beat at the difference between the rates, 10Hz in this case -- and so it's better to change the display mode to match. Problem is, a 50Hz mode wasn't showing up in the mode list for my monitor, so I opened the NVIDIA control panel and added a custom 50Hz mode. The test ran fine and the monitor happily switched into a PAL-compatible refresh rate.

Then I tried the Direct3D 9 code, and it refused to do the mode switch.

While searching around for some AVX docs, I happened to find a blog post on Intel's website describing how to optimize an image processing routine. The gist of the article was that you could get big gains just by throwing some VC++ compiler switches such as /arch:SSE2 or /arch:AVX to tell the compiler to use vector instructions. Presto, your code magically gets faster with less than an hour of work and without having to modify the algorithm!

Of course, my next thought was: "Yeah, until QA gives you an A-class bug the next day saying that the code now crashes on an Athlon XP or Core i7."

The documentation for the Visual C++ compiler /arch compiler switch is labeled "Minimum CPU Architecture," but should probably emphasize the ramifications of this switch. If you use this switch, your code will crash on any CPU that doesn't support the required instruction set. Unlike the Intel compiler, which has options to auto-dispatch to different code paths depending on the available instruction set, the VC++ compiler will simply blindly generate code for the target CPU. Therefore, you can also reinterpret the switches as follows:

/arch:SSE: Generate code that crashes on an Athlon.

/arch:SSE2: Generate code that crashes on an Athlon XP.

/arch:AVX: Generate code that crashes on a Core i7.

This is not to say that the /arch switch is bad, as the compiler does actually generate faster code when it can use vector instructions. The problem is that unless you can absolutely guarantee that your EXE or DLL will never run on a CPU lower than the specified tier, you can't use those switches. Okay, so /arch:SSE is probably pretty safe at this point, and you may be able to justify /arch:SSE2. You'd be insane to throw /arch:AVX on your whole app unless you really want to require a Sandy Bridge or Bulldozer CPU (which, as of today, only one of which has shipped).

What about compiling only some of your code that way? You can pull this off if you build multiple DLLs or EXEs and switch them based on the architecture, at the cost of additional deployment and testing hassle. Compiling different modules within the same DLL or EXE with different /arch settings, though, is dangerous. Take this function:

void foo(float x, float y) {
return std::min(x, y);
}

Do a little #define foo magic and #include this from a few .cpps with different /arch settings, and you can extrude out x87/SSE/SSE2/AVX versions from the same file. There's only one small problem: the call to the std::min() function. std::min is a template and in the VC++ compilation model it is compiled with each .cpp file that instantiates it, meaning that each of the platform modules compiles its own version of the std::min template specialized for x87/SSE/SSE2/AVX. Where this goes wrong is when the linker collapses all of the COMDAT records and discards all but one instantiation of std::min<float>(). You don't know or control which one it picks because they're supposed to be the same. When I tested this locally, it picked the AVX version and the program crashed on my Core i7 laptop. Oops.

What this means is that linking in modules with mixed /arch settings is broken unless you take special care not to use any inline or template functions within the arch-dependent modules, which excludes a substantial portion of the C++ standard library.

In conclusion, enabling enhanced instruction sets isn't something you can just do in an hour even if it's just a drop-down option in your project settings. You need to understand the full ramifications of the change and determine whether it also involves changes to your program's minimum required system specifications or the way you need to organize and build the affected code.

A limitation of 32-bit code is that it lives in a 32-bit address space, which limits directly addressable memory to 2-4GB and typically usable memory to as low as 1.5GB. Well, I've found that this is sometimes a benefit. You see, a Core i7 CPU can run code really fast, and if you've got a piece of code stuck in a loop allocating memory, that means it can allocate memory really fast too -- so fast that it sends the system into swap death within seconds. Sometimes with a lightweight debugger like WinDbg/CDB or with command-line utilities like taskkill.exe you can nuke the process quickly, but with a heavyweight IDE like Visual Studio it can take minutes for the debugger for respond and for the system to recover. If you have enough system memory, however, the 32-bit process will run out of address space and crash out before it can do damage. It's funny to see a process instantly hit the 1.5GB brick wall in Task Manager and then break in the debugger.

(You might ask why I have a paging file at all. Well, certain features of Windows require a paging file because it's a convenient way to do critically low level I/O, such as writing a kernel dump file. It also increases the chances of you being able to catch the offender before the system really absolutely runs out of memory, at which time Bad Things Happen(tm).)

There are a few file formats that have really well thought out documentation, such as PNG. Then there's the archive format ARC, which I was trying to write a decoder for. The great thing about search engines on the Internet is their ability to magnify the popularity of a single document called "The ARC archive file format" that doesn't actually tell you most of what you need to know about the format, such as the bitstream ordering for bit packing.

Now, there is source code available for ARC extraction, but using source code isn't always the best idea. First, there can be legal issues if the licensing or origin of the code turns out to be unfavorable, but another problem is that source code often isn't very good documentation and often still needs to be reverse engineered a bit to extract the essential bits... assuming it's even comprehensive. As such, I often avoid using source code as a reference if I can. Fortunately, the compression methods used here aren't crazy like arithmetic encoding and can be eyeballed in a hex editor for verification, so with some elbow grease the rest can be determined from existing archives and decoders.