Stuff

I've recently been reading the specs for the SSE4 extensions that Intel is adding to Penryn, the successor to their Core 2 Duo CPU. It looks very interesting and more extensive than the small additions made in Prescott (SSE3; mainly horizontal add/subtract) and Core 2 Duo (SSSE3; mainly absolute value and double-width align). One thing stood out to me immediately, which is that Intel is re-introducing implicit special registers into SSE -- specifically, the variable blend instructions use XMM0 as an implicit third argument, since the instruction format normally only accommodates two-address opcodes. Argh. Okay, I can deal with that, since there are already precedents in IMUL, SHLD/SHRD, and MASKMOVQ. But there was something else that stood out.

They're finally giving us dot product instructions.

This may not seem special, but after writing graphics vertex shaders, the inflexible programming model of SSE seems very constraining. A dot product, or inner product, computes the sum of the component-wise products of two vectors -- that is, for (ax, ay, az) and (bx, by, bz), the dot product result is ax*bx + ay*by + az*bz. This is useful for a number of geometric and signal processing operations, but one very common use is transformation of a vector by a matrix for 3D graphics. Now, on a GPU, both multiply-add (mad) and dot product (dp2/dp3/dp4) are one-clock instructions. This means that you have complete freedom to choose either row-major or column-major storage for transform matrices, because the only change is whether you use a series of mads or dp4s to do the transformation. In SSE code, though, matrix transformations are more awkward, meaning that the matrix layout is constrained by the more efficient path in the assembly code. If you're storing row-major matrices, then you splat the vectors and do multiply-adds; if you're storing column-major, then you multiply and do horizontal adds, or junk the routine and make it row-major because you don't have SSE3. The new DPPS and DPPD instructions allow the CPU to more closely mimic the GPU algorithm, which is nicer from a hand-coding standpoint, and much nicer to a vertex shader emulator.

The new DPPS and DPPD instructions aren't just nice, however, but they're also unusually flexible. I had expected DPPS and DPPD to just be straight 4-D and 2-D dot products that right-align or splat the result, but they're a lot more featureful than that due to an immediate argument. You can mask out some or all components in the dot product as well as some or all of the destination components. For instance, you can compute (ax*bx + az*bz) and store it into Y and W of the output, with X and Z being zero. Among all of the possibilities, this allows DPPS to compute 2-D, 3-D, and 4-D dot products. Cool!

When I thought about it some more, though, it wasn't as impressive as it first seemed:

You don't necessarily save on instructions in a matrix transform.

You don't necessarily save on instructions with the destination mask.

Let's take the matrix transform first -- specifically, a 4x4 transform with column-major layout. The components of the result vector from transforming a vector v by such a matrix M are the dot product of v by each of the rows of M, so in SSE3 this takes four MULPS instructions and three HADDPS instructions, with all four dot products happening in parallel in the HADDPSes. With DPPS, you can do the dot products directly... but you still have to merge the results using three ADDPS or ORPS instructions, since DPPS only allows you to place zeroes with the destination mask, not merge the result with other vector. That means that in the end, you're still at seven instructions. In fact, I don't think it's possible to do it in fewer than seven instructions, no matter what the instructions, as long as you are constrained to two-address opcodes. No matter what, you need four instructions to compute intermediate results, and three to merge those together (since you always eliminate one intermediate per merge instruction). You do win in the 4x3 case by one instruction, though... but that assumes that DPPS is as cheap as MULPS, which may not be the case.

The destination mask is neat, but you probably still need to merge something other than zero into those masked channels. For instance, if you're only writing to XYZ, you probably want 1 in W and not 0. Thing is, all of the cases I could think of that used the destination mask followed by ADDPS/ORPS were just as easily handled by BLENDPS, which selectively copies channels using an immediate mask. You don't save instruction count here, so the remaining benefit would be if ADDPS/ORPS were cheaper than BLENDPS, which could be the case. In a vertex shader, just about every instruction has a destination write mask, so this blend operation is free -- you can just have successive dp4 instructions write into r0.x, r0.y, r0.z, and r0.w. Unfortunately, we're still not there yet in x86.

By the way, the instruction encodings are getting a bit ridiculous: DPPS is 66 0F 3A 40 /r ib, for a total of six bytes. I'm sure glad I didn't hardcode my disassembler to one- and two-byte opcodes. This makes PowerPC machine code look compact, too.

It's worth noting that I haven't tried SSE4, not even with the software emulator that Intel supposedly has in the Penryn SDK, so I haven't really pounded on the instructions yet. The packed move-with-zero-extension instructions alone could probably significantly shorten a number of VirtualDub's image processing routines, since I spend a lot of time unpacking 32-bit pixels into 64-bit intermediates -- assuming, of course, PMOVZXBW isn't just immediately split into load and unpack uops by the instruction decoder. The thing is, even though SSE4 is more significant to what I do than SSE3 and SSSE3 were, it still isn't anywhere on the magnitude of MMX and SSE, which added major new functionality to the ISA. It also doesn't help with the people who don't have SSE4... heck, I still assume that some people don't have MMX, and in commercial software you can't generally assume more than SSE right now unless you're really aiming for the high end.

I've been struggling with video display issues in VirtualDub under Windows Vista for a while now, as some of you may know. I hit a couple of snags during the beta, one of which was due to a DirectDraw implementation issue in the OS that was fixed in RTM. 1.6.17 works decently well in Vista, fortunately. However, as I've optimized and reworked the display code in 1.7.x, I'm finding that I'm hitting a lot of weird issues in Window Vista again that I wasn't seeing in Windows XP. I spent part of last weekend fighting these again in another fit of frustration over things not displaying when they should.

When I see the exact same issues on two machines running Vista, one with an NVIDIA card and one with an ATI card, I'm inclined to believe it's Microsoft's fault.

The first problem, which I've mentioned before, has to do with DirectDraw hardware video overlays -- these are essentially secondary displays that are composited on top of the main one in the video scanout hardware itself. Yeah, yeah, Microsoft's been saying that video overlays are outdated... but they're the only widely available way to do hardware YCbCr color conversion for commonly used formats without requiring 3D pixel shaders of some sort. You'd be hard pressed to find a system out there with a resolution greater than 800x600 that doesn't support a YUY2 overlay. Well, the problem is that Vista will happily let you create a hardware overlay surface, populate it, and show it -- without actually displaying anything. Your program thinks its happily displaying video, when the user is actually seeing green, magenta, or whatever you use for your colorkey. Lame. I worked around this in VirtualDub 1.7 by calling DwmIsCompositionEnabled() if it is available, and forcibly disabling overlays if the DWM is compositing.

The second problem is more insidious. For various reasons, I'm moving the display code to a separate thread in 1.7.2, and this is exposing a lot of weird threading issues in Windows, like the HTTRANSPARENT issue I mentioned earlier. Well, another problem I hit is that the DWM doesn't seem to consistently update its composition tree when you have a child 3D window in another thread -- you can call Present() in Direct3D or SwapBuffers() in OpenGL, and nothing shows up. In fact, you get junk from underneath the window. I beat my head against the desk for hours trying to figure this out, and made the following conclusions:

That the DWM eliminates flicker is BS. It can sometimes make it worse, because changes in the window hierarchy that were ordinarily atomic can now be visible. Formerly, if you deleted a child window and created a new one in its place, you were OK because the message pump didn't run in between and WM_PAINT couldn't be sent to any windows on the thread to cause flicker. Now the DWM can rebuild the composition tree in between and add flicker to a previously solid display.

Calling GetDC() causes a Direct3D-based window to stop updating. Okay, sure, I can understand not wanting to support mixed GDI and D3D rendering at the same time on the same window, but it'd be nice to have some better sort of fallback mechanism than just ignoring all of my Present() calls -- at least show something, even if it's slow and flickery. Oh, and it's really hard to figure out if a window is visible without an HDC (I use RectVisible() for this). I ended up calling GetDCEx(hwnd, NULL, 0) on the parent window instead.

You can call BeginPaint() safely, although I suspect you have to be really careful about what you do with the HDC.

Doing a FillRect() every frame makes the D3D rendering show up... not that I'm about to ship that solution.

WS_EX_COMPOSITED and WS_EX_LAYERED do squat.

The solution I finally came up with was to call SetWindowPos() with the SWP_FRAMECHANGED message after the first call to Present() or SwapBuffers(). This seems like an utterly bogus solution, and I see a frame of garbage whenever the D3D or OpenGL minidriver reinitializes, but in the absence of any better solution or any diagnostics to determine what's really going wrong, it's the best I can do. Sigh.

I think the most astonishing part to me is how Microsoft can form a movement to get applications "Vista compatible" -- when in reality what they've done is broken a lot of apps and asked the vendors to pick up the pieces. Sure, some apps were doing really broken things, but I'm just trying to use Direct3D to display video....