Stuff

I've gotten to a stable enough point that I feel comfortable in revealing what I've been working on lately, which is GPU acceleration for video filters in VirtualDub. This is something I've been wanting to try for a while. I hacked a video filter to do it a while back, but it had the severe problems of (a) only supporting RGB32, and (b) being forced to upload and download immediately around each instance. The work I've been doing in the past year to support YCbCr processing and to decouple the video filters from each other cleaned up the filter system enough that I could actually put in GPU acceleration without significantly increasing the entropy of the code base.

There are two problems with the current implementation.

The first problem is the API that it uses, which is Direct3D 9. I chose Direct3D 9 as the baseline API for several reasons:

It's the API I'm most familiar with, by far.

The debug runtime is much more thorough than what I've had available with other APIs.

PIX and NVPerfHUD are free.

It runs on just about any modern video card.

Shaders have well-defined profiles, are portable between graphics card vendors, and use standardized byte code.

On top of this are a 3D portability layer, then the filter acceleration layer (VDXA). The API for the low level layer is designed so that it could be retargeted to Direct3D 9Ex, D3D10 and OpenGL; the VDXA layer is much more heavily restricted in feature set, but also adds easier to use 2D abstractions on top. The filter system in turn has been extended so that it inserts filters as necessary to upload or download frames from the accelerator and can initiate RGB<->YUV conversions on the graphics device. So far, so good...

...except for getting data back off the video card.

There are only two ways to download non-trivial quantities of data from the video card in Direct3D 9, which are (1) GetRenderTargetData() and (2) lock and copy. In terms of speed, the two methods are slow and pathetically slow, respectively. GetRenderTargetData() is by far the preferred method nowadays as it is decently well optimized to copy down 500MB/sec+ on any decent graphics card. The problem is that it is impossible to keep the CPU and GPU smoothly running in parallel if you use it, because it blocks the CPU until the GPU completes all outstanding commands. The result is that you spend far more time blocking on the GPU than actually doing the download and your effective throughput drops. The popular suggestion is to double-buffer render target and readback surface pairs, and as far as I can tell this doesn't help because you'll still stall on any new commands that are issued even if they go to a different render target. This means that the only way to keep the GPU busy is to sit on it with the CPU until it becomes idle, issue a single readback, and then immediately issue more commands. That sucks, and to circumvent it I'm going to have to implement another back end to see if another platform API is faster at readbacks.

The other problem is that even after loading up enough filters to ensure that readback and scheduling are not the bottlenecks, I still can't get the GPU to actually beat the CPU.

I currently have five filters accelerated: invert, deinterlace (yadif), resize, blur, blur more, and warp sharp. At full load, five out of the six are faster on the CPU by about 20-30%, and I cheated on warp sharp by implementing bilinear sampling on the GPU instead of bicubic. Part of the reason is that the CPU has less of a disadvantage on these algorithms: when dealing with 8-bit data using SSE2 it has 2-4x bandwidth than with 32-bit float data, since the narrower data types have 2-4x more parallelism in 128-bit registers. The GPU's texture cache also isn't as advantageous when the algorithm simply walks regularly over the source buffers. Finally, the systems I have for testing are a bit lopsided in terms of GPU vs. CPU power. For instance, take the back-of-the-envelope calculations for the secondary system:

There are, of course, a ton of caveats in these numbers, such as memory bandwidth and the relationship between theoretical peak ops and pixel throughput. The Quadro, for instance, is only about half as fast as the GeForce in real-world benchmarks. Still, it's plausible that the CPU isn't at a disadvantage here, particularly when you consider the extra overhead in uploading and downloading frames and that some fraction of the GPU power is already used for display. I need to try a faster video card, but I don't really need one for anything else, and more importantly, I no longer have a working desktop. But then again, I could also get a faster CPU... or more cores.

The lesson here appears to be that it isn't necessarily a given that the GPU will beat the CPU, even if you're doing something that seems GPU-friendly like image processing, and particularly if you're on a laptop where the GPUs tend to be a bit underpowered. That probably explains why we haven't seen a huge explosion of GPU-accelerated apps yet, although they do exist and are increasing in number.

Templates are a feature in C++ where you can create functions and class types that are parameterized on other types and values. An example is the min() function. Without templates, your choices in C++ would be a macro, which has problems with side effects; a single function, which locks you down to a single type; or multiple overloads, which drives you nuts. Templates allow you to declare a min() that works with any type that has a less-than predicate without having to write all of the variants explicitly.

The problem with templates is that they're (a) awful to use and (b) very powerful. The C++ template syntax is horrible, with all sorts of notorious problems with angle brackets and typename and other issues, and anyone who used VC6 still shudders at the mention of STL errors. The thing is, I still like templates, because they're compile-time and extremely versatile. The last time I had to use generics in C#, I ran into so many limitations that I really wished I'd had templates instead. C# generics are a mixture of both compile-time and run-time instantiation, which means they're more constrained. In particular, the inability to use constructors with parameters or to cast to the generic type is crippling, especially if you're working with enums. In C++, you can pretty much do anything with T that you could with using an explicit type.

Function template instantiation is one of the areas that I have the most problem with. The idea is simple: you specify the function you want, and the compiler finds the best template to fit. In reality, you're in a role-playing game where you wish for the function you want and the compiler GM does whatever it can to do precisely what you ask and still screw you, like implicitly convert a double through type bool. I got burned by this tonight when I tried to port VirtualDub to Visual Studio 2010 beta 1. I had expected this to be quick since everything just worked in the CTP, but with beta 1 it took hours due to several nasty bugs in the project system. The first time I was able to run the program it asserted before it even opened the main window. The problem was in this code:

int nItems = std::min<int>(mMaxCount, s.length());

mMaxCount was 4, s.length() was 2, and I ended up with min(4, 2) == 4. WTF?

First, I should note the reason for the explicit call. I often end up with situations where I need to do a min() or a max() against mixed signed and unsigned types, and usually I know that the value ranges are such that it's OK to force to one type, such as if I've already done some clamping. To do this, I force the template type. Well, it turns out that specifying min<int>() doesn't do what I had expected. It doesn't force a call to the version of min() with one template parameter of type int -- it forces a call to any template with int as the first type parameter. This used to be OK because std::min() only had one overload that took two parameters, so no other template could match. However, VS2010 beta 1 adds this evil overload:

template<class T, class Pred>inline const T& min(const T&, Pred);

Why you would ever want a min() that takes a single value and a predicate is beyond me. However, since I was calling min() with an int and an unsigned int, the compiler decided that min<int, unsigned>(int, unsigned) was a better match than min<int>(int, int). The odd result is that 2 got turned into an ignored predicate and min(2, 4) == 4. Joy. I hacked the build into working by writing my own min() and max() and doing a massive Replace In Files.

One of the problems with Windows Presentation Foundation (WPF) has been its lack of text hinting. Text hinting adjusts the position of glyph curves to better match the grid, resulting in better legibility because stems appear sharper than if they were positioned between pixels. WPF omitted text hinting in order to attain resolution independence, but this resulted in illegible text at small point sizes, such as that commonly used in UIs. If you've ever taken text and resized it in an image editor, you're seeing a similar problem.

Visual Studio 2010 is switching to WPF for its text editor and parts of its UI, which led to some concerns about text readability. Some work has been done with WPF 4.0 to add back in hinting support, which I can say seems to have worked in beta 1:

These screenshots were taken with the default Consolas font and with the Lucida Console 10pt font I use with VS2005, on Windows 7 RC. It looks fine. I suspected that maybe VS2010 was still using WinForms, but I verified that there's no WinForms container window in Spy++, and you can also tell by the caret, which curiously changes width slightly as it moves. I wouldn't have thought that the caret would require hinting as well, but I can say now that it definitely looks weird having the caret antialiased. I also have to say that, unlike the old Courier New default that I hated passionately, the new Consolas default looks pretty good. It takes more space vertically than Lucida Console, but it takes less space horizontally and looks sharper. I personally like ClearType, though; those who don't are a bit stuck, as I don't believe you can turn it off in VS2010.

Unfortunately, while the text editor looks good, the menus are a different story:

Ewwww. Apparently I'm not the only one to notice this, based on the forums. I don't think I could stand it if the text editor looked like that. Text hinting is good.

Side note: I wanted to post whether VirtualDub compiled with VS2010 out of the box, but I can't figure out how to integrate the DirectX SDK. The VC++ Directories option seems to have gone missing with the switch to MSBuild, and I can only find it as a project setting now. The last thing I want is to bake a system-specific path into a project file. Hmmm.

I've been working on rewriting video algorithms in pixel shaders for 3D acceleration lately, and one of the sticking points I hit was Edge-Based Line Averaging (ELA) interpolation.

ELA is a common spatial interpolation algorithm for deinterlacing and works by trying several angles around the desired point and averaging between the points with the lowest absolute difference. The angles are chosen to be regular steps in sample location, i.e. (x+n, y-1) and (x-n, y+1) for n being small integers. This produces reasonable output for cases where a temporal or motion-based estimation is not available. The specific variant I'm dealing with is part of the Yadif deinterlacing algorithm, which checks three horizontally adjacent pixels for each angle and only picks the farthest two if the intermediate angle is a better match as well. In other words:

This should be a relatively simple translation to a pixel shader -- convert each source pixel access to a texture sample. Not. It turns out that under the Direct3D ps_2_0 profile, which is what I need to target, there aren't enough temporary registers to run this algorithm. In order to run the algorithm, at least 14 source pixel fetches need to be done, and there are only 12 temp registers in ps2.0. The HLSL compiler valiantly tries to squeeze everything in and fails. Nuts.

There is an important caveat to this implementation tack, which is that I had source pixels mapped in AoS (array of structures) format, i.e. a single pixel held YCbCr components and an unused alpha channel. The CPU implementation of this algorithm, at least the way I wrote it in VirtualDub 1.9.1+, uses SoA (structures of arrays) orientation for speed. SoA arranges the data as planes of identical components, so instead of mixing components together you fetch a bunch of Y values across multiple pixels, a bunch of Cb pixels, and a bunch of Cr pixels, etc. I decided to try this in the pixel shader, since texture fetches were my main bottleneck. It looked something like this:

Switching to SoA in a pixel shader nullifies some of the advantages of the GPU, since the GPU doesn't use as long vectors as fixed-point hardware (4x vs. SSE2's 16x), and because some GPU hardware doesn't directly support the swizzles you need to emulate a shift. It also largely nullifies the advantage of having texture samplers since you can no longer address the source by individual samples. Well, it turns out in this case that the extra swizzling made the situation even worse than in the AoS case, because the compiler didn't even get halfway down the shader before it gave up.

The main lesson here is that sampling textures can quickly become a bottleneck in the ps_2_0 profile. Just because you have 32 texture sampling instructions available doesn't mean you can use them. I've thought about switching to a higher pixel shader profile, like ps_2_a/b, but there are reasons I want to try to stay to ps_2_0, the main ones being the wide platform availability, the hard resource limits, and the omission of flow control and gradient constructs.

In the end, I had to split the ELA shader into two passes, one which just wrote out offsets to a temporary buffer and another pass that did the interpolation. It works, but the GPU version is only able to attain about 40 fps, whereas the CPU version can hit 60 fps with less than 100% of one core. I guess that mainly speaks to my lopsided computer spec more than anything else. That having been said, it kind of calls into question the "GPUs are much faster" idea. I have no doubt this would run tremendously faster on a GeForce 8800GTX, but it seems that there are plenty of GPUs out there where using the GPU isn't a guaranteed win over the CPU, even for algorithms that are fairly parallelizable.

A couple of days ago, I looked into XML as a possibility for an exchange format for a program I was working on. Using an off-the-shelf XML parser wasn't an option, so a relatively simple format was needed. XML seemed like a relatively good fit due to its hierarchical tag-based nature, and if I was going to use a simple text-based format, using a ubiquitous one seemed to be a good idea. I've acquired a bit of a distaste for XML over the years, primarily from seeing people convert 10MB of binary data to 100MB of XML for parsing in an interpreted language. For a simple file with a few data items, though, it makes a lot of sense.

The first set of warning bells went off when I pulled down the XML 1.0 standard from the W3C and discovered it was 35 pages long. W3C standards don't seem to be organized well in general, since they delve immediately into details without giving a good overview first. Well, I could deal with that -- I've survived ISO standard documents before, and these aren't that bad. Much of the standard deals with document type declarations (DTDs) and validation, which could be omitted.

That is, until I discovered the horrors of the internal DTD subset.

The internal DTD subset allows you to embed the DTD directly into the document. That's fine, and since it's wrapped in <!DOCTYPE> then in theory it should be easily skippable. Well, it would be, were it not for two little problems called character entities and attribute value defaults:

If you load this XML into a web browser like Firefox or Internet Explorer, you'll see the effects of the DTD, which is to introduce a mode attribute into the text tag, and to expand the &bar; character entity. These two features have a number of annoying consequences:

All XML parsers, including non-validating ones, must parse the internal DTD subset. This means that an alternate tag parsing path must be introduced since the DTD doesn't follow the same attribute=value format that the rest of XML uses.

The internal DTD subset cannot be ignored, since it can change the interpretation of the data.

Character entities can now expand to arbitrary lengths. This prohibits in-place conversion and requires dynamic memory allocation. Even more fun is the possibility of nested expansion, which leads to the billion laughs attack.

XML parsers must both parse elements and interpret them, due to the need to inject attribute defaults.

I really wonder how much benefit there was in including user-defined character entities and attribute defaults in the XML standard. It seems to me that if these two features had been omitted, there could have been a clear delineation in the standard between DTD/validation and data, and the core non-validating part could have been made much simpler.

I finally have some time lately to flush out some of the stuff I've had going on lately, and since all my latest VirtualDub work went out with 1.9.2, it's time for something else. A while ago I posted the first version of Altirra, my Atari 8-bit computer emulator. Due to some encouragement from some friendly people on AtariAge, I went back and fixed a whole bunch of emulation bugs in the emulator, and the result is version 1.1:

As proof of its worthiness of release, I present my in-emulator triumph over evil in The Last Starfighter:

The primary improvements are a lot of undocumented features implemented in GTIA, a couple of fixes to POKEY, and several tricky but important timing fixes in ANTIC. The display code now also does aspect ratio locking (a feature that I need to integrate back to VirtualDub).

Working on an emulator for an old computer system is basically working to an inflexible, and unfortunately incomplete, spec. I have access to the official Atari Hardware and OS reference manuals, which describe all hardware registers and kernel calls in the system. You can write an emulator that exactly conforms to those specifications and the result is that you will be able to run maybe half of the software that was released. The other half won't work because it depends on undocumented behavior such as when the display list interrupt bit resets (scan line 248) and the exact number of clock cycles it takes to enter the vertical blank routine (47-53 for XL OS). The spec is unrelenting and frozen -- unless you have a time machine, there is no way to change it. Therefore, you will implement exactly what the hardware does, period, and anything less is a bug. It gives you some appreciation for when you do have leeway to get specs changed to ease implementation.

There is one feature that I wanted to get into this version that I couldn't, which is artifacting. Artifacting is a phenomenon of composite video where unfiltered high frequencies in luminance are misinterpreted by the TV as colors. In NTSC Ataris, the high-resolution modes output pixels at a rate exactly twice the color subcarrier, which means that alternating luminance values produce colors instead of patterns. NTSC is designed with frequency interleaving to mitigate this somewhat, but the Atari produces non-interlaced video with a constant color subcarrier phase, defeating this. The way I tried to simulate this was brute-force, simply converting the video screen to a virtual 13.5MHz (4*Fsc) signal and then separating it out to Y/I/Q with FIR low pass and high pass filters. Well, the result kind of sucked, with an exceptionally blurry luma output and visible rippling in artifacted areas. The result on a real Atari with my LCD monitor's composite input looks significantly better, which leads me to believe there's something I'm missing, possibly a non-linear separation algorithm.

I spent some time downloading and Windows 7 RC today. I installed the beta a while ago, but didn't do too much with it since I had only installed it on a VM, which invalidated any comparisons in graphics performance and prevented any access to 3D acceleration. Well, this time I took advantage of the uber-cool install to VHD feature to install and run Windows 7 without having to repartition my hard drive, so that I could run it directly on real hardware. I get the feeling that I'm pretty far behind in noticing some of the improvements in Windows 7, but hey... better late than never.

Now, to understand where I'm coming from, I've had both 32-bit XP and 64-bit Vista installed on dual-boot for some time, but I almost never boot into Vista 64 unless I need to test some 64-bit code. The reason is that Vista 64 runs like complete crap. CPU and memory aren't a problem, because I have a 2.5GHz Core 2 and 2GB of RAM. The problem is my somewhat mediocre NVIDIA Quadro NVS 140M, which is 8400GS based, compounded by my 1920x1200 display. It runs like a slug with Aero Glass enabled, and is still laggy and runs at low frame rates with it off. The result is that Visual Studio 2005 is noticeably more sluggish on Vista 64 than in XP, and thus I spend all my time in XP instead.

The main reason that Vista performs worse is that, despite all the hype about the new desktop hardware acceleration, the WDDM 1.0 display driver model introduced in Vista actually forces all GDI and DirectDraw rendering to occur in software on the CPU. Since most UI in Windows is based off one of those two APIs -- including Win32 UI and .NET's GDI+ based UI -- this results in significantly slower UI performance. Well, unfortunately, it doesn't seem that Windows 7 runs Aero Glass any better, and in fact on this system VirtualDub has trouble breaking around 25 fps due to the low desktop composition frame rate. It runs noticeably better with composition disabled however: stretching video to 1920x1200 through DirectDraw now takes about 0-2% of the CPU instead of 6-10%.

The reason for this, as it turns out, is that the new WDDM 1.1 DDI in Windows 7 reintroduces support for adds several GDI optimizations, including reducing lock contention and reintroducing some hardware acceleration. The DDI documentation for GDI hardware acceleration reveals seven primitive operations that are supported for acceleration: alpha blend, BitBlt, ClearType blend (text rendering), solid color rect fill, StretchBlt, and transparent blit (color key). That covers a lot, but the one notable omission is line draw. In theory it would be possible to accelerate those by rasterizing the line to spans, and it'd be trivial to do so for the common cases of horizontal and vertical lines, but I don't know if Windows 7 actually does this. In any case, reintroduction of basic GDI acceleration is welcome. This also appears to extend to DirectDraw blits, although those are still not bilinearly filtered as they usually are with XP drivers.

Another interesting change is that WDDM 1.1 reintroduces support for DirectDraw hardware video overlays. In Vista, attempting to use a hardware overlay either resulted in the overlay calls silently failing or desktop composition being forced off. It appears that overlays work again on Windows 7. You can't test this with current versions of VirtualDub since I added code to force them off on Windows Vista or later due to the poor OS behavior, but you can see them in action if you get an old version of VirtualDub like 1.6.0. They don't work quite like on XP, though, because for some reason they don't update except when a GDI repaint occurs. I'm not sure if it's worth investigating this since any video card that supports WDDM 1.1 supports DirectX 9, and at that point it's better just to use Direct3D.

I have noticed a couple of occasional weirdnesses in the USER layer in Windows 7 that I haven't pinned down yet. In the beta at high DPI, one specific item in VirtualDub's display pane context menu bizarrely shows up in a different font. Altirra also occasionally shows a vertically displaced window caption in one of its tool windows after a full-screen mode switch. I suppose it's possible that these are video driver related, as the WDDM 1.1 drivers are still very new.

In any case, since I don't intend to buy a new computer any time soon, it's looking like I might be leapfrogging Vista to Windows 7 as my main OS, now that some of the 2D performance gaffes are getting addressed.

1.9.2 is now out as a new experimental release. This version mainly solidifies the changes to the filter system made in 1.9.1, with reworked allocators to further reduce memory usage and caching improvements for better performance.

Several users have asked me why the "run as job" option was moved from the save dialogs to submenu options in 1.9.x. The reason why I removed it is that I was getting a lot of reports of people unable to write any files with VirtualDub, which turned out to be due to unintentionally leaving that flag on. Unfortunately, this made those batch commands a bit more annoying to get to. 1.9.1 added a keyboard shortcut for batching "save as AVI" as a stopgap. 1.9.2 now allows reconfiguration of all keyboard shortcuts for commands on the menu, which allows all of the remaining batch commands to be accessed quickly. This turned out to be easier to implement than I had expected, with the exceptions being that (a) storing the shortcuts in both a forward and backward compatible way was tricky, and (b) I had to write my own hot key control as the Win32 one has some silly hardcoded restrictions on what keys can be accepted.

I've been writing a custom tree view control to work around a couple of serious limitations of the Win32 tree view control. The main one is that, as I've noted in the past, it has quadratic time performance if you add nodes at the end of a branch, due to unoptimized linked list usage. Typically I work around this by adding nodes in reverse at the head, which executes in linear time instead, but this time I need to incrementally extend the tree. You might think that using the hInsertAfter of TVINSERTSTRUCT would fix this, but sadly, no: the tree view control implements this too by walking the child list from the head. The only way I can think of to work around this is to use virtual text nodes and rotate the text entries while inserting dummy entries at the top, but that's hacker than I like and there are other features that are missing too (columns).

I got as far as getting the node structure set up and started on rendering, only to hit on a problem: the text rendering was slooooowwww. Specifically, ExtTextOut() was taking a long time to run, and display updates were below 10 fps. Well, the problem was that due to some code transformations, I was accidentally printing out a newline (\n) at the end of the string, and that somehow causes ExtTextOut() to drop to an extremely slow code path. To give you an idea of just how horribly slow this is, here's a chart of ExtTextOut() performance when control characters are added:

This chart shows how long it takes to execute 200 calls to ExtTextOut() with a 255 character string, consisting of a mix A and ^A characters. The font is a fixed-width font, so the fill load is the same. Notice how the slope of the line increases. Notice also that when the string consisted of solely ^A characters, it took four seconds to draw 200 lines of text. That's almost as slow as the .NET DataGridView control. I omitted the line for the second text that uses Bs instead of ^As because it's so much faster it doesn't even show up, topping out at 12ms. I'm guessing this problem is caused to font substitution kicking in, but whatever's going on, it's disasterous: on this system, just including a single control character cuts text rendering rate by at least a fourth, and by the time you get to thirty control characters, it takes more than ten times longer to render the string. I should note that there definitely seems to be a system-dependent aspect here, as this was on an a 2.5GHz Core 2 with a NVIDIA Quadro NVS 140M, and the test ran a lot faster on a 1.4GHz Pentium M with an ATI RADEON Mobility. The text with control characters still ran at less than one-fourth speed, though, so it's still a really bad case.

Moral of the story: make sure ExtTextOut() receives only printable characters.