An application of mine uses DirectDraw to draw video frames on multiple screens, and so far the visualization pipeline used a group of off-screen YUV surfaces that, at the end of the process, are drawn into the primary surface. So this is the code that creates the primary surface:

The last blit does not even perform any rescaling, as the image in the off-screen surface already has the correct dimensions of the DestRect in the primary surface, so it's just a copying of data from the off-screen to the primary surface, and color-space conversion from YUV to the color space of the primary surface (these days, RGB32 is definitely a safe bet). This visualization pipeline has been running fine for years (it works fine up to nVidia 28x driver series), but then after nVidia 29x drivers series, including the latest 30x versions, performance dropped in a dramatic way, blits were so slow that they were dragging down the whole system. So I started benchmarking the various steps of the visualization pipeline, and it turns out that the latest step, that humble blit you see above, was about 100 times slower with newer drivers than with older ones! The performance was even poorer if the code was blitting on a secondary monitor, requiring literally many milliseconds to draw a single image to the screen. Even worse, the GPU usage is really high even when drawing only a few video streams at the same time, so the GPU is clearly a performance bottleneck, and it should not be, as there are no complex operations going on.

The solution is switching the off-screen surfaces from YUV to RGB32, so the declaration of the surfaces becomes:

the CPU must perform a YUV -> RGB32 conversion while copying data to video memory

After benchmarking, both downsides seem to be quite minor, due to the speed of sysmem to vidmem memcopy, and that highly optimized versions of YUV->RGB32 color-space conversions are available in the Intel IPP library. Summing up, the performance of RGB32 pipeline on 30x driver series is on a par with that of the YUV pipeline on 28x driver series, and definitely within the required performance boundaries.

Latest Articles

Standing out of the pack starts by being visible, and being noticed by the right group of professionals. No matter how good your profile is, it is lost in a sea of similar profiles, so you need to show up and start attracting

There are many ways to extract data elements from web pages, almost all of them prettier and cooler than the method proposed here, but as we are in an hurry, let's get that data quickly, ok? Suppose we have to extract the

One of the most common roadblocks when scraping the content of web sites is getting the full contents of the page, including JS-generated data elements (probably, the ones you are looking for). So, when using CEFSharp to scrape

Two good news: file I/O is unit-testable, and it is surprisingly easy to do. Let's see how it works!
A software no-one asked for
First, we need a piece of software that deals with files and that has to be unit-tested. The

If you encounter the following error when pulling a repository in SourceTree:
VirtualAlloc pointer is null, Win32 error 487
it is due to to the Cygwin system failing to allocate a 5 MB large chunk of memory for its heap at