GPU ENCODING IN EXPRESSION ENCODER 4 PRO SP2

Introduction

This document is to complement the original Expression Encoder 4 SP1 GPU Encoding white paper and discuss the changes that SP2 introduced on this subject. The latest service pack includes a major revision of the CUDA H.264 encoder as well as the introduction of Intel Sandy Bridge QSV support, which both will be discussed in depth below.

General GPU Encoding Recommendations

To Maximize Performance

As before, we recommend applying the proper GPU for video processing load. We provide below basic guidelines for optimizing your performance using a GPU to encode. Using proper GPU hardware and a conservative "GPU Stream" setting can reduce the chances of maxing out the GPU resources. Another way to reduce the GPU memory footprint is by lowering the "Number of Reference Frames". GPU-Z is a great freeware to monitor the load and memory usage.

To calculate the appropriate number of GPU streams setting for a specific scenario, we also created a simple performance tool. More information is available here.

Finally, since available GPU memory is more important than ever, we recommend closing any other applications to minimize GPU memory usage outside of Encoder.

To Maximize Quality

Consider using 1-pass constrained VBR over CBR.

Make the "Buffer Window" longer. Note that this will impact latency.

Use accurate pre-processing settings (aka de-interlacing and resizing methods) for a given source. For example, using the wrong de-interlacing method can introduce weave artifacts, which are hard to encode and will significantly impact overall output quality.

Consider removing any letterboxing introduced by the source and/or the encode settings. Those introduce hard high-contrast edges which are hard to encode and will also impact output quality.

Second Generation of CUDA GPU Encoding

In collaboration with our partners at Main Concept, a lot of work was done on the CUDA H.264 encoder to optimize the performance of the GPU, as well as improving output quality and adding new encode functionality like true CBR and HRD support. Those changes have significantly impacted its behavior, especially in terms of performance:

GPU memory requirements have doubled using with the same settings when compared to SP1.

GPU load has more than doubled too.

CPU load has been reduced when encoding with CUDA.

In other words, you will experience better quality and less CPU load, at the cost of higher GPU load and some overall performance loss.

Finally, because the CUDA encoder doesn't support 2-pass encodings, they have been disabled. Please note that while Encoder will revert to software encoding if a 2-pass encode is selected, the new 1-pass H.264 VBR encoding options are fully supported in both CUDA and QSV-based encoding.

Recommendations

Because of the new behavior, we now recommend using only higher-end CUDA GPUs for this feature. A rough suggestion would be to get CUDA hardware with a minimum of 200 CUDA cores per HD stream to be encoded via CUDA and at the least 1 GB of video memory (2 GB or more preferred). It means that for encoding a single HD stream on the GPU, at the least a GeForce GTX 550 Ti or Quadro 2000 GPU is recommended, while encoding 2 or more HD streams on the GPU would require at the least a GeForce GTX 570, Quadro 6000 or Tesla C2050 card. As in Encoder SP1, multiple video cards can be used to speed up Smooth Streaming encodes. Note that we also changed our default of number of GPU streams to 2, which can be tweaked to take full advantage of the hardware used.

Since the GPU is taking more of the load, matching the CPU may not be as important for single stream encodes. Because most current CUDA GPUs can't take more that 2-3 HD streams at a time, a powerful CPU is still required for HD Smooth Streaming encodes.

Finally, make sure to install the latest Nvidia drivers available from Nvidia's website here. Note that the GPU driver version is required to be higher than 267.91 for SP2.

New Intel Sandy Bridge QSV GPU Encoding Support

What is QSV?

Intel Quick Sync Video (QSV) is a GPU acceleration technology available on some Intel processors that can dramatically increase the speed of transcoding media files. It is available on most 2nd generation i3, i5 and i7 processors (aka "Sandy Bridge"), as well as some newer Xeon CPUs, like the E3-1285. More information about QSV is available on Intel's website. Please note that the newly Sandy Bridge E CPU line (example: i7-3960X) does not have an embedded GPU, and thus does not support QSV.

QSV encoding support has been integrated in Encoder 4 Pro SP2, which can be used in a similar fashion as CUDA encoding to speed up MP4 and H.264-based Smooth Streaming encode operations.

Usage

QSV GPU encoding works the same way than CUDA encoding as explained in the Encoder SP1 GPU Encoding white paper. Because the two GPU-based H.264 encoders behave very differently, Expression Encoder 4 Pro SP2 provides a way to control the number of streams independently between CUDA and QSV, which are defaulting to 2 and 8 respectively (see below). As with CUDA, QSV encoding is turned on by default in the Encoder application and turned off in the SDK.

As for the SDK, the QSV device is enumerated in the H264EncodeDevices.Devices along with the CUDA devices. A new H264EncodeDevice property, DeviceType of type H264DeviceType, provides a way to differentiate between the two GPU types. H264EncodeDevices.MaxNumberGPUStream keeps controlling the number of streams to be encoded by CUDA, while H264EncodeDevices.MaxNumberIntelHDGraphicsStream controls the streams encoded by QSV.

Multi GPU Encoding

Both CUDA and QSV encoding can be used together, either on one multi-stream encode or multiple instances of encoder, which may provide even better performance results. While encoding multiple streams on a PC with both technologies, the streams will be allocated in a snake pattern starting with the Intel GPU (1-2-2-1, 1-2-3-3-2-1) from the highest resolution stream down.

It's worth mentioning that, when used with a discrete GPU, Intel QSV requires the Intel GPU to be enabled as default video adapter in the BIOS and needs to be active and connected to a monitor. Note that LucidLogix Virtu technology, bundled with some Sandy Bridge PCs, enables virtualization of the discrete GPU through the Intel GPU, enabling the usage of both QSV and Cuda and using only one output. More information is available on LucidLogix' website.

Limitations

Like the CUDA implementation, the QSV supports only a subset of encode settings. For instance, those encoding features are not supported:

Reference B-frames

Adaptive B-frames

Multi pass encoding (2-pass encodes will use software encoding)

QSV encoding is not supported on XP.

An active Intel HD Graphics GPU is required to be accessible. This means that:

The PC's motherboard needs to support the Intel Accelerator GPU (most 1155-H67 and 1155-Z68 motherboards do, but all 1155-P67 don't).

The GPU needs to be connected to a screen and be activated.

Accessing the PC via Remote Access, VM, or via a service isn't supported.

If the PC is locked or accessed via Remote Desktop, the next encode operation will fail to use the GPU since it will not be available.

As with CUDA, QSV encoding is an approximation of what the software-only based encoding can provide. Output quality differences, especially at low bitrates should be expected. Because of this, we recommend using the software H.264 encoder if highest quality results are a priority over higher performance.

Recommendations

The integrated Intel Accelerator 3000 GPU, found in high-end mobile CPUs and in the K line of 2nd generation i5 and i7 based desktops as well as on some Xeon and i3 CPUs, is highly recommended. See the Intel website for more details.

Because memory is shared, we highly recommend having a minimum of 4 GB for single stream encoding and 8 GB or more for Smooth Streaming and/or parallel encoding. Running out of system memory will greatly affect performance.

It is highly recommended to install the very latest drivers directly from the Intel website to ensure best stability and reliability.

While the QSV encoding can take a large load of concurrent streams, it is worth noting that a blend of hardware and software encoding will likely result in better performance in most Smooth Streaming cases.

Known Issues

A few issues were found in the recent sets of Intel drivers currently available. While all of them have been investigated and resolved, some of the fixes didn't make it into the latest drivers, but will be available in the next driver version (ETA early 2012). Here is the short list:

When using either Intel's driver 8.15.10.2361 or 8.15.10.2509, about 5% of H.264 Smooth Streaming encodes using QSV will hang at the start of the encode or may contain a corrupted stream. This issue was fixed in the driver 8.15.10.2559.

When using Intel's driver 8.15.10.2361 and encoding to QVBR in H.264 Main or High profile using QSV, output will be corrupted. The issue was fixed in the driver 8.15.10.2509.

Windows 8 Beta Users: Intel's current Windows 8 drivers do not yet support QSV, but will be in the future.

Other GPU Acceleration Technologies

OpenCL and DirectCompute technologies were evaluated but deemed not ready for integration in Expression Encoder 4 Pro SP2. We are planning to continue monitoring them and investigating potential integration in the future.

Wow wow wow! QuickSync support is very exciting and thanks for removing the screen capture limit from the free version too. Microsoft should work with Intel to tell them to include QuickSync even in the highest end Intel Core i7 processor. The highest-end enthusiast version which is designed to be paired with discrete graphics doesn't have the integrated GPU, hence no QuickSync.