As promised, today we’ll talk about video decoding. We will review the most important operations that a decoder has to fulfill, and for each case see what kind of speed boost we can expect with a shader based video decoding. Because video decoding is a complex process and one blog post can hardly be thorough, I’ll provide related links for each chapter, if you wish to start your own research on a particular subject

Let’s begin with a quick overview of the most important operations of a VP8 video decoding process :

1/ Colors coding

Colors are usually represented in the 24 bits RGB scheme (8 bits for each of the 3 components). It is the “native format” for most of the screens and most importantly for the human eye.

But when it comes to video compression, a different scheme is used. It’s called YCbCr, and you may know it as “YUV”. It is not a different colorspace than RGB but rather a different way to write it. The fact is that the human eye can see higher difference of luminosity than difference of colors. As for a RGB pixel, a YCbCr pixel has 3 components : Y is the “luma” component, Cb and Cr the “chroma” blue and red. The whole components of the same type of a picture are call a plane. The trick is to store the two chroma components at quarter resolution and the luma component at full resolution. This is called “chroma sub-sampling”, and the VP8 codec only support “4:2:0” mode.

So with a 4:2:0 sub-sampling, we have an overall of 50% of the components removed, for a drop of quality almost imperceptible !

2/ Block based coding

Most of the computation that we are going to study don’t work on the whole picture, but instead the video encoders divide the picture in smaller block of data, easier and especially faster to process.

First we have the “macroblocks” which are 16×16 pixels wide, used by the inter-prediction process. Then we have the “blocks” of 4×4 pixels, used by the intra-prediction process. Each macroblock includes 16 4×4 blocks of luma components, and 32 2×2 blocks of chroma components (8 for chroma blue, 8 for chroma red).

In this post I’ll mostly talk of the treatments applied on the 4×4 luma blocks in order to save time, not going too deep on the details.

3/ Entropy decoding

The first process in “decoding order” is the entropy decoder. Lets paraphrase wikipedia :

In information theory an entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium.

One of the main types of entropy coding creates and assigns a unique prefix-free code to each unique symbol that occurs in the input. These entropy encoders then compress data by replacing each fixed-length input symbol by the corresponding variable-length prefix-free output codeword. The length of each codeword is approximately proportional to the negative logarithm of the probability. Therefore, the most common symbols use the shortest codes.

The VP8 codec uses an entropy coder called “Boolean Entropy Decoder”, a form of arithmetic coding, which uses only two symbols : 0 or 1. It is faster than typical arithmetic coding and allows purely integer implementation.

The only thing we need to known, is that entropy decoding is a very linear process and cannot be parallelized. The symbols are decompressed one by one (the bitstream is processed bit by bit in the case of arithmetic coding, i/o are bufferized however) and each symbol decoded updates the probability context of the decoder. Loose one bit, loose one frame.

Conclusion ? The entropy decoding stays in the CPU. What we need to do though is to decode the bitstream content for one frame in one pass, and not to start other processes as soon as enough data are available. That way we will have all the data to start other processes in parallel.

4/ Inverse transformation

The DCT, initials for Discrete Cosine Transform, is the operation at the heart of today video compression formats. The DCT is a reversible and lossless operation, meaning that the encoder and the decoder do the exact same thing, and that it does not directly help data compression, but just reorganize the coefficients of a block by frequency order, with the high frequencies at the beginning of the block, and the low frequencies at the end.

Note that the first coefficient of the block is called the “DC” coefficient, and represents the average of all the values of the coefficients of the block. The other coefficients are called “AC”.

The VP8 codec uses a 2d DCT-II on AC coefficients of each 4×4 block of luma pixel in frame.

The DC coefficients are then bundled by groups of 16 (so each DC coefficient of all 16 4×4 blocks of a macroblock) and applied with a secondary WHT transform.

As the DCT and WHT transformations are basically matrix calculus, they does fit well inside shaders. Plus, the blocks of data to process are independent upon each other, which makes parallelization easy. The same DCT is applied to all of the blocks of the picture. But there is a major drawback that we will saw in chapter 6.

5/ Inverse quantization

The quantization stage reduces the amount of information by dividing each transformed blocks of coefficient (a 2d matrix) by a “quantization matrix” and then by a “quantization parameter (used to adjust the final quality of the block) to reduce the quantity of possible values that value could have. All near zero values are zeroed.

Because this makes the values fall into a narrower range, this allows entropy coding to express the values more compactly.

Here is a quick overview of the quantity of data in a block before and after transformation + quantization :

A simple way to gain a lot of speed using shaders is to process several 4×4 blocks of data simultaneously as there is no dependency between blocks (in the limits of available shader units), and to process 4 data at a time inside each block.

6/ Intra-prediction

Intra-predictions use spatial redundancy inside a frame to reduce the amount of data necessary to encode adjacent blocks. It is a major compression feature, able to greatly reduce the size of a single picture. VP8 has 10 available intra-prediction schemes, which consist in copying the edge of the adjacent block(s) through another block, as seen in the following picture :

The major problem of the intra-prediction is that it reads the edges of adjacent reconstructed blocks. To reconstruct a block, the transformation, quantization and intra-prediction must have been performed on that block.

As the intra-prediction is in the middle of the operations needed to reconstruct a block, it mean that we cannot process these blocks in parallel, because each block can depend on the left, up-left, up-right block.

So basically each time we want to intra-predict a block we have to wait for the result of the previous intra-prediction process. It means that it suppresses the gain of doing parallel block transformation + quantization, because it would need to wait for the intra-prediction results to reconstruct the blocks anyway.

Once a set of data is loaded, it is better to perform all of the operations on that set while the data are still “hot” into the CPU cache, avoiding cache miss and expensive memory reads.

In a GPU implementation, we would try to do that :

entropy decode everything in a frame

upload data to the GPU memory

transform all the blocks (in parallel)

quantifiy all the blocks (in parallel)

intra-predict all the blocks

reconstruct all the blocks (in parallel)

So what to do with the intra-prediction stage ?

When you start several intra-predictions processes at different places of the picture, ignoring the dependency between block is not an option. Do one mistake during the decoding, and that mistake will propagate itself :

One another obstacle is that as each block uses a different intra-prediction scheme, we would need to setup shader programs for each scheme. Each of these shader programs must have a list of blocks it can handle, and they would have to wait for the availability of the reconstructed adjacent blocks of these inside the list to perform their task. But doing so (meaning doing linear processing using a parallel processor) could be counter productive performance wise.

The other solution would be to download transformed and quantified data (after parallel processing) to the main memory, do the intra-prediction in the CPU and re-upload the data for the reconstruction stage and further processing.

So far, my decision is to leave all the treatments before block reconstruction to the CPU, and concentrate on the parallelization of heavier treatments. The SIMD CPU optimizations (MMX, SSE…) will be reintroduced.

7/ Inter-prediction (motion compensation)

The inter-prediction, or motion compensation, is a practical way to use the temporal redundancy between frames. As two frames which follow each other have a great chance to be very similar, each macroblock of the picture can be just copied into another picture instead of being recoded. If needed, the copied macroblock can be slightly moved and interpolated to match the destination picture. This process can be quite intensive, up to 70% of CPU time, depending on the amount of inter-predicted blocks in the frame of course.

The VP8 specification allows up to 3 frames to be used as reference.

If the reference frames are already loaded into the GPU memory, recomposing a frame by copying macroblocks between GPU textures is faster than doing the same thing across CPU memory.

One other advantage is that sometimes an interpolation filter must be applied to the block of texture copied, and texture filtering is one thing a GPU do fast, or at least faster than a CPU.

8/ Loop filter

As we talked about on this post, most of the processing during video decoding are done on small blocks of data. One side effect is that differences can appear between blocks when they are compressed with slightly different parameters.

To counter this “blocking effect”, a “loop filter”, or “deblocking filter” must be applied on the final picture. This filter is usually applied as a post processing filter for low bitrate videos.

In VP8 as in the H.264, the loop filter is a mandatory feature, allowing the encoder to lower the bitrate by 10%, regained in quality by applying the loop filter in both encoding and decoding paths. The only problem is that the loop filter is the most expensive operation of the VP8 decoding process, from 30% and up to 60% of the total decoding time, especially with high definition videos.

VP8 has two filter modes, a low complexity “simple filter” and a high complexity “normal filter” (used in almost every video). These two modes can be adjusted on a frame basis. The filter is somewhat adaptive, as some filtering parameter can be adjusted for each macroblock.

The goal is to smooth the rough edges between each block of the three planes (Y, Cb and Cr) for the normal filter, and only the Y planes for the simple filter. Depending if we are on a block or macroblock boundaries the process is not exactly the same, as 16×16 macroblock can pack parameters specific to a group of 16 of 4×4 blocks, there can be bigger differences at these macroblocks edges, so adjacent blocks edges within a macroblock may be concatenated and processed at once in their entirety.

Edges are filtered in that order :

left macroblock edge (16 pixels wide)

each left edge (4 pixels wide) for every 16 blocks inside the macroblock

top marcoblock edge (16 pixels wide)

each top edge (4 pixels wide) for every 16 blocks inside the macroblock

For each pixel position of an edge, two or three pixels adjacent to either side of the edge (at a right angle to the edge orientation) are examined and possibly modified if the difference between the values on either sides fall into a certain threshold.

The deblocking process reads and then writes to the same values. But the shader programs usually input and output to different textures. The idea would be to create a new texture to output only delta values for each pixel, and then to compose the “delta texture” to the “frame texture”.

This VP8 deblocking process is linear, each macroblock must be fully processed before staring the next one. In order to parallelize it, it should be safe to start several deblocking process at different macroblocks (chosen depending on their filter strength) at the expense of some particular edges left unfiltered.

9/ Color space conversion

Color space conversion is the last process in “decoding order”. As we’ve seen in chapter 1, the video formats don’t store their pixels in RGB, but in subsampled YCbCr format.

To convert a pixel between the two color models, the VP8 uses the ITU recommendation 601, which defines among some other things how to convert an analog RGB signal to a digital YCbCr signal. From there an equation to convert digital YCbCr to digital RGB can be extrapolated :

Where R, G and B are the 3 components of a 24 bits RGB color space, and Y, Cb and Cr are the 3 components of a 24 bits YCbCr color space. The prime denotes a gamma-corrected component.

The YCbCr to RGB conversion can be intense on the CPU, for each pixel, 9 divisions, 13 multiplications, 11 additions/subtractions. With some rounding and factorisation we can find a faster implementation, with still 7 multiplications, 7 bitshift and 7 additions/subtractions :

Once again, to do that operation faster in a shader unit, convert 4 components at a time, and start simultaneously several conversion processes at different places of the frame.

10/ Conclusion

Video codecs are not conceived with performance in mind, as the primary goal is to save a lot of data for transmission. Video decoding is not really a “parallel job”. Some operations are by nature linear, and their handling by a GPU have to be carefully thought. The raw power of a GPU is only the addition of the raw power of its many computing units. When some operations cannot be parallelized, a GPU will end up being slower than a CPU for the same task.

The hard thing when it comes to GPU decoding is to properly identify which parts of the decoding can be parallelized and wich parts cannot. The key aspect is to do the maximum of operations inside the GPU memory, with shader programs working directly in that memory, and to try to have as little interaction as possible between CPU and GPU. Best case scenario, a shader outputs its results directly to another shader inputs. Uploading data to the GPU memory or scheduling shader programs introduces some latency which could became a net performance penalty in the decoding if done too frequently.

If you have any question regarding the content of this post, ask these in the comments and I’ll do my best to answer it.

11/ Summer of Code conclusion

As you may know the Google Summer of Code 2011 ends today.

I did not finish my project and only completed half of it. The “client side” of the VP8 VDPAU implementation is working and is currently being reviewed by the libvdpau maintainers. The mplayer/mplayer2 and ffmpeg/libav patches are also in shape and depending on the feedback on the libvdpau patch they will be sent upstream for review.

The Gallium3D decoder implementation inside Mesa is up and running, tested on r600g and softpipe drivers, but is only a slimmed down port of the libvpx decoder. Currently the only hardware accelerated part is the color space conversion stage, mainly because of the work of Younes Manton and Christian König on the G3DVL interface and the VDPAU state tracker.

My work here is not done, and the ideas expressed in this post have yet to be concretized. I still have some free time, and as I start working late on this project it’s only fair that I dedicate this time to move forward with the decoder implementation.

I’d like to thanks every one that has been involved in this project, starting with my mentor Jakob Bornecrantz who has been very patient with all my questions these past months, Younes Manton and Chrisitan Konïg who have been working on the current G3DVL interface and video decoders, Matt Dew for helping GSoC students since day one, and thanks to everyone else !

Special thanks to Chrisitan Konïg and AMD for providing hardware for the project !

Finally a big thanks to Google, to give students around the world an awesome learning opportunity with the Summer of Code program, and a great way of starting working inside a free software community.

]]>https://emericdev.wordpress.com/2011/08/26/a-comprehensive-guide-to-parallel-video-decoding/feed/16emericgvp8_decode_pipelinecolors_originalRGBYCbCrblock_based_coding4x4_block_luma4x4_block_coeff4x4_block_dct4x4_block_dct_zz4x4_block_quant4x4_block_quant24x4_block_quant2_zzblockcoeffintrapred4x4foreman_cif_IbBbBbBbP_frame0_nokmotion compensationsky_blockingmacroblock_filteringrgb_fullrgb_fastMy GSoC project statushttps://emericdev.wordpress.com/2011/08/17/my-gsoc-project-status/
https://emericdev.wordpress.com/2011/08/17/my-gsoc-project-status/#respondTue, 16 Aug 2011 23:24:35 +0000http://emericdev.wordpress.com/?p=79]]>Hi !
It’s been a long time that I didn’t blogged about my Google Summer of Code project, and I’m sorry about that. But today I have some good news !
I have successfully set up a VP8 decoder inside Mesa, built from Google’s libvpx. Let’s walk through the different parts of this new decoding stack.

We need a VP8 video to begin with. As the current decoder is based on the libvpx, the official VP8 Codec SDK, all of the VP8 functionalities are supported and so, every webm file can be used. That’s an important point, and that’s one of the reason why I initially decided to start using an existing decoder first and not write a new one from scratch. Then I could progressively replace keys computations using shaders while having a solid base known to work in every situations.

libvdpau

To link together a video player and a video decoder through the VDPAU API, the first component used is the libvdpau. Formerly part of the NVIDIA video driver, it is now a standalone package allowing third party implementation of the API into video player or video decoder.
This library has been patched to support the VP8 codec, and to handle a VdpPictureInfoVP8 structure. That structure is loaded by a VDPAU video player with information contained into each VP8 frame header and then passed to a VDPAU decoder alongside with the bitstream buffer.

mplayer/mplayer2 and ffmeg/libav

Then, we need a media player. I have patched mplayer (and its recent fork mplayer2) together with ffmpeg (and its recent fork libav) to support VP8 decoding through VDPAU. The patches between mplayer and mplayer 2 are almost the same, but the actual VDPAU implementation is different with an advantage to mplayer 2. The ffmpeg/libav patches are identical.

To launch mplayer the following arguments must be used :

mplayer -vo vdpau -vc ffvp8vdpau myvideofile.webm

-vo vdpau tells mplayer to initialize a VDPAU video output

-vc ffvp8vdpau tells mplayer to use the ffvp8vdpau video codec, a slightly modified version of the native VP8 codec of ffmpeg/libav

So basically, mplayer use the libvdpau to find an available decoder, initialize it, create a surface that is gonna be filled by the decoder, then after each decoded frame, draw the surface on screen.

As mplayer/mplayer2 relies on ffmpeg/libav to do the frame decoding itself, a hook has to be added into the frame decoding process, to bypass the regular decoder and send the datas to the VDPAU decoder.

Mesa

The “big part” of this GSoC is obviously the VP8 decoder implementation living inside Mesa. This piece of code is gonna be identified as a “device” by libvdpau. While most of the time a device is a hardware driver, the mesa decoder just register itself as an available video decoder. We know that this device is gonna be called by ffmpeg/libav for every frame to decode, with these arguments :

The content of the frame header (with various information like frame size, type, …)

The bitstream buffer, which contain the compressed data representing exactly one frame

Up to 3 “reference frames”, which are already decoded frames used for motion compensation

With that, the decoder can do its job and decode frames. When a frame is ready to be drawn on screen, the decoder must load that frame into the surface provided by the VDPAU video output created by mplayer.

The first step was to create a new decoder stub, based on the existing g3dvl interface (where all the Gallium3D video decoding work take place), and add it to the VDPAU state tracker, to advertise VP8 decoding capabilities.

The second step was to plug a working decoder into these new function stubs.

I used the official libvpx (close to the 0.9.7 version) and stripped out several functionalities. The goal of course is to have a lightweigth standalone decoder. Example of removed code are the VP8 encoder, multi-threading (which was actually counter productive into the decoder part), the libvpx API, CPU run time detection, frame scaling functionality, a custom memory manager, etc…

And it worked !

This new decoder can be used by the patched mplayer, through the VDPAU Gallium3D state tracker, to decode VP8 video. Seeking, pausing, and everything works as intended.

Summary

Let’s see how our different components interacts :

So far only mplayer can use the VP8 decoder inside Mesa, but other implementation can be provided for other popular media players like VLC and who knows, the flash media player ?! Similarly, the patched version of mplayer can only use the Mesa VP8 decoder, but only because there is no other VDPAU VP8 implementation in the wild right now.

What’s left

The work on libvdpau and mplayer/ffmpeg is done and only need to be reviewed.

Right now the VP8 decoder is a pure C implementation running on the CPU (only color space conversion is done by the GPU, tested with r600g), and it is about 3 times slower than the regular libvpx (mostly because all CPU SIMD code as been removed). Ultimately different parts of the decoder are going to be rewritten to fit GPU decoding, and I don’t intend to port code from libvpx anymore.

Next week, a more technical post where we will see what decoding operations are the biggest CPU time eater, and what we are gonna do about it

The goal of my Google Summer of Code project is to provide a generic way of decoding VP8 videos using hardware acceleration. This “hardware acceleration” is going to be provided by modern graphic cards and there “streaming processors”, “shaders units”, “cuda cores” or whatever names they can takes depending on vendor’s marketing departments. Basically they are small processors built inside GPUs that can process a lot of data simultaneously.

When I say a “generic way to use hardware acceleration”, I mean that a lot of media players should be able to benefit from it, and a lot of graphic cards should be able to provide it.

Decoding back-end

First, a hardware accelerated decoder has to be built. Because writing even a purely CPU based video decoder is a heavy task (in itself longer than the time allocated by the Google Summer of Code program), an existing decoder is going to be used first. Then, the heaviest computational tasks are going to be progressively rewritten to use shaders.

The libvpx library is going to be used and built inside the Mesa 3D Graphics Library. That should allow video cards drivers (r300g, r600g and nouveau are targeted) using Mesa to be shipped with VP8 hardware decoding support.libvpx is the VP8 reference implementation, supported by Google, and has a BSD style license, compatible with Mesa’s MIT license, wich make it a great candidate for inclusion.

In order to be used, the decoder located within Mesa must advertises its capabilities through a Gallium3D state tracker supporting the VDPAU API.

API

So an API to make the link between the video decoder and a media player is needed. When a media player start decoding a video, it must check if your system is capable of hardware decoding, and if so, bypass its regular CPU based decoder to use the GPU based decoder. Several APIs can help doing that, let’s review some of them :

XvMC (X-Video Motion Compensation)

Build as a xorg extension, and based on the even older X video extension, XvMC allows media players to offload a limited number of operations (motion compensation/inter-frame prediction and iDCT/inverse transformation) to capable GPUs. XvMC design is quite old and has not been thought for recent video formats.
Its primary target are MPEG 1/2 videos.

VA API (Video Acceleration API)

Originally designed by Intel, VA API main motivation was to supersede XvMC with a new design and much extended capabilities. In addition of motion compensation and inverse transform, VA API can also handle deblocking filter, intra-frame prediction and bitstream processing. As XvMC, VA API just exposes to the GPUs only particular chunks of data and their associated treatments, so a lot of the video decoding logic stays inside regular CPU based decoder. Another particularity of VA API is to handle video encoding as well as video decoding.
Its primary target are H.264, VC-1, MPEG-2 and MPEG-4 videos. VA API is currently implemented by the Intel Linux driver.

VDPAU (Video Decode and Presentation API for Unix)

VDPAU was designed by NVIDIA to offload video decoding and post-processing effects from the CPU. A media player has to start the decoding, but then passes large portions of the bitstream to VDPAU and gets back fully decoded frames. As almost all of the decoding process can be offloaded, VDPAU allows great flexibility in the implementation of a GPU based decoder.
Its primary target are H.264, VC-1, MPEG-2 and MPEG-4 videos. The libvdpau library needs to be slightly patched in order to support VP8 decoding. VDPAU is currently implemented by the NVIDIA closed source driver available for Linux, FreeBSD and Solaris operating systems. VDPAU is well supported by media players, and this is why it has been chosen.

Last but not least, you’ll need a media player. The media player loads a given video file, parses its container to gather some information about the file’s content (video length, definition, audio and video codec, etc) then launches the decoding process and finally draw the decoded pictures onto the screen.

Today most of the media players available don’t do media decoding themselves, but instead uses libraries dedicated to these tasks. The primary used library is ffmpeg and its recent fork libav. We can also mention GStreamer and xine-lib. These libraries are available on a wide range of operating systems and architectures, and can decode pretty much every video or audio file formats available in the wild (and that is a lot).

I made the choice to add VP8 VDPAU support inside ffmpeg/libav. They are well known libraries, already have support VDPAU for other video formats, and are used among others by the famous VLC media player and MPlayer.

I have just finished my exams, an so my academic year is now over. Time to start working full time on my Google Summer of Code project. The purpose of this blog is to keep you posted about my progress. I will write it in english, so I apologize in advance for any errors that may slip into my sentences.

The goal of this project is to write a Gallium3D state tracker capable of hardware accelerated VP8 video decoding through the VDPAU API. This would allow every graphic card with a Gallium3D driver (primary targets are r300g, r600g and Nouveau drivers) to be able to decode VP8 videos, and every VDPAU enabled multimedia software to play these videos.

Hardware acceleration will be built upon graphic card’s shaders units, able to take care of some heavy computations like motion compensation, intra-predictions, iDCT, deblocking filter.

More information about the project can be found on this page, and a brief presentation of myself and contact information can be found on that page.

Feel free to contact me at any time if you have any questions or comments about this work !