VVC: The Next-Next Generation Codec

So it’s happening. After their previous work on h.264/AVC and h.265/HEVC, the ITU-T Video Coding Experts Group (VCEG) and ISO Moving Picture Experts Group (MPEG) have again joined forces to create another video codec named Versatile Video Coding (VVC).

The preliminary version of VVC is already outperforming HEVC significantly and the goal for the final version is set to reduce the bitrate compared to HEVC by another 40-50% at the same visual quality. In this blog post I want to explain a little bit about VVC, where it is going, what some key technical novelties are, what the current state of development is. The blog post is based on my presentation at Demuxed 2018:

Evolution or Revolution?

Since development of VVC was started from the basis of HEVC, the first question to ask about VVC is: Is it an evolution based on the technologies that were used in the former coding standards or is a really new and revolutionary way of compressing video?

Answer: It’s more or less an evolution of the basic building blocks that were already used in HEVC and various other codecs before.

It is still a hybrid, block based video coding standard

Most technologies are based on HEVC and are further refined and improved

But there are also a lot of new coding tools which have not been seen in the context of video coding

VVC development timeline

Right after the standardization of HEVC, the research labs which were involved did not close down but continued their research. Some techniques which were not included in HEVC were further studied and enhanced and completely new techniques were developed. In October 2015 the Joint Video Exploration Team was formed to explore how these new things could be combined into a new video coding standard. Then in March 2017 a CFE (call for evidence) was issued.

The Cfe and Cfp steps are how MPEG operates for all new coding standards. Basically MPEG hears that there is interest in starting work on a new standard and asks for some tangible evidence showing that the new technology is better than what we have now to justify investing effort in the project. The response was very positive. JEM based software was tested and it was found to be very performant and this success led to the next step which was a CFP (Call for proposals).

This is basically a request for basis on which to get started, which is basically the first piece of software. The responses to the CFP were evaluated in April, and the common basis was established. From that basis the new standardization was started and also a name was decided: Versatile Video Coding. It is meant to be very versatile and address all of the video needs from low resolution and low bitrates to high resolution and high bitrates, HDR, 360 omni-directional and so on. The first version is planned to be finished in October 2020.

What’s new in VVC?

Intra Prediction

Some of you may have heard about Intra prediction. Intra prediction is the technology to predict a block, using what you already have decoded in the neighboring blocks of the same frame. There is definitely some evolutionary stuff going on here.

There are now 65 Intra directions where HEVC had 33.

Rectangular blocks can now be predicted (HEVC only allowed square blocks)

Larger blocks are also now available for intra prediction than HEVC’s 32×32

There are new prediction modes where you can do a directional Interpolation (PDPC)

Chroma prediction is also now included (CCLM)

Luma and chroma blocks can have different block sizes using a separate tree for the chroma components

All of these improvements are basically logical evolutions of the HEVC intra prediction model.

Inter Prediction

Also inter prediction did some evolutionary steps forward from HEVC. In inter prediction, prediction is done from the frames in the reference picture buffer using a technique called motion compensated prediction. Basically, a rectangular region from the reference picture is copied and moved into to the position of the current block. For each block a motion vector indicates which area from the reference picture to copy and move.

One important step of motion compensation is how to obtain the prediction information at the decoder side (the motion vector and which picture in the reference buffer to use). Since just writing this information into the bitstream uses a lot of bitrate, also here prediction schemes are used. For example, if the block to the left moves a certain way, it is highly likely that the current block moves in a similar way. So it does not make sense to transmit another motion vector but to use the motion vector from the left block and modify it. While HEVC already had some powerful prediction schemes in place, these were further refined and now allow for even better prediction of the motion information.

A technique which was new to HEVC is temporal motion vector prediction. In this technique, the motion information for the current block is not prediction from the neighboring blocks but from a block in the temporal past or future (from the reference picture buffer). In HEVC, only one motion vector could be copied this way. In VVC, this prediction mode now can also copy the partitioning of the block in the reference as well as multiple motion vectors and prediction modes.

Predictions from the neighborhood and temporal predictions can now be combined.

Overlapped Block Motion Compensation (OBMC) has also been included. This technique overlaps the edges of the neighboring blocks and then smoothes over the edges to avoid sharp edges which typically occur in inter prediction. A similar technique was also included in AV1.

Other evolutionary improvements

While the Arithmetic coding engine is still based on the CABAC coder from AVC, it has been enhanced and is now even faster. Of course it is also being adapted to the current coding scheme like the symbols and their probabilities.

Transformation: Where there used to be just one transform (DCT in HEVC) there are now 4 different separable transforms (DCT/DST). There is also some advanced coding of these as well as an enhanced rate distortion optimized quantization scheme called Dependent Scalar Quantization.

In-loop filtering has also been enhanced further. Adaptive Loop Filtering (ALF) was proposed for HEVC but did not get into the standard. Since then it was further refined and enhanced and now made its way into VVC.

Palette Coding is well known from image coding and from other video coding standards like the screen content coding extensions to HEVC or AV1. This technique helps most for captured rendered screen content. However, for VVC this will be in the “normal” version of VVC and not in an extension.

Block Partitioning

While the former points were all just evolutionary steps forward there are also some tools which are really new and “revolutionary” in the scope of video coding. One of them is the block partitioning. As you might know, a video is split into blocks, those blocks are split into smaller blocks, and then we do the prediction and transformation on the smallest blocks in this hierarchy. How each of the blocks is split is signaled in the bitstream.

In HEVC, one tree structure allowed to split each square block recursively into 4 square sub-blocks. When a block was not split any further using this tree, a limited set of rectangular predictions were allowed. That was it. In VVC, there are now multiple possible splits which are embedded in multiple tree structure. This is a really cool idea. There are basically two splitting stages one after another. In the first stage, we do a quad-tree split (like in HEVC). In the second stage, each block can be split horizontally and vertically into 2 (binray) or 3 (tenary) parts. This stage is recursive again so that each rectangular block can again be split into 2 or 3 parts horizontally or vertically. This approach makes the encoder much more flexible to adapt to the input content but at the same time also makes video encoding much more complex. An example of the block partitioning is shown here:

Block partitioning in VVC

Affine motion

Classical motion compensation only works on 2 dimensional rectangular regions, so you just copy and move an area to another one. In real video not a lot of stuff moves in only 2D across the screen. Objects often scale or rotate or transforms, so it makes sense to map these more complex motions into the video codec as well. This is also called affine motion.

The idea is to extend the classical model which only allows for planar motion while at the same time keeping the computational complexity at a minimum. The additional types of motion which are enabled are scaling, rotation, shape changes and shearing. The full model now has 6 degrees of freedom (DOF) compared to the 2 degrees of freedom of the conventional planar motion scheme. The complexity is kept low by not applying the affine motion model per pixel but per block of 4×4 pixels. Each 4×4 block uses normal planar motion compensation while the motion information per 4×4 block is calculated using the affine motion model.

Again, this allows for much more flexibility at the encoder side but at the same time is also very complex.

Decoder side search

Searching for motion information or prediction information based on pixel information at the decoder side is an idea that has been around for a long time but has never been implemented in a standard because it basically moves some of the complexity from the encoder to the decoder. While in the past it has been show often that this can really improve the coding efficiency it was always considered impossible to add so much complexity to the decoder.

The idea for VVC is that if we have normal bi-prediction we are predicting from two references, and copying information to the current block. Using that prediction as a template we search through the reference pictures that we already have for a better motion compensated block. Then, we re do the prediction again using the updated motion information and obtain a better prediction of the current block. In order to keep the complexity of the search at the decoder to a minimum, only 8 positions around the block are search in each reference. This is a tradeoff between the additional decoder complexity and the coding efficiency.

Still Under Discussion

There are still a lot of tools that are under discussion. One of these topics is non rectangular partitioning. Normally, all coding blocks are rectangular while edges in natural images are usually not. Therefore, encoders tend to select very small blocks around edges in order to efficiently predict them. With more flexible prediction shapes like diagonal splits the encoder could use bigger blocks at edges which would be much more efficient. Again this adds quite some complexity at the encoder side and creates new issues on how to integrate it with other techniques. Multiple different options are under discussion here.

There are many other ideas that are still being considered and many of them are a long way from being set in stone. Some of these include:

Secondary Transform

Motion Compensation

More complex affine motion options

Local illumination compensation

Coding of motion information

Current Picture Referencing

Decoder side motion search

High Level Syntax

Tiles / Slices

Reference Picture Signaling

and the list goes on …

Coding Performance

The big question that everyone wants answered is: what is the performance of VVC?

Below is the comparison of VTM 2 vs HM 16.19 at a very preliminary stage. The values are BD-Rate based on PSNR values. The majority of tools mentioned above are not yet able to be used and so these results are expected to improve a lot from what we see today.

There are three different configurations which are tested, but the one we are most interested in is the Random Access one, which is the most performant configuration and is tailored to VOD and streaming applications. Currently we are seeing a reduction of 23% in bitrate at the same PSNR. But it was also already shown that when subjective tests are performed, the values are much higher. The set goal is above 35%, so let’s see how that improves over time until October 2020.

At the same time, the complexity of the encoder and decoder does not increase too much compared to HEVC. While only encoding and decoding times are measured they show that the decoder is only at 120% compared to HEVC so that real time decoding is unlikely to be a problem. Also the encoder complexity just increases by a factor of about 4 which is also quite manageable. When more of the tools which are currently under discussion are adopted it is quite likely that these values will further increase. However, the joint team always has a close eye on these values so a huge hike is unlikely. Just to put it into perspective, the encoding time of the reference AV1 encoder compared to HEVC is closer to 2000%.

Licensing

Since the standardization is still ongoing and it is not yet finalized which tools and techniques will be in the final version of VVC, the licensing situation is still mostly unknown. The one thing that we do know for sure is that VVC is a patent-prone video coding standard and will not be free of charge. The major problem with HEVC is that there is more than one license pool plus some companies that do not license their IP at all. This has made the situation very difficult to manage. For the sake of everyone who wants to make good use of VVC, we are hoping that there will be a better solution for VVC. There is an industry forum that tries to work towards a better solution but right now there is only so much we can do.

Conclusion

There is no doubt that VVC has some exciting potential and is already showing some interesting results. Two years is a long time, and there is a lot of opportunity to further increase the coding performance, as well as perfect some of the new tools that are already adopted into the new standard. It will be exciting to check in again in 6 to 12 months to see how the codec is developing.