Video Coding Principles
Our most preferred video standards use block-based processing. Each macro-block typically contains four 8x8 luminance blocks and two 8x8 chrominance blocks (for Chroma format 4:2:0). Video coding is based on the principles of motion compensated prediction (MC) [1],
transform and quantization and entropy coding. Figure 2 shows a typical motion compensation based video codec. In motion compensation, compression is achieved by predicting each Macro-block of pixels in a frame of video from a similar region of a recently
coded (“reference”) video frame. For example, background areas often stay the same from one frame to the next and do not need to be retransmitted in each frame. Motion estimation (ME) is the process of determining for each MB in the current frame, the 16x16 region of the
reference frame that is most similar to it. ME is usually the most performance intensive function in video compression. Information on the relative location of the most similar region for each block in the current frame (“motion vector”) is transmitted to the decoder.

The residual after MC is divided into 8x8 blocks, each encoded using a combination of transform coding, quantization and variable length coding. Transform coding, such as discrete cosine transform or DCT, exploits spatial redundancy in the residual signal. Quantization removes perceptual redundancy and reduces the amount of data required to encode the residual. Variable length coding exploits the statistical nature of the residual coefficients. The process of redundancy removal via MC is reversed in the decoder, and the predicted data from the reference frame is combined with the encoded residual data to generate back a representation of the original video frame.

Figure 2: Standard Motion Compensated Video Coding

In a video codec, an individual frame may be encoded using one of three modes: I, P or B (see Figure 3). A few frames referred to as Intra (I) frames are encoded independently without reference to any other frame (no motion compensation). Some frames may be coded
using MC with a previous frame as reference (forward prediction). These frames are referred to as Predicted (P) frames.

B frames or bi-directional predicted frames are predicted from both past frames as well as frames slated to appear after the current frame. A benefit of B frames is the ability to match a background area that was occluded in the previous frame but can be found in a subsequent frame using backward prediction. Bi-directional prediction can allow for decreased noise by averaging both forward and backward prediction. Leveraging this feature in encoders requires additional processing since ME has to be performed for both forward and backward prediction, which can effectively double the motion estimation computational requirements. Additional memory is also needed at both encoder and decoder to store two reference frames. B frame tools require a more complex data flow since frames are decoded out of order with respect to how they are captured and need to be displayed. This feature results in increased latency and thus, is not suitable for some real-time sensitive applications. Until H.264, B frames were not used for prediction allowing trade-offs to be made for some applications. For example, they can be skipped in low frame rate apps without impacting the decoding of future I and P frames.

Figure 3: An illustration of inter-frame prediction in I, P and B Frames

Legacy Video Coding StandardsH.261
H.261[2], defined by the ITU, was the first major video compression standard. It was targeted for two-way video conferencing applications and was designed for ISDN networks that supported 40kbps-2Mbps. H.261 supports resolutions of 352X288 (CIF) and 176X144 (QCIF) with the chrominance resolution sub-sampling of 4:2:0. Complexity is also designed to be low since videophones require simultaneous real-time encoding and decoding. Due to its focus on two-way video, which is delay-sensitive, H.261 allows only I and P frames and no B frames.

H.261 uses a block-based DCT for transform coding of the residual. The DCT maps each 8x8 block of pixels to the frequency domain producing 64 frequency components (first coefficient referred to as DC and the rest as AC). To quantize the DCT coefficients, H.261
uses a fixed linear quantization across all the AC coefficients. The quantized coefficients are subject to run-length coding, which allows the representation of the quantized frequency coefficients as a non-zero coefficient level followed by runs of zero coefficients and a final
end of block code after the last non-zero value. Finally, variable Length (Huffman) Coding converts the run-level pairs into variable length codes (VLCs) with the bit-length optimized for the typical probability distribution.

Standard block based coding results in blocky video. In H.261, this is avoided by using a loop filtering technique. A simple 2D FIR filter applied on the block edge is used to smooth out quantization effects in the reference frame. It must be applied in a bit-exact fashion on
both the encoder and the decoder.

MPEG-1
MPEG-1[3] was the first video compression algorithm developed by the ISO. The driving application was storage and retrieval of moving pictures and audio on digital media such as video CDs using SIF resolution (352x240 at 29.97 fps or 352X288 at 25 fps) at about 1.15 Mbps. MPEG-1 is similar to H.261 but encoders typically require more performance to support the heavier motion found in movie content versus typical video telephony.

Compared to H.261, MPEG-1 allows B frames. It also uses adaptive perceptual quantization. For example, a separate quantization scale factor or equivalently step size is specifically applied to each frequency bin to optimize human visual perception. MPEG-1 only supports progressive video, and resultantly, an effort was started on a new standard, MPEG-2, to support both progressive and interlaced video at higher resolutions using higher bitrates.

MPEG-2/H.262
MPEG-2[4] was developed targeting digital television and soon became the most successful video compression standard thus far. MPEG-2 addressed both standard progressive video (where a video sequence consists of a succession of frames each captured at regularly spaced time instants) as well as interlaced video, which is popular in the television world. In interlaced video, two sets of alternate rows of pixels (each called a field) in the image are captured and displayed alternately. Until recently, this approach was particularly suited to the physics of most TV displays on the market. MPEG-2 supports standard television resolutions including interlaced 720x480 at 60 fields per second for NTSC used in the US and Japan
and interlaced 720x576 at 50 fields per second for PAL used in Europe and other countries.

MPEG-2 builds on MPEG-1 with extensions to support interlaced video and also much wider motion compensation ranges. Since higher resolution video is an important application, MPEG-2 supports vastly wider search ranges than MPEG-1. This greatly increases the
performance requirement for motion estimation versus the earlier standards. Encoders taking full advantage of the wider search range and the higher resolution, require significantly more processing than H.261 and MPEG-1. Interlaced coding tools in MPEG-2 include the ability to optimize the motion compensation supporting both field and frame based predictions and support for both field and frame based DCT/IDCT. MPEG-2 performs well at compression ratios around 30:1. The quality achieved with MPEG-2 at 4-8 Mbps was acceptable for consumer video applications, and it soon became deployed in applications including digital satellite, digital cable, DVDs and lately, high-definition TV.

In addition, MPEG-2 adds scalable video coding tools to support multiple layer video coding, namely, temporal scalability, spatial scalability, SNR scalability and data partitioning. Although related profiles were defined in MPEG-2 for scalable video applications, Main
Profile that supports single layer coding is the sole MPEG-2 profile that is widely deployed in mass market today. MPEG-2 Main Profile is often referred to as simply MPEG-2. The processing requirements for MPEG-2 decoding were initially very high for general purpose processors and even DSPs. Optimized fixed function MPEG-2 decoders were developed and became inexpensive over time due to the high volumes. MPEG-2 proved that the availability of cost-effective silicon solutions is a key ingredient for the success and
deployment of video codec standards.

H.263
H.263[5] was developed after H.261 with a focus on enabling better quality at even lower bitrates. One of the important targets was video over ordinary telephone modems at 28.8 Kbps. The target resolution was SQCIF (128x96) to CIF (352X288). The basic techniques are similar to H.261 with a few differences.

Motion vectors in H.263 were allowed to be multiples of ½ in either direction (“half-pel”) with the reference picture digitally interpolated to higher resolution. This approach leads to better MC accuracy and higher compression ratios. Larger ranges were allowed for the MVs. A host of new options were provided for different scenarios including:

Four motion vectors: One motion vector for each 8x8 block rather than one motion vector for the entire MB.

3D VLC: Huffman coding which combines an end of block (EOB) indicator together with each Run Level pair. This feature is specifically targeted at low-bit rate where many times there are only one or two coded coefficients.

However despite these techniques, adequate video quality over ordinary phone lines proved to be very difficult and videophones over standard modems are still a challenge today. Since H.263 generally offered improved efficiency over H.261, it became used as the preferred algorithm for video conferencing with H.261 support still required for compatibility with older systems. H.263 expanded over time as H.263+ and H.263++ added optional annexes supporting compression improvements and features for robustness over packet networks. H.263 and its annexes formed the core for many of the coding tools in MPEG-4.

MPEG-4
MPEG-4[6] was initiated by the ISO as a follow-on to the success of MPEG-2. Some of the early objectives were increased error robustness to support wireless networks, better support for low bitrate applications and a variety of new tools to support merging graphic objects with video. Most of the graphics features have not yet gained significant traction in products, and implementations have focused primarily on the improved low bitrate compression and error resiliency.

Context Adaptive Intra DCT DC/AC Prediction: Allows the DC/AC DCT coefficients to be predicted from neighboring blocks either to the left or above the current block

Extended dynamic range of quantized AC coefficients from [-127:127] in H.263 to [-2047, 2047] to support high fidelity video.

Error resiliency features added to support recovery in packet loss conditions include:

Slice Resynchronization: Establishes slices within images that allow quicker resynchronization after an error has occurred. Unlike MPEG-2 packet sizes, MPEG-4 packet sizes are de-linked from the number of bits used to represent a MB. As a result, resynchronization is possible at equal intervals in the bitstream irrespective of the amount of information per MB.

Data Partitioning: A mode that allows partitioning the data within a video packet into a motion part and DCT data part by separating these with a unique motion boundary marker. This allows more stringent checks on the validity of motion vector data. If an error occurs you can have better visibility into the point where the error occurs, thus avoiding the discarding of all the motion data when an error is found.

Reversible VLC: VLC code tables designed to allow decoding backwards as well as forwards. When an error is encountered, it is possible to sync at the next slice or start code and work back to the point where the error occurred.

New prediction (NEWPRED): Mainly designed for fast error recovery in real-time applications, where the decoder uses a reverse channel to request additional information from the encoder in the event of packet losses.

The MPEG-4 advanced simple profile (ASP) starts from the simple profile and adds B frames and interlaced tools (for Level 4 and up) similar to MPEG-2. It also adds quarter-pixel motion compensation and an option for global motion compensation. MPEG-4 advanced simple profile requires significantly more processing performance than the simple profile and has higher complexity and coding efficiency than MPEG-2.

MPEG-4 was used initially in Internet streaming and became adopted, for example, by Apple’s QuickTime player. MPEG-4 simple profile is now finding widespread applications in mobile streaming. MPEG-4 ASP forms the foundation for the proprietary DivX codec that has become popular.

Tools vs. compression gains
Clearly, when we review the techniques introduced in the video codec field through H.261, MPEG-1, MPEG-2 and H.263, we observe that a few basic techniques have provided the most compression gains. Figure 4 illustrates those techniques and its relative effectiveness.
Clearly motion compensation (both integer and half-pel) stands out compared to tools such as four motion vectors and quarter-pel motion compensation.