There are three main parts found in a standard video compression
algorithm.

Motion Compensated Prediction (MCP)

Transform Coding (DWT/ DCT)

Group of pictures (GOP)

The most fundamental procedure in video compression is the
Motion-Compensated Prediction. What does this mean? Video
sequences show a high degree of correlation from frame to
frame. One compression strategy is to take the difference
between adjacent frames and store this value. If the two frames
are similar, the difference frame will contain only a little
amount of information. This technique is known as Predictive
Coding and is commonly used in signal compression. In video
signals there is a catch, imagine an action sequence with
fast moving cars, and quick camera movements. In these sequences,
adjacent frames may be wildly different. If neighboring frames
are very different, the difference frame may be large and
contain more information than the original frames themselves!
To overcome this we use ‘Motion-Compensated’ Prediction.

MCP is a refinement of predictive coding. Rather than finding
the difference frame directly, we can use the motion of objects
in the scene to produce a better predictive coding algorithm.

How do we estimate the motion in the scene?

Motion Estimation

In general, a scene has multiple moving objects. Therefore,
the motion of each object can be characterized from frame
to frame. For example, if there is a movie of a car driving
across the screen, then each frame shows the same car but
it is shifted with respect to the previous frame. This shift
can be calculated and characterized by a Motion Vector. Now
we can extend this idea to all objects in the scene.

However, determining where all the objects are in a scene
is extremely complex. A simple, but non-ideal, solution is
to partition each frame into non-overlapping uniform square
blocks. This type of motion estimation is called Block Matching.

The blocks in the first frame, called the anchor frame, are
compared to the blocks in the second frame, called the target
frame. Motion Vectors can then be calculated for each block
to see where each block from the anchor frame ends up in the
target frame.

The following is an example of two frames from a video sequence.
Note that the train is moving from right to left, the ball
is rolling, the gimbals are spinning, and the camera is slowly
panning upwards.

The following figure shows the motion vectors generated between
these two frames. Note the areas of motion. Also note that
there is perceived motion in areas of constant color. This
is because all the blocks in this area consitute a perfect
match. Therefore there is perceived motion. However, this
is not a problem when reconstructing the frame.

The following figure shows the target frame predicted from
the anchor frame and the motion vectors.

When a target frame is reconstructed using
the anchor frame and motion vectors, the reconstruction is
not perfect. In order to compensate for these errors, an error
frame is generated at the encoder. The error frame is the
difference between the actual target frame, and the reconstructed
target frame. The error frame is added to the reconstructed
frame at the decoder.

So why are these error frames and motion vectors desireable?
The reason is that the motion vectors produce error frames
that compress extremely well.

So how are these error frames compressed? The answer is transform
coding.

All mainstream encoders use Discrete Cosine Transform (DCT)
to perform transform coding. The DCT maps a time domain signals
to a frequency domain representation. We can compress the
frequency domain spectrum by truncating low intensity regions.

However, the DCT has several drawbacks. Computation of the
DCT takes an extremely long time and grows exponentially with
signal size. To calculate the DCT of an entire video frame
takes an unacceptable amount of time. The only solution is
to partition the frame into small blocks and then apply the
DCT to each block. However, this leads to a degradation in
picture quality.

The Discrete Wavelet Transform, DWT, offers a better solution.
The DWT is another transform that maps time domain signals
to frequency domain representations. But the DWT has a distinct
advantage; The DWT, in essence, can be computed by performing
a set of digital filters which can be done quickly. This allows
us to apply the DWT on entire signals without taking a significant
performance hit. By analyzing the entire signal the DWT captures
more information than the DCT and can produce better results.

So what does the DWT look like when applied to an image?

The figure below shows an image of a seagull. The DWT separates
the image’s high frequency components from the rest of the
image, resizes the remaining parts and rearranges them to
form a new ‘transformed’ image.

This image shows one step of the 2-Dimensional DWT. The image
is separated into four subimages. The bottom left, bottom-right
and top-right show the high-frequency detail of the image.
The top left quadrant contains the low frequency or lower
detail portion of the image, we can see that most of the information
is in this portion. We can achieve compression by removing
data in the high detail areas. As you can see, if we retain
only the top left image we are dropping information that does
not distort the image in a noticeable fashion.

To perform lossy compression all small components in the
transformed image can be set to zero and run-length encode.

What is the general structure of compressed and uncompressed
video sequences?

In general, there are three different types of frames. They
are called I, P, and B frames.

I frames are essentially the main anchor frames. No motion
estimation is performed to generate these frames. They are
transform coded directly to ensure a high quality reconstruction.
This is because all following frames are predicted from the
I frame. And any error in the I frame will propagate through
the rest of the group.

P frames are predicted using MCP from the preceeding I or
P frame. The error frame generated is then transformed and
compressed. Both the error frame and the set of motion vectors
are stored to file.

B frames are encoded much like P frames except that the prediction
is done from a combination of a previous P or I frame, and
a future frame P or I frame. The results are then averaged
to represent the current frame. This is called bi-directional
prediction. The prediction relative to future frames is needed
to capture new object that may appear in the video in the
middle of the group of pictures.

Why is the ordering important?

When predicting frames from previously encoded frames any
errors in the previously encoded frames will degrade the reconstruction
of the current frame. This error propagation can be controlled
by using a specific frame ordering known as the group of pictures.
A general group of pictures structure is shown below.