MPEG-FAQ: multimedia compression [2/9]

This is the summary about the ISO video and audioformats MPEG 1, 2 and 4

Archive-name: mpeg-faq/part2
Last-modified: 1996/06/02
Version: v 4.1 96/06/02
Posting-Frequency: bimonthly
perceptual audio codecs. If you need more informations about the Noise-to-
Mask-Ratio (NMR) technology, feel free to contact nmr@iis.fhg.de.
Q: O.K., back to these listening tests. Come on, tell me some results.
A: Well, for details you should study one of those AES papers or MPEG
documents listed above. The main result is that for low bitrates (64 kbps
per channel or below), Layer-3 always scored significantly better than
Layer-2. Another important conclusion is the draft recommendation of the
task group TG 10/2 within the ITU-R. It recommends the use of low bit-
rate audio coding schemes for digital sound-broadcasting applications
(doc. BS.1115).
Q: Very interesting! Tell me more about this recommendation!
A: The task group TG 10/2 concluded its work in October 93. The draft
recommendation defines three fields of broadcast applications:
- distribution and contribution links (20 kHz bandwidth, no audible
impairments with up to 5 cascaded codecs)
Recommendation: Layer-2 with 180 kbps per channel
- emission (20 kHz bandwidth)
Recommendation: Layer-2 with 128 kbps per channel
- commentary links (15 kHz bandwidth)
Recommendation: Layer-3 with 60 kbps for monophonic and 120 kbps
for stereophonic signals
Q: I see. Medium bitrates - Layer-2, low bitrates - Layer-3. What's about a
bitrate of 96 kbps per channel that seems to be "somewhere in between"
Layer-2 and Layer-3 domains?
A: Interesting question. In fact, a total bitrate of 192 kbps for stereo music is
useful for real applications, e.g. emission via satellite channels. The ITU-R
required that emission codecs should score at least 4.0 on the CCIR
impairment scale, even for the most critical material. At 128 kbps per
channel, Dolby's AC-2, Layer-2 and Layer-3 fulfilled this requirement.
Finally, Layer-2 got the recommendation mainly because of its
"commonality with the distribution and contribution application".
Further tests for emission were performed at 192 kbps joint-stereo coding.
Layer-3 clearly met the requirements, Layer-2 fulfilled them only
marginally, with doubts remaining during further tests with cascaded
codecs in 1993. In the end, the task group decided to pronounce no
recommendation for emission at 192 kbps.
Q: Someone told me that in the ITU-R tests, there was some trouble with
Layer-3, specifically on male voice in the German language. Still, Layer-3
got the recommendation for "commentary links". Can you explain that?
A: Yes. For commentary links, the quality requirements for speech were to be
equivalent to 14-bit linear PCM, and for music, some perceptible
impairments were to be tolerated. In the test in 1992, Layer-3 was by far
the only codec that fulfilled these requirements (e.g. overall monophonic,
Layer-3 scored 3.6 in contrast to Layer-2 at 2.05 - and for male German
speech, Layer-3 scored 4.4 in contrast to Layer-2 at 2.4).
Further tests were performed in 1993 using headphones. They showed that
MPEG-1 Layer-3 with monophonic speech (the test item is German male
voice) at 60 kbps did not fully meet the quality requirements. The ITU
decided to recommend Layer-3 and to include a temporary footnote that
will be removed as soon as an improved Layer-3 codec fulfills the
requirements completely, i.e. even with that well-known critical male
German speech item (for many other speech items, Layer-3 has no trouble
at all).
Q: O.K., a Layer-2 codec at low bitrates may sound poor today, but couldn't
that be improved in the future? I guess you just told me before that the
encoder is not fixed in the standard.
A: Good thinking! As the sound quality mainly depends on the encoder
implementation, it is true that there is no such thing as a "Layer-N"-
quality. So we definitely only know the performance of the reference
codecs used during the international tests. Who knows what will happen in
the future? What we do know now, is:
Today, in MPEG-1 and MPEG-2, Layer-3 provides the best sound quality
at low bitrates, by far better than Layer-2.
Tomorrow, both Layers may improve. Layer-2 has been designed as a
trade-off between quality and complexity, so the bitstream format allows
only limited innovations. In contrast, even the current reference Layer-3-
codec does not exploit all of the powerful mechanisms inside the Layer-3
bitstream format.
Q: What other topics do I have to keep in mind? Tell me about the complexity
of Layer-3.
A: O.K. First, we have to separate between decoder and encoder, as the
workload is distributed asymmetrically between them, i.e. the encoder
needs much more computation power than the decoder.
For a stereo Layer-3-decoder, you may either use a DSP (e.g. one
DSP56002 from Motorola) or an "ASIC", like the masc-programmed DSP
chip MAS 3503 C from Intermetall, ITT. Some rough requirements are:
computation power around 12 MIPs
Data ROM 2.5 Kwords
Data RAM 4.5 Kwords
Programm ROM 2 to 4 Kwords
word length at least 20 bit
Intermetall (ITT) estimated an overhead of around 30 % chip area for
adding the necessary Layer-3 modules to a Layer-2-decoder. So you need
not worry too much about decoder complexity.
For a stereo Layer-3-encoder achieving reference quality, our current real-
time implementations use two DSP32C (AT&T) and one DSP56002. With
the advent of the 21060 (Analog Devices), even a single-chip stereo
encoder comes into view.
Q: Quality, complexity - what about the codec delay?
A: Well, the standard gives some figures of the theoretical minimum delay:
Layer-1: 19 ms (<50 ms)
Layer-2: 35 ms (100 ms)
Layer-3: 59 ms (150 ms)
The practical values are significantly above that. As they depend on the
implementation, exact figures are hard to give. So the figures in brackets
are just rough thumb values - real codecs may show significant higher
values.
Q: For some applications, a very short delay is of critical importance: e.g. in a
feedback link, a reporter can only talk intelligibly if the overall delay is
below around 10 ms. Here, do I have to forget about MPEG audio at all?
A: Not necessarily. In this application, broadcasters may use "N-1" switches
in the studio to overcome this problem - or they may use equipment with
appropriate echo-cancellers.
But with many applications, these delay figures are small enough to
present no extra problem. At least, if one can accept a Layer-2 delay, one
can most likely also accept the higher Layer-3 delay.
Q: Someone told me that, with Layer-3, the codec delay would depend on the
actual audio signal, varying over the time. Is this really true?
A: No. The codec delay does not depend on the audio signal.With all Layers,
the delay depends on the actual implementation used in a specific codec, so
different codecs may have different delays. Furthermore, the delay depends
on the actual sample rate and bitrate of your codec.
Q: All in all, you sound as if anybody should use Layer-3 for low bitrates.
Why on earth do some vendors still offer only Layer-2 equipment for these
applications?
A: Well, maybe because they started to design and develop their systems
rather early, e.g. in 1990. As Layer-2 is identical with MUSICAM, it has
been available since summer of 1990, at latest. In that year, Layer-3
development started and could be successfully finished at the end of 1991.
So, for a certain time, vendors could only exploit the already existing part
of the new MPEG standard.
Now the situation has changed. All Layers are available, the standard is
completed, and new systems may capitalize on the full features of MPEG
audio.
4. Products
Q: What are the main fields of application for Layer-3?
A: Simply put: all applications that need high-quality sound at very low
bitrates to store or transmit music signals. Some examples are:
- high-quality music links via ISDN phone lines (basic rate)
- sound broadcasting via low bitrate satellite channels
- music distribution in computer networks with low demands for channel
bandwidth and memory capacity
- music memories for solid state recorders based on ROM chips
Q: What kind of Layer-3 products are already available?
A: An increasing number of applications benefit from the advanced features
of MPEG audio Layer-3. Here is a list of companies that currently sell
Layer-3 products. For further informations, please contact these companies
directly.
Layer-3 Codecs for Telecommunication:
- AETA, 361 Avenue du Gal de Gaulle (*)
F-92140 Clamart, France
Fax: +33-1-4136-1213 (Mr. Fric)
(*) products announced for 1995
- Dialog 4 System Engineering GmbH, Monreposstr. 57
D-71634 Ludwigsburg, Germany
Fax: +49-7141-22667 (Mr. Burkhardtsmaier)
- PKI Philips Kommunikations Industrie, Thurn-und-Taxis-Str. 14
D-90411 Nuernberg, Germany
Fax: +49-911-526-3795 (Mr. Konrad)
- Telos Systems, 2101 Superior Avenue
Cleveland, OH 44114, USA
Fax: +1-216-241-4103 (Mr. Church)
Speech Announcement Systems:
- Meister Electronic GmbH, Koelner Str. 37
D-51149 Koeln, Germany
Fax: +49-2203-1701-30 (Mr. Seifert)
PC Cards (Hardware and/or Software):
- Dialog 4 System Engineering GmbH, Monreposstr. 57
D-71634 Ludwigsburg, Germany
Fax: +49-7141-22667 (Mr. Burkhardtsmaier)
- Proton Data, Marrensdamm 12 b
D-24944 Flensburg, Germany
Fax: +49-461-38169 (Mr. Nissen)
Layer-3-Decoder-Chips:
- ITT Intermetall GmbH, Hans-Bunte-Str. 19
D-79108 Freiburg, Germany
Fax: +49-761-517-2395 (Mrs. Mayer)
Layer-3 Shareware Encoder/Decoder:
- Mailbox System Nuernberg (MSN), Innerer Kleinreuther Weg 21
D-90408 Nuernberg, Germany
Fax: +49-911-9933661 (Mr. Hanft)
Shareware (version 1.50) is available for:
- IBM-PCs or Compatibles with MS-DOS:
L3ENC.EXE and L3DEC.EXE should work on practically
any PC with 386 type CPU or better. For the encoder, a
486DX33 or better is recommended.
On a 486DX2/66 the current shareware decoder performs in
1:3 real-time, and the shareware encoder in 1:14 real-time
(with stereo signals sampled with 44.1 kHz).
- Sun workstations:
On a SPARC station 10, the decoder works in real time, the
encoder performs in 1:5 real-time.
For more information, refer to chapter 6.
5. Support by Fraunhofer-IIS
Q: I understand that Fraunhofer-IIS has been the main developer of MPEG
audio Layer-3. What can they do for me?
A: The Fraunhofer-IIS focusses on applied research. Its engineers have
profound expertise in real-time implementations of signal-processing
algorithms, especially of Layer-3. The IIS may support a specific Layer-3
application in various ways:
- detailed informations
- technical consulting
- advanced C sources for encoder and decoder
- training-on-the-job
- research and development projects on contract basis.
For more informations, feel free to contact:
- Fraunhofer-IIS, Weichselgarten 3
D-91058 Erlangen, Germany
Fax: +49-9131-776-399 (Mr. Popp)
Q: What are the latest audio demonstrations disclosed by Fraunhofer-IIS?
A: At the Tonmeistertagung 11.94 in Karlsruhe, Germany, the IIS
demonstrated:
- real-time Layer-3 decoder software (mono, 32 kHz fs) including sound
output on ProAudioSpectrum running on a 486DX2/66
- playback of Layer-3 stereo files from a CD-ROM that has been produced
by Intermetall and contains Layer-3 data of up to 15 h of stereo music
(among others, all Beethoven symphonies); the decoder is a small board
that is connected to the parallel printer port. It mainly carries 3 chips: a
PLD as data interface, the MAS 3503 C stereo decoder chip, and the
ASCO Digital-Analog-Converter. The board has two cinch adapters that
allow a very simple connection to the usual stereo amplifier.
- music-from-silicon demonstration by using the standard 1 Mbyte
EPROMs to store 1.5 minutes of CD-like quality stereo music
- music link (with around 6 kHz bandwidth) via V.34 modem at 28.8 kbps
and one analog phone line
6. Shareware Information
The Layer 3 Shareware is copyright Fraunhofer - IIS 1994,1995.
The shareware packages are available:
- via anonymous ftp from fhginfo.fhg.de (153.96.1.4)
You may download our Layer-3 audio software package from the directory
/pub/layer3. You will find the following files:
For IBM PCs:
l3v150d1.txt a short description of the files found in l3v150.zip
l3v150d1.zip encoder, decoder and documentation
l3v150d2.txt a short description of the files found in l3v150n.zip
l3v150d2.zip sample bitstreams
For SUN workstations:
l3v150.sun.txt short description of the files found in
l3v100.sun.tar.gz
l3v150.sun.tar.gz encoder, decoder and documentation
l3v150bit.sun.txt short description of the files found in
l3v150bit.sun.tar.gz
l3v150bit.sun.tar.gz sample bitstreams
- via direct modem download (up to 14.400 bps)
Modem telephone number : +49 911 9933662 Name: FHG
Packet switching network: (0) 262 45 9110 10290 Name: FHG
(For the telephone number, replace "+" with your appropriate international
dial prefix, e.g. "011" for the USA.)
Follow the menus as desired.
- via shipment of diskettes (only including registration)
You may order a diskette directly from:
Mailbox System Nuernberg (MSN)
Hanft & Hartmann
Innerer Kleinreuther Weg 21
D-90408 Nuernberg, Germany
Please note: MSN will only ship a diskette if they get paid for the
registration fee before. The registration fee is 85 Deutsche Mark (about 50
US$) (plus sales tax, if applicable) for one copy of the package. The
preferred method of payment is via credit card. Currently, MSN accepts
VISA, Master Card / Eurocard / Access credit cards. For details see the file
REGISTER.TXT found in the shareware package.
You may reach MSN also via Internet: msn@iis.fhg.de
or via Fax: +49 911 9933661
or via BBS: +49 911 9933662 Name: FHG
or via X25: 0262 45 9110 10290 Name: FHG
(e.g. in USA, please replace "+" with "011"
- via email
You may get our shareware also by a direct request to msn@iis.fhg.de. In
this case, the shareware is split into about 30 small uuencoded parts...
SOFTWARE: MPEG Audio Layer 3 Shareware Codec and Windows Realtime Player
----------------------------------------------------------------
MPEG Audio Codec and Windows REALTIME Player from Fraunhofer IIS
----------------------------------------------------------------
Fraunhofer IIS announces l3enc/l3dec V2.00 and WinPlay3 V1.00.
For high quality audio compression, the shareware l3enc/l3dec V2.00
package is available for Linux, SUN, NeXT and DOS on
<URL:ftp://ftp.fhg.de/pub/layer3>
Versions for SGI and HP will follow soon.
The shareware package for DOS
<URL:ftp://ftp.fhg.de/pub/layer3/l3v200d1.zip>
includes a demo version of WinPlay3, a Windows MPEG Audio Layer 3
realtime-player.
With MPEG Audio Layer 3 you can get a 12:1 compression with a CD like
quality.
Instead of 12 MByte / minute (stereo 44.1 kHz) you only need about
1 Mbyte / minute!
More information can be found on
<URL:ftp://ftp.fhg.de/pub/layer3/MPEG_Audio_L3_FAQ.html>
or contact <URL:mailto:layer3@iis.fhg.de>
- via direct modem download (up to 14.400 bps)
Modem telephone number : +49 911 9933662 Name: FHG
Packet switching network: (0) 262 45 9110 10290 Name: FHG
(For the telephone number, replace "+" with your appropriate international
dial prefix, e.g. "011" for the USA.)
Follow the menus as desired.
- via shipment of diskettes (only including registration)
You may order a diskette directly from:
Mailbox System Nuernberg (MSN)
Hanft & Hartmann
Innerer Kleinreuther Weg 21
D-90408 Nuernberg, Germany
Please note: MSN will only ship a diskette if they get paid for the
registration fee before. The registration fee is 85 Deutsche Mark (about 50
US$) (plus sales tax, if applicable) for one copy of the package. The
preferred method of payment is via credit card. Currently, MSN accepts
VISA, Master Card / Eurocard / Access credit cards. For details see the file
REGISTER.TXT found in the shareware package.
You may reach MSN also via Internet: msn@iis.fhg.de
or via Fax: +49 911 9933661
or via BBS: +49 911 9933662 Name: FHG
or via X25: 0262 45 9110 10290 Name: FHG
(e.g. in USA, please replace "+" with "011"
- via email
You may get our shareware also by a direct request to msn@iis.fhg.de. In
this case, the shareware is split into about 30 small uuencoded parts...
Harald Popp
Audio & Multimedia ("Music is the *BEST*" - F. Zappa)
Fraunhofer-IIS-A, Weichselgarten 3, D-91058 Erlangen, Germany
Phone: +49-9131-776-340
Fax: +49-9131-776-399
email: popp@iis.fhg.de
P.S.: Look out for planetoid #3834!
-------------------------------------------------------------------------------
~Subject: What is MPEG-1+ ?
This was a little mail-talk between harti@harti.de (Stefan Hartmann)
and hgordon@system.xingtech.com.
Q: What is MPEG-1+ ?
It's MPEG-1 at MPEG-2 (CCIR) resolution. It will maybe be used
fir TV-on-top-boxes for broadcasting or video-on-demand projects
to enhance the picture quality.
Q: I see. Is this a new standard ?
No. MPEG-1 allows the definition of frames until 4000x4000 pixel, but
that is usally not used.
Q; So what's different ?
I understand that the effective resolution is approximately 550 x 480.
Typical datarates are 3.5Mbps - 5.5Mbps (sports programming and perhaps
movies are higher).
Q: Is the video quality lower than with real MPEG-2 movies ?
The quality is better than cable TV, and in my area, we don't have cable.
They de-interlace and compress the full frames. My understanding is that
this is about 5%-10% less efficient than taking advantage of MPEG-2
interfield motion vectors.
Q: If the fields are deinterlaced, do you see the interlace artifacts, so that
a moving object in one field is already more into one direction, than in the
other field ?
Probably the TV-receiver also gives it out interlaced again to the TV-
set, so this does not produce this interlace artifact like on
PCs with live video windows displaing both fields....
Q: Can you record this anyhow on a VCR ? Does the SAT-Receiver have a
video- output, so you can record movies to tape ?
You should be able to record to tape, though they may have some record
blocking hardware which has to be overcome with video stabilizing
hardware.
Q: What kind of realtime encoders do they use at the broadcast station ?
CLI (Compression Labs) is the manufacturer, using C-Cube chipsets (10
CL-4000's per MPEG-1+ encoder).
Q: Is there any written info about this MPEG-1 Plus technology available on
the net ?
Not that I'm aware. Maybe C-Cube has a Web site.
[So it's up to you, dear reader, to find more and to tell me where it is ;o) ]
Frank Gadegast, phade@powerweb.de
-------------------------------------------------------------------------------
~Subject: What is MPEG-2?
MPEG-2 FAQ
version 3.7 (May 11, 1995)
by Chad Fogg (cfogg@chromatic.com)
The MPEG (Moving Pictures Experts Group) committee began its life in
late 1988 by the hand of Leonardo Chairiglione and Hiroshi Yasuda with
the immediate goal of standardizing video and audio for compact discs.
Over the next few years, participation amassed from international
technical experts in the areas of Video, Audio, and Systems, reaching
over 200 participants by 1992.
By the end of the third year (1990), a syntax emerged, which when
applied to code SIF video and compact disc audio samples rates at a
combined coded bitrate of 1.5 Mbit/sec, approximated the perceptual
quality of consumer video tape (VHS). After demonstrations proved that
the syntax was generic enough to be applied to bit rates and sample
rates far higher than the original primary target application, a second
phase (MPEG-2) was initiated within the committee to define a syntax
for efficient representation of broadcast video. Efficient
representation of interlaced (broadcast) video signals was more
challenging than the progressive (non-interlaced) signals coded by
MPEG-1. Similarly, MPEG-1 audio was capable of only directly
representing two channels of sound. MPEG-2 would introduce a scheme to
decorrelate mutlichannel discrete surround sound audio.
Need for a third phase (MPEG-3) was anticipated in 1991 for High
Definition Television, although it was later discovered by late 1992
and 1993 that the MPEG-2 syntax simply scaled with the bit rate,
obviating the third phase. MPEG-4 was launched in late 1992 to explore
the requirements of a more diverse set of applications, while finding a
more efficient means of coding low bit rate/low sample rate video and
audio signals.
Today, MPEG (video and systems) is exclusive syntax of the United
States Grand Alliance HDTV specification, the European Digital Video
Broadcasting Group, and the high density compact disc (lead by rivals
Sony/Philips and Toshiba).
What is MPEG video syntax ?
MPEG video syntax provides an efficient way to represent image
sequences in the form of more compact coded data. The language of the
coded bits is the syntax. For example, a few tokens can represent an
entire block of 64 samples. MPEG also describes a decoding
(reconstruction) process where the coded bits are mapped from the
compact representation into the original, raw format of the image
sequence. For example, a flag in the coded bitstream signals whether
the following bits are to be decoded with a DCT algorithm or with a
prediction algorithm. The algorithms comprising the decoding process
are regulated by the semantics defined by MPEG. This syntax can be
applied to exploit common video characteristics such as spatial
redundancy, temporal redundancy, uniform motion, spatial masking, etc.
MPEG Myths
A brief summary myths.
1. Compression Ratios over 100:1
Articles in the press and marketing literature will often make the
claim that MPEG can achieve high quality video with compression ratios
over 100:1. These figures often include the oversampling factors in
the source video. In reality, the coded sample rate specified in an
MPEG image sequence is usually not much larger than 30 times the
specified bit rate. Pre-compression through subsampling is chiefly
responsible for 3 digit ratios for all video coding methods, including
those of the non-MPEG variety.
2. MPEG-1 is 352x240
Both MPEG-1 and MPEG-2 video syntax can be applied at a wide range of
bitrates and sample rates. The MPEG-1 that most people are familiar
with has parameters of 30 SIF pictures (352 pixels x 240 lines) per
second and a bitrate less than 1.86 megabits/sec----a combination
known as "Constrained Parameters Bitstreams". This popular
interoperability point is promoted by Compact Disc Video (White Book).
In fact, it is syntactically possible to encode picture dimensions as
high as 4095 x 4095 and a bitrates up to 100 Mbit/sec. With the advent
of the MPEG-2 specification, the most popular combinations have
coagulated into Levels, which are described later in this text. The
two most common are affectionately known as SIF (e.g. 352 pixels x 240
lines x 30 frames/sec), or Low Level, and CCIR 601 (e.g. 720
pixels/line x 480 lines x 30 frames/sec), or Main Level.
3. Motion Compensation displaces macroblocks from previous pictures
Macroblock predictions are formed out of arbitrary 16x16 pixel (or 16x8
in MPEG-2) areas from previously reconstructed pictures. There are no
boundaries which limit the location of a macroblock prediction within
the previous picture, other than the edges of the picture.
4. Display picture size is the same as the coded picture size
In MPEG, the display picture size and frame rate may differ from the
size (resolution) and frame rate encoded into the bitstream. For
example, a regular pattern of pictures in a source image sequence may
be dropped (decimated), and then each picture may itself be filtered
and subsampled prior to encoding. Upon reconstruction, the picture may
be interpolated and upsampled back to the source size and frame rate.
In fact, the three fundamental phases (Source Rate, Coded Rate, and
Display Rate) may differ by several parameters. The MPEG syntax can
separately describe Coded and Display Rates through sequence_headers,
but the Source Rate is known only by the encoder.
5. Picture coding types (I, P, B) all consist of the same macroblocks types.
All macroblocks within an I picture must be coded Intra (like a
baseline JPEG picture). However, macroblocks within a P picture may
either be coded as Intra or Non-intra (temporally predicted from a
previously reconstructed picture). Finally, macroblocks within the B
picture can be independently selected as either Intra, Forward
predicted, Backward predicted, or both forward and backward
(Interpolated) predicted. The macroblock header contains an element,
called macroblock_type, which can flip these modes on and off like
switches. macroblock_type is possibly the single most powerful element
in the whole of video syntax. Picture types (I, P, and B) merely enable
macroblock modes by widening the scope of the semantics. The component
switches are:
1. Intra or Non-intra
2. Forward temporally predicted (motion_forward)
3. Backward temporally predicted (motion_backward)
(2+3 in combination represent ⌠Interpolated■)
4. conditional replenishment (macroblock_pattern).
5. adaptation in quantization (macroblock_quantizer).
6. temporally predicted without motion compensation
The first 5 switches are mostly orthogonal (the 6th is derived from the
1st and 2nd in P pictures, and does not exist in B pictures). Some
switches are non-applicable in the presence of others. For example, in
an Intra macroblock, all 6 blocks by definition contain DCT data,
therefore there is no need to signal either the macroblock_pattern or
any of the temporal prediction switches. Likewise, when there is no
coded prediction error information in a Non-intra macroblock, the
macroblock_quantizer signal would have no meaning.
6. Sequence structure is fixed to a specific I,P,B frame pattern.
A sequence may consist of almost any pattern of I, P, and B pictures
(there are a few minor semantic restrictions on their placement). It
is common in industrial practice to have a fixed pattern (e.g.
IBBPBBPBBPBBPBB), however, more advanced encoders will attempt to
optimize the placement of the three picture types according to local
sequence characteristics in the context of more global
characteristics. Each picture type carries a penalty when coupled with
the statistics of a particular picture (temporal masking, occlusion,
motion activity, etc.).
The variable length codes of the macroblock_type switch provide a
direct clue, but it is the full scope of semantics of each picture type
spell out the costs-benefits. For example, if the image sequence
changes little from frame-to-frame, it is sensible to code more B
pictures than P. Since B pictures by definition are never fed back
into the prediction loop (i.e. not used as prediction for future
pictures), bits spent on the picture are wasted in a sense (B pictures
are like temporal spackle). Application requirements also govern
picture type placement: random access points, mismatch/drift reduction,
channel hopping, program indexing, and error recovery & concealment.
The 6 Steps to Claiming Bogously High Compression Ratios:
MPEG video is often quoted as achieving compression ratios over 100:1,
when in reality the sweet spot rests between 8:1 and 30:1.
Heres how the fabled greater than 100:1 reduction ratio is derived for
the popular Compact Disc Video (White Book) bitrate of 1.15 Mbit/sec.
Step 1. Start with the oversampled rate
Most MPEG video sources originate at a higher sample rate than the
"target sample rate encoded into the final MPEG bitstream. The most
popular studio signal, known canonically as D-1 or CCIR 601 digital
video, is coded at 270 Mbit/sec.
The constant, 270 Mbit/sec, can be derived as follows:
Luminance (Y): 858 samples/line x 525 lines/frame x 30 frames/sec x
10 bits/sample ~= 135 Mbit/sec
R-Y (Cb): 429 samples/line x 525 lines/frame x 30 frames/sec x
10 bits/sample ~= 68 Mbit/sec
B-Y (Cb): 429 samples/line x 525 lines/frame x 30 frames/sec x
10 bits/sample ~= 68 Mbit/sec
Total: 27 million samples/sec x 10 bits/sample = 270 Mbit/sec.
So, our compression ratio is: 270/1.15... an amazing 235:1 !!
Step 2. Include blanking intervals
Only 720 out of the 858 luminance samples per line contain active
picture information. In fact, the debate over the true number of
active samples is the cause of many hair-pulling cat-fights at TV
engineering seminars and conventions, so it is safer to say that the
number lies somewhere between 704 and 720. Likewise, only 480 lines
out of the 525 lines contain active picture information. Again, the
actual number is somewhere between 480 and 496. For the purposes of
MPEG-1s and MPEG-2s famous conformance points (Constrained Parameters
Bitstreams and Main Level, respectively), the number shall be 704
samples x 480 lines for luminance, and 352 samples x 480 lines for each
of the two chrominance pictures. Recomputing the source rate, we arrive
at:
(luminance)
704 samples/line x 480 lines x 30 fps x 10 bits/sample ~= 104 Mbit/sec
(chrominance)
2 components x 352 samples/line x 480 lines x 30 fps x 10 bits/sample
~= 104 Mbit/sec
Total: ~ 207 Mbit/sec
The ratio (207/1.15) is now only 180:1
Step 3. Include higher bits/sample
The MPEG sample precision is 8 bits. Studio equipment often quantize
samples with 10 bits of accuracy. The 2-bit improvement to the dynamic
range is considered useful for suppressing noise in multi-generation
video.
The ratio is now only 180 * (8/10 ), or 144:1
Step 4. Include higher chroma ratio
The famous CCIR-601studio signal represents the chroma signals (Cb, Cr)
with half the horizontal sample density as the luminance signal, but
with full vertical resolution. This particular ratio of subsampled
components is known as 4:2:2. However, MPEG-1 and MPEG-2 Main Profile
specify the exclusive use of the 4:2:0 format, deemed sufficient for
consumer applications, where both chrominance signals have exactly half
the horizontal and vertical resolution as luminance (the MPEG Studio
Profile, however, centers around the 4:2:2 macroblock structure). Seen
from the perspective of pixels being comprised of samples from multiple
components, the 4:2:2 signal can be expressed as having an average of 2
samples per pixel (1 for Y, 0.5 for Cb, and 0.5 for Cr). Thanks to the
reduction in the vertical direction (resulting in a 352 x 240
chrominance frame), the 4:2:0 signal would, in effect, have an average
of 1.5 samples per pixel (1 for Y, and 0.25 for Cb and Cr each). Our
source video bit rate may now be recomputed as:
720 pixels x 480 lines x 30 fps x 8 bits/sample x 1.5 samples/pixel
= 124 Mbit/sec
... and the ratio is now 108:1.
Step 5. Include pre-subsampled image size
As a final act of pre-compression, the CCIR 601 frame is converted to
the SIF frame by a subsampling of 2:1 in both the horizontal and
vertical directions.... or 4:1 overall. Quality horizontal subsampling
can be achieved by the application of a simple FIR filter (7 or 4 taps,
for example), and vertical subsampling by either dropping every other
field (in effect, dropping every other line) or again by an FIR filter
(regulated by an interfield motion detection algorithm). Our ratio now
becomes:
352 pixels x 240 lines x 30 fps x 8 bits/sample x 1.5 samples/pixel
~= 30 Mbit/sec !!
.. and the ratio is now only 26:1
Thus, the true A/B comparison should be between the source sequence at
the 30 Mbit/sec stage, the actual specified sample rate in the MPEG
bitstream, and the reconstructed sequence produced from the 1.15
Mbit/sec coded bitstream.
Step 6. Don▓t forget the 3:2 pulldown
A majority of high-end programs originates from film. Most of the
movies encoded onto Compact Disc Video were in captured and reproduced
at 24 frames/sec. So, in such an image sequence, 6 out of the 30
frames every second are in fact redundant and need not be coded into
the MPEG bitstream, leading to the shocking discovery that the actual
soure bit rate has really been 24 Mbit/sec all along, and the
compression ratio a mere 21:1 !!! Even at the seemingly modest 20:1
ratio, discrepancies will appear between the 24 Mbit/sec source
sequence and the reconstructed sequence. Only conservative ratios in
the neighborhood of 8:1 have demonstrated true transparency for
sequences with complex spatial-temporal characteristics (i.e. rapid,
divergent motion and sharp edges, textures, etc.). However, if the
video is carefully encoded by means of pre-processing and intelligent
distribution of bits, higher ratios can be made to appear at least
artifact-free.
What are the parts of the MPEG document?
The MPEG-1 specification (official title: ISO/IEC 11172 Information
technology Coding of moving pictures and associated audio for digital
storage media at up to about 1.5 Mbit/s, Copyright 1993.) consists of
five parts. Each document is a part of the ISO/IEC number 11172. The
first three parts reached International Standard in 1993. Part 4
reached IS in 1994. In mid 1995, Part 5 will go IS.
Part 1---Systems: The first part of the MPEG standard has two primary
purposes: 1). a syntax for transporting packets of audio and video
bitstreams over digital channels and storage mediums (DSM), 2). a
syntax for synchronizing video and audio streams.
Part 2---Video: describes syntax (header and bitstream elements) and
semantics (algorithms telling what to do with the bits). Video breaks
the image sequence into a series of nested layers, each containing a
finer granularity of sample clusters (sequence, picture, slice,
macroblock, block, sample/coefficient). At each layer, algorithms are
made available which can be used in combination to achieve efficient
compression. The syntax also provides a number of different means for
assisting decoders in synchronization, random access, buffer
regulation, and error recovery. The highest layer, sequence, defines
the frame rate and picture pixel dimensions for the encoded image
sequence.
Part 3---Audio: describes syntax and semantics for three classes of
compression methods. Known as Layers I, II, and III, the classes trade
increased syntax and coding complexity for improved coding efficiency
at lower bitrates. The Layer II is the industrial favorite, applied
almost exclusively in satellite broadcasting (Hughes DSS) and compact
disc video (White Book). Layer I has similarities in terms of
complexity, efficiency, and syntax to the Sony MiniDisc and the Philips
Digitial Compact Cassette (DCC). Layer III has found a home in ISDN,
satellite, and Internet audio applications. The sweet spots for the
three layers are 384 kbit/sec (DCC), 224 kbit/sec (CD Video, DSS), and
128 Kbits/sec (ISDN/Internet), respectively.
Part 4---Conformance: (circa 1992) defines the meaning of MPEG
conformance for all three parts (Systems, Video, and Audio), and
provides two sets of test guidelines for determining compliance in
bitstreams and decoders. MPEG does not directly address encoder
compliance.
Part 5---Software Simulation: Contains an example ANSI C language
software encoder and compliant decoder for video and audio. An
example systems codec is also provided which can multiplex and
demultiplex separate video and audio elementary streams contained in
computer data files.
As of March 1995, the MPEG-2 volume consists of a total of 9 parts
under ISO/IEC 13818. Part 2 was jointly developed with the ITU-T,
where it is known as recommendation H.262. The full title is:
Information Technology--Generic Coding of Moving Pictures and
Associated Audio. ISO/IEC 13818. The first five parts are organized in
the same fashion as MPEG-1(System, Video, Audio, Conformance, and
Software). The four additional parts are listed below:
Part 6 Digital Storage Medium Command and Control (DSM-CC): provides a
syntax for controlling VCR- style playback and random-access of
bitstreams encoded onto digital storage mediums such as compact disc.
Playback commands include Still frame, Fast Forward, Advance, Goto.
Part 7 Non-Backwards Compatible Audio (NBC): addresses the need for a
new syntax to efficiently de- correlate discrete mutlichannel surround
sound audio. By contrast, MPEG-2 audio (13818-3) attempts to code the
surround channels as an ancillary data to the MPEG-1
backwards-compatible Left and Right channels. This allows existing
MPEG-1 decoders to parse and decode only the two primary channels while
ignoring the side channels (parse to /dev/null). This is analogous to
the Base Layer concept in MPEG-2 Scalable video. NBC candidates include
non-compatible syntaxs such as Dolby AC-3. Final document is not
expected until 1996.
Part 8 10-bit video extension. Introduced in late 1994, this
extension to the video part (13818-2) describes the syntax and
semantics to coded representation of video with 10-bits of sample
precision. The primary application is studio video (distribution,
editing, archiving). Methods have been investigated by Kodak and
Tektronix which employ Spatial scalablity, where the 8-bit signal
becomes the Base Layer, and the 2-bit differential signal is coded as
an Enhancement Layer. Final document is not expected until 1997 or
1998. [Part 8 will be withdrawn]
<IMG SRC="mpeg2lay.gif">
<IMG SRC="mpeg2la2.gif">
Part 9 Real-time Interface (RTI): defines a syntax for video on demand
control signals between set-top boxes and head-end servers.
What is the evolution of an MPEG/ISO document?
In chronological order:
Abbr. ISO/Committee notation Author's notation
----- ------------------------------- -----------------------------
- Problem (unofficial first stage) barroom witticism or dare
NI New work Item Napkin Item
NP New Proposal Need Permission
WD Working Draft We▓re Drunk
CD Committee Draft Calendar Deadlock
DIS Draft International Standard Doesn't Include Substance
IS International Standard Induced patent Statements
Introductory paper to MPEG?
Didier Le Gall, "MPEG: A Video Compression Standard for Multimedia
Applications," Communications of the ACM, April 1991, Vol.34, No.4, pp.
47-58
MPEG in periodicals?
The following journals and conferences have been known to contain
information relating to MPEG:
IEEE Transactions on Consumer Electronics
IEEE Transactions on Broadcasting
IEEE Transactions on Circuits and Systems for Video Technology
Advanced Electronic Imaging
Electronic Engineering Times (EE Times)
IEEE Int'l Conference on Acoustics, Speech, and Signal Processing (ICASSP)
International Broadcasting Convention (IBC)
Society of Motion Pictures and Television Engineers Journal (SMPTE)
SPIE conference on Visual Communications and Image Processing
MPEG Book?
Several MPEG books are under development.
An MPEG book will be produced by the same team behind the JPEG book:
Joan Mitchell and Bill Pennebaker.... along with Didier Le Gall. It is
expected to be a tutorial on MPEG-1 video and some MPEG-2 video. Van
Nostran Reinhold in 1995.
A book, in the Japanese language, has already been published (ISBN:
4-7561-0247-6). The title is called MPEG by ASCII publishing.
Keith Jack's second edition of Video Demystified, to be published in
August 1995, will feature a large chapter on MPEG video. Information:
ftp://ftp.pub.netcom/pub/kj/kjack/
MPEG is a DCT based scheme?
The DCT and Huffman algorithms receive the most press coverage (e.g.
"MPEG is a DCT based scheme with Huffman coding"), but are in fact less
significant when compared to the variety of coding modes signaled to
the decoder as context-dependent side information. The MPEG-1 and
MPEG-2 IDCT has the same definition as H.261, H.263, JPEG.
What are constant and variable bitrate streams?
Constant bitrate streams are buffer regulated to allow continuos
transfer of coded data across a constant rate channel without causing
an overflow or underflow to a buffer on the receiving end. It is the
responsibility of the Encoders Rate Control stage to generate
bitstreams which prevent buffer overflow and underflow. The constant
bit rate encoding can be modeled as a reservoir: variable sized coded
pictures flow into the bit reservoir, but the reservoir is drained at a
constant rate into the communications channel. The most challenging
aspect of a constant rate encoder is, yes, to maintain constant channel
rate (without overflowing or underflow a buffer of a fixed depth) while
maintaining constant perceptual picture quality.
In the simplest form, variable rate bitstreams do not obey any buffer
rules, but will maintain constant picture quality. Constant picture
quality is easiest to achieve by holding the macroblock quantizer step
size constant (e.g. level 16 of 31). In its most advanced form, a
variable bitrate stream may be more difficult to generate than
constant bitrate streams. In advanced variable bitrate streams, the
instantaneous bit rate (piece-wise bit rate) may be controlled by
factors such as: 1. local activity measured against activity over
large time intervals (e.g. the full span of a movie), or 2.
instantaneous bandwidth availability of a communications channel.
Summary of bitstream types
Bitrate type
Applications
constant-rate
fixed-rate communications channels like the original Compact Disc,
digital video tape, single channel-per-carrier broadcast signal, hard
disk storage
simple variable-rate
software decoders where the bitstream buffer (VBV) is the storage
medium itself (very large). macroblock quantization scale is typically
held constant over large number of macroblocks.
complex variable-rate
Statistical muliplexing (multiple-channel-per-carrier broadcast
signals), compact discs and hard disks where the servo mechanisms can
be controlled to increase or decrease the channel delivery rate,
networked video where overall channel rate is constant but demand is
variably share by multiple users, bitstreams which achieve average
rates over very long time averages
What is statistical multiplexing ?
Progressive explanation:
In the simplest coded bitstream, a PCM (Pulse Coded Modulated) digital
signal, all samples have an equal number of bits. Bit distribution in a
PCM image sequence is therefore not only uniform within a picture,
(bits distributed along zero dimensions), but is also uniform across
the full sequence of pictures.
Audio coding algorithms such as MPEG-1s Layer I and II are capable of
distributing bits over a one dimensional space, spanned by a frame. In
layer II, for example, an audio channel coded at a bitrate of 128
bits/sec and sample rate of 44.1 Khz will have frames (which consist of
1152 subband coefficients each) coded with approximately 334 bits.
Some subbands will receive more bits than others.
In block-based still image compression methods which employ 2-D
transform coding methods, bits are distributed over a 2 dimensional
space (horizontal and vertical) within the block. Further, blocks
throughout the picture may contain a varying number of bits as a
result, for example, of adaptive quantization. For example, background
sky may contain an average of only 50 bits per block, whereas complex
areas containing flowers or text may contain more than 200 bits per
block. In the typical adaptive quantization scheme, more bits are
allocated to perceptually more complex areas in the picture. The
quantization stepsizes can be selected against an overall picture
normalization constant, to achieve a target bit rate for the whole
picture. An encoder which generates coded image sequences comprised of
independently coded still pictures, such as JPEG Motion video or MPEG
Intra picture sequences, will typically generate coded pictures of
equal bit size.
MPEG non-intra coding introduces the concept of the distribution of
bits across multiple pictures, augmenting the distribution space to 3
dimensions. Bits are now allocated to more complex pictures in the
image sequence, normalized by the target bit size of the group of
pictures, while at a lower layer, bits within a picture are still
distributed according to more complex areas within the picture. Yet in
most applications, especially those of the Constant Bitrate class, a
restriction is placed in the encoder which guarantees that after a
period of time, e.g. 0.25 seconds, the coded bitstream achieves a
constant rate (in MPEG, the Video Buffer Verifier regulates the
variable-to-constant rate mapping). The mapping of an inherently
variable bitrate coded signal to a constant rate allows consistent
delivery of the program over a fixed-rate communications channel.
Statistical multiplexing takes the bit distribution model to 4
dimensions: horizontal, vertical, temporal, and program axis. The 4th
dimension is enabled by the practice of mulitplexing multiple programs
(each, for example, with respective video and audio bitstreams) on a
common data carrier. In the Hughes' DSS system, a single data carrier
is modulated with a payload capacity of 23 Mbits/sec, but a typical
program will be transported at average bit rate of 6 Mbit/sec each. In
the 4-D model, bits may be distributed according the relative
complexity of each program against the complexities of the other
programs of the common data carrier. For example, a program undergoing
a rapid scene change will be assigned the highest bit allocation
priority, whereas the program with a near-motionless scene will receive
the lowest priority, or fewest bits.
How does MPEG achieve compression?
Here are some typical statistical conditions addressed by specific
syntax and semantic tools:
1. Spatial correlation: transform coding with 8x8 DCT.
2. Human Visual Response---less acuity for higher spatial frequencies:
lossy scalar quantization of the DCT coefficients.
3. Correlation across wide areas of the picture: prediction of the DC
coefficient in the 8x8 DCT block.
4. Statistically more likely coded bitstream elements/tokens: variable
length coding of macroblock_address_increment, macroblock_type,
coded_block_pattern, motion vector prediction error magnitude, DC
coefficient prediction error magnitude.
5. Quantized blocks with sparse quantized matrix of DCT coefficients:
end_of_block token (variable length symbol).
6. Spatial masking: macroblock quantization scale factor.
7. Local coding adapted to overall picture perception (content
dependent coding): macroblock quantization scale factor.
8. Adaptation to local picture characteristics: block based coding,
macroblock_type, adaptive quantization.
9. Constant stepsizes in adaptive quantization: new quantization scale
factor signaled only by special macroblock_type codes. (adaptive
quantization scale not transmitted by default).
10. Temporal redundancy: forward, backwards macroblock_type and motion
vectors at macroblock (16x16) granularity.
11. Perceptual coding of macroblock temporal prediction error: adaptive
quantization and quantization of DCT transform coefficients (same
mechanism as Intra blocks).
12. Low quantized macroblock prediction error: No prediction error for
the macroblock may be signaled within macroblock_type. This is the
macroblock_pattern switch.
13. Finer granularity coding of macroblock prediction error: Each of
the blocks within a macroblock may be coded or not coded. Selective
on/off coding of each block is achieved with the separate
coded_block_pattern variable-length symbol, which is present in the
macroblock only of the macroblock_pattern switch has been set.
14. Uniform motion vector fields (smooth optical flow fields):
prediction of motion vectors.
15. Occlusion: forwards or backwards temporal prediction in B
pictures. Example: an object becomes temporarily obscured by another
object within an image sequence. As a result, there may be an area of
samples in a previous picture (forward reference/prediction picture)
which has similar energy to a macroblock in the current picture (thus
it is a good prediction), but no areas within a future picture
(backward reference) are similar enough. Therefore only forwards
prediction would be selected by macroblock type of the current
macroblock. Likewise, a good prediction may only be found in a future
picture, but not in the past. In most cases, the object, or
correlation area, will be present in both forward and backward
references. macroblock_type can select the best of the three
combinations.
16. Sub-sample temporal prediction accuracy: bi-linearly interpolated
(filtered) "half-pel" block predictions. Real world motion
displacements of objects (correlation areas) from picture-to-picture do
not fall on integer pel boundaries, but on irrational . Half-pel
interpolation attempts to extract the true object to within one order
of approximation, often improving compression efficiency by at least 1
dB.
17. Limited motion activity in P pictures: skipped macroblocks. When
the motion vector is zero for both the horizontal and vertical vector
components, and no quantized prediction error for the current
macroblock is present. Skipped macroblocks are the most desirable
element in the bitstream since they consume no bits, except for a
slight increase in the bits of the next non-skipped macroblock.
18. Co-planar motion within B pictures: skipped macroblocks. When the
motion vector is the same as the previous macroblocks, and no quantized
prediction error for the current macroblock is present.
What is the difference between MPEG-1 and MPEG-2 syntax?
Section D.9 of ISO/IEC 13818-2 is an informative piece of text
describing the differences between MPEG-1 and MPEG-2 video syntax. The
following is a little more informal.
Sequence layer:
MPEG-2 can represent interlaced or progressive video sequences,
whereas MPEG-1 is strictly meant for progressive sequences since the
target application was Compact Disc video coded at 1.2 Mbit/sec.
MPEG-2 changed the meaning behind the aspect_ratio_information
variable, while significantly reducing the number of defined aspect
ratios in the table. In MPEG-2, aspect_ratio_information refers to the
overall display aspect ratio (e.g. 4:3, 16:9), whereas in MPEG-2, the
ratio refers to the particular pixel. The reduction in the entries of
the aspect ratio table also helps interoperability by limiting the
number of possible modes to a practical set, much like frame_rate_code
limits the number of display frame rates that can be represented.
Optional picture header variables called display_horizontal_size and
display_vertical_size can be used to code unusual display sizes.
frame_rate_code in MPEG-2 refers to the intended display rate, whereas
in MPEG-1 it referred to the coded frame rate. In film source video,
there are often 24 coded frames per second. Prior to bitstream
coding, a good encoder will eliminate the redundant 6 frames or 12
fields from a 30 frame/sec video signal which encapsulates an
inherently 24 frame/sec video source. The MPEG decoder or display
device will then repeat frames or fields to recreate or synthesize the
30 frame/sec display rate. In MPEG-1, the decoder could only infer the
intended frame rate, or derive it based on the Systems layer time
stamps. MPEG-2 provides specific picture header variables called
repeat_first_field and top_field_first which explicitly signal which
frames or fields are to be repeated, and how many times.
To address the concern of software decoders which may operate at rates
lower or different than the common television rates, two new variables
in MPEG-2 called frame_rate_extension_d and frame_rate_extension_n can
be combined with frame_rate_code to specify a much wider variety of
display frame rates. However, in the current set of define profiles
and levels, these two variables are not allowed to change the value
specified by frame_rate_code. Future extensions or Profiles of MPEG
may enable them.
In interlaced sequences, the coded macroblock height (mb_height) of a
picture must be a multiple of 32 pixels, while the width, like MPEG-1,
is a coded multiple of 16 pixels. A discrepancy between the coded
width and height of a picture and the variables horizontal_size and
vertical_size, respectively, occurs when either variable is not an
integer multiple of macroblocks. All pixels must be coded within
macroblocks, since there cannot be such a thing as fractional
macroblocks. Never intended for display, these overhang pixels or
lines exist along the left and bottom edges of the coded picture. The
sample values within these trims can be arbitrary, but they can affect
the values of samples within the current picture, and especially future
coded pictures. In the current pictures, pixels which reside within
the same 8x8 block as the overhang pixels are affect by the ripples of
DCT quantization error. In future coded pictures, their energy can
propagate anywhere within an image sequence as a result of motion
compensated prediction. An encoder should fill in values which are
easy to code, and should probably avoid creating motion vectors which
would cause the Motion Compensated Prediction stage to extract samples
from these areas. The application should probably select
horizontal_size and vertical_size that are already multiples of 16 (or
32 in the vertical case of interlaced sequences) to begin with.
Group of Pictures:
The concept of the Group of Pictures layer does not exist in MPEG-2.
It is an optional header useful only for establishing a SMPTE time code
or for indicating that certain B pictures at the beginning of an edited
sequence comprise a broken_link. This occurs when the current B
picture requires prediction from a forward reference frame (previous in
time to the current picture) has been removed from the bitstream by an
editing process. In MPEG-1, the Group of Pictures header is mandatory,
and must follow a sequence header.
Picture layer:
In MPEG-2, a frame may be coded progressively or interlaced, signaled
by the progressive_frame variable. In interlaced frames
(progressive_frame==0), frames may then be coded as either a frame
picture (picture_structure==frame) or as two separately coded field
pictures (picture_structure==top_field or
picture_structure==bottom_field). Progressive frames are a logic
choice for video material which originated from film, where all pixels
are integrated or captured at the same time instant. Most electronic
cameras today capture pictures in two separate stages: a top field
consisting of all odd lines of the picture are nearly captured in the
time instant, followed by a bottom field of all even lines. Frame
pictures provide the option of coding each macroblock locally as either
field or frame. An encoder may choose field pictures to save memory
storage or reduce the end-to-end encoder-decoder delay by one field
period.
There is no longer such a thing called D pictures in MPEG-2 syntax.
However, Main Profile @ Main Level MPEG-2 decoders, for example, are
still required to decode D pictures at Main Level (e.g. 720x480x30
Hz). The usefulness of D pictures, a concept from the year 1990, had
evaporated by the time MPEG-2 solidified in 1993.
repeat_first_field was introduced in MPEG-2 to signal that a field or
frame from the current frame is to be repeated for purposes of frame
rate conversion (as in the 30 Hz display vs. 24 Hz coded example
above). On average in a 24 frame/sec coded sequence, every other coded
frame would signal the repeat_first_field flag. Thus the 24 frame/sec
(or 48 field/sec) coded sequence would become a 30 frame/sec (60
field/sec) display sequence. This processes has been known for decades
as 3:2 Pulldown. Most movies seen on NTSC displays since the advent of
television have been displayed this way. Only within the past decade
has it become possible to interpolate motion to create 30 truly unique
frames from the original 24. Since the repeat_first_field flag is
independently determined in every frame structured picture, the actual
pattern can be irregular (it doesnt have to be every other frame
literally). An irregularity would occur during a scene cut, for
example.
Slice:
To aid implementations which break the decoding process into parallel
operations along horizontal strips within the same picture, MPEG-2
introduced a general semantic mandatory requirement that all
macroblock rows must start and end with at least one slice. Since a
slice commences with a start code, it can be identified by
inexpensively parsing through the bitstream along byte boundaries.
Before, an implementation might have had to parse all the variable
length tokens between each slice (thereby completing a significant
stage of decoding process in advance) to know the exact position of
each macroblock within the bitstream. In MPEG-1, it was possible to
code a picture with only a single slice. Naturally, the mandatory
slice per macroblock row restriction also facilitates error recovery.
MPEG-2 also added the concept of the slice_id. This optional 6-bit
element signals which picture a particular slice belongs to. In badly
mangled bitstreams, the location of the picture headers could become
garbled. slice_id allows a decoder to place a slice in the proper
location within a sequence. Other elements in the slice header, such
as slice_vertical_position, and the macroblock_address_increment of the
first macroblock in the slice uniquely identify the exact macroblock
position of the slice within the picture. Thus within a window of 64
pictures, a lost slice can find its way.
Macroblock:
motion vectors are now always represented along a half-pel grid. The
usefulness of an integer-pel grid (option in MPEG-1) diminished with
practice. A intrinsic half-pel accuracy can encourage use by encoders
for the significant coding gain which half-pel interpolation offers.
In both MPEG-1 and MPEG-2, the dynamic range of motion vectors is
specified on a picture basis. A set of pictures corresponding to a
rapid motion scene may need a motion vector range of up to +/- 64
integer pixels. A slower moving interval of pictures may need only a
+/- 16 range. Due to the syntax by which motion vectors are signaled in
a bitstream, pictures with little motion would suffer unnecessary bit
overhead in describing motion vectors in a coordinate system
established for a much wider range. MPEG-1s f_code picture header
element prescribed a radius shared by horizontal and vertical motion
vector components alike. It later became practice in industry to have a
greater horizontal search range (motion vector radius) than vertical,
since motion tends to be more prominent across the screen than up or
down (vertical). Secondly, a decoder has a limited frame buffer size
in which to store both the current picture under decoding and the set
of pictures (forward, backward) used for prediction (reference) by
subsequent pictures. A decoder can write over the pixels of the oldest
reference picture as soon as it no longer is needed by subsequent
pictures for prediction. A restricted vertical motion vector range
creates a sliding window, which starts at the top of the reference
picture and moves down as the macroblocks in the current picture are
decoded in raster order. The moment a strip of pixels passes outside
this window, they have ended their life in the MPEG decoding loop. As
a result of all this, MPEG-2 created separate into horizontal and
vertical range specifiers (f_code[][0] for horizontal, and f_code[][1]
for vertical), and placed greater restrictions on the maximum vertical
range than on the horizontal range. In Main Level frame pictures, this
is range is [- 128,+127.5] vertically, and [-1024,+1023.5]
horizontally. In field pictures, the vertical range is restricted to [-
64,+63.5].
Macroblock stuffing is now illegal in MPEG-2. The original intent
behind stuffing in MPEG-1 was to provide a means for finer rate control
adjustment at the macroblock layer. Since no self-respecting encoder
would waste bits on such an element (it does not contribute to the
refinement of the reconstructed video signal), and since this unlimited
loop of stuffing variable length codes represent a significant headache
for hardware implementations which have a fixed window of time in which
to parse and decode a macroblock in a pipeline, the element was
eliminated in January 1993 from the MPEG-2 syntax. Some feel that
macroblock stuffing was beneficial since it permitted macroblocks to be
coded along byte boundaries. A good compromise could have been a
limited number of stuffs per macroblock. If stuffing is needed for
purposes of rate control, an encoder can pad extra zero bytes before
the start code of the next slice. If stuffing is required in the last
row of macroblocks of the picture, the picture start code of the next
picture can be padded with an arbitrary number of bytes. If the
picture happens to be the last in the sequence, the sequence_end_code
can be stuffed with zero bytes.
The dct_type flag in both Intra and non-Intra coded macroblocks of
frame structured pictures signals that the reconstructed samples output
by the IDCT stage shall be organized in field or frame order. This
flag provides an encoder with a sort of poor mans motion_type by
adapting to the interparity (i.e. interfield) characteristics of the
macroblock without signaling a need for motion vectors via the
macroblock_type variable. dct_type plays an essential role in Intra
frame pictures by organizing lines of a common parity together when
there is significant interfield motion within the macroblock. This
increases the decorrelation efficiency of the DCT stage. For
non-intra macroblocks, dct_type organizes the 16 lines (... luminance,
8 lines chrominance) of the macroblock prediction error. In combination
with motion_type, the meaning....
dct_type
motion_format
interpretation
frame
Intra coded
block data is frame correlated
field
Intra coded
block data is more strongly correlated along lines of
opposite parity