Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

A method is disclosed for the decoding and encoding of a block-based
video bit-stream such as MPEG2, H.264-AVC, VC1, or VP6 using a system
containing one or more high speed sequential processors, a homogenous
array of software configurable general purpose parallel processors, and a
high speed memory system to transfer data between processors or processor
sets. This disclosure includes a method for load balancing between the
two sets of processors.

Claims:

1. A system for decompressing a compressed video data stream, the system
comprising: a sequential processor array comprising at least one
sequential processor and arranged to receive a video data input stream,
the sequential processor array configured to recover macroblock
coefficient data and macroblock metadata from the video data input
stream; a parallel processor array PPA comprising a plurality of parallel
multi-processors, the parallel processor array configured to process the
recovered macroblock coefficient data and macroblock to recover pixels; a
data bus interconnecting the sequential processor array and the parallel
processor array PPA; a first RAM memory coupled to the sequential
processor array to store the Macroblock coefficient data and the
Macroblock metadata; and a second RAM memory coupled to the parallel
processor array PPA to store pixel data output from the parallel
processor array.

2. The system of claim 1, wherein the second RAM is arranged for storing
PPA program instructions and Macroblock coefficient data, macroblock
metadata, reference frame data and output video frame data.

3. The system of claim 1, wherein the multi-processors comprise SIMD
processors.

4. The system of claim 1, wherein: the first RAM and the second RAM are
realized in a single shared RAM; and the PPA has access to the shared RAM
to read the stored macroblock coefficient data and macroblock metadata.

5. The system of claim 1, wherein the system is implemented in a single
semiconductor integrated circuit.

6. The system of claim 1, wherein: the sequential processor array
comprises a single semiconductor integrated circuit; and the parallel
processor array comprises a second semiconductor integrated circuit.

8. A method for decompressing a variable length inter-block dependent
compressed video input data stream wherein the input data stream
comprises macroblocks, the method comprising: in a first main process
using a first processing array, wherein the first processing array
comprises a computer program execution device: decompressing a first
video frame of the input data stream; producing a first independent
variable length coefficient data set responsive to the first video frame
data; producing a first fixed-size metadata data set responsive to the
first video frame data; and storing the first independent variable length
coefficient data set and the first fixed-size metadata data set in
memory; and in a second main process using a second processing array that
is different than the first processing array, executing substantially in
parallel with the first main process: decompressing the macroblocks of
the input data stream by decoding the stored first independent variable
length coefficient data set using the stored first fixed-size metadata
data set and using at least one previously-stored reference frame, so as
to generate a desired output video frame.

9. The method of claim 8, further comprising: using a central processing
unit (CPU) component of a computing device, recovering macroblock
coefficient data and macroblock metadata from the video data input
stream; providing the recovered macroblock coefficient data and
macroblock metadata from the CPU component to a Graphics Processing Unit
(GPU) component of the computing device over an internal bus of the
computing device; and using the GPU component, processing the recovered
macroblock coefficient data and macroblock metadata from the CPU
component to recover pixels.

10. The method of claim 9, wherein providing the recovered macroblock
coefficient data and macroblock metadata from the CPU component to the
component of the computing device further comprises transmitting the
recovered macroblock coefficient data and macroblock metadata over a bus
that interconnects two separate semiconductor integrated circuits.

11. The method of claim 8, further comprising: commencing execution of
the first main process on a second frame of input data following
completion of said storing of the first video frame data to cause the
first main process to decompress at least one frame ahead of the second
main process.

12. The method of claim 11, wherein the second main process decodes
substantially all of the macroblocks of the current frame concurrently.

13. The method of claim 11, wherein said storing the first coefficient
data set and the first fixed-size metadata data set includes storing the
data in a shared memory space accessible to the array of
multi-processors.

14. The method of claim 11, further comprising: encoding the frequency
coefficient data using a selected run length coding; wherein the
coefficient data represents the residual data generated from predicting a
next macroblock of data.

15. The method of claim 11, wherein the first fixed-sized metadata data
set contains properties of each macroblock that instruct the second
processing array for decoding each macroblock.

16. The method of claim 15, wherein the properties of each macroblock
(stored in the first fixed-size metadata data set) that instruct the
multi-processors in the PPA for decoding each macroblock include at least
coded block pattern, prediction modes, quantization parameter, and motion
vectors.

17. A method for decompressing a variable length inter-block dependent
encoded video input data stream wherein the input data stream comprises
macroblocks, the method comprising: in a first main process using a first
processing array, wherein the first processing array comprises a computer
program execution device: decompressing a first video frame of the input
data stream; producing a first macroblock coefficient data set responsive
to the first video frame data; producing a first macroblock metadata data
set responsive to the first video frame data; and storing the first
macroblock coefficient data set and the first macroblock metadata data
set; and in a second main process using a second processing array that is
different than the first processing array, executing substantially in
parallel with the first main process: decompressing the macroblocks of a
previous frame of the same input data stream.

18. The method of claim 17, further comprising, after storing the first
macroblock coefficient data set and the first macroblock metadata data
set, beginning decompressing a second video frame of the input data
stream.

19. The method of claim 17, further comprising: using a central
processing unit (CPU) component of a computing device, recovering the
first macroblock coefficient data set and the first macroblock metadata
set from the video data input stream; providing the recovered macroblock
coefficient data set and macroblock metadata set from the CPU component
to a Graphics Processing Unit (GPU) component of the computing device
over an internal bus of the computing device; and using the GPU
component, processing the recovered macroblock coefficient data set and
macroblock metadata set from the CPU component to recover pixels.

20. The method of claim 19, wherein providing the recovered macroblock
coefficient data and macroblock metadata from the CPU component to the
component of the computing device further comprises transmitting the
recovered macroblock coefficient data set and macroblock metadata set
over a bus that interconnects two separate semiconductor integrated
circuits.

Description:

RELATED APPLICATIONS

[0001] This application is a continuation of U.S. Non-Provisional Ser. No.
12/058,636 filed Mar. 28, 2008, entitled "VIDEO ENCODING AND DECODING
USING PARALLEL PROCESSORS", which claims priority from U.S. Provisional
application No. 61/002,972 filed Nov. 13, 2007, entitled "METHOD FOR
DECODING OR ENCODING VIDEO USING ONE OR MORE SEQUENTIAL PROCESSORS AND A
GROUP OR GROUPS OF PARALLEL SIMD PROCESSORS AND LOAD BALANCING TO ACHIEVE
OPTIMAL EFFICIENCY", both of which are incorporated by reference herein
in their entirety.

[0003] This invention pertains to methods and apparatus for decoding or
encoding video data using one or more sequential processors together with
a group or groups of parallel general purpose SIMD processors.

BACKGROUND OF THE INVENTION

[0004] Encoding and decoding systems and methods for MPEG and other
block-based video bit-stream data are now widely known. The fundamentals
are well summarized in U.S. Pat. No. 6,870,883 ("Iwata"), incorporated
herein by this reference. Iwata discloses a three-processor system and
method for video encoding and decoding, to achieve a modicum of
parallelism and improved performance over a strictly sequential solution.

[0005] Block based video compression standards such as MPEG2, H.264, and
VC1 are difficult to decode or encode in parallel using parallel
processors due to the interdependency of bits or blocks of the video
frame. It is also difficult to maximize the performance by keeping all
processors as busy as possible due to differing requirements of the
processors.

[0006] One property of video is that for any given block of pixels (e.g.
macroblock) in the video frame, there is a high correlation to
neighboring blocks. Video compression technologies take advantage of this
through the use of prediction. When the video is encoded, the encoder
predicts block properties based on neighboring blocks and then encodes
the difference (residual) from the prediction. The video decoder computes
the same prediction and adds the residual to the prediction to decompress
the video. Since only residuals to the predictions are sent, the amount
of information sent between the encoder and the decoder is compressed.
One drawback to having block properties predicted based off neighboring
blocks is that if a neighboring block contains an error, for example due
to interference during a broadcast, then all subsequent blocks will also
contain an error causing an entire frame of video to be corrupted. For
this reason, these video compression standards contain a notion of a
slice.

[0007] A "slice" of video data contains a set of blocks that can be
decoded without any other neighboring block information (from outside the
slice). At each slice, the predictors are reset, trading off compression
efficiency for error resilience. The majority of encoded MPEG2 content,
for example, uses one slice per line of blocks. If an error is introduced
in any given block, the system can recover on the next line of blocks.

[0008] Two other properties of video that allow it to be compressed are
these: high frequency information can be discarded without the human
vision system detecting a noticeable change in the results; and, motion
tends to be localized to certain areas of the picture. Video compression
standards take advantage of these two properties by a method called
quantization and motion estimation/motion compensation, respectively.

[0009] Finally, to further compress the video data, a lossless variable
length encoding scheme is used in video compression technologies. These
methods may even use a context adaptive algorithm causing further
dependency on data previously encoded or decoded in the data stream.

[0010] Some known solutions utilize multiple sequential processors or
arrays of processors connected by a network such as Ethernet or high
speed memory interface. These solutions suffer in efficiency from
insufficient number of processors, and memory bandwidth/latency in
sending data to all the processors.

[0011] Other proposals for parallelizing video decoding and encoding have
been proposed such as that disclosed in U.S. Pat. No. 6,870,883 which
describes a system for decoding video using multiple processors. That
system requires the computationally expensive ability to transfer data to
and from processors at each macroblock.

[0012] The need remains for improvements in methods and systems for video
data processing to improve throughput while maintaining video quality and
controlling costs.

SUMMARY OF THE INVENTION

[0013] In general, the present disclosure concerns improvements in video
encoding/decoding technology. In one embodiment, improvements can be
achieved by using two different processor systems, namely a Sequential
Processor Array ("SPA") and a Parallel Processor Array ("PPA"). The SPA
and PPA encode/decode a video stream in a predefined, coordinated manner.
In one illustrative system, in a decoder, a sequential processor array is
provided comprising at least one general purpose sequential processor and
arranged to receive a video data input stream; and a general purpose
parallel processor array is provided. A data bus interconnects the
sequential processor array and the parallel processor array. A first
memory is coupled to the sequential processor array to store SPA program
instructions, Macroblock coefficient data and Macroblock metadata
produced by the SPA from incoming video data. A second memory is coupled
to the parallel processor array and is arranged for storing PPA program
instructions and macroblock coefficient data, macroblock metadata,
reference frame data and output video frame data.

[0014] Additional aspects and advantages of this invention will be
apparent from the following detailed description of preferred
embodiments, which proceeds with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] FIG. 1 is a simplified block diagram illustrating a multi-processor
video decompression/compression system architecture consistent with the
present invention.

[0016] FIG. 2 is a simplified flow diagram illustrating a method of
parallel variable length decode of encoded video data using an array of
general purpose sequential processors.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0017] With the advent of general purpose multi-core processors from Intel
or AMD which have either 2, 4 or 8 processors and massively
multi-processor systems such as NVIDIA's G80 GPU which, as of this
writing, contains up to 128 SIMD (single instruction, multiple data)
processors, a relatively inexpensive commodity desktop PC can provide a
massive amount of processing power. What is needed, and is described
herein, are methods for utilizing systems that include sequential and
parallel processors so as to greatly enhance the speed and efficiency of
decoding and decompression of block-based video data. The decompressed
video frames can then be displayed on a computer or television screen, or
used in further video processing such as image enhancement, scaling, or
encoding for re-transmission.

[0018] Our invention in various embodiments takes advantage of slices and
other independent portions of the video to greatly increase the coding
speed and efficiency. In one presently preferred embodiment, our methods
can be used in a system of the general type illustrated in FIG. 1,
containing a sequential processor array ("SPA" 101) and a Parallel
Processor Array ("PPA" 102). The SPA contains one or more high
performance general purpose sequential processing units that are designed
to execute sequential instructions on sequential data. The PPA contains
one or more groups of homogeneous general purpose SIMD multiprocessors
107 that are designed to operate on highly parallel problems where many
processors can work in parallel. The SPA and PPA each has access to one
or more physical RAMs (Random Access Memory) 103 and 104, respectively,
and are connected together by a high speed bi-directional data and
communication bus 105.

[0019] Each multiprocessor 107 contains one or more SIMD (Single
Instruction Multiple Data) processors, and also contains a memory cache
(illustrated as RAM but may be other types of cache) 115 allowing fast
access and communication between each SIMD processor in the
multiprocessor. There is also, in the illustrated embodiment, a random
access memory (RAM 104) shared by all multiprocessors in the array 102,
that store the video frames, macroblock coefficient data, macroblock
metadata, and multiprocessor instructions. There is a PPA sequencer and
memory arbiter 106 to automatically and efficiently select processors to
execute a set of instructions 114. Each multiprocessor can process
batches of instructions and one batch is executed after the other. The
scheduler selects batches of instructions for each multiprocessor. If,
and when, a multi-processor is instructed to wait for memory or a
synchronization event, the scheduler will swap in new instructions to
execute on that processor.

Decode Method

[0020] One aspect of the present invention involves using two methods or
processes in parallel to efficiently apply processing resources to decode
or encode block-based video data. We use the term "parallel" to refer to
processes that generally run concurrently, in a coordinated fashion. We
do not use the term to require a strict step by step, or clock by clock
parallelism. The following description is for decoding, but it can be
applied to encoding in a similar manner as further explained below.

[0021] The first of the pair of methods we will call Parallel Variable
Length Decode or "PVLD." As the name implies, this method applies
parallel processing to the variable-length decoding aspect of video
decoding. It decompresses a video frame of a variable length inter-block
dependent encoded stream 116, and produces an independent variable length
macroblock coefficient data buffer 110 and a fixed size macroblock
metadata buffer 111. This data, for one frame in a preferred embodiment,
is then copied to the PPA's RAM memory 104 through the high speed
communication bus 105. In an alternative embodiment, a memory address can
be sent to the PPA 107, for example in the case of a single shared RAM
device (not shown).

[0022] The second process of the pair we will call Parallel Block Decode
or "PBD." The PBD process decompresses each macroblock by decoding the
run-length compressed coefficient data using the metadata and using
stored reference frames 112. The output of this method is the desired
output video frame 113. As soon as the data is copied to the PPA, the SPA
can start on the next frame, thus the first method PVLD in a preferred
embodiment is always decompressing one frame ahead of the second method,
the PBD. Since both methods are running in parallel and both of these
processes make use of many processors (in array 101 and array 102,
respectively, the speed and efficiency of decoding an entire video stream
is greatly improved compared to prior solutions.

[0023] Referring again to FIG. 1, the coefficient data buffer (110 and
117) contains a run length encoded version of the frequency coefficients
representing the residual data from the prediction, and the metadata
buffer contains other properties of each macroblock that instruct the
multiprocessors in the PPA how to decode each macroblock. Buffer 110
contains the coefficient data, or is accumulating that data, for a Frame
"n+1" when the PPA buffer 117 is working on decoding the coefficient data
from the last (i.e., the next preceding) Frame n. As noted, the SPA
starts on the next frame of data as soon as it stores a completed frame
of coefficient data in the buffer 117 for the PPA to conduct PBD. In this
embodiment, there is no harm if the block decode in the PPA temporarily
falls behind, as the next frame data can wait in the buffer. However, it
need not wait for a full frame of data to begin processing macroblocks.

Processing Slices of Macroblock Data

[0024] As discussed in the background section, slices can be decoded
independently of other slices. Slices also contain blocks that are
dependent on other blocks in the slice and are best decoded sequentially;
therefore, in a preferred embodiment, each slice is decoded using a
sequential processor 108, but more than one slice can be decoded in
parallel using a group of sequential processors 101. Each sequential
processor 108 in the SPA decodes an assigned slice, and outputs the
independent coefficients and metadata into another array for use in the
second process (PBD). If there are not enough sequential processors for
all slices of a frame, slices may be assigned, for example in a
round-robin fashion, until all slices are decoded.

[0025] Slices of data are variable in byte length due to the nature of the
variable length encoding as well as the amount of compression due to
prediction for each slice, however. To accommodate this aspect, a process
is illustrated in FIG. 2 to pre-process the data in such a way that a
frame of video can be decoded in parallel using multiple sequential
processors. 201 shows the variable sized slices packed in a buffer. This
buffer contains the encoded bits from the video stream with all slices
packed together. The data is pre-processed by finding the point in the
buffer where each slice begins and the pointers for each slice are stored
in an index array shown in 202. This index array is then read by the each
processor in the SPA (203) to find the location of the slice that each
processor is responsible for decoding. Once the set of macroblocks in
each SPA processor's array has been VLC decoded to coefficients and meta
data, the resulting (RLE compressed) coefficients and metadata for each
block in a slice is stored in an array (204 and 205 respectively and 117
and 118 respectively). Another index table is used to tell each processor
in the PPA where each macroblock is located in the coefficient buffer
(204). In the case of this invention, the index table is stored at the
top of the coefficient buffer for convenience. Each processor in the PPA
then reads the address offset for the macroblock data that it is
responsible for decoding from this table as shown in (117).

[0026] Once all the slices have been decoded, the decompressed slice data
is sent to the PPA for the PBD and decoding of the next frame of slices
can be started on the SPA. Since each macroblock in the PBD is
independent of other macroblocks, many more processors can be applied to
decompressing all of the blocks in parallel.

[0027] Each processor in a multiprocessor communicates through a RAM that
is local to the group of processors. Each processor's function depends on
the macroblock decoding phases.

[0028] In some cases, such as high bit rate video decoding or encoding,
some of the sequential decoding or encoding in the PVLD could be
offloaded to the PPA. In some embodiments, where this capability is
implemented, the decision depends on which phase of the codec is the
bottleneck in the system. A methodology for making this determination for
load balancing is described in the next section.

Load Balancing to Optimize Throughput

[0029] To properly load balance the system using the PPA and the SPA the
system must calculate the theoretical performance (for example, in frames
per second) of the different processor load distributions using various
inputs and some pre-calibrated constants. The following is a sample
calculation.

[0030] Let:

Ns=# of processors for SPA Np=# of processors for PPA Cs=clock rate of
one of the processors in either the SPA and PPA (assume all have the same
clock rate) Cp=clock rate of one of the processors in the PPA (assume all
have the same clock rate) Cts=available clock rate per array of SPA
processors=Cs*MIN(Ns, # slices in the video) Ctp=available clock rate per
array of PPA processors=Cp*MIN(Np, # slices in the video) B=bits per
frame of a video stream (initial value set to avg bitrate/FPS and then
continuously refined by analyzing previously processed frames and frame
type) P=total pixels in video frame T=transfer rate of high speed bus
Ks=SPA processor clocks per bit for a given system found experimentally
or by calibration, and may be different depending on I, P or B frames
Kp=PPA processor clocks per bit for a given system found experimentally
or by calibration, and may be different depending on I, P or B frames
Kpp=PPA processor clocks per pixel for a given system found
experimentally or by calibration, and may be different depending on I, P,
or B frames.

[0031] First, the theoretical time for VLC decode or encode in the SPA and
PPA is calculated using this equation: Tvs=B*Ks/Cts

The PPA calculation is this equation.

Tvp=B*Kp/Ctp

[0032] The transfer time is calculated by this equation: Tt=B/T for both
the more compressed VLC representation, and the Metadata/Coefficient
representation of the frame. B changes depending on the VLC
representation or the Metadata/Coefficient representation.

[0033] The pixel processing time is calculated by a new K and a new Ct:

Ctp=Cp*MIN(Np,#Macroblocks in the frame)

Tpp=P*Kpp/Ctp

[0034] The total FPS is then defined by:

1/(Tvs+Tt+MAX(Tpp-Tvs,0)) when running the PPA and SPA in parallel or;

1/(Tvp+Tt+Tpp) when offloading the VLC processing to the PPA.

[0035] These two values are compared and the proper load balancing is
chosen based on the better theoretical performance.

[0036] A calculation of this type can be run on every frame and variables
B, Ks, Kp, and Kpp can be refined based on actual frame processing times
vs calculated. B preferably is constantly updated based on historical
values and the frame type (such as I, P or B frames). K may also be
adjusted based on the historical values of real vs theoretical
performance.

Tables 1, 2 and 3 Below Show Examples of Sample Results

[0037] This example shows the difference of processing the VLC decode
using the PPA vs the SPA and why decoding the VLC step on the 16
processor PPA can achieve a higher overall performance than a 4 processor
SPA despite each PPA having a much smaller clocks/second value and a
longer transfer time per frame. Processing the VLC on the PPA achieves a
74 frames per second overall performance where the SPA achieves a 60
frames per second overall performance. In this case, the system would
execute the VLC decode on the PPA instead of the SPA. A new clock/bit
measurement and clock/pixel measurement may then be calculated to
determine how the next frame will be decoded.

[0038] The encoding of video is accomplished in a similar way, but in
reverse. The video frames are first placed in to the PPA's RAM memory 104
through a second file decoding process, or a memory copy from a capture
device such as a camera. The PPA then executes various pixel processes of
an encoder resulting in coefficients. These processes include Intra and
Inter prediction, mode selection, motion estimation, motion compensation,
DCT and IDCT, Quantization and inverse Quantization.

[0039] The resulting coefficients and metadata is then placed into an
array similar to 204 and 205 for further processing by the SPA. The SPA
then takes the coefficient and meta data and encodes using a variable
length coding process (VLC) resulting in a video stream.

[0040] If there are multiple slices in the picture, the SPA can process
each slice in parallel resulting in higher overall performance.

[0041] It will be obvious to those having skill in the art that many
changes may be made to the details of the above-described embodiments
without departing from the underlying principles of the invention. The
scope of the present invention should, therefore, be determined only by
the following claims.