blob: df6b5dfe14599306dd1d2b92ebc6e5e39583f7f3 [file] [log] [blame]
### Section 5: Overview of the Decoding Process {#h-05-00}
A VP8 decoder needs to maintain four YUV frame buffers whose
resolutions are at least equal to that of the encoded image. These
buffers hold the current frame being reconstructed, the immediately
previous reconstructed frame, the most recent golden frame, and the
most recent altref frame.
Most implementations will wish to "pad" these buffers with
"invisible" pixels that extend a moderate number of pixels beyond all
four edges of the visible image. This simplifies interframe
prediction by allowing all (or most) prediction blocks -- which are
_not_ guaranteed to lie within the visible area of a prior frame -- to
address usable image data.
Regardless of the amount of padding chosen, the invisible rows above
(or below) the image are filled with copies of the top (or bottom)
row of the image; the invisible columns to the left (or right) of the
image are filled with copies of the leftmost (or rightmost) visible
row; and the four invisible corners are filled with copies of the
corresponding visible corner pixels. The use of these prediction
buffers (and suggested sizes for the _halo_) will be elaborated on in
the discussion of motion vectors, interframe prediction, and sub-
pixel interpolation later in this document.
As will be seen in the description of the frame header, the image
dimensions are specified (and can change) with every key frame.
These buffers (and any other data structures whose size depends on
the size of the image) should be allocated (or re-allocated)
immediately after the dimensions are decoded.
Leaving most of the details for later elaboration, the following is
an outline of the decoding process.
First, the frame header (the beginning of the first data partition)
is decoded. Altering or augmenting the maintained state of the
decoder, this provides the context in which the per-macroblock data
can be interpreted.
The macroblock data occurs (and must be processed) in raster-scan
order. This data comes in two or more parts. The first (_prediction_
or _mode_) part comes in the remainder of the first data partition.
The other parts comprise the data partition(s) for the DCT/WHT
coefficients of the residue signal. For each macroblock, the
prediction data must be processed before the residue.
Each macroblock is predicted using one (and only one) of four
possible frames. All macroblocks in a key frame, and all _intra-coded_
macroblocks in an interframe, are predicted using the already-decoded
macroblocks in the current frame. Macroblocks in an interframe may
also be predicted using the previous frame, the golden frame, or the
altref frame. Such macroblocks are said to be _inter-coded_.
The purpose of prediction is to use already-constructed image data to
approximate the portion of the original image being reconstructed.
The effect of any of the prediction modes is then to write a
macroblock-sized prediction buffer containing this approximation.
Regardless of the prediction method, the residue DCT signal is
decoded, dequantized, reverse-transformed, and added to the
prediction buffer to produce the (almost final) reconstruction value
of the macroblock, which is stored in the correct position of the
current frame buffer.
The residue signal consists of 24 (sixteen Y, four U, and four V) 4x4
quantized and losslessly compressed DCT transforms approximating the
difference between the original macroblock in the uncompressed source
and the prediction buffer. For most prediction modes, the 0th
coefficients of the sixteen Y subblocks are expressed via a 25th WHT
of the second-order virtual Y2 subblock discussed above.
_Intra-prediction_ exploits the spatial coherence of frames. The 16x16
luma (Y) and 8x8 chroma (UV) components are predicted independently
of each other using one of four simple means of pixel propagation,
starting from the already-reconstructed (16-pixel-long luma, 8-pixel-
long chroma) row above, and column to the left of, the current
macroblock. The four methods are:
1. Copying the row from above throughout the prediction buffer.
2. Copying the column from the left throughout the prediction
3. Copying the average value of the row and column throughout the
prediction buffer.
4. Extrapolation from the row and column using the (fixed) second
difference (horizontal and vertical) from the upper left corner.
Additionally, the sixteen Y subblocks may be predicted independently
of each other using one of ten different _modes_, four of which are 4x4
analogs of those described above, augmented with six "diagonal"
prediction methods. There are two types of predictions, one intra
and one prediction (among all the modes), for which the residue
signal does not use the Y2 block to encode the DC portion of the
sixteen 4x4 Y subblock DCTs. This "independent Y subblock" mode has
no effect on the 8x8 chroma prediction.
_Inter-prediction_ exploits the temporal coherence between nearby
frames. Except for the choice of the prediction frame itself, there
is no difference between inter-prediction based on the previous frame
and that based on the golden frame or altref frame.
Inter-prediction is conceptually very simple. While, for reasons of
efficiency, there are several methods of encoding the relationship
between the current macroblock and corresponding sections of the
prediction frame, ultimately each of the sixteen Y subblocks is
related to a 4x4 subblock of the prediction frame, whose position in
that frame differs from the current subblock position by a (usually
small) displacement. These two-dimensional displacements are called
_motion vectors_.
The motion vectors used by VP8 have quarter-pixel precision.
Prediction of a subblock using a motion vector that happens to have
integer (whole number) components is very easy: The 4x4 block of
pixels from the displaced block in the previous, golden, or altref
frame is simply copied into the correct position of the current
macroblock's prediction buffer.
Fractional displacements are conceptually and implementationally more
complex. They require the inference (or synthesis) of sample values
that, strictly speaking, do not exist. This is one of the most basic
problems in signal processing, and readers conversant with that
subject will see that the approach taken by VP8 provides a good
balance of robustness, accuracy, and efficiency.
Leaving the details for the implementation discussion below, the
pixel interpolation is calculated by applying a kernel filter (using
reasonable-precision integer math) three pixels on either side, both
horizontally and vertically, of the pixel to be synthesized. The
resulting 4x4 block of synthetic pixels is then copied into position
exactly as in the case of integer displacements.
Each of the eight chroma subblocks is handled similarly. Their
motion vectors are never specified explicitly; instead, the motion
vector for each chroma subblock is calculated by averaging the
vectors of the four Y subblocks that occupy the same area of the
frame. Since chroma pixels have twice the diameter (and four times
the area) of luma pixels, the calculated chroma motion vectors have
1/8-pixel resolution, but the procedure for copying or generating
pixels for each subblock is essentially identical to that done in the
luma plane.
After all the macroblocks have been generated (predicted and
corrected with the DCT/WHT residue), a filtering step (the _loop
filter_) is applied to the entire frame. The purpose of the loop
filter is to reduce blocking artifacts at the boundaries between
macroblocks and between subblocks of the macroblocks. The term "loop
filter" is used because this filter is part of the "coding loop";
that is, it affects the reconstructed frame buffers that are used to
predict ensuing frames. This is distinguished from the
postprocessing filters discussed earlier, which affect only the
viewed video and do not "feed into" subsequent frames.
Next, if signaled in the data, the current frame may replace the
golden frame prediction buffer and/or the altref frame buffer.
The halos of the frame buffers are next filled as specified above.
Finally, at least as far as decoding is concerned, the (references
to) the "current" and "last" frame buffers should be exchanged in
preparation for the next frame.
Various processes may be required (or desired) before viewing the
generated frame. As discussed in the frame dimension information
below, truncation and/or upscaling of the frame may be required.
Some playback systems may require a different frame format (RGB,
YUY2, etc.). Finally, as mentioned in the Introduction, further
postprocessing or filtering of the image prior to viewing may be
desired. Since the primary purpose of this document is a decoding
specification, the postprocessing is not specified in this document.
While the basic ideas of prediction and correction used by VP8 are
straightforward, many of the details are quite complex. The
management of probabilities is particularly elaborate. Not only do
the various modes of intra-prediction and motion vector specification
have associated probabilities, but they, together with the coding of
DCT coefficients and motion vectors, often base these probabilities
on a variety of contextual information (calculated from what has been
decoded so far), as well as on explicit modification via the frame
The "top-level" of decoding and frame reconstruction is implemented
in the reference decoder file `dixie.c` (Section 20.4).
This concludes our summary of decoding and reconstruction; we
continue by discussing the individual aspects in more depth.
A reasonable "divide and conquer" approach to implementation of a
decoder is to begin by decoding streams composed exclusively of key
frames. After that works reliably, interframe handling can be added
more easily than if complete functionality were attempted
immediately. In accordance with this, we first discuss components
needed to decode key frames (most of which are also used in the
decoding of interframes) and conclude with topics exclusive to