Revision as of 13:08, 31 August 2012

Daala is the current working name of a next generation video codec— to be renamed once someone insists on something better. So far the best proposed alternative is PatentCake.

For now the purposes of this page is to collect notes about things which have been discussed in informal public IRC discussion about the next generation initiative. Participants in these discussions have included Timothy Terriberry, Jason Garrett-Glaser, Loren Merritt, Ben Schwartz, Greg Maxwell, and others.

Techniques

The discussed overall structure so far has been a variable size lapped-DCT block based codec with lapping done via pre/post filtering with a specially structured (lifting) linear phase transform along the edges along with overlapped block motion compensation and the expected trimmings. The lapping can be optimized for energy compaction and other useful properties, including invert-ability, and yields excellent results with efficient finite precision math.

Other components which have been discussed include:

Techniques applicable to all frame types

Multisymbol arithmetic coding

Timothy has some trial code showing speed-up proportional to the number of bits coded at once. (ec_test.c)

Mode prediction using the previously decoded data, e.g. coding the mode using a probability function derived from trained predictors on the surrounding blocks.

This will be terrible for robustness but may significantly reduce signalling overhead, allowing many more modes, and provide continuous adaptation between signalling free and fully signalled modes.

Explore legendre polynomial basis transforms instead of DCT

May have better perceptual properties and/or result in 'less compromised' efficient implementations.

Techniques applicable to inter frames

Using x264 as a test-bed Jason and Loren demonstrated 15% rate/distortion improvements from using 10-bit intermediaries and references, estimated as being 1/3rd from quality calculation in the 10-bit space, 1/3rd from the higher precision references, and 1/3rd from higher intermediate precision in calculations (e.g. MC filter processing).

Increased reference precision competes for memory with increased number of references. The improvements demonstrated appear to be a greater win than increasing the reference count once there are four references or so.

Super-resolution techniques for motion-compensation references have been discussed— in particular it appears that the half-pel location is where intelligent filtering matters the most so staged computation could be effectively used to allow more expensive filtering at that level.

Edge-directed interpolation techniques might be effectively applied to increase motion compensation accuracy, but most of the techniques known to be very effective are too slow.

Speculation has been offered that a significant part of MC inaccuracy may be due to blending in a physically incorrect (gamma-corrected) space, though no real conclusions were made. Academic papers on motion compensation accuracy seem to have ignored this issue.

Possible compromise: the video reference structure contains a backbone that can be decoded at only N bits of depth (e.g. 10), and higher precisions are only supported outside of this reference chain.

Precision by truncation: decode is performed twice on each frame, identically, at low and high precision. The only difference between them is the bit-depth of the transform, or possibly of the transform and MC filters. Only low-precision outputs can be referenced by subsequent frames. Useful if high-precision content is still worth watching at low precision.

Precision by gamma: decode is performed once at low precision as normal. Then the output frame is converted to linear-light at high precision, after which another layer of residuals is added. The second layer can be permitted to reference previous high-precision frames... tricky to use both sets of references though. Useful if high precision is used for storing linear data, but people still want to watch it on "low-end" hardware.

Some high end digital cameras are operating jpeg-derivatives in a special mode that keeps the image in the native linear RGB bayer format in order to avoid lossy/slow demosaicing on the camera. In particular this allows white balancing in post without excessive loss. Probably out of scope for Daala itself.

Bayer, 4:2:0, 4:2:2, and Interlacing are all special cases of a more general pattern in which the output frames are decimated/subsampled in a regular fashion. All such subsamplings could be supported by a unified framework in which the video is always stored with all planes fully sampled, with a header indicating the recommended subsampling for display. In such cases, the encoder can regard the transform as highly overcomplete, and simply ignore unneeded coefficients (presumably by leaving high frequency residuals coded as zero). This structure would in effect turn the codec into a motion-compensated interpolating/deinterlacing filter. Whether this approach is sensible presumably depends in part on how the transform is structured. It would be especially easy if the transform's highest-frequencies were coded by a wavelet-like layer.

Lossless intra-ability: The ability to losslessly rewrite any frame as an intra frame (perhaps with significant bitrate overhead) in order to make frame accurate cuts possible.

Or best handled by making sure that containers have working pre-roll, but presumably common GOP sizes will be greater than the number of references so even if losslessly reencoding the references is expensive it may be cheaper than pre-roll. Do both?

Can be had for 'free' if lossless is supported, plus the right header flags to restuff the references from lossless copies in a packed hidden frame.

Use of explicitly (rather than staged) super-resolution and/or deeper references may make this functionality unattractive due to increased overhead.

Internal overlays which could be swapped without re-encoding? (e.g. advertising, station ID). Could also be automatically generated by a Sufficiently Advanced™ encoder to improve efficiencies for static sprites over moving backgrounds.

Complicates making the complexity bounded. No Sufficiently Advanced™ encoder likely to ever exist. But perhaps the station id/advertising uses fully justify this.

Could be done externally to the video codec, but if so it's no likely to be useful for anyone ever.

A secondary reference implementation in OpenCL, maintained throughout development, to make sure that the codec is GPU-friendly and can be done efficiently using OpenCL primitives.

SWAR-friendly arithmetic. For example, choosing transform coefficients so that no intermediate product overflows 16 bits (tricky for signed values) can sometimes enable (e.g.) 4 parallel operations in one uint64_t. This can allow a pure C reference implementation to run faster, which is valuable for initial adoption and ports to new platforms.

Parametric decode-side blur.

Symmetrical blur in regions that are smooth on scales longer than the block size. Could be signaled or derived from observed DC values.

Motion blur so that moving objects are blurred along the motion vector. May require coding a shutter speed parameter (0..1 as a fraction of the inter-frame interval).