2/10/2010

02-10-10 - Some little image notes

1. Code stream structure implies a perceptual model. Often we'll say that uniform quantization is optimal for RMSE but is not optimal
for perceptual quality. We think of JPEG-style quantization matrices that crush high frequencies as being better for human-visual perceptual
quality. I want to note and remind myself that actually just the coding structure actually targets perceptual quality even if you are using
uniform quantizers. (obviously there are gross ways this is true such as if you subsample chroma but I'm not talking about that).

1.A. One way is just with coding order. In something like a DCT with zig-zag scan, we are assuming there will be more zeros in the high frequency.
Then when you use something like an RLE coder or End of Block codes, or even just a context coder that will correlate zeros to zeros, the result
is that you will want to crush values in the high frequencies when you do RDO or TQ (rate distortion optimization and trellis quantization).
This is sort of subtle and important; RDO and TQ will pretty much always kill high frequency detail, not because you told it anything about the HVS
or any weighting, but just because that is where it can get the most rate back for a given distortion gain - and this is just because of the way
the code structure is organized (in concert with the statistics of the data).
The same thing happens with wavelet coders and something
like a zerotree - the coding structure is not only capturing correlation, it's also implying that we think high frequencies are less important and
thus where you should crush things. These are perceptual coders.

1.B. Any coder that makes decisions using a distortion metric (such as any lagrange RD based coder) is making perceptual decisions according to
that distortion metric. Even if the sub-modes are not overtly "perceptual" if the decision is based on some distortion other than MSE you can
have a very perceptual coder.

2. Chroma. It's widely just assumed that "chroma is less important" and that "subsampling is a good way to capture this". I think that those
contentions are a bit off. What is true, is that subsampling chroma is *okay* on *most* images, and it gives you a nice speedup
and sometimes a memory use reduction (half as many
samples to code). But if you don't care about speed or memory use, it's not at all clear that you should be subsampling chroma for human
visual perceptual gain.

It is true that we see high frequencies of chroma worse than we see high frequencies of luma. But we are still pretty good at locating a
hard edge, for example. What is true is that a half-tone printed image in red or blue will appear similar to the original at a closer distance
than one in green.

One funny thing with JPEG for example is that the quantization matrices are already smacking the fuck out of the high frequencies, and then they
do it even harder for chroma. It's also worth noting that there are two major ways you can address the importance of chroma : one is by killing
high frequencies in some way (quantization matrices or subsampling) - the other is how fine the DC value of the chroma should be; eg. how should
the chroma planes be scaled vs. the luma plane (this is equivalent to asking - should the quantizers be the same?).

1 comment:

On 1): Another interesting thing in the same vein is the choice of binarization scheme when you're using binary arithmetic coding. The obvious effect is that it determines your prior model as long as you don't have sufficient statistics (and that way you could do "perceptual preconditioning"); but there's also secondary effects, because the binarization also makes some contexts a lot easier to capture than others (context within a single binarized symbol is trivial to include). Together with the prior expectation and RD optimization, this forms a positive feedback loop (you tend to favor small motion vectors because they initially take fewer bits, so they're estimated as more likely, so they take even fewer bits, and so on) - it's self-reinforcing in a way. That's not just in the beginning - at least some of the bits are going to be pretty random, so the shorter binarizations to tend to be cheaper to code than the long ones. Whatever binarization you pick actually biases your coding choices towards its own distribution, even after the model has been trained on your data.