4/07/2010

04-07-10 - Video

Blurg, the complexity wheel turns. In the end, all the issues with video come down two huge fundamental problems :

1. Lack of the true distortion metric. That is, we make decisions to optimize for some D, but that D is not really what
humans perceive as quality. So we try to bias the coder to make the right kind of error in a black
art hacky way.

2. Inability to do full non-greedy optimization. eg. on each coding decision we try to do a local greedy decision and hope
that is close to globally optimal, but in fact our decisions do have major effects on the future in numerous and complex ways.
So we try to account for how current decisions might affect the future using ugly
heuristics.

These two major issues underly all the difficulties and hacks in video coding, and they are unfortunately nigh intractable.
Because of these issues, you get really annoying spurious results in coding. Some of the annoying shit I've seen :

A. I greatly improve my R/D optimizer. Global J goes up !! (lower J is better, it should have gone down).
WTF happened !? On any one block, my R/D optimizer now has
much more ability to make decisions and reach a J minimum on that block. The problem is that the local greedy optimization
is taking my code stream to weird places that then hurt later blocks in ways I am not accounting for.

B. I introduce a new block type. I observe that the R/D chooser picks it reasonably often and global J goes down. All signs
indicate this is good for coding. No, visual quality goes down! Urg. This can come from any number of problems, maybe the
new block type has artifacts that are visually annoying. One that I have run into that's a bother is just that certain block
types will have their J minimum on the R/D curve at very different places - the result of that is a lot of quality variation
across the frame, which is visually annoying. eg. the block type might be good in a strict numerical sense, but its optimum
point is at much higher or much lower quality than your other block types, which makes it stand out.

C. I completely screw up a block type, quality goes UP ! eg. I introduce a bug or some major coding inefficiency so a certain
block type really sucks. But global quality is better, WTF. Well this can happen if that block type was actually bad.
For one thing, block types can actually be bad for global J even if they are good for greedy local J, because they produce
output that is not good as a future mocomp source, or even simply because they are redundant with other block types and are a
waste of code space. A more complex problem which I ran into is that a broken block type can change the amount of bits allocated
to various parts of the video, and that can randomly give you better bit allocation, which can make quality go up even though you
broke your coder a bit. Most specifically, I broke my Intra ("I") block (no mocomp) coder, which caused more bits to go to I-like frames, which actually
improved quality.

D. I improve my movec finder, so I'm more able to find truly optimal movecs in an R/D sense (eg. find the movec that actually
optimizes J on the current block). Global J goes down. The problem here is that optimizing the current movec can make that
movec very weird - eg. make the movec far from the "true motion". That then hurts future coding greatly.

In most cases these problems can be patched with hacks and heuristics. The goal of hacks and heuristics is basically to
try to address the first two issues. Going back to the numbering of the two issues, what the hacks do is :

1. Try to force distortion to be "good distortion". Forbid too much quality variation between neighboring blocks. Forbid
block mode decisions that you somehow decide is "ugly distortion" even if it optimized J. Try to tweak your D metric to
make visual quality better. Note that the D tweakage here is a pretty nasty black art - you are NOT actually trying to make
a D that approximates a true human visual D, you are trying to make a D under which your codec will make decisions that produce
good global output.

2. To account for the greedy/non-greedy problem, you try to bias the greedy decisions towards things that you guess will be
good for the future. This guess might be based on actually future data from a previous run. Basically you decide not to
make the choice that is locally optimal if you have reason to believe it will hurt too much in the future. This is largely
based on intuition and familiarity with the codec.

Now I'll mention a few random particular issues, but really these themes occur again and again.

I. Very simple block modes. Most coders have something like a "direct block copy" mode, or even a "solid single color",
eg. DIRECT or "skip" or whatever. These type of blocks are generally quite high distortion and very low rate. The
problem occurs when your lambda is sort of near the threshold for whether to prefer these blocks or not. Oddly the
alternate choice mode might have much higher rate and much higher distortion. The result is that a bunch of very similar
blocks near each other in an image might semi-randomly select between the high quality and low quality modes (which
happen to have very similar J's at the current lambda). This is obviously ugly. Furthermore, there's a non-greedy
optimization issue with these type of block modes. If we compare two choices that have similar J, one is a skip type
block with high distortion, another is some detailed block mode - the skip type is bad for information conveyance to the
future. That is, it doesn't add anything useful for future blocks to refer to. It just copies existing pixels (or even
wipes some information out in some cases).

II. Gradual changes need to be send gradually. That is, if there is some part of the video which is slowly steadily
changing, such as a slow cross fade, or very slow scale/rotate type motion, or whatever - you really need to send it as
such. If you make a greedy best J decision, at low bit rate you will some times decide to send zero delta, zero delta,
for a while because the difference is so small, and then it becomes too big where you need to correct it and you send a
big delta. You've turned the gradual shift into a stutter and pop. Of course the decision to make a big correction
won't happen all across the frame at the same time, so you'll see blocks speckle and move in waves. Very ugly.

III. Rigid translations need to preserved. The eye is very sensitive to rigid translations. If you just let the movec
chooser optimize for J or D it can screw this up. One reason is that very small motions or movements of monotonous
objects might slip to movec = 0 for code size purposes. That is, rather than send the correct small movec, it might decide
that J is better by incorrectly sending a zero delta movec with a higher distortion. Another reason is that the actual
best pixel match might not correspond to the motion, you can get anomalies, especialy on sliding patterned or semi-patterned
objects like grass. In these cases, it actually looks better to use the true motion movec even if it has larger numerical
distortion D to do so. Furthermore there is another greedy/non-greedy issue. Sometimes some non-true-motion movec might
give you well the best J on the current block by reaching out and grabbing some random pixels that match really well. But
that screws up your motion field for the future. That movec will be used to condition predictions of future movecs. So say
you have some big slowly translating field - if everyone picks nice true motion movecs they will also be coherent, but if
people just get to pick the best match for themselves greedily, they will be a big mess and not predict each other. That
movec might also be used by the next frame, the previous B frame, etc.

IV. Full pel vs. half/quarter/sub-pel is a tricky issue. Sub-pel movecs often win in a strict SSD sense; this is partly
because when the match is imperfct, sub-pel movecs act to sort of average two guess together; they produce a blurred
prediction, which is optimal under L2 norm. There are some problems with this though; sub-pel movecs act to blur the
image, they can stand out visually as blurrier bits; they also act to "destroy information" in the same way that simple
block modes do. Full pel movecs have the advantage of giving you straight pixel copies, so there is no blur or destruction
of information. But full pel movecs can have their own problem if the true motion is subpel - they can produce wiggling.
eg. if an area should really have movecs around 0.5 , you might make some blocks where the movec is +0 and some where it is +1.
The result is a visible dilation and contraction that wiggles along, rather than a perfect rigid motion.

V. A good example of all this evil is the movec search in x264. They observed that allowing very large movec search ranges
actually decreases quality (vs a more local incremental searches). In theory if your movec chooser is using the right
criterion, this should not be - more choices should never hurt, it should simply not choose them if they are worse.
Their problem is twofold - 1. their movec chooser is obviously not perfect in that it doesn't account for current cost
completely correctly, 2. of course it doesn't account for all the effects on the future. The result is that using some
heuristic seed spots for the search which you believe are good coherent movecs for various reasons, and then doing small
local searches actually gives you better global quality. This is a case where using "broken" code gives better results.

In fact it is a general pattern that using very accurate local decisions often hurts global quality, and often using
some tweaked heuristic is better. eg. instead of using true code cost R in your J decision, you make some functional
fit to the norms of the residuals; you then tweak that fit to optimize global quality - not to fit R. The result is
that the fit can wind up compensating for the greedy/non-greedy and other weird factors, and the approximation can
actually be better than the more accurate local criterion.

4 comments:

One thing to consider is that not only do the optimization processes only improve locally, the metrics are actually unable to measure anything else.

Common distortion metrics for images are basically sums of error measurements over small windows. For all SSD-based metrics, the windows are as small as they can be - single pixels. SSIM uses larger windows with some overlap, but it's still the case that very visible global consistency violations (e.g. an edge being interrupted in a single block) is only detected (and hence only influences measured distortion) in a small number of those windows.

Even without a real understanding of the HVS, it's possible to get out of that "local distortion ghetto" simply by aggregating local measures with other means than just plain summation.

Case in point: the skip/no-skip decision you mentioned. Instead of using R + lambda*D, minimize R + lambda*D + lambda2*||grad(D)|| instead (using the spatial "gradient", in this case distortion differences between neighboring blocks). The higher lambda2 gets, the less attractive isolated blocks with much higher distortion than their surroundings become. Of course you can also get fancy and allow arbitrary distortion along boundaries (which are also determined during the minimization) - then you need to add a penalty term for the length of the boundary as well (effectively clamping the gradient term at some maximum value per block). There's tons of ways to do this kind of stuff.

Adding this kind of term makes global optimization hairy, but it's quite easy to look at a small local change and see whether it improves the modified metric or not. In short, an actual implementation boils down to a couple of rules that "look" heuristic (and in fact might be identical to heuristics you're already using) but actually work towards locally improving a more complete distortion metric.

I think this is the way forward - imperfectly minimizing a more complete cost function is a far better theoretical model to work from than perfectly minimizing a simple function and then doing local changes to "make it look good" without any real justification from your model.

minimize R + lambda*D + lambda2*||grad(D)|| instead (using the spatial "gradient", in this case distortion differences between neighboring blocks). The higher lambda2 gets, the less attractive isolated blocks with much higher distortion than their surroundings become.

Don't squared distortion metrics already achieve this, encouraging a relatively level amount of error throughout the image, and without introducing a second parameter to optimize on? (The problem being when we find that sums of absolutes seem to produce better results than squares. Maybe we should be raising to an intermediate power?)

"Don't squared distortion metrics already achieve this, encouraging a relatively level amount of error throughout the image, and without introducing a second parameter to optimize on?"They reduce overall error variance, but they don't care at all about spatial variation. For a squared distortion metric, a block with error 100 is a block with error 100 no matter where it appears in the image; if you include the gradient term, it's suddenly a lot cheaper to place that block in a neighborhood with average error around 90 than it would be to place it in a neighborhood with blocks of average error around 20.

I agree that introducing an extra parameter is problematic; in practice you'll want to choose lambda2 as some function of lambda and only optimize for one parameter.

However, that's not my main point; the idea is that you can view the whole process of "do RD optimization, then apply some heuristics" as "optimize simplified version of target function to get an initial solution, then iteratively perform local changes to optimize full target function". If you can write a heuristic as term in the target function (even if somewhat unwieldy), you're in good shape: doing so makes it much clearer what the model is, and there might even be a way to optimize it directly. The other way round is just as interesting - any term that is difficult to incorporate in a full optimization round but easy to evaluate "incrementally" (i.e. "does this change reduce overall error"?) translates to a local post-optimization test in the code. But unlike arbitrary heuristics, in this case you know that it's actually improving some well-defined target function. That's much better than just knowing that "this video looks a bit better when I turn this on".

(also BTW the gradient of D term is most interesting if it also includes temporal gradient)

One problem is that the improved target function that you want to optimize is itself just a heuristic. Even in the grad D case you get into "what should I tweak lambda2 to be so it looks good". The most extreme case is x264 where the SATD target function is a very strange heuristic that sort of happens to work out well due to the structure of the coder.

I also think that the quality goal is just too complex to express and use in a practical way. But in that case even if you can't actually use it in your R/D optimizer you could use it as a verification pass to make sure you actually are optimizing something concrete.