1/12/2010

01-12-10 - Lagrange Rate Control Part 2

Okay, so we've talked a bit about lagrange coding decisions. What I haven't mentioned yet is that we're implicitly talking about
rate control for video coders. "Rate control" just means deciding how many bits to put in each frame. That's just a coding decision.
If the frames were independent (eg. as in Motion JPEG - no mocomp) then we know that lagrange multiplier R/D decisions would in fact
be the optimal way to allocate bits to frames. Thus, if we ignore the frame-dependence issue, and if we pretend that all distortions
are equal - lagrange rate control should be optimal.

What does lagrange rate control mean for video? It means you pick a single global lambda for the whole video. This lambda tells you
how much a bit of rate should be worth in terms of distortion gain. Any bit which doesn't give you at least lambda worth of distortion gain
will not be coded. (we assume bits are of monotonically decreasing value - the first bit is the most important, the more bits you send the
less value they have). On each frame of video you want to code the frame to maximize J. The biggest single control parameter to do this is
the quantizer. The quantizer will largely control the size of the frame. So you dial Q to maximize J on that frame, this sets a frame size.
(maximizing J will also affect movec choices, macroblock mode choices, etc).

Frames of different content will wind up getting different Q's and very different sizes. So this is very far from constant bit rate or
constant quantizer. What it is is "constant bit value". That is, all frames are of the size where adding one more bit does not help by
lambda or more. Harder to code (noisy, fast motion) frames will thus be coded with more error, because it takes more bits to get the same
amount of gain in that type of frame. Easy to code (smooth, flat) frames will be coded with much less error. Whether or not this is good
perceptually is unclear, but it's how you get optimal D or a given R assuming our D choice is what we want.

Ideally you want to use a single global lambda for your whole video. In practice that might not be possible. Usually the user wants to
specificy a certain bit rate, either because they actually need to meet a bit rate maximum (eg. for DVD streaming) , or because they want
to specify a maximum total size (eg. for fitting on your game's ship DVD), or simply because that's an intuitive way of specifying "quality"
that people are familiar with from MP3 audio and such. So your goal is to hit a rate. To do that with a single global lambda, you would have
to try various lambdas, search them up and down, re-encode the whole video each time. You could use binary search (or maybe interpolation
search), but this is still a lot of re-encodings of the whole video to try to hit rate. (* more on this later)

Aside : specifying lambda is really how people should encode videos for distribution as downloads via torrents or whatever. When I'm storing
a bunch of videos on my hard disk, the limitting factor is my total disk size and the total download time - I don't need to limit how big
each individual movie is. What I want is for the bits to go where they help me most. That's exactly what lambda does for you. It makes no
sense that I have some 700 MB half hour show that would look just fine in 400 MB , while I have some other 700 MB show that looks like shit and
could really use some more bits. Lambda is the right way to allocate hard drive bytes for maximum total viewing quality.

Okay. The funny thing is that I can't find anyone else on the web or in papers talking about lagrange video rate control. It's such an
obvious thing that I expected it to be the standard way, but it's not.

What do other people do? The de-facto standard seems to be what x264 and FFMPEG do, which I'll try to roughly outline (though I can't say
I get all the details since the only documentation is the very messy source code). Their good mode is two pass, so I'll only talk about that.

The primary thing they do is pick a size for each frame somehow, and then try to hit that size. To hit that frame size, they search QP (
the quantization parameter) a bit. The specifically only search QP in the local neighborhood of the previous QP because they want to limit
QP variation between frames (the range of search is a command line parameter - in fact almost everything in this is a command line parameter
so I'll stop saying that). When they choose a QP, there's a heuristic formula for H264 which specifies a lambda for lagrange decisions that
corresponds to that QP. Note that this lambda is only used for inside-the-frame coding decisions, not for choosing QP or global rate allocation.
Also note that the lambda-QP relationship is not necessarily optimal; it's a formula (there are a bunch of H264 papers about
making good lambda-QP functional fits and searches). They also do additional funny things like run a blurring pass on QP to smooth out variation;
presumably this is a heuristic for perceptual purposes.

So the main issue is how do they pick this frame size to target? So far as I can tell it's a mess of heuristics. For each frame they have
a "complexity" measure C. On the first pass C is computed from entropy of the delta or some such thing, raised to the 0.8 power (purely
heuristic I believe). The C's are then munged by some fudge factors (that can be set on the command line) - I frame sizes are multiplied by
a factor > 1 that makes them bigger, B frame sizes are multipled by a factor < 1 that makes them smaller. Once all the "complexities" are
chosen, they are all scaled by (target total size) / (total complexity) to make them add up to the total target size. This sets the desired
size for each frame.

Note that this heuristic method has many of the same qualitative properties as full lagrangian allocation - that is, more complex frames will get
more bytes than simpler frames, but not *enough* more bytes to give them the same error, so more complex frames will have larger errors than
simpler frames. However, quantitatively there's no gaurantee that it's doing the same thing.

So far as I can tell the lagrange method is just better (I'm a little concerned about this because it just seems to vastly obviously better
that it disturbs me that not everyone is doing it). Ignoring real world issues we've glossed over, the next big problem is the fact
that we have to do this search for lambda, so we'll talk about that next time.

Daiz: at the same bitrate or even at higher bitrates, fades ended up worse-looking with mbtree on than with mbtree off. It's a shame since the mbtree encodes look better everywhere else.

DS: This seems to be inherent in the algorithm and I'm not entirely sure how to resolve it... it was a problem in RDRC as well, in the exact same way. ... MB-tree lowers the quality on things that aren't referenced much in the future. Without weighted prediction, fades are nearly all intra blocks...

This seems almost like a bug... if you view it as "steal bits from other things to increase the quality of source blocks" it makes sense, but presumably putting extra bits into the source blocks actually reduces the number of bits you need later to hit the target quality.

I guess at really low bit rates where you just copy blocks without making ANY fixup at all then you're just stealing bits.

"Yes, qualitatively, other frames are improved with mb-tree, but the trend seems to be fades are significantly worse. It's almost as if bitrate is redistributed from fades/low luminance frames to other frames"

"mbtree can definitely be beneficial in some scenarios, but it almost always comes with signicantly worse fades and dark scenes. Perhaps the default --qcomp value isn't optimal, but increasing it will lower the mbtree effect, and basically we are back to square one. What i am seeing is a sort of "tradeoff." Some frames are improved at the expense of others. But the "expense" is quite severe in my opinion, at least with default qcomp. I'm looking for more balance."

This isn't really surprising, it's one of those consequences of using a "D" measure that doesn't match the HVS ; mb-tree and lagrange rate control and all that will move bits around to minimize D , which usually means not giving many bits to low luma stuff. That's normally correct, but I guess is occasionally terrible.

I'll write more about this in part 4 some day.

"but presumably putting extra bits into the source blocks actually reduces the number of bits you need later to hit the target quality."

Not necessarily, you *hope* that putting more bits in source blocks helps overall, but you can't afford to test every bit movement, so you just use some heuristics that usually help.

This is probably either obvious or useless, but -- for rate control, would it make sense to encode a few frames ahead, while keeping intermediate data structures, and then if a rate adjustment has to be made, re-use the partial work to more cheaply re-encode the same few frames with an adjusted quality?