Clearly there is a lot of win to be had from good chroma sampling. Let's talk a bit about what's going on.

(BTW the ffmpeg results don't directly compare to my_y4m, they're just there for reference and sanity check;
for example I use symmetric-centered (JPEG style) 420 subsampling and I think they use offset MPEG-style 420
subsampling ; I'm also still not sure if they are in 16-235 or 0-255 , I am definitely in 0-255 ).

First of all just in terms of basic filtering, you might do your operation like this :

So, you might think that's a reasonable way to go, and try various filters. I did experiments on this
before (see
this blog post
and in particular the comment at the end). I found that fancy filters don't really help much, that the
best thing is just doing bilinear reconstruction and a special optimized downsample filter that I call
"down for bilinear up".

But once you think about it, there's no reason that we should follow this particular process for making our
YUV. In particular, downsampling in chroma space does some very weird things that you might not expect.
A lot of the weirdness comes from the fact that the RGB we care about is clamped in the range [0-255].
And our advantage is that we have Y at higher resolution.

His writing is very weird, so I'll paraphrase briefly. One of the problems with chroma subsampling is that it's
not light linear. eg. averaging Cb and Cr does not produce a resulting color which is the average of what your
eye will see. Instead of subsampling CbCr , you should instead solve for the YCbCr which produces the light-linear
color that you would see for the average of those 2x2 chromas. The easiest way to do this is just to subsample CbCr
is in some way, and then instead of computing Y from the original RGB, you turn it into a solve to find the Y,
given CbCr, that produces the best result.

The next good idea from the "Twibright" page is just to abandom the idea of computing the YCbCr from a matrix in
general. We know what the decoder will do to upsample, so instead we just take the idea that our encoder should
output the coefficients which will make the best result after upsampling. In the "Hyperluma" algorithms, he
sets his goal as preserving constant actual luma ( luma is not "Y" , it's actual perceived brightness). Basically
he does the chroma subsample using RGB, and then given the low-res chroma and the original high-res RGB, solve
for the Y that gives you the correct luma.

The "Luminaplex" algorithm takes this one step further and does a brute force optimization. In the end with compression
if you want absolute optimality it always comes down to this - wiggle the discrete values, see what the output is,
and take the best. (we saw this with DXT1 as well).

On the Twibright page he claims that he can implement Luminaplex for box upsampling and still get great results. I
found that to not be true on most real images, you need at least bilinear upsampling. To solve for the optimal YCbCr
for a given RGB target image, you have a large coupled linear system. That is, for each pixel, you have to consider
the current CbCr and also the neighbors, and for those neighbors, you have to consider their neighbors. This is
a sparse semi-diagonal linear system. In particular, you are solving a linear system like this :

find x to minimize | A x - b |
b = the target RGB values
(b has 3 * # of pixels values)
x = the YCbCr values for the output
w*h of Y, (w*h)/4 each of Cb and Cr
A = the YCbCr -> RGB matrix plus upsample filter
3*w*h rows
(3/2)*w*h columns
for each R and B row, there are 5 non-zero entries
(1 for Y and 4 for Cr or Cb)
for each G row there are 9 non-zero entries
(1 for Y and 4 for Cr and 4 for Cb)

you can solve this reasonably easily with a standard sparse matrix linear solver. Note that in this form we are directly solving
for minimum RGB RMSE , but you can of course solve for other metrics (that are linear transformations of RGB anyway).
The fours in the number of terms come from the four-tap box filter; bigger filters have more non-zero terms and so make the matrix
much fatter in memory and slower to solve.

If you implement this you get "solve lsqr for linear up" , in the results above. Note that this has the problem
mentioned in the last post. I actually want to solve for discrete and clamped YCbCr, but it's too hard, so I just
solve the equation as if they are continuous, and then round to the nearest int and clamp to 0-255. To improve
this, I actually re-find Y from Chroma after the round-and-clamp. The loss from going discrete is this bit :

//float solution rmse : 1.6250
rmse : 1.7400

I believe that this is as good as you can do for a decoder which operates in the simple linear way. That is, up to
now we have assumed that we have a generic decoder, so we can use these techniques to make optimal baseline-compatible
JPEGs or H264 videos or whatever. But there are things we can do on the decode side as well, so let's look at that.

Glenn makes the realization that many of the YCbCr produced by normal subsampling are actually impossible. That is, they
can't come from any RGB in [0-255]. When you see an impossible YCbCr, you know it was caused by downsampling, which means
you know that some chroma was put on your pixel that should have been on a neighbor's pixel. For the moment let's
pretend we have box sampling. In a 2x2 block we have some black pixels and some red pixels. When you box downsample you
will get the average of the red chroma (call it Cr = 1) and the black chroma (Cr = 0). When you box upsample you will
get Cr = 0.5 over all four pixels. But not you have a pixel with Y = 0 and Cr = 0.5 ; the only way to make a zero Y but
with some red chroma would be for it to have negative G and/or B. So this must be a mistake - when we see Y = 0 and Cr = 0.5,
we know that the chroma on this pixel must haved "spilled" onto us from our neighbor incorrectly. To fix it, we just
take our unwanted Cr and push it over to our neighbor, and we get a perfect result - the Y = 0 gets Cr = 0 and is black,
and the Cr = 0.5 red pixel gets pushed up to Cr = 1.

Glenn works out how much chroma a pixel can hold for a given Y. One way to think about this is to think of the RGB->YCbCr as a
rotation (+ shear and scale, but you can think of it as rotation for our purposes). You've taken the RGB axial box and have put
a new box around it in a rotated space. To completely cover the range of the original box, we have to use a much bigger box
in this new space. The result is a large amount of empty space in the YCbCr box which did not come from the original RGB box.
Handling this better is a general problem for all compressors with color conversions - we often code YCbCr as if they have
full range, but in fact after we have seen Y we know that the range for CbCr might be much smaller.

There's another way of getting the same result, which is to use the fact that we know our Y is more reliable than our CbCr.
That is, use your YCbCr to reproduce RGB. Now see if the RGB are all in [0-255] , if they are you are fine. If not, you
have to clamp them. Now recompute Y from RGB (something like 0.2R + 0.7G + 0.1B). Because of the clamping, this will now
be wrong, eg. not match the transmitted Y. So what we are doing is ensuring that the Y of the output RGB is equal to the Y
we transmitted. To acheive that, we adjust CbCr so that we are not clamping the RGB.

The main limitation of Glenn's method is that it only helps when you are pushing pixels into illegal values. eg. the black next
to red example above was helped enormously, but if it was instead grey next to red, then no illegal value would have been made
and we would have done nothing. (eg. on my_soup it only gave us 3.1129 -> 3.0603)

The other problem with Glenn's method is that it is rather slow in the decoder, too slow for something like video (but certainly okay
for a JPEG loader in Photoshop or something like that).

There are some simplifications/approximations of Glenn's method which would probably be viable in video.

One is to compute an approximate "chroma capacity" based on Y, and then for each 2x2 box upsample, instead of putting the chroma onto
each pixel with equal weight, you do it weighted by the chroma capacity. Chroma capacity is a triangle function of Y, something like
min( Y, 255-Y ). So a 2x2 box upsample adjusted for capacity is just :

(this is similar to his "proportion" method but much simplified). On the synthetic case of the red and black quad, this produces the
same results as the more expensive method. On real images it's slightly worse.

Another approach to accelerating Glenn's method would be to just go ahead and do the YCbCr->RGB on each pixel, and then when you do
the clamp (which you must do anyway), use that branch to spill to neighbors, and compute the spill amount directly from how far
your RGB is pushed out of [0..255] , eg. if B is -4 , then (-4 / Cb_to_B) worth of Cb goes onto the neighbor.

I've only implemented the "spill" method for box sampling, but you can do it for any time of upsampling filter. It's a little
awkward though, as you have to implement your upsampler in a sort of backwards way; rather than iterating on the high res pixels
and sampling from the subsampled CbCr plane with some filter and accumulating into each output pixel only once, instead you
need to iterate over the low res subsampled CbCr and create the filter output and add it into various target pixels.

There's one final method to look at which we've not implemented yet.
Glenn's method is of course a form of "luma aided chroma upsample", but it's not what I'm usually refering to when I say that.
What we usually mean is using the luma edge information to select the chroma upfilter. That is, rather than always doing bilinear chroma
upsample or bicubic or sinc or whatever, you do a decision tree on the luma pixels and choose one of several filters which have various
shapes. This is actually a variant of the old "super resolution" problem. We have a low res signal and wish to make the best possible
high res output, possibly with the help of some side information. In this case we have the high res Y plan as side information which
we believe is well correlated; in many of the "super resolution" papers in the literature what they have is previous video frames,
but the techniques are quite similar. I've seen some papers on this but no implementation (though I hear some TV's actually have a
hacky version of this called "warpscan" or something like that).

Training filters for various edge scenarios is relatively straightforward, so I might do that in the next few days.

BTW one scary issue for all these fancy filters is if you don't control the decoder or know its exact spec. In particular, TV's
and such that are doing fancy chroma upsamplings now means your manipulations could make things worse.

Also something I have observed with chroma sampling is that minimizing any simple analytic metric like RMSE can lead to ringing.
This is a very generic issue in lossy compression (it's roughly the same reason that you get ringing in wavelets). In order to reduce
a very large single pixel error, the encoder will apply something like a sinc filter shape to that pixel. That might cut the
error at that pixel from 100 to 50, which is a huge win, but it also adds a low magnitude ringing around that pixel (maybe magnitude 5
or so). In RMSE terms this is a big win, but visually it's very bad to create that ringing around the single bad pixel, better to
just leave its big error alone. (nonlinear error metrics might help this, but nonlinear error metrics don't lead to simple to solve
linear matrix equations)

2 comments:

We know what the decoder will do to upsample, so instead we just take the idea that our encoder should output the coefficients which will make the best result after upsampling.

I've long argued this as an approach to consider for mipmap generation: we know they're reconstructed with bilerp, so optimize the downsample of the mipmaps around reconstructing as best as possible after bilerp.

Of course, we don't actually display texture mipmaps upsampled this way, so it's maybe not actually a good argument (instead they are better off with the bilerp resampling being a good approximation to a good-looking downsampling/blur of the source data, in which case we're back where we started).

Also, I should mention I actually *tried* this, using RMSE, and observed the ringing problem you reported.

I tried augmenting the linear programming problem with monotinicity constraints:

If x_0 < x_1 < x_2 < x_3, then require downsampled x_01 < x_23, but I don't recall ever getting any good results from this. Also monotinicity may not be good enough, maybe you want to make sure that if a gradient is smooth it stays smooth, not just monotonic; e.g. if the derivative (finite difference) over six samples is montonically increasing, than so should the derivative (finite difference) over the the "corresponding" downsampled ones. (Using sloppy pixel-is-a-little-square-box-filter terminology.)