6/16/2009

06-16-09 - Inverse Box Sampling

A while ago I posed this problem to the world :

Say you are given the box-downsampled version of a signal (I may use "image" and "signal" interchangeably cuz I'm
sloppy). Box-downsampled means groups of N values in the original have been replaced by the average in that group
and then downsampled N:1. You wish to find an image which is the same resolution as the source and if box-downsampled
by N, exactly reproduces the low resolution signal you were given. This high resolution image you produce should be "smooth" and
close to the expected original signal.

Examples of this are say if you're given a low mip and you wish to create a higher mip such that downsampling again
would exactly reproduce the low mip you were given. The particular case I mainly care about is if you are given
the DC coefficients of a JPEG, which are the averages on 8x8 blocks, you wish to produce a high res image which
has the exact same average on 8x8 blocks.

Obviously this is an under-constrained problem (for N > 1) because I haven't clearly spelled out "smooth" etc.
There are an infinity of signals that when downsampled produce the same low resolution version. Ideally I'd like
to have a way to upsample with a parameter for smoothness vs. ringing that I could play with.
(if you're nitty, I can constrain the problem precisely : The correlation of the output image and the original source image
should be maximized over the space of all real world source images (eg. for example over the space of all images that
exist on the internet)).

Anyway, after trying a whole bunch of heuristic approaches which all failed (though Sean's iterative approach is actually
pretty good), I found the mathemagical solution, and I thought it was interesting, so here we go.

First of all, let's get clear on what "box downsample" means in a form we can use in math.

You have an original signal f(t) . We're going to pretend it's continuous because it's easier.

To make the "box downsample" what you do is apply a convolution with a rectangle that's N wide. Since I'm treating t as continuous
I'll just choose coordinates where N = 1. That is, "high res" pixels are 1/N apart in t, and "low res" pixels are 1 apart.

Convolution { f , g } (t) = Integral{ ds * f(s) * g(t - s) }

The convolution with rect gives you a smoothed signal, but it's still continuous. To get the samples of the low res image, you
multiply this by "comb". comb is a sum of dirac delta functions at all the integer coordinates.

F(t) = Convolve{ rect , f(t) }

low res = comb * F(t)

low res = Sum[n] L_n * delta_n

Okay ? We now have a series of low res coefficients L_n just at the integers.

This is what is given to us in our problem. We wish to try to guess what "f" was - the original high res signal. Well, now that
we've written is this way, it's obvious ! We just have to undo the comb filtering and undo the convolution with rect !

First to undo the comb filter - we know the answer to that. We are given discrete samples L_n and we wish to reproduce the smooth
signal F that they came from. That's just Shannon sampling theorem reconstruction. The smooth reconstruction is made by just
multiplying each sample by a sinc :

sinc(x) is 1.0 at x = 0 and 0.0 at all other integer x's and it oscillates around a lot.

So this F(t) is our reconstruction of the rect-filtered original - not the original. We need to undo the rect filter. To do
that we rely on the Convolution Theorem : Convolution in Fourier domain is just multiplication. That is :

Fou{ Convolution { f , g } } = Fou{ f } * Fou{ g }

So in our case :

Fou{ F } = Fou{ Convolution { f , rect } } = Fou{ f } * Fou{ rect }

Fou{ f } = Fou{ F } / Fou{ rect }

Recall F(t) = sinc( t - n ) , so :

Fou{ f } = Sum[n] L_n * Fou{ sinc( t - n ) } / Fou{ rect }

Now we need some Fourier transform knowledge. The easiest way for me to find this stuff is just to do the integrals myself. Integrals
are really fun and easy. I won't copy them here because it sucks in ASCII so I'll leave it as an exercise to the reader. You can
easily figure out the Fourier translation principle :

Fou{ sinc( t - n ) } = e^(-2 pi i n v) * Fou{ sinc( t ) }

As well as the Fourier sinc / rect symmetry :

Fou{ rect(t) } = sinc( v )

Fou{ sinc(t) } = rect( v )

All that means for us :

Fou{ f } = Sum[n] L_n * e^(-2 pi i n v) * rect(v) / sinc(v)

So we have the Fourier transform of our signal and all that's left is to do the inverse transform !

f(t) = Sum[n] L_n * Fou^-1{ e^(-2 pi i n v) * rect(v) / sinc(v) }

because of course constants pull out of the integral. Again you can easily prove a Fourier translation principle : the e^(-2 pi i n v) term
just acts to translate t by n, so we have :

f(t) = Sum[n] L_n * h(t - n)

h(t) = Fou^-1{ rect(v) / sinc(v) }

First of all, let's stop and see what we have here. h(t) is a function centered on zero and symmetric around zero - it's a reconstruction
shape. Our final output signal, f(t), is just the original low res coefficients multiplied by this h(t) shape translated to each integer
point n. That should make a lot of sense.

What is h exactly? Well, again we just go ahead and do the Fourier integral. The thing is, "rect" just acts to truncate the infinite range
of the integral down to [-1/2, 1/2] , so :

h(t) = Integral[-1/2,1/2] { dv e^(2 pi i t v) / sinc(v) }

Since sinc is symmetric around zero, let's take the two halves of the range around zero and add them together :

And.. we're stuck. This is an integral function; it's a pretty neat form, it sure smells like some kind of Bessel function or something
like that, but I can't find this exact form in my math books. (if anyone knows what this is, help me out).
(actually I think it's a type of elliptic integral).

One thing we can do with h(t) is prove that it is in fact exactly what we want. It has the box-unit property :

That is, the 1.0 wide window box filter of h(t) centered on integers is exactly 1.0 on its own unit interval, and 0 on others. In
other words, h(t) reconstructs its own DC perfectly and doesn't affect any others.
(prove this by just going ahead and doing the integral; you should get sin( N * pi ) / (N * pi ) ).

While I can't find a way to simplify h(t) , I can just numerically integrate it. It looks like this :

You can see it sort of looks like sinc, but it isn't. The value at 0 is > 1. The height of the central peak vs.
the side peaks is more extreme than sinc, the first negative lobes are deeper than sinc. It actually reminds me
of the appearance of a wavelet.

Anyway, this is all very amusing and it actually "works" in the sense that if you blow up a low-res image using
this h(t) basis shape, it does in fact make a high res image that is smooth and upon box-down sampling exactly
reproduces the low-res original.

It is, however, not actually useful. For one thing, it's computationally ridiculous. Of course you would
precompute the h(t) and store it in a table, but even then, the reach of h(t) is infinite, and it doesn't
get small until very large t (beyond the edges of any real image), so in practice every output pixel must be
a weighted sum from every single DC values in the low res image. Even without that problem, it's useless
because it's just too ringy on real data. Looking at the shape above it should be obvious it will ring
like crazy.

I believe these problems basically go back to the issue of using the ideal Shannon reconstruction when I
did the step of "undoing the comb". By using the sinc to reproduce I doomed myself to non-local effect and
ringing. The next obvious question is - can you do something other than sinc there? Why yes you can,
though you have to be careful.

Say we go back to the very beginning and make this reconstruction :

F(t) = Sum[n] L_n * B( t - n )

We're making F(t) which is our reconstruction of the smooth box-filter of the original. Now B(t) is some reconstruction
basis function (before we used sinc). In order to be a reconstruction, B(t) must be 1.0 at t = 0, and 0.0 at all other
integer t. Okay.

If we run through the math with general B, we get :

again :

f(t) = Sum[n] L_n * h(t - n)

but with :

h(t) = Fou^-1{ Fou{ B } / sinc(v) }

For example :

If B(t) = "triangle" , then F(t) is just the linear interpolation of the L_n

Fou{ triangle } = sinc^2 ( v)

h(t) = Fou^-1{ sinc^2 ( v) / sinc(v) } = Fou^-1{ sinc } = rect(t)

Our basis functions are rects ! In fact this is the reconstruction where the L_n is just made a constant over each
DC domain. In fact if you think about it that should be obvious. If you take the L_n and make them constant on
each domain, then you run a rectangle convolution over that - as you slide the rectangle window along, you get linear
interpolation, which is our F(t).

That's not useful, but maybe some other B(t) is. In particular I think the best line of approach is for B(t) to be
some kind of windowed sinc. Perhaps a Guassian-windowed sinc. Any real world window I can think of leads to a Fourier
transform of B(t) that's too complex to do analytically, which means our only approach to finding h is to do a
double-numerical-integration which is rather a disastrous thing to do, even for precomputing a table.

So I guess that's the next step, though I think this whole approach is a practical dead end and is now just a scientific
curiosity. I must say it was a lot of fun to actually bust out pencil and paper and do some math and real thinking. I really miss it.

9 comments:

I solved for both a full-width (sinc-like) thing which should match the iterated one, and then a simplified bilerp with a tunable parameter and a very local filter support (only need 3x3 DC values to do a given 8x8 block as 4 4x4 bilerps).

Note that you can get a tuning factor by just having two solutions and weighting between them.

My writeup offers a "half-bilerp" approach for 3x3 taps on the DC; the best "half-bilerp" approach for 1x1 taps is the standard blocky DC reconstruction. You could weight between those, but obviously you'd have block artifacts.

So the thing might be to solve for the best approach using 5x5 taps on the DC, and then try a tuning factor between the 3x3 tap and the 5x5 tap solutions.

Like a lot of image processing/computer vision problems, this one is underdetermined in an "interesting" way, and different ways of regularizing it correspond to very different algorithms.

Your approach (deconvolution) is at one extreme, with no particular assumption of how a "typical" reconstructed image should look, but very strong assumptions in the model (linearity of the reconstruction process, bandlimitedness of the box-filtered signal so deconvolution can be applied). The other extreme would be just having a huge library of "typical" images and trying to find matching patches subject to some constraints to the transitions work, like in this paper. This is equivalent to learning a statistical model of typical images and using the maximum a posteriori estimate with the downsampled image being your observation, and the main problem is that there's just too many "plausible" images to ever learn a representative sample.

...which is where slightly stronger model assumptions would help. Add some basic color correction, for example; it needs to have a very limited range since otherwise it'd introduce too much ambiguity when trying to match the lowres image. Example-driven texture synthesis algorithms could be killer here, but this is basically the inverse problem - given this repeating structure, what is the most likely pattern to have generated it, and where have I seen it before?

"The other extreme would be just having a huge library of "typical" images and trying to find matching patches subject to some constraints to the transitions work"

Yes, I've written and worked on that problem as well. It's slightly different because they generally don't require that down-sampling reproduces the original low res. It's usually called "super resolution" and there are now a lot of papers on it (better than that one).

It's definitely something on my "wish list" that I would love to work on if I had a bunch of free time. I was working on it a bit at the end of my unemployment but I got distracted by the job search and whatnot. It's a very hard problem.

BTW I believe that any linear reconstruction approach to these kinds of problems is wrong and doomed to suck.

"BTW I believe that any linear reconstruction approach to these kinds of problems is wrong and doomed to suck."

Maybe, but everything else seems way too expensive to actually compute, so I focused on linear just because it seems computationally feasible, and better than DC-nearest-neighbor. Are there actually other options on the table? You appeared to write all your attempts off as too expensive.

Also, I suspect any solution to this problem is either going to be ringy or blocky; my guess it's inherent to the problem. This applies to this solution, to hats, etc. Basically, you're trying to pack a certain amount of energy into the block; either it's well-distributed over the block (blocky), or you try to keep the edges nice and then it's overconcentred in the center (ringy).

Also, the iterative solution seemed pretty reasonable to me, was it actually too ringy? My mexican hat is just a truncation of the iterative solution with the necessary cleanup to make the DC work, so I'm not sure how much worse it would be.

Second, if every 8x8 block is, in the end, independent (the only hard constraint is the average DC), then you can use different solutions for different blocks. So if the deconvolution approach is too slow, but the mexican hat is too ringy, then if you end up in a place where 'oh well, i'll just use nearest neighbor aka DC replication', you could still do mexican hat, have a 'ring detector' (color falls outside of convex hull of nearby elements) and substitute the constant DC blocks in those places.

I don't mean "linear" as in whether the basis function is piecewise linear or not - I mean linear in the "linear operator" sense - *any* basis function based approach will not be awesome.

The simplest way to get a "Non-linear" approach is to have something like a few different linear solutions, and pick some blend between them that's based on the local neighborhood. For example one obvious thing to do which I tried is to run an edge detector on the DC image and do different things when you see hard edges vs. smooth areas.