2/13/2011

02-13-11 - JPEG Decoding

I'm working on a JPEG decoder sort of as a side project. It's sort of a nice small way for me to test a
bunch of ideas on perceptual metrics and decode post-filters in a constrained scenario (the constraint is
baseline JPEG encoding).

I also think it's sort of a travesty that there is no mainstream good JPEG decoder. This stuff has been in
the research literature since 1995 (correction : actually, much earlier, but there's been very modern good stuff
since 95 ; eg. the original deblocker suggestion in the JPEG standard is no good by modern standards).

There are a few levels for good JPEG decoding :

0. Before even getting into any post-filtering you can do things like laplacian-expected dequantization
instead of dequantization to center.

1. Realtime post-filtering. eg. for typical display in viewers, web browsers, etc. Here at the very least
you should be doing some simple deblocking filter. H264 and its derivatives all use one, so it's only fair.

2. Improved post-filtering and deringing. Various better filters exist, most are variants of bilateral
filters with selective strengths (stronger at block boundaries and ringing-likely areas (the ringing-likely
areas are the pixels which are a few steps away from a very strong edge)).

3. Maximum-a-posteriori image reconstruction given the knowledge of the JPEG-compressed data stream. This
is the ultimate, and I believe that in the next 10 years all image processing will move towards this technique
(eg. for super-resolution, deconvolution (aka unblur), bayer de-mosaicing, etc etc). Basically the idea is
you have a probability model of what images are likely a-priori P(I) and you simply find the I that maximizes P(I)
given that jpeg_compress(I) = known_data. This is a very large modern topic that I have only begun to scratch
the surface of.

It's shameful and sort of bizarre that we don't even have #1 (*). Obviously you want different levels of processing for
different applications. For viewers (eg. web browsers) you might do #1, but for loading to edit (eg. in Photoshop or whatever)
you should obviously spend a lot of time doing the best decompress you can. For example if I get a JPEG out
of my digital camera and I want to adjust levels and print it, you better give me a #2 or #3 decoder!

(* : an aside : I believe you can blame this on the success of the IJG project. There's sort of an unfortunate
thing that happens where there is a good open source library available to do a certain task - everybody just
uses that library and doesn't solve the problem themselves. Generally that's great, it saves developers a lot of time,
but when that library stagnates or fails to adopt the latest techniques, it means that entire branch of code
development can stall. Of course the other problem is the market dominance of Photoshop, which has long
been the pariah of all who care about image quality and well implemented basic loaders and filters)

So I've read a ton of papers on this topic over the last few weeks. A few notes :

"Blocking Artifact Detection and Reduction in Compressed Data". They work to minimize the MSDS difference,
that is to equalize the average pixel steps across block edges and inside blocks. They do a bunch of good
math, and come up with a formula for how to smooth each DCT coefficient given its neighbors in the same
subband. Unfortunately all this work is total shit, because their fundamental idea - forming a linear combination
using only neighbors within the same subband - is completely bogus. If you think about only the most basic
situation, which is you have zero AC's, so you have flat DC blocks everywhere, the right thing to do is to
compute the AC(0,1) and AC(1,0) coefficients from the delta of neighboring DC levels. That is, you correct one
subband from the neighbors in *other* subbands - not in the same subband.

Another common obviously wrong fault that I've seen in several paper is using non-quantizer-scaled thresholds.
eg. many of the filters are basically bilateral filters. It's manifestly obvious that the bilateral pixel sigma
should be proportional to the quantizer. The errors that are created by quantization are proportional to the
quantizer, therefore the pixel steps that you should correct with your filter should be proportional to the quantizer.
One paper uses a pixel sigma of 15 , which is obviously tweaked for a certain quality level, and will over-smooth
high quality images and under-smooth very low quality images.

The most intriguing paper from a purely mathematical curiosity perspective is
"Enhancement of JPEG-compressed images by re-application of JPEG" by Aria Nosratinia.

Nosratinia's method is beautifully simple to describe :

Take your base decoded image
For all 64 shifts of 0-7 pixels in X & Y directions :
At all 8x8 grid positions that starts at that shift :
Apply the DCT, JPEG quantization matrix, dequantize, and IDCT
Average the 64 images

That's it. The results are good but not great. But it's sort of weird and amazing that it does as well
as it does. It's not as good at smoothing blocking artifacts as a dedicated deblocker, and it doesn't
totally remove ringing artifacts, but it does a decent job of both. On the plus side, it does preserve
contrast better than some more agressive filters.

Why does Nosratinia work? My intuition says that what it's doing is equalizing
the AC quantization at all lattice-shifts. That is, in normal JPEG if you look at the 8x8 grid at shift (0,0)
you will find the AC's are quantized in a certain way - there's very little high frequency energy, and what there
is only occurs in certain big steps - but if you
step off to a different lattice shift (like 2,3), you will see unquantized frequencies, and you will see a lot more
low frequency AC energy
due to picking up the DC steps. What Nosratinia does is remove that difference, so that all lattice shifts of
the output image have the same AC histogram. It's quite an amusing thing.

One classic paper that was way ahead of its time implemented a type 3 (MAP) decoder back in 1995 :
"Improved image decompression for reduced transform coding artifacts" by O'Rourke & Stevenson. Unfortunately
I can't get this paper because it is only available behind IEEE pay walls.

I refuse to give the IEEE or ACM any money, and I call on all of you to do the same. Furthermore, if you are
an author I encourage you to make your papers available for free, and what's more, to refuse to publish in
any journal which does not give you all rights to your own work. I encourage everyone to boycott the IEEE,
the ACM, and all universities which do not support the freedom or research.

1 comment:

My guess is that it was due to a combination of two factors: IJG, and the fact that the spec wasn't freely available. It was available "cheap", but not over the internet, which meant people like me never had a spec to even think about writing one. I only wrote the decoder in stb_image because I could finally find it for free -- the w3c republished it for some reason as part of their web standards.

I can't say for sure this affected lots of people, but it mattered for me, and it seems like it would make sense for a lot of other open-source-y itch-scratching people.