The original goal with Overmix was to reduce compression noise by averaging frames and this works quite well. However video compression is quite complex and simply averaging pixel values fails to take that into consideration. But since it really is quite complex I will as a start consider the case were each frame is simply compressed using JPEG. (There actually happens to be a video format called Motion JPEG which does just that.) Video is usually encoded using motion compensation techniques to reduce redundancy from other frames, so this is a major simplification. I am also using the same synthetic data from the de-mosaic post (which does not contain sub-pixel movement) by compressing each frame to JPEG at quality 25.

Before we begin we need to understand how the images were compressed. There are actually many good articles about JPEG compression written in an easy to understand manner, see for example Wikipedia’s example or the article on WhyDoMath.
As a quick overview: JPEG segments the image into 8×8 blocks which are then processed individually. This is done by transforming the block into the frequency domain using the Discrete cosine transform (DCT). Informally, this separates details (high frequencies) from the basic structure (low frequencies). The quality is then reduced using a Quantisation table which determines how precise each coefficient (think of it as a pixel) in the DCT is preserved. For lower quality settings, the quantisation table lowers the quality of the high frequencies more. The same table is used for all 8×8 blocks. After the quantisation step, each value is rearranged into a specific pattern and losslessy compressed.
Line-art tends to suffer of this compression since the sharp edge of a line requires precise high-frequency information to reproduce.

The idea to minimize the the JPEG compression is to estimate an image, which best explains our observed compressed JPEG images. Each image is offsetted to each other, so the 8×8 blocks will cover different areas in each image most of the time.
To make the estimate we use use the average of each pixel value like the original ignorant method. Then we will try to improve our estimate in an iterative manner. This is done by going trough all our JPEG compressed frames and try to recover the details lost after quantisation. For each 8×8 block in the compressed frame, we apply the DCT on the equivalent 8×8 area in the estimated image. Then for each coefficient we check if it is the same value as our compressed frame after the quantisation. If so, we will use the non-quantified coefficient to replace the one the the compressed frame. The idea is that we will slowly try to rebuild the compressed frame to its former glory, by using the accumulated information from the other frames to estimate a more precise coefficients that when compressed with the same settings would still produce our original compressed frame.

The figure shows a comparison at 2x magnification. From left to right: A single compressed frame, averaging all the frames pixel by pixel, the JPEG estimation method just explained after 300 iterations, and the original image the compressed frames was generated from.

The most significant improvement comes from the averaging, most of the obnoxious artifacts are gone and the image appears quite smooth. However it is quite blurry compared to the original and have a lot of ringing which is especially noticeable with the hand (second row).
The estimation improves the result a little bit, the edges are slightly sharper and it brings out some of the details, for example the button in the first row and the black stuff in the third row. However it still have ringing and even though it is reduced, it is a bit more noisy making it less pleasing visually.

The problem here is that the quantisation is quite significant and there are many possible images which could explain our compressed frames. Our frames simply does not contain enough redundancy to narrow it in. The only way to improve the estimation is to have prior knowledge about our image.

My quick and dirty approach was to use the noise reduction feature of waifu2x. It is based on a learning based approach and waifu2x have been specially trained on anime/line-art images. Each compressed frame was put through waifu2x on High noise reduction, and this was used as an initial estimate for each image and the coefficients were updated like usual. Afterwards the algorithm was run like before, and the result can be seen below:

From left to right: Single frame after being processed by waifu2x, average of the frames processed by waifu2x, the estimation without using waifu2x (unchanged from the previous comparison), and finally the estimation when using waifu2x.

The sharpness in the new estimate is about the same, but both ringing and especially noise have been reduced. It have a bit more noise than the averaged image, but notice that the details in the black stuff (third row) for the averaged image is even worse than the old average (see first comparison). This is likely due to it not being line-art and waifu2x therefore not giving great results. Nevertheless, the estimation still managed to bring it back as no matter how bad our estimate is, as it still enforces the resulting image to conform to the input images.

But two aspects makes this a “quick and dirty” addition. The prior knowledge is only applied to the initial estimate and ignored for the following iterations, which might be the reason it is slightly more noisy. It still worked here, but there is no guaranty. Secondly waifu2x is trained using presets and ignores how the image was actually compressed.

Note: Due to the nature of censored content, I’m showcasing synthetic data in this post.

Japan has its crazy censorship laws, so most hentai anime uses mosaic censoring to hide the important bits. Of course with my work on Overmix I’m interested in camera pans that can be stitched, and those could look like the following:
What is interesting however is that, while it might not be too apparent at first, the mosaic censor changes from frame to frame. Instead of censoring the image which is panned, a specific area is marked as “dirty” and censoring is applied on each frame separately. So looking even closer, we can see that the mosaics are arranged in a grid which is relative to the viewpoint (i.e. fixed) and not the panned image (which is moving).
So depending on how the panned image is offset from the viewpoint, each tile in the mosaic will cover a different area in different frames. If we take all our frames and find the average for each pixels, we can see a vast improvement already:
It is however quite blurry… However super-resolution have thought us that if we have several images of the same motif but where each image caries slightly different information, we can exploit it to increase the resolution. So let’s try to do better.

Before we can start, we need to know how the mosaic censoring works to finish our degradation model. Contrary to what I thought, it is not done by averaging all pixels inside each tile. Instead it is done by picking a single pixel (which I assume is the center one) and use that for the color of the entire tile.
The approach is therefore really simple. We create a high-resolution (HR) grid, and for each frame we plot the tile color into the HR grid into the pixel located at the center of the tile. When we have done that for all the frames, we end up with the following:
There are a lot of black pixels, these are pixels which were not covered by any of our frames, pixels we know nothing about. There is no way of extracting these pixels from our source images, so we need to fake them by using interpolation. I used the “Inpaint – diffusion” filter in G’MIC and the result isn’t too bad:
Those “missing pixels” fully controls how well the decensoring will work. If the viewpoint only pans in a vertical direction we will only have vertical strips of known pixels, thus we cannot increase the resolution in the horizontal dimension. If we have movement in both dimensions it tends to work pretty well as you can see in the overview image below:

I’m getting close to having a Super-Resolution implementation, but I took a small detour using what I have learned to a slightly easier problem, extracting overlayed images. Here is an example:
The character is looking into a glass cage and the reflection is added by adding the face as a semi-transparent layer. We also have the background image, i.e. the image without the overlay:
A simple way of trying to extract the overlayed image is to subtract the background image, and indeed we can somewhat improve the situation:
Another approach is to try estimate the overlayed image, which when overlayed on top on the background image will produce the merged image.
This is done starting with a semi-random estimate and then iteratively improve your estimation. (The initial estimate is just the merged image in this case.)
For each iteration we take our estimation and overlays it on the background. If our estimation is off it will obviously differ from our merged image, so we find the difference between the two and use that difference to improve our estimate. After doing enough iterations, we end up with the following:
While not perfect, this is quite a bit better. I still haven’t added regularization which stabilizes the image and thus could improve it, but I’m not quite sure how it affects the image in this context.

Super-Resolution works in a similar fashion, just instead of overlaying an image, it downscales a high-resolution estimate and compares it against the low-resolution images. It just so happens that I never implemented downscaling…

I managed to find a working implementation of one of the papers I was interested in understanding, while looking at OpenCV. It is in Japanese, but you can find the code with some comments in English here: http://opencv.jp/opencv2-x-samples/usage_of_sparsemat_2_superresolutionA small warning if you try to run it, it is very memory intensive. With a 1600×1200 image it used 10GB of RAM on my system. It also crashes if your image’s dimensions are not a multiple of the resolution enhancement factor.

All tests are done with 16 low resolution images and with increasing the resolution 4 times. The image below is the result for the best case, where the images are positioned evenly. Left image is one of the 16 Low Resolution (LR) images, the right is the original, and with the middle being the Super Resolution result after 180 iterations:

There is a bit of ringing artifacts around the high-contrast edges, but notice how it manages to slightly bring out the lines in the eye, even though it looks completely flat in the LR image.

Below is the same, but with input images haven been degraded by noise and errors. While it does loose a little bit of detail, the results are still fairly good and with less noise than the input images.

The last test is with random sub-pixels displacements, instead of them being optimal. The optimal is shown to the left as a comparison. It is clear that it loses it effectiveness as the image becomes more blocky.

My plan is to use this implementation as an aid to understand the parts of the article I don’t fully understand. I would like to try this out with DVD anime sources, but this method (or just this implementation) just wouldn’t work with 150+ images. You can wait on a slow algorithm to terminate, but memory is more of a hard limit. But this method allows to have separate blur/scaling matrices for each LR image, so you can probably improve it by keeping them equal.

Most stitches are mostly static with perhaps a little mouth movement. Some however do contain significant movement and Overmix has never intended to try to merge this into a single image in a sensible fashion.

This is because I have yet to see a program which can do this perfectly. While I have seen several making something which looks decent at first, they usually have several issues. Common issues are lines not properly connecting at places and straight lines ending up being curved.

For me, it need to be perfect. If I need to manually fix it up, or even redo it from scratch, not much is gained. The goal with Overmix was always to reach a level I would not be able to reach without its assists. Thus there is no reason to pursue a silver-bullet solution, especially if it gives worse results than doing it manually.

Instead the approach I have taken is to detect which images belongs to what movement. For example, if an arm is moving, we want to figure out which images are those where the arm has yet to start moving, the images where the arm has stopped moving, and all those in between. In other words, we end up with a group of images for each frame the animator drew.

These groups can be combined individually without any issues and the set of resulting images can be merged manually. However since we might have reduced 100 video frames to 10 animated frames, we can take advantage of the nice denoising and debanding properties of Overmix, which will improve the final quality of the stitch.

Cyclic movement

One interesting use-case which can be fully automated is cyclic movement, i.e. movement which ends the way it started, and continues to loop. This is often people doing repetitive motions such as waving goodbye.

Of course the real benefit is when the view port is moving, as it would be cumbersome to do manually. The following example was a slow pan-up over the course of 89 frames where the wind is rustling the character’s clothes and hair around, reduced to 22 frames:

Notice how the top part of some frames are missing, as the scene ended before the top part had been in view for each frame. The same was the case for the bottom, but since it contained no movement, any frame could fill in the missing information.

(The animation can be downloaded in FullHD resolution APNG here (98 MiB).)

Algorithm

The main difficulty is making a distinction between noise and movement. (Noise can both be compression artifacts, but also others such as TV logos, etc.) A few methods were tried, but the best and simplest one of those take advantage of the fact that most Japanese animation reduce the animation cost by using a lower frame-rate. This is typically 3 video frames for each animated frame, though it can be dynamic through the animation!

The idea is to compare the difference between the previous frame. Since there are usually 3 consecutive frames without animation, it will return a low difference. But as soon as it hits a frame which contains the next part of the animation, a high difference will appear, causing a spike to appear on the graph. Doing this for every frames gives a result like this:

Using this, we can determine a noise threshold by drawing a line (shown in purple) which intersects as many blue lines as possible. While this is mostly tested on cyclic movement, it works surprisingly well.

The ever returning issue of sub-pixel alignment strikes back though. When the stitch contains movement in both directions, the sub-pixel misalignment can cause the difference to become large enough to cause issues. This can easily be avoided by simply using sub-pixel alignment, but as of now this is quite a bit slower in Overmix.

Once the threshold has been determined the images are separated into groups based on that threshold. If the difference between the last image in a group and the next image is below the threshold, it is added to that group. If it could not be added to any group, a new group containing that image will be created. This is done for all the images.

Further work

Notice the file size of the APNG image is nearly 100 MB. This is because each of the 22 images is rendered independently of each other and thus results in 22 completely different images. But the background is the same for each and every frame, so that means we are not taking advantage of the information about the background found in the other frames. Thus, by detecting which parts in the frames are consistent and which differs when rendering, we can both improve quality and reduce file size.

Aligning the resulting frames can be tricky when there is a lot of movement in a cyclic animation, because the images only have a little resemblance to each other. Even when this does work, sub-pixel alignment+rendering is more important than usual since otherwise the ±0.5 error will show up as a shaky animation. I have an idea to how to solve the alignment issue, but my math knowledge is currently too lacking in order to actually implement it.

Automatic aligning

While zooming and rotating is not supported (yet at least), horizontal and vertical movement is rather reliably detected and can be done to sub-pixel precision. Sub-pixel precision is rather slow and memory intensive though, as it works by image upscaling. Still needs some work when aligning images with transparent regions though, and I also believe it can be made quite a bit faster.

De-noising

Noise is removed by averaging all frames and works very well, even on blue-ray rips. However since I have not been able to render while taking sub-pixel alignment into account, it blurs the image ever so slightly.

10-bit raw input and de-telecine

I have made a frame dumper to grab the raw YUV data from the video streams, to ensure no quality loss happen between video file and Overmix. This actually showed that VLC outputs in pretty low quality, and appears to have some trouble with Hi10p content and color spaces. Most noticeable however is that VLC uses nearest-neighbor for chroma-upsampling, which especially looks bad with reds. Having access directly to the chroma channels also opens up for the possibility of doing chroma-restoration using super resolution.

I have made several tools for my custom “dump” format, including the video dumper, Windows shell thumbnailer (vista/7/8), QT5 image plugin and a compressor to reduce the filesize. They can be found in their separate repository here: https://github.com/spillerrec/dump-tools

De-telecine have also been added, as it is necessary in order to work with the interlaced MPEG2 transport stream anime usually is broadcasted in.

Deconvolution

Using deconvolution I have found that I’m able to make the resulting images much sharper, at the expense of a little bit of noise. It appears that anime is significantly blurred, perhaps because of the way it have been rendered, or perhaps intentionally. More on this later…

TV-logo detection and removal

It works decently as shown in a recent post and can also remove other static content such as credits. I have tried to do it fully automatically, but it doesn’t work well enough compared to the manual process. Perhaps some Otsu thresholding combined with dilation will work?

I’m intending to try doing further work with this method to see if I can get it to work with moving items, and doing the reverse, separating the moving item. This is interesting as some panning scenes moves the background independently of the foreground (for faking depth of field).

Steam removal

Prof-of-concept showed that if the steam is moving, we can chose the parts from each frame which contains the least amount of steam. Thus it doesn’t remove it completely, but it can make a significant difference. The current implementation however deals with the colors incorrectly, which is fixable, but not something I care to do unless requested…

Animation

Some scenes contains a repeating animation, while doing vertical/horizontal movement. Especially H-titles seems have much of this, but can also be found as mouth movement and similar. While still in its early stages, I have successfully managed to separate the animation into its individual frames and stitch those, using a manual global threshold.

I doubt it would currently work with minor animations, such as changes with the mouth, noise would probably mess it up. So I’m considering investigating other way of calculating the difference, perhaps using the edge-detected image instead, or doing local differences.

One annoying artifact of averaging all frames is that static content such as TV channel logos will appear in the final image for each frame:

Similarly, sometimes frames contains a small black border which also ruins the image. My old way of thinking was that I should make a filter which ignored largely inconsistent pixels. But really, all we need to do is to crop the image…

It is the same deal with logos, what we want to do is to ignore the area which contains the logo. That means we have to select that area, which is pretty easy to do manually. But what about doing it automatically?

Logo detection, first attempt

The logo is positioned exactly the same place on each and every frame. The pixels at the logo thus have consistent values so the idea is that if we find the difference, any value below a certain threshold is part of the logo.

It doesn’t work that well, because the assumption that all other parts are inconsistent doesn’t always hold. Notice how in the above image the whole right side has about the same color. Furthermore, the logo isn’t always consistent as it is usually a bit transparent and changes depending on the background.

Below is an example of this method (top) and the final method (below). Notice how the logo looks dark, but the thresholded version to the right nearly did not capture any of the logo. While it is not very noticeable, the very right side is actually about as dark as the logo, so using any threshold higher than what was used, would include incorrect areas.

Working logo detection

The idea with this method is that instead of trying to find consistency in the frames, we estimate the original image and tries to find the inconsistencies. The original image doesn’t contain the TV logo, so the inconsistency should be high where the logo is. Then for each frame, we find the difference between that frame and the expected image. To ignore inconsistencies such as compression artifacts and the like, we find the consistencies in all the frames differences. A graphical overview of this method is given below:

The old average method is actually a decent estimate for the original image. While the logo does appear, it is quite a bit dimmer than on the individual frames as it is averaged with frames where it does not appear at that spot. As seen in the “Static stitched difference” image, it works well. However it is far from perfect, as there are some bright areas right around the logo. That is the afterimage of the logos in the estimate, if we could come up with a better estimate, that would disappear.

But that is exactly what we can now, since if we ignore the area we just found, the image would appear without logos. While our estimate was good enough, see this example of a credits text we would like to detect and remove: (click on the image to see it in full size)

The first detection (to the left) was far from good enough, but if we make a new estimation of the original image, based on that detection, we get closer to the truth and thus a better detection. We can continue iterating until the detection is stable.

Not quite automatic yet

The thresholds have been selected manually, but I do believe that it might be able to work nearly automatically. If instead of thresholding at the end of each detection instead keep it in gray-scale, we could use it as a probability for that pixel to be correct. The less likely the pixel is correct, the less it will affect the estimate in the presence of other pixels, hopefully resulting in an better estimate. It will most likely require more iterations to get a good estimate though and we might still have to do a final thresholding manually.

I could test it, I have just not yet invested the time needed to do so…

Further work

A few pixels in the mask are missing, which could affect the image negatively. Using dilation here might be a good idea.

In case some areas are completely covered by credits or the like in all frames, there exists algorithms to guess the missing pixels. Could be interesting to mess around with.

Also, it does not appear to currently correctly work on the chroma channels. In the credits example, it is still possible to slightly see the effect of the credits, but most of that is from the chroma channels. I’m not quite sure why the credits still affect the chroma channels, but I guess it is related to the subsampling. Probably nothing to major.

But I’m getting close to understanding how the image alignment issues appears, and I want to try to see if I can solve it first.

Well, Overmix is here with a dehumidifier to solve your problem. Too damp? Run it once and watch as your surroundings become clearer.

Your local hot spring before:

and after:

Can’t get enough of Singing in the rain? Don’t worry, just put it in the reverse and experience the downpour.

Normal rainy day:

The real deal:

This is another multi-frame approach, and really just as simple as using the average. Since the steam lightens the image, all you have to do is to take the darkest pixel at that position. (In other words, the lighter the pixel is, the more likely it is to be steam.) Since the steam is moving, this way you use the least steamy parts of each frame to gain a stitched image with the smallest amount of steam.

If we do the opposite, take the brightest pixel, we can increase the amount of steam. That is not really that interesting, but the second example shows how we can uses this to bring out features that would otherwise be treated as noise. We could also combine it with the average approach using a range, to deal with the real noise, but I did this for fun so I didn’t go that far.

While this is a fairly simple method, it highlights that we can use multiple frames not just to improve quality, but also to analyze and manipulate the image. I have several neat ideas I want to try out, but more about those when I have something working.

There are two colorspaces commonly used in video today, which are defined in Rec. 601 and Rec. 709 respectively. Simply speaking, Rec. 601 is mainly used for analog sources, while Rec. 709 mainly is for HD TV and BD.

So how do VLC handle this? It assumes everything is Rec. 601 and you get something like this:

The bottom left is from a DVD and the top right is from the BD release. In comparison, here is how it looks in Overmix, using Rec. 601 for the DVD release and Rec. 709 for the BD release:

VLC also seems to ignore the gamma difference between Rec. 601/709 and sRGB, and it handles 10 bit content in a way that reduces color accuracy to worse than 8 bits sources. Behold, the histogram from a Hi10p source:

Free stuff might be nice, but this is what you get…

EDIT: I messed up the studio-swing removal in Overmix (which is now fixed), so the colors were slightly off. It was consistent between rec.601/709 so the comparison still holds. Overmix might be nice, but this is what you get…

Just five months later… Here are some early results using artificial data.

Using Wikimedia Commons “picture of the day” for October 31. 2013 by Diego Delso (CC BY-SA 3.0), I created LR (Low Resolution) images which were 4 times smaller in each direction. Each LR image had its own offset, so to have one LR image for all possible offsets, 16 images was created.

To detect the sub-pixel alignment afterwards, the images were upscaled to 4x their size and ordinary pixel-based alignment was used. The upscaled versions were only used for the alignment and thus discarded afterwards. The final image was then rendered at 4x resolution using cubic interpolation, but taking the sub-pixel alignment into account. Lastly the image was deconvolved in GIMP using the G’MIC plugin to remove blur. The results are shown below:

Left side shows the LR (shown upscaled using Nearest neighbor interpolation) and original image respectively. Right side shows the SR (Super Resolution) results, using different interpolation methods. Both are cubic, however the top is using Mitchell and the bottom is using Spline. In simple terms, Spline is more blurry than Mitchell but has less blocking artifacts. Mitchell is usually pretty good choice (as it is a compromise between several other cubic interpolation methods), however the blocking is pretty noticeable here. Using Spline here avoids that and since we attempt to remove blur afterwards it works pretty well. However do notice that Mitchell does recover slightly more detail in the windows to the right.

But while Mitchell often does appear to be slightly more sharp, it tends to mess up more often, which can clearly be seen on the “The power of” building to the left. The windows are strangely mixed up into each other, while they are perfectly aligned when using Spline.

Conclusion

Results are much better than the LR images, however it is more an magnification of 2x instead of the optimal 4x. And to make matters worse, this is generated optimal data without blur or noise.

However this is the simplest way of doing SR and I believe other methods do give better results. Next I want to try the Fourier-based approach which is also one of the early SR methods. It should give pretty good results, but it is not used much anymore because it does not work for rotated or skewed images.

Using artificial data has really shown me why I have had so little success with it so far. I’m mainly working with anime screenshots and the amount of detail which can be restored is probably not that much. My goal is actually more to avoid blurriness that happens when they are not aligned perfectly. Thus while it should have been obvious, lesson learned, do not test on data which you are not sure whether will give an result or not… What I did gain from this is that anime tends to be rather blurry and that image deconvolution can help a lot. When I understand this blurriness in detail I will probably write more about it though.