Most stitches are mostly static with perhaps a little mouth movement. Some however do contain significant movement and Overmix has never intended to try to merge this into a single image in a sensible fashion.

This is because I have yet to see a program which can do this perfectly. While I have seen several making something which looks decent at first, they usually have several issues. Common issues are lines not properly connecting at places and straight lines ending up being curved.

For me, it need to be perfect. If I need to manually fix it up, or even redo it from scratch, not much is gained. The goal with Overmix was always to reach a level I would not be able to reach without its assists. Thus there is no reason to pursue a silver-bullet solution, especially if it gives worse results than doing it manually.

Instead the approach I have taken is to detect which images belongs to what movement. For example, if an arm is moving, we want to figure out which images are those where the arm has yet to start moving, the images where the arm has stopped moving, and all those in between. In other words, we end up with a group of images for each frame the animator drew.

These groups can be combined individually without any issues and the set of resulting images can be merged manually. However since we might have reduced 100 video frames to 10 animated frames, we can take advantage of the nice denoising and debanding properties of Overmix, which will improve the final quality of the stitch.

Cyclic movement

One interesting use-case which can be fully automated is cyclic movement, i.e. movement which ends the way it started, and continues to loop. This is often people doing repetitive motions such as waving goodbye.

Of course the real benefit is when the view port is moving, as it would be cumbersome to do manually. The following example was a slow pan-up over the course of 89 frames where the wind is rustling the character’s clothes and hair around, reduced to 22 frames:

Notice how the top part of some frames are missing, as the scene ended before the top part had been in view for each frame. The same was the case for the bottom, but since it contained no movement, any frame could fill in the missing information.

(The animation can be downloaded in FullHD resolution APNG here (98 MiB).)

Algorithm

The main difficulty is making a distinction between noise and movement. (Noise can both be compression artifacts, but also others such as TV logos, etc.) A few methods were tried, but the best and simplest one of those take advantage of the fact that most Japanese animation reduce the animation cost by using a lower frame-rate. This is typically 3 video frames for each animated frame, though it can be dynamic through the animation!

The idea is to compare the difference between the previous frame. Since there are usually 3 consecutive frames without animation, it will return a low difference. But as soon as it hits a frame which contains the next part of the animation, a high difference will appear, causing a spike to appear on the graph. Doing this for every frames gives a result like this:

Using this, we can determine a noise threshold by drawing a line (shown in purple) which intersects as many blue lines as possible. While this is mostly tested on cyclic movement, it works surprisingly well.

The ever returning issue of sub-pixel alignment strikes back though. When the stitch contains movement in both directions, the sub-pixel misalignment can cause the difference to become large enough to cause issues. This can easily be avoided by simply using sub-pixel alignment, but as of now this is quite a bit slower in Overmix.

Once the threshold has been determined the images are separated into groups based on that threshold. If the difference between the last image in a group and the next image is below the threshold, it is added to that group. If it could not be added to any group, a new group containing that image will be created. This is done for all the images.

Further work

Notice the file size of the APNG image is nearly 100 MB. This is because each of the 22 images is rendered independently of each other and thus results in 22 completely different images. But the background is the same for each and every frame, so that means we are not taking advantage of the information about the background found in the other frames. Thus, by detecting which parts in the frames are consistent and which differs when rendering, we can both improve quality and reduce file size.

Aligning the resulting frames can be tricky when there is a lot of movement in a cyclic animation, because the images only have a little resemblance to each other. Even when this does work, sub-pixel alignment+rendering is more important than usual since otherwise the ±0.5 error will show up as a shaky animation. I have an idea to how to solve the alignment issue, but my math knowledge is currently too lacking in order to actually implement it.