Reproducing Japanese Anime Styles With CartoonGAN AI

From Hayao Miyazaki’s Spirited Away to Satoshi Kon’s Paprika, Japanese anime has made it okay for adults everywhere to enjoy cartoons again. Now, a team of Tsinghua University and Cardiff University researchers have introduced CartoonGAN — an AI-powered technology that simulates the styles of Japanese anime maestri from snapshots of real world scenery.

A real-world train station scene (left) transformed to a cartoon-style picture (right).

Meanwhile, existing transformation methods based on non-photorealistic rendering (NPR) or convolutional neural networks (CNN) are also either time-consuming or impractical as they require paired images for model training. Moreover, these methods do not produce satisfactory cartoonization results, as (1) different cartoon styles have unique characteristics involving high-level simplification and abstraction, and (2) cartoon images tend to have clear edges, smooth color shading and relatively simple textures, which present challenges for the texture-descriptor-based loss functions used in existing methods.

CartoonGAN is a GAN framework composed of two CNNs which enables style translation between two unpaired datasets: a Generator for mapping input images to the cartoon manifold; and a Discriminator for judging whether the image is from the target manifold or synthetic. Residual blocks are introduced to simplify the training process.

To avoid slow convergence and obtain high-quality stylization, dedicated semantic content loss and edge-promoting adversarial loss functions and an initialization phase are integrated into this cartoonization architecture. The content loss is defined using the ℓ1 sparse regularization (instead of the ℓ2 norm) of VGG (Visual Geometry Group) feature maps between the input photo and the generated cartoon image.

An example of a Makoto Shinkai stylization shows the importance of each component in CartoonGAN: The initialization phase performs a fast convergence to reconstruct the target manifold; sparse regularization copes with style differences between cartoon images and real-world photos while retaining original contents, and the adversarial loss function creates the clear edges.

Both real-world photos and cartoon images are used for model training, while the test data contains only real-world pictures. All training images are resized to 256×256 pixels. Researchers downloaded 6,153 real-world pictures from Flickr, 5,402 of which were for training and the rest for testing. A total of 14,704 cartoon images from popular anime artists Makoto Shinkai, Mamoru Hosoda, Hayao Miyazaki, and Satoshi Kon were used for model training.

Because NST only uses a single stylization reference image for model training, it cannot deeply learn a particular anime style, especially when there are significant content differences between the stylization reference image and the input images. Improvements can be seen when more training data is introduced. However, even if a large collection of training data is used, stylization inconsistencies may appear between regions within the image.

Although the upgraded CycleGAN+Lidentity model’s identity loss function performs better on input photo content preservation, it is still unable to reproduce Makoto Shinkai or Hayao Miyazaki’s artistic styles as accurately as CartoonGAN does. Moreover, CartoonGAN’s processing time of 1617.69 s is 33 percent faster than CycleGAN and and 50 percent faster than CycleGAN plus Lidentity.

The paper’s authors say they will focus on improving cartoon portrait stylization for human faces in their future research, while exploring applications for other image synthesis tasks with designed loss functions. The team also plans to extend the CartoonGan method to video stylization by adding sequential constraints to the training process.