Computer made Japanese letters through Variational Autoencoder

It’s been a while since I posted a new blog post having been busy with other things in my life. I have been working on this project for a while now. And now, when it is finally done I can share it with you.

Besides my passion to Machine Learning and AI algorithm in general, I have another not very common hobby, and it is the Japanese language. I have been studying it for a while now and can even get to technical words in ML field (機械学習 and デイープラーニング ) although I still have a long way to go. With this said, I thought to myself, why not join my two biggest passions together and build a cool project?

I decided to design a computer algorithm which can reproduce Japanese letters (especially Hiragana and Katakana – ひらがなとカタカナ) using a Variational autoencoder.

The database I used in this project is from the “ ETL Character Database”. The letters are organized in a very unusual way so make sure to read the instructions of how to handle the different databases (ETL 1–9).

The first part we will cover is the preprocessing of our data. As I said, the database is not very friendly to Data scientists (although of course massive projects are on a whole different caregory). Let’s open the dataset:

In this part I am opening a single character from the database (using ETL -4 database only at the moment). The code I am using is taken from here with different tweaks concerning the execution (e.g. bitstring is not compatible with tensorflow in my system, so I had to split to two notebooks, one for preprocessing and the other for the model itself). As you can see the letter needs a serious preprocessing like cropping, filtering out the noise and stronger greyscale contrast in order to recoginze the character (which is 小 by the way, means “small” ).

This function creates the dataset itself in order to handle it in numpy array for convenience. There are 6113 pictures (grey scale) with resolution of 76×72 pixels. Let’s set a function that cleans our dataset.I implemented a simple gaussian blur and then thresholding (otsu’s histogram method) and a “TOZERO” binarization in order to preserve the stroke pressure grey scale. Did this in order to get a better a results hopefully later on, When we create the Japanese letters.

First, I will explain in a nutshell the concept of the VAE in order to shed some light for those of you who are not familiar with this architecture. For a more in depth and elaborate explanation you can try this page. It gives a very thorough explanation about the relationship with Probablistic graphical models and deep learning concepts.

Autoencoders are in great use in many fields of data science. The autoencoders can be used to compression of feature vectors, anomaly detection etc. It is based on unsupervised approach. The main idea of the autoencoder is to encode the data (labeled x in the graph) to a smaller dimension vector and then try to decode it back to the original (reconstruct) x’. The main difference between the AE (Autoencoder) and the VAE is that in the VAE the middle layer is considered as a normal distribution (every node represents its own normal distribution). How do we achieve that? — good question.

2 main things are different then the AE:

The loss function used in the model contains two elements: the first is the reconstruction loss which is the same as in the AE in order to train the network to reconstruct the data. and the second element is KL(Kullback -leibler) divergence loss. The KL divergence represents how different two distributions are. It has very unique characteristics like non symmetry, direct relation to fischer information metric and more. We use this loss in order to force the network to capture the layer between the encoder and the decoder to capture distribution similar to normal distribution. The KL divergence works as a regularizer.

The second difference is the reparameterization trick. Since the network learn it’s parameters using our trusty old back propagation algorithm, it needs to differntiate the layers. If we’re using sampling (like you can see on the right side of the graph), and we want to take derivate of a function of our sampled variable with respect to our parameter we have a problem since our variable is a random variable. The reparameterization trick will solve this problem (and I urge you to read the links I wrote before!).

So you asking, how will the network will synthesize it’s own Japanese letters then?

What we’re going to do is , after we trained the network on our dataset and made sure we were satisfied with the reconstruction, we will sample variables from standard normal distribution. Later, insert it as the input to the Decoder only and watch the output of the network ,which is basically random since we don’t have a designated input besides random variables sampled from a normal distribution! nice, isn’t it?

So our goal here is to train the network with the dataset we preprocessed beforehand in order to create new handwritten Japanese letters that are not part of the dataset but based on it.

I seperated the preprocessing and the model to two seperate notebooks since the “bitstring” package and tensorflow weren’t compatible for some reason on my rig. Saving the data we preprocessed using:

The batch size is set to 32, as a loss function we have a leaky relu (probably other loss functions will do in this simple model) with the negative end slope set to 0.3 (totally arbitrary). Our dataset is made of greyscale images of 72*72 . dropout helped to avoid isolated activated neurons and overfitting. Kernel size not too big of 4 and other standard hyperparameters. The number of latent units (the sampled layer) is 12 after several failed optimization attempts.

z variable holds all the hidden units composed of mean and a standard deviation multiplied by a normal distribution sampled epsilon (This is the reparametrization trick mentined before)

The most important part in this piece of code is the loss function definition, as we discussed before we have two parts, the image loss (MSE/L2 loss for this simple image) and the latent loss for the KL divergence. Adam optimizer, learning rate and all the other stuff are quite standard.