Main menu

Old spelling to new Dutch spelling with tensorflow

On February 10 2017 the 27th edition of the Computational Linguistics in the Netherlands conference will be held in Leuven. As part of this conference they organise a “shared task”. This year, this task is a competition in “translating” the bible from old Dutch to new Dutch. As a small experiment I tried to transform each word from old Dutch to new Dutch using a neural network. Below are my results (adapted from an IPython notebook that can be found here: https://github.com/rmeertens/Old-dutch-to-new-dutch-tensorflow).

In 1637 the first translation of the bible from the original Hebrew, Aramaic, and Greek languages to Dutch was published (https://en.wikipedia.org/wiki/Statenvertaling). As the Dutch language changed over the years (under influence of Spain and other countries that conquered us) the bible was translated to newer and newer Dutch.

The difference can be spotted easily between the 1637 and 1888 version:
1637: De Aerde nu was woest ende ledich, ende duysternisse was op den afgront: ende de Geest Godts sweefde op de Wateren.
1888: De aarde nu was woest en ledig, en duisternis was op den afgrond; en de Geest Gods zweefde op de wateren.
If you are Dutch you can probably read the 1637 version, although you will need some time to find the right words. The letters ae (aerde) changed to aa (aarde), ch (ledich) changed to g (ledig) and some words ending in t (afgront) now end with the d (afgrond). Even the word Godt changed to God in just 250 years.

As it takes historians a long time to read old texts in their original form, we would like to make them a bit more readable. What is interesting is that the Google search bar understands what you want to say with their autocorrect:
How you can implement Google’s autocorrect was written down by Peter Norvig in this excellent post: http://norvig.com/spell-correct.html .

Using the “Autocorrect function” you can build a dictionary of words from old Dutch to new Dutch (if you would like to see this in a blogpost, please contact me). What I wanted to do was train a dictionary on a small number of words, and use neural networks to generalise the conversion of old Dutch to new Dutch.

Preparation

I automatically created a dictionary with the old Dutch and new Dutch version of 20.852 words. If this is enough for deep neural networks is something we will find out at the end of this project. Adding more data is difficult as the only aligned old-new data I have is the bible with 37.235 lines of text.

The network

My plan is to use a recurrent neural network (https://en.wikipedia.org/wiki/Recurrent_neural_network) encoder that reads all characters, and a recurrent neural network decoder that generates characters. The data should be preprocessed with this idea in mind… This means setting a max length to the words that we want to transform, and some other tricks I will discuss later.

Data preprocessing

As mentioned above I would like to use a sequence to sequence approach. Important for this approach is having a certain length of words. Words that are longer than that length have been discarded in de data-reading step above. Now we will add paddings to the words that are not long enough.

Another important step is creating a train and a test set. We only show the network examples from the train set. At the end I will manually evaluate some of the examples in the testset and discuss what the network learned. During training we train in batches with a small amount of data. With a random data splitter we get a different trainset every run.

The network

Time to implement everything in Tensorflow. We use the embedding_attention_seq2seq function. This function:

embeds our characters

has an encoder which returns a sequence of outputs

has an attention model which uses this sequence to generate output characters

batch_size=64memory_dim=256embedding_dim=32enc_input=[tf.placeholder(tf.int32,shape=(None,))foriinrange(MAX_LENGTH_WORD)]dec_output=[tf.placeholder(tf.int32,shape=(None,))fortinrange(MAX_LENGTH_WORD)]weights=[tf.ones_like(labels_t,dtype=tf.float32)forlabels_tinenc_input]dec_inp=([tf.zeros_like(enc_input[0],dtype=np.int32)]+[dec_output[t]fortinrange(MAX_LENGTH_WORD-1)])empty_dec_inp=([tf.zeros_like(enc_input[0],dtype=np.int32,name="empty_dec_input")fortinrange(MAX_LENGTH_WORD)])cell=tf.nn.rnn_cell.GRUCell(memory_dim)# Create a train version of encoder-decoder, and a test version which does not feed the previous inputwithtf.variable_scope("decoder1")asscope:outputs,_=tf.nn.seq2seq.embedding_attention_seq2seq(enc_input,dec_inp,cell,max_features,max_features,embedding_dim,feed_previous=False)withtf.variable_scope("decoder1",reuse=True)asscope:runtime_outputs,_=tf.nn.seq2seq.embedding_attention_seq2seq(enc_input,empty_dec_inp,cell,max_features,max_features,embedding_dim,feed_previous=True)loss=tf.nn.seq2seq.sequence_loss(outputs,dec_output,weights,max_features)optimizer=tf.train.AdamOptimizer()train_op=optimizer.minimize(loss)# Init everythingsess=tf.InteractiveSession()sess.run(tf.initialize_all_variables())

Training

Time for training! I will show the network 64.000 words. This means that each word is seen around 2-3 times. Every 100 batches I will print the loss to see how well the network is doing.

Train analysis

Looks like we are learning something! The loss is going down the first 500 steps. After that the loss is not reduced a lot anymore. This is possible because natural language is difficult, and rules of old Dutch to new Dutch are not always consistent. Without an additional dictionary to verify the solutions the network made I think it will be difficult to train a perfect network.

Now it’s test time! Let’s input some words the network has not seen before and see what rules it learned.

defget_reversed_max_string_logits(logits):string_logits=logits[::-1]concatenated_string=""forlogitinstring_logits:val_here=np.argmax(logit)concatenated_string+=feature_list[val_here]returnconcatenated_stringdefprint_out(out):out=list(zip(*out))out=out[:10]# only show the first 10 samplesforindex,string_logitsinenumerate(out):print("input: ",end='')print_vector(Xin[index])print("expected: ",end='')expected=Yin[index][::-1]print_vector(expected)output=get_reversed_max_string_logits(string_logits)print("output: "+output)print("==============")# Now run a small test to see what our network does with wordsRANDOM_TESTSIZE=5Xin,Yin=get_random_reversed_dataset(Xtest,Ytest,RANDOM_TESTSIZE)Xin_transposed=np.array(Xin).TYin_transposed=np.array(Yin).Tfeed_dict={enc_input[t]:Xin_transposed[t]fortinrange(MAX_LENGTH_WORD)}out=sess.run(runtime_outputs,feed_dict)print_out(out)deftranslate_single_word(word):Xin=[get_vector_from_string(word)]Xin=sequence.pad_sequences(Xin,maxlen=MAX_LENGTH_WORD)Xin_transposed=np.array(Xin).Tfeed_dict={enc_input[t]:Xin_transposed[t]fortinrange(MAX_LENGTH_WORD)}out=sess.run(runtime_outputs,feed_dict)returnget_reversed_max_string_logits(out)interesting_words=["aerde","duyster","salfde","ontstondt","tusschen","wacker","voorraet","gevreeset","cleopatra"]forwordininteresting_words:print(word+" becomes: "+translate_single_word(word).replace("~",""))

Test analysis

Looks like our network learned simple rules such as (ae) -> (aa), (uy) -> (ui), (s) -> (z) if at the start of a word. There are also difficult cases, such as tusschen (tussen). Some words (voorraet) are difficult, as in modern Dutch the last t changed to a d (but this is not a hard rule, you simply have to learn it).

Translating “cleopatra” is an interesting case. As it is a name, you don’t want to change it… The network can’t know this, and simply renames her to “kleopatra”

Conclusion

Using a neural network to go from old Dutch to new Dutch has been mildly succesful. Some words are “translated” correctly, while others unfortunately are mistranslated. It has been a fun experiment, and it is still interesting to see what rules were “easy” to learn, and what rules are difficult to learn.

If you want to toy around with this model, or have any questions, please let me know!