(Part 1) Machine Translation with Deep Neural Networks

Speech and intelligence

Even if intelligence is hard to define, one thing is clear: speech and intelligence are intimately linked. Speech is a tool that we not only use to communicate with other people, but it also reflects the way our brain abstracts and conceptualises our world.

That doesn’t mean people cannot understand concepts which do not exist in their own language. You can think of speech as a way to pave “thought motorways,” or mental patterns that are used frequently and therefore easily accessible to make sense of concepts.

Here’s an example: Certain languages have no egocentric directions, or directions putting the speaker at the center. The Guugu Yimidhirr – an Aboriginal tribe speaking an eponymous language – always define directions relative to compass points and never relative to their own perspective. They don’t take a left at the junction, they turn West. Even still, the Guugu Yimidhirr are fabulous navigators, whether in the Australian outback or in street canyons. The brain of a child who starts learning the language needs to grasp the much more difficult concept of compass directions – a process taking multiple years in the case of the Guugu Yimidhirr.

That’s why speech is an important topic in Artificial Intelligence with its own area of research called Natural Language Processing (NLP). It’s a field full of many challenges, ranging from speech recognition and speech synthesis systems to the analysis of grammatical structures and semantics. NLP encompasses practically all aspects of language. Many problems of NLP, though, are not trivial, and despite great strides -- think Siri, Alexa & Co. -- we still can’t talk with a computer about God and the world around us. Machines or software are still not capable of capturing the meaning of texts without error, and they still sound a bit tinny.

A short history of NLP

Machine translation systems were already a part of research work during the early years of AI. As with other AI disciplines, texts back then were processed using rigid, human-made rules. In 1954, the Georgetown-IBM experiment – a program with a translation procedure composed of just six grammatical rules – managed to translate more than 60 sentences from Russian into English. These early successes inspired great optimism and opened the floodgates of research investments.

Experts underestimated the complexity of the problem, though, and the originally promising approaches could never catch up to the expectations these early successes had raised. With little to show for their efforts, most of the research funding dried up and NLP had to endure an “AI winter.”

Things began to look up when the original, rigid control systems were gradually replaced by statistical learning procedures. Stochastic models such as HMMs (Hidden Markov Models) led to great leaps and found their way into many commercially successful products.

Fast forward to today, and we’re in the middle of a Deep Learning Explosion. Experts are busy developing NLP solutions which significantly outperform almost all non-neural systems. The backend of Google Translate, for instance, was replaced by a Deep Learning System in September 2016.

Deep Machine Translation

Machine translation consists of complex subtasks which occur in almost every problem in NLP and for the longest time remained unsolvable. For a translation, it doesn’t suffice to simply swap the words of one language with those of another. The meaning behind the words needs to be captured and subsequently repackaged in suitable words of the target language.

Today, Deep Learning methods based on neural networks are used in NLP. The advantage lies in being able to directly extract the meaning and then the translation from raw data without pre-processing. This strategy is called end-to-end, but it only works properly when the relevant data properties are also “learned”. Before, these features were defined manually by experts. Nowadays, deep neural networks are the method of choice – directly deducing a speech model from a large data set. Let’s have a look at how modern neural translation software, similar to that used by Google Translate, is constructed.

A modern neural translation system exists of three key components. First, the input data or the sequence of letters in a text needs to be transformed into numbers. Then, the input sequence is translated into an internal representation, which is finally decoded into the output language. All parts of the system are optimised to work together (therefore “end-to-end”) with the backpropagation algorithm. This requires the definable deduction of individual components. The fundamentals of how neural networks learn are explained in more detail here.

Embedding input data

Unlike image and speech data, text exists of discrete symbols (think letters or words), which can’t readily be used in a neural network. That’s the first part of machine translation calls for creating a numerical input. Of course, each symbol could simply be assigned one fixed number and a sentence transformed into a string of numbers. It’s more common, though, to use a list of numbers per symbol – a vector representation or so-called embedding. Today, it’s common practice to represent the input at the word level – mainly because the entire system trains and performs faster this way. A word, when represented as a vector, can be understood as a point in a high-dimensional space. This has clear advantages over fixed identification numbers. Similar words (however similarity is defined) can accumulate close to each other and be coded with a meaning very early in the word representation.

Words can appear in different forms in many languages. Verbs can be conjugated and nouns declined, yet still, all forms should share the same representation as they mean the same. To ease the pressure on the neural model and to make sure it doesn’t need to learn this map, systems often run a so-called lemmatisation of words, a type of pre-processing in which each word is replaced by its basic form.

Naïve vector representations of discrete symbols are so-called one-hot vectors. In these, each symbol (e.g. each word) is replaced by a vector of the length of the vocabulary. This vector only exists of zero’s and a single one. The position of this one clearly identifies each symbol in the vocabulary. If a training dataset contains 100,000 words, for instance, each word is replaced by a vector of length 100,000. The vector for the word “yes” might have the shape [1, 0, 0, ....], the word “no” [0, 1, 0, …], and so on.

This type of representation is very inefficient, but matrix multiplication can turn one-hot vectors intro much shorter vectors that conform to our word embedding. If our embedding should have a length of 100 elements and our vocabulary contains 100,000 words, then this map can be represented by a matrix with 100 rows and 100,000 columns. But what should this matrix look like? As matrix multiplication is derivable, we can train embedding, as part of the complete system (LINK to article about gradient reduction and backpropagation).

The meaning of a word depends on its context: An address where you send your mail to is not the same as a speech given in front of a crowd. Read how an AI masters this difference in the second part.