Introduction

Question answering has recieved more focus as large search engines have basically mastered general information retrieval and are starting to cover more edge cases. Question answering happens to be one of those edge cases, because it could involve a lot of syntatic nuance that doesn’t get captured by standard information retrieval models, like LDA or LSI. Hypothetically, deep learning models would be better suited to this type of task because of their ability to capture higher-order syntax. Two papers, “Applying deep learning to answer selection: a study and an open task” (Feng et. al. 2015) and “LSTM-based deep learning models for non-factoid answer selection” (Tan et. al. 2016) , are recent examples which have applied deep learning to question-answering tasks with good results.

Feng et. al. used an in-house Java framework for their work, and Tan et. al. built their model entirely from Theano. Personally, I am a lot lazier than them, and I don’t understand CNNs very well, so I would like to use an existing framework to build one of their models to see if I could get similar results. Keras is a really popular one that has support for everything we might need to put the model together.

Installing Keras

See the instructions here on how to install Keras. The simple route is to install using pip , e.g.

sudo pip install --upgrade keras

There are some important features that might not be available without the most recent version. I’m not sure if doing pip install gets the most recent version, so it might be helpful to install from binary. This is actually pretty straightforward! Just change to the directory where you want your source code to be and do:

One benefit of this is that if you want to add a custom layer, you can add it to the Keras installation and be able to use it across different projects. Even better, you could fork the project and clone your own fork, although this gets into areas of Git beyond my understanding.

These are pretty interesting to play around with. It is really cool how easy it is to get one of these set up! With Keras, a high-level model design can be quickly implemented.

Word Embeddings

Ok! Let’s dive in. The first challenge that you might think of when designing a language model is what the units of the language might be. A reasonable dataset might have around 20000 distinct words, after lemmatizing them. If the average sentence is 40 words long, then you’re left with a 20000 x 40 matrix just to represent one sentence, which is 3.2 megabytes if each word is represented in 32 bits. This obviously doesn’t work, so the first step in developing a good language model is to figure out how to reduce the number of dimensions required to represent a word.

One popular method of doing this is using word2vec . word2vec is a way of embedding words in a vector space so that words that are semantically similar are near each other. There are some interesting consequences of doing this, like being able to do word addition and subtraction:

king - man + woman = queen

In Keras, this is available as an Embedding layer. This layer takes as input a (n_batches, sentence_length) dimensional matrix of integers representing each word in the corpus, and outputs a (n_batches, sentence_length, n_embedding_dims) dimensional matrix, where the last dimension is the word embedding.

There are two advantages to this. The first is space: Instead of 3.2 megabytes, a 40 word sentence embedded in 100 dimensions would only take 16 kilobytes, which is much more reasonable. More importantly, word embeddings give the model a hint at the meaning of each word, so it will converge more quickly. There are significantly fewer parameters which have to be jostled around, and parameters are sort of tied together in a sensible way so that they jostle in the right direction.

Let’s try this out! We can train a recurrent neural network to predict some dummy data and examine the embedding layer for each vector. This model takes a sentence like “sam is red” or “sarah not green” and predicts what color the person is. It is a very simple example, but it will illustrate what the Embedding layer is doing, and also illustrate how we can turn a bunch of sentences into vectors of indices by building a dictionary.

importitertoolsimportnumpyasnpsentences=''' sam is red hannah not red hannah is green bob is green bob not red sam not green sarah is red sarah not green'''.strip().split('/n')is_green=np.asarray([[0,1,1,1,1,0,0,0]],dtype='int32').Tlemma=lambdax:x.strip().lower().split(' ')sentences_lemmatized=[lemma(sentence)forsentenceinsentences]words=set(itertools.chain(*sentences_lemmatized))# set(['boy', 'fed', 'ate', 'cat', 'kicked', 'hat'])# dictionaries for converting words to integers and vice versaword2idx=dict((v,i)fori,vinenumerate(words))idx2word=list(words)# convert the sentences a numpy arrayto_idx=lambdax:[word2idx[word]forwordinx]sentences_idx=[to_idx(sentence)forsentenceinsentences_lemmatized]sentences_array=np.asarray(sentences_idx,dtype='int32')# parameters for the modelsentence_maxlen=3n_words=len(words)n_embed_dims=3# put together a model to predict fromkeras.layersimportInput,Embedding,merge,Flatten,RNNfromkeras.modelsimportModelinput_sentence=Input(shape=(sentence_maxlen,),dtype='int32')input_embedding=Embedding(n_words,n_embed_dims)(input_sentence)color_prediction=RNN(1)(input_embedding)predict_green=Model(input=[input_sentence],output=[color_prediction])predict_green.compile(optimizer='sgd',loss='binary_crossentropy')# fit the model to predict what color each person ispredict_green.fit([sentences_array],[is_green],nb_epoch=5000,verbose=1)embeddings=predict_green.layers[1].W.get_value()# print out the embedding vector associated with each wordforiinrange(n_words):print('{}: {}'.format(idx2word[i],embeddings[i]))

The embedding layer embeds the words into 3 dimensions. A sample of the vectors it produces is seen below. As predicted, the model learns useful word embeddings.

Each category is grouped in the 3-dimensional vector space. The network learned each of these categories from how each word was used; Sarah and Sam are the red people, while Bob and Hannah are the green people. However, it did not differentiate well between not , is , red , and green , because those weren’t immediately obvious for the decision task.

Recurrent Neural Networks

As the Keras examples illustrate, there are different philosophies on deep language modeling. Feng et. al. did a bunch of benchmarks with convolutional networks, and ended up with some impressive results. Tan et. al. used recurrent networks with some different parameters. I’ll focus on recurrent neural networks first (What do pirates call neural networks? Arrrgh NNs). I’ll assume some familiarity with both recurrent and convolutional neural networks. Andrej Karpathy’s blog discusses recurrent neural networks in detail. Here is an image from that post which explains the core concept:

Vanilla

The basic RNN architecture is essentially a feed-forward neural network that is stretched out over a bunch of time steps and has it’s intermediate output added to the next input step. This idea can be expressed as an update equation for each input step:

Note that dot indicates vector-matrix multiplication. Multiplying a vector of dimensions <m> by a matrix of dimensions <m, n> can be done with dot(<m>, <m, n>) and yields a vector of dimensions <n> . This is consistent with its usage in Theano and Keras. In the update equation, we multiply each input_vector by our input weights W , multiply the prev_hidden vector by our hidden weights U , and add a bias, before passing the sum to the activation function sigmoid . To get the many to one behavior in the image, we can grab the last hidden state and use that as our output. To get the one to many behavior, we can pass one input vector and then just pass a bunch of zero vectors to get as many hidden states as we want.

LSTM

If the RNN gets really long, then we run into a lot of difficulty training the model. The effect of something a early in the sequence on the end result is very small relative to later components, so it is hard to use that information in updating the weights. To solve this, several methods have been proposed, and two have been implemented in Keras. The first is the Long Short-Term Memory (LSTM) unit, which was proposed by Hochreiter and Schmidhuber 1997 . This model uses a second hidden state which stores information from further back in the model, allowing that information to have a stronger effect on the end result. The update equations for this model are:

Note that * indicates element-wise multiplication. This is consistent with its usage in Theano and Keras. First, there are a bunch more parameters to train; not only do we have weights for the input-to-hidden and hidden-to-hidden matrices, but also we have an accompanying candidate_state . The candidate state is like a second hidden state that transfers information to and from the hidden state. It is like a safety deposit box for putting things in and taking things out.

GRU

The second model is the Gated Recurrent Unit (GRU), which was proposed by Cho et. al. 2014 . The equations for this model are as follows:

In this model, there is an update_gate which controls how much of the previous hidden state to carry over to the new hidden state and a reset_gate which controls how much the previous hidden state changes. This allows potentially long-term dependencies to be propagated through the network.

My implementations of these models in Theano, as well as optimizers for training them, can be found in this Github repository .

RNN Example: Predicting Dummy Data

Now that we’ve seen the equations, let’s see how Keras implementations compare on some sample data.

The results will vary from trial to trial. RNNs are exceptionally difficult to train. However, in general, a model that can take advantage of long-term dependencies will have a much easier time seeing how two sequences are different.

Attentional RNNs

It isn’t strictly important to understand the RNN part before looking at this part, but it will help everything make more sense. The next component of language modeling, which was the focus of the Tan paper, is the Attentional RNN. This essential components of model is described in “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” (Xu et. al. 2016) . I’ll try to hash it out in this blog post a little bit and look at how to build it in Keras.

Lambda Layer

First, let’s look at how to make a custom layer in Keras. There are a couple options. One is the Lambda layer, which does a specified operation. An example of this could be a layer that doubles the value it is passed:

This doubles our input data. Note that there are no trainable weights anywhere in this model, so it couldn’t actually learn anything. What if we wanted to multiply our input vector by some trainable scalar that predicts the output vector? In this case, we will have to write our own layer.

Building a Custom Layer Example

Let’s jump right in and write a layer that learns to multiply an input by a scalar value and produce an output.

fromkeras.engineimportLayerfromkerasimportinitializations# our layer will take input shape (nb_samples, 1)classMultiplicationLayer(Layer):def__init__(self,**kwargs):self.init=initializations.get('glorot_uniform')super(MultiplicationLayer,self).__init__(**kwargs)defbuild(self,input_shape):# each sample should be a scalarassertlen(input_shape)==2andinput_shape[1]==1self.multiplicand=self.init(input_shape[1:],name='multiplicand')# let Keras know that we want to train the multiplicandself.trainable_weights=[self.multiplicand]defget_output_shape_for(self,input_shape):# we're doing a scalar multiply, so we don't change the input shapeassertinput_shapeandlen(input_shape)==2andinput_shape[1]==1returninput_shapedefcall(self,x,mask=None):# this is called during MultiplicationLayer()(input)returnx*self.multiplicand# test the modelfromkeras.layersimportInputfromkeras.modelsimportModel# input is a single scalarinput=Input(shape=(1,),dtype='int32')multiply=MultiplicationLayer()(input)model=Model(input=[input],output=[multiply])model.compile(optimizer='sgd',loss='mse')importnumpyasnpinput_data=np.arange(10)output_data=3*input_datamodel.fit([input_data],[output_data],nb_epoch=10)print(model.layers[1].multiplicand.get_value())# should be close to 3

There we go! We have a complete model. We could change it around to make it fancier, like adding a broadcastable dimension to the multiplicand so that the layer could be passed a vector of numbers instead of just a scalar. Let’s look closer at how we built the multiplication layer:

First, we make a weight initializer that we can use later to get weights. glorot_uniform is just a particular way to initialize weights. We then call the __init__ method of the super class.

defbuild(self,input_shape):# each sample should be a scalarassertlen(input_shape)==2andinput_shape[1]==1self.multiplicand=self.init(input_shape[1:],name='multiplicand')# let Keras know that we want to train the multiplicandself.trainable_weights=[self.multiplicand]

This method specifies the components of the model, for when we build it. The only component we need is the scalar to multiply by, so we initialize a new tensor by calling self.init , the initializer we created in the __init__ method.

defget_output_shape_for(self,input_shape):# we're doing a scalar multiply, so we don't change the input shapeassertinput_shapeandlen(input_shape)==2andinput_shape[1]==1returninput_shape

This method tells the builder what the output shape of this layer will be given its input shape. Since our layer just does a scalar multiply, it doesn’t change the output shape from the input shape. For example, scalar multiplying the input [1, 2, 3] of dimensions <3, 1> by a scalar factor of 2 gives the output [2, 4, 6] , which has the same dimensions <3, 1> .

defcall(self,x,mask=None):# this is called during MultiplicationLayer()(input)returnx*self.multiplicand

This is the bread and butter of the the layer, where we actually perform the operation. We specify that the output of this layer is the input x matrix multiplied by our multiplicand tensor. Note that this method takes a while to run, because whatever backend we use (for example, Theano) has to put together the tensors in the right way. To make your layer run quickly, it is good practice to add assert checks in the build and get_output_shape_for methods.

Characterizing the Attentional LSTM

Now that we’ve got an idea of how to build a custom layer, let’s look at the specifications for an attentional LSTM. Following Tan et. al. , we can augment our LSTM equations from earlier to include an attentional component. The attentional component requires some attention vector attention_vec .

The new equations are the last three, which correspond to equations 9, 10 and 11 from the paper (approximately reproduced below, using different notation).

The attention parameter is a function of the current hidden state and the attention vector mixed together. Each is first put through a matrix, summed and put through an activation function to get an attention state, which is then put through another transformation to get an attention parameter. The attention parameter then re-updates the hidden state. Supposedly, this is conceptually similar to TF-IDF weighting, where the model learns to weight particular states at particular times.

Building an Attentional LSTM Example

Now that we have all the components for an Attentional LSTM, let’s see the code for how we could implement this. The attentional component can be tacked onto the LSTM code that already exists.

Let’s look at what each function is doing individually. Note that this builds heavily upon the already-existing LSTM implementation.

fromkerasimportbackendasKfromkeras.layersimportLSTM

We will create a subclass (does python even do subclasses?) of the LSTM implementation that Keras already provides. The Keras backend is either Theano or Tensorflow, depending on the settings specified in ~/.keras/keras.json (the default is Theano). This backend lets us use Theano-type functions such as K.zeros , which specifies a matrix of zeros, to initialize our model.

We initialize the layer by passing it the out number of hidden layers output_dim and the layer to use as the attention vector attention_vec . The __init__ function is identical to the __init__ function for the LSTM layer except for the attention vector, so we just reuse it here.

defbuild(self,input_shape):

I won’t reproduce everything here, but essentially this method initializes all of the weight matrices we need for the attentional component, after calling the LSTM.build method to initialize the LSTM weight matrices.

This method is used by the LSTM superclass to define components outside of the step function, so that they don’t need to be recomputed very time step. In our case, the attentional vector doesn’t need to be recomputed every time step, so we define it as a constant (we then grab it in the step function using attention = states[4] ).

Convolutional Neural Networks

I will add something here when I actually understand how these work at a sufficient level. Basically, with language modeling, a common strategy is to apply a ton (on the order of 1000) convolutional filters to the embedding layer followed by a max-1 pooling function and call it a day. It actually works stupidly well for question answering (see Feng et. al. for benchmarks). In the mean time, I will dump some code here that might be helpful to pour over.

Similarity Metrics

The basic idea with question answering is to embed questions and answers as vectors, so that the question vector is close in vector space to the answer vector. For example, with the Attentional RNN, we take the question vector and use it as an input for generating the answer vector. A common approach is to then rank answer vectors according to their cosine similarities with the question vector. This doesn’t follow the conventional neural network architecture, and takes some manipulation to achieve in Keras. To use equations, what we would like to do is:

best answer = argmax(cos(question, answers))

Training is generally done by minimizing hinge loss. In this case, we want the cosine similarity for the correct answer to go up, and the cosine similarity for an incorrect answer to go down. The loss function can be formulated as:

Note that for implementations, having a loss of zero can be troublesome, so a small value like 1e-6 is generally preferable instead. The loss is zero when the difference between the cosine similarities of the good and bad answers is greater than the constant margin we defined. In practice, the margins generally range from 0.001 to 0.2. If we want to use something besides cosine similarity, we can reformulate this as

where sim is our similarity metric. Hinge loss works well for this application, as opposed to something like mean squared error, because we don’t want our question vectors to be orthogonal with the bad answer vectors, we just want the bad answer vectors to be a good distance away.

Cosine Similarity Example: Rotational Matrix

First, let’s look at how to do cosine similarity within the constraints of Keras. Fortunately, Keras has an implementation of cosine similarity, as a mode argument to the merge layer. This is done with:

If we pass it two inputs of dimensions (a, b, c) , it will calculate the cosine simliarity of the c dimension (specified using dot_axes ) and give an output of dimensions (a, b) . However, because we might eventually want to implement other types of similarities besides cosine similarity, let’s look at how this can be done by passing a lambda function to merge .

We define a function similarity which we will use to compute the similarity of the inputs passed to the merge layer. Note that when we do this, we also have to pass an output_shape which tells Keras what shape the output will be after we do this operation (hopefully in the future this shape will be inferred, but it is still an open issue in the Github group).

A cool example might be to see if we can learn a rotation matrix. A rotation matrix in Euclidean space is a matrix which rotates a vector by a certain angle around the origin. It is defined as a function of theta , the angle to rotate by:

We can learn this matrix really simply with the right dataset and one dense layer, that is:

A Dense layer with linear activation is the exact same as a matrix multiplication. We give it two input dimensions and two output dimensions. After training this model, the printed weight matrix is:

[[-0.00603954, -0.99370766] [ 0.99173903, 0.0078686 ]]

which is close to the rotation matrix for an angle of 90 degrees. Let’s try this again, but with cosine similarity. This will require some manipulation. In the previous example, we had a clearly defined input, a , and output, b , and our model was designed to perform a transformaion on a to predict b . In this example, we have two inputs, a and b , and we will perform a transformation on a to make it close to b . As an output, we get the similarity of the two vectors, so we need to train our model to make this similarity high by providing it a bunch of 1’s.

This looks a bit like cosine similarity, but the scaling seems off. Cosine similarity is ambivalent about the magnitude of vectors, so the weight matrix ends up not being a rotation matrix so much as a rotation-and-skew matrix. It is interesting to think about why this network learned this particular matrix.

Below, a unit square (blue) is multiplied by the first matrix to get the orange square, and by the second matrix to get the yellow square.

Other Similarity Metrics

Feng et. al. provided a list of similarities along with their benchmarks for a CNN architecture. Some of these similarities, along with their implementations in Keras, are reproduced below. They rely on these helper functions:

InsuranceQA Model Example

A framework for designing and testing models can be found in this Github repo . This model achieved relatively good marks for Top-1 Accuracy (how often did the model rank a ground truth the highest out of 500 results) and Mean Reciprocal Rank (MRR), which is defined as

The results after learning the training set are summaraized in the following table.

Test set

Top-1 Accuracy

Mean Reciprocal Rank

Test 1

0.4933

0.6189

Test 2

0.4606

0.5968

Dev

0.4700

0.6088

For comparison, the best model from Feng et. al. achieved an accuracy of 0.653 on Test 1, and the model in Tan et. al. achieved an accuracy of 0.681 on Test 1. This model isn’t exceptional, but it works pretty well for how simple it is. It outperforms the baseline bag of words model, and performs on par with the Metzler-Bendersky IR model introduced in “Learning concept importance using a weighted dependence model” ( Bendersky and Metzler, 2010 ). Here’s how we build it in Keras:

The code is kind of awkward without the context, so I would recommend checking out the repository to see how it works.

Closing Remarks

Hopefully this demonstrates that Keras is powerful and flexible enough to allow for quick and creative implementations of networks. This post follows my final project for my Information Retrieval class, the code for which can be seen here . I think this code makes more sense in the context of this post.