Deep Learning Simplified

Recurrent Neural Network (RNN) – Part 5: Custom Cells

In this post, we will explore the idea of creating our own custom RNN cells. But first, we will take a closer look at the simple RNN and then more complicated units such as LSTM and GRU. We will also analyze the tensorflow code for these units and draw from them to eventually create our own custom cells. In this post, I will be using images from one of the best posts out there on RNNs/LSTMS by Chris Olah. I highly urge you to read the post and in my post I will be reiterating a lot of material but I will move rather quickly and focus more on the tf code. I will be referring back to this code in a future post on applying layer normalization to these RNN architectures, which can be found here.

Basic RNNs:

With traditional RNNs, the main issue is that we cannot adequately learn long term dependencies because the operations that we repeat at each cell unit for each input are static. If you think back to the basic RNN cell, the operations all involve the single tanh operation.

This architecture is suitable inputs where the solutions are based on short term dependencies but if we wish to utilize long term memory efficiently to predict the right targets, we will need a rnn cell unit that is more robust. Cue the LSTM.

Long Short Term Memory Networks (LSTMs):

The architecture of the LSTMs allows us to have long term information control at the expensive of more operations. Our traditional RNNs had one output which served as both the hidden state representation and the output from the cell.

There is an absence of information control with this basic architecture that prevents us from holding on to useful information for many steps down the line. The LSTM, instead, has two different types of outputs. We still the traditional state output which acts as the hidden state representation and the cell’s output but the cell also outputs a cell state C. Here is the LSTM in all its glory, time to break it down into pieces.

Forget gate:

The very first gate is the forget gate. This gate allows us to selectively pass information to determination of the cell state. I will break down the notation below once and you can reapply for all the other gates as well.

And of course to implement this, you could follow something like tf’s _linear function. But the main idea is that we are applying this sigmoid operation to both the input and the previous hidden state. But what exactly is applying this sigmoid operation doing? Recall that sigmoid outputs in the range [0, 1] and here we are applying it to a matrix of shape [N X H], which means we will produce NXH values with sigmoid applied to them. If the sigmoid operation results in 0, then that hidden value is nullified and it it is 1, we completely let that value be used. Anything in between allows parts of the information to go through. This is an nice way to control the information that is flowing through by effectively blocking and selectively passing parts of the inputs to the cell.

This forget gate, however, is only the first operation that we do to ultimately calculate our cell state. The next operation involves the input gate.

Input gate:

The input gate takes in our input X and the previous hidden state and computes two operations. First it selectively allows parts of the inputs to pass through with a sigmoid gate and then we multiply it by the tanh of the inputs.

What the tanh is doing here is a bit different from the sigmoid operation. Recall that tanh changes our inputs into the range [-1, 1]. This essentially changes the underlying representation of our inputs with this nonlinearity. This is the exact same step as what we were doing with the basic RNN cell. But now we take the product of these two values and add it to the value from the forget gate to calculate our cell state.

These operations with the forget and input gate can be translated to the fact that we keep parts of the old cell state (C_{t-1}) and keep parts of the new transformed (tanh) cell state C~_t. These weights are trained with our data to learn exactly how much information to keep and how to perform the correct transformation.

Output gate:

The last gate is the output gate and it uses the input, previous hidden state and the new cell state to determine the new hidden state representation.

This operation again involves the selective information barrier sigmoid which is multiplied with tanh of the cell state. Note that this tanh operation is not a neural network as with the tanh operation in the input gate. This is simply applying the tangent to the cell state without any modifications with weights. We are merely forcing the cell states [NXH] values to be in the range [-1, 1].

Variations:

There are literally hundreds of variations for RNN cells so I suggest checking our Chris Olah’s blog again for more information. A few note worthy one’s he discussed were the peephole model (allow all gates to see the cell state available at that point in time (C_{t-1} or C_t is already calculated) and coupled cell states (only update when we forget and forget when we update). But the current rival to the LSTM, which is heavily based off of the LSTM and it rapidly growing in use is the Gated Recurrent Unit (GRU).

Gated Recurrent Unit (GRU):

The main idea behind the GRU is that is combines the forget and input gate into one update gate.

Empirically, the GRU’s performance on most tasks is on par with the LSTM and also computationally less expensive. These tradeoffs are the reason behind it’s surging popularity.

Tensorflow Native Implementations:

Now we will take a look at the official Tensorflow code the GRU unit and we will mostly focus on the function calls, inputs and outputs. From here, we will replicate the structure to create our own unique cells. If you’re interested in the other cells available, you can find them all at this link. We will just focus on the GRU because it’s performance is as good as the LSTM in more cases and significantly less complex.

The GRUCell class start with the __init__ function which defines the number of units and the activation function it will use. This is the activation function that is usually tanh but the sigmoid activations are fixed since the [0,1] range allows us to control the information flow. Then we have two properties that both return self._num_units when invoked. And finally, we have out __call__ function which is what processes the input and churns out the new hidden state. Recall that GRU does not have a cell state like the LSTM.

First, we compute r and u (u = z in colah’s notation above). Instead of separately doing them, we just merge the weights and do it with 2*num_units and then we split it by two. split(dim, num_splits, value). Then we apply our sigmoid activate on the values to selectively control the information flow. Then we calculate the candidate c and use it to calculate out new hidden state representation. You may see that the order for calculating new_h is switched, either way works fine, because the weights will train accordingly.

All of the other cells’ codes look very similar to this, so you will easily be able to interpret them.