Subscribe to this blog

Follow by Email

Search This Blog

The Restricted Boltzmann Machine (RBM) and Dreams

This write up is less of a puzzle and more a tutorial. Without further ado, here you go.

The Restricted Boltzmann Machine (RBM) has become increasingly popular of late after its success in the Netflix prize competition and other competitions. Most of the inventive work behind RBMs was done by Geoffrey Hinton. In particular the training of RBMs using an algorithm called "Contrastive Divergence" (CD). CD is very similar to gradient descent. A good consequence of the CD is its ability to "dream". Of the various machine learning methods out there, the RBM is the only one which has this capacity baked in implicitly.

The RBM is an energy based model. If we have a training data where we have two classes to classify, the RBM during its training phase, tries to assign a lower "energy" to a class than other classes. It achieves this by introducing an architecture of visible and hidden units. This is shown in the figure below.

There are also bias units but I've excluded them from the diagram above to demonstrate the uniqueness of RBMs. Notice, the visible units do not have any links amongst them. The hidden units also do not have connections amongst them. This is where the "Restricted" in name RBM comes from. They are restricted in whom they are connected with. The architecture, in that sense, is fixed. Each visible unit connects to all hidden units and each hidden unit connects to all visible units. Now onto to the next logical question. What are these units? The visible units are simply floating/integer value numbers. They can be represented by any vector or matrix of numbers. They are the gateway through which we enter training examples and make predictions/dreams etc. The hidden units are like neurons in a typical neural network. They aggregate all the data through the visible units. What about those connecting lines? They represent weights by which each each visible unit is connected to a hidden unit. The hidden units are sigmoid functions which take the form
$$
P(x) = \frac{1}{1 + e^{-x}}
$$
If \(P(x)\) is \(\gt 0.5\) then you can think of these units as "lighting up".Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems)

Training & Learning:

In order to train an RBM we follow the CD algorithm. Note, RBMs are known to work best with binary input features and if we want to do classification of some sort. You could still do regression or deal with real valued features, but that is beyond the scope of this write up. For more information you can read up this excellent write up by Geoffrey Hinton. The CD algorithm works by initializing the weights to random values. Next we give the visible units a training example by turning on (setting to 1) the visible units which have the feature and compute how the hidden units light up. Each of the hidden units get a flow of values come in from each visible unit and they are aggregated by their corresponding weights. For the \(i^{th}\) hidden unit, if \(w_{ij}\) represents all the weights pointing towards it, the total value coming into it would be
$$
\sum_{j}^{N}w_{ij}v_{j}
$$
where \(N\) is the number of visible units. This causes some of the hidden units to light up. This is what gets called as "positive phase" (it doesn't matter if it is called negative phase, its just a name). Next, using the lit up hidden units, the visible units are recreated. This step is crucial to understand. Also note that the first time this is done, the visible units most likely will not look anything like the set you just gave it. However, based on how different the visible units look to the ones that you just gave it, the weights are adjusted. The degree to which this is done is decided by the learning rate, typically set to some small number. This method is very similar to gradient ascent. The smaller you set the learning rate at, the better the learning and the longer it would take to eventually finish. The entire process is repeated for the next training example and then the following one and so on until all examples are completed. All of this counts as one iteration. You do several hundreds of these iterations until the visible units that are generated by the RBM during the negative phase closely resemble the input ones during the positive phase.

Example

The following is an example implementation of the RBM done in R. I've tried to make it slightly more efficient by using R's built in apply functions This is by no means complete but should hopefully serve as an inspiration for someone to build out a package for this method in R and put that on CRAN.

Before getting into the example implementation, here is an example dataset I have created. It has just two classes of examples. They are several instances of an arrow pointing up and an arrow pointing down.

An example up arrow would look like the following. Notice, the ascii art :)

The above code spends most of its time in the training phase. Also included are two example cases. The number of iterations are hard wired into this code segment at 2000 for simplicity. Three hidden units are chosen but you can change that by changing the variable num.hiddenstates. x.test1 and x.test2 are two matrices which have these example cases put in. The output of the hidden node values while under the two test cases are also written out. When you run this code, notice that the value of a1 for both cases will always be different when viewed in combination. Due to the random nature of the hidden units you would not get the same numbers for the hidden units every time you run this code. This is important to understand and note. A typical (and a good) way RBMs are used is to use this output as input to another classifier (a neural network or decision tree say) which in turn does the prediction. Such architectures are called "deep learning" architectures.

Dreaming

The final leg of the RBM code does a cool trick. It can "dream"! This simply means that the hidden units are used to regenerate the visible units. To seed the dreaming process we give it one example from the training set (in the code I give it the test set itself) and see what units in the hidden layer get activated. Then the negative phase of the CD algorithm is run to regenerate the visible units. This can be done as many different times as one pleases. I've set it to dream up just two cases each of an ASCII art up-arrow. Here is how it showed up on my machine.

[1,]000010000[2,]000111000[3,]001101100[4,]011111110[5,]111011011

[1,]000010000[2,]001111000[3,]001001100[4,]011111110[5,]111111111

Hopefully this write up was helpful in gaining a better understanding of this nice machine learning algorithm. If you are looking to learn the art of machine learning here are some good books to own

Discovering Statistics Using R
This is a good book if you are new to statistics & probability while simultaneously getting started with a programming language. The book supports R and is written in a casual humorous way making it an easy read. Great for beginners. Some of the data on the companion website could be missing.

Linear Algebra (Dover Books on Mathematics)
An excellent book to own if you are looking to get into, or want to understand linear algebra. Please keep in mind that you need to have some basic mathematical background before you can use this book.

Linear Algebra Done Right (Undergraduate Texts in Mathematics)
A great book that exposes the method of proof as it used in Linear Algebra. This book is not for the beginner though. You do need some prior knowledge of the basics at least. It would be a good add-on to an existing course you are doing in Linear Algebra.

Follow @ProbabilityPuzIf you are looking to learn time series analysis, the following are some of the best books in time series analysis.

Introductory Time Series with R (Use R!)
This is good book to get one started on time series. A nice aspect of this book is that it has examples in R and some of the data is part of standard R packages which makes good introductory material for learning the R language too. That said this is not exactly a graduate level book, and some of the data links in the book may not be valid.

Econometrics
A great book if you are in an economics stream or want to get into it. The nice thing in the book is it tries to bring out a oneness in all the methods used. Econ majors need to be up-to speed on the grounding mathematics for time series analysis to use this book. Outside of those prerequisites, this is one of the best books on econometrics and time series analysis.