Preface

This is an online companion for the paper An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation by Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra and Yoshua Bengio, to appear in the proceedings of the International Conference on Machine Learning (2007). A camera-ready version of this paper can be downloaded from here. This document provides additional information regarding the generation of the datasets, details of our experiments and downloadable versions of the datasets that we used. It overlaps partly with the paper itself, but it is meant to be used in conjunction with the paper in order to get a deeper (pun somewhat intended) understanding of our experiments.

Introduction

Recently, several learning algorithms relying on models with deep architectures have been proposed. Though they have demonstrated impressive performance, to date, they have only been evaluated on relatively simple problems such as digit recognition in a controlled environment, for which many machine learning algorithms already report reasonable results. Here, we present a series of experiments which indicate that these models show promise in solving harder learning problems that exhibit many factors of variation. These models are compared with well established algorithms such as Support Vector Machines and single-layer feed-forward neural networks.

Description of the algorithms

We define a shallow model as a model with very few layers of composition, e.g. linear models, one-hidden layer neural networks and kernel SVMs:

On the other hand, deep architecture models are such that their output is the result of the composition of some number of computational units, and the number of needed computational units is not exponential in the characteristics of the problem such as the number of factors of variation or the number of inputs. These units are generally organized in layers so that the many levels of computation can be composed. Also, they can only be approximated appropriately by a shallow model with a large (i.e. possibly exponential) number of computational units. In other words, these models are able to implement very complex functions with relatively few parameters.

Here are the deep architecture models that we considered in our paper. More details about the corresponding training algorithms can be found in the paper and pseudo-codes can be found in the appendix of the technical report for Greedy Layer-Wise Training of Deep Networks.

Deep Belief Networks (DBN): Hinton et al. (2006) introduced this generative model with arbitrarily many layers of stochastic neurons and developped a training algorithm for them that is based on a greedy layer-wise generative learning procedure. The training strategy for such networks has been presented and analyzed by Bengio et al. (2007) and they concluded that greedy unsupervised learning is a key ingredient in finding a solution for the problem of training deep networks. While lower layers of a DBN extract “low-level features” from x, the upper layers are supposed to represent more “abstract” concepts that explain the input observation x.

Stacked Autoassociators: As demonstrated by Bengio et al. (2007), the idea of successively extracting non-linear features that “explain” variations of the features at the previous level can be applied not only to RBMs but also to autoassociators. An autoassociator is simply a model (usually a one-hidden-layer neural network) trained to reproduce its input by forcing the computations to flow through a “bottleneck” representation.

In the experimental results section, we compare these models with shallow architectures such as single-layer feed-forward networks and Support Vector Machines.

Description of the datasets

In order to study the capacity of these algorithms to scale to learning problems with many factors of variation, we have generated datasets where we can identify some of these factors of variation explicitly. We focused on vision problems, mostly because they are easier to generate and analyze. In all cases, the classification problem has a balanced class distribution.

Variations on MNIST

In one series of experiments, we construct new datasets by adding additional factors of variation to the MNIST images.

The generative process used to generate the datasets is the following:

Introducing multiple factors of variation leads to the following benchmarks:

mnist-rot: the digits were rotated by an angle generated uniformly between 0 and radians. Thus the factors of variation are the rotation angle and the factors of variation already contained in MNIST, such as handwriting style:

mnist-back-rand: a random background was inserted in the digit image. Each pixel value of the background was generated uniformly between 0 and 255;

mnist-back-image: a patch from a black and white image was used as the background for the digit image. The patches were extracted randomly from a set of 20 images downloaded from the internet. Patches which had low pixel variance (i.e. contained little texture) were ignored;

mnist-rot-back-image: the perturbations used in mnist-rot and mnist-back-image were combined.

Discrimination between tall and wide rectangles.

In this task, a learning algorithm needs to recognize whether a rectangle contained in an image has a larger width or length. The rectangle can be situated anywhere in the 28 x 28 pixel image. We generated two datasets for this problem:

rectangles: the pixels corresponding to the border of the rectangle has a value of 255, the rest are 0. The height and width of the rectangles were sampled uniformly, but when their difference was smaller than 3 pixels the samples were rejected. The top left corner of the rectangles was also sampled uniformly, with the constraint that the whole rectangle fits in the image.

rectangles-image: the border and inside of the rectangles corresponds to an image patch and a background patch is also sampled. The image patches are extracted from one of the 20 images used by mnist-back-image. Sampling of the rectangles is essentially the same as for rectangles, but the area covered by the rectangles was constrained to be between 25% and 75% of the total image, the length and width of the rectangles were forced to be of at least 10 and their difference was forced to be of at least 5 pixels.

Recognition of convex sets

The convex sets consist of a single convex region with pixels of value 255. Candidate convex images were constructed by taking the intersection of a number of half-planes whose location and orientation were chosen uniformly at random. The number of intersecting half-planes was also sampled randomly according to a geometric distribution with parameter 0.195. A candidate convex image was rejected if there were less than 19 pixels in the convex region.

Candidate non-convex images were constructed by taking the union of a random number of convex sets generated as above, but with the number of half-planes sampled from a geometric distribution with parameter 0.07 and with a minimum number of 10
pixels. The number of convex sets was sampled uniformly from 2 to 4. The candidate non-convex images were then tested by checking a convexity condition for every pair of pixels in the non-convex set. Those sets that failed the convexity test were added to the dataset.

The parameters for generating the convex and non-convex sets were balanced to ensure that the mean number of pixels of value 255 is the same in the two datasets.

Impact of background pixel correlation

In order to explore the space of learning problems standing between mnist-back-rand and mnist-back-images, we set up an experiment where we could vary the amount of background pixel correlation. We are hence assuming that background correlation is the main characteristic that distinguishes mnist-back-images from mnist-back-random.

Correlated pixel noise was sampled from a zero-mean multivariate Gaussian distribution of dimension equal to the number of pixels: . The covariance matrix, , was defined as a convex combination of an identity matrix and a Gaussian kernel function. Representing the position of the th pixel with the vector , we have:

with kernel bandwidth . The Gaussian kernel induced a neighborhood correlation structure among pixels such that nearby pixels are more correlated that pixels further apart. For each sample from , the pixel values (ranging from 0 to 1) were determined by passing elements of through an error function:

We generated six datasets with varying degrees of neighborhood correlation by setting the mixture weight to the values . The marginal distributions for each pixel is uniform(0,1) for each value of .

Downloadable datasets

All the datasets are provided as zip archives. Each archive contains two files -- a training (and validation) set and a test set. We used the last 2000 examples of the training sets as validation sets in all cases but for rectangles (200) and, in the case of SVMs, retrained the models with the entire set after choosing the optimal parameters on these validation sets. Data is stored at one example per row, the features being space-separated. There are 784 features per example (=28*28 images), corresponding to the first 784 columns of each row. The last column is the label, which is 0 to 9 for the MNIST variations and 1 or 0 for the rectangles, rectangles-images and convex datasets.

IMPORTANT New versions of the datasets containing rotations have been generated. There was an issue in the previous versions with the way rotated digits were generated, which increased the range of values a digit pixel could have. For instance, this issue made it easier to discern digits from the image background in the MNIST rotated+back-image dataset. New results for these datasets have been generated and are reported along with the other benchmark results.

Attached is an archive that contains the scripts needed to generate the datasets, along with a README. Pleasecontactus should you have any questions about the scripts.

Experimental setup and details

We conducted experiments with two deep architecture models: a 3 hidden layer Deep Belief Network (noted DBN-3) and a 3 hidden
layer Stacked Autoassociator Network (noted SAA-3). In order to compare their performance, we also trained:

a standard single hidden layer feed-forward neural network (noted NNet), to measure the improvement provided by the additional layers and the unsupervised initalization used in DBN-3 and SAA-3

a single hidden layer Deep Belief Network (noted DBN-1), to measure the improvement provided by the additional layers used in DBN-3 and SAA-3 only

a Support Vector Machine classifier with Gaussian and polynomial kernel, which are popular reference points for classification models

In all cases, the model selection was performed using a validation set. For NNet, the best combination of number of hidden units (varying from 25 to 700), learning rate (from 0.0001 to 0.1) and decrease constant (from 0 to ) of stochastic gradient descent and weight decay penalization (from 0 to ) was selected using a grid search.

For DBN-3 and SAA-3, because these models can necessitate more than a day to train, we could not perform a full grid search in the space of hyper-parameters. For both models, the number of hidden units per layer must be chosen, in addition to all other optimization parameters (learning rates for the unsupervised and supervised phases, stopping criteria of the unsupervised phase, etc.). We chose an approximate search procedure that we believed finds a reasonable minima for the following grid search:

The hyper-parameter search procedure we used alternates between fixing a neural network architecture and searching for good optimization hyper-parameters similarly to coordinate descent. More time would usually be spent on finding good optimization parameters, given some empirical evidence that we found indicating that the choice of the optimization hyper-parameters (mostly the learning rates) has much more influence on the obtained performance than the size of the network. We used the same procedure to find the hyper-parameters for DBN-1, which are the same as those of DBN-3 expect the second hidden layer and third hidden layer sizes. We also allowed ourselves to test for much larger first-hidden layer sizes, in order to make the comparison between DBN-1 and DBN-3 fairer.

We usually started by testing a relatively small architecture (between 500 and 700 units in the first and second hidden layer, and between 1000 and 2000 hidden units in the last layer). Given the results obtained on the validation set (compared to those of NNet for instance) after selecting approriate optimization parameters, we would then consider growing the number of units in all layers simultaneously. The biggest networks we eventually tested had up to 3000, 4000 and 6000 hidden units in the first, second and third hidden layers respectively.

As for the optimization hyper-parameters, we would proceed by first trying a few combinations of values for the stochastic gradient descent learning rate of the supervised and unsupervised phases (usually between 0.1 and 0.0001). We then refine the choice of tested values for these hyper-parameters. The first trials would simply give us a trend on the validation set error for these parameters (is a change in the hyper-parameter making things worse of better) and we would then consider that information in selecting appropriate additional trials. One could choose to use learning rate adaptation techniques (e.g. slowly decreasing the learning rate or using momentum) but we did not find these techniques to be crucial.

For all neural networks, we used early stopping based on the error of the model on the validation set. If for 5 consecutive epochs we don't improve on the best validation error, then training is stopped. As for the unsupervised phase stopping criteria, with SAA-3, we stopped greedily training a layer when the autoassociator reconstruction cost did not improve more than 1% on the training set after an epoch of 10000 samples, or 10% for a training set of 1000 samples. With DBN-3, we did not use early stopping, because the RBM training criterion is not tractable. Instead, we tested 50 or 100 unsupervised learning epochs for each layer and selected the best choice based on the final accuracy of the model on the validation set.

The experiments with the NNet, DBN-1, DBN-3 and SAA-3 models were conducted using the PLearn library, an Open Source C++ library for machine learning which was developed and is actively used in our lab.

In the case of SVMs with Gaussian kernels, we performed a two-stage grid search for the width of the kernel and the soft-margin parameter. In the first stage, we searched through a coarse logarithmic grid ranging from to and to . In the second stage, we performed a more fine-grained search in the vicinity of that tuple that gave the best validation error. For example, if the optimal tuple was , we would examine tuples from . In the case of the polynomial kernel, the strategy was the same, except that we searched through all possible degrees of the polynomial up to 20 (no fine-grained search on this parameter, obviously).

Throughout the experiments we used the publicly available library libSVM version 2.83 (note that in the meantime the library has been updated. The results reported below are those obtained with version 2.83).

Results

The confidence intervals on the mean test error are computed using the following formula: , where is the estimated test error, is 0.05 and , with being the inverse zero-mean unit-variance Gaussian CDF at .

For convenience, we provide two tables of results, one being the transpose of the other. The test errors with (*) are the lowest for a given dataset (or whose margins overlap with the confidence margin of the lowest).

Classification error of SVM RBF, SAA-3 and DBN-3 on MNIST examples with progressively less pixel correlation in the
background

Discussion of results

There are several conclusions which can be drawn from these results:

Taken together, deep architecture models show globally the best performance. Seven times out of 8, either DBN-3 or SAA-3 are among the best performing models (within the confidence intervals).

Four times out of 8 the best accuracy is obtained with a deep architecture model (either DBN-3 or SAA-3). This is especially true in three cases: mnist-back-rand, mnist-back-image and mnist-rot-back-image, where they perform better by a large margin.

The improvement provided by deep architecture models is most notable for factors of variation related to background, especially in the case of random background, where DBN-3 almost reaches its performance on mnist-basic. It seems however that not all of the invariances can be learned just as easily---an example is the one of rotation, where the deep architectures do not outperform SVMs.

Even though SAA-3 and DBN-3 provide consistent improvement over NNet, these models are still sensitive to hyper-parameter selection. This might explain the surprising similarity of the results for SAA-3 on mnist-back-image and mnist-rot-back-image, even though the former corresponds to an easier learning problem than the latter.

It can be seen that, as the amount of background pixel correlation increases, the classification performance of all three algorithms degrade. This indicates that, as the factors of variation become more complex in their interaction with the input space, the relative advantage brought by DBN-3 and SAA-3 diminishes. This observation is preoccupying and implies that learning algorithms such as DBN-3 and SAA-3 will eventually need to be adapted in order to scale to harder, potentially "real life" problem.

Conclusions

We presented a series of experiments which show that deep architecture models tend to outperform other shallow models such as SVMs and single-layer feedforward
neural networks. We also analyzed the relationships between the performance of these learning algorithms and certain properties of the problems that we considered. In particular, we provided empirical evidence that these techniques compare favorably to other state-of-the-art learning algorithms on learning problems with many factors of variation, but only up to a certain point where the data distribution becomes too complex and computational constraints become an important issue.

Further reading

This web page originated from the work presented in the following paper:

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

This technical report and this book chapter present the philosophy behind deep architecture models and motivate them in the context of Artificial Intelligence, and the technical report explains Restricted Boltzmann Machine, Contrastive Divergence, and Deep Belief Nets in a tutorial way: