Deep/Convolutional GMMs

Convolutional GMM with 32 components trained on 16×16 patches. 50 EM iteratons were run and the log likelihood steadily increased through them. I trained on 200,000 patches that were randomly selected from BSDS, zero-meaned and flat patches with standard deviation less than 0.04 were discarded.

Each row shows ten 16×16 patches from the dataset; these are patches that have the maximal posterior probability on that particular filter (i.e. component of the GMM) i.e these 10 patches have the highest probability for that component.

The 11th column in each row shows the corresponding power spectrum of the filter. To interpret the power spectra, note that the patch Fourier spectrum must be canceled out by the filter so that the negative exponential (i.e. probability) of that patch/component is high. This is why the power spectra are “dark” on the inside and “bright” on the outside, the inverse of what we would expect. As seen from the above writeup, we can only determine the power spectrum of the filter since the Gaussian is insensitive to phase.

Here are 24 components trained on 8×8 patches. For both sets of filters and patches, there are significant overlaps in the power spectra of different filters.

I also ran 32 components of 28×28 MNIST digits (using the 60K digits for training with no data augmentation or any other processing). These results (on the 10K test data) are also interesting; there’s definitely lots of redundancy in the components (e.g. for the digit 1), and it would be cool to “reallocate” the model capacity to other regions of the spectrum:

Here’s MNIST trained on 128 components:

Next, I trained a full (brute-force) GMM on MNIST 28×28 images; 32 components for 10 iterations. Here are the top 10 samples for each component (some of the components have no samples assigned to them in the test set). It definitely seems to be better than the convolutional model:

Does training on different subsets of the data (e.g. different classes) give different enough models, which can be used to discriminate classes?

How can we reduce overlap between the filters so that there is more discriminative power? Also can the filters be made more narrow-band (and is this necessary)?

Generate samples from these models to see what they look like.

How can we reconstruct the filter phases; and is this even important? If we take an input image and just filter it with these zero-phase filters, then the output phase is the same as the input phase; so there is no loss of phase information.

Let’s re-train a 32 component convolutional model and look at the top samples per component along with the filter:

Now, we take the outputs of this first layer, and train on the 32 log probability components i.e. the dataset of size 60000 x 32. The log probs are pre-processed by setting mean to 0 and standard deviation to 1 for each sample. Then we train a full covariance GMM model with 32 components on this second layer (with zero mean components). Finally, we show for each second layer component, which are the samples that respond the most to these probability components are shown below:

Same as above but with non-zero GMM means, and without setting the samples means to 0 or std to 1:

We trained a first layer convolutional model with 32 components on MNIST data with filter size 28×28. The results of the digit assignment to each component along with the filter power spectrum, is given below:

Then we generated a training set as follows: for each digit in the training set, we computed the normalized probability vector as $p(x|c)$, where c are the GMM components. Thus, we generate a dataset of size 60000 x 32, where each row adds upto 1, and is non-negative. We then train a second GMM on this input, with 17 components. The second layer was trained with a deterministic EM iterations on the data from the first layer. The resulting top digits for each component are given by:

Then we repeat the above process to train a third layer with 10 components. Actually, we started the training with 16 components, but 6 components were pruned during the EM iterations. The resulting top components were given as follows:

It’s interesting that the distribution seems much more “concentrated”. On the other hand, we see that some digits span two components, where other digits tend to be subsumed. Do we expect that a different training mechanism would better fit these distributions? Can we also do a top-down training which will take the output from the top layer and use that to refine the EM training of the lower layers?