In the previous post we introduced Boltzmann machines and the infeasibility of computing the gradient of its log-partition function $\nabla_{\theta}\log{Z}$. To this end, we explored one strategy for its approximation: Gibbs sampling. Gibbs sampling is a viable alternative because the expression for this gradient simplifies to an expectation over the model distribution, which can be approximated with Monte Carlo samples.

In this post, we'll highlight the imperfections of even this approach, then present more preferable alternatives.

SML assumes the premise: let's initialize our chain at a point already close to the model's true distribution—reducing or perhaps eliminating the cost of burn-in altogether. This given, at what sample do we initialize the chain?

In SML, we simply initialize at the terminal value of the previous chain (i.e. the one we manufactured to compute the gradients of the previous mini-batch). As long as the model has not changed significantly since, i.e. as long as the previous parameter update (gradient step) was not too large, this sample should exist in a region of high probability under the current model.

Per the expression for the full log-likelihood gradient, e.g. $\nabla_{w_{i, j}}\log{\mathcal{L}} = \mathop{\mathbb{E}}_{x \sim p_{\text{data}}} [x_i x_j] - \mathop{\mathbb{E}}_{x \sim p_{\text{model}}} [x_i x_j]$, the negative phase works to "reduce the probability of the points in which the model strongly, yet wrongly, believes".[^1] Since we approximate this term at each parameter update with samples roughly from the current model's true distribution, we do not encroach on this foundational task.

With no guarantee that the data distribution resembles the model distribution, we may systematically fail to sample, and thereafter "suppress," points that are incorrectly likely under the latter (as they do not appear in the former!). This incurs the growth of "spurious modes" in our model, aptly named.[^1]

Immediately, we remark that if we optimize this model with maximum likelihood, our algorithm will, trivially, make $c$ arbitrarily negative. In other words, the quickest way to increase the thing on the left is to decrease $c$.

Builds a classifier that discriminates between samples generated from the model distribution and noise distribution trained only on samples from the latter. (Clearly, this will not make for an effective classifier.)

To train this classifier, we note that the equation asks us to maximize the likelihood of the noise samples under the noise distribution—where the noise distribution itself has no actual parameters we intend to train!

In solution, we trivially expand our expectation to one over both noise samples, and data samples. In doing so, in predicting $\log{p_{\text{joint}}(y = 1\vert x)} = 1 - \log{p_{\text{joint}}(y = 0\vert x)}$, we'll be maximizing the likelihood of the data under the model.

We briefly alluded to the fact that our noise distribution is non-parametric. However, there is nothing stopping us from evolving this distribution and giving it trainable parameters, then updating these parameters such that it generates increasingly "optimal" samples.

Of course, we would have to design what "optimal" means. One interesting approach is called Adversarial Contrastive Estimation, wherein the authors adapt the noise distribution to generate increasingly "harder negative examples, which forces the main model to learn a better representation of the data."[^2]

Below, as opposed to in the previous post, I offer a vectorized implementation of the Boltzmann energy function.

This said, the code is still imperfect: especially re: the line in which I iterate through data points individually to compute the joint likelihood.

Finally, in Model._H, I divide by 1000 to get this thing to train. The following is only a toy exercise (like many of my posts); I did not spend much time tuning parameters.

In [371]:

defsigmoid(z):return1/(1+torch.exp(-z))classModel(nn.Module):def__init__(self,n_units,seed=42):super().__init__()torch.manual_seed(seed)self.params=nn.Linear(n_units,n_units,bias=True)torch.nn.init.xavier_uniform_(self.params.weight)self.c=nn.Parameter(torch.FloatTensor([1.]))self.diagonal_mask=(~torch.eye(n_units).byte()).float()self.n_units=n_unitsdefforward(self,x,log=False):returnself._likelihood(x,log=log)def_unnormalized_likelihood(self,x):returntorch.exp(self._H(x))def_H(self,x):H=(self.diagonal_mask*torch.triu(self.params.weight*torch.ger(x,x))).sum()+self.params.bias.sum()returnH/1000def_likelihood(self,x,log=False):""" :param x: a vector of shape (n_units,) or (n, n_units), where the latter is a matrix of multiple data points for which to compute the joint likelihood :return: the likelihood, or log-likelihood if `log=True` """ifnotself.n_unitsinx.size()andlen(x.size())in(1,2):raise('Please pass 1 or more points of `n_units` dimensions')# compute unnormalized likelihoodsmultiple_samples=len(x.shape)==2ifmultiple_samples:likelihood=[self._unnormalized_likelihood(point)forpointinx]else:likelihood=[self._unnormalized_likelihood(x)]iflog:returntorch.stack([torch.log(lik)-torch.log(self.c)forlikinlikelihood])else:returntorch.stack([lik/self.cforlikinlikelihood])defsample(self,n_samples=100,init_sample=None,burn_in=25,every_n=10)->np.array:ifburn_in>n_samples:n_samples+=burn_ininit_sample=init_sampleortorch.zeros_like(self.params.bias)samples=[init_sample]def_gibbs_step(sample,i):z=sum([self.params.weight[i,j]*sample[j]forjinrange(len(sample))ifj!=i])+self.params.bias[i]p=sigmoid(z)returnpfor_inrange(n_samples):sample=list(samples[-1])# make copyfori,_inenumerate(sample):sample[i]=_gibbs_step(sample=sample,i=i)samples.append(torch.Tensor(sample))return[samplefori,sampleinenumerate(samples[burn_in:])ifi%every_n==0]

Train a model using noise contrastive estimation. For our noise distribution, we'll start with a diagonal multivariate Gaussian, from which we can sample, and whose likelihood we can evaluate (as of PyTorch 0.4!).

noiseset=dset.mnist.MNIST(root='data/mnist',download=True,transform=transform)noiseloader=torch.utils.data.DataLoader(noiseset,batch_size=BATCH_SIZE*10,shuffle=True)# get some random training imagesdataiter=iter(noiseloader)images,labels=dataiter.next()# show imagesimshow(torchvision.utils.make_grid(images))

# define modelmodel=Model(n_units)# define classifierk=10classifier=lambdaX:1-sigmoid(torch.Tensor([k]).log()-model(X,log=True).squeeze())optimizer=torch.optim.Adam(model.parameters(),lr=.1)# i had to change this learning rate to get this to train# traintrain_model(classifier,optimizer,trainloader,noiseloader,n_batches=100)

In this post, we discussed four additional strategies for both speeding up, as well as outright avoiding, the computation of the gradient of the log-partition function $\nabla_{\theta}\log{Z}$.

While we only presented toy models here, these strategies see successful application in larger undirected graphical models, as well as directed conditional models for $p(y\vert x)$. One key example of the latter is a language model; though the partition function is a sum over distinct values of $y$ (labels) instead of configurations of $x$ (inputs), it can still be intractable to compute! This is because there are as many distinct values of $y$ as there are tokens in the given language's vocabulary, which is typically on the order of millions.

Thanks for reading.

This website does not host notebooks, it only renders notebooks
available on other websites.