We use the same toy data from
David Ha's blog post, where he explains MDNs. It is an inverse problem where
for every input $x_n$ there are multiple outputs $y_n$.

In [3]:

defbuild_toy_dataset(N):y_data=np.random.uniform(-10.5,10.5,N)r_data=np.random.normal(size=N)# random noisex_data=np.sin(0.75*y_data)*7.0+y_data*0.5+r_data*1.0x_data=x_data.reshape((N,1))returntrain_test_split(x_data,y_data,random_state=42)ed.set_seed(42)N=5000# number of data pointsD=1# number of featuresK=20# number of mixture componentsX_train,X_test,y_train,y_test=build_toy_dataset(N)print("Size of features in training data: {}".format(X_train.shape))print("Size of output in training data: {}".format(y_train.shape))print("Size of features in test data: {}".format(X_test.shape))print("Size of output in test data: {}".format(y_test.shape))sns.regplot(X_train,y_train,fit_reg=False)plt.show()

Size of features in training data: (3750, 1)
Size of output in training data: (3750,)
Size of features in test data: (1250, 1)
Size of output in test data: (1250,)

We define TensorFlow placeholders, which will be used to manually feed batches of data during inference. This is one of many ways to train models with data in Edward.

We use a mixture of 20 normal distributions parameterized by a
feedforward network. That is, the membership probabilities and
per-component mean and standard deviation are given by the output of a
feedforward network.

We leverage TensorFlow Slim to construct neural networks. We specify
a three-layer network with 15 hidden units for each hidden layer.

In [5]:

defneural_network(X):"""loc, scale, logits = NN(x; theta)"""# 2 hidden layers with 15 hidden unitshidden1=slim.fully_connected(X,15)hidden2=slim.fully_connected(hidden1,15)locs=slim.fully_connected(hidden2,K,activation_fn=None)scales=slim.fully_connected(hidden2,K,activation_fn=tf.exp)logits=slim.fully_connected(hidden2,K,activation_fn=None)returnlocs,scales,logitslocs,scales,logits=neural_network(X_ph)cat=Categorical(logits=logits)components=[Normal(loc=loc,scale=scale)forloc,scaleinzip(tf.unstack(tf.transpose(locs)),tf.unstack(tf.transpose(scales)))]y=Mixture(cat=cat,components=components,value=tf.zeros_like(y_ph))# Note: A bug exists in Mixture which prevents samples from it to have# a shape of [None]. For now fix it using the value argument, as# sampling is not necessary for MAP estimation anyways.

Note that we use the Mixture random variable. It collapses
out the membership assignments for each data point and makes the model
differentiable with respect to all its parameters. It takes a
Categorical random variable as input—denoting the probability for each
cluster assignment—as well as components, which is a list of
individual distributions to mix over.

# There are no latent variables to infer. Thus inference is concerned# with only training model parameters, which are baked into how we# specify the neural networks.inference=ed.MAP(data={y:y_ph})optimizer=tf.train.AdamOptimizer(5e-3)inference.initialize(optimizer=optimizer,var_list=tf.trainable_variables())

Here, we will manually control the inference and how data is passed
into it at each step.
Initialize the algorithm and the TensorFlow variables.

In [7]:

sess=ed.get_session()tf.global_variables_initializer().run()

Now we train the MDN by calling inference.update(), passing
in the data. The quantity inference.loss is the
loss function (negative log-likelihood) at that step of inference. We
also report the loss function on test data by calling
inference.loss and where we feed test data to the TensorFlow
placeholders instead of training data.
We keep track of the losses under train_loss and test_loss.

Note a common failure mode when training MDNs is that an individual
mixture distribution collapses to a point. This forces the standard
deviation of the normal to be close to 0 and produces NaN values.
We can prevent this by thresholding the standard deviation if desired.

After training for a number of iterations, we get out the predictions
we are interested in from the model: the predicted mixture weights,
cluster means, and cluster standard deviations.

To do this, we fetch their values from session, feeding test data
X_test to the placeholder X_ph.

Let's plot the log-likelihood of the training and test data as
functions of the training epoch. The quantity inference.loss
is the total log-likelihood, not the loss per data point. Below we
plot the per-data point log-likelihood by dividing by the size of the
train and test data respectively.

Let's look at how individual examples perform. Note that as this is an
inverse problem we can't get the answer correct, but we can hope that
the truth lies in area where the model has high probability.

In this plot the truth is the vertical grey line while the blue line
is the prediction of the mixture density network. As you can see, we
didn't do too bad.