Running algorithms which require the full data set for each update
can be expensive when the data is large. In order to scale inferences,
we can do batch training. This trains the model using
only a subsample of data at a time.

Simulate $N$ training examples and a fixed number of test examples.
Each example is a pair of inputs $\mathbf{x}_n\in\mathbb{R}^{10}$ and
outputs $y_n\in\mathbb{R}$. They have a linear dependence with
normally distributed noise.

We also define a helper function to select the next batch of data
points from the full set of examples. It keeps track of the current
batch index and returns the next batch using the function
next(). We will generate batches from data during inference.

defgenerator(arrays,batch_size):"""Generate batches, one with respect to each array's first axis."""starts=[0]*len(arrays)# pointers to where we are in iterationwhileTrue:batches=[]fori,arrayinenumerate(arrays):start=starts[i]stop=start+batch_sizediff=stop-array.shape[0]ifdiff<=0:batch=array[start:stop]starts[i]+=batch_sizeelse:batch=np.concatenate((array[start:],array[:diff]))starts[i]=diffbatches.append(batch)yieldbatches

In [4]:

ed.set_seed(42)N=10000# size of training dataM=128# batch size during trainingD=10# number of featuresw_true=np.ones(D)*5X_train,y_train=build_toy_dataset(N,w_true)X_test,y_test=build_toy_dataset(235,w_true)data=generator([X_train,y_train],M)

The latent variables are the linear model's weights $\mathbf{w}$ and
intercept $b$, also known as the bias.
Assume $\sigma_w^2,\sigma_b^2$ are known prior variances and $\sigma_y^2$ is a
known likelihood variance. The mean of the likelihood is given by a
linear transformation of the inputs $\mathbf{x}_n$.

Here, we define a placeholder X. During inference, we pass in
the value for this placeholder according to batches of data.
To enable training with batches of varying size,
we don't fix the number of rows for X and y. (Alternatively,
we could fix it to be the batch size if training and testing
with a fixed size.)

Run variational inference with the Kullback-Leibler divergence.
We use $5$ latent variable samples for computing
black box stochastic gradients in the algorithm.
(For more details, see the
$\text{KL}(q\|p)$ tutorial.)

For batch training, we will iterate over the number of batches and
feed them to the respective placeholder. We set the number of
iterations to be equal to the number of batches times the number of
epochs (full passes over the data set).

When initializing inference, note we scale $y$ by $N/M$, so it is as if the
algorithm had seen $N/M$ as many data points per iteration.
Algorithmically, this will scale all computation regarding $y$ by
$N/M$ such as scaling the log-likelihood in a variational method's
objective. (Statistically, this avoids inference being dominated by the prior.)

The loop construction makes training very flexible. For example, we
can also try running many updates for each batch.

In general, make sure that the total number of training iterations is
specified correctly when initializing inference. Otherwise an incorrect
number of training iterations can have unintended consequences; for example,
ed.KLqp uses an internal counter to appropriately decay its optimizer's
learning rate step size.

Note also that the reported loss value as we run the
algorithm corresponds to the computed objective given the current
batch and not the total data set. We can instead have it report
the loss over the total data set by summing info_dict['loss']
for each epoch.

Only certain algorithms support batch training such as
MAP, KLqp, and SGLD. Also, above we
illustrated batch training for models with only global latent variables,
which are variables are shared across all data points.
For more complex strategies, see the
inference data subsampling API.

This website does not host notebooks, it only renders notebooks
available on other websites.