Accurate and Diverse Sampling of Sequences based on a
“Best of Many” Sample Objective

Abstract

For autonomous agents to successfully operate in the real world, anticipation of future events and states of their environment is a key competence. This problem has been formalized as a sequence extrapolation problem, where a number of observations are used to predict the sequence into the future. Real-world scenarios demand a model of uncertainty of such predictions, as predictions become increasingly uncertain – in particular on long time horizons. While impressive results have been shown on point estimates, scenarios that induce multi-modal distributions over future sequences remain challenging. Our work addresses these challenges in a Gaussian Latent Variable model for sequence prediction. Our core contribution is a “Best of Many” sample objective that leads to more accurate and more diverse predictions that better capture the true variations in real-world sequence data. Beyond our analysis of improved model fit, our models also empirically outperform prior work on three diverse tasks ranging from traffic scenes to weather data.

Predicting the future is important in many scenarios ranging from autonomous driving to precipitation forecasting. Many of these tasks can be formulated as sequence prediction problems. Given a past sequence of events, probable future outcomes are to be predicted.

Recurrent Neural Networks (RNN) especially LSTM formulations are state-of-the-art models for sequence prediction tasks [2, 23, 6, 22]. These approaches predict only point estimates. However, many sequence prediction problems are only partially observed or stochastic in nature and hence the distribution of future sequences can be highly multi-modal. Consider the task of predicting future pedestrian trajectories. In many cases, we do not have any information about the intentions of the pedestrains in the scene. A pedestrian after walking over a Zerba crossing might decide to turn either left or right. A point estimate in such a situation would be highly unrealistic. Therefore, in order to incorporate uncertainty of future outcomes, we are interested in structured predictions. Structured prediction in this context implies learning a one to many mapping of a given fixed sequence to plausible future sequences [19]. This leads to more realistic predictions and enables probabilistic inference.

Recent work [14] has proposed deep conditional generative models with Gaussian latent variables for structured sequence prediction. The Conditional Variational Auto-Encoder (CVAE) framework [19] is used in [14] for learning of the Gaussian Latent Variables. We identify two key limitations of this CVAE framework. First, the currently used objectives hinder learning of diverse samples due to a marginalization over multi-modal futures. Second, a mismatch in latent variable distribution between training and testing leads to errors in model fitting. We overcome both challenges which results in more accurate and diverse samples – better capturing the true variations in data.
Our main contributions are:
{enumerate*}

We propose a novel “best of many” sample objective;

We analyze the benefits of our ’best of many” sample objective analytically as well as show an improved fit of latent variables on models trained with this novel objective compared to prior approaches;

We also show for the first time that this modeling paradigm extends to full-frame images sequences with diverse multi-modal futures;

We demonstrate improved accuracy as well as diversity of the generated samples on three diverse tasks: MNIST stroke completion, Stanford Drone Dataset and HKO weather data. On all three datasets we consistently outperform the state of the art and baselines.

Structured Output Prediction. Stochastic feed-forward neural networks (SFNN) [21] model multi-modal conditional distributions through binary stochastic hidden variables. During training multiple samples are drawn and weighted according to importance-weights. However, due to the latent variables being binary SFNNs are hard to train on large datasets. There has been several efforts to make training more efficient for binary latent variables [16, 8, 15, 13]. However, not all tasks can be efficiently modelled with binary hidden variables. In [19], Gaussian hidden variables are considered where the re-parameterization trick can be used for learning on large datasets using stochastic optimization. Inspired by this technique we model Gaussian hidden variables for structured sequence prediction tasks.

Variational Autoencoders. Variational learning has enabled learning of deep directed graphical models with Gaussian latent variables on large datasets [11, 10, 9]. Model training is made possible through stochastic optimization by the use of a variational lower bound of the data log-likelihood and the re-parameterization trick. In [3] a tighter lower bound on the data log-likelihood is introduced and multiple samples are used during training which are weighted according to importance weights. They show empirically that their IWAE framweork can learn richer latent space representations. However, these models do not consider conditional distributions for structured output prediction. Conditional variational auto-encoders (CVAE) [19] extend the VAE framework of [11] to model conditional distributions for structured output prediction by introducing the CVAE objective which maximizes a lower bound on the conditional data log liklihood. The CVAE framework has been used for a variety of tasks. Examples include, generation of likely future frames given a single frame of a video [24], diverse images of clothed people conditioned on their silhouette [12], and trajectories of basketball players using pictorial representations [1]. However, the gap between the training and test latent variable distributions cannot be fully closed by the CVAE objective function. We consider a new multi-sample objective which relaxes the constraints on the recognition network by encouraging diverse sample generation and thus leads to a better match between the training and test latent variable distributions.

Recurrent Neural Networks. Recurrent Neural Networks (RNNs) are state of the art methods for variety of sequence learning tasks [7, 20]. In this work, we focus on sequence to sequence regression tasks, in particular, trajectory prediction and image sequence prediction. RNNs have been used for pedestrian trajectory prediction. In [2], trajectories of multiple people in a scene are jointly modelled in a social context. However, even though the distribution of pedestrian trajectories are highly multimodal (with diverse futures), only one mean estimate is modelled. [14] jointly models multiple future pedestrian trajectories using a recurrent CVAE sampling module. Samples generated are refined and ranked using image and social context features. While our trajectory prediction model is similar to the sampling module of [14], we focus on improving the sampling module by our novel multi-sample objective function. Convolutional RNNs [22] have been used for image sequence prediction. Examples include, robotic arm movement prediction [6] and precipitation now-casting [22, 18]. In this work, we extend the model of [22] for structured sequence prediction by conditioning predictions on Gaussian latent variables. Furthermore, we show that optimization using our novel multi-sample objective leads to improved results over the standard CVAE objective.

We begin with an overview of deep conditional generative models with gaussian latent variables and the CVAE framework with the corresponding objective [19] used for training. Then, we introduce our novel “best-of-many” samples objective function. Thereafter, we introduce the conditional generative models which serve as the test bed for our novel objective. We first describe our model for structured trajectory prediction which is similar to the sampling module of [14] and consider extensions which additionally conditions on visual input and generates full image sequences.

Figure 2: Conditional generative models.

We consider deep conditional generative models of the form shown in Figure 2. Given an input sequence x, a latent variable ^z is drawn from the conditional distribution p(z|x) (assumed Gaussian). The output sequence ^y is then sampled from the distribution pθ(y|x,z) of our conditional generative model with parameterized by θ. The latent variables z enables one-to-many mapping and the learning of multiple modes of the true posterior distribution p(y|x). In practice, the simplifying assumption is made that z is independent of x and p(z|x) is N(0,I). Next, we discuss the training of such models.

3.1 Conditional Variational Auto-encoder Based Training Objective

We would like to maximize the data log-likelihood pθ(y∣x). To estimate the data log-likelihood of our model pθ, one possibility is to perform Monte-Carlo sampling of the latent variable z. For T samples, this leads to the following estimate,

^LMC

=log(1TT∑i=1pθ(y|^zi,x)),^zi∼N(0,I).

(1)

This estimate is unbiased but has high variance [15]. We would underestimate the log-likelihood for some samples and overestimate for others, especially if T is small. This would in turn lead to high variance weight updates.

We can reduce the variance of updates by estimating the log-likelihood through importance sampling during training. As described in [19], we can sample the latent variables z from a recognition network qϕ using the re-parameterization trick [11]. The data log-likelihood is,

log(pθ(y∣x))=log(∫pθ(y|z,x)p(z|x)qϕ(z|x,y)qϕ(z|x,y)dz).

(2)

The integral in (S4) is computationally intractable. In [19], a variational lower bound of the data log-likelihood (S4) is derived, which can be estimated empirically using Monte-Carlo integration (also used in [14]),

^LCVAE=1TT∑i=1logpθ(y|^zi,x)−DKL(qϕ(z|x,y)∥p(z|x)),^zi∼qϕ(z|x,y).

(3)

The lower bound in (S7) weights all samples (^zi) equally and so they must all ascribe high probability to the data point (x,y). This introduces a strong constraint on the recognition network qϕ. Therefore, the model is forced to trade-off between a good estimate of the data log-likelihood and the KL divergence between the training and test latent variable distributions. One possibility to close the gap introduced between the training and test pipelines, as described in [19], is to use an hybrid objective of the form (1−α)^LMC+α^LCVAE. Although such an hybrid objective has shown modest improvement in performance in certain cases, we could not observe any significant improvement over the standard CVAE objective in our structured sequence prediction tasks. In the following, we derive our novel “best-of-many-samples” objective which on the one hand encourages sample diversity and on the other hand aims to close the gap between the training and testing pipelines.

(a)Our model for structured trajectory prediction.

(b)Our model for structured image sequence prediction.Figure 3: Our model architectures. The recognition networks are only available during training.

3.2 Best of Many Samples Objective

Here, we propose our objective which unlike (S7) does not weight each sample equally. Consider the functions f1(z)=\nicefracp(z|x)qϕ(z|x,y) and f2(z)=pθ(y|z,x)×qϕ(z|x,y) in (S4). We cannot evaluate f2(z) directly for Monte-Carlo samples. Notice, however that both f1(z) and f2(z) are continuous and positive. As qθ(z|x,y) is normally distributed, the integral above can be very well approximated on a large enough bounded interval [a,b]. Therefore, we can use the First Mean Value Theorem of Integration [4], to separate the functions f1(z) and f2(z) in (S4),

We can estimate the first term on the right of (S9) using Monte-Carlo integration. The minimum in the second term on the right of (S9) is difficult to estimate, therefore we approximate it by the KL divergence over the full distribution. The KL divergence heavily penalizes qϕ(z|x,y) when it is high for low values p(z|x) (which leads to low value of the ratio of the distributions). This leads to the following “many-sample” objective, (more details in the supplementary section),

(6)

Compared to the CVAE objective (S4), the recognition network qϕ now has multiple chances to draw samples with high posterior probability (pθ(y∣z,x)). This encourages diversity in the generated samples. Furthermore, the data log-likelihood (S4) estimate in this objective is tighter as ^LMS≥^LCVAE follows from the Jensen’s inequality. Therefore, this bound loosens the constrains on the recognition network qϕ and allows it more closely match the latent variable distribution p(z|x). However, as we focus on regression tasks, probabilities are of the form e−MSE(^y,y). Therefore, in practice the Log-Average term can cause numerical instabilities due to limited machine precision in representing the probability e−MSE(^y,y). Therefore, we use a “Best of Many Samples” approximation ^LBMS of (S10). We can pull the constant \nicefrac1T term outside the average in (S10) and approximate the sum with the maximum,

Similar to (S10), this objective encourages diversity and loosens the constrains on the recognition network qϕ as only the best sample is considered. During training, initially pθ assigns low probability to the data for all samples ^zi. The log(T) difference between (S10) and (S2) would be dominated by the low data log-likelihood. Later on, as both objectives promote diversity, the Log-Average term in (S10) would be dominated by one term in the average. Therefore, (S10) would be well approximated by the maximum of the terms in the average. Furthermore, (S2) avoids numerical stability issues.

Figure 4: Diverse samples drawn from our LSTM-BMS model trained using the ^LBMS objective, clustered using k-means. The number of clusters is set manually to the number of expected digits based on the initial stroke.

Figure 5: Top 10% of samples drawn from the LSTM-BMS model (magenta) and the LSTM-CVAE model (yellow), with the groundtruth in (blue).

3.3 Model Architectures for Structured Sequence Prediction

We base our model architectures on RNN Encoder-Decoders. We use LSTM formulations as RNNs for structured trajectory prediction tasks (Figure 3) and Convolutional LSTM formulations (Figure 3) for structured image sequence prediction tasks. During training, we consider LSTM recognition networks in case of trajectory prediction (Figure 3) and for image sequence prediction, we consider Conv-LSTM recognition networks (Figure 3). Note that, as we make the simplifying assumption that z is independent of x, the recognition networks are conditioned only on y.

Model for Structured Trajectory Prediction. Our model for structured trajectory prediction (see Figure 3) is similar to the sampling module of [14]. The input sequence x is processed using an embedding layer to extract features and the embedded sequence is read by the encoder LSTM. The encoder LSTM produces a summary vector v, which is its internal state after reading the input sequence x. The decoder LSTM is conditioned on the summary vector v and additionally a sample of the latent variable z. The decoder LSTM is unrolled in time and a prediction is generated by a linear transformation of it’s output. Therefore, the predicted sequence at a certain time-step ^yt is conditioned on the output at the previous time-step, the summary vector v and the latent variable z. As the summary v is deterministic given x, we have,

Conditioning the predicted sequence at all time-steps upon a single sample of z enables z to capture global characteristics (e.g. speed and direction of motion) of the future sequence and generation of temporally consistent sample sequences ^y.

Extension with Visual Input. In case of dynamic agents e.g. pedestrians in traffic scenes, the future trajectory is highly dependent upon the environment e.g. layout of the streets. Therefore, additionally conditioning samples on sensory input (e.g. visuals of the environment) would enable more accurate sample generation. We use a CNN to extract a summary of a visual observation of a scene. This visual summary is given as input to the decoder LSTM, ensuring that the generated samples are additionally conditioned on the visual input.

Model for Structured Image Sequence Prediction. If the sequence (x,y) in question consists of images e.g. frames of a video, the trajectory prediction model Figure 3 cannot exploit the spatial structure of the image sequence. More specifically, consider a pixel yt+1i,j at time-step t+1 of the image sequence y. The pixel value at time-step t+1 depends upon only the pixel yti,j and a certain neighbourhood around it. Furthermore, spatially neighbouring pixels are correlated. This spatial structure can be exploited by using Convolutional LSTMs [22] as RNN encoder-decoders. Conv-LSTMs retain spatial information by considering the hidden states h and cell states c as 3D tensors – the cell and hidden states are composed of vectors cti,j, hti,j corresponding to each spatial position. New cell states, hidden states and outputs are computed using convolutional operations. Therefore, new cell states ct+1i,j, hidden states ht+1i,j depend upon only a local spatial neighbourhood of cti,j, hti,j, thus preserving spatial information.

We propose conditional generative models networks with Conv-LSTMs for structured image sequence prediction (Figure 3). The encoder and decoder consists of two stacked Conv-LSTMs for feature aggregation. As before, the output is conditioned on a latent variable z to model multiple modes of the conditional distribution p(y∣x). The future states of neighboring pixels are highly correlated. However, spatially distant parts of the image sequences can evolve independently. To take into account the spatial structure of images, we consider latent variables z which are 3D tensors. As detailed in Figure 3, the input image sequence x is processed using a convolutional embedding layer. The Conv-LSTM reads the embedded input sequence and produces a 3D tensor v as the summary. The 3D summary v and latent variable z is given as input to the Conv-LSTM decoder at every time-step. The cell state, hidden state or output at a certain spatial position, cti,j, hti,j, yti,j, it is conditioned on a sub-tensor z–i,j of the latent tensor z. Spatially neighbouring cell states, hidden states (and thus outputs) are therefore conditioned on spatially neighbouring sub-tensors z–i,j. This coupled with the spatial information preserving property of Conv-LSTMs detailed above, enables z to capture spatial location specific characteristics of the future image sequence and allows for modeling the correlation of future states of spatially neighboring pixels. This ensures spatial consistency of sampled output sequences ^y. Furthermore, as in the fully connected case, conditioning the full output sequence sample ^y is on a single sample of z ensures temporal consistency.

We evaluate our models both on synthetic and real data. We choose sequence datasets which display multimodality. In particular, we evaluate on key strokes from MNIST sequence data [5] (which can be seen as trajectories in a constrainted space), human trajectories from Stanford Drone data [17] and radar echo image sequences from HKO [22]. All models were trained using the ADAM optimizer, with a batch size of 32 for trajectory data and 4 for the radar echo data. All experiments were conducted on a single Nvidia M40 GPU with 12GB memory. For models trained using the ^LCVAE and ^LBMS objectives, we use T={10,10,5} samples during training on the MNIST Sequence, Stanford Drone, and HKO datasets respectively.

(a)Diverse samples dawn from our LSTM-BMS model trained using the ^LBMS objective, color-coded after clustering using k-means with four clusters.

(b)Top 10% of samples drawn from the LSTM-BMS model (margenta) and the LSTM-CVAE model (yellow), with the groundtruth in blue.Figure 7: Qualitative evaluation on the Stanford Drone dataset.

4.1 MNIST Sequence

The MNIST sequence dataset consists of pen strokes which closely approximates the skeleton of the digits in the MNIST dataset. We focus on the stroke completion task. Given an initial stroke the distribution of possible completions is highly multimodal. The digits 0, 3, 2 and 8, have the same initial stroke with multiple writing styles for each digit. Similarly for the digits 0 and 6, with multiple writing styles for each digit.

We fix the length of the initial stroke sequence at 10. We use the trajectory prediction model from Figure 3 and train it using the ^LBMS objective (LSTM-BMS). We compare it against the following baselines: {enumerate*}

The trajectory prediction model from Figure 3 trained using the ^LMC objective (LSTM-MC);

The trajectory prediction model from Figure 3 trained using the ^LCVAE objective (LSTM-CVAE) . We use the negative conditional log-likelihood metric (CLL) and report the results in Table 1. We use T=100 samples to estimate the CLL.

We observe that our LSTM-BMS model achieves the best CLL. This means that our LSTM-BMS model fits the data distribution best. Furthermore, we see that the latent variables sampled from our recognition network qϕ(z∣x,y) during training better matches the true distribution p(z∣x) used during testing. This can be seen through the KL divergence DKL(qϕ(z∣x,y)∥p(z∣x)) in Figure 6 during training of the recognition network trained with the ^LBMS objective versus that of the ^LCVAE objective. We observe that the KL divergence of the recognition network trained with the ^LBMS to be substantially lower, thus, reducing the mismatch in the latent variable z between the training and testing pipelines.

We show qualitative examples of generated samples in Figure 4 from the LSTM-BMS model. We show T=100 samples per test example. The initial conditioning stroke is shown in white. The samples drawn are diverse and clearly multimodal. We cluster the generated samples using k-means for better visualization. The number of clusters is set manually to the number of expected digits based on the initial stroke. In particular, our model generates corresponding to 2, 3, 0 (1st example), 0, 6 (2nd example) and so on.

We compare the accuracy of samples generated by our LSTM-BMS model versus the LSTM-CVAE model in Figure 5. We display mean of the oracle top 10% of samples (closest in euclidean distance w.r.t. groudtruth) generated by both models. Comparing the results we see that, using the ^LBMS objective leads to the generation of more accurate samples.

4.2 Stanford Drone

The Stanford Drone dataset consists of overhead videos of traffic scenes. Trajectories of various dynamic agents including Pedestrians and Bikers are annotated. The paths of such agents are determined by various factors including the intention of the agent, paths of other agents and the layout of the scene. Thus, the trajectories are highly multi-modal. As in [17, 14], we predict the trajectories of these agents 4.8 seconds into the future conditioned on the past 2.4 seconds. We use the same dataset split as in [14]. We encode trajectories as relative displacement from the initial position. The trajectory at each time-step can be seen as the velocity of the agent.

We consider the extension of our trajectory prediction model (Figure 3) discussed in subsection 3.3 conditioned on the last visual observation from the overhead camera. We use a 6 layer CNN to extract visual features (see supplementary material). We train this model with the ^LBMS objective and compare it to: {enumerate*}

A vanilla LSTM encoder-decoder regression model with and without visual observation (LSTM);

We report the results in Table 2. We report the CLL metric and the euclidean distance in pixels between the true trajectory and the oracle top 10% of generated samples at 1, 2, 3 and 4 seconds into the future at (\nicefrac15) resolution (as in [14]). Our LSTM-BMS model again performs best both with respect to the euclidean distance and the CLL metric. This again demonstrates that using the ^LBMS objective enables us to better fit the groundtruth data distribution and enables the generation of more accurate samples. The performance advantage with respect to DESIRE-SI-IT4 [14] is due to {enumerate*}

We show qualitative examples of generated samples (T=10) in Figure 7. We color code the generated samples using k-means with four clusters. The qualitative examples display high plausibility and diversity. They follow the layout of the scene, the location of roads, vegetation, vehicles etc. We qualitatively compare the accuracy of samples generated by our LSTM-BMS model versus the LSTM-CVAE model in Figure 7. We see that the oracle top 10% of samples generated using the ^LBMS objective are more accurate and thus more representative of the groundtruth.

4.3 Radar Echo

The Radar Echo dataset [22] consists of weather radar intensity images from 97 rainy days over Hong Kong from 2011 to 2013. The weather evolves due to varity of factors, which are difficult to identify using only the radar images, with varied and multimodal futures. Each sequences consists of 20 frames each of resolution 100×100, recorded at intervals of 6 minutes. We use the same dataset split as [22] and predict the next 15 images given the previous 5 images.

We compare our image sequence prediction model in Figure 3 trained with the ^LBMS (Conv-LSTM-BMS) objective to one trained with the ^LCVAE (Conv-LSTM-CVAE) objective. We additionally compare it to the Conv-LSTM model of [22]. In addition to the CLL metric (calculated per image sequence), we use the following precipitation nowcasting metrics from [22], {enumerate*}

Rainfall mean squared error (Rainfall-MSE),

Critical success index (CSI),

False alarm rate (FAR),

Probability of detection (POD), and

Correlation . For fair comparison we estimate these metrics using T=1 random samples from the Conv-LSTM-CVAE and Conv-LSTM-BMS models.

We report the results in Table 3. Both the Conv-LSTM-CVAE and Conv-LSTM-BMS models perform better compared to [22]. This is due to use of embedding layers for feature extraction and the use of 2×2 max pooling in between two Conv-LSTM layers for feature aggregation (compared no embedding layers or pooling in [22]).
Furthermore, the superior CLL of the Conv-LSTM-BMS model demonstrates it’s ability to fit the data distribution better. We show qualitative examples in Figure 9 at t+5, t+10 and t+15. We generate T=50 samples and show the sample closest to the groundtruth (Best), the mean of all the samples and the per-pixel variance in the samples. The qualitative examples demonstrate that our model produces highly accurate and diverse samples.

We have presented a novel “best of many” sample objective for Gaussian latent variable models and show its advantages for learning conditional models on multi-modal distributions. Our analysis shows indeed the learnt latent representation is better matched between training and test time – which in turn leads to more accurate samples. We show the benefits of our model on trajectory as well as image sequence prediction using three diverse datasets: MNIST strokes, Stanford drone and HKO weather. Our proposed appraoch consistently outperforms baselines and state of the art in all these scenarios.

Acknowledgments We would like to thank Francesco Croce for his comments and feedback.

\addappheadtotoc

Appendix

Appendix A Additional Details of our “Best of Many” Sample Objective

Here we provide additional details of our “Best of Many” samples objective and include additional qualitative results.
We begin with the formal statement of the First Mean Value Theorem of Integration [4]. The First Mean Value Theorem of Integration states that, if f1:[a,b]→R is continuous and f2 is an integrable function that does not change sign on [a,b], then ∃z′∈(a,b) such that,

∫baf1(z)f2(z)dz=f1(z′)∫baf2(z)dz

(S1)

The data log-likelihood Equation (3) in the main paper, estimated using importance sampling using a recognition network qϕ is given by,

log(pθ(y∣x))=log(∫pθ(y|z,x)p(z|x)qϕ(z|x,y)qϕ(z|x,y)dz).

(S2)

We apply the First Mean Value Theorem of Integration to derive Equation (4) in the main paper, which is,

(S3)

To do this, we set f1(z)=\nicefracp(z|x)qϕ(z|x,y) and f2(z)=pθ(y|z,x)×qϕ(z|x,y) (from the data log-likelihood in (S2)). The integral in (S2) can be very well approximated on a large enough bounded interval [a,b]. This leads to,

However, as mentioned in the main paper, the minimum in (S5) is difficult to estimate. Therefore, we use the following approximation. From (S3), we know that ∃z′∈(a,b) which lower bounds the data log-likelihood. To maximize this data log-likelihood, we would like to maximize log(f1(z′)). However, as we do not know z′, we instead choose to maximize it for a set of N points in (a,b),

As values of both p and qϕ are bounded above by 1, the value of the function f2(z′i)=\nicefracp(z′i|x)qϕ(z′i|x,y) is likely to be low when is p low and qϕ is high. Therefore, to give more importance to such points z′i, we weight each point by qϕ(z′i|x,y),

If we choose a sufficiently large set of points z′i∈(a,b), we can collect the terms in the second part of (S8) and replace them with a single integral,

log(∫bapθ(y|z,x)qϕ(z|x,y)dz)−∫baqϕ(z|x,y)×log(qϕ(z|x,y)p(z|x))dz.

(S9)

The second integral in (S9) is the KL divergence between the two distributions qϕ(z|x,y) and p(z|x),

log(∫bapθ(y|z,x)qϕ(z|x,y)dz)−DKL(qϕ(z|x,y)∥p(z|x)).

(S10)

We can estimate the data log-likelihood term in (S10) using Monte-Carlo integration. This leads to the “Many Sample” objective from the main paper,

(S11)

As mentioned in the main paper, we use the re-parameterization trick [11] to sample from our recognition network qϕ. Therefore, the recognition network predicts the mean and variance N(μ,σ) of the Gaussian distribution qϕ from which the latent variable z is sampled. Thus, we can directly use the predicted μ,σ to estimate the KL divergence as in [11].

Approximating the data log-likelihood term in the first part of (S11) as shown in the main paper, leads to our “Best of Many” sample objective.

Appendix B Additional Details of our Models

Here, we include details of each layer of our models.

b.1 Model for Structured Trajectory Prediction

We provide the details of our structured trajectory prediction model in Table 4. Followed by the details of the recognition network (qϕ) in Table 5. We refer to fully connected layers as Dense and Size refers to the number of neurons in the layer.

Layer

Type

Size

Activation

Input

Output

In1

Input

x

EMB1

EMB1

Dense

32

ReLU

In1

LSTMenc

LSTMenc

LSTM

48

tanh

EMB1

EMB2

EMB2

Dense

64

ReLU

{LSTMenc,qϕ}

LSTMdec

LSTMdec

LSTM

48

tanh

EMB2

Out1

Out1

Dense

2

LSTMdec

^y

Table 4: Details our model for Structured Trajectory Prediction. The details of the recognition network qϕ used during training follows in Table 5.

Layer

Type

Size

Activation

Input

Output

In2

Input

y

EMB3

EMB3

Dense

64

ReLU

In2

LSTMrec

LSTMrec

LSTM

128

tanh

EMB3

{D1,D2}

D1

Dense

64

LSTMrec

μ

D2

Dense

64

LSTMrec

σ

Table 5: Details of the recognition network used during training of our model for Structured Trajectory Prediction.

b.2 Extension with Visual Input

This model is similar to the model for Structured Trajectory Prediction, expect that the LSTMdec is additionally conditioned on the output of an CNN encoder. The details are in Table 6 and Table 7. We use the same recognition network as described previously in subsection B.1.

Table 8: Details of the recognition network used during training of our extended Structured Trajectory Prediction model with Visual Input.

b.3 Model for Structured Image Sequence Prediction

We provide the details of our structured image sequence prediction model in Table 9. Followed by the details of the recognition network (qϕ) in Table 10. In contrast to the model for structured trajectory prediction, we use Convolutional LSTMs and Convolutional Embedding layers.

Appendix C Additional Results

We show additional qualitative results on the HKO dataset in Figure 9 at t+5, t+10 and t+15. We generate T=50 samples and show the sample closest to the groundtruth (Best), the mean of all the samples and the per-pixel variance in the samples. As in the main paper, the qualitative examples demonstrate that our model produces samples which are close to the groundtruth (comparing the Best sample and the groundtruth) and diverse samples (comparing the difference between the mean of the samples and the Best sample).