Abstract

Being able to predict what may happen in the future requires an in-depth understanding of the physical and causal rules that govern the world. A model that is able to do so has a number of appealing applications, from robotic planning to representation learning. However, learning to predict raw future observations, such as frames in a video, is exceedingly challenging—the ambiguous nature of the problem can cause a naively designed model to average together possible futures into a single, blurry prediction. Recently, this has been addressed by two distinct approaches: (a) latent variational variable models that explicitly model underlying stochasticity and (b) adversarially-trained models that aim to produce naturalistic images. However, a standard latent variable model can struggle to produce realistic results, and a standard adversarially-trained model underutilizes latent variables and fails to produce diverse predictions. We show that these distinct methods are in fact complementary. Combining the two produces predictions that look more realistic to human raters and better cover the range of possible futures. Our method outperforms prior and concurrent work in these aspects.

Try the Stochastic Adversarial Video Prediction (SAVP) Model

Paper

Example Results

We show qualitative results of the video predictions achieved by our SAVP method, our GAN and VAE variants, and other approaches. SV2P is prior work from Babaeizadeh et al. 2017, while SVG is concurrent work from Denton & Fergus 2018. For the stochastic models, we show the prediction with the "best" similarity compared to the ground truth video (out of 100 samples), unless otherwise labeled. Yellow indicates predicted frames, and white indicates the conditioned frames. We also show that our model can predict several hundred frames into the future despite only being trained to predict 10 future frames.

Acknowledgments

We thank Emily Denton for providing pre-trained models and extensive and timely assistance with reproducing the SVG results, and Mohammad Babaeizadeh for providing data for comparisons with SV2P. This research was supported in part by the Army Research Office through the MAST program, the National Science Foundation through IIS-1651843 and IIS-1614653, and hardware donations from NVIDIA. Alex Lee and Chelsea Finn were also supported by the NSF GRFP. Richard Zhang was partially supported by the Adobe Research Fellowship.