Cloudera Fast Forward Labs is a machine intelligence research company.

newsletter

Realistic Video Generation

Generative Adversarial Networks (GANs) wowed the world in 2014 with their
ability to generate what we considered to be realistic images. While these
images were quite low resolution, researchers kept working on how to perfect
these methods in order to increase the quality of the images and even to apply
the algorithm on other types of data like text and sound.

However, until recently there has been little success in making realistic
videos. The main problem with making videos is temporal consistency: while
people can be forgiving in one frame and find some interpretation for
unrealistic regions, we are adept at seeing inconsistencies with how videos
progress.

For example, we can accept some strange looking texture in the background of an
image as simply some strange looking background. However, if that background is
randomly changing from frame to frame in a video, we immediately discount the
video. It is exactly this temporal consistency which has plagued researchers
trying to apply GANs to videos – while each frame seemed realistic taken on
its own, when assembled into a video, there were considerable inconsistencies
which ruined any illusion of realism. This restricted the ability to reuse
models that showed success at generating individual images, and forced
researchers to come up with new methods to deal with the temporal nature of
videos.

Recently, researchers at NVIDIA and MIT have come up with a new type of
GAN, vid2vid, which primarily addresses this problem by explicitly
incorporating how things seem to be moving within the video, in order to continue
this motion in future frames. (In addition, they follow previous work, which uses
a multi-resolution approach for generating high resolution images). This is done
by calculating the optical flow of the image, which is a classic computer
vision method that simply has not been incorporated into such a model until now.

The results are quite staggering (we highly recommend watching their release
video). With the model you can create dashboard camera footage from the initial
segmentation frame (allowing you to change the type and shape of objects in the
frame by simply drawing in the corresponding color); it’s even possible to create realistic looking
dance videos from pose information. It’s interesting to see this new method as
compared with previous methods, to really get a sense of how important this
additional temporal information is for making realistic results.

These high quality results are quite exciting and are groundbreaking work in
the field of video generation. From applications in generating synthetic
training data to use in creative projects, the vid2vid model itself is instantly
applicable.

Even more interesting is how the field as a whole will learn from this
research and start finding ways to incorporate other classic algorithms into
neural networks. Just as conv-nets explicitly encoded the two dimensional
understanding we have for images into models so that they can more quickly and
accurately learn how to work with that data, this method explicitly encodes our
understanding of how frames of a video flow from one to another (albeit this was
much trickier to do than the conv-net example!). We’re interested in seeing what
other algorithms will be incorporated into neural networks like this and what
capabilities these models will have.