Abstract

This work studies the problem of modeling non-linear visual processes
by learning linear generative models from observed sequences.
We propose a joint learning framework, combining a Linear Dynamic System and a Variational Autoencoder
with convolutional layers.
After discussing several conditions for linearizing neural networks,
we propose an architecture that allows Variational
Autoencoders to simultaneously learn the non-linear observation
as well as the linear state-transition from a sequence of observed
frames. The proposed framework is demonstrated experimentally in
three series of synthesis experiments.

While classification of image and video with Convolutional Neural Networks (CNN) is becoming an established practice, unsupervised learning and generative modeling remain to be challenging problems in deep learning.
A successful construction of generative models of a visual process
enables the possibility of generating sequences of video frames
such that the appearance as well as the dynamics approximately resemble the original training process without copying it. This procedure is typically referred to as video generation [1, 2] or video synthesis [3].
More technically, this means, that in addition to a suitable probability model for the individual frames, a probabilistic description for the frame-to-frame transition is also necessary. Analysis and reproduction of visual processes simplifies considerably, if this transition can be assumed to be a linear function. For instance, linear transformations are easily invertible and by means of spectral analysis, it can be studied how successive applications of the same transformation behave in the long term.

Unfortunately, most frame transitions in real-world visual processes unlikely are linear functions. Nevertheless, unsupervised learning has come up with many approaches to fit linear transition models to real-world processes, for instance by
using linear low-rank [4], or sparse approximations of the frames [5],
or applying the kernel trick to them [6].

The success of Generative Adversarial Networks (GAN) [7] and
Variational Autoencoders (VAE) [8] have lead to an increased
interest in deep generative learning and it seems natural to apply such techniques to sequential processes.
We approach this idea from the perspective of linearization in order to keep the model as simple as possible.
In an analogous way as physicists transforming non-linear differential equations into linear ones by means of an appropriate change of variables, our approach is to learn a latent representations of visual processes, such that the latent state-to-state transition can be described by a linear model. To this end, we jointly learn a non-linear observation and a linear state transition function by means of a modified VAE.

Similarly to our work, the authors of [9] combine Linear Dynamic Systems (LDSs) with VAEs.
However, the focus of their work is on control rather than on synthesis. Furthermore, their model is locally linear and the transition distribution is modeled in the variational bound, whereas we model it as a separate layer. This also is the main difference to
the work in [10], where VAEs are combined with linear dynamic models for forecasting images in video sequences, and to [11], in which VAEs are used as Kalman Filters.

The work [12] deals with linearizing transformations under uncertainty via neural networks. It resembles this work in that we also focus on representation learning rather than on a particular application. However, unlike our work, it does not employ VAEs. Theoretical groundwork regarding learned visual transformations has been done in [13, 14, 15] and [16]. More generally, the synthesis of video dynamics by means of neural networks has been discussed, among others as in [17] and [18].

Finally, the core contribution of this work is a combination of neural networks with Markov processes. This has been the subject of many works in the recent past. For a broad overview of results in this field, the reader is referred to Chapter 20 of [19].

3.1 Dynamic Systems

Dynamic textures [4] have popularized LDSs in the modeling of visual processes. Typically, an LDS is of the following form

ht+1=Aht+vt,yt=¯y+Cht+wt,

(1)

where ht∈Rn is the low dimensional state space variable at time t, A∈Rn×n the state transition matrix, yt∈Rd the observation at time t and C∈Rd×n the observation matrix. The vector ¯y∈Rd represents a constant offset in the observation space. The input terms vt and wt are modeled as zero-mean i.i.d. Gaussian noise, and are independent of ht.

The simplicity of state transition in the model (1)
enables straightforward prediction, generation, and analysis
of observations or synthesis.
For real-world visual processes that are often highly non-linear, it is
therefore of great interest to find a model that linearizes the underlying
process, so that in some latent state space representation,
the state transition admits both the linearity and Gaussianity
as depicted in Eq. (1).
Specifically, this work focuses on the following non-linear dynamic system model,
i.e., a linear state transition and a non-linear observation mapping

ht+1=Aht+vt,yt=C(ht)+wt,

(2)

where C:Rn→Ω⊂Rd is assumed to be nonlinear in the rest of the paper.
For algorithmic reasons, we assume that wt is drawn from an isotropic Gaussian distribution, i.e.,

wt∼N(⋅|0,σ2wI).

(3)

Note, that the model in Eq. (2) is not unique with respect to
changes of basis in the state space [20].
Let P∈Rn×n be a full rank matrix, we define the following substitution

Specifically, given one visual process described by (2) with
vt∼N(⋅|0,Σv),
one can define an equivalent system via the transformations

ht→P−1ht,A→P−1AP,Σv→PΣvP⊤,C→C∘P.

(6)

If C is implemented via a neural network, we can ensure that it accounts for a possible change of basis.
Therefore, without loss of generality, we propose the following assumption on the latent samples ht.

and in order to make sure that the latent states ht remain Gaussian
in sequential synthesis scenarios, i.e., Et[ht+1h⊤t+1]=I,
we just need to ensure that the process noise is zero-mean and has the covariance matrix I−AA⊤.
◊

3.2 Linearizability of Non-linear Visual Transformations

The purpose of this subsection is to justify our aim to learn a linear state-transition model (2) from a conceptual point of view and provide cues on how to chose a neural network architecture for linearizing visual transformations.

In the model description (2), the observation mapping
is modeled by a non-linear function C:Rn→Ω.
In what follows, we aim to show the feasibility of the linear model on the
state transitions.
Let us consider a visual transformation of observations in Ω
by φ:Ω→Ω.
Such transformations are in general very difficult to model, and hence
it is very unlikely to find a global observation mapping C such that the
respective transformation in the latent space Rn can
be exactly modeled by a linear transition.
Here, we firstly form the notion of local linearization of a nonlinear
self map.

Definition 1.

Let Ω⊂Rd, φ:Ω→Ω be a continuous self map,
and Γ:Ω→Rn be a local diffeomorphism at all y∈Ω.
The map Γ is said to be a local linearizer
of φ at y∗∈Ω, if there exists a
matrix Φ∈Rn×n such that the following equality holds true
with h∗:=Γ(y∗)

lim∥h∗∥→0∥(Γ∘φ∘Γ−1)(h∗)−Φh∗∥∥h∗∥=0.

(9)

Here, the map Γ behaves as a chart of the data manifold Ω.
If Γ(y∗)=0 and y∗ is a fixed point of φ,
then it is obvious that the map φ is linearized by Γ at y∗.
In general, the map φ cannot be guaranteed to have a fixed point.
Nevertheless, by the Brouwer’s fixed-point theorem, we propose the following
assumption to ensure existence of local linearizability of φ.

Assumption 2.

The set Ω⊂Rd is compact and convex, and
φ:Ω→Ω is a continuous self map, i.e.,
the map φ:Ω→Ω has at least
a fixed point y∗.

This assumption can be easily justified by applications in image/video
processing, where images lie in some hyper-cube, e.g. [0,255]d.
It has some resemblance to control theory, where linearization of
non-linear dynamical systems can be carried out around equilibrium points [21].
The following proposition thus makes the assumption of existence of y∗ to characterize neural networks that locally linearize transformations.

Proposition 1.

Let Ω⊂Rd, φ:Ω→Ω be a continuous self map,
and Γ:Ω→Rn be a local diffeomorphism at all y∈Ω.
If y∗ is a fixed point of φ, then
the following map

Γ′:Rn→Ωh↦Γ−1(h−Γ(y∗))

(10)

locally linearizes φ.

Proof.

Since Γ is a diffeomorphism, ϕ:=Γ∘φ∘Γ−1 is differentiable. We denote by Φ the Jacobian matrix of ϕ at Γ(y∗) and Taylor’s theorem yields

lim∥h−Γ(y∗)∥→0∥ϕ(h)−ϕ(Γ(y∗))−Φ(h−Γ(y∗))∥∥h−Γ(y∗)∥=0.

(11)

Knowing that y∗ is a fixed point of φ, we can rewrite the expression by substituting ϕ as

The error term in (11) is driven by the curvature of ϕ around Γ(y∗). Incidentally, the authors of [12] also achieve linearization by penalizing curvature.

Essentially, Proposition 10 suggests to include a bias for
the first layer of a linearizing neural network that tries to implement C:=Γ−1.
However, in general this is not enough to achieve low linearization error globally. In fact, a neural network consisting of one single affine layer suffices to locally linearize an appropriate transformation φ.

3.3 Linearization via CNNs

In this subsection, we discuss several additional heuristics for linearization,
and argue that employing convolutional layers are a suitable choice.

We start by observing that CNNs are capable of representing data in a way such that it is almost invariant to certain classes of transformations φ:Ω→Ω in the data [22, 23]. In other words, a transformation applied to a data sample does not greatly displace its representation.
For illustration purposes, let y∈Ω denote an image depicting an object,
and φ(y)∈Ω an image depicting the same object, deformed by applying certain forces to it. Due to the curse of dimensionality, the application of φ can lead to a significant displacement of the pixel representation in the Euclidean space, but analysis of simplified CNNs with fixed filter weights and absolute value or ReLU activation functions, so-called Scattering transforms, has shown that it is possible to find a representation Γ:Ω→Rn that is contracting with respect to spatial deformations of images y∈Ω. More specifically, in [22] a deformation φ is described as a warping of spatial coordinates. For such deformations, a bound ϵ(φ) was derived such that

∥Γ(y)−Γ(φ(y))∥≤ϵ(φ)∥y∥

(15)

holds, if Γ is implemented by a Scattering transform. Even though the discussion in [22] is limited to deformations, it is generally assumed in [24] that approximately invariant representations with respect to much broader classes of transformations can be learned by CNNs. The smaller the contraction constant ϵ(φ), the more regularity is introduced to the data, with respect to the linearizability of φ.
To see this, we introduce the minimal expected linearization error q, which measures how well a transformation φ can be modeled by a multiplication with a matrix A as follows

q(Γ,φ)=minA∈Rn×nEY[∥AΓ(Y)−Γ(φ(Y))∥2].

(16)

The following inequality easily follows

q(Γ,φ)≤EY{∥Γ(Y)−Γ(φ(Y))∥2}≤ϵ2(φ)EY[∥Y∥2].

(17)

Note however, that the measure q does not account for how much Γ expands or shrinks its input. The contraction constant ϵ(φ), for instance, was derived for approximately norm-preserving functions Γ[22].

To summarize, we can hope to linearize broad classes of transformations, given the right neural network architecture. In particular, the preceding discussion suggests auto-encoders with (almost everywhere) differentiable activation functions to account for the diffeomorphism property in Proposition 10, and input layers with bias to account for (10). Due to the contraction properties suspected from CNNs, it seems natural to employ convolutional layers and ReLU activations for both the encoder and the decoder.

Until now, we discussed heuristics for the choice of architecture, but the advantage of neural networks is that we can make design goals like linearizability explicit by formulating an appropriate loss function. We tackle this problem from a stochastic perspective by constraining the joint probability distribution of succeeding samples ht,ht+1 in the latent space.

4.1 Variational Autoencoders: Review

According to Assumption 1, the observation mapping C transforms a standard normal distribution to the observation distribution pY, where for a latent sample h, the expected observation is C(h) and the according conditional probability distribution pY|H=h is given by the noise model (3).

Conveniently, VAEs provide a framework to do just that. Let X be a standard normal distributed random variable. Given a set {y1,…,yT} of realizations of a random variable Y with the distribution pY, the objective of the VAE is to maximize the log-likelihood function

L(y)=lnpY(y),

(18)

by learning a parametrized function fθ that approximately transforms X to Y. Accordingly to (3), we fix the following assumption,

pY|X=x=N(fθ(x)|0,σ2wI).

(19)

Then, applying the expectation yields

L(y,θ)=lnEX[N(fθ(X)|0,σ2wI)].

(20)

The parameter θ should thus maximize the term EY[L(Y,θ)].

However, directly maximizing the expected value of (20) by standard Monte Carlo methods is infeasible for computational reasons [25]. Luckily, variational inference provides a lower bound for the likelihood function that can be optimized by stochastic gradient descent. Let gy:Rn→Rn be a parametrized, measurable function which maps from and to the codomain of X. Let the random variable

Z:=gy(X)

(21)

have the probability density function pZ. Let us consider the expression

E(y)=lnpY(y)−D(pZ∥pX|Y=y),

(22)

with D denoting the Kullback-Leibler Divergence (KLD).
Since the KLD is always nonnegative, then the following inequality holds true

E(y)≤L(y).

(23)

We can rewrite the KLD as an expected value. Since y is not a random variable, it is not affected by the expected value,
and we reformulate (22) as

It is then straightforward to Compute the gradient of the squared norm in EZ,Y[∥Y−fθ(Z)∥2].
For an estimation of the expected value, we draw one sample y from pY and several samples from pZ by applying gy,ϑ on samples of standard normal noise. This is known as the reparametrization trick.
Slightly more elaborate is the KLD term in EY[D(pZ∥pX)].
Since the distribution pZ depends on gy,ϑ, in order to make the task computationally tractable,
gy,ϑ is modeled as an affine function of the form

gy,ϑ(x)=diag(a(y,ϑ))x+b(y,ϑ).

(26)

In technical terms, this means the encoder part of Fig. 1 is a subnetwork that maps from a training sample y∈Rd to the two vectors a∈Rn and b∈Rn. The random variable Z is thus described by the distribution

In such a way, stochastic gradient descent of (25) can be therefore applied via backpropagation.

4.2 Markov Assumptions

We want to model a sequential, stochastic visual process (2) such that C is performed by the decoder part of a VAE. Let us assume that we are given a sequence {y1,…,yT}⊂Rd of vectorized video frames. Hereby, C is carried out by a neural network Cη described by the trainable parameter tuple η. If we neglect the temporal order of the frames, we can theoretically train a VAE to generate frames similar to yt, because the latent variables ht are from the standard normal distribution. However, crucial to synthesizing a visual process is not only the capability to create still-image frames, but also to create them according to a temporal model. First and foremost, this implies a possibility to infer A in addition to Cη. The easiest way to approach this is by first learning Cη by training the VAE and then inferring A via squared error minimization as

argmin~A∈Rn×nT−1∑t=1∥ht+1−~Aht∥22.

(29)

Such an approach clearly has its advantages in terms of simplicity, but given the high capacity of trainable neural networks, it is more elegant to learn Cη and A simultaneously. By doing so, we force the latent variables already during the training process to fit a linear transition model instead of fitting a linear state transition model to a sequence of already learned latent variables.

The temporal model at hand is a first order Markov process. Initially, the data needs to be adapted to the problem. We formulate our problem setting thus as a generative model for {s1,…,sT−1}, where each observation

st=[ytyt+1]

(30)

contains two succeeding frames.

A sample st is a realization of the random variable S∈R2d and is composed of two subvectors with the same statistical properties. This means, the distributions of the upper and the lower half subvector of S, i.e., of the current and the predicted frame, must be identical. Specifically, following the discussion of Section 4.1, we assume that S is driven by a latent variable ~H∈R2n and the conditional distribution pS|~H has the form

pS|~H=[h⊤−h⊤+]⊤(s)=N(s|[Cη(h−)⊤Cη(h+)⊤]⊤,σ2wI),

(31)

where σ2w denotes the variance of the observation noise wt in (2). The subvectors h−,h+∈Rn stand for the latent variables, i.e., the state space vectors, belonging to the upper and lower half of s. As agreed on before, their marginal distribution is standard normal. However, their joint distribution is not, since the choice of h+ depends on h−. In fact, from the previous section, we can deduce the joint probability distribution as

Missing or unrecognized delimiter for \Bigg

(32)

This contradicts the premise of the VAE which models latent variables by standard normal distributions. However, if we assume that it is possible to adapt the model of the classical VAE, such that the decoder part in Fig. 1 can be fed with samples drawn from distribution (32) and make the parameter A trainable, we can simultaneously learn the observation and dynamic state transition of a visual process.

4.3 A Dynamic VAE

In this section, we propose a neural network architecture that produces samples similar to st from realizations of the distribution (32). We achieve this by modeling the linear dynamics with an additional layer between the latent space layer and the decoder. Let us refer to such a layer as the dynamic layer and to the architecture in its entirety as a Dynamic VAE. The purpose of the dynamic layer is to map the random variable X∈R2n which has standard normal distribution, to a random variable ~H which has the distribution indicated in (32). Let us denote by X−,X+∈Rn the upper and lower half of X∈R2n. Then such a mapping can be achieved by a function h:R2n→R2n of the form

[h−h+]=h([x−x+])=[x−Ax−+Bx+],

(33)

where B is a matrix such that BB⊤=I−AA⊤. Fig. 2 depicts the resulting architecture.

ϑ

θ=(η,A,B)

gs,ϑ(⋅)

fθ(⋅)

N(⋅|0,I)

B⋅

A⋅

Cη(⋅)

Cη(⋅)

h−

h+

x−

x+

Cη(⋅)

s

Figure 2: VAE with dynamic layer.

In order to guarantee stationarity, we need to ensure that condition (8) is satisfied. This can be done by including a regularizer. The loss function of the Dynamic VAE parameters θ=(η,A,B) and ϑ is thus defined as

c(θ,ϑ)=12σ2wEZ,S[∥S−fθ(Z)∥2]+ES[D(pZ∥pX)]+λ∥I−AA⊤−BB⊤∥2F,

(34)

where λ>0 should be chosen high enough to keep the regularizer ∥I−AA⊤−BB⊤∥2F close to 0.
The KLD term depends on ϑ via (28). Note however, that y is to be replaced by s in this context.

5.1 Overview

The experiments treated three different kinds of visual processes. In each experiment, the Dynamic VAE was trained with a sequence of frames. Afterwards, each sequence was generated from the trained model. Latent states of dimension n=10 were synthesized according to the rule

ht+1=Aht+Bvt,yt=Cη(ht),

(35)

where vt were drawn from a standard normal distribution and the initial state h0 was inferred from the expected value of the conditional latent distribution of a test frame pair s0. The frame pair s0 was excluded from the training set. This was done in order to improve the significance of the experimental outcome with respect to how well the model generalizes.

The observer neural network C was implemented via a fully-connected layer followed by three convolutional layers with ReLU activations and nearest neighbor upsampling. The number of channels was decreased with each layer by the factor four, such that the number of pixels in each hidden layer remains roughly unchanged. The encoder mirrored the structure with the same number of convolutional layers, ReLU activations, increasing number of channels and Max-Pooling layers. For each convolutional layer, the same filter size was used. All experiments were implemented in Python 3.6 with PyTorch 0.1.12 on CUDA 8.0. The choice of parameters for each experiment is described in
Table 1.
The code is publicly available [26].

Experiment

σ2w

λ

Filter size

MNIST

1.5

100.0

3×3

UCLA-50

0.31

100.0

4×4

NORB

5.0

100.0

4×4

Table 1: Parameter settings

Evaluating generative models is particularly challenging. This is due to the very nature of the problem that demands measuring the similarity of the probability distribution underlying the training data to the probability distribution that generated the test data. Neither of the two is available in closed form but can be only estimated from a limited number of samples in a very high-dimensional space. It is thus an established practice to evaluate generative models by visual inspection of the generated samples [19]. However, it is important to consider overfitting that can lead to supposedly very realistic samples. The following experiments have the purpose of demonstrating the principal capability of the proposed methods to infer a linear model from a highly non-linear process by generating sequences from Gaussian noise. Therefore, we acknowledge the fact that our choice of hyperparameters is possibly suboptimal and architectures that are optimized for a specific task could lead to visually more appealing results. Due to space constraints, only a few experimental outcomes are shown in each subsection. The supplementary material to this paper contains synthesis results for each performed experiment.

5.2 Learning to Count

In the first series of experiments, we trained our architecture with sequences of images from the MNIST data set. One sequence was used for each experiment. The aim was to learn a generative, sequential model that can produce repeating sequences of numbers. For instance, in the first experiment, the frame transition to be linearized was a mapping of a 1 to a 2, a 2 to a 3 and a 3 to a 4 and a 4 to a 1. Each training sequence contained 7999 MNIST image pairs.

Fig. 3 visualizes the synthesis of the sequences 12341234… and 67896789… in comparison to the result of a purely linear model as described in [4]. The dynamic VAE did well in synthesizing number sequences of length 4 or smaller. More challenging were longer sequences as Fig. 4 shows. While some sequences, like 34567… and 45678…, could be sufficiently well trained, other sequences, like 12345… appeared to yield non-stationary systems, or like 23456… were to unpredictable for the Dynamic VAE.

5.3 Dynamic Textures

The second series of experiments focused on the synthesis of dynamic textures. In each experiment, the Dynamic VAE was trained with one class of dynamic texture from the cropped UCLA-50 database [27, 28].

Fig. 5 depicts the synthesis results for the dynamic texture wfalls-c. In general, we observed that the synthesis of predictable sequences, e.g. oscillations or cyclic phenomena produces realistic results. Chaotic textures yielded some frames that looked artificial. This could be observed, for instance, in the synthesis of the candle dynamic texture.

5.4 Rotating Objects

The Small NORB[29] dataset consists of pictures taken of different miniature objects under varying lighting conditions, elevation and azimuthal angles. One object at once was used for training. We trained our model to linearize a counterclockwise azimuthal rotation by 20∘. Since the Small NORB dataset contains little variability apart for the intentional one, we decided to exclude one configuration of lighting conditions and elevation angle form the training data and use the contained sequence of azimuthal positions as ground truth for our experiment. Generally, the rotation could be well reproduced by the linear state transition model, except for the category 1 which contains human figures.
Fig. 6 depicts the Dynamic VAE synthesis of a rotating horse compared to a linear synthesis [4].

The angle is slightly higher than 20∘, since the columns are not aligned. The model seems to be confused by diametrical angles. For instance, at certain positions of Fig. 6, it becomes indeterminable whether the horse faces towards or away from the observer, leading to the skipping of the following 180∘.

This work presented an approach to infer linear models of visual processes by means of Variational Autoencoders. To this end, the classical VAE model was modified to include an additional layer that models the latent dynamics of the visual process. The capability of the proposed model was demonstrated in three series of synthesis experiments.
Additionally, the aim of this work was to develop a notion of linearizability and what implications it has on the choice of neural network architectures. While yielding first conceptual results, we understand that the theoretical analysis on this matter has room for improvement.
Therefore, in future work, we plan to gain further insights in the theoretical concept of linearizability but also improve the architecture to handle more complex data.