It has long been conjectured that hypotheses spaces suitable for data that is
compositional in nature, such as text or images, may be more efficiently
represented with deep hierarchical networks than with shallow ones. Despite the
vast empirical evidence supporting this belief, theoretical justifications to
date are limited. In particular, they do not account for the locality, sharing
and pooling constructs of convolutional networks, the most successful deep
learning architecture to date. In this work we derive a deep network
architecture based on arithmetic circuits that inherently employs locality,
sharing and pooling. An equivalence between the networks and hierarchical
tensor factorizations is established. We show that a shallow network
corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to
Hierarchical Tucker decomposition. Using tools from measure theory and matrix
algebra, we prove that besides a negligible set, all functions that can be
implemented by a deep network of polynomial size, require exponential size in
order to be realized (or even approximated) by a shallow network. Since
log-space computation transforms our networks into SimNets, the result applies
directly to a deep learning architecture demonstrating promising empirical
performance. The construction and theory developed in this paper shed new light
on various practices and ideas employed by the deep learning community.

Convolutional rectifier networks, i.e. convolutional neural networks with
rectified linear activation and max or average pooling, are the cornerstone of
modern deep learning. However, despite their wide use and success, our
theoretical understanding of the expressive properties that drive these
networks is partial at best. On the other hand, we have a much firmer grasp of
these issues in the world of arithmetic circuits. Specifically, it is known
that convolutional arithmetic circuits possess the property of "complete depth
efficiency", meaning that besides a negligible set, all functions that can be
implemented by a deep network of polynomial size, require exponential size in
order to be implemented (or even approximated) by a shallow network. In this
paper we describe a construction based on generalized tensor decompositions,
that transforms convolutional arithmetic circuits into convolutional rectifier
networks. We then use mathematical tools available from the world of arithmetic
circuits to prove new results. First, we show that convolutional rectifier
networks are universal with max pooling but not with average pooling. Second,
and more importantly, we show that depth efficiency is weaker with
convolutional rectifier networks than it is with convolutional arithmetic
circuits. This leads us to believe that developing effective methods for
training convolutional arithmetic circuits, thereby fulfilling their expressive
potential, may give rise to a deep learning architecture that is provably
superior to convolutional rectifier networks but has so far been overlooked by
practitioners.

Our formal understanding of the inductive bias that drives the success of
convolutional networks on computer vision tasks is limited. In particular, it
is unclear what makes hypotheses spaces born from convolution and pooling
operations so suitable for natural images. In this paper we study the ability
of convolutional arithmetic circuits to model correlations among regions of
their input. Correlations are formalized through the notion of separation rank,
which for a given input partition, measures how far a function is from being
separable. We show that a polynomially sized deep network supports
exponentially high separation ranks for certain input partitions, while being
limited to polynomial separation ranks for others. The network's pooling
geometry effectively determines which input partitions are favored, thus serves
as a means for controlling the inductive bias. Contiguous pooling windows as
commonly employed in practice favor interleaved partitions over coarse ones,
orienting the inductive bias towards the statistics of natural images. In
addition to analyzing deep networks, we show that shallow ones support only
linear separation ranks, and by this gain insight into the benefit of functions
brought forth by depth - they are able to efficiently model strong correlation
under favored partitions of the input.