Accepted Papers

Title:

MCMC for Hierarchical Semi-Markov
Conditional Random Fields

Abstract:

Deep architecture such as hierarchical
semi-Markov models is an important class of
models for nested sequential data. However, the
inference can be expensive for problems with
arbitrary sequence length and depth. In this
contribution, we propose a new approximation
technique that may have the potential to achieve
sub-cubic time complexity in both length and
depth, at the cost of some controllable loss of
quality. The idea is based on two well-known
methods: Gibbs sampling and Rao-Blackwellisation.
We provide some simulation-based evaluation of
the quality of the RGBS with respect to run time
and sequence length.

We have proposed the deep-structured
conditional random fields (CRFs) for sequential
labeling and classification recently. The core
of this model is its deep structure and its
discriminative nature. This paper outlines the
learning strategies and algorithms we have
developed for the deep-structured CRFs, with a
focus on the new strategy that combines the
layer-wise unsupervised pre-training using
entropy-based multi-objective optimization and
the conditional likelihood-based
back-propagation fine tuning, as inspired by the
recent development in learning deep belief
networks.

Generative models for sequential data based
on directed graphs of Restricted Boltzmann
Machines (RBMs) are able to accurately model
high dimensional sequences as recently shown. In
these models, temporal dependencies in the input
are discovered by either buffering previous
visible variables or by recurrent connections of
the hidden variables. Here we propose a
modification of these models, the Temporal
Reservoir Machine (TRM). It utilizes a recurrent
artificial neural network (ANN) for integrating
information from the input over
time. This information is then fed into a RBM at
each time step. To avoid difficulties of
recurrent network learning, the ANN remains
untrained and hence can be thought of as a
random feature extractor. Using the architecture
of multi-layer RBMs (Deep Belief Networks), the
TRMs can be used as a building block for complex
hierarchical models. This approach unifies
RBM-based approaches for sequential data
modeling and the Echo State Network, a powerful
approach for black-box system identification.
The TRM is tested on a spoken digits task under
noisy conditions, and competitive performances
compared to previous models are observed.

Author Names:

Benjamin Schrauwen*, Ghent University
Lars Buesing, Graz University of Technology

We propose the use of competitive learning in
deep networks for understanding sequential data.
Hierarchies of competitive learning algorithms
have been found in the brain [1] and their use
in deep vision networks has been validated [2].
The algorithm is simple to comprehend and yet
provides fast, sparse learning. To understand
temporal patterns we use the depth of the
network and delay blocks to encode time. The
delayed feedback from higher layers provides
meaningful predictions to lower layers. We
evaluate a multi-factor network design by using
it to predict frames in movies it has never seen
before. At this task our system outperforms the
prediction of the Recurrent Temporal Restricted
Boltzmann Machine [3] on novel frame changes.

Author Names:

Robert Gens*, University of Washington
Pedro Domingos, University of Washington

A key challenge associated with the design of
scalable deep learning architectures pertains to
efficiently capturing spatiotemporal
dependencies in a scalable framework that is
modality independent. This paper presents a
novel discriminative deep learning architecture,
which relies on an identical cortical circuit
populating the hierarchical structure. Belief
states formed across the hierarchy intrinsically
capture sequences of patterns, rather than
static patterns, thereby facilitating the
embedding of temporal dependencies. At the core
of the adaptation mechanism are two learned
constructs, one of which relies on a fast and
stable incremental clustering. Moreover, the
proposed methodology does not require
layer-by-layer training and lends itself
naturally to massively-parallel processing
platforms. A simple test case demonstrates the
validity of the architecture and learning
algorithm. The system can be efficiently applied
to various modalities, including those
associated with complex visual and audio
information representation.

Author Names:

Itamar Arel*, University of Tennessee
Derek Rose, University of Tennessee
Tom Karnowski, University of Tennessee

Hidden Markov Models (HMMs) have been the
state-of-the-art techniques for acoustic
modeling despite their unrealistic independence
assumptions and the very limited
representational capacity of their hidden
states. There are many proposals in the research
community for deeper models that are capable of
modeling the many types of variability present
in the speech generation process. Deep Belief
Networks (DBNs) have recently proved to be very
effective for a variety of machine learning
problems and this paper applies DBNs to acoustic
modeling. On the standard TIMIT corpus, DBNs
consistently
outperform other techniques and the best DBN
achieves a phone error rate (PER) of 23.0% on
TIMIT core test set.

Author Names:

Abdel-rahman Mohamed*, University of Toronto
George Dahl, University of Toronto
Geoffrey Hinton, University of Toronto

Recently, Poon and Domingos (2009) developed
the first approach for unsupervised semantic
parsing, the USP system. They applied it to
extracting a knowledge base from biomedical
abstracts for question answering and showed that
it substantially outperforms state-of-the-art
systems such as TextRunner and DIRT. In this
paper, we show that USP can be viewed as
learning a deep network for semantic parsing.
The hidden units in the network represent
clusters of meaning expressions, whereas the
visible units represent dependency trees of
input sentences. USP starts with a network where
each atomic expression has its own cluster, and
learns the final architecture by incrementally
combining hidden units to abstract away
syntactic and lexical variations of the same
meaning. USP can be naturally generalized to a
new approach for deep learning based on
structure search; we discuss the implications of
this.

We propose a non-linear graphical model for
structured prediction. It combines the power of
deep networks to extract high level features
with the graphical framework of Markov networks,
yielding a powerful and scalable model that we
apply to signal labeling tasks.

The overall objective function of a MAP-based
language model (LM) adaptation technique is
implicitly a composition of two objective
functions: The first objective is concerned with
the maximum likelihood estimation of the model
parameters from the in-domain
data while the second objective is concerned
with an appropriate representation of prior
information obtained from a general purpose
corpus. In this paper, we separate these
individual objective functions, which are at
least partially conflicting, and take a
multi-objective programming (MOP) approach to LM
adaptation. The resulting MOP problem is solved
in an iterative manner such that each objective
is optimized one after another with constraints
on the others. When solved this way, the target
LM is in the form of a log-linear interpolation
of component LMs. In our preliminary experiments
with bigram LMs, the proposed approach slightly
outperformed linear interpolation. In our
ongoing work with trigram LMs, we expect the
proposed approach to outperform linear
interpolation in terms of both the perplexity
and the automatic speech recognition work error
rate.

Empirical results have shown that many spoken
language identification systems based on
hand-coded features perform poorly on small
speech samples where a human would be
successful. A hypothesis for this low
performance is that the set of extracted
features is insufficient. A deep architecture
that learns features automatically is
implemented and evaluated on several datasets.

In recent years, deep learning approaches have
gained significant interest as a way of building
hierarchical representations from unlabeled
data. However, to our knowledge, these deep
learning approaches have not been extensively
studied for auditory data. In this paper, we
apply convolutional deep belief networks to
audio data and empirically evaluate them on
various audio classification tasks. In the case
of speech data, we show that the learned
features correspond to phones/phonemes. In
addition, our feature representations learned
from unlabeled audio data show very good
performance for multiple audio classification
tasks. We hope that this paper will inspire more
research on deep learning approaches applied to
a wide range of audio recognition tasks.