Software

Torch7 is a machine learning library which aims at including state-of-the-art algorithms.

Torch7
is the last version of Torch. It provides a Matlab-like
environment for state-of-the-art machine learning algorithms. It
is easy to use and provides a very efficient
implementation, thanks to an easy and
fast scripting language (Lua) and a underlying C
implementation. It is distributed under
a BSD
license.

Torch5 was the
previous official version. Torch7 is built
over Torch5, bringing more flexibility in Tensor types, as
well as many optimizations (including SSE, OpenMP or CUDA).

Torch3 was written
completely in C++. While it has been used in many
projects, I always found it myself too complicated. It also lacked
documentation.

Other versions like Torch4 have been also
written. Torch4 was developped in
Objective C. While being simpler than Torch3, it did
not spread because sheeps prefer complicated languages
like C++.

Written while I was a PhD student, it was efficient at the time. I
would recommend using
now LIBSVM,
as SVMTorch has not been updated since a long while... or the
new Torch 5 software which includes
efficient SVMs.

SENNA is a software distributed under a non-commercial license, which
outputs a host of Natural Language Processing (NLP) predictions:
part-of-speech (POS) tags, chunking (CHK), name entity recognition
(NER) and semantic role labeling (SRL).

SENNA is fast because it uses a simple architecture, self-contained
because it does not rely on the output of existing NLP system, and
accurate because it offers state-of-the-art or near state-of-the-art
performance.

SENNA is written in ANSI C, with about 2500 lines of code. It requires
about 150MB of RAM and should run on any IEEE floating point computer.

Publications

2016

In this paper, we propose a novel approach for weakly-supervised word
recognition. Most state of the art automatic speech recognition systems are
based on frame-level labels obtained through forced alignments or through a
sequential loss. Recently, weakly-supervised trained models have been
proposed in vision, that can learn which part of the input is relevant for
classifying a given pattern. Our system is composed of a convolutional
neural network and a temporal score aggregation mechanism. For each
sentence, it is trained using as supervision only some of the words (most
frequent) that are present in a given sentence, without knowing their order
nor quantity. We show that our proposed system is able to jointly classify
and localise words. We also evaluate the system on a keyword spotting task,
and show that it can yield similar performance to strong supervised HMM/GMM
baseline.

This paper aims to classify and locate objects accurately and efficiently,
without using bounding box annotations. It is challenging as objects in the
wild could appear at arbitrary locations and in different scales. In this
paper, we propose a novel classification architecture ProNet based on
convolutional neural networks. It uses computationally efficient neural
networks to propose image regions that are likely to contain objects, and
applies more powerful but slower networks on the proposed regions. The
basic building block is a multi-scale fully-convolutional network which
assigns object confidence scores to boxes at different locations and
scales. We show that such networks can be trained effectively using
image-level annotations, and can be connected into cascades or trees for
efficient object classification. ProNet outperforms previous
state-of-the-art significantly on PASCAL VOC 2012 and MS COCO datasets for
object classification and point-based localization.

Torch 7 is a scientific computing platform that supports both CPU and GPU
computation, has a lightweight wrapper in a simple scripting language, and
provides fast implementations of common algebraic operations. It has become
one of the main frameworks for research in (deep) machine learning. Torch
does, however, not provide abstractions and boilerplate code for
machine-learning experiments. As a result, researchers repeatedly
re-implement experimentation logics that are not interoperable. We
introduce Torchnet: an open-source framework that provides abstractions and
boilerplate logic for machine learning. It encourages modular programming
and code re-use, which reduces the chance of bugs, and it makes it
straightforward to use asynchronous data loading and efficient multi-GPU
computations. Torchnet is written in pure Lua, which makes it easy to
install on any architecture with a Torch installation. We envision Torchnet
to become a platform to which the community contributes via plugins.

Recent works in Natural Language Processing (NLP) using neural networks
have focused on learning dense word representations to perform
classification tasks. When dealing with phrase prediction problems, is is
common practice to use special tagging schemes to identify segments
boundaries. This allows these tasks to be expressed as common word tagging
problems. In this paper, we propose to learn fixed-size representations for
arbitrarily sized chunks. We introduce a model that takes advantage of such
representations to perform phrase tagging by directly identifying and
classifying phrases. We evaluate our approach on the task of multiword
expression (MWE) tagging and show that our model outperforms the
stateof-the-art model for this task.

Morphologically rich languages (MRL) are languages in which much of the
structural information is contained at the wordlevel, leading to high level
word-form variation. Historically, syntactic parsing has been mainly
tackled using generative models. These models assume input features to be
conditionally independent, making difficult to incorporate arbitrary
features. In this paper, we investigate the greedy discriminative parser
described in (Legrand and Collobert, 2015), which relies on word
embeddings, in the context of MRL. We propose to learn morphological
embeddings and propagate morphological information through the tree using a
recursive composition procedure. Experiments show that such embeddings can
dramatically improve the average performance on different
languages. Moreover, it yields state-of-the art performance for a majority
of languages.

We present a simple neural network for word alignment that builds source
and target word window representations to compute alignment scores for
sentence pairs. To enable unsupervised training, we use an aggregation
operation that summarizes the alignment scores for a given target word. A
soft-margin objective increases scores for true target words while
decreasing scores for target words that are not present. Compared to the
popular Fast Align model, our approach improves alignment accuracy by 7 AER
on English-Czech, by 6 AER on Romanian-English and by 1.7 AER on
English-French alignment.

In this work we propose to augment feedforward nets for object segmentation
with a novel top-down refinement approach. The resulting bottom-up/top-down
architecture is capable of efficiently generating high-fidelity object
masks. Similarly to skip connections, our approach leverages features at
all layers of the net. Unlike them, our approach does not attempt to output
independent predictions at each layer. Instead, we first output a coarse
‘mask encoding’ in a feedforward pass, then refine this mask encoding in a
top-down pass utilizing features at successively lower layers.

2015

In this paper, we propose a new way to generate object proposals,
introducing an approach based on a discriminative convolutional
network. Our model is trained jointly with two objectives: given an image
patch, the first part of the system outputs a class-agnostic segmentation
mask, while the second part of the system outputs the likelihood of the
patch being centered on a full object. At test time, the model is
efficiently applied on the whole test image and generates a set of
segmentation masks, each of them being assigned with a corresponding object
likelihood score.

We are interested in inferring object segmentation by leveraging only
object class information, and by considering only minimal priors on the
object segmentation task. This problem could be viewed as a kind of weakly
supervised segmentation task, and naturally fits the Multiple Instance
Learning (MIL) framework: every training image is known to have (or not) at
least one pixel corresponding to the image class label, and the
segmentation task can be rewritten as inferring the pixels belonging to the
class of the object (given one image, and its object class). We propose a
Convolutional Neural Network-based model, which is constrained during
training to put more weight on pixels which are important for classifying
the image. We show that at test time, the model has learned to discriminate
the right pixels well enough, such that it performs very well on an
existing segmentation benchmark, by adding only few smoothing priors. Our
system is trained using a subset of the Imagenet dataset and the
segmentation experiments are performed on the challenging Pascal VOC
dataset (with no fine-tuning of the model on Pascal VOC). Our model beats
the state of the art results in weakly supervised object segmentation task
by a large margin. We also compare the performance of our model with state
of the art fully-supervised segmentation approaches.

The bag-of-words (BOW) model is the common approach for classifying documents,
where words are used as feature for training a classifier. This generally
involves a huge number of features. Some techniques, such as Latent Semantic
Analysis (LSA) or Latent Dirichlet Allocation (LDA), have been designed to summarize
documents in a lower dimension with the least semantic information loss.
Some semantic information is nevertheless always lost, since only words are considered.
Instead, we aim at using information coming from n-grams to overcome
this limitation, while remaining in a low-dimension space. Many approaches, such
as the Skip-gram model, provide good word vector representations very quickly.
We propose to average these representations to obtain representations of n-grams.
All n-grams are thus embedded in a same semantic space. A K-means clustering
can then group them into semantic concepts. The number of features is therefore
dramatically reduced and documents can be represented as bag of semantic
concepts. We show that this model outperforms LSA and LDA on a sentiment
classification task, and yields similar results than a traditional BOW-model with
far less features.

Generating a novel textual description of an image is an interesting
problem that connects computer vision and natural language processing. In
this paper, we present a simple model that is able to generate descriptive
sentences given a sample image. This model has a strong focus on the syntax
of the descriptions. We train a purely bilinear model that learns a metric
between an image representation (generated from a previously trained
Convolutional Neural Network) and phrases that are used to described
them. The system is then able to infer phrases from a given image sample.
Based on caption syntax statistics, we propose a simple language model that
can produce relevant descriptions for a given test image using the phrases
inferred. Our approach, which is considerably simpler than state-of-the-art
models, achieves comparable results in two popular datasets for the task:
Flickr30k and the recently proposed Microsoft COCO.

D. Palaz, M. Magimai-Doss and R. Collobert. Analysis of CNN-based Speech Recognition System using Raw Speech as Input. In 16th Annual Conference of the International Speech Communication Association (Interspeech), 2015.

Automatic speech recognition systems typically model the relationship
between the acoustic speech signal and the phones in two separate steps:
feature extraction and classifier training. In our recent works, we have
shown that, in the framework of convolutional neural networks (CNN), the
relationship between the raw speech signal and the phones can be directly
modeled and ASR systems competitive to standard approach can be built. In
this paper, we first analyze and show that, between the first two
convolutional layers, the CNN learns (in parts) and models the
phone-specific spectral envelope information of 24 ms speech. Given that
we show that the CNN-based approach yields ASR trends similar to standard
short-term spectral based ASR system under mismatched (noisy) conditions,
with the CNN-based approach being more robust.

This paper introduces a greedy parser based on neural networks, which leverages
a new compositional sub-tree representation. The greedy parser and the compositional
procedure are jointly trained, and tightly depends on each-other. The
composition procedure outputs a vector representation which summarizes syntactically
(parsing tags) and semantically (words) sub-trees. Composition and tagging
is achieved over continuous (word or tag) representations, and recurrent neural
networks. We reach F1 performance on par with well-known existing parsers,
while having the advantage of speed, thanks to the greedy nature of the parser. We
provide a fully functional implementation of the method described in this paper.

State-of-the-art automatic speech recognition systems model the
relationship between acoustic speech signal and phone classes in two
stages, namely, extraction of spectral-based features based on prior
knowledge followed by training of acoustic model, typically an artificial
neural network (ANN). In our recent work, it was shown that Convolutional
Neural Networks (CNNs) can model phone classes from raw acoustic speech
signal, reaching performance on par with other existing feature-based
approaches. This paper extends the CNN-based approach to large vocabulary
speech recognition task. More precisely, we compare the CNN-based approach
against the conventional ANN-based approach on Wall Street Journal
corpus. Our studies show that the CNN-based approach achieves better
performance than the conventional ANN-based approach with as many
parameters. We also show that the features learned from raw speech by the
CNN-based approach could generalize across different databases.

Recent works on word representations mostly rely on predictive
models. Distributed word representations (aka word embeddings) are trained
to optimally predict the contexts in which the corresponding words tend to
appear. Such models have succeeded in capturing word similarities as well
as semantic and syntactic regularities. Instead, we aim at reviving
interest in a model based on counts. We present a systematic study of the
use of the Hellinger distance to extract semantic representations from the
word co-occurrence statistics of large text corpora. We show that this
distance gives good performance on word similarity and analogy tasks, with
a proper type and size of context, and a dimensionality reduction based on
a stochastic low-rank approximation. Besides being both simple and
intuitive, this method also provides an encoding function which can be used
to infer unseen words or phrases. This becomes a clear advantage compared
to predictive models which must train these new words.

2014

State-of-the-art phoneme sequence recognition systems are based on hybrid
hidden Markov model/artificial neural networks (HMM/ANN) framework. In this
framework, the local classifier, ANN, is typically trained using Viterbi
expectation-maximization algorithm, which involves two separate steps:
phoneme sequence segmentation and training of ANN. In this paper, we
propose a CRF based phoneme sequence recognition approach that
simultaneously infers the phoneme segmentation and classifies the phoneme
sequence. More specifically, the phoneme sequence recognition system
consists of a local classifier ANN followed by a conditional random field
(CRF) whose parameters are trained jointly, using a cost function that
discriminates the true phoneme sequence against all competing sequences. In
order to efficiently train such a system we introduce a novel CRF based
segmentation using acyclic graph. We study the viability of the proposed
approach on TIMIT phoneme recognition task. Our studies show that the
proposed approach is capable of achieving performance similar to standard
hybrid HMM/ANN and ANN/CRF systems where the ANN is trained with manual
segmentation.

The goal of the scene labeling task is to assign a class label to each
pixel in an image. To ensure a good visual coherence and a high class
accuracy, it is essential for a model to capture long range (pixel) label
dependencies in images. In a feed-forward architecture, this can be
achieved simply by considering a sufficiently large input context patch,
around each pixel to be labeled. We propose an approach that consists of a
recurrent convolutional neural network which allows us to consider a large
input context while limiting the capacity of the model. Contrary to most
standard approaches, our method does not rely on any segmentation technique
nor any task-specific features. The system is trained in an end-to-end
manner over raw pixels, and models complex spatial dependencies with low
inference cost. As the context size increases with the built-in recurrence,
the system identifies and corrects its own errors. Our approach yields
state-of-the-art performance on both the Stanford Background Dataset and
the SIFT Flow Dataset, while remaining very fast at test time.

J. Legrand and R. Collobert. Recurrent Greedy Parsing with Neural Networks. In Proceedings of the European Conference on Machine Learning, Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2014.

In this paper, we propose a bottom-up greedy and purely discriminative
syntactic parsing approach that relies only on a few simple features. The
core of the architecture is a simple neural network architecture, trained
with an objective function similar to a Conditional Random Field. This
parser leverages continuous word vector representations to model the
conditional distributions of context-aware syntactic rules. The learned
distribution rules are naturally smoothed, thanks to the continuous nature
of the input features and the model. Generalization accuracy compares very
well with the existing generative or discriminative (non-reranking) parsers
(despite the greedy nature of our approach), and prediction speed is very
fast.

@inproceedings{legrand:2014,
title = {Recurrent Greedy Parsing with Neural Networks},
author = {J. Legrand and R. Collobert},
booktitle = {Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD)},
year = {2014}
}

R. Lebret and R. Collobert. Word Embeddings through Hellinger PCA. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 482-490, Association for Computational Linguistics, 2014.

Word embeddings resulting from neural language models have been shown to be
a great asset for a large variety of NLP tasks. However, such architecture
might be difficult and time-consuming to train. Instead, we propose to
drastically simplify the word embeddings computation through a Hellinger
PCA of the word co-occurence matrix. We compare those new word embeddings
with some well-known embeddings on named entity recognition and movie
review tasks and show that we can reach similar or even better
performance. Although deep learning is not really necessary for generating
good word embeddings, we show that it can provide an easy way to adapt
embeddings to specific tasks.

2013

In hybrid hidden Markov model/artificial neural networks (HMM/ANN)
automatic speech recognition (ASR) system, the phoneme class conditional
probabilities are estimated by first extracting acoustic features from the
speech signal based on prior knowledge such as, speech perception or/and
speech production knowledge, and, then modeling the acoustic features
with an ANN. Recent advances in machine learning techniques, more
specifically in the field of image processing and text processing, have
shown that such divide and conquer strategy (i.e., separating feature
extraction and modeling steps) may not be necessary. Motivated from these
studies, in the framework of convolutional neural networks (CNNs), this
paper investigates a novel approach, where the input to the ANN is raw
speech signal and the output is phoneme class conditional probability
estimates. On TIMIT phoneme recognition task, we study different ANN
architectures to show the benefit of CNNs and compare the proposed approach
against conventional approach where, spectral-based feature MFCC is
extracted and modeled by a multilayer perceptron. Our studies show that the
proposed approach can yield comparable or better phoneme recognition
performance when compared to the conventional approach. It indicates that
CNNs can learn features relevant for phoneme classification automatically
from the raw speech signal.

This paper proposes a method for learning to rank over network data. The
ranking is performed with respect to a query object which can be part of
the network or out- side it. The ranking method makes use of the features
of the nodes as well as the existing links between them. First, a
neighbors-aware ranker is trained using a large margin pairwise loss
function. Neighbors-aware ranker uses target neighbors scores in addition
to objects' content and therefore, the scoring is consistent in every
neighborhood. Then, collective inference is performed using an iterative
ranking algorithm, which propagates the results of rankers over the
network. By formulating link prediction as a ranking problem, the method
is tested on several networks, with pa- pers/citations and
webpages/hyperlinks. The results show that the proposed algorithm, which
uses both the attributes of the nodes and the structure of the links,
outperforms several other methods: a content-only ranker, a link-only
one, a random walk method, a relational topic model, and a method based
on the weighted number of common neighbors. In addition, the propagation
algorithm improves results even when the query object is not part of the
network, and scales efficiently to large networks.

2012

Neural networks and machine learning algorithms in general require a
flexible environment where new algorithm prototypes and experiments
can be set up as quickly as possible with best possible computational
performance. To that end, we provide a new framework called Torch7,
that is especially suited to achieve both of these competing
goals. Torch7 is a versatile numeric computing framework and machine
learning library that extends a very lightweight and powerful
programming language Lua. Its goal is to provide a flexible
environment to design, train and deploy learning machines. Flexibility
is obtained via Lua, an extremely lightweight scripting language. High
performance is obtained via efficient OpenMP/SSE and CUDA
implementations of low-level numeric routines. Torch7 can also easily
be interfaced to third-party software thanks to Lua’s light C
interface.

We show how nonlinear embedding algorithms popular for use with
"shallow" semi-supervised learning techniques such as kernel methods
can be easily applied to deep multi-layer architectures, either as a
regularizer at the output layer, or on each layer of the
architecture. This trick provides a simple alternative to existing
approaches to deep learning whilst yielding competitive error rates
compared to those methods, and existing shallow semi-supervised
techniques.

We propose a unified neural network architecture and learning
algorithm that can be applied to various natural language
processing tasks including part-of-speech tagging, chunking,
named entity recognition, and semantic role labeling. This
versatility is achieved by trying to avoid task-specific
engineering and therefore disregarding a lot of prior
knowledge. Instead of exploiting man-made input features
carefully optimized for each task, our system learns internal
representations on the basis of vast amounts of mostly
unlabeled training data. This work is then used as a basis for
building a freely available tagging system with good
performance and minimal computational requirements.

We propose a new fast purely discriminative algorithm for
natural language parsing, based on a “deep” recurrent
convolutional graph transformer network (GTN). Assuming a
decomposition of a parse tree into a stack of “levels”, the
network predicts a level of the tree taking into account
predictions of previous levels. Using only few basic text
features which leverage word representations from Collobert
and Weston (2008), we show similar performance (in F1 score)
to existing pure discriminative parsers and existing
“benchmark” parsers (like Collins parser, probabilistic
context-free grammars based), with a huge speed advantage.

I apologize for the incorrect F1 score I first reported for
Carreras et al' parser (90.5% inttead of 91.1%). I confused WSJ
sections 23 and 24 performance. Thanks to Michael Collins who
reported the bug.

Many Knowledge Bases (KBs) are now readily available and
encompass colossal quantities of information thanks to either
a long-term funding effort (e.g. WordNet, OpenCyc) or a
collaborative process (e.g. Freebase, DBpedia). However, each
of them is based on a different rigorous symbolic framework
which makes it hard to use their data in other systems. It is
unfortunate because such rich structured knowledge might lead
to a huge leap forward in many other areas of AI like nat-
ural language processing (word-sense disambiguation, natu-
ral language understanding, ...), vision (scene
classification, image semantic annotation, ...) or
collaborative filtering. In this paper, we present a learning
process based on an innovative neural network architecture
designed to embed any of these symbolic representations into
a more flexible continuous vector space in which the original
knowledge is kept and enhanced. These learnt embeddings would
allow data from any KB to be easily used in recent machine
learning methods for prediction and information retrieval. We
illustrate our method on WordNet and Freebase and also
present a way to adapt it to knowledge extraction from raw
text.

2010

Bio-relation extraction (bRE), an important goal in bio-text
mining, involves subtasks identifying relationships between
bio-entities in text at multiple levels, e.g., at the article,
sentence or relation level. A key limitation of current bRE
systems is that they are restricted by the availability of
annotated corpora. In this work we introduce a semi-supervised
approach that can tackle multi-level bRE via string comparisons
with mismatches in the string kernel framework. Our string kernel
implements an abstraction step, which groups similar words to
generate more abstract entities, which can be learnt with
unlabeled data. Specifically, two unsupervised models are proposed
to capture contextual (local or global) semantic similarities
between words from a large unannotated corpus. This
Abstraction-augmented String Kernel (ASK) allows for better
generalization of patterns learned from annotated data and
provides a unified framework for solving bRE with multiple degrees
of detail. ASK shows effective improvements over classic string
kernels on four datasets and achieves state-of-the-art bRE
performance without the need for complex linguistic features.

We present a general framework and learning algorithm for the task
of concept labeling: each word in a given sentence has to be
tagged with the unique physical entity (e.g. person, object or
location) or abstract con- cept it refers to. Our method allows
both world knowledge and linguistic information to be used during
learning and prediction. We show experimentally that we can learn
to use world knowledge to resolve ambiguities in language, such as
word senses or ref- erence resolution, without the use of
handcrafted rules or features.

We study the standard retrieval task of ranking a fixed set of
items given a previously unseen query and pose it as the
half-transductive ranking problem. Transductive representations
(where the vector representation of each example is learned) allow
the generation of highly nonlinear embeddings that capture the
characteristics of object relationships without relying on a
specific choice of features, and require only relatively simple
optimization. Unfortunately, they have no direct out-of-sample
extension. Inductive approaches on the other hand allow for the
representation of unknown queries. We describe algorithms for this
setting which have the advantages of both transductive and
inductive approaches, and can be applied in unsupervised (either
reconstruction-based or graph-based) and supervised ranking
setups. We show empirically that our methods give strong
performance on all three tasks.

We present a general framework and learning algorithm for the task
of concept labeling: each word in a given sentence has to be
tagged with the unique physical entity (e.g. person, object or
location) or abstract concept it refers to. Our method allows both
world knowledge and linguistic information to be used during
learning and prediction. We show experimentally that we can handle
natural language and learn to use world knowledge to resolve
ambiguities in language, such as word senses or coreference,
without the use of hand-crafted rules or features.

We study the standard retrieval task of ranking a fixed set of
documents given a previously unseen query and pose it as the
half-transductive ranking problem. The task is partly transductive
as the document set is fixed. Existing transductive approaches are
natural non-linear methods for this set, but have no direct
out-of-sample extension. Functional approaches, on the other hand,
can be applied to the unseen queries, but fail to exploit the
availability of the document set in its full extent. This work
introduces a half-transductive approach to benefit from the
advantages of both transductive and functional approaches and show
its empirical advantage in supervised ranking setups.

We present a class of nonlinear (polynomial) models that are
discriminatively trained to directly map from the word content in
a query-document or document-document pair to a ranking
score. Dealing with polynomial models on word features is
computationally challenging. We propose a low-rank (but diagonal
preserving) representation of our polynomial models to induce
feasible memory and computation requirements. We provide an
empirical study on retrieval tasks based on Wikipedia documents,
where we obtain state-of-the-art performance while providing
realistically scalable methods.

This tutorial will describe recent advances in deep learning
techniques for Natural Language Processing (NLP). Traditional NLP
approaches favour shallow systems, possibly cascaded, with
adequate hand-crafted features. In constrast, we are interested in
end-to-end architectures: these systems include several feature
layers, with increasing abstraction at each layer. Compared to
shallow systems, these feature layers are learnt for the task of
interest, and do not require any engineering. We will show how
neural networks are naturally well suited for end-to-end learning
in NLP tasks. We will study multi-tasking different tasks, new
semi-supervised learning techniques adapted to these deep
architectures, and review end-to-end structured output
learning. Finally, we will highlight how some of these advances
can be applied to other fields of research, like computer vision,
as well.

Typical information extraction (IE) systems can be seen as tasks
assigning labels to words in a natural language sequence. The
performance is restricted by the availability of labeled words. To
tackle this issue, we propose a semi-supervised approach to
improve the sequence labeling procedure in IE through a class of
algorithms with self-learned features (SLF). A supervised
classifier can be trained with annotated text sequences and used
to classify each word in a large set of unannotated sentences. By
averaging predicted labels over all cases in the unlabeled corpus,
SLF training builds class label distribution patterns for each
word (or word attribute) in the dictionary and re-trains the
current model iteratively adding these distributions as extra word
features. Basic SLF models how likely a word could be assigned to
target class types. Several extensions are proposed, such as
learning words’ class boundary distributions. SLF exhibits robust
and scalable behaviour and is easy to tune. We applied this
approach on four classical IE tasks: named entity recognition
(German and English), part-of-speech tagging (English) and one
gene name recognition corpus. Experimental results show effective
improvements over the supervised baselines on all tasks. In
addition, when compared with the closely related self-training
idea, this approach shows favorable advantages.

In this article we present Supervised Semantic Indexing (SSI)
which defines a class of nonlinear (quadratic) models that are
discriminatively trained to directly map from the word content in
a query-document or document-document pair to a ranking
score. Like Latent Semantic Indexing (LSI), our models take
account of correlations between words (synonymy,
polysemy). However, unlike LSI our models are trained from a
supervised signal directly on the ranking task of interest, which
we argue is the reason for our superior results. As the query and
target texts are modeled separately, our approach is easily
generalized to different retrieval tasks, such as cross-language
retrieval or online advertising placement. Dealing with models on
all pairs of words features is computationally challenging. We
propose several improvements to our basic model for addressing
this issue, including low rank (but diagonal preserving)
representations, correlated feature hashing (CFH) and
sparsification. We provide an empirical study of all these methods
on retrieval tasks based on Wikipedia documents as well as an
Internet advertisement task. We obtain state-of-the-art
performance while providing realistically scalable methods.

To reduce the increasing amount of time spent on literature search
in the life sciences, several methods for automated knowledge
extraction have been developed. Co-occurrence based approaches can
deal with large text corpora like MEDLINE in an acceptable time
but are not able to extract any specific type of semantic
relation. Semantic relation extraction methods based on syntax
trees, on the other hand, are computationally expensive and the
interpretation of the generated trees is difficult. Several
natural language processing (NLP) approaches for the biomedical
domain exist focusing specifically on the detection of a limited
set of relation types. For systems biology, generic approaches for
the detection of a multitude of relation types which in addition
are able to process large text corpora are needed but the number
of systems meeting both requirements is very limited. We introduce
the use of SENNA (‘Semantic Extraction using a Neural Network
Architecture’), a fast and accurate neural network based Semantic
Role Labeling (SRL) program, for the large scale extraction of
semantic relations from the biomedical literature. A comparison of
processing times of SENNA and other SRL systems or syntactical
parsers used in the biomedical domain revealed that SENNA is the
fastest Proposition Bank (PropBank) conforming SRL program
currently available. 89 million biomedical sentences were tagged
with SENNA on a 100 node cluster within three days. The accuracy
of the presented relation extraction approach was evaluated on two
test sets of annotated sentences resulting in precision/recall
values of 0.71/0.43. We show that the accuracy as well as
processing speed of the proposed semantic relation extraction
approach is sufficient for its large scale application on
biomedical text. The proposed approach is highly generalizable
regarding the supported relation types and appears to be
especially suited for general-purpose, broad-scale text mining
systems. The presented approach bridges the gap between fast,
cooccurrence-based approaches lacking semantic relations and
highly specialized and computationally demanding NLP approaches.

We describe a novel simple and highly scalable semi-supervised
method called Word-Class Distribution Learning (WCDL), and apply
it the task of information extraction (IE) by utilizing unlabeled
sentences to improve supervised classification methods. WCDL
iteratively builds class label distributions for each word in the
dictionary by averaging predicted labels over all cases in the
unlabeled corpus, and re-training a base classifier adding these
distributions as word features. In contrast, traditional
self-training or co-training methods add self-labeled examples
(rather than features) which can degrade performance due to
incestuous learning bias. WCDL exhibits robust behavior, and has
no difficult parameters to tune. We applied our method on German
and English name en- tity recognition (NER) tasks. WCDL shows
improvements over self-training, multi-task semi-supervision or
supervision alone, in particular yielding a state-of-the art 75.72
F1 score on the German NER task.

In this article we propose Supervised Semantic Indexing (SSI) an
algorithm that is trained on (query, document) pairs of text
documents to predict the quality of their match. Like Latent
Semantic Indexing (LSI), our models take account of correlations
between words (synonymy, polysemy). However, unlike LSI our models
are trained with a supervised signal directly on the ranking task
of interest, which we argue is the reason for our superior
results. As the query and target texts are modeled separately, our
approach is easily generalized to different retrieval tasks, such
as online advertising placement. Dealing with models on all pairs
of words features is computationally challenging. We propose
several improvements to our basic model for addressing this issue,
including low rank (but diagonal preserving) representations, and
correlated feature hashing (CFH). We provide an empirical study of
all these methods on retrieval tasks based on Wikipedia documents
as well as an Internet advertisement task. We obtain
state-of-the-art performance while providing realistically
scalable methods.

This work proposes a learning method for deep architectures that
takes advantage of sequential data, in particular from the
temporal coherence that naturally exists in unlabeled video
recordings. That is, two successive frames are likely to contain
the same object or objects. This coherence is used as a
supervisory signal over the unlabeled data, and is used to improve
the performance on a supervised task of interest. We demonstrate
the effectiveness of this method on some pose invariant object and
face recognition tasks.

Humans and animals learn much better when the examples are not
randomly presented but organized in a meaningful order which
illustrates gradually more concepts, and more complex ones. Here,
we formalize such training strategies in the context of machine
learning, and call them curriculum learning. In the context of
recent research studying the difficulty of training in the
presence of non-convex training criteria (for deep deterministic
and stochastic neural networks), we explore curriculum learning in
various set-ups. The experiments show that significant
improvements in generalization can be achieved by using a
particular curriculum, i.e., the selection and order of training
examples. We hypothesize that curriculum learning has both an
effect on the speed of convergence of the training process to a
minimum and, in the case of non-convex criteria, on the quality of
the local minima obtained: curriculum learning can be seen as a
particular form of continuation method (a general strategy for
global optimization of non-convex functions).

We present a class of models that are discriminatively trained to
directly map from the word content in a query-document or
document- document pair to a ranking score. Like Latent Semantic
Indexing (LSI), our models take account of correlations between
words (synonymy, pol- ysemy). However, unlike LSI our models are
trained with a supervised signal directly on the task of interest,
which we argue is the reason for our superior results. We provide
an empirical study on Wikipedia documents, using the links to
define document-document or query-document pairs, where we obtain
state-of-the-art performance using our method.

2008

Torch provides a Matlab-like environment for state-of-the-art machine
learning algorithms. It is easy to use and very efficient, thanks to a
simple-yet-powerful fast scripting language (Lua), and a underlying C/C++
implementation. Torch is easily extensible and has been shown to scale to
very large applications.

We show how the regularizer of Transductive Support Vector Machines (TSVM)
can be trained by stochastic gradient descent for linear models and
multi-layer architectures. The resulting methods can be trained
online, have vastly superior training and testing speed to existing TSVM
algorithms, can encode prior knowledge in the network architecture, and
obtain competitive error rates. We then go on to propose a natural
generalization of the TSVM loss function that takes into account
neighborhood and manifold information directly, unifying the two-stage Low
Density Separation method into a single criterion, and leading to
state-of-the-art results.

We describe a single convolutional neural network architecture that, given
a sentence, outputs a host of language processing predictions:
part-of-speech tags, chunks, named entity tags, semantic roles,
semantically similar words and the likelihood that the sentence makes sense
(grammatically and semantically) using a language model. The entire
network is trained jointly on all these tasks using weight-sharing, an
instance of multitask learning. All the tasks use labeled data except
the language model which is learnt from unlabeled text and represents a
novel form of semi-supervised learning for the shared tasks. We show how
both multitask learning and semi-supervised learning improve the
generalization of the shared tasks, resulting in state-of-the-art
performance.

We show how nonlinear embedding algorithms popular for use with shallow
semi-supervised learning techniques such as kernel methods can be applied
to deep multi-layer architectures, either as a regularizer at the output
layer, or on each layer of the architecture. This provides a simple
alternative to existing approaches to deep learning whilst yielding
competitive error rates compared to those methods, and existing shallow
semi-supervised techniques.

2007

R. Collobert and J. Weston. Fast Semantic Extraction Using a Novel Neural Network Architecture. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 560-567, June 2007.

We describe a novel neural network architecture for the problem of semantic
role labeling. Many current solutions are complicated, consist of several
stages and handbuilt features, and are too slow to be applied as part of
real applications that require such semantic labels, partly because of
their use of a syntactic parser (Pradhan et al., 2004; Gildea and Jurafsky,
2002). Our method instead learns a direct mapping from source sentence to
semantic tags for a given predicate without the aid of a parser or a
chunker. Our resulting system obtains accuracies comparable to the current
state-of-the-art at a fraction of the computational cost.

In this paper we study a new framework introduced by Vapnik (1998; 2006)
that is an alternative capacity concept to the large margin approach. In
the particular case of binary classification, we are given a set of labeled
examples, and a collection of rage the Universum by maximizing the number
of observed contradictions, and show experimentally that this approach
delivers accuracy improvements over using labeled data alone.

We show how the Concave-Convex Procedure can be applied to Transductive SVMs, which
traditionally require solving a combinatorial search problem. This provides for the rst
time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the
utility of our approach. Software is available at
http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction.html.

Convex learning algorithms, such as Support Vector Machines (SVMs), are
often seen as highly desirable because they offer strong practical
properties and are amenable to theoretical analysis. However, in this work
we show how non-convexity can provide scalability advantages over
convexity. We show how concave-convex programming can be applied to produce
(i) faster SVMs where training errors are no longer support vectors, and
(ii) much faster Transductive SVMs.

2004

This thesis aims to address machine learning in general, with a particular
focus on large models and large databases. After introducing the learning
problem in a formal way, we first review several important machine learning
algorithms, particularly Multi Layer Perceptrons, Mixture of Experts and
Support Vector Machines. We then present a training method for Support
Vector Machines, adapted to reasonably large datasets. However the training
of such a model is still intractable on very large databases. We thus
propose a divide and conquer approach based on a kind of Mixture of Experts
in order to break up the training problem into small pieces, while keeping
good generalization performance. This mixture model can be applied to any
kind of existing machine learning algorithm. Even though it performs well
in practice the major drawback of this algorithm is the number of
hyper-parameters to tune, which makes it difficult to use. We thus prefer
afterward to focus on training improvements for Multi Layer Perceptrons,
which are easier to tune, and more suitable than Support Vector Machines
for large databases. We finally show that the margin idea introduced with
Support Vector Machines can be applied to a certain class of Multi Layer
Perceptrons, which leads to a fast algorithm with powerful generalization
performance.

R. Collobert and S. Bengio. Links Between Perceptrons, MLPs and SVMs. In International Conference on Machine Learning, ICML, 2004.

We propose to study links between three important classification
algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector
Machines (SVMs). We first study ways to control the capacity of Perceptrons
(mainly regularization parameters and early stopping), using the margin
idea introduced with SVMs. After showing that under simple conditions a
Perceptron is equivalent to an SVM, we show it can be computationally
expensive in time to train an SVM (and thus a Perceptron) with stochastic
gradient descent, mainly because of the margin maximization term in the
cost function. We then show that if we remove this margin maximization
term, the learning rate or the use of early stopping can still control the
margin. These ideas are extended afterward to the case of MLPs. Moreover,
under some assumptions it also appears that MLPs are a kind of mixture of
SVMs, maximizing the margin in the hidden layer space. Finally, we present
a very simple MLP based on the previous findings, which yields better
performances in generalization and speed than the other models.

Neural networks with the right criterion (like an hinge loss) work well,
with better scaling properties than SVMs...
Also, each neuron in the hidden layer of a neural network acts
interestingly as a kind of SVM, on a subset of the training set.

Several second-order optimization methods for gradient descent algorithms
have been proposed over the years, but they usually need to compute the
inverse of the Hessian of the cost function (or an approximation of this
inverse) during training. In most cases, this leads to an O(n^2) cost in
time and space per iteration, where n is the number of parameters, which
is prohibitive for large n. We propose instead a study of the Hessian
before training. Based on a second order analysis, we show that a
block-diagonal Hessian yields an easier optimization problem than a full
Hessian. We also show that the condition of block-diagonality in common
machine learning models can be achieved by simply selecting an appropriate
training criterion. Finally, we propose a version of the SVM criterion
applied to MLPs, which verifies the aspects highlighted in this second
order analysis, but also yields very good generalization performance in
practice, taking advantage of the margin effect. Several empirical
comparisons on two benchmark datasets are given to illustrate this
approach.

Probably because in the past neural network were studied on very small
databases, many people believe neural networks overfit easily. I would
correct by: if not well tuned (like a SVM having a Gaussian kernel with a
small variance!) neural networks do overfit. But in fact, in many cases,
they are hard to train.
We show here that the choice of the architecture itself has an impact on
the optimization.
In particular we show that the margin criterion used in SVMs is well suited
for neural network optimization: with the hinge loss, the Hessian is better
conditioned than classical loss like Mean Squared Error.

A challenge for statistical learning is to deal with large data sets,
e.g. in data mining. The training time of ordinary Support Vector
Machines is at least quadratic, which raises a serious research challenge
if we want to deal with data sets of millions of examples. We propose a
``hard parallelizable mixture'' methodology which yields significantly
reduced training time through modularization and parallelization: the
training data is iteratively partitioned by a ``gater'' model in such a way
that it becomes easy to learn an ``expert'' model separately in each region
of the partition. A probabilistic extension and the use of a set of
generative models allows representing the gater so that all pieces of the
model are locally trained. For SVMs, time complexity appears empirically
to locally grow linearly with the number of examples, while
generalization performance can be enhanced. For the probabilistic version
of the algorithm, the iterative algorithm provably goes down in a cost
function that is an upper bound on the negative log-likelihood.

The aim was to use a divide-and-conquer method to break up the SVM
complexity and solve large scale classification tasks. While these mixtures
do work, they are unfortunately quite difficult to tune, because of the
additional hyper-parameters involved in the architecture.
This paper has been originally presented at the
International Workshop on Pattern Recognition with Support Vector Machines (SVM'2002).
The original paper, with less experiments and
without probabilistic mixtures, has been published in NIPS.
A variant, including more experiments than the NIPS version
has been published in Neural Computation.

We present an overview of recent research at IDIAP on speech & face based
biometric authentication. This paper covers user-customised passwords,
adaptation techniques, confidence measures (for use in fusion of audio &
visual scores), face verification in difficult image conditions, as well as
other related research issues. We also overview the open source Torch
library, which has aided in the implementation of the above mentioned
techniques.

2002

Many scientific communities have expressed a growing interest in machine
learning algorithms recently, mainly due to the generally good results they
provide, compared to traditional statistical or AI approaches. However,
these machine learning algorithms are often complex to implement and to use
properly and efficiently. We thus present in this paper a new machine
learning software library in which most state-of-the-art algorithms have
already been implemented and are available in a unified framework, in order
for scientists to be able to use them, compare them, and even extend them
for their own purposes. More interestingly, this library is freely
available under a BSD license and can be retrieved from the web by
everyone.

This presented the first version of the Torch machine learning library. Several versions
have been developped since then, culminating with Torch5,
the official last version.

Support Vector Machines (SVMs) are currently the state-of-the-art models
for many classification problems but they suffer from the complexity of
their training algorithm which is at least quadratic with respect to the
number of examples. Hence, it is hopeless to try to solve real-life
problems having more than a few hundreds of thousands examples with
SVMs. The present paper proposes a new mixture of SVMs that can be easily
implemented in parallel and where each SVM is trained on a small subset of
the whole dataset. Experiments on a large benchmark dataset (Forest) as
well as a difficult speech database, yielded significant time improvement
(time complexity appears empirically to locally grow linearly with the
number of examples). In addition, and that is a surprise, a significant
improvement in generalization was observed on Forest.

This is our first paper on Mixture of SVMs. The aim was to use a
divide-and-conquer method to break up the SVM complexity and solve large
scale classification tasks. While these mixtures do work, they are
unfortunately quite difficult to tune, because of the additional
hyper-parameters involved in the architecture.
A variant of this paper, with more experiments,
has been published in Neural Computation.
An extended version, including more experiments
and probabilistic mixtures has been published in IJPRAI and presented at SVM'2002.

Support Vector Machines (SVMs) are currently the state-of-the-art models
for many classification problems but they suffer from the complexity of
their training algorithm which is at least quadratic with respect to the
number of examples. Hence, it is hopeless to try to solve real-life
problems having more than a few hundreds of thousands examples with
SVMs. The present paper proposes a new mixture of SVMs that can be easily
implemented in parallel and where each SVM is trained on a small subset of
the whole dataset. Experiments on a large benchmark dataset (Forest)
yielded significant time improvement (time complexity appears empirically
to locally grow linearly with the number of examples). In addition, and
that is a surprise, a significant improvement in generalization was
observed.

The aim was to use a divide-and-conquer method to break up the SVM
complexity and solve large scale classification tasks. While these mixtures
do work, they are unfortunately quite difficult to tune, because of the
additional hyper-parameters involved in the architecture.
The original paper, with less experiments, has
been published in NIPS.
An extended version, including more experiments
and probabilistic mixtures has been published in IJPRAI and presented at SVM'2002.

2001

Support Vector Machines (SVMs) for regression problems are trained by
solving a quadratic optimization problem which needs on the order of l
square memory and time resources to solve, where l is the number of
training examples. In this paper, we propose a decomposition algorithm,
SVMTorch (available at
http://www.idiap.ch/learning/SVMTorch.html),
which is similar to SVM-Light proposed by Joachims (1999) for
classification problems, but adapted to regression problems. With this
algorithm, one can now efficiently solve large-scale regression problems
(more than 20000 examples). Comparisons with Nodelib, another publicly
available SVM algorithm for large-scale regression problems from Flake and
Lawrence (2000) yielded significant time improvements. Finally, based on a
recent paper from Lin (2000), we show that a convergence proof exists for
our algorithm.

Our contribution extends Joachims ideas to the regression SVM
problem. Though nowadays it may seems obvious, curiously it was not the
technique used to train regression SVMs at the time we proposed this
extension.