February 13-17, 2012, Prague, Czech Republic

Invited Speakers

Since Frege, formal semanticists have focused on the interface between syntax
and semantics and on the role played by grammatical words in leading the
composition of content words. In brief, the meaning of words can be an object
(eg. proper name), or a set of objects (eg. noun) or a set of sets (eg.
quantifying determiners); sets are represented as functions, and function
application and abstraction are employed to carry out meaning assembly guided by
the syntactic structure. The FS view on the meaning of words has been found
unsatisfactory by all those people who care of empirical analyses, FS models in
fact do not handle the richness of lexical meaning and typically are not
accompanied by learning methods. The empirical view has brought to the
development of a framework known as Distributional Semantics Models (DSMs). In
brief, the meaning of content words is approximated by vectors that summarize
their distribution in large text corpora.

Recently, these two research trends have converged into Distributional
Compositionality; a number of works have been carried out on how to
incorporate compositionality in DSMs, in order to construct vectorial
representations for linguistic constituents above the word (e.g.:
Baroni and Zamparelli, 2010; Clarke et al., 2011; Erk et al., 2010;
Grefenstette and Sadrzadeh, 2011; Mitchell and Lapata, 2010; Thater et
al., 2010). Moreover, grammatical words and in particular logical
words (quantifying determiners, coordination, negation) that for long
have been considered to be part of the formal semantics realm only
have became of interest within the DSMs framework too.

DSMs have been evaluated on different semantic tasks. For instance,
they have proved to be very successful at modeling a wide range of
lexico-semantic phenomena by geometric methods applied to the
distributional space (Turney and Pantel, 2010). Sentential entailment
has been the starting point of the logical view on natural language
but again it is a valid task for any theory which aim to capture the
semantics of a language. Hence, predicting entailment is a good
test-bed for Compositional DSMs too.

In the course, after a brief general introduction to FS and DS models,
I will introduce various state-of-the-art approaches to composition in
DS and I will discuss ways in which they have been evaluated and point
to on-going work within this research line.

Jan Hajič, Jakub Mlynář:
The MALACH Project: Research and Access to the Memories of Holocaust Survivors

The Shoa Foundation Institute's
(University of Southern California,
Los Angeles, USA) archive of the memories ("testimonies")
of Holocaust survivors will be described, together with technology used for
the creation, indexing (calaoguing) and search in the archive.
The "Malach" project which was running 2002-2007 attempted
to develop technology for automatic indexing of and the access to the archive;
its technological achievements will be described.
At present, new speech and language translation technologies are being
developed to allow for more sophisticated and broader search possibilities
to this and similar huge audio or videoarchives. These technologies will
also be presented. The archive itself (or better, the
Access Point
to the archive located in the same building) will be shown
to interested students
later in the week.

DeepDict
is a lexical database with a graphical interface, built from large text corpora annotated
with Constraint Grammar dependency links as well as various morphological,
syntactic and semantic tags. It allows the user to view collocational
and frequency information for typical lexically governed constructions,
such as "vote + on + proposal / amendment / report / resolution ...." or
"ride / breed / frighten / tame + horse".

The lecture series will discuss not only (a) DeepDict's uses and
perspectives, but also (b) the annotation system and (c) the Constraint
Grammar parsing technology behind it, dedicating one session to each of
these issues.

Since DeepDict is available for 10 different languages, it will be
possible to accomodate participants' individual language interests to a
certain degree. Also, given the modular and language-independent
architecture of Constraint Grammar systems, workshop-style discussion of
individual project ideas is strongly encouraged.

Further, interest-based reading: The Eckhard Bick publications page with articles on
various aspects of Constraint Grammar and its applications, targeting a
range of languages:
http://beta.visl.sdu.dk/Artikeloversigt.html

Suppose you have one billion tweets. How do you process
and manage this vast amount of information? These three classes will
discuss background ideas associated with Big Data and will give an
overview of techniques for dealing with it: Map Reduce, randomised
algorithms (finger printing, Bloom Filters, Locality Sensitive
Hashing) and streaming. Examples from natural language processing
will be used. Technical aspects will be kept to a minimum and where
possible everything will be explained from scratch.

Outline:

Lecture One: Big Data, Economics and Obstacles

This class will look at the problems and challenges associated with
processing massive amounts of data, using commodity machines (ie cloud
computing). We will touch upon questions of trust, economics as well
those aspects of Big Data which make it hard to deal with.

Given the background material presented in L1, L2 will give an
overview of one popular way to problem solve using large numbers of
unreliable machines. This class will introduce Hadoop (the open
source version of Map Reduce), the Map Reduce programming models,
efficiency concerns and will end with a critique.

Sometimes we need to deal with problems that are just to large for our
machines. Randomised algorithms allow us to tackle such problems and
can be amazingly fast (or compact). However, unlike conventional
approaches, they can make mistakes. This class will show how two
problems in natural language processing --representing large language
models and finding breaking news in Twitter-- can be solved using
Bloom Filters and Locality Sensitive Hashing.

Speech is an increasingly important medium of interaction with computer
systems. With the advent of mobile applications, this is becoming even
more important as people find it easier talk to their phones than to
type on them. This lecture series will discuss how to build systems
which interact via speech, called spoken dialogue systems, using
statistical techniques. Topics covered will include supervised learning
of shallow semantics and dialogue acts from text, reinforcement learning
for learning dialogue strategies, Markov Decision Process and Partially
Observable Markov Decision Processes.

Luke S. Zettlemoyer and Michael Collins (2009) Learning
Context-dependent Mappings from Sentences to Logical Form. In
Proceedings of the Joint Conference of the Association for Computational
Linguistics and International Joint Conference on Natural Language
Processing (ACL-IJCNLP)

The research in machine translation (MT) is far from a stable state.
Many competing paradigms are being examined, many specific setups sound
plausible, many hybrid methods are possible. The purpose of the course
is to get acquainted with yet another experiment management tool (eman)
and apply it to experiments with (factored) phrase-based MT. We will
build on top of the outputs created in the labs of 'Natural Language
Processing with Treex' (Martin Popel).

General experience with experimenting in Unix environment. (Unlike
e.g. Experiment management system distributed with Moses, eman is
versatile and task independent. It is only the specific set of 'seed
scripts' that cover Moses training and evaluation pipeline.)

Exposure to phrase-based MT and exploration of some relevant parameters.

Moses Manual ... especially Section 4 Background
http://www.statmt.org/moses/manual/manual.pdf
You may want to get Moses running on your laptop, but you will be
provided with Unix servers and Moses pre-installed.

Treex
is is a highly modular, multi-purpose, multi-lingual, easily extendable
Natural Language Processing framework.
There is a number of NLP tools already integrated in Treex, such as
morphological taggers, lemmatizers, named entity recognizers,
dependency parsers, constituency parsers, various kinds of
dictionaries. Treex allows storing all data in a rich XML-based format
as well as several other popular formats (CoNLL, Penn MRG), which
simplifies data interchange with other frameworks. Treex is tightly
coupled with the tree editor TrEd,
which allows easy visualization of syntactic structures. Treex is language universal
and supports processing multilingual parallel data.
Treex facilitates distributed processing on a computer cluster. One of
the most sophisticated application developed in Treex is the
deep-syntactic machine translation system TectoMT.