Currently, I am working at IULA in Barcelona, Spain on a
number of projects related to corpus linguistics, with a focus on
specialized languages and neologism.

Open Sources Lexical Information Network

Between May 2004 and May 2008 I was working at the ILTEC in Lisbon, Portugal on a
number of related tools, which together are intended to form OSLIN:
an open source lexical information network. The heart of this network is
formed by MorDebe:
a morphological database system, developed language independently,
currently filled with over 125.000 portuguese lemmas, and around 1,5 million
word-forms. MorDebe is used together with NeoTrack for the semi-automatic
detection of neologisms in on-line newspapers.

Automatic extraction of semantic
relations from corpora using linguistic markers

In the first two semesters
of 2003 I was
working at the ERSS in Toulouse, France in the field of computational
terminology on the automatic extraction of semantic relations from
corpora. The methodology used was first described by Hearst (1992): use
patterns of text (often called linguistic markers) to find implicit and
explicit mentionings of semantic relations in a text corpus. The basic
idea is best made clear with a simple example: if a text contains the
sentence This rod is best for greylings and other trouts., it
implicitly claims that greylings are trouts. By finding all such
implicitly expressed relations in a corpus, one can build a (partial)
ontology.

As part of this project, I have written a multilingual
concordancer for anotated copora (YakwaSI), based on the
Yakwa concordancer of Ludovic Tanguy. YakwaSI can search aligned corpora
for string of words, lemmata, and syntactic categories, based on a POS
tagged corpus, using currently either Cordial or Treetagger.

The Application of Formal Concept Analysis to a
Multilingual Lexical Database

The topic of my
thesis was the application of FCA to Multilinugal Lexical Databases
(see the SIMuLLDA home
page). A brief description of its research question:

There are a
lot of different bilingual dictionaries available in the world. Still,
there is not a bilingual dictionary for every pair of languages. If you
consider two `minor' languages like Malay Indonesian and Hungarian,
there is a very slim chance that you will find a dictionary translating
between these two languages. This is not surprising: there are several
thousand different languages, so a full coverage would require many
millions of bilingual dictionaries.

The way also have bilingual
dictionaries between every pair of 'minor' languages, say Malay
Indonesian and Hungarian, is not to create them by hand (since that
would take way too much time), but to construe a Multilingual Lexical
Database (MLLD), which contains many languages and which can be used to
generate a bilingual dictionary between any pair of them. If all
languages would use the same notions expressed by different words, such
an MLLD would be hardly problematic. However, different languages often
use different words, for instance because one language makes more
subtile differences than another; whereas Hungarian has only one word
for RICE (risz), Indonesian has four of them: padi for
rice as it grows in the field, gabah for rice that has been
harvested but not processed, beras for rice that has been husked
and hulled, and nasi for cooked rice.

In order to get the
MLLD to do what you want (automatically create bilingual dictionaries),
you need to have a system that is powerful enough to deal with all these
subtle differences. The purpose of this thesis is to test whether a
logical framework called Formal Concept Analysis is powerful enough to
function as the structural core of such an MLLD.