I've got a job at Google in London!
It is a research position in the Text-To-Speech group of Rob Clark. The research will be about text-to-speech and natural language understanding. In particular, how the latter can help improving the first.

My paper "Byte-level Machine Reading across Morphologically Varied Languages" with Llion Jones and Daniel Hewlett of Google Research is accepted to the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), in New Orleans.
This paper is based on the research I did during my internship at Google Research in Mountain View, California.

I was asked to join the program committee of the 2018 edition of The Web Conference (27th edition of the former WWW conference), in Lyon, France.
Yes, that is right, WWW 2018 is rebranded as The Web Conference this year.

I have been invited to join the Program Committee of KDD 2017, a premier interdisciplinary conference bringing together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data, which is held in Halifax, Nova Scotia, Canada, August 13 - 17, 2017

Organising BNAIC 2016 was a lot of fun. I debuted as a session chair, in the Natural Language Processing session. I was also the Demo chair of the organising committee. We had a very nice demo session, I think, with "Autonomous Robot Soccer Matches" by Caitlin Lagrand et al. as BNAIC SKBS Demo Award winner.

Great stuff!! My full paper Siamese CBOW: Optimizing Word Embeddings for Sentence Representations that I wrote together with Alexey Borisov and Maarten de Rijke is accepted for ACL 2016, which is held in Berlin.

We present the Siamese Continuous Bag of Words (Siamese CBOW) model, a neural network for efficient estimation of high-quality sentence embeddings. Averaging the embeddings of words in a sentence has proven to be a surprisingly successful and efficient way of obtaining sentence embeddings. However, word embeddings trained with the methods currently available are not optimized for the task of sentence representation, and, thus, likely to be suboptimal. Siamese CBOW handles this problem by training word embeddings directly for the purpose of being averaged. The underlying neural network learns word embeddings by predicting, from a sentence representation, its surrounding sentences. We show the robustness of the Siamese CBOW model by evaluating it on 20 datasets stemming from a wide variety of sources.

I am quite thrilled and honoured by this... I was interviewed by the New Scientist.
The interview is titled Will computers ever be able to understand language? (in Dutch). It's is about my research on sentence similarity and also a bit about the state of affairs of natural language processing in general.

Today, Agnes van Belle, an AI master student I supervised, graduated.
She wrote a nice thesis called Historical Document Retrieval with Corpus-derived Rewrite Rules.
Spelling changes quite often occur gradually (even if they are government-imposed) and in the thesis it is shown that we can exploit the continuum of gradual changes when doing query expansion for historical document retrieval.

Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics. However, lexical features, like string matching, do not capture semantic similarity beyond a trivial level. Furthermore, handcrafted patterns and external sources of structured semantic knowledge cannot be assumed to be available in all circumstances and for all domains. Lastly, approaches depending on parse trees are restricted to syntactically well-formed texts, typically of one sentence in length.
We investigate whether determining short text similarity is possible using only semantic features — where by semantic we mean, pertaining to a representation of meaning — rather than relying on similarity in lexical or syntactic representations. We use word embeddings, vector representations of terms, computed from unlabelled data, that represent terms in a semantic space in which proximity of vectors can be interpreted as semantic similarity.
We propose to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge with word embeddings. A novel feature of our approach is that an arbitrary number of word embedding sets can be incorporated. We derive multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word embeddings. The features representing labelled short text pairs are used to train a supervised learning algorithm. We use the trained model at testing time to predict the semantic similarity of new, unlabelled pairs of short texts.
We show on a publicly available evaluation set commonly used for the task of semantic similarity that our method outperforms baseline methods that work under the same conditions.

Word meanings change over time. Detecting shifts in meaning for particular words has been the focus of much research recently. We address the complementary problem of monitoring shifts in vocabulary over time. That is, given a small seed set of words, we are interested in monitoring which terms are used over time to refer to the underlying concept denoted by the seed words.
In this paper, we propose an algorithm for monitoring shifts in vocabulary over time, given a small set of seed terms. We use distributional semantic methods to infer a series of semantic spaces over time from a large body of time-stamped unstructured textual documents. We construct semantic networks of terms based on their representation in those semantic spaces and use graph-based measures to calculate saliency of terms. Based on these graph-based measures we produce ranked lists of terms that represent the concept underlying the initial seed terms over time as final output.
As the task of monitoring shifting vocabularies over time for an ad hoc set of seed words is, to the best of our knowledge, a new one, we construct our own evaluation set. Our main contributions are the introduction of the task of ad hoc monitoring of vocabulary shifts over time, the description of an algorithm for tracking shifting vocabularies over time given a small set of seed words, and a systematic evaluation of results over a substantial period of time (over four decades). Additionally, we make our newly constructed evaluation set publicly available.

The NLeSc PathFinder grant proposal that I co-wrote is accepted. In the proposal we describe a system for monitoring shifts in vocabulary over time.For example, in the 1950s people used to say automobile where nowadays everyone would use the word car. It's the same concept, but the vocabulary has changed. Another nice example is the word propaganda in Dutch. In the 1950s, this used to refer to commercial activities like advertising, where nowadays, in Dutch, one would use the word reclame.

The algorithms I developed to monitor changes in vocabulary over time will be implemented in a tool that discloses a corpus of digitized historical Dutch newspapers (covering the last four centuries) used by digital humanities scholars.

The abstract Concepts Through Time: Tracing Concepts In Dutch Newspapers
Discourse (1890-1990) Using Word Embeddings which I co-wrote with Melvin Wevers and Pim Huijnen is accepted for Digital Humanities 2015 (DH2015) in Sydney, Australia.

Last year, I participated in the Cumulative Citation Recommendation task (CCR) of the Knowledge Base Acceleration (KBA) track of the Text REtrieval Conference, TREC 2013. This is the notebook paper describing our approach.

Today I presented my work on "Time-Aware Chi-squared for Document Filtering over Time" at CLIN24 in Leiden. This is largely the same presentation I held earlier at the TAIA workshop at SIGIR 2013 in Dublin and at TREC 2013 in Gaithersburg.Just in case anyone is interested, here are the slides.