Deep Learning for Social Sensing from Tweets

Résumé

Distributional Semantic Models (DSM) that represent words as vectors of weights over a high dimensional feature space have proved very effective in representing semantic or syntactic word similarity. For certain tasks however it is important to represent contrasting aspects such as polarity, opposite senses or idiomatic use of words. We present a method for computing discriminative word embeddings can be used in sentiment classification or any other task where one needs to discriminate between contrasting semantic aspects. We present an experiment in the identification of reports on natural disasters in tweets by means of these embeddings.

Texte intégral

1Distributional Semantic Models (DSM) that represent words as vectors of weights over a high dimensional feature space (Hinton et al., 1986), have proved very effective in representing semantic or syntactic aspects of lexicon. Incorporating such representations has allowed improving many natural language tasks. They also reduce the burden of feature selection since these models can be learned through unsupervised techniques from plain text.

4Traditional embeddings are created from large collections of unannotated documents through unsupervised learning, for example building a neural language model (Collobert et al. 2011; Mikolov et al. 2013) or through Hellinger PCA (Lebrét and Collobert, 2013). These embeddings are suitable to represent syntactic similarity, which can be measured through the Euclidean distance in the embeddings space. They are not appropriate though to represent semantic dissimilarity, since for example antonyms end up at close distance in the embeddings space

5In this paper we explore a technique for building discriminative word embeddings, which incorporate semantic aspects that are not directly obtainable from textual collocations. In particular, such embedding can be useful in sentiment classification in order to learn vector representations where words of opposite polarity are distant from each other.

7For creating the embeddings, we used DeepNL1, a library for building NLP applications based on a deep learning architecture. DeepNL provides two methods for building embeddings, one is based on the use of a neural language model, as proposed by Collobert et al. (2011) and one based on a spectral method as proposed by Lebret and Collobert (2013).

8The neural language method can be hard to train and the process is often quite time consuming, since several iterations are required over the whole training set. Some researcher provide precomputed embeddings for English2.

9Mikolov et al. (2013) developed an alternative solution for computing word embeddings, which significantly reduces the computational costs and can also exploit concurrency trough the Asynchronous Stochastic Gradient Descent algorithm. An optimistic approach to matrix updates is also exploited to avoid synchronization costs.

10The authors published single-machine multithreaded C++ code for computing the word vectors3. A reimplementation of the algorithm in Python, but with core computations in C, is included in the Genism library (Řehůřek and Sojka, 2010)

11Lebret and Collobert (2013) have shown that embeddings can be efficiently computed from word co-occurrence counts, applying Principal Component Analysis (PCA) to reduce dimensionality while optimizing the Hellinger similarity distance.

12Levy and Goldberg (2014) have shown similarly that the skip-gram model by Mikolov et al. (2013) can be interpreted as implicitly factorizing a word-context matrix, whose values are the pointwise mutual information (PMI) of the re-spective word and context pairs, shifted by a global constant.

13For certain tasks, as for example sentiment analysis, semantic similarity is not appropriate, since antonyms end up at close distance in the embeddings space. One needs to learn a vector representation where words of opposite polarity are distant.

14Tang et al. (2013) propose an approach for learning sentiment specific word embeddings, by incorporating supervised knowledge of polarity in the loss function of the learning algorithm. The original hinge loss function in the algorithm by Collobert et al. (2011) is:

ℒCW(x, xc) = max(0, 1 − fθ(x) + fθ(xc))

where x is an ngram and xc is the same ngram corrupted by changing the target word with a randomly chosen one, fθ(.) is the feature function computed by the neural network with parameters θ. The sentiment specific network outputs a vector of two dimensions, one for modeling the generic syntactic/semantic aspects of words and the second for modeling polarity.

15A second loss function is introduced as objective for minimization:

ℒSS(x, xc) = max(0, 1 − δs(x) fθ(x)1 + δs(x) fθ(xc)1)

where the subscript in fθ(x)1 refers to the second element of the vector and δs(x) is an indicator function reflecting the sentiment polarity of a sentence, whose value is 1 if the sentiment polarity of x is positive and -1 if it is negative.

17The DeepNL library provides a training algorithm for discriminative word embedding that performs gradient descent using an adaptive learning rate according to the AdaGrad method. The algorithm requires a training set consisting of documents annotated with their discriminative value, for example a corpus of tweets with their sentiment polarity, or in general documents with multiple class tags. The algorithm builds embeddings for both unigrams and ngrams at the same time, by performing variations on a training sentence replacing not just a single word, but a sequence of words with either another word or another ngram.

19We tested the use of discriminative word embeddings in the task of social sensing, i.e. of detecting specific signals from social media. In particular we explored the ability to monitor and alert about emergencies caused by natural disasters. We explored the corpus of Social Sensing4, which consist of 5,642 tweets about natural catastrophic events like earthquakes or floods. To obtain a balanced training set, we combined this corpus with a set of generic tweets, consisting of 23,507 tweets. The combined corpus, consisting of 29,149 tweets, was randomly split into a training, development and test set consisting respectively of 23,850, 2,649 and 2,650 tweets.

20Most sentiment analysis systems exploit a specialized lexicon (Rosenthal et al, 2014; Rosenthal et al, 2015). We built a lexicon of words related or indicative of disasters, by using the Italian Word Embeddings interface5. Starting from a seed set of few specialized words we produced a lexicon of 292 words (including words with a hashtag).

21For detecting tweets reporting about natural disasters, we exploit an SVM classifier, which uses as continuous features the word embeddings created from the text of the Italian Wikipedia. Additionally a set of discrete features is used, similar to those used in the top scoring system in the task 10 of SemEval 2014 on Sentiment Analysis in Twitter (Mohammad et al., 2014). These features are summarized in the following table:

26Social sensing research is a rapidly growing field; however, it is difficult to compare our work with others since the data sets used are different.

27The only experiment performed on the same data set, is described in (Cresci et al., 2015), which focuses on distinguishing whether damage was reported, rather than just reportig a disaster. Sixteen experiments were carried out, using four subsets of the corpus for training, corresponding to four disaster events, and testing on either different events (cross-event) or same/different disaster types (in-domain, out-domain). F1 scores in detecting non relevant tweets ranged between 19% and 28% for cross-event and out-domain and reached 73% for in-domain in one of the indomain tests.

28We have presented the notion of discriminative word embeddings that were designed to cope with semantic dissimilarity in tasks like sentiment analysis or multiclass classification.

29As an example of the effectiveness of this type of embeddings in other applications, we have explored their use in detecting tweets reporting alerts or notices about natural disasters.

30Our approach consisted in using a classifier trained on a corpus of annotated tweets, using discriminative embeddings as features, instead of the typical manually crafted features or dictionnaries employed in tweet classification tasks as sentiment analysis.

31In the future, we plan to explore the use a convolutional network classifier, also provided by DeepNL, without any additional features, as Severyn and Moschitti (2015) have done for the SemEval 2015 task on Sentiment Analysis in Twitter.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pp. 384-394. Association for Computational Linguistics.2013.