Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news

This paper investigates the temporal differences between topic coverage and sentiment on Twitter and news media.

Method

The analysis is based on a corpus of 16,189 news articles and 7,106,297 tweets which have been assembled based on queries for "Ebola" and "Ebola virus". Preprocessing removed stopwords, applied lemmatization, tokenization and part-of-speech tagging to the corpus.

The following four techniques have been applied to the created corpus:vocabulary control, i.e. replace consumer's expressions with expert's jargon using the Consumer Health Vocabulary (CHV)