News and stories from data mining research in Aalto

Our recent paper titled ‘The Effect of Collective Attention on Controversial Debates on Social Media’ (arXiv link) won the best student paper award at the 9th ACM Web Science conference held in Troy, New York.

The paper studies the evolution of long-lived controversial debates on Twitter – i.e., discussions on topics such as ‘gun control’ or ‘abortion’, that reveal a split of opinion between people who support different sides of the argument.

The main goal of this work is to study dynamic aspects of controversial debates — in particular: (i) whether controversy around the debates has increased over time; and (ii) whether controversy increases or decreases when major associated events occur.

Data

The dataset consists of an 1% sample of Twitter of all tweets generated between September 2011 and September 2016, as published by Twitter and stored on the Internet Archive (link). For the purposes of the study, we focus on subsets of tweets related to major controversial topics in the USA, including Obamacare, Abortion, and Gun Control.

Measuring Controversy

For each topic in the study, we measure the controversy surrounding the topic for each day spanned by the dataset. To do so, we employ the Random Walk Controversy (RWC) method we developed in earlier work [1]. The RWC score essentially quantifies the degree to which the retweet network of a given topic and day is polarized – and, the higher the RWC score, the higher the controversy around the topic. For more details on the RWC score, we refer the interested reader to the full paper [1].

Controversy over Time

Having obtained a controversy score for each topic and day in the dataset, we can now ask whether controversy has increased over the five years covered in the dataset.

The answer to this question is shown in the plot below. The X-axis of the plot spans time at daily granularity, from September 2011 to September 2016; and the Y-axis spans values of the RWC score.

As we see from the figure, even though RWC appears to fluctuate over time, there is no clear trend for increasing or decreasing controversy over time.

Controversy and Collective Attention

Even so, we wish to understand better the fluctuations of controversy over time. Our hypothesis is that the level of controversy around a controversial topic increases or decreases with the collective attention attracted by the topic. In plain terms, we hypothesized that, when a controversial debated was making headlines, the level of controversy around it would increase. For instance,

To test that hypothesis, we follow two steps.

Firstly, we quantified collective attention of a topic a given day as the number of users who post a tweet on that day. As we see in the figure below, this level of attention coincides well with the occurrence of important events related to the topics.

Secondly, we juxtapose RWC score with Collective attention, as measured at daily granularity. The results are shown in the figure below. Larger values on the X-axis of the plots correspond to higher levels of collective attention, and larger values on the Y-axis correspond to higher levels of RWC score.

The figures reveal a clear trend: the higher the level of collective attention on a controversial topic, the larger the controversy as measured by the RWC score.

It is important to note that this trend was not observed for non-controversial topics.

Other Measures and Future Work

In addition to the discussion above, the full paper studies the behavior of other network- and content-based measures over time.

With this work, we dived deeper into the study of controversial debates and the complex interactions they encompass. In future work, we plan to study the interplay between controversy and echo chamber phenomena.

Several people have expressed their concern, lately, about high levels of polarization in society. For example, the World Economic Forum’s report on global risks lists the increasing societal polarization as a threat – and others have suggested that social media might be contributing to this phenomenon.

In a recent paper, published at the Tenth International Conference on Web Search and Data Mining (WSDM 2017), we build algorithmic techniques to mitigate the rising polarization by connecting people with opposing views – and evaluate them on Twitter.

This blog post is a summary of our published work at ACM CIKM. The project is about automatically profiling the skills of users by analyzing their personal communication data. We considered this as a prediction problem, given the messages of the user we had to predict the skills of the user. We made of use of the stack exchange dataset which is freely available here, as a training set. There are many stackexchange websites like stackoverflow, cs, datascience, physics, history and so on. This dataset covers a diverse set of skills and will be automatically updated if new technologies come to the fore.

Our recent paper on ‘Social media image analysis for public health‘ will appear as a short paper in CHI 2016. The question we ask in this paper is whether images uploaded to social media can be used to predict public health variables and lifestyle diseases, such as obesity, diabetes, depression, etc.

Lifestyle diseases are of major concern in the developed world. NYTimes estimates that in addition to costing almost a trillion dollars, lifestyle diseases kill more people than contagious diseases. With the ubiquitous use of social-media platforms in the recent years, it has never been easier to collect and analyze lifestyle choices of large populations. For this reason, social-media data has indeed been used in the past to study or monitor public health. Continue reading →

Controversies are everywhere on social media. Studying and understanding the structure and evolution of these controversies is an important area of research. Though there have been previous studies that try to study controversy on social media, they are either too domain specific (e.g., politics) or need prior labeled data.

To address these shortcomings, in our recent WSDM 2016 paper, we designed a fully automatic way to detect ad-hoc controversial issues in the wild, with no prior information or domain knowledge. We represent a topic of discussion with a conversation graph. In this graph vertices represent people and edges represent conversation activity, such as posts, comments, mentions, or endorsements. Our goal is to examine if there are distinguishable patterns in the way conversations are shaped during a controversial event.

It is a measure that tells us how central one set of nodes (let’s call it C) is with respect to another set of nodes (let’s call it Q) in a graph. As an example, consider the graph shown in the figure below. In that graph, we use color to indicate the two sets of nodes — Q is shown in red and C is shown in blue. Continue reading →

I recently attended the CIKM conference in Melbourne to present our paper on facility location in map-reduce and Giraph. In this post, I will give a brief summary some of the talks I attended. As CIKM is a very large conference, with 166 papers accepted this year, this list is merely a random sample of the complete list of papers. Continue reading →