Bringing tools for media monitoring to the public: outcomes of the ODYCCEUS summer school on ‘Democracy in the Age of Big Data and AI’ (FETFX Future Tech Week contribution)

Introduction

Social science is
undergoing a revolution. The rise of social media, smart cities, digital
archives, quantified selving, and personal agents is providing an unprecedented
amount of data about the social relations and behaviors of very large groups of
people, creating an opportunity to a more empirically grounded social theory.
But this opportunity can only be realized by using powerful tools: for
analyzing textual data, for finding patterns in data, and for visualizing those
patterns in a way that brings out their meaning. These tools increasingly rely
on complex systems science and Artificial Intelligence.

At the same time we see that collective decision-making in contemporary societies is also undergoing profound change. Social media, Big Data, and AI are already having a profound impact on political and social processes. They have empowered ‘citizen-driven’ political movements such as the Arab spring or the French ‘Gilets Jaunes’ protests. They also provide new means to strategically influence public opinion as it was seen before the the Brexit refendum and the US presidential elections in 2016. Similarly, they could lead to rapid polarizations in political debates, such as those concerning migration or identity, and accelerate the spreading of fake news or hate speech.

Building (social) media observatories

The ODYCCEUS summer school mixed theoretical lectures with practical hands-on sessions and ateliers to examine how novel tools of computational social science can help us understand these phenomena and possibly facilitate future democratic decision processes. It introduces social scientists and media researchers to the latest methods, tools, and techniques and introduces AI researchers and complex systems scientists to the approaches and issues of social science so that they can come up with new tools or refine existing ones.

More specifically, participants learned how they could build Opinion Observatories that tap into social media to collect information about how certain actors are trying to manipulate political opinion in elections, how fake news gets fabricated and spreads, or how opinions get polarized and shift. To this end, participants engaged in a series of ateliers guided by tutors, each of which addressed a case study. Case studies considered a specific contentious issue using specific data sources.

In this workshop, participants used data from the news website of the Guardian and Penelope components to develop a ‘science tracker’: a pipeline to reveal and track references to scientific publications and research institutions in news articles and comments related to climate change. Participants have thus built components to trace the relationships between scientific references and articles and those in news website commentaries, the occurrence of scientific references in articles over time, etc. As such, they have laid a basis for a further exploration of the role of scientific literature within the climate change debate.

The participants of the workshop decided to
focus on a very specific discourse, which took place in the shadow of the big
Brexit topic dominating UK politics: The
debate about badger cullings in the UK due to the high number of cases of
bovine tubercolosis among cattle. The illness is partly spread by the badger
population in the UK and Department for Environment, Food and Rural Affairs has
hence initiated culling of badgers to reduce the reservoir of infection in
wildlife.

Two data sources were used in the workshop:
Twitter posts of the week of the Summer School, and all parliamentary speeches
by the UK House of Commons since 2016. The aim was to produce an analysis
allowing and making use of both close and distant reading to gain a
comprehensive view of the debate.

Starting off with relatively close reading of
the culling debate in the UK parliament, a new component was introduced and
applied to the speeches: Statement graphs. There, sentences including the same
words are linked to each other, which yields an overview about the language and
topic structure of speeches and can be used to visualize differences and
similarities in the use of expressions and words of different parties and politicians,
or at different times.

The participants then analyzed the language of
the different parties in the UK parliament, and found that the two major
parties of the UK talk differently about the cullings: Labour politicians in a
more detailed and emotional way, Tories in more abstract and rational terms.

Twitter was used to get an impression about the discourse on social media. Traffic on Twitter was accelerated by the announcement of extending the culling to eleven new areas on September 11. Analysis of the retweet network of posts including ‘badger‘ and ‘cull‘ showed that, to a big extent, only the voices opposing the culling measures were present and shared on Twitter. Statements in favour of the cullings were largely missing. An interesting follow-up question hence arose: How do we deal with data that is not there?

The results showed that even for a ‘minor‘ political topic (compared to and overshadowed by the very heated Brexit debate going on at the same time), the methods and pipelines the participants used and developed provided valuable insights on different scales of data aggregation and visualization, from close to distant reading. This is what should be pursued further in the on-going development of Penelope. The simple visualization techniques of the statement and retweet graphs turned out to provide a valuable systematic perspective on the data. Also more qualitatively oriented researchers can profit from those tools, that we plan to provide to Penelope soon.

Case 3: Hate and excitable speech (tutored by the University of Amsterdam)

This
group’s project was aimed at mapping and tracing the spread of extreme speech
and hate speech across online platforms. This involved a number of datasets
across the one-year interval surrounding the election of Donald Trump as US
president in November 2016. Two datasets were of particular interest here, one
sourced from 4chan, an anonymous and ephemeral web forum that over the past few
years has become increasingly prominent as a breeding ground of far right
conspiracy theories and discourse. The other consisted of all comments left by
users on Breitbart, a more traditional right-wing news site which emerged as an
important supporter of Donald Trump’s campaign in 2016.

One strand
of the research focused on a semantic analysis of these datasets. Using a
word2vec analysis based on an expert list of known ‘antagonistic’ phrases on
the platform – phrases used to identify either the ingroup (e.g. 4chan and its
users) or the outgroup (those outside the forum’s subculture, ‘normies’ in
4chan vernacular), the full dataset could then be processed to see how
particular phrases or topics could be mapped on this ingroup-versus-outgroup
axis. This gave coherent results for a range of topics, from news outlets
(where the right-wing Fox News ranked consistently ‘closer’ to the right-wing
datasets than neutral or left-wing outlets) to political candidates (where
Donald Trump was, unsurprisingly, closer, while Democrat Bernie Sanders was the
furthest outward), providing an empirical and quantified impression of the self-described
political identity of these ‘exciteable’ discussion spaces.

A second
strand of research focused on tracing of one particular ‘phrasal meme’, the far
right conspiracy theory of ‘white genocide’, which suggests that the ‘white
race’ is under constant threat from immigration and miscegenation. By first
subsetting all posts from the 4chan and Breitbart datasets containing this
phrase, and then looking for ‘neologisms’ – that is, words not found in other
language spaces – we formed an overview of topics discussed in relation to
white genocide in these spaces. This produced two interesting insights. First,
the semantic context of white genocide, as defined by these neologisms, is
remarkably different between the spaces; while on 4chan the discussion is clearly
focused on traditional antisemitic theories, Breitbart is quite specifically
concerned with the ‘kalergi plan’, a related conspiracy theory. This shows how
a current far right talking point can manifest differently in different online
discussion spaces. Secondly, the discussion of ‘white genocide’ on especially
Breitbart seemed to be completely divorced from the articles published by the
outlet, suggesting that the comments section on the site is not so much a place
to react to whatever it publishes, but rather a more general right-wing
discussion forum where people simply continue an on-going ideological
discussion.

Major parts of this research were conducted using 4CAT, a Penelope-based research tool, which allowed for a rapid iterative approach to data analysis. In this way, it also served as a testbed for these technologies, and demonstrated that through the use of a unified interface on top of existing analytical pathways, researchers from various backgrounds can effectively engage with large datasets.

The idea of “migrant crisis” has been criticized by migrations studies as an artefact produced by the public sphere. Nonetheless, this idea has spread worldly. This article proposes to study the salience of news related to migrations in a corpus of European press during a fiveyears period (2014-2018). When it comes to the comparison between different sources, between different languages and national contexts, methodological obstacles arise. Salience analysis should take in charge the social complexity by a cross-dimensional perspective. First, we propose a methodology for building a multi-sources, multinational and multi-languages corpusand to identify the topic. Then, the analysis reveals that the salience of migration in the press is not as global as supposed. Important differences of coverage exist. Thanks to an online platform, observations of salience patterns can be made by crossing time variables (period and span) with source attributes (language, editorial scope, location).