Information nutrition labels

NewsScan implements six information nutrition labels: source popularity, article popularity, ease of reading, sentiment, objectivity and political bias. Ease of reading, sentiment and objectivity have been proposed by Fuhr et al. (2018). We propose to add three more nutrition labels: Source as well as article popularity and political bias. Similar to food nutrition labels the information nutrition labels aim to provide the reader some base to judge about the reliability of the article’s content. The credibility nutrition label proposed by Fuhr et al. (2018), for instance, is able to give the reader the indication whether e.g. the source where the article come from is credible or not. However, the credibility label entails already a judgment. It already sums some pieces of information and makes conclusion based on them. We think instead of providing the reader such a judgment the user might be better informed when we provide information that are possible bases for computing e.g. the credibility label. The proposed three new labels aim this purpose, i.e. providing enough details to enable the user to make an informed judgment about an articles content.

In the following we describe the nutrition labels currently implemented within the plugin

Source popularity

The label source popularity encompasses two dimensions: the reputation of the news source and its inﬂuence.

The reputation of a source is analyzed using the Web of Trust Score1. This score is computed by an Internet browser extension that helps people make informed decision about whether to trust a website or not. It is based on a unique crowdsourcing approach that collects rating and reviews from global community of millions of users who rate and comment on websites based on their personal experiences.

The inﬂuence of a source is computed using Alexa Rank, Google PageRank and popularity on Twitter.

Alexa Rank is a virtual ranking system set by Alexa.com (a subsidiary of Amazon) that audits and publishes the frequency of visits on various websites. The Alexa ranking is the geometric mean of reach and page views, averaged over a period of three months.

Google PageRank is a link analysis algorithm that assigns a numerical weight to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of measuring its relative importance within the set.

Twitter Popularity is calculated as an average of the scores for the following two metrics:

Followers Count: This gives the amount of users that are following a source.

Listed Count: This indicates the number of memberships of the source to different topics. It is based on the user’s activity to add/remove the source from their customized list. The higher it is, the more diverse the source is.

An overall source popularity score shown to the user is calculated by averaging these four metrics. However, when the icon card is ﬂipped the user can also get detailed information about each of the above scores

Article popularity

Popularity = – alog(bx + 1)

Where x is the average amount of tweets per hour, so that the article popularity is 0 when x is 0. The most popular article we found had around 23 tweets per hour in its peak 24 hours. This is used as a reference value, i.e. an article must have this many tweets to reach a score of 100. The logarithmic function is used because the output has to be scaled properly. For example, an article with ﬁve tweets per hour is still relatively popular, even though it is just a fraction of the reference score. Choosing a large value for b will make the function close to being linear, which will cause even the relatively popular articles to have low scores. A small b will make the function more curved. If b is too big however, any article with a decent amount of tweets will have a score very close to100. B is chosen empirically to be 1 so that the scores are distributed well between 0 and 100 over a variety of typical news articles. A is determined to be 73 to give the reference article a score of 100.

Ease of reading

As described by Schwarm and Ostendorf (2005) the readability level is used to characterize the educational level a reader needs to understand a text. This topic has been in research since 1930 and several automatic solutions have been proposed to determine the readability level of an input text (Vajjala and Meurers, 2013; Xia et al.,2016; Schwarmand Ostendorf, 2005). The core concept in these studies is to use machine learning along with feature engineering covering lexical, structural, and heuristic based features. We followed this core concept and used Random Forest with features in-spired by earlier studies. This approach achieved 73% accuracy on a data set of texts written by students in Cambridge English examinations (Xia et al., 2016). The classiﬁer predicts ﬁve different levels of readability varying from A2 (easy) to C2 (difﬁcult) (Xia et al., 2016). We map these values to percentages so that A2 becomes 100% (easy to read) and C2 becomes 20% (difﬁcult to read) (see Table 1)

Table 1: Levels of readability

Text Level

A2

B1

B2

C1

C2

Value

100%

80%

60%

40%

20%

Sentiment

A text containing sentiment is written in an emotional style. To determine the sentiment value of an article, our algorithm uses the pattern3.en library (Hayden and de Smet). In this library every word is assigned a sentiment value, which can be negative or positive [-1; 1]. If a word shows intense positive emotions (e.g. happy, amazing), it is given a high positive value. In line with that, a term indicating intense negative emotions (e.g. bad, disgusting) is assigned a high negative value. A word not containing any emotions (e.g. the, you, house), has a value of near to zero. First, the algorithm calculates the sentiment value for every sentence by averaging all absolute values of sentiment for the distinct words. After that, the overall sentiment value of the whole news article is calculated. For that, the average of the sentences is taken and multiplied by 100.

Objectivity

Objectivity is given when a text is written from a neutral rather than a personal perspective. Phrases like “in my opinion” or “I think” are used by author to reﬂect their individual thoughts, beliefs and attitudes. The process of determining the objectivity of a text is similar to the process of calculating the sentiment value. The aforementioned library pattern3.en (Hayden and de Smet) also includes a value of subjectivity for every word .Therefore we use it to obtain an objectivity score for articles. Values range from 0 to 1, with a value near to 0 indicating objectivity and a value near to 1 indicating subjectivity. The overall score for subjectivity contained in an article is calculated as the average over all sentences. However, since we want to examine the objectivity and not the subjectivity of a text, the values need to be inverted:

Objectivity = 1 −Subjectivity

This score is normalized by multiplying it by 100 to attain a consistent score range for all labels.

Political bias

Bias measures the degree to which an article is written from a one-sided perspective that enforces users to believe in a speciﬁc viewpoint without considering opposing arguments. For calculating political bias we followed Fairbanks et al. (2018) and used two classes that represent different political orientations: conservatism(sources that are biased towards the right) and liberalism (sources that are biased towards the left).The authors also argue that the content of the article is a strong discriminant to distinguish between biased and non-biased articles. Following the authors we built a content based model for prediction of political bias in the news articles. To achieve that, a logistic regression classiﬁer is trained on a dataset containing articles from The Global Database of Events Language and Tone Project (The GDELT Project). This database monitors the world’s broadcast news in over 100 languages and provides a computing platform. However, it does not contain any information about the political bias. To retrieve the bias contained in an article, we crawled from the Media Bias Fact Check the required bias information. The Media Bias Fack Check contains human annotated fact checks for various source domains. For our articles we have left-biased, right-biased and neutral labels. We use a simple bag-of-words approach as features to guide our logistic regression model. As the label values in our plugin are all shown in a range from 0-100%, the label’s landing page shows 0% when the article has no political bias otherwise 100% – regardless whether the article is left or right biased. When the label’s card is ﬂipped the reader can see whether the article has left or right political bias