Journalisic Data Analysis

Maand: november 2015

Even though it is unintentional, scientist are misled by their own biases. I previously mentioned one of the biggest biases of confirmation, but that is not the only bias people have. In this blogpost, I tell you something about the opportunistic bias and publication bias.

Jamie DeCoster: “The opportunistic bias occurs when the reported relations are stronger or otherwise more supportive of the researcher’s theories than they would be without the exploratory process.”

The opportunistic bias occurs when researchers examine multiple analyses before deciding which one to actually use. This selection process makes it more likely to find significant results and large effect sizes because you can pick the analysis that favors your expected prediction or theory most. But, according to DeCoster and Sparks there are different procedures that shift your result towards significance; you can create an opportunistic bias by examining the most preferable way of transforming variables, measure a large collection of variables and only report desirable results, and examine the same hypotheses with different analyses, methods, or in different subgroups of participants. Another possibility is scrutinizing undesirable findings more closely than desirable findings (e.g. double-check the unexpected finding). Michèle Nuijten also mentioned several activities and she noted the self-admission rates of professors. Below you’ll find the three most frequent procedures:

Failing to report all of the study’s dependent measures (63.4%);

Deciding whether to collect more data after looking at the results and their significance (55.9%);

Selectively reporting the studies that worked (45.8%).

As a consequence of the opportunistic bias, type I errors are easily made. Also, p-values can’t be interpreted as they should be because the actual probability of finding a significant result is much higher. Hence, opportunistic bias can lead to a significant effect (even when no effect is there). These wrongly drawn conclusions are incorporated in the general view in literature (as people and researchers read biased articles) and they systematically influence meta-analyses.

Why do we want these significant results so badly? Why do we transform data in a preferable way or do different analysis to get these significant results? One of the causes of the opportunistic bias is closely related to the publication bias.

Michèle Nuijten: “Publication bias is putting the non-significant results in the closet and publish the significant results in the journals.”

With publication bias, the whole view in literature gets distorted; by only reporting the articles that have significant results, researchers get triggered to (only) publish significant results and this stimulates the opportunistic bias increasingly. The view on the world changes and the (scientific) knowledge we have is not as objective as it should be. This could be dangerous in the field of medicine for instance. Publication bias is not only noticeable in the scientific world, but also in journalism. For journalists it is important to create remarkable and sensational stories in order to get people to read a blog/newspaper and get paid by their bosses. With this, incorrect and biased information is (even more) encouraged in our society resulting in a misguided worldview.

The problem is clear: the motivation of the opportunistic bias is closely related to the publication bias. What can we do about it?

[1] Researchers must create reliable articles with as little as possible (publication and opportunistic) biases. They should also make use of preregistration by means of the website OSF(which does not allow you to change anything when posted on). When referring to other articles or previous theories, they should be cautious. To indicate if an article is adequate, researchers could look for bad and good signs.

Bad signs:

Statistical errors;

Lot of p-values just below .05;

Post hoc explanations of covariates;

Removing outliers without doing a sensitivity check;

Vague and inaccurate language in the method section;

Degrees of freedom that don’t match the sample size.

Good signs:

High power or large sample size;

Preregister hypotheses, method and analysis plan;

Openness (share data, analyses, material online);

Replication with high power and preregistration;

Meta-analysis of different studies (test for publication bias).

[2] What could journals do? They could create a more rigorous and thorough reporting standard (e.g. reporting the intended and the actual sample size, describing all variables, mention the analyses which were pre-specified and which were done). In addition, journals could require an increased disclosure (e.g. researches have to write a log of all performed analyses and procedures). In my opinion, journals should also publish non-significant results because this is also a result. They can do this by accepting or rejecting research proposals on the basis of their theory, described method and proposed analysis. When it is accepted, the journal would agree to publish it no matter if the results are non-significant. I do think that the latter should have some other requirements to uphold the quality of research papers though.

[3] What can be done by journalists? Journalists should be cautious when referring to an article. In my opinion, a lot of journalists are not doing this; they are rather sensational instead of subtle. An example of this can be found in the news report “Even Casually Smoking Marijuana Can Change Your Brain, Study Says” of the Washington Post. The study they refer to solely indicated that there were differences in the brain of casual pot users compared to nonusers, but it did not mention that these differences were caused by marijuana use (because it even couldn’t show causality because of the study’s design). In this sense, journalists should be critical and more skeptical and not just write something that is exciting.

Big data is a hot topic at the moment, but what is ‘big data’ exactly? According to Lewis and Westlund, big data refers to data sets that are too large for standard computer memory and software to process. By analyzing big data you can reveal patterns, trends, and associations. Big data gives us the opportunity to integrate information from different sources and recognize/predict various patterns. These patterns are usually related to human interactions and behavior. With this, big data could be very important for the implementation of marketing activities.

For instance, when you are the marketing manager of the Dutch supermarket cooperation Albert Heijn. Albert Heijn makes use of the “Bonus-card”, a card with which you retrieve discounts on products who are in the “Bonus”. Every time your card gets scanned, Albert Heijn receives information about your consumer behavior and interactions (e.g. you come every day to buy at least one Tony’s chocolate bar). So Albert Heijn knows what you are going to eat this day or week. By means of the Bonus-card they can effectively design marketing campaigns (e.g. hamster weken) because they have insight in you as customer. Almost all regular customers of Albert Heijn have a Bonus-card so it is a smart way to gather data and, subsequently, predict purchase behavior.

In the example above I wrote about data collection in a physical environment, but the same (and probably more easy and often) happens in the non-physical environment, such as the Internet. The Internet has numerous ways to retrieve data (cookies, Google, YouTube, social media interactions, location-based services etc.). Some ballpark figuresof real-time data online (20/11/2015 15.00):

Videos viewed today on YouTube: 6,027,000,000

Photos uploaded on Instagram: 162,630,000

Google searches today: 2,787,425,000

Facebook active users: 1,499,923,000

Blog posts written today:2,577,000

Tweets send today: 571,955,000

Data can be implemented for commercial purposes. For instance, when Google knows that you love Tony’s chocolate and you repeatedly purchased it online at Albertheijn.nl, you might receive banners or advertisement of both Albert Heijn and Tony’s chocolate when surfing online. In some respects, both in the offline environment (i.e., Bonus-card) as the online environment (i.e., www.albertheijn.nl) we are being followed, or some would say stalked. A logical follow-up question would be: Is this ethical?

To answer these questions, I would like to give special attention to social medium Facebook as it is very popular and it is a medium that retrieves lots of data. Facebook collects data on the basis of your own activities on Facebook (e.g. posting a picture of you and Tony), but also when other networks or people deliver information about you. Hence, privacy depends on your friends on Facebook. In addition, Facebook assembles data concerning payments, device usage and data from websites that collaborate with Facebook, such as advertisers. This is what they say on their ‘privacy page’. But, what is the deal with image recognition and extracting data from that. Eric Postma mentioned that faces, emotions and objects can be recognized in images. If Facebook starts using this technique, more questions could be raised with regards to privacy and ethics.

But, that is not the only thing that they do; Facebook carries out research with Facebook-users as (unknown) participants. Kramer, Guillory and Hancockdid a research about emotion contagion through Facebook. In their study they showed that when more positive posts were suppressed in people’s news feed, less positive expressions were posted. When negative expressions were reduced, the opposite pattern occurred. An interesting study, but (again) questions arise about ethics as participants were not aware of their participation.

In all, lots of data is assembled though Facebook, but are Facebook-users aware of this? In my opinion, people should be made aware of the activities of Facebook; the image processing, their research studies, and all other data gathering activities. Smith, Szongott, Henne and von Voigt also seem to agree with this. By creating awareness, people get a real choice to unsubscribe if they feel their privacy is invaded. Although I think that most people are aware of the fact that posting information online can be traced back to you, some people are not aware of this (e.g. children). Also, as I mentioned earlier, your privacy is depended on your friend.

To conclude, big data creates opportunities for marketers but it also raises questions about privacy and ethical dilemmas. Both Albert Heijn and Facebook retrieve data and use it for their own commercial purposes, but they do it differently. Now my question to you is “Would you be more comfortable with the Albert Heijn approach or with the Facebook approach?”

How someone’s judgement is shaped by personal opinions and feelings not only influences the way one tells a story, but also how someone finds a story. In this blogpost the latter has the main focus as I have tried to find a common problem in finding data to support stories and a common pitfall in finding story ideas with data analysis.

Every day stories are told in the news; sometimes a terrorist attack is the main issue and often political activities are situated in news articles. Galtung and Ruge provided insight in the predictive pattern of news articles. They provided a list of different factors, such as incorporation of unexpected facts, and they predicted that when events score high on the list, the more likely they will occur in news articles. One of the factors was ‘threshold’. With this, they meant that the greater the intensity and the more casualties involved, the more likely the event will be reported in the news.

In a different review, Hardcup and O’Neil showed that this list is subject to subjectivity. They reported, among other things, that the threshold is open for subjective interpretation as someone might think that 20 deaths in ten road accidents is worse than five deaths in one rail crash. They also stated that news is more likely to be news if it is culturally similar. For instance, if the same accident happens in a country that shares your cultural values opposed to a country that doesn’t, the former would be more likely to be newsworthy. Moreover, Hardcup and O’Neil reported that a news agency’s own agenda could also play a role in selecting articles due to commercial interests. In sum, in the process of selecting a news article subjectivity plays a role. However, the influence of subjectiveness prevails in a more early stage of data journalism.

In order to report a story, a news reporter has to do a data analysis. Following the inverted pyramid of Data Journalism of Paul Bradshaw, a news reporter goes through several processing stages in order to establish his story. In order of sequence, these are: compile, clean, context and combine. When all processes are gone through, a story can be build and communicated to the world. However, different things go wrong in the process of finding data to support stories and the process of finding story ideas with data analysis.

What can go wrong in the process of finding data to support stories?

In the compiling phase, one of the most important steps in journalism, news reporters search for data and see whether data is relevant or not. In this stage, a common pitfall is the urge to look for information that is consistent with one’s own expectations, beliefs or hypothesis – a phenomenon called the confirmation bias. With this, news reporters might be more reliant to seek for datasets that confirms their expectations and interpret data in a way that is consistent with their own believes. This affects both the data analysis process and of course indirectly the news article itself. For instance, you believe in democracy and its benefits. When writing an article, you only search with key terms that are consistent with your existing believe, such as “positive effects democracy”. The image below gives an illustration of the confirmation bias. Thus, there is a clear difference between building a case to justify a conclusion which was already drawn and impartially evaluating evidence in order to come to an unbiased conclusion. Motivational factors for the existence of the confirmation bias might be that it is more time consuming, it leads to cognitive costs (e.g. generating new ideas), and it is also not good for the self-esteem.

What can go wrong in the process of finding story ideas with data analysis?

Another process in which biases often occur is in the combining phase. In this phase, journalists look for patterns and how to combine data together. A common pitfall in this stage is that journalists see an illusory correlation. With this, individuals perceive a correlation between variables (often events, behaviors or people) even when no such correlation stands. This bias is actually related to the confirmation bias because former expectations play a dominant part in this process. For instance, you (still) believe that democracy is best, so you try to relate democratic countries with fewer murder cases than countries with a dictatorship. When you don’t find a pattern, you think that you might have chosen the wrong variable so you look further for a variable such as ‘life satisfaction’. You find that there is a significant relationship between life satisfaction and democracy. However, you do not take into account that wealth or richness is a confound or mediator. So, you’re left with an illusory correlation based on your own subjectivity.

Subjectivity and its consequences

Writing news reports on the basis of subjectivity has implications. Among others, Wanta, Golan, and Lee did research on media influence on perceptions of foreign nations. Their results showed that participants perceived a country as more negative when it received more negative media coverage. Positive media coverage, on the other hand, had no influence on public perceptions. So when a news reporter is negative about, for example, a minority group and he is led by his confirmation bias and/or an illusory correlation, his news report is rather subjective and (more) negative about that minority group (than needed). This affects readers and it might enhance negative stereotyping.

Thus, subjectivity plays a role in different facets of writing news reports and it has effects on news readers. The two most important phenomena are the confirmation bias and the illusory correlation bias which are present in the first and last stage of the inverted pyramid of Data Journalism. So, please be aware of these fallacies when writing a news article! You can do this by being objective and critical. Don’t do research with key terms that are consistent with your own beliefs, but be open-minded and use antonyms in your data research. When finding patterns in data, take into account that there could be confounding variables or mediators/moderators that affect the outcome.