Ethics in data journalism: accuracy

The following is the first in a series of extracts from a draft book chapter on ethics in data journalism. This is a work in progress, so if you have examples of ethical dilemmas, best practice, or guidance, I’d be happy to include it with an acknowledgement.

Data journalism ethics: accuracy

Probably the most basic ethical consideration in data journalism is the need to be accurate, and provide proper context to the stories that we tell. That can influence how we analyse the data, report on data stories, or our publication of the data itself.

In late 2012, for example, data journalist Nils Mulvad finally got his hands on veterinary prescriptions data that he had been fighting for for seven years. But he decided not to publish the data when he realised that it was full of errors.

“Publishing it directly would have meant identifying specific doctors for doing bad things which they hadn’t done. So we only used the data to point us to possible bad examples from which to start the journalistic research and verify information.”

Similarly, when Mulvad approached school leaders about data released by the Danish ministry of education, it turned out grades had been miscalculated. Getting a response on data before publishing is a vital step in checking its accuracy, and the more independent the responding party, the better.

“In traditional journalism we spend significant resources ‘lawyering’ every name and associated assertion in a controversial story… and for those names that we don’t lawyer, we just leave them out of the story. That’s a tradeoff that probably applies in some way when it involves a bulk list of names.”

Sometimes, however, the size or nature of the data makes verifying every row impossible, but there is still an ethical imperative to publish the data in order to ‘hold it to account’. In those cases, Mulvad illustrates how he weighs up competing ethical principles:

“We need to judge if publication is more important than protecting individuals for potential errors in them by the authority which has provided them. We decided in [a story about EU farm subsidies] that, in the absence of alternative methods, it was more important to display the data as a way to investigate who received the money, and for what, after some general cleaning and checking. If someone then asks to be removed from the database we do it – in very few cases – where there is evidence that they need to hide their address or something like that.”

Where possible errors are highlighted, Mulvad says, he contacts the authorities responsible and enquires about the possibility of it being corrected.

In some cases the errors are so serious that the data should be withdrawn completely, as the Texas Tribune decided when they began receiving complaints about information in their prisoner database which incorrectly showed some offenders as having been convicted of a different crime, sexually assaulting a child.

“My husband couldn’t understand why people were taunting him and calling him a child molester,” a woman named Lori Wallace-Wilson said.
“[We] analyzed the information associated with other prisoners … and discovered that there were more than 300 inmates who, like Wilson, had been assigned [crime codes] by the agency that corresponded with aggravated sexual assault of a child — even though the description of their actual offenses included no reference to a minor.
“Initially, we decided simply to remove those codes from our site for those offenders. But then we learned offenders with other types of convictions were also being coded incorrectly [by the Texas Department of Criminal Justice].“

What is important in this case is not just the principle of accuracy. Because the inaccurate data was official data, ethical principles of serving the public good in highlighting the inaccuracies need to be considered. Further investigation confirmed that the codes were not double-checked for accuracy by public officials, and as a result of reporting this the Texas Department of Criminal Justice decided to review the records of the inmates with incorrectly entered data.

Gathering and reporting data

When it comes to gathering data, guidance on political polls provide a useful precedent. The BBC’s guidelines, for example, specify that when commissioning polls “the methodology and the data, as well as the accuracy of the language, must stand up to the most searching public scrutiny.”

In reporting polls, the importance of context is emphasised, including identifying the organisation that conducted the poll, the questions, method and sample size, and the broader trend of all polls or a particular pollster. Language should “say polls “suggest” and “indicate”, but never “prove” or “show”” (BBC, 2013) and “draw attention to events which may have had a significant effect on public opinion since it was done”. Doubts about sources should be reflected in reporting, or in the decision to report on a poll at all.

Reporting the margin of error is particularly important: this is the range within which the ‘true’ figures are likely to sit. The principle does not just apply to voting polls: in 2011 the BBC headlined a story “UK unemployment total on the rise” but should have mentioned that the ‘rise’ was well within the margin of error, meaning that the true figure could have actually meant no change, or even a drop in joblessness.

When gathering data yourself, it is important to be accurate in your language and explicit about your methodology.

In this they are clear about how they define the term “civilian”, and “child” – terms which could be taken for granted, but which are not always clear. One of the reasons for low official ‘civilian’ casualty counts, for example, was that, according to the New York Times, the US “in effect counts all military-age males in a strike zone as combatants, according to several administration officials, unless there is explicit intelligence posthumously proving them innocent.”

Context is king

Adding context is a vital part of the data journalism process: absolute figures must be put into the context of the size of the local population, historical patterns, and even differing demographics. Trends must be checked against changes in boundaries or data collection and classification methods.

Data visualisation needs to be equally clear: charts and tables, for example, should generally have a baseline of zero to avoid misrepresenting changes as being more severe than they are, and timescales should be chosen to represent long-term trends rather than misrepresenting by starting or selecting from an all-time low or high.

Clarity can also be undermined by optical illusions created by the way we interpret shapes and lines. For example: in a bubble chart or node diagram two circles which are the same size will look different if one is surrounded by smaller circles and the other by bigger ones (see Fung, 2012 and Skau, 2013 – images below).

How circles can appear to be different sizes based on context – image from Junk Charts

Illusions in data visualisation – image from Visual.ly

Journalists using data need to also be careful not to make false comparisons. Reporter Mike Stucka, for example, talks about a story where school test scores included scores from a private mental hospital

“These were released, for administrative purposes, along with the regular local school system’s scores. Those kids are few in number, can come from anywhere, aren’t going to be getting a substantial amount of education there because of time, and are rather distracted. I’ve got no problem deleting that “school” entirely.”

Personalisation and interactivity

New forms of storytelling involving interactivity and personalisation present a challenge for providing context: if we hand over some control to the user, how do we ensure they receive that contextual information? Does a calculator that tells the user they are better off because of the latest budget announcement lead them to believe that it is a ‘good budget’? The answer is that wider context must be embedded or integrated at the point when it is needed.

Personalisation also raises issues around the collection of user data. In the creation of their budget calculator, for example, the BBC’s Features team made sure that calculations using user-inputted data such as earnings were performed by the user’s computer (“client-side”), not on BBC servers (“server-side”).

Surveys and skepticism

In data journalism surveys can be either a potential source of ‘easy data’ or of public service fact-checking.

Survey-based journalism is particularly vulnerable to misrepresentation, especially when results are published without skepticism. Ben Goldacre, for example, writes about a BBC article which reported “Six in 10 people support a new power station at Hinkley in Somerset” (BBC, 2010). The story was based on a survey carried out for the energy company EDF and the question generating those results was preceded by a series of ‘set-up’ questions on the creation of local jobs and leading statements.

The BBC’s guidance recommends “appropriate scepticism” should be exercised when reporting the results of surveys and that “a description of the methodology used” should be included where necessary, including the numbers of respondents; “percentages should only be used with caution and when contextualised”

As a result it is important that the journalist dealing with surveys always request access to the original data, including all the questions asked. If that information is not forthcoming then the truth of the claims cannot be established and the journalist will need to take the decision not to publish.

If the survey is already in the public domain, however, the ethical decision then concerns whether the journalist should report on the resistance to publish more details or data, and criticisms about the methods used.

Scepticism is also required when gathering data yourself through surveys or ‘crowdsourcing’ (a process of inviting public involvement in data-gathering). Depending on how results might be used, your system may need a check on unusual patterns of behaviour which may suggest people are ‘gaming’ the system; or a ‘second check’ mechanism where some submissions are vetted by other users; or even require some proof to be submitted alongside the data.

More broadly, journalists should adhere to the ethical principle of transparency in attributing sources (see Friend and Singer, 2007) and linking to the full data where possible, with the exceptions detailed above and in the section on protecting sources below.

Predictions

Predictions are one type of data where the conflict between the principle of accuracy and minimising harm comes to the fore. New York Times data journalist Nate Silver, in his book The Signal and The Noise (2013), explains how the publication of predictions can be both self-fulfilling or self-cancelling, as well as dangerous to publicise:

“If you can’t make a good prediction, it is very often harmful to pretend that you can. I suspect that epidemiologists, and others in the medical community, understand this because of their adherence to the Hippocratic oath. Primum non nocere: First, do no harm.” (Silver, 2013, p230)

Journalists should be especially wary when other predictions do not indicate the same thing, as can happen in political voting intention polls (Silver implies that Rick Santorum’s lead in a single poll may have led voters to switch allegiance).