FAQ: Data journalism, scraping and Help Me Investigate

We call it data driven journalism (DDJ) nowadays, and we used to call it computer assisted reporting. Did only the name change or has there been a more profound change in this set of skills and methods?

I talk a bit about this in the data journalism chapter of the Online Journalism Handbook. I think it’s a qualitative and quantitative change: CAR was primarily about using spreadsheets and databases on datasets, locally on your computer.

DDJ is about the shifts to datasets and tools that are available via the network: it includes automation of process (e.g. scraping, querying APIs), it includes the expansion of ‘data’ beyond spreadsheets to include a vastly expanded range of digitised information: connections, images, audio, video, text. It includes the shift to using CAR for newsgathering to using DDJ techniques for the actual communication of the story: from interactive databases to live visualisation, data-driven tools and apps, and so on.
With the ubiquity of new tools, such as tools for data visualisation, should every journalist be a data journalist nowadays? Can every journalist adopt these skills?

Every journalist should look at ways they can incorporate DDJ techniques in their work – that might be automating part of their newsgathering (for example setting up, combining, filtering and publishing RSS feeds) but more widely it should be a recognition of what information in their field is digitized.

That used to be a minor aspect of reporting – occasional reports or statistical releases. But now it’s a regular and central part of everything from sport and fashion to politics and crime. You could possibly justify ignoring that in the past but to do so now will look increasingly ignorant and lazy. The expectation is higher that journalists justify their roles with something more than filling space.

Talking about visualisations, is visual representation of information getting more important than the actual textual narration online?

I don’t think so. I think it’s more important that it has been, simply because there’s an increased recognition that some people prefer to communicate and consume visually. But it’s not an either/or: visualisation has strengths and weaknesses, as does text, and each will work better for different stories.

You recently published the book “Scraping for Journalists”. Scraping is a new term in the Balkans, not widespread among journalists. So could you tell us what scraping is in DDJ and a bit more about your book – it seems to be a manual, what exactly can we learn from it?

Scraping is the process of gathering and storing data from online sources. That might be a single table on a webpage, or it might be information from hundreds of webpages, the results of a database search, or dozens of spreadsheets, or thousands of PDF reports.

It’s a very important way for journalists to be able to ask questions quickly without having to spend days or weeks manually printing off and poring through documents. When governments make it harder to ask questions, scraping pulls down some of those barriers.

The book teaches you how to write a very basic scraper in five minutes using Google Docs, and then takes you through more and more powerful scrapers using a range of free tools to solve a number of typical problems that journalists face. I really wrote it because I realised that journalists were trying to learn programming in the same way that they learned journalism – when actually what’s really important is not learning programming languages, but problem-solving techniques. I wanted a book that didn’t finish with the last page.

Do you cover DDJ in your university classes? What do you focus on – concrete skills and tools or changes DDJ is bringing to the media landscape or….?

Yes – I teach data journalism both on the MA in Online Journalism at Birmingham City University that I lead, and the MA courses at City University London, where I’m a visiting professor. I focus on some core techniques – such as spreadsheet tips and visualisation principles – along with those problem-solving techniques I mentioned earlier: where to look for solutions and the importance of engaging with online communities. I also talk about the context: how different parts of the media are using these data skills, both editorially and commercially.

Any particular tools in the DDJ toolbox you would recommend to beginners? And how should a self-taught journalists start learning DDJ skills?

I always advise students and trainees to start with the stories they’re reporting, rather than particular skills. A sports reporter will need different skills to an education correspondent, or a business reporter. Start with a simple question you have and see if you can find the data to answer it – or look for a simple clean dataset in your field and see what simple stories you can find in it: who’s top and bottom? Where does the money go? Then put the data aside and pick up the telephone.

Some skills are more likely to come in useful than others: using advanced search operators to find spreadsheets and reports, for example. Pivot tables and advanced filters in spreadsheets. Knowing about making FOI requests might be useful to some; scraping for others.

There are some good books for the big view (some here), but mostly they should be in the habit of searching online for answers to individual problems – and having conversations in online communities like NICAR, the Wobbing (European FOI) mailing list, the Scraperwiki mailing list, and so on.

How would you sell the need to invest resources in teaching journalists DDJ skills to a disinterested newsroom management?

Don’t fall for the myth that data journalism is ‘resource intensive’. It can save your staff time and money. It can lead to stories that are stickier, gather more user data, and are more appealing to advertisers. It can lead to new commercial opportunities and new revenue streams. It can differentiate your content from the commodity news that is rapidly depreciating in value.

You are a founder of the website HelpmeInvestigate. Tell us about the website.

HelpMeInvestigate.com was set up in 2009 to explore “crowdsourcing” investigative journalism – in other words collaborating with members of the public to do public interest investigations. The project is completely voluntary and has no paid staff. Investigations range from simple questions to longform investigations – the most recent was an investigation into the allocation of Olympic torchbearer places which led to coverage in The Guardian, Independent, Daily Mail, BBC radio, local newspapers across the UK, and even a German newspaper. The fruits of the investigation were published as a longform ebook – 8,000 Holes: How the 2012 Olympic Torch Relay Lost Its Way – in the final week of the Olympic torch relay, which was incredible to be able to do. The book is free by the way, but users can also pay a donation to the Brittle Bone Society.