Search form

6 open source tools for data journalism | Opensource.com

When I was in journalism school back in the late 1980s, gathering data for a story usually involved hours of poring over printed documents or microfiche.

A lot has changed since then. While printed resources are still useful, more and more information is available to journalists on the web. That’s helped fuel a boom in what’s come to be known as data journalism. At its most basic, data journalism is the act of finding and telling stories using data—like census data, crime statistics, demographics, and more.

There are a number of powerful and expensive tools that enable journalists to gather, clean, analyze, and visualize data for their stories. But many smaller or struggling news organizations, let alone independent journalists, just don’t have to budget for those tools. But that doesn’t mean they’re out in the cold.

There are a number of solid open source tools for data journalists that do the job both efficiently and impressively. This article looks at six tools that can help data journalists get the information that they need.

Grabbing the data

Much of the data that journalists find on the web they can download as a spreadsheet or as CSV or PDF files. But there’s a lot of information that’s embedded in web pages. Instead of manually copying and pasting that information, a trick just about every data journalist uses is scraping. Scraping is the act of using an automated tool to grab information embedded in a web page, often in the form of an HTML table.

If you, or someone in your organization, is of a technical bent then Scrapy might be the tool for you. Written in Python, Scrapy is a command line tool that can quickly extract structured data from web pages. Scrapy is a bit challenging to install and set up, but once it’s up and running you can take advantage of a number of useful features. Python savvy programmers can also quickly extend those features.

Spreadsheets are one of the basic tools of the data journalist. In the open source world, LibreOffice Calc is the most widely-used spreadsheet editor. Calc isn’t just for viewing and manipulating data. By taking advantage of its Web Page Query import filter, you can point Calc to a web page containing data in tables and grab one or all of the tables on page. While it’s not as fast or efficient as Scrapy, Calc gets the job done nicely.

Dealing with PDFs

Whether by accident or by design, a lot of data on the web is locked in PDF files. Many of those PDFs can contain useful information. If you’ve done any work with PDFs, you know that getting data out of them can be a chore.

That’s where DocHive, a tool developed by the Raleigh Public Record for extracting data from PDFs, comes in. DocHive works with PDFs created from scanned documents. It analyzes the PDF, separates it into smaller pieces, and then uses optical character recognition to read the text and inject the text into a CSV file. Read more about DocHive in this article.

Tabula is similar to DocHive. It’s designed to grab tabular information in a PDF and convert it to a CSV file or a Microsoft Excel spreadsheet. All you need to do is find a table in the PDF, select the table, and let Tabula do the rest. It’s fast and efficient.

Cleaning your data

Often, the data you’ll grab may contain spelling and formatting errors or problems with character encoding. That makes the data inconsistent and unreliable, and makes cleaning the data essential.

If you have a small data set, one that consists of a few hundred rows of information, then you can use LibreOffice Calc and your eyeballs to do the cleanup. But if you have larger data sets, doing the job manually will be a long, slow, inefficient process.

Instead, turn to OpenRefine. It automates the process of manipulating and cleaning your data. OpenRefine can sort your data, automatically find duplicate entries, and reorder your data. The real power of OpenRefine comes from facets. Facets are like filters in spreadsheets that let you zoom in on specific rows of data. You can use facets to ferret out blank cells and duplicate data, as well as see how often certain values appear in the data.

Visualizing your data

Having the data and writing a story with it is all well and good. A good graphic based on that data can be a boon when trying to summarize, communicate, and understand data. That explains the popularity of infographics on the web and in print.

You don’t need to be a graphic design wizard to create an effective visualization. If your needs aren’t too complex, Data Wrapper can create effective visualizations. It's an online tool that breaks creating a visualization into four steps: copy data from a spreadsheet, describe your data, choose the type of image you want, then generate the graphic. You don’t get a wide range of image types with Data Wrapper, but the process couldn’t be easier.

Obviously, this isn’t an exhaustive list of open source data journalism tools. But the tools discussed in this article provide a solid platform for a journalism organization on a budget, or even an intrepid freelancer, to use data to generate story ideas and to back those stories up.

3 Comments

Thanks for sharing these resources, Scott. I was in journalism school in the late 90s/early 200s and took a class called "Internet Journalism." We learned basics on how to search for information online, pre-Google. We were told to do all of our searches on ixquick.com. Later, when I became a city hall reporter, I used to have to trek downtown to city hall to pick up the city council agenda every other week. These days the full agendas are online, including supporting materials, not to mention a wealth of other data from city departments. I have mad respect for journalists who reported pre-Internet days, especially on data-driven stories. It's amazing how far we've come and how much has changed in such a short time.

Ginny, times definitely have changed. When I was in J-school in the 80s, an investigative reporter visited my class and described his typical day: head over to the hall of records (or wherever) first thing in the morning, comb through documents, take a break for lunch, comb through more documents. With a few breaks in between and attempts to ferret out sources. And more documents ...

But, as Nicolas Kayser-Bril pointed out, Journalists should be extremely careful before reusing a dataset that was proactively published by a government. Or by anyone else, for that matter.

Writer. Technology coach. Soldier of fortune. Ocelot wrangler. Husband and father. Blogger. Collector of pottery. Scott is a few of these things. He's also a long-time user of free/open source software who extensively writes and blogs on the subject.

Main menu

The opinions expressed on this website are those of each author, not of the author's employer or of Red Hat.

Opensource.com aspires to publish all content under a Creative Commons license but may not be able to do so in all cases. You are responsible for ensuring that you have the necessary permission to reuse any work on this site. Red Hat and the Shadowman logo are trademarks of Red Hat, Inc., registered in the United States and other countries.