Working with tabular data

Working with tabular data

At its simplest, tabular data is data that is stored in rows and columns (hence the name “tabular” i.e. in tables), either in a flat file or a database, and is usually comprised of simple alphanumeric values. CSV/TSV, JSON, XLS(X), and XML are some of the more common formats you’ll find tabular data in, though unfortunately it does still often appear in non machine-readable formats like PDF and DOC and most first be extracted and cleaned before being used.

Converting between data formats

There’s a good chance that you’re going to want to convert your data from the format you’ve found into something a little more modern and useful (like JSON). Mr. Data Converter is a simple web-based tool for coverting from Excel, CSV, and TSV to JSON, HTML, MySQL, PHP, Python, Ruby, and more.

For batching up the conversion process to run across many datasets consider the dataconvert command-line tool developed by OKFN Labs for converting from CSV, XLS(X), and JSON to CSV.

Cleaning your data

If your data has had humans involved in entering it then it’s probably fully of all sorts of small variations in how the data have been entered that you’ll need to clean up before it becomes machine-readable. Fortunately, there are a couple of great tools.

OpenRefine (formerly Google Refine) is a powerful desktop tool for cleaning messy data, transforming it between different tabular data formats, and even integrates with web services via some simple connectors so you can, for example, geocode a bunch if addresses using Google direclty in OpenRefine. Check out the School of Data’s simple tutorial on using OpenRefine to see it in action.

If out-of-the-box tools aren’t cutting it and you need to dive into code take a look back at some of the Python libraries, like pandas, that we recommended in Converting between data formats. If you’re feeling brave take a look at dedupe, which leverages machine-learning to perform de-duplication and cleansing of data.

And if all else fails you can always fall back to reliable command-line tools like grep, awk, and sed combined with regular expressions. If you need to upskill your regex foo Debuggex and Regexpr should set you on the right path.

Analysing tabular data

So you’ve got a nice clean dataset and now you want to do some analysis on it to understand if reality matches your hypothesis!

Spreadsheets

Sometimes the simplest tools are the best and a spreadsheet is all you need – Excel, afte rall, is the world’s most widely used IDE!

When it comes to sharing your analysis with the world check out Knitr, for quick and easy report generation, googleVis for making R and Google Charts talk nicely, and Shiny for a full-blown web app framework for R to turn your awesome analyses into a shiny interactive web app (such as this demo).