Data Formats

Data comes in a thousand and one formats, some friendlier than others. Let’s review a few!

API

APIs – application programming interface – is a way for computers to communicate to one another. For us, this generally means sharing data. We’ll be coding up Python scripts to talk to and request data from machines around the world, from Twitter to the United States government.

CSV

Sample CSV

1

2

3

4

city,population

NewYork,8406000

Los Angeles,3884000

Richmond,214114

Comma-separated values are the most common format for data. It’s a quick export away from Excel or Google Spreadsheets, and you’ll find yourself working from CSV’s more often than not.

Although “comma-separated” is in the name, a CSV can arguably also use tabs, pipes, or any other character as a field delimiter (although the tab-separated one can also be a TSV).

JSON

JSON example

JavaScript

1

2

3

4

5

6

7

8

{

state:"Tennessee",

presidents:[

{name:"Andrew Jackson",term:[1829,1837]},

{name:"James K. Polk",term:[1845,1849]},

{name:"Andrew Johnson",term:[1865,1869]}

]

}

JSON stands for JavaScript Object Notation, and it’s a slightly more complicated format than a CSV. It can contain lists, numbers, sub-items, and all sorts of complexities that are great for expressing the nuance of real-world data. Data from APIs is often formatted as JSON.

Shapefiles

Shapefiles are by far the most common format for geographic data. City council districts, state boundaries, and the nearest wifi spots can often be found as shapefiles. You can import them into software like QGIS or convert them to geography-friendly JSON.

GeoJSON and TopoJSON

GeoJSON example

JavaScript

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

{

"type":"FeatureCollection",

"features":[

{

"type":"Feature",

"properties":{},

"geometry":{

"type":"Point",

"coordinates":[

-73.970947265625,

40.81380923056958

]

}

}

]

}

GeoJSON and TopoJSON are both specially-formatted JSON files that contain geographic information.

The Lede Program

An intensive, post-bac certification program designed to equip journalists and storytellers of all kinds with the computational skills needed to turn data into narrative, from Columbia’s Graduate School of Journalism and Department of Computer Science.