Nothing too groundbreaking here, if you’re a coder, but this was interesting to me in my coding development.

This weekend, I pounded away at the Document Cloud API, experimenting with how we might integrate it into our continually-developing workflow at PBS News.

This taught me a) the DC folks have some of the clearest documentation I’ve ever seen, b) the DC folks are really responsive to requests, as in they added created and modified dates to the API less than two hours after I requested it, c) Don’t make test documents public, esp. now that anyone can search the public docs and d)APIs are an awesome tool!
As data mungers, a lot of the time we deal with completely unstructured data (I’m in the middle of a MASSIVE project dealing with that right now), so it’s always a pleasure when there’s an API, which stands for “Application Programming Interface,” which at least has a fairly consistent pattern.

APIs are showing up all over journalism. My former colleagues at the LAT have one for LA neighborhoods, and both USA Today and the New York Times have several, covering beats from politics (NYT’s Congress API) to literature (USAT’s Bestsellers API). They come in formats that are much easier to parse programatically than a lot of other stuff we have to deal with. And while it’s important that we structure our content as data, we should also pay attention to structuring the information we engage with as data, so we get back to the “reporting” part of database reporting. It’s something I’m trying to spend a lot more time on. And as much aof this as we have,

Fortunately, API skills serve you well whether you’re talking about newsroom content or external information, so it’s a useful skill to have.

Oh, and if you want to avoid the programming part altogether, my mentor Derek Willis can help you bring it directly into Excel. (How many entries do we think I got through without citing Derek? Not many. I’ll try again starting next post. Prediction: It won’t last.)

But even so, the overarching point is that using programming (my language of choice is Python) can help you, yes you, have the flexibility to pick information out of the API and bring it into a CSV, or a database, with the structure you impose. So, hopefully, you can start to imagine the fun I had with pulling in, displaying and remixing the documents we’ll be using to tell stories at PBS News.

Wait, there’s more! You know what’s even better? I’m part of an online journalism group that’s learning Python together. It’s a really great mixture, because I’m learning a lot from people who’ve been doing this much longer and better than I, but I’m also able to help people who are new to Python — where I was a year ago. And I love giving back to the community whenever I am able.

So, when one journalist asked how to get a JSON response from the Twitter API to go into a CSV, I was happy to mock something up, based on the DocCloud code I had put together, and share away. It occurs to me this might help you make the most of structured data from Twitter, Document Cloud, or any of the JSON-ish APIs out there, so I’ll post it here as well. Let me know how to make it better, and if you have any questions.

#To put this into a CSV, csv code adapted from this recipe #(http://www.palewire.com/posts/2009/03/03/django-recipe-dump-your-queryset-out-as-a-csv-file/) #of Ben Welsh at the LAT, who helped me do my first work with APIs:# IMPORTS#Make Python understand how to read things on the Internetimporturllib2#Make Python understand the stuff in a page on the Internet is JSONimport json
# Make Python understand csvsimportcsv# Make Python know how to take a break so we don't hammer API and exceed rate limitfromtimeimport sleep
# tell computer where to put CSV
outfile_path='/path/to/my/file.csv'# open it up, the w means we will write to it
writer =csv.writer(open(outfile_path,'w'))#create a list with headings for our columns
headers =['user','tweet_text']#write the row of headings to our CSV file
writer.writerow(headers)# GET JSON AND PARSE IT INTO DICTIONARY# We need a loop because we have to do this for every JSON file we grab#set a counter telling us how many times we've gone through the loop, this is the first time, so we'll set it at 1
i=1#loop through pages of JSON returned, 100 is an arbitrary numberwhile i<100:
#print out what number loop we are on, which will make it easier to track down problems when they appearprint i
#create the URL of the JSON file we want. We search for 'egypt', want English tweets, #and set the number of tweets per JSON file to the max of 100, so we have to do as little looping as possible
url =urllib2.Request('http://search.twitter.com/search.json?q=egypt&lang=en&rpp=100&page=' + str(i))#use the JSON library to turn this file into a Pythonic data structure
parsed_json = json.load(urllib2.urlopen(url))#now you have a giant dictionary.#Type in parsed_json here to get a better look at this. #You'll see the bulk of the content is contained inside the value that goes with the key, or label "results". #Refer to results as an index. Just like list[1] refers to the second item in a list, #dict['results'] refers to values associated with the key 'results'.print parsed_json
#run through each item in results, and jump to an item in that dictionary, ex: the text of the tweetfor tweet in parsed_json['results']:
#initialize the row
row =[]#add every 'cell' to the row list, identifying the item just like an index in a list
row.append(str(tweet['from_user'].encode('utf-8')))
row.append(str(tweet['created_at'].encode('utf-8')))
row.append(str(tweet['text'].encode('utf-8')))#once you have all the cells in there, write the row to your csv
writer.writerow(row)#increment our loop counter, now we're on the next time through the loop
i = i +1#tell Python to rest for 5 secs, so we don't exceed our rate limit
sleep(5)

hi I think your code is exactly what I need, I tried it didn\’t give me an error but it also didn\’t write to the CSV file. I was able to load the json file but after that nothing. I mean even before the

writer.writerow(headers) didn\’t write anything on the CSV file. HELP please!

This sounds like there\’s an issue with the path to your CSV file. Sometimes, this doesn\’t throw an error, but just tries to write to a file that doesn\’t exist. Double-check that your path is correct. If you still have issues, feel free to follow up by sending me your script and CSV file via email at meminkoff AT gmail DOT com — and I\’ll see if I can diagnose what\’s going on.

Hi I emailed you my code a couple of days ago, I don\’t know if you got it but I think I figured out how to work it. I was just wondering if you had an idea on a way to output the from_user in a sorted way, in order of how many times a certain user tweeted Egypt?

Excellent! Big thanks!!! I’ve been having problems with this for two days. I finally see the light! There is a slight problem with some of the tweets, since they include a /n in their shortened URLs and Python think it’s a new line… any ideas on how to overcome this problem?

Hi Michelle!
I trying to use your code to parse json on facebook.
I’ve found that it works until I get a post without one of the elements.
Is likely we have a case of tweet without the “from_user” tag. When it occurs, I getting “KeyError” and the code stops.

Just replace this line ” url = urllib2.Request(‘http://search.twitter.com/search.json?q=egypt&lang=en&rpp=100&page=’ + str(i))”
with the link the address or URL you went to, to grab that page from Twitter. If it’s just on your computer, you’ll need to put in on the Web somewhere, and then use that address instead. Everything else should be similar.

Whose ramblings are these?

Hi! I'm Michelle, a passionate journalist-programmer on an ongoing quest to combine data, technology and writing in the service of telling stories about our world.

My current playground for experimenting is the Associated Press' Washington, DC bureau, where I work as an Interactive Producer, focusing on the data journalism I live, breathe and love.

I've worked as a Data Producer at PBS and a data developer/journo intern at the LA Times' Data Desk. I'm also a proud alum of Medill, Northwestern's journalism school, and originally from the Chicago area.