A blog of life, love, and Lisp

Main menu

Collecting real-time Twitter data with the Streaming API

Inspired by Neal Caren’s excellent series on Big Data collection and analysis with Python, I want to work on a set of tutorials for some basic collection and analysis as well.

I’m drawing on some of my previous “tworkshops” that are meant to bring people from zero knowledge, to knowing how to move around basic analysis of Twitter data with potential for parallel processing in systems like Hadoop MapReduce.

Let’s start with the basics of what the data look like and how to access it.

Accessing the Twitter API

The way that researchers and other people who want to get large publically available Twitter datasets is through their API. API stands for Application Programming Interface and many services that want to start a developer community around their product usually releases one. Facebook has an API that is somewhat restrictive, while Klout has an API to let you automatically look up Klout scores and all their different facets.

The Twitter API has two different flavors: RESTful and Streaming. The RESTful API is useful for getting things like lists of followers and those who follow a particular user, and is what most Twitter clients are built off of. We are not going to deal with the RESTful API right now, but you can find more information on it here: https://dev.twitter.com/docs/api. Right now we are going to focus on the Streaming API (more info here: https://dev.twitter.com/docs/streaming-api). The Streaming API works by making a request for a specific type of data — filtered by keyword, user, geographic area, or a random sample — and then keeping the connection open as long as there are no errors in the connection.

For my own purposes, I’ve been using the tweepy package to access the Streaming API. I’ve incorporated two changes in my own fork that have worked well for me on both Linux and OSX systems: https://github.com/raynach/tweepy

Understanding Twitter Data
Once you’ve connected to the Twitter API, whether via the RESTful API or the Streaming API, you’re going to start getting a bunch of data back. The data you get back will be encoded in JSON, or JavaScript Object Notation. JSON is a way to encode complicated information in a platform-independent way. It could be considered the lingua franca of information exchange on the Internet. When you click a snazzy Web 2.0 button on Facebook or Amazon and the page produces a lightbox (a box that hovers above a page without leaving the page you’re on now), there was probably some JSON involved.

JSON is a rather simplistic and elegant way to encode complex data structures. When a tweet comes back from the API, this is what it looks like (with a little bit of beautifying):

Let’s move our focus now to the actual elements of the tweet. Most of the keys, that is, the words on the left of the colon, are self-explanatory. The most important ones are “text”, “entities”, and “user”. “Text” is the text of the tweet, “entities” are the user mentions, hashtags, and links used in the tweet, separated out for easy access. “User” contains a lot of information on the user, from URL of their profile image to the date they joined Twitter.

Now that you see what data you get with a tweet, you can envision interesting types of analysis that can emerge by analyzing a whole lot of them.

A Disclaimer on Collecting Tweets

Unfortunately, you do not have carte blanche to share the tweets you collect. Twitter restricts publicly releasing datasets according to their API Terms of Service (https://dev.twitter.com/terms/api-terms). This is unfortunately for collaboration when colleagues have collected very unique datasets. However, you can share derivative analysis from tweets, such as content analysis and aggregate statistics.

Collecting Data

Let’s get to it. The first step is to get a copy of tweepy (either by checking out the repository or just downloading it) and installing it.

The next thing to do is to create an instance of a tweepy StreamListener to handle the incoming data. The way that I have mine set up is that I start a new file for every 20,000 tweets, tagged with a prefix and a timestamp. I also keep another file open for the list of status IDs that have been deleted, which are handled differently than other tweet data. I call this file slistener.py.

Next, we need the script that does the collecting itself. I call this file streaming.py. You can collect on users, keywords, or specific locations defined by bounding boxes. The API documentation has more information on this. For now, let’s just track some popular keywords — obama and romney (keywords are case-insensitive).

According to the Twitter API, if you’re tracking by user, you get: tweets
created by the user, tweets which were retweeted by the user, replies to any
tweet created by the user, retweets of any tweet created by the user, and
“manual” replies to the user created without using Twitter’s “reply” button.

So yes.

Logesh

The code is working fine. But not able to convert the JSON to xml or not able to import to mysql. It is throwing the error.

Hi, I found your code while trying to figure out twitter OAUTH and how to collect tweets. However, it seems that I can’t create .json file. I get the following error: No such file or directory: ‘../data/test.20131211-130817.json’ (or whatever the timestamp is). Is there something specific that I can do to fix this? Thanks!

Excellent tutorial! Many thanks for this. I am collection some location based data atm. Actually I would only need some information out of the whole tweet (i.e. timestamp, user ID and location). What would be a way to modify the script so only those attributes are saved (rather than everything to save me disk space)?

I see, but bounding the USA I could do that for mainland USA. I was just wondering if there’s a better way than bounding to get the whole USA. You may be able to bound the mainland USA except hawaii and alaska, which will be fine. I was wondering if one can just use the US WOE ID but wasn’t sure how that can be done using tweepy. If not, then I’ll do some bounding box math.

There’s a field in the Twitter documentation which encodes for language, I believe. And you could check if the string matched Obama using str.find or whatever. And you can change the self.counter >= 20000 line to exit after reaching 300.

Jenny Gnil

Thanks! Can I restirct the output that i only get the tweet text with the user name?
I tried this but i get an error (it’s the error of the main class: print “error!”)
my code:

Hrm — maybe you should try removing the try / except in the main class and see what the actual error is.

Jenny Gnil

It’s the same error (the error from the main class) and I tried to use an additional method for the output but i got the same error. Do I have to return something that the method on_status will be closed or do you know another reason for the error?

hello alex, you have posted very nice tutorial
i wanted to know how to track word in last 5-min tweets

lsk26

Hi Alex,
I am running into some issues with the Slistener. It used to work super smooth, but currently I am getting some errors, whenever the self.counter limit is reaches (so technically a new Json file should be created. I keep getting an “error!” message. Do you have any idea what this could be about? I kept your scripts largely unchanged, so I do not expect it is related to my changes in code. Would be super grateful for your response! Many thanks

I’ve been running into an error at the same point. Editing the error handling in streamin.py main() I’m getting an IOError and a further edit reveals “no such file or directory”.

MONIKA BANSAL

i have created a small program for streaming twitter data in a specified data range . earlier it was running perfectly but now not its not creating any file it shows count 0

import tweepy

import csv

access_token = “3922189213-dojmvufY0yVqdMt8BJEm4dXefP3BhQVhkD”

access_token_secret = “MGUrD5y4bTPxtgbcP96lsSOv202XFivVJCQqaMj”

consumer_key = “CQGnx5DY5DgdNRnb74Xgk”

consumer_secret = “5otBreM8LDVnKTnJEtCc1ISMFrpp7V8mi8vGRKrX2P6″

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

# Open/Create a file to append data

textFile = open(‘fetched_tweets_baaghi4.txt’, ‘a’)

#Use csv Writer

#csvWriter = csv.writer(csvFile)

count=0;

for tweet in tweepy.Cursor(api.search,q=”#Baaghi”,lang=”en”,

since_id=”2016-03-16″,until=”2016-04-15″,).items():

print (tweet.created_at, ascii(tweet.text))

count=count+1

#csvWriter.writerow([tweet.created_at, tweet.text.encode(‘utf-8′)])

textFile.write(ascii(tweet.text)+’n’)

print(count)

earlier data was getting streamed, file was also getting created but now it shows count zero, no data in file and program stops without any error.
tried whole process in other pc too but no result.
urgently need help what could be the problem.