In this example, we use the IPython notebook to mine data from Twitter with the Twython library. Once we have fetched the raw stream for a specific query, we will at first do some basic word frequency analysis on the results using Python's builtin dictionaries, and then we will use the excellent NetworkX library developed at Los Alamos National Laboratory to look at the results as a network and understand some of its properties.

Using NetworkX, we aim to answer the following questions: for a given query, which words tend to appear together in tweets, and global pattern of relationships between these words emerges from the entire set of results?

Obviously the analysis of text corpora of this kind is a complex topic at the intersection of natural language processing, graph theory and statistics, and here we do not pretend to provide an exhaustive coverage of it. Rather, we want to show you how with a small amount of easy to write code, it is possible to do a few non-trivial things based on real-time data from the Twitter stream. Hopefully this will serve as a good starting point; for further reading you can find in-depth discussions of analysing social network data in Python in the book Mining the Social Web.

Here we define which query we want to perform, as well as which words we want to filter out from our analysis because they appear very commonly and we're not interested in them.

Typically you want to run the query once, and after seeing what comes out, fine-tune the removal list, as which words are 'noise' is fairly query-specific (and also changes over time, depending on what's happening out there on Twitter):

In [7]:

query="big data"words_to_remove="""with some your just have from it's /via &amp; that they your there this"""

This is the cell that actually fetches data from Twitter. We limit the output to the first 30 pages of search max (typically Twitter stops returning results before that).

In [8]:

n_pages=30results=[]retweets=[]forpageinrange(1,n_pages+1):search=twitter.search(q=query+' lang:en',page=str(page))res=search['results']ifnotres:print'Stopping at page:',pagebreakfortinres:ift['text'].startswith('RT '):retweets.append(t)else:results.append(t)tweets=[t['text']fortinresults]# Quick summaryprint'query: ',queryprint'results: ',len(results)print'retweets:',len(retweets)print'Variable `tweets` has a list of all the tweet texts'

An interesting question to ask is: which pairs of words co-occur in the same tweets? We can find these relations and use them to construct a graph, which we can then analyze with NetworkX and plot with Matplotlib.

We limit the graph to have at most n_nodes (for the most frequent words) just to keep the visualization easier to read.

An interesting summary of the graph structure can be obtained by ranking nodes based on a centrality measure. NetworkX offers several centrality measures, in this case we look at the Eigenvector Centrality: