Mining the Social Web, 1st Edition - The Tweet, The Whole Tweet, and Nothing But the Tweet (Chapter 5)¶

If you only have 10 seconds...

Twitter's new API will prevent you from running much of the code from Mining the Social Web, and this IPython Notebook shows you how to roll with the changes and adapt as painlessly as possible until an updated printing is available. In particular, it shows you how to authenticate before executing any API requests illustrated in this chapter and how to use the new search API amongst other things. It is highly recommended that you read the IPython Notebook file for Chapter 1 before attempting the examples in this chapter if you haven't already. One of the examples also presumes that you've run an example from Chapter 4 and stored some data in Redis that is recycled into this chapter.

If you have a couple of minutes...

Twitter is officially retiring v1.0 of their API as of March 2013 with v1.1 of the API being the new status quo. There are a few fundamental differences that social web miners that should consider (see Twitter's blog at https://dev.twitter.com/blog/changes-coming-to-twitter-api and https://dev.twitter.com/docs/api/1.1/overview) with the two changes that are most likely to affect an existing workflow being that authentication is now mandatory for all requests, rate-limiting being on a per resource basis (as opposed to an overall rate limit based on a fixed number of requests per unit time), various platform objects changing (for the better), and search semantics changing to a "pageless" approach. All in all, the v1.1 API looks much cleaner and more consistent, and it should be a good thing longer-term although it may cause interim pains for folks migrating to it.

The latest printing of Mining the Social Web (2012-02-22, Third release) reflects v1.0 of the API, and this document is intended to provide readers with updated examples from Chapter 5 of the book until a new printing provides updates.

Unlike the IPython Notebook for Chapter 1, there is no filler in this notebook at this time. See the Chapter 1 notebook for a good introduction to using the Twitter API and all that it entails.

As a reader of my book, I want you to know that I'm committed to helping you in any way that I can, so please reach out on Facebook at https://www.facebook.com/MiningTheSocialWeb or on Twitter at http://twitter.com/SocialWebMining if you have any questions or concerns in the meanwhile. I'd also love your feedback on whether or not you think that IPython Notebook is a good tool for tinkering with the source code for the book, because I'm strongly considering it as a supplement for each chapter.

You will need to set your PYTHONPATH environment variable to point to the 'python_code' folder for the GitHub source code when launching this notebook or some of the examples won't work, because they import utility code that's located there

Note that this notebook doesn't repeatedly redefine a connection to the Twitter API. It creates a connection one time and resuses it throughout the remainder of the examples in the notebook

Arguments that are typically passed in through the command line are hardcoded in the examples for convenience. CLI arguments are typically in ALL_CAPS, so they're easy to spot and change as needed

For simplicity, examples that harvest data are limited to small numbers so that it's easier to use experiment with this notebook (given that @timoreilly, the principal subject of the examples, has vast numbers of followers)

The parenthetical file names at the end of the captions for the examples correspond to files in the 'python_code' folder of the GitHub repository

Just like you'd learn from reading the book, you'll need to have a CouchDB server running because several of the examples in this chapter store and fetch data from it

The package twitter_text that is illustrated in some examples for extracting "tweet entities" is no longer necessary because the v1.1 API provides tweet entities, but the code still reflects it for compatibility with the current discussion in the book

Example 5-2. Extracting tweet entities with a little help from the twitter_text package (the_tweet__extract_tweet_entities.py)

In [ ]:

################################################################################## NOTE: The opt-in "include_entities" flag can be passed in as a keyword # argument to to t.statuses.show to have Twitter's API extract the entities # instead of using the getEntities function as described in this example like so:## tweet = t.statuses.show(id=TWEET_ID, include_entities=1)# # This is a case-in-point of Twitter's API constantly evolving to make the lives# of developers easier. Their API slowly evolved quite a bit over the course of# 2010 as Mining the Social Web was being written, and will no doubt continue# to evolve and obsolete additional examples. Still, however, not all Twitter # APIs provide an opt-in parameter for extracting tweet entities (as of early# January 2010 anyway), and it is likely the case that you'll need to perform # this work manually for histroical or archived data that was collected prior # to mid- to late-2010 unless 3rd party data providers perform the work for you.#################################################################################importsysimportjsonimporttwitter_text# easy_install twitter-text-pyimporttwitterfromtwitter__loginimportlogin# Get a tweet id by clicking on status "Details" right off of twitter.com. # For example, http://twitter.com/#!/timoreilly/status/17386521699024896TWEET_ID='17386521699024896'# XXX: IPython Notebook cannot prompt for inputdefgetEntities(tweet):# Now extract various entities from it and build up a familiar structureextractor=twitter_text.Extractor(tweet['text'])# Note that the production Twitter API contains a few additional fields in# the entities hash that would require additional API calls to resolveentities={}entities['user_mentions']=[]foruminextractor.extract_mentioned_screen_names_with_indices():entities['user_mentions'].append(um)entities['hashtags']=[]forhtinextractor.extract_hashtags_with_indices():# massage field name to match production twitter apiht['text']=ht['hashtag']delht['hashtag']entities['hashtags'].append(ht)entities['urls']=[]forurlinextractor.extract_urls_with_indices():entities['urls'].append(url)returnentities# Fetch a tweet using an API method of your choice and mixin the entitiest=twitter.Twitter(domain='api.twitter.com',api_version='1.1')tweet=t.statuses.show(id=TWEET_ID)tweet['entities']=getEntities(tweet)printjson.dumps(tweet,indent=4)

Example 5-3. Harvesting tweets from a user or public timeline (the_tweet__harvest_timeline.py)

In [ ]:

importsysimporttimeimporttwitterimportcouchdbfromcouchdb.designimportViewDefinitionfromtwitter__loginimportloginfromtwitter__utilimportmakeTwitterRequestfromtwitter__utilimportgetNextQueryMaxIdParamTIMELINE_NAME='user'# XXX: IPython Notebook cannot prompt for inputMAX_PAGES=2# XXX: IPython Notebook cannot prompt for inputUSER='timoreilly'# XXX: IPython Notebook cannot prompt for inputKW={# For the Twitter API call'count':200,'trim_user':'true','include_rts':'true','since_id':1,}ifTIMELINE_NAME=='user':USER=sys.argv[3]KW['screen_name']=USERifTIMELINE_NAME=='home'andMAX_PAGES>4:MAX_PAGES=4ifTIMELINE_NAME=='user'andMAX_PAGES>16:MAX_PAGES=16t=login()# Establish a connection to a CouchDB databaseserver=couchdb.Server('http://localhost:5984')DB='tweets-%s-timeline'%(TIMELINE_NAME,)ifUSER:DB='%s-%s'%(DB,USER)try:db=server.create(DB)exceptcouchdb.http.PreconditionFailed,e:# Already exists, so append to it, keeping in mind that duplicates could occurdb=server[DB]# Try to avoid appending duplicate data into the system by only retrieving tweets # newer than the ones already in the system. A trivial mapper/reducer combination # allows us to pull out the max tweet id which guards against duplicates for the # home and user timelines. This is best practice for the Twitter v1.1 API# See https://dev.twitter.com/docs/working-with-timelinesdefidMapper(doc):yield(None,doc['id'])defmaxFindingReducer(keys,values,rereduce):returnmax(values)view=ViewDefinition('index','max_tweet_id',idMapper,maxFindingReducer,language='python')view.sync(db)KW['since_id']=int([_idfor_idindb.view('index/max_tweet_id')][0].value)api_call=getattr(t.statuses,TIMELINE_NAME+'_timeline')tweets=makeTwitterRequest(api_call,**KW)db.update(tweets,all_or_nothing=True)print'Fetched %i tweets'%len(tweets)page_num=1whilepage_num<MAX_PAGESandlen(tweets)>0:# Necessary for traversing the timeline in Twitter's v1.1 API.# See https://dev.twitter.com/docs/working-with-timelinesKW['max_id']=getNextQueryMaxIdParam(tweets)api_call=getattr(t.statuses,TIMELINE_NAME+'_timeline')tweets=makeTwitterRequest(api_call,**KW)db.update(tweets,all_or_nothing=True)print'Fetched %i tweets'%len(tweets)page_num+=1

# Note: The Twitter v1.1 API includes tweet entities by default, so the use of the# twitter_text package for parsing out tweet entities in this chapter is no longer# relevant, but included for continuity with the text of the book.importsysimportcouchdbfromcouchdb.designimportViewDefinitionfromprettytableimportPrettyTableDB='tweets-user-timeline-timoreilly'# XXX: IPython Notebook cannot prompt for inputserver=couchdb.Server('http://localhost:5984')db=server[DB]FREQ_THRESHOLD=3# XXX: IPython Notebook cannot prompt for input# Map entities in tweets to the docs that they appear indefentityCountMapper(doc):ifnotdoc.get('entities'):importtwitter_textdefgetEntities(tweet):# Now extract various entities from it and build up a familiar structureextractor=twitter_text.Extractor(tweet['text'])# Note that the production Twitter API contains a few additional fields in# the entities hash that would require additional API calls to resolveentities={}entities['user_mentions']=[]foruminextractor.extract_mentioned_screen_names_with_indices():entities['user_mentions'].append(um)entities['hashtags']=[]forhtinextractor.extract_hashtags_with_indices():# Massage field name to match production twitter apiht['text']=ht['hashtag']delht['hashtag']entities['hashtags'].append(ht)entities['urls']=[]forurlinextractor.extract_urls_with_indices():entities['urls'].append(url)returnentitiesdoc['entities']=getEntities(doc)ifdoc['entities'].get('user_mentions'):foruser_mentionindoc['entities']['user_mentions']:yield('@'+user_mention['screen_name'].lower(),[doc['_id'],doc['id']])ifdoc['entities'].get('hashtags'):forhashtagindoc['entities']['hashtags']:yield('#'+hashtag['text'],[doc['_id'],doc['id']])ifdoc['entities'].get('urls'):forurlindoc['entities']['urls']:yield(url['url'],[doc['_id'],doc['id']])defsummingReducer(keys,values,rereduce):ifrereduce:returnsum(values)else:returnlen(values)view=ViewDefinition('index','entity_count_by_doc',entityCountMapper,reduce_fun=summingReducer,language='python')view.sync(db)# Print out a nicely formatted table. Sorting by value in the client is cheap and easy# if you're dealing with hundreds or low thousands of tweetsentities_freqs=sorted([(row.key,row.value)forrowindb.view('index/entity_count_by_doc',group=True)],key=lambdax:x[1],reverse=True)field_names=['Entity','Count']pt=PrettyTable(field_names=field_names)pt.align='l'for(entity,freq)inentities_freqs:iffreq>FREQ_THRESHOLD:pt.add_row([entity,freq])printpt

importjsonimportredisimportcouchdbimportsysfromtwitter__utilimportgetRedisIdByScreenNamefromtwitter__utilimportgetRedisIdByUserIdSCREEN_NAME='timoreilly'# XXX: IPython Notebook cannot prompt for inputTHRESHOLD=15# XXX: IPython Notebook cannot prompt for input# Connect using default settings for localhostr=redis.Redis()# Compute screen_names for friendsfriend_ids=r.smembers(getRedisIdByScreenName(SCREEN_NAME,'friend_ids'))friend_screen_names=[]forfriend_idinfriend_ids:try:friend_screen_names.append(json.loads(r.get(getRedisIdByUserId(friend_id,'info.json')))['screen_name'].lower())exceptTypeError,e:continue# not locally available in Redis - look it up or skip it# Pull the list of (entity, frequency) tuples from CouchDBserver=couchdb.Server('http://localhost:5984')db=server['tweets-user-timeline-'+SCREEN_NAME]entities_freqs=sorted([(row.key,row.value)forrowindb.view('index/entity_count_by_doc',group=True)],key=lambdax:x[1])# Keep only user entities with insufficient frequenciesuser_entities=[(ef[0])[1:]forefinentities_freqsifef[0][0]=='@'andef[1]>=THRESHOLD]# Do a set comparisonentities_who_are_friends= \
set(user_entities).intersection(set(friend_screen_names))entities_who_are_not_friends= \
set(user_entities).difference(entities_who_are_friends)print'Number of user entities in tweets: %s'%(len(user_entities),)print'Number of user entities in tweets who are friends: %s' \
%(len(entities_who_are_friends),)foreinentities_who_are_friends:print'\t'+eprint'Number of user entities in tweets who are not friends: %s' \
%(len(entities_who_are_not_friends),)foreinentities_who_are_not_friends:print'\t'+e

importsysimporthttplibfromurllibimportquoteimportjsonimportcouchdbfromtwitter__loginimportloginfromtwitter__utilimportmakeTwitterRequestDB='tweets-user-timeline-timoreilly'# XXX: IPython Notebook cannot prompt for inputUSER='n2vip'# XXX: IPython Notebook cannot prompt for inputtry:server=couchdb.Server('http://localhost:5984')db=server[DB]exceptcouchdb.http.ResourceNotFound,e:print>>sys.stderr,"""CouchDB database '%s' not found. Please check that the database exists and try again."""%DBsys.exit(1)# query by termtry:conn=httplib.HTTPConnection('localhost',5984)conn.request('GET','/%s/_fti/_design/lucene/by_text?q=%s'%(DB,quote(USER)))response=conn.getresponse()ifresponse.status==200:response_body=json.loads(response.read())else:print>>sys.stderr,'An error occurred fetching the response: %s %s' \
%(response.status,response.reason)sys.exit(1)finally:conn.close()doc_ids=[row['id']forrowinresponse_body['rows']]# pull the tweets from CouchDBtweets=[db.get(doc_id)fordoc_idindoc_ids]# mine out the in_reply_to_status_id_str fields and fetch those tweets as a batch requestconversation=sorted([(tweet['_id'],int(tweet['in_reply_to_status_id_str']))fortweetintweetsiftweet['in_reply_to_status_id_str']isnotNone],key=lambdax:x[1])min_conversation_id=min([int(i[1])foriinconversationifi[1]isnotNone])max_conversation_id=max([int(i[1])foriinconversationifi[1]isnotNone])# Pull tweets from other user using user timeline API to minimize API expenses...t=login()reply_tweets=[]results=[]page=1whileTrue:results=makeTwitterRequest(t.statuses.user_timeline,count=200,# Per <http://dev.twitter.com/doc/get/statuses/user_timeline>, some# caveats apply with the oldest id you can fetch using "since_id"since_id=min_conversation_id,max_id=max_conversation_id,skip_users='true',screen_name=USER,page=page)reply_tweets+=resultspage+=1iflen(results)==0:break# During testing, it was observed that some tweets may not resolve or possibly# even come back with null id values -- possibly a temporary fluke. Workaround.missing_tweets=[]for(doc_id,in_reply_to_id)inconversation:try:print[rtforrtinreply_tweetsifrt['id']==in_reply_to_id][0]['text']exceptException,e:print>>sys.stderr,'Refetching <<tweet %s>>'%(in_reply_to_id,)results=makeTwitterRequest(t.statuses.show,id=in_reply_to_id)printresults['text']# These tweets are already on handprintdb.get(doc_id)['text']print

Example 5-11. Counting the number of times Twitterers have been retweeted by someone (the_tweet__count_retweets_of_other_users.py)

In [ ]:

# Note: As pointed out in the text, there are now additional/better ways to process retweets# as the Twitter API has evolved. In particular, take a look at the retweet_count field of the# status object. See https://dev.twitter.com/docs/platform-objects/tweets. However, the technique# illustrated in this code is still relevant as some Twitter clients may not follow best practices# and still use the "RT" or "via" conventions to tweet as opposed to using the Twitter API to issue# a retweet.importsysimportcouchdbfromcouchdb.designimportViewDefinitionfromprettytableimportPrettyTableDB='tweets-user-timeline-timoreilly'# XXX: IPython Notebook cannot prompt for inputFREQ_THRESHOLD=3# XXX: IPython Notebook cannot prompt for inputtry:server=couchdb.Server('http://localhost:5984')db=server[DB]exceptcouchdb.http.ResourceNotFound,e:print"""CouchDB database '%s' not found. Please check that the database exists and try again."""%DBsys.exit(1)# Map entities in tweets to the docs that they appear indefentityCountMapper(doc):ifdoc.get('text'):importrem=re.search(r"(RT|via)((?:\b\W*@\w+)+)",doc['text'])ifm:entities=m.groups()[1].split()forentityinentities:yield(entity.lower(),[doc['_id'],doc['id']])else:yield('@',[doc['_id'],doc['id']])defsummingReducer(keys,values,rereduce):ifrereduce:returnsum(values)else:returnlen(values)view=ViewDefinition('index','retweet_entity_count_by_doc',entityCountMapper,reduce_fun=summingReducer,language='python')view.sync(db)# Sorting by value in the client is cheap and easy# if you're dealing with hundreds or low thousands of tweetsentities_freqs=sorted([(row.key,row.value)forrowindb.view('index/retweet_entity_count_by_doc',group=True)],key=lambdax:x[1],reverse=True)field_names=['Entity','Count']pt=PrettyTable(field_names=field_names)pt.align='l'for(entity,freq)inentities_freqs:iffreq>FREQ_THRESHOLDandentity!='@':pt.add_row([entity,freq])printpt

Example 5-12. Finding the tweets that have been retweeted most often (the_tweet_count_retweets_by_others.py)

importsysimportcouchdbfromcouchdb.designimportViewDefinitionDB='tweets-user-timeline-timoreilly'# XXX: IPython Notebook cannot prompt for inputtry:server=couchdb.Server('http://localhost:5984')db=server[DB]exceptcouchdb.http.ResourceNotFound,e:print"""CouchDB database '%s' not found. Please check that the database exists and try again."""%DBsys.exit(1)# Emit the number of hashtags in a documentdefentityCountMapper(doc):ifnotdoc.get('entities'):importtwitter_textdefgetEntities(tweet):# Now extract various entities from it and build up a familiar structureextractor=twitter_text.Extractor(tweet['text'])# Note that the production Twitter API contains a few additional fields in# the entities hash that would require additional API calls to resolveentities={}entities['user_mentions']=[]foruminextractor.extract_mentioned_screen_names_with_indices():entities['user_mentions'].append(um)entities['hashtags']=[]forhtinextractor.extract_hashtags_with_indices():# Massage field name to match production twitter apiht['text']=ht['hashtag']delht['hashtag']entities['hashtags'].append(ht)entities['urls']=[]forurlinextractor.extract_urls_with_indices():entities['urls'].append(url)returnentitiesdoc['entities']=getEntities(doc)ifdoc['entities'].get('hashtags'):yield(None,len(doc['entities']['hashtags']))defsummingReducer(keys,values,rereduce):returnsum(values)view=ViewDefinition('index','count_hashtags',entityCountMapper,reduce_fun=summingReducer,language='python')view.sync(db)num_hashtags=[rowforrowindb.view('index/count_hashtags')][0].value# Now, count the total number of tweets that aren't direct repliesdefentityCountMapper(doc):ifdoc.get('text')[0]=='@':yield(None,0)else:yield(None,1)view=ViewDefinition('index','num_docs',entityCountMapper,reduce_fun=summingReducer,language='python')view.sync(db)num_docs=[rowforrowindb.view('index/num_docs')][0].value# Finally, compute the averageprint'Avg number of hashtags per tweet for %s: %s'% \
(DB.split('-')[-1],1.0*num_hashtags/num_docs,)

Exmaple 5-14. Harvesting tweets for a given query (the_tweet__search.py)

In [ ]:

importsysimporttwitterimportcouchdbfromcouchdb.designimportViewDefinitionfromtwitter__utilimportmakeTwitterRequestfromtwitter__loginimportloginQ='OpenGov'# XXX: IPython Notebook cannot accept inputMAX_PAGES=5server=couchdb.Server('http://localhost:5984')DB='search-%s'%(Q.lower().replace('#','').replace('@',''),)t=login()search_results=t.search.tweets(q=Q,count=100)tweets=search_results['statuses']for_inrange(MAX_PAGES-1):# Get more pagesnext_results=search_results['search_metadata']['next_results']# Create a dictionary from the query string paramskwargs=dict([kv.split('=')forkvinnext_results[1:].split("&")])search_results=t.search.tweets(**kwargs)tweets+=search_results['statuses']iflen(search_results['statuses'])==0:breakprint'Fetched %i tweets so far'%(len(tweets),)# Store the datatry:db=server.create(DB)exceptcouchdb.http.PreconditionFailed,e:# Already exists, so append to it (but be mindful of appending duplicates with repeat searches.)# The refresh_url in the search_metadata or streaming API might also be# appropriate to use here.db=server[DB]db.update(tweets,all_or_nothing=True)print'Done. Stored data to CouchDB - http://localhost:5984/_utils/database.html?%s'%(DB,)