It's Data. Realtime, Social, Contextual Data.

I Repeat; Real-time. Data. Sh*tloads of it.

A service unlike any we've ever had before, and yes, very different and much faster than the traditional internet. We've seen it used as a tool for social activism, a platform for change, as well as a real-time news stream in a way that old-school media just can't play.

The thing about drinking from the firehose is, well, just that. Overwhelming data and no real easy way to capture it all in a format that you can easily use. There is simply too much shit happening every second to utilize properly.

So, if you want to do any real trending, data mining, post-processing, or simple analysis on it - you really need to be making a historical record of what you're looking for and store it in a usable way. The Twitter search function only goes back so far (at least from what I've found), so for the vast majority of data (aka anything you'd want to do some long-term trending on) - when its gone, its gone (for easy public consumption, at least).

Anyone who has been reading this blog over the past month or so, has seen my Metallica stuff - and yes, honestly, I wrote this script in order to add some data to that project. I wanted to chart mentions of the bands name, and member names - analyze them at a later date and try to draw some conclusions (as well as pretty visualizations).

But none of that is possible without the actual DATA first.
So, let's go get some "tweet harvesting" started.

Disclaimer: This is pretty quick and dirty - but for 99% of cases it's perfect (possibly even overkill). Want to make it better? Go ahead, man! You are the master of your own destiny.

So the question is, "How do we grab stuff from the timeline in the first place?" My answer, as it always is - Python, the world's best Swiss army knife for data.

Enter my trusty Python 2.7 (32-bit to not break all sorts of modules). If you don't already have it, you can download it here. They have flavors for whatever you favorite OS is, but in this case I'm going to be using Windows and SQL Server, if you use OS X or Linux, you'd have to rewrite the SQL statements and connect syntax to use MySQL (which isn't a big deal, the rest would work fine as-is).

Obviously, regular licensed versions of SQL Server work fine too, as well as SQL Server 2012 (Express or not) - but for the sake of actually getting it to work the first time (and not being on a RC release), I recommend 2008 R2.

When you DO install it, just make sure that you use "mixed-mode" authentication - since we're going to use a hardcoded user - none of that pesky "Windows Autentication" shit - AND install it as the "Default Instance" NOT a "Named Instance". You see what I mean as it walks you through the process.

Setting up your SQL Server table structure

now we've got a fresh and spanky SQL Server installation - we need to set up some database tables to hold all our Tweets. "A table to hold my tweets", Hmm, sounds like a sad new-wave children's story... Anyways...

First we need to make a new database. Log into your installation by using the SQL Server Management Studio app (as the 'sa' user, who you created a password for earlier), right-click on the "Databases" folder and go to "New Database". Name whatever you want, just remember that you'll have change my script to match. The defaults are fine in the creation.

Once its created, click "New Query" in the upper left bar.

Go to the query window and type "user [TweetHarvester]" (or [PinkyWinky], or whatever the hell you called your database)

Now we are ready to make some tables, here is the structure that I used for my script.

We are making 3 tables here:

A "Temp Tweet" table exactly like the "real" table, in order to easilylazily filter out dupes

A Log table in order to capture what has happened on each run. Because, lets face it - stats are fun

Before we go any further, we need to figure out what we're going to track...

In my case, as I said earlier, I'm tracking Metallica tweets(which includes the names of all past and present members), so my script has an array of all of those keywords and will iterate over them one at a time.

conn = pymssql.connect(host='localhost',user='notsa', password='notsayly', database='TallicaTweets')
cur = conn.cursor()# connect to our database and create a cursor to do some work

harvest_list =['metallica','james hetfield','lars ulrich','kirk hammett','rob trujillo','jason newsted','cliff burton']# my aforementioned harvest list, use as many as you want, # they will all be separated in the database by keyword

cur.execute("select max(isnull(batchid,0)) from tweetlog")
batch_id_cur = cur.fetchall()# updated 3-8-2012if batch_id_cur[0][0]isNone:
batch_id =0else:
batch_id = batch_id_cur[0][0]+1# grabbing the last "batch id", if it exists so we# can make log entries that make SOME sense

try:# lets try to to put each tweet in our temp table for now
cur.execute("""insert into TweetBankTemp (tweet_id, tweet_datetime, tweet_keyword, tweet, tweeter, lang)
values ('"""+str(tweet['id_str'].encode('utf-8').replace("'","''").replace(';',''))+"""',
cast(substring('"""+str(tweet['created_at'].encode('utf-8'))+"""',5,21) as datetime),
'"""+str(tweet_keyword)+"""',
'"""+str(tweet['text'].encode('utf-8').replace("'","''").replace(';',''))+"""',
'"""+str(tweet['from_user'].encode('utf-8').replace("'","''").replace(';',''))+"""',
'"""+str(tweet['iso_language_code'].encode('utf-8').replace("'","''").replace(';',''))+"""'
) """)except:print"############### Unexpected error:",sys.exc_info()[0],"##################################"# just in case something heinous happens we don't lose # ALL the tweets in this run, just the error one

cur.execute("""insert into tweetbank (tweet_id, tweet_datetime, tweet_keyword, tweet, tweeter, lang)
select * from tweetbanktemp where tweet_id NOT in
(select distinct tweet_id from tweetbank)""")# take all the tweets that we DIDNT already have# and put them in the REAL tweet table

cur.execute("""delete from tweetbanktemp where tweet_keyword = '"""+str(tweet_keyword)+"""'""")# take all THESE out of the temp table to not# interfere with the next keyword

Disclaimer #2: I take no responsibility if this code ends up eating your grandmother's birthday photos, makes your Shih Tzu sick, or causes you to forget about those Jethro Tull tickets you've been sitting on. It probably won't do most of those things, but still!

Probably could have been done MUCH more efficiently - but I'm all about results, and for a project like this. Quick wins. Set it up to run every 5 minutes in the background using Task Scheduler and suddenly we're off to the races gathering tweets.

Pic of my "TweetLog" Table.

Pic of my "TweetBank" table.

What do we DO with this data? Well, that'll have to wait a bit while I gather it - but I sense some Tableau in the future...

Would love feedback, especially if you got it working (or if you didn't, maybe I can help).

looks great – will try it in the next few days, problem is i am a beginner coder and using OS X. Can you elaborate on the changes needed. Also can you get the followed and follower stats using this method?

http://ryrobes.com/ Ryan Robitaille

Heya, so sorry I missed this comment last week.

You should have no problem using the OSX version of Python, one of the only changes you’ll have to make will be to use a local installation of MySQL to save the data to.

I’m not sure, offhand if the follower and followed stats are in the data object that I pull all the fields from, but I’m pretty sure that this twitter module has ways of getting that too.

Let me know how it goes!

http://ryrobes.com/ Ryan Robitaille

Anyone having any issues with this, please let me know. I’m sure we can figure them out. Its way too awesome of a datasource to get forgotten (after x days).

http://40limon.es dariomartinezb

Hello Ryan,

I’m wondering if you could update your script to use the streaming API instead of the search API. The rpp limitation is driving me nuts, as I can’t get pass 100 tweets per keyword on any given batch run. I would like to grab, say, 1500 tweets per keyword/run, but I haven’t been unable to do so.

Any ideas?

jeff

Yes I am having issues. I love what you did and want to do something similar for a project. I used your code as a test run. Did everything exactly the same. The code created tables in SQL Server. When I tried running the python code, I get an error that I cant connect to interface. I am running windows 7, 64 but installed python2.7.3 32 bit and modules in 32 bit. Tried uninstalling antivirus and disabling windows firewall but get same error. Any suggestions?

http://ryrobes.com/ Ryan Robitaille

hey Jeff, can you post your error? Sounds like you might have some funkiness with the pymssql modules (they can be weird and there are seemingly 10 diff versions floating around out there).

Jeff

Ryan sorry for the delay, your response did not come through to my email. If I use “server= ‘localhostSQLEXPRESS’, I get this:

Ive messed with firewall configs and cant figure it out. I can connect with odbc so I think it is definitely within pymssql. Ant suggestions?

http://ryrobes.com/ Ryan Robitaille

Give it a try without the database part. most newer version of pymssql shouldn’t have a problem with named instances like the localhostSQLEXPRESS one above.

Jeff

Ryan, how would you get the lat long (location) from the tweets? Would you mind showing a tweak in your code to do so?

Dirk

I was wondering what the line is for adding the geo information to the table, we create a table with geo as a field but in the insert statement there is no geo information.

For the rest I got it working on a mac os x, needed to change some parts. If you are interested I can send the changes and you can add them to this page.

http://40limon.es dariomartinezb

I also am wondering about the geo field not being used. I’m plain stupid when it comes to python (this is actually my first practical use of it). I’m trying to put the geo thing to work but to no avail yet. Also, I’m trying to capture some other fields… I guess I’ll need to grab a Python for Dummies or something.

But still, this is a great script and solves a lot of things for me :) Kudos to Ryan! You got me interested in Python, and being myself a marketing guy, that’s quite something. :)

CAM77

I’m brand new to python and trying to get this example to work. I’ve got the database set up with the appropriate tables, and the python modules installed, but when I run the python script, the shell opens and displays a flashing cursor for about 15 seconds then closes without any error messages, but the tables are not populated with any data. Any ideas what I am doing wrong?

CAM77

Update: Tried running in PythonWin and this is the error message displayed in the interactive window:

hello, i have never used python before but will learn because i’d like to harvest some twitter data for a project. I am using Mountain Lion and I have trouble with the easy install instruction – when I try to run the setuptools shell as indicated I get a SyntaxError: invalid syntax error message with setuptools highlighted in red. I am so sorry my question is elementary, I was wondering if you could advise me on how to properly install all necessary parts of tolls needed. Thank you, Kristina

http://ryrobes.com/ Ryan Robitaille

No worries! I’m going to run one on OSX since many people seem to be using it. It should MOSTLY be the same, but the install will likely be different.

allan

hi thanks for the post, was wondering how i can download the tweet coordinates together with the other information to a postgresql server

http://twitter.com/OllyOsborne Olly Osborne

Will this be able to run on Mountain Lion on Mac?

http://ryrobes.com/ Ryan Robitaille

Should work the same – I know it works on Linux but I haven’t tried it on OSX. I know the modules are available. Give it a shot and let me know what happens, maybe I can help!

Raymond Lou

Hi Ryan, thanks for the interesting writeup. I’m a beginner learning Python and was wondering how difficult it would be to implement a similar process for Weibo, the Chinese version of Twitter. As far as I can tell, it doesn’t have an API like Twython.

Hi Ryan – what function does “results” in the line (for tweet in search_results["results"]: ) exactly serve/do? It’s giving me a keyword error on it and I’m not sure why it’s even there in the first place.

3: Adjust these two variables:
['from_user'] to ['user']['screen_name'] (in two places)
['iso_language_code'] to ['metadata']['iso_language_code']

4: Time, you must re-sql that time section, either change the table to accept varchar and just throw in created_at, or parse it into a new form and convert it properly

I think that is a list of all the things I changed once you get to this error. I had to use pyodbc too.

Jim

Hi Kris, The python code runs, and stores some empty(ish) values in the tweetlog. However in the python output The tweets are successfully returned but ” Unexpected error: ” is displayed above each one. Did you have this issue?

paceman61

Due to changes in twitter auth model, I included the following before the first for loop (being very verbose with parameters):

Hey Ryan, I get the error: “Something went wrong during the installation.” When I try to run the script for the ez_setup.py.

Any idea why this is?
Thanks.

Murphy

Also, to specify the error more clearly. All I am doing is opening the ez_setup.py file (which opens in a new online window). Then I copy the code that is in it and run it in Python. That’s when it gives me the error. All of the text it gives me is actually this: