However, I have a few complains. First, Twitter API authentication needs
to be set up for each archive. The task which require a few steps in
Google Spreadsheet is not trivial. Second, I often ended up getting
hundreds or even thousands of duplicates that may cause the file size to
exceed Google Spreadsheet’s size limit. Third, sometimes the getTweets()
function powered by Google Code (I think) fails for unknown reasons.
Although I still greatly enjoy TAGS, for these reasons I have been
wishing to have an easier and more liable way to archive theoretically
unlimited number of tweets for quite a while. So when I read a post
about *archiving tweets with
MongoDB* (part1
&
part2)
by Julian Hillebrand today, I couldn’t
stop myself from trying it out.

Julian’s code works! But
after a bit playing and tweaking, I think the solution I ended up using
became quite distinctive and deserved an independent post. So below is
what I did, which I think contains some improvement over the original
solution.

Setup MongoDB
—————-

You can download MongoDB here and
install it on your computer as Julian suggested. Or, you can using
MongoDB hosting services in the cloud, like
MongoLab. I used the free subscription on
MongoLab. It was very easy to setup (believe me, this was my first time
partying with MongoDB), and you don’t need to install any other things
(like the Netbeans MongoDB plugin). Plus, you might want to keep it
running when your laptop is off so it’s better to use something in the
cloud. So I strongly recommend MongoLab. If you choose to follow this
path, there’re around 3 steps:

Register for a MongoLab free subscription

Create a new database, say “twitter-mongo”

Click into the database and add a database user; record the
username/password for later use

It will definitely help to read MongoDB’s
manual,
but I really got as far as the first page.

Setup Java project and run the code
————————————–

Java code is used to retrieve tweets through Twitter API and save them
into MongoDB. Julian explained how his code works in details. I did some
tweaks to make the settings more visible and easier to modify. My code
is posted in this
gist. Download
TwitterLoop.java and do the following few things:

Then you should be able to directly run the Java file and start
collecting tweets. Each time when you run the file, you will need to
type in the search keyword (e.g. “#mri13”) for Twitter. Then the file
will repeatedly retrieve 100 tweets containing that keyword from Twitter
every 60 seconds (these two numbers can also be customized in the Java
file), and put new ones into MongoDB. Theoretically you can have Java
instances running forever. (As Julian mentioned, there should be a
better way to do this loop, for eaxmple using streaming API.)

If you run the Java file twice for two different Twitter archives, say
“#mri13” and “#edtechchat”, two MongoDB *collections* will be created
respectively with these two names.

Retrieve tweets in R from MongoDB for analysis
————————————————-

After tweets are collected in MongoDB, querying data in R becomes very
straightforward.

Try with another collection

Conclusions

This solution brought together by MongoDB, Java, and R seems to me a
proof-of-concept of a reliable and scalable way to automatically archive
and analyze tweets. Here, Java can be easily replaced by Python or R. It
might evolve into a nice toolbox for hacking Twitter data. I believe
there will be a lot of fun.