Demo of Using twitter-hashtag-analytics to Analyze Tweets from #LAK13

2013-02-26

Building on Ben Marwick,
Martin Hawksey and Tony
Hirst’s
work on analyzing tweets with R, I started an R project for tweet
analysis, namely
twitter-hashtag-analytics.
This project is hosted on Github and welcomes anyone who’s interested to
contribute. It is my very first attempt to write a package in R, so I
admit the capabilities of it is still limited and its structure may be
not properly planned. Any advice will be highly appreciated.

This demo, drafted with knitr, aims to show
the functionality of
twitter-hashtag-analytics
and also available on Github. It will evlove along with this project

Data Preparation

Before starting to analyze tweets, we will first load a few source files
(libraries) in this project.

# check working directory
getwd()
# note that Knitr automatically sets wd to where the Rmd file is. so if
# you wish to run code line-by-line, you should setwd mannually.
# setwd('/home/bodong/src/r/twitter-analytics/twitter-hashtag-analytics')
# load source files
source("get_tweets.R")
source("munge_tweets.R")
source("utilities.R")

Then we can retrieve a Twitter hashtag dataset by searching through
Twitter API. Two other methods of retriving tweets implemented in this
project so far include retriving from Google Spreadsheet archives
(see here)
and reading directly from a CSV file.

# get tweets by search
# this function is defined in get_tweets.R
df <- GetTweetsBySearch('#LAK13')
# save or load data (so you can reuse data rather than search all the time)
save(df, file="./data/df.Rda")
# load("./data/df.Rda")

Because tweet information retrieved through twitteR is kind of limited
(see its reference
manual, p.
11), we need to extract user information, such as reply_to_user and
retweet_from_user, mannually from each tweet. At the same time, the
names of metadata in twitteR are quite different from those used in the
official Twitter API, the following PreprocessTweets function in
munge_tweets.R also renames some attributes of tweets. Moreover, the
PreprocessTweets function also trims urls in tweets and put them in a
new column named links.

Start from Easy Stuff: Count Things

Count tweets, retweets (by), and replies (to) for each user

Regular statuses, retweets, and replies are three main types of tweets
we analyze. The GetTweetCountTable function can easily count total
tweets sent by a user, times of retweeting by other users, and number of
replies a user has received.

Social Network Analysis (SNA)

Visualize social networks

An archived tweet dataset contains retweeting and replying as two
main type of links among users. Some studies looks into following
relations, which require further queries to Twitter. So in this demo, we
focus on retweeting and replying links.

The CreateSNADataFrame function in social_analysis.R provides an
easy way to create a data frame containing all edges of the requested
social network. With created edges, we can easily create an SNA graph
and visualize it with packages like igraph and sna.

Basic SNA measures

We can further compute some basic SNA measures. For instance, density of
this network is 0.027, reciprocity of users in the network is 0.9488,
and degree centralization of this network is 0.2425. These measures are
calculated as below.

We have detected 7 communities in this network. The largest community
contains 44.231% of all users in this dataset.

Univariate Conditional Uniform Graph Tests

In network analysis, people do types of tests to check whether some
aspects of a network are unusual. We can do such tests, namely
conditional uniform graph tests, through the cug.test function in
the sna package. Further information about these tests can be found
here.

This task first uses ConstructCorpus in semantic_analysis.R to
create a text corpus, and then uses MakeWordCloud to make a word
cloud. Please note that ConstructCorpus provides a number of options
such as whether to remove hashtags (#tag) or users (@user) embedded in
tweets.

Next we are going to create a term-document matrix for some quick
similarity computation.

For more advanced similarity computation among documents and terms, I am
considering adding Latent Semantic Analysis (LSA) capability into this
package in the future.

Topic modelling with Latent Dirichlet Allocation (LDA)

With the sparse term-document matrix created above, we can use the
TrainLDAModel function in semantic_analysis.R to train a LDA model.
(Note: I don’t understand all of steps in the code in TrainLDAModel
refactored from Ben Marwick’s
repo. So please help to
check it if you understand LDA.) This step may take a while depending on
the size of the dataset.

Sentiment Analysis

This project implements three methods (with one method that depends on
ViralHeat not working) of analyzing sentiment of tweets. Let’s try
function ScoreSentiment in sentiment_analysis.R implemented based on
this
post.

Update: Since I got a lot of emails about the post, I want to point
out that I have converted most of the work here into a Shiny app, and
you can find an updated version of the code in this
“twitterytics-shiny” Github
repo.