UFC 189: A Twitter Stream Animation

I’ve been on a kick lately of learning how to collect, process, and analyze data collected from Twitter. I’m an MMA fight fan, so I decided to set up a stream from the Twitter API on Saturday July 11th during UFC 189. The data I’ll be using throughout the rest of this blog post focuses on the end of Rory MacDonald vs. Robbie Lawler fight, rounds 4-5, and the complete Chad Mendes vs. Conor McGregor bout. The end result of the analysis is an animation along the timeline of the fights showing tweet counts per minute, the most frequently tweeted words, and major moments during the event.

Overview of the Data

The dataset I am working with consisted of 131,148 tweets all containing the search term “UFC189”. The tweets were tokenized and cleaned to remove hashtags, urls, usernames, and stop words including the search term itself. The NLTK module in Python was used to accomplish this task. Once the preprocessing was completed, the dataset was processed in one minute windows to indentify the 25 most frequent words each minute. In addition I manually created a timeline of important events such as the beginning of a fight, end of the fight, and end of rounds based on the tweets of BloodyElbow.com and MMAJunkie.com.

Visualizing the Data

The end goal of this blog post is to create an animation. However the animation will be based on static pictures which I will create using the ggplot, wordcloud, grid, and gridBase packages. To start things off I’m going to import my data.

# event text tweets:# a list of important events along with the # hour:minute they occurredevent.timeline<-read.csv("data/eventTimeline.csv")event.timeline$created_at<-mdy_hm(event.timeline$created_at)

# get tweet count per minute##tweets.per.minute<-tweets%>%group_by(hour_minute)%>%summarise(tweet_frequency=n())

# join the timeline events to the # tweet frequency count for each minutetweets.per.minute<-left_join(tweets.per.minute,event.timeline,by=c("hour_minute"="created_at"))

Below is a small description of the dataframes which were imported:

ds: This dataframe contains 3 columns: token, count, and time. These are the top 25 tokens which occur during each minute along with the frequency with which the token occurred in that window.

The timeline of tweet count per minute annotated with major events is created using ggplot2. The word cloud is created with the wordcloud package. To stack the plots over one another in a single window requires the grid and gridBase packages. Below is the print_stacked_plots function which does the following:

Creates the base timeline plot which contains the hour and minute on the x-axis and the number of tweets during that minute on the y-axis.

For each unique minute in the data:

Create a word cloud.

Create grid for to place word cloud and timeline in.

Highlight current time on timeline and annotate if a major event is occurring.

Save the plot to a file.

# print word clouds stacked # on top of timelineprint_stacked_plots<-function(ds,tweets.per.minute){

# set up the grid space to# stack the wordcloud over the ggplot plotplot.new()vps<-baseViewports()pushViewport(vps$figure)# set the margins for the second chart to the same as the firstpushViewport(plotViewport(margins=wc.plot.margins))## I use 'wc.plot.margins' to set the margins

# select the tweet frequency for # the hour:minute during this iterationcurrent.tweet<-tweets.per.minute[tweets.per.minute$hour_minute==floor_date(time.intervals[i],unit="minute"),]

# highlight the current point in time# on the tweet frequency timeline p<-frequency.timeline.plot+geom_point(data=current.tweet,aes(x=hour_minute-atlantic.time.offset,y=tweet_frequency),size=4,color="#756bb1")+scale_x_datetime(labels=c("11:45","12:00","12:15","12:30","12:45","1:00","1:15"))

# check if any annotation of a major event is present# if an event is present add it to the plot if(!is.na(current.tweet$tweet)){p<-p+geom_text(data=current.tweet,aes(x=hour_minute-atlantic.time.offset,y=tweet_frequency,label=tweet),fontface="bold",hjust=0,size=7)}grid.draw(ggplotGrob(p))## draw the figure

dev.off()}

}

The functions for drawing the word cloud and base timeline tweet frequency plot can be found below.

The word cloud was set up to only show the 15 most frequent words and all words were set to the same orientation. The settings were tuned in this manner to ensure consistency between plots. The event timeline is a basic ggplot2 line plot. I also use the fte_theme to help style the plot. Below is the plot at the beginning of the Mendes/McGregor fight.

Animating the Timeline

The end result of running the print_stacked_plot function is 83 individual image files which will then be converted into a single gif using the animate package. In my example I will also be the im.convert function which is a wrapper for ImageMagick. You will need to download and install ImageMagick to use the im.convert function.

## set options for gif# one second delay between imagesoopt=ani.options(interval=1)# convert images to gifim.convert(files=list.files("data/plots/",full.names=T),output="wc_animation.gif")

The resulting animation is shown below. In the animation you will see Lawler’s and MacDonald’s names as the most prominent in the beginning. Once the walkouts start for the McGregor/Mendes fight “Sinead” comes up as frequent word as Sinead O’Connor performed “Foggy Dew” for the McGregor walk out. Once the McGregor/Mendes fight starts their names are the most prominent. After the fight has finished words such as “featherweight”,”McGregor”, and “Champion” become most prominent.