Which world leaders are twitter bots?

Set-up

Given that I do quite like twitter, I thought it would be a good idea to right about R’s interface to the twitter API; rtweet. As usual, we can grab the package in the usual way. We’re also going to need the tidyverse for the analysis, rvest for some initial webscraping of twitter names, lubridate for some date manipulation and stringr for some minor text mining.

Getting the tweets

So, I could just write the names of twitter’s 10 most followed world leaders, but what would be the fun in that? We’re going to scrape them from twiplomacy using rvest and a chrome extension called selector gadget:

The string inside html_nodes() is gathered using selector gadget. See this great tutorial on rvest and for more on selector gadget read vignette("selectorgadget"). Tabs (\t) and linebreaks (\n) are removed with str_replace_all() from the stringr package.

Now we can collect the twitter data using rtweet. We can use the function lookup_users() to grab basic user info such as number of tweets, friends, favourites and followers. Obviously analysing all 50 leaders at once would be a pain. So we’re only going to take the top 10 (WARNING: this could take a while)

We only want the columns of interest (name, followers_count, friends_count, statuses_count and favourites_count) and then we want the data in long format. To do this we’re going to use select() and gather()

Notice Donald trumps everyone in the followers and status area (from what I here he’s quite a prevalent tweeter), however Sushma Swaraj and Narendra Modi trump everyone when it comes to favourites and friends respectively.

Now, we’re going to use the function get_timelines() to retrieve the last 2000 tweets by each leader. Again this may take a while!

lead_r_tl = get_timelines(lead_r, n = 2000)

Unfortunately get_timelines() only gives us their twitter handle and doesn’t return their actual name. So I’m going to use select() and left_join() to add the column of names to make for easier reading on the upcoming graphs

Either world leaders really love iPhones or their social media / security teams do. Probably the latter. I can hear you all begging the question, using which source is more likely to give a world leader more retweets and favourites? To do this we’re going to summarise each source by it’s mean number of retweets and favourites and then gather the data into a long format for plotting

Naturally this leads me to the question of which leader, over their previous 2000 tweets, has the most overall retweets and favourites, and who has the highest average number of retweets and favourites?

What about the mean retweets and favourites per month? ts_plot() provides us with a quick way to turn the data into a time series plot. However this wouldn’t work for me so I’m doing it the dplyr way. I’m going to a monthly time series so first we need to aggregate our data into months. The function rollback(), from lubridate, is fantastic for this. It will roll a date back to the first day of that month whilst also getting rid of the time information.

We now have two columns, fav_mean and rt_mean, that have in them the mean number of retweets and favourites for each leader in each month. We can use select() and gather() to select the variables we want then turn this into long data for plotting

The only function, botornot(), works on either given user names, or the output of the get_timelines() function from rtweet. To keep the inline with the rest of the blog, we’re going to use the output we’ve already created from get_timelines(), stored in lead_r_tl

bot = botornot(lead_r_tl) %>%
arrange(prob_bot)

For a clearer look at the probabilities I’m going to plot them with their actual names instead of the screen names