CULTURE OF INSIGHT

Data Visualisation Blog

Scraping & Swarming: A Visual Exploration of Facebook Data in R

TL;DR

If you’re looking for a tool to scrape all the posts in facebook page/group with a link and have the data presented to you in a searchable, filterable table then check out the shiny app I made for this purpose by clicking on the image below (very niche market, I know).

If however, that’s not why you’re here, and would like to look at some interesting ways of visualising social media data (or any kind of events over time data), please read on.

Some Thoughts, Observations & Concerns

The amount of data being generated by the big social media giants is now unimaginably vast. The potential for abuse of this data is therefore quite concerning - see here - but I’m not here to bore you with my musings on the state of cyber surveillance, Trump, Brexit, etc. We won’t be doing anything sinister like that today. Instead of leveraging this data to manipulate others, let’s see if we can use some of our own facebook data for a more reflective (existential) analysis, and see what we learn.

As mentioned above, for this example I’m going to be using the data from a private group I share with some friends to post links to music we think is worth listening to, but this can be applied to any facebook page. Particularly useful if you run a public page with lots of activity and engagements that you want to try and make sense of, or if you’re just a raging narcissist and want to know at what time of the day to change your profile picture to yield the most likes.

We’re going to scrape the data with the ever-so-useful Rfacebook package built by Pablo Barbera (nice one Pablo). Then we’ll use ggplot2 for our visualisations with a bit of added interactivity with plotly to round things off.

On y va!

Scraping The Data

Now, to access the facebook API you need to head over to facebook’s developer site, create your own ‘app’ (not as painful as it sounds), then save the token it gives you as a variable for ease of use later.

Once that’s done, save your group_id as a variable (normally a 15 digit number that comes after the /group/ part of the URL) and we’re good to go.

side-note: you can try this with public facebook pages/groups also. Just paste in the page or group ID.

token <- 'XXXXXXXXXXXXXXXXXXXXX'
group_id <- 'XXXXXXXXXXXXXXXXXX'

Now let’s build a function that will do all the hard graft for us and output a tidy dataframe with only the data that we’re interested in. For me there is a niche element in that I want to scrape the metadata from any links within each post to give me the name of that link. In most cases this will be the title of a youtube video which will most likely be a song title. This gives me a fast way of knowing what songs have been posted in the group without having to follow every URL.

It does require an extra bit of leg-work in the function as Rfacebook’s getGroup function doesn’t return link titles, only URLs. But if you don’t need this info then you can skip it and you’re life is a lot easier.

The main mutations to the dataset we’ll be making are all date/time related. To explore various relationships between group posts and time, it’s useful to aggregate up to broader time categories. Facebook gives us the date/time of each post to the exact second. We’ll then round this datetime up to minute, hour, weekday and month. As R can’t create a purely ‘time’ class, for our Minute and Hour variables we will create a datetime but set the date to the same day for every post. This will squash all times into a 1 day period and allow for better post/time analysis.

Here’s the full scraping function. The limit variable will dictate how many posts facebook will return data for. Without this you get a fairly pathetic 25 posts, but one of my motivations for doing this was to get instant access to historical posts going back to the start of the group in late 2015. It’s a very tedious process of scrolling and scrolling and scrolling on facebook to get to where you want to be, so set the limit high and the function will keep scraping until there are no posts left to scrape.

Time to visualise!

I’m going to explore various relationships in the data with some different chart types then refine it down to 1 (maybe 2) charts that I think give me all the information that is most useful to me, making every pixel count (can you hear me Edward Tufte?)

All charts use hrbrmstr’s glorious ipsum_rc theme from the hrbrthemes package, because Roboto Condensed is life.

Let’s get the ball rolling with a stacked area chart of posts per month split by group member…

To save myself and my friends any embarrassment over things like pitiful post:likes ratios, I’ve anonymised the names of the group members.

Now let’s use the perennially sought after facebook Like as a measurement of success along with the number of comments on each posts. You may have noticed from the first chart that Member4 joined the party late. Did his addition have any effect on the average engagement each month?

With the number of people in the group increasing by 33%, we can see a bit of a hike in average comments per post which then returns to a similar level. Sadly the average number of likes has fallen since his arrival. quantity != quality

Focusing in on member performance, using the count of posts per month by group member, let’s see how consistent each member is by looking at the min and max number of posts.

Swarm Those Posts

That covers a lot of the things I would be interested to explore in this dataset, but let’s see if we can build a chart that incorporates a lot of them into one graphic.

If anyone has seen my last blogpost you’ll know that I have a bit of thing for colourful dot-density charts/maps. Using the ggbeeswarm’s geom_quasirandom we can represent each post as a dot and show the density of posts over time. The quasirandom plotting is used to offset points within categories to reduce overplotting.

I’m also going to use the Plotly ggplotly wrapper here to add a bit of interactivity to the plot. There’s a lot of debate in the Data Vis world about interactivity and it’s benefits (or lack thereof) but in this instance it will allow us to add tooltip information for each post as well as zooming funcitonality to focus in on a specific period of time which is great for digging into to areas of high density.

I’ve set the alpha argument to Likes which makes it easy for us to identify posts that have performed the best.

The pièce de résistance of this chart would be having the name of the link (song title) in the tooltip, but as this blog post will be read by around 5-10 (million) people, I couldn’t possibly reveal the back catalog of music that we have meticulously curated to an audience that vast.

I do have it included in my personal version however, and it’s a lot of fun identifying posts with the most likes via the alpha level, zooming into clustered areas, and using the tooltip to see what each song is - give it a try yourself!

Finally, let’s create the same chart but with a more micro timescale, squashing all posts into a 1 week period.

CULTURE OF INSIGHT

There’s been some great animated maps in the data viz world of late. Most notably this stunner by John Muyskens for the Washington Post, showing the diverted flight paths of planes getting themselves into the line of the recent solar eclipse. What’s more it was made with R and ggplot2! Have a look here:
Hundreds of aircraft flocked to the moon's shadow during Monday's eclipse. Animation by @JohnMuyskens
Data courtesy of @flightradar24 pic.

Background
I really liked this blogpost by Peter Ellis that was recently brought to my attention by everyone’s favourite #rstats tweeter, Mara Averick:
???? code-through: “Inter-country inequality and the World Development Indicators” by @ellis2013nz https://t.co/zIjgqjPqKc #rstats #dataviz pic.twitter.com/h1sUfO2PPJ
— Mara Averick (@dataandme) July 22, 2017
In the post, Peter recreates some of the charts from Branko Milanovic’s highly acclaimed book ‘Global Inequality: A New Approach for the Age of Globalization’ using World Development Indicator data from the World Bank.

TL;DR If you’re looking for a tool to scrape all the posts in facebook page/group with a link and have the data presented to you in a searchable, filterable table then check out the shiny app I made for this purpose by clicking on the image below (very niche market, I know).
If however, that’s not why you’re here, and would like to look at some interesting ways of visualising social media data (or any kind of events over time data), please read on.

Background
I recently came across Eric Fisher’s brilliant collection of dot density maps that show racial and ethnic divisions within US cities. His work was inspired by Bill Rankin’s Map of Chicago that was made in 2009. Bill makes some salient points in this video about the limitations of choropleth mapping (where boundaries are filled with one colour based on one variable) and how it has a tendancy to “reinforce political ideals of national determination and ethinic homogeneity.

Digging into some electoral data
The 2017 election is hardly interesting, from a data perspective. We all know the map will be mostly blue with some red blobs and a yellow top. Like Maggie Simpson, as the Radio Times pointed out…
It’s often the same with research and data we use for business. Research teams carefully construct management reports each month but when little changes, not much attention is paid. Considering how much we paid for the data, or how much is traded on it, that feels like a missed opportunity.

Hello from Culture of Insight! Welcome to our new blog.
We are a data visualisation consultancy based in London, helping some of the UK’s top companies turn their data into insight. For more information about what we do you can check out our main site.
Here we’ll be showcasing some of work in amongst our general musings on all things data, analytics, programming (#rstats), and design.
Thanks for your interest in Culture of Insight, stop by again soon!