Fun With Facebook

I am often surprised by which of my Facebook posts are the most liked and by who likes what. I wondered: are there any interesting patterns there? Could I visualize them?

My next question (as always) was: could I get the data? Thanks to the rise of the API economy I could. Companies have discovered it’s profitable to make their private data public. IT departments are splitting into private and public-facing sides. Public APIs with user-friendly consoles make it ever easier to slurp up data from almost anywhere – and small amounts of slurping are often free.

Facebook’s Graph API Explorer console

This is great news for quantitative self projects like this one. Facebook’s public API console is called Graph API Explorer. You hit a button to get a temporary authentication token then construct queries by pointing and clicking. Here is the query that I used to retrieve my posts and the people who liked each one:

The result comes back as a JSON file. JSON (JavaScript Object Notation) is the lingua franca of the public data world. The data it conveys is often called “unstructured”, but it would be more accurate to say “flexibly structured”. JSON data can have quite elaborate hierarchical structures but with attributes that are sometimes there and sometimes not.

There are a few caveats, as I discovered. You have to pull down the data in reasonably-sized chunks and then paste those chunks together. The data is curated but not pristine; there were a few duplicate IDs and missing quotation marks that had to be cleaned up. JSON is fragile: if even one comma is out of place in a 2-meg text file havoc ensues. And some data is missing by design. Facebook no longer allows you to query friend lists (even your own) so I had to scrape that list the old-fashioned way. Several of my friends’ likes were missing because they had opted out of all data collection.

NodeBox network converts a JSON query into a set of structured CSV files

The main challenge in visualizing data like this is converting flexible JSON structures to more predictable lists and tables. To do this I developed a NodeBox network that starts with a JSON query at the top, adds in some additional information from other sources, sorts and filters and merges it, then spits out a half-dozen clean CSV files of structured data ready to plot. I had to write a custom CSV output node in Python but other than that no coding was required and the network can be reused for future queries.

I now had a fair amount of data to work with: 904 posts written over 9 years with 7500 likes from 144 friends and 311 strangers. How could I turn this into something visual?

My initial vision had to give way to reality

My initial vague idea was to represent friends as balloons – the more each liked me the bigger the balloon – with strings that would somehow connect to the posts they liked, colored by topic and arranged on a timeline like beads. But initial conceptions must give way to reality; with visualization the design has to follow the data.

Showing posts in a grid is more manageable and revealing

The first inconvenient truth was that a timeline of 904 posts over nine years, with some posts months apart and others minutes apart, was just too long to draw as a single line. So I made it two-dimensional: a grid with months on the x axis and hours on the y. This had the additional advantage of showing what time of day my posts appeared.

My friends evolved from a forest to a line of jellyfish

When I first tried plotting my friend balloons, the multiple strings for friends with many likes looked more like tree trunks, so my balloons became a forest. But it was much more revealing to plot the friends by when we first connected, and to let the stings diverge right away. The final result was something more like jellyfish.

It’s hard to draw thousands of overlapping lines without making a mess

Linking my jellyfish friends to my post grid was the next challenge. These links form what’s called a Sankey diagram. Sankey diagrams rarely have more than a few dozen overlapping links of varying thickness; it’s hard to draw thousands of links without creating a big black mess. After much study I found I could make beautiful and revealing links by keeping them thin, partially translucent, and by aggregating the grid connections for each month down to a fine point.

The final step was to assign colors based on the topic of each post: green for personal posts, blue for general interest, red for political, and orange for work-related. This was the only data I could not automate; I had to assign topics to each of the 904 posts by hand. Once the post bubbles were colored, the lines leading to each post could also be colored, and the jellyfish could be colored as well based on their proportion of topic preferences.

To see the end result and what I learned from it stay tuned for part 2: Who Likes Me?