One common criticism of the otherwise excellent ggplot2 is that it doesn’t come with network visualisation capability. Network vis is so popular at the moment that it seems like a bit of a big omission; but network data is also quite unique in terms of structure (and the layout algorithms would need implementing) so I can see why it hasn’t been integrated.

Moritz Marbach has a great post explaining how to easily get ggplot2 up and running with network data. It was still one of the top hits on Google when I checked it out recently for a project. However the post is from 2011 so is getting a little dated – it uses the sna package rather than igraph (which seems to be becoming a standard for network science) and also has a few deprecated ggplot2 commands in it. So I thought I’d add a bit of an update here to the code.

As Marbach explains the secret to getting ggplot2 to draw networks is quite simple: get a network analysis package to give you a list of nodes, edges, and node layout information as a series of X,Y coordinates. Then you can simply plot the nodes with geom_point and the edges with geom_segment. Put together it looks something like this:

I found this code buried in an old google group discussion which I thought I would repost. As with everything ggplot wise hat tip to the incredible Hadley Wickham.

Often it’s nice to break down scatter plots by a third variable, especially if it’s categorical. However if you have lots of categories the space occupied by the different groups isn’t always straightforward to see:

A bounding box around the outermost points in each group (a ‘convex hull’) makes it easier to see the groups. To make them appear in ggplot you can use the following code:

Where X and Y are the names of your X and Y variables on the scatter plot. With ‘hulls’ in place you can simply add the following to your ggplot command (with a bit of alpha so you can still see the points):

geom_polygon(data = hulls, alpha = 0.2)

Is it better? On the one hand you can clearly see the groups. On the other hand outliers may give a bit of a distorted impression of what is going on. Perhaps something which trimmed outliers would be a bit better.

Types are a perennial headache in social science computing. One of the reasons I like perl is that it is so tolerant to variable types changing dynamically according to their context – saves a lot of time when scripting and also much easier to explain to students. It’s pretty clear which one of these is easier to explain to a beginner:

Yesterday however I ran into an even more annoying typing problem in R. I needed to export a large dataset (600,000 obs x 100 vars) which I only had in .RData. Using write.table() quickly hits R’s upper memory buffer. So I set up a simple loop to divide up the file on the basis of area_id, a variable with around 40 unique values:

What do I get? 40 files with nothing in them. Subset clearly isn’t working. A closer look at the last value country took turns this up:

What’s happening? R doesn’t think country or even country[1] are equivalent to 6. But when I assign country[1] to another variable (without making any explicit attempt to change types) then everything works. It’s not really clear to me why that should be. But this sort of typing difficulty is one of the things that puts beginners off: and I think it’s especially a shame in R since this language should be oriented towards the needs of small script writers.