Exploring the CRAN social network

A few months ago, I published a post where I was trying to map the dependencies relationships between R packages. Today I want to do something similar with package contributors. My idea is to reconstruct a social graph where each node would be a person (presumably a developer), and two persons would be connected by an edge if they have collaborated on the same package. Thus I would be able to explore the CRAN social network!

This post is also a bit special for me. It’s the very first time I’m using dplyr and the tidyverse. I used to write my code in base R but the amazing work of Thomas Lin Pedersen around tiygraph and ggraph convinced me to take the plunge. And it was a lot of fun!

First, we load the packages and the data. In the first version of the project I used the XML file of each package that I retrieved from CRAN using wget and then read and parsed with xml2. Since R 3.4.0, things are much, much simpler with the function CRAN_package_db.

That's all we need to get the full list of contributors. But it comes in a very messy way. R provides a structured system to describe persons which should be used in the Authors@R field of the DESCRIPTION file, but most people use a simple character string.

The big cleanup

I combined a series of regular expressions and string manipulations to extract every name. It was the first time I was using stringr and I have to say, I don't regret gsub, grepl, etc. I just miss the support for some PCRE functionalities…

At this point we have a list of 13668 (unique) names, and this number is increasing almost every day. The R community is huge! Of course there are still some errors. And we can find surprising contributors:

aut_list$Name[4922]

# [1] "Her Majesty the Queen in Right of Canada"

The Social Network

The next step is to produce a two-column matrix describing all the connections of the network (edge list). An edge list can be turned into a graph object with the igraph package. Finally we can convert the igraph graph into a tidy graph so we can use the API provided by the tidygraph package. For example to filter nodes/edges or select only the main component.

Focus on the core network

The complete graph makes a nice painting to hang on the wall of your living room. But hard to say anything about it. So, we will reduce the data and focus on the contributors involved in more than 4 packages.

If you know a bit R and its community, you will certainly recognize some names. The challenge is to interpret this classification. It is not a good thing to put people in boxes, but this is mandatory to make nice and colorful visualizations! So, I give it a try. For me, the first group include people of the early days who contributed to major packages which constitute the heart of R. The second group is more related to the second generation, associated with Rstudio products and who have been particularly prolific in the last years. Of course the two groups are strongly connected.

How can we label these two groups? After long consideration, I chose to go for ” The Ancients and ” The Moderns, without any value judgment! This is a reference to an old debate in French literature. This dubious comparison would certainly be more appropriate to designate the ongoing debate base vs. tidyverse, but… not today!

… there are 254 edges connecting the the Ancients and the Moderns! But I think this is a nice illustration of how social graphs keeps the imprint of history and past events. Next step would be to look at the graph dynamics over time. Maybe an idea for a future post…

To leave a comment for the author, please follow the link and comment on their blog: R_EN – Piece of K.