Extracting Social Network Graphs from DNC Emails

DZone's Guide to

Extracting Social Network Graphs from DNC Emails

Data analysis doesn't need big expensive tools. Simple utilities and a little creativity can extract interesting relationships from datasets too. Here, find out how you could do this yourself, using the recent DNC email leaks as a sample set.

Yesterday I crawled the dataset and processed it. I extracted two graphs in the Konect format. Since I am not sure if I am legally allowed to publish the processed data sets I will only link to the source code so you can generate the data sets yourself, if you don’t know how to run the code but need the information drop me a mail. I Also hope that Jérôme Kunegis will do an analysis of the networks and include them to Konect.

First, We Have the Temporal Graph

This graph consists of 39338 edges. There is a directed edge for each email sent from one person to another person and a timestamp when this has happened. If a person puts n recipients in CC there will be n edges added to the graph.

The data is currently not sorted by the fourth column but this can easily be done. Clearly, an email network is directed and can have multi-edges.

Second, We Have the Weighted Co-recipient Network

Looking at the data I have discovered that many emails have more than one recipient so I thought it would be nice to see the social network structure by looking at how often two people occur in the recipient list for an email. This can reveal a lot about the social network structure of the DNC.

Still, at first glimse, the data looks pretty natural. In the following, I provide a diagram showing the rank frequency plot of senders and receivers. One can see that some people are way more active than other people. Also, the recipient curve is above the sender curve which makes sense since every mail has one sender but at least 1 recipient.

Also, you can see the rank co-occurence count diagram of the co-occurence network. This when the ranks are above 2000 the standard network structure picture changes a little bit. I have no plausible explanation for this. Maybe this is due to the fact that the data dump is not complete. Still, I find the data looks pretty natural to me so further investigations might make sense.