SNA with R: Loading your network data in statnet

We are interested in Social Network Analysis using the statistical analysis and computing platform R. As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.

The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want.

We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries. In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:

Here a record indicates that the customer identified as src called (type=call) the customer dest at the given time start and the call lasted duration seconds. In general, there will be (many) more attributes describing the transaction which are represented by the …. In a Financial Services example, the records may be money transfers between accounts.

Implementation in the network class

In the naive implementation of this data as a network, we would have the sources and destinations (broadly speaking: people) as vertices and the calls as edges. That broadly seems to make sense: people are connected by the calls they make, and that is the social relationship we wish to model.

In the terminology of the network class, that means that our network will be directed (calls and money transfers have a direction from one person to another) and will need to allow multiple edges between the same endpoints (because any one person can, and indeed usually will, make several calls to the same other person).

We could consider dropping the multiple attribute of the network and instead represent the fact that A has called B with a single edge and perhaps have the number of calls and their total duration as an edge attribute. We will investigate this another time, but it is surely a less faithful representation of the data that we have (and we would need to drop information like the time of call).

Mapping customer identifiers to network vertex numbers

One thing they seem to forget to tell you in the documentation is that when you import your data your vertex identifiers (which in our case is customer or account numbers) must be changed to number the vertices and that this numbering must be sequential and start from 1. Being used to an environment where the vertex identifiers are arbitrary (and arrays usually start from 0), this one had me puzzled for a while. The error message that tells you your vertex numbering is not what the package expected is spectacularly unhelpful:

For the discussion that follows, we will assume that you have changed your identifies externally to R.

Loading the data

The good news is that our data is essentially in a format that the network package calls edge list and which it can import directly.

I say “essentially” because for some strange reason the package expects the destination to come before the source which seems ass-backwards to me. But assume we have our data in a file cdr.csv like this (we only have calls here):

OK, that’s a lot of warnings, but it basically worked. We have figured out how to load our network data into the network package in R.

Performance

We can’t do an exhaustive performance review now, but let us at least make sure we can load medium-sized networks. We change our CDR simulator to emit the desitnation before the source just like network likes it and let it run.

The first file has 2,645,288 (simulated) CDR lines from 100k customers and it loads OK on our small development workstation even with the default settings:

We can potentially save some time and memory by not explicitly not performing the edge check (again: the documentation frustrates us and is silent on what the defaults are for the network call we used above) so we try this for our next file with 51,316,641 lines of CDR data (again for 100k customers) which also saves us some column swapping:

Our attempted optimization did not seem to matter and this network is too big for the machine and the network package. Building the network was painful as I was working on the workstation at the same time. The machine has 16GB installed RAM, but it was clearly running out and swapping extensively.

51 million might be a reasonable size data set for some Financial Services applications but it is clearly a trivial number of records for Telecommunications. I’ll need to do some more digging around.

Does anybody have any SNA benchmarks? I like the KXEN implementation for its simplicity and speed so I might get a copy and try it out. Any R performance experts who could make suggestions in the comments? How big are your networks?

3 CX priorities for 8 company cultures:
Customer Experience (CX) initiatives often have a strong focus on changing the corporate culture. And rightly so. But culture is hard and we have found...
(~1092 words)

CX Conversations: Are yours unhelpful?:
Cassandra Goodman makes a passionate plea for Customer Experience (CX) leaders to focus on what matters: improving the business through better customer- and employee experience....
(~246 words)

CYBAEA

CYBAEA are value and growth architects for the data economy. We are passionate about value creation and delivering commercial results. We help organizations identify and act upon opportunities in Customer Value Management (CVM), Customer Experience (CX) and Advocacy, and Innovation and Growth.