zondag 22 mei 2016

Data visualization : Network analysis with Graphviz and NodeXL

Introduction

We are witnessing the growing importance of data visualization in Business Intelligence and -Business Analytics. Data visualization is one of the important aspects of Analytics process. When doing Analytics you need to investigate the data for certain patterns and when you get a basic idea of the data and the scope of the job you formulate certain hypothesizes and you try to prove them or falsify them.

Now, I'm receiving education about data visualization and one the assignments is visualizing network related information. So I picked a data set and visualized that in a set of network diagrams. Technically that is easy to do but getting insights from this network diagram is a bit more difficult. This blogpost is about different kind of aspects of data visualization. I'll describe the process of designing and tweaking a data visualization of a dolphins social network data set.

Dolphins

For the assignment and for this blogpost I decided to use a dataset from the UCI Network Data Repository, specifically a dolphins dataset. This is a dataset of an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand. You can find more information about this dataset here and at the Cornell University Library.

Dataset

The download of the dataset from the University of California provides a couple of files. The one I used for this assignment is the .gml file and the .png which is shown below in this paragraph. The nodes in the network are the names of the dolphins while the links show relationships between the nodes.

To understand networks and their participants, we evaluate the location of actors in the network. The location of the dolphins is an important element in the diagram and we can ask ourselves the following questions: Who are the leaders? Who are the bridges (perhaps not always the leader)? Who are the isolated ones? Where are the clusters? Who is in the middle and who is on the sides of the diagram? As said, the first diagram I found was this one ? But can we create a more insightful diagram?

Although this diagram provides some information about the social interactions, it's a bit difficult to see the most 'popular' dolphins and the outliers. Is it possible to analyze the data and provide more insights in the data by adding distinctness between the nodes and the vertices?

This is the paraphrase of the research I've found on the webpage of Cornell University Library :

"Social animals have to take into consideration the behaviour of conspecifics when making decisions to go by their daily lives. These decisions affect their fitness and there is therefore an evolutionary pressure to try making the right choices. In many instances individuals will make their own choices and the behavior of the group will be a democratic integration of all decisions. However, in some instances it can be advantageous to follow the choice of a few individuals in the group if they have more information regarding the situation that has arisen. Here I provide early evidence that decisions about shifts in activity states in a population of bottlenose dolphin follow such a decision making process. This unshared consensus is mediated by a non-vocal signal which can be communicated globally within the dolphin school. These signals are emitted by individuals that tend to have more information about the behaviour of potential competitors because of their position in the social network. I hypothesise that this decision making process emerged from the social structure of the population and the need to maintain mixed-sex schools."

Some key concepts

There are a couple of key concepts important: degrees, betweenness and closeness in social network analysis. I'll explain them in more detail here.

Degree
Social network researchers measure network activity for a node by using the concept of degrees. This is the number of ingoing and outgoing links/vertices. This is just counting the numbers of relations. In case of the dolphins, the dolphins with the highest degrees are the leaders, as you would say when using common wisdom. But you can discuss that by asking, how is the a high degree dolphin connected to others? Is the specific high degree dolphin connected to other high degree dolphins or does she/he connect the loners too? Is the dolphin in a high degree cluster or does she/he involve others too. And can we visualize that in a complex diagram?

Betweenness
Another aspect is the betweenness of nodes (or dolphins if you like). Although a certain node can have a high degree of connections, the location of the node in the network could be isolated (within a cluster). A node can be in between of certain clusters and then the node is an important role as a connector between clusters. So this node is important but it is also a weak point in the network because if the node breaks down, the clusters are not connected anymore. Without the in between node, signals from one cluster to the other will be cut off. So, this node is also very important for the (social) cohesion of the network.

Closeness
Closeness is also an important metric of node. Closeness tells you something about the ability to access the all the nodes in the network. These nodes can access the other nodes more quickly than others can do

Eigenvector (variant of closeness)
Eigenvector (also called eigencentrality) is a measure of the influence of a node in a network. Eigen vector assigns a certain weight to nodes in the network and nodes that have a high score are more important and are more important to other nodes that are connected to that node. Examples of this metric are PageRank of Google and the Katz centrality.

Network analysis with Graphviz

For this blogpost I've used a couple of tools and the most important ones are Graphviz and NodeXL (add in for Excel). Off course the first problem is that the format for graphviz is not the format of the file that comes along with the dolphins data set. So the first step to execute is transforming the file into a Graphviz structure (it is called DOT language).

It is quite some manual work to convert the gml file into a DOT language structure and I think it's doable until 100 nodes and depending on the number of vertices. The next step was showing the graph with Graphviz and that was a bit of disappointment when I saw this.

Later, I read there are a couple of programs that transforms the DOT file into a specific network diagram depending on the different algorithms.

The next step was calculating the numbers of connections to the nodes and add colors to the nodes with a colorscheme. I counted the number of connections in the source file in excel

And defined the color in excel and converted it into a DOT color scheme:

Well, that is still a lot of jumble of vertices and nodes. Still not very insightful, is it? I found out that there are all kind of algorithms that transforms the DOT file into all kind of diagrams. the first one I tried was "Neato". As it seems DOT draws hierarchical layouts of directed graphs while Neato creates "spring model" layouts. Neato seems better suited for undirected graphs. Still, neato and DOT are compatible meaning that they can use the same input file and command line options.

Now as you may see, there is some overlap of the nodes and the vertices. That is not very helpful for clear insights of the analysis. What you can see in the diagram is that the centrality becomes more clearer now. The red nodes are centered in the middle and the darkblue nodes are positioned at the edges of the diagram.

Now let's try to get a clearer picture of the network. Thus without overlapping. I've added overlap = False to the .gv file.

graph {
overlap = false;
...
...

The next step is to execute the following command again and see what the result is.

This is the output of the Neato command. The blue nodes with the least number of vertices are on the edges of the diagram and red/orange nodes are more centered in the middle of the diagram and that is what we want.

Looking at the diagram we can see some patterns. There seems a highly connected cluster (top) and a lesser connected cluster on the bottom of the diagram. There are some nodes that seems to function as bridges between the clusters.

Let's see if we can tweak this diagram a bit further and see what is possible and create a better visual experience. Let's try the setting "splines = true" and add this to the .gv DOT file. This way the lines are not overlap the nodes

Visually, we could say that there seems two clusters, one high activity and one less activity cluster and there seems to be bridges between them.

twopi

Another program I tried was twopi (two pi?) and it draws Radial layouts. Nodes are placed on concentric circles depending their distance from a given root node. Basically, one node is chosen as the center and put at the origin. The remaining nodes are placed on a sequence of concentric circles centered about the origin, each a fixed radial distance from the previous circle. All nodes distance 1 from the center are placed on the first circle; all nodes distance 1 from a node on the first circle are placed on the second circle; and so forth.

And this is the resulting diagram. Indeed It seems a circular diagram:

Circo

Circo draws graphs using a circular layout (see Six and Tollis, GD ’99 and ALENEX ’99, and Kaufmann and Wiese, GD ’02.) The tool identifies biconnected components and draws the nodes of the component on a circle. The block-cutpoint tree is then laid out using a recursive radial algorithm. Edge crossings within a circle are minimized by placing as many edges on the circle’s perimeter as possible. In particular, if the component is outerplanar, the component will have a planar layout. If a node belongs to multiple non-trivial biconnected components, the layout puts the node in one of them. By default, this is the first non-trivial component found in the search from the root component.

Gstart

Neato has unnecessary edge crossings, or has missed an obvious chance to make a much nicer layout. Neato and all similar virtual physical model algorithms rely on heuristic solutions of optimization problems. The better the solution, the longer it takes to find. Unfortunately, it is also possible for these heuristics to get stuck in local minima. Also, it is heavily influenced by the initial position of the nodes. It is quite possible that if you run neato again, but with a different random seed value, or more iterations, you'll get a better layout. In particular, note that there are no guarantees that neato will produce a planar layout of a planar graph, or expose all or most of a graph's symmetries.

Gepsilon

Further analysis with NodeXL

After googling a bit I found out that Excel has a very interesting add in, NodeXL. NodeXL is a very nice free open-source template for Microsoft Excel 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs. With NodeXL, you can enter a network edge list in a worksheet, click a button and see your graph, all in the familiar environment of the Excel window.

And what makes interesting it calculates all kind of measures for you, like the degree, Betweenness, closeness, eigenvector, PageRank and the clustering coefficient. So all the manual work I've done does the NodeXL for you. Why didn't I find this tool earlier!

Now, if I interpret this numbers of in betweenness right, the higher the number of in betweenness, the more important the node is in the network and that would be SN89, Web, Beescratch, SN100 and Jet. The closeness nodes SN89, SN100, SN4, Grin and Kringel.

Next steps

Now, I spend only a couple of hours with Network analysis, but in order to get the full potential you have to spend much more time in the this interesting topic. For instance you could remove the low betweenness in relation with the centrality of nodes. The nodes that are highly connected but are not used as bridges. This jumble up the picture and if you remove them you can see the lines of communication much better. I used this technique also with processmining. In processmining you analyze the processes with the data and I thing I remembered was that you can analyze resources. You can investigate the throughput of cases through a process and looking at the resources and how are the doing. Useful for identifying bottlenecks

Conclusion

Visualizations will become more and more important and hopefully the BI/BA tools like PowerBI will add these kind of visualizations to their tooling because it gives more insights. But, the area of network analysis is huge and full of algorithms, mathematical formulas, concepts, etc. In order to understand the numbers and the diagrams you have to know the why and the how of the diagrams and why and how they are drawn in order to draw (right) conclusions.