Clustering in Weighted Networks

A fundamental measure that has long received attention in both theoretical and empirical research is the clustering coefficient. This measure assesses the degree to which nodes tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties (Feld, 1981; Friedkin, 1984; Heider, 1946; Holland and Leinhardt, 1970; Louch, 2000; Snijders, 2001; Snijders et al., 2006; Watts and Strogatz, 1998). In real-world networks, this likelihood tends to be greater than the average probability of a tie randomly established between two nodes (Holland and Leinhardt, 1971; Wasserman and Faust, 1994; Watts and Strogatz, 1998).

Traditionally, the two versions of the clustering coefficient developed for testing the tendency of nodes to cluster together into tightly knit groups are the global clustering coefficient (Doreian, 1969) and the local clustering coefficient (Watts and Strogatz, 1998). The global version was designed to give an overall indication of the clustering in the network, whereas the local gives an indication of the embeddedness of single nodes.

The Global Clustering Coefficient

The global clustering coefficient is based on triplets of nodes. A triplet is three nodes that are connected by either two (open triplet) or three (closed triplet) undirected ties. A triangle consists of three closed triplets, one centred on each of the nodes. The global clustering coefficient is the number of closed triplets (or 3 x triangles) over the total number of triplets (both open and closed). The first attempt to measure it was made by Luce and Perry (1949). This measure gives an indication of the clustering in the whole network (global), and can be applied to both undirected and directed networks (often called transitivity, see Wasserman and Faust, 1994, page 243). However, it cannot be applied to weighted networks.

For this sample network, the binary global clustering coefficient would be 3 over 9, or 0.33. The closed triplets are B→A←C; A→B←C; A→C←B; whereas the open triplets are A→B←D; A→B←E; C→B←D; C→B←E; D→B←E; B→E←F. As can be seen from the sample network, the strongest ties are inside the triangle. This is not captured by the binary coefficient as the weights are not considered.

A generalisation of the global clustering coefficient to weighted networks was proposed by Opsahl and Panzarasa (2009). This generalisation required a triplet value to be defined, and then, divided the value of closed triplets on the value of all triplets. The generalised clustering coefficient produces the same result as the binary global clustering coefficient when it is applied to a binary network. This is because all triplets have the same value, irrespective of the method used to calculate triplet values. In addition, the generalised coefficient has the same properties as C. Moreover, if weights are randomly assigned in the network, the weighted clustering coefficient is equal to the binary coefficient.

Four methods was suggested for defining a triplet value: the arithmetic mean, geometric mean, maximum, and minimum of the two tie weights that make up the triplet. The arithmetic mean is simply the average of the tie weights. This method is insensitive to differences between the two tie weights as an extreme value can have a major impact on the triplet value. The geometric mean of the weights attached to the two ties overcomes some of the sensitivity issues as a triplet made up by a tie with a low value and a tie with a high value will have a lower value than if the arithmetic mean were used. In addition, it is possible to use two extreme methods to deal with differences in tie weights. The maximum method takes the highest value of the two weights, and will make a triplet with a strong tie and a weak tie equal to a triplet with two strong ties. Conversely, the minimum method takes the lowest value of the two weights, and make triplets with a strong tie and a weak tie equal to triplets with two weak ties. The table below highlights the differences between the methods of defining the triplet value (adopted from Chapter 2 of my thesis).

It is vital to use an appropriate method for defining the value of a triplet as this impacts on the outcome of the coefficient. The method should be chosen based on the research question as well as the way in which the strength of the ties are operationalised into weights. For example, in a network where the weights correspond to the level of flow, and a weak tie would act as a bottleneck, the minimum method might be most appropriate to use.

To highlight a potential use of the binary and generalised global clustering coefficient, it is possible to see whether stronger triplets are more likely to be closed than weaker ones. This can be done by comparing the weighted version () to the binary version calculated on a binary network where all ties with a weight greater than 0 is set to present (). In so doing, the two coefficients are calculated on the same network structure. The only difference is that the former takes the weights of ties into consideration. If the ratio between the weighted clustering coefficient and the binary coefficient is higher than 1, it can be argued that triplets made up by strong ties are more likely to be closed than triplets made up by weak ties (). On the contrary, it can be argued that triplets made up by weak ties are more likely to be closed than those made up by strong ties if the ratio is less than 1.

To empirically investigate whether strong ties are more likely to be part of closed triplets than weak ties, I have used the tnet networks. For simplicity, the arithmetic mean method is used for defining triplet values.

Network

Nodes

Ties

1

Freeman’s EIES network (time 1)

48

695

0.7627

0.7702

1.0099

2

Freeman’s EIES network (time 2)

48

830

0.8131

0.8214

1.0102

3

Freeman’s EIES network (messages)

32

460

0.6386

0.7378

1.1555

4

Consulting (advice)

46

879

0.6932

0.7130

1.0285

5

Consulting (value)

46

858

0.6764

0.6852

1.0131

6

Research team (advice)

77

2228

0.6848

0.7127

1.0408

7

Research team (awareness)

77

2326

0.6785

0.6957

1.0253

8

C.elegans’ neural network

306

2345

0.1818

0.2364

1.3009

9

US airport network

500

2980

0.3514

0.4765

1.3562

10

Newman’s scientific collaboration network

16730

47594

0.3596

0.3178

0.8838

As it is possible to see from the above table, 9 of the 10 networks have a weighted clustering coefficient that is higher than the traditional clustering coefficient. This suggest that triangles generally are, in fact as suggested by Simmel (1923[1950]), made up by strong ties. The tenth network, which shows the opposite result, is different from the other nine networks. It is a one-mode projection of a two-mode network, whereas the first nine networks are native one-mode networks. This feature could explain the observed result as the one-mode projection is constructed in a way that penalises the weight of ties within triangles, or more specifically, large fully connected cliques (Newman, 2001). See the projecting two-mode networks.

Local Clustering Coefficient

The local clustering coefficient is based on ego network density or local density (Scott, 2000; Uzzi and Spiro, 2005; Watts and Strogatz, 1998). For a node, this is the fraction of the number of present ties over the total number of possible ties between the node’s neighbours. Therefore, the outcome ranges between 0 and 1: 0 if no ties exist between the neighbours, and 1 if all possible ties exists.

For the sample network, nodes A and C would get a score of 1 since all possible ties among their neighbours are present, whereas node E would get a score of 0 since none of the possible ties among its neighbours are present. Nodes D and F would not get a value since the number of possible ties among their neighbours is 0 (if a node has less than 2 neighbours, the coefficient is undefined). For node B, one out of six possible ties is present, so the coefficient is 1/6 or 0.1667.

The main advantage of this version of the clustering coefficient is that a score is assigned to each node (local measure). This enables researchers to study associations between the coefficient and other nodal properties (e.g. Panzarasa et al., 2009) and perform regression analyses with the observations being the nodes of a network (e.g. Uzzi and Lancaster, 2004). However, this version of the clustering coefficient suffers from three major limitations (see Opsahl and Panzarasa, 2009, for a longer discussion). First, its outcome does not take into consideration the weight of the ties in the network. Second, the local clustering coefficient cannot be calculated on directed networks. Third, a negative correlation with degree is often found in real-world networks. This is due to the fact that it is “easier” for a node with two neighbours to get a score of 1 (only one tie is need) than for a node with 10 neighbours (45 ties must be present).

Recently, there have been a number of attempts to extend the local clustering coefficient to the case of weighted networks (Barrat et al., 2004; Lopez-Fernandez et al., 2004; Onnela et al., 2005; Zhang and Horvath, 2005). I will focus on Barrat et al. (2004) measure. They proposed a generalisation of the local clustering coefficient to weighted networks by taking the weights explicitly into account. First, they assigned a triplet value to each triplet in the network based on the arithmetic mean (normal average). Then for each node, they summed the value of the closed triplets that were centred on the node and divided it by the total value of all triplets centred on the node. Similarily to the weighted global clustering coefficient, triplet values do not have to be the arithmetic mean, but could also be the geometric mean, maximum, and minimum of the two tie weights.

For the above sample network, all the nodes, except node B, would get the same outcome as the binary analysis. Nodes A and C get a value of 1 because all their neighbours are connected, whereas node E get a value of 0 because none of its neighbours are connected. Nodes D and F do not have a value as they have less than 2 neighbours. Node B does not get the same value as it has more than 2 neighbours, and is the centre of both open and closed triplets. The weighted local clustering coefficient is between 0.18 and 0.36, while the binary one is 0.17. The increased weighed coefficient is a reflection of node B’s strong ties being directed towards neighbours that are themselves connected. The table below highlights the local clustering coefficients.

Node

Weighted local clustering coefficient using

arithmetic mean

geometric mean

maximum method

minimum method

binary version

1

1.00

1.00

1.00

1.00

1.00

2

0.24

0.27

0.18

0.36

0.17

3

1.00

1.00

1.00

1.00

1.00

4

NaN

NaN

NaN

NaN

NaN

5

0.00

0.00

0.00

0.00

0.00

6

NaN

NaN

NaN

NaN

NaN

Want to test it with your data?

The clustering coefficients can be calculated using tnet. First, you need to download and install tnet in R. Then, you need to create an edgelist of your network (see the data structures in tnet for weighted one-mode networks). The commands below show how the edgelist for the sample network here can manually be entered, and how to calculate the clustering coefficients.

Depending on the size of your network, these R functions might be a bit slow. I have also programmed these functions in c++, which makes them much faster. Send me an email if you would like the source/compiled code of the c++ functions.

Like this:

Hi Tore,
I have a question about tnet clustering coefficient function.
My data is weighted one-mode network (edge list).
I tried to get clustering coefficient score for this network, but I kept getting the following error messages:

Error in clustering_local_w(data) : Network is not undirected!
Measure is not defined from directed networks.
In addition: Warning message:
In as.tnet(net, type = “weighted one-mode tnet”) :
There were self-loops in the edgelist, these were removed
Can you point out what I am doing wrong?
Thank you!

I apply the function to small weighted networks. I normally keep the network data in a matrix format. I have converted the matrixes to edgelists before I applied the global clustering coefficients function but I noticed that the results are the same as when I apply the function directly to the data in the matrix format. Is there any special reason why I should keep converting the data?

All tnet functions can be applied to edgelists and matrices. Specifically, the as.tnet-function is also run first by the functions if it hasn’t manually be run already. This function makes a number of assumptions based on the properties of the supplied data object:

if it has 3 columns, it is assumed to be “weighted one-mode tnet”

if it has 2 columns, it is assumed to be “binary two-mode tnet”

if it has 4 columns, it is assumed to be “longitudinal tnet”

if it has more than 4 columns and the same number of rows as columns, it is transformed into an edgelist and assumed to be “weighted one-mode tnet”

if it has more than 4 columns, not the same number of rows as columns, and consists of only 0s and 1s, it is transformed into an edgelist and assumed to be “binary two-mode tnet”

if it has more than 4 columns, not the same number of rows as columns, and consists of not only 0s and 1s, it is transformed into an edgelist and assumed to be “weighted two-mode tnet”

As good practice, I would always recommend running as.tnet and specifying the type-parameter manually after loading your network. For example:

Thank you for your quick reply and a very useful answer! Since I work with square matrices tnet transforms them to weighted one-mode tnet. I will check the type-parameter to be sure.

Since I work with small networks I want to keep track of different nodes. The fact that nodes are given a number when the matrix is converted to an edgelist makes this more difficult. Is there an easy way to get node ID’s from a matrix to the edgelist?

I am not entire sure what you mean by node ids as this information is not contained in the matrix of ties except for row/column numbers. The as.tnet function takes the row/column number as the id by using which applied to a matrix with arr.ind equal to TRUE (line 40 in the as.tnet-function: net 0, arr.ind = TRUE), net[net > 0]) ).

If you would like consistent id numbers, I suggest explicitly create edgelists with the numbers.

Hi Tore,
a have a question about the output of clustering_w. The result is supposed to be in a range between 0 and 1 right? I’m applying this to a non directed network represented by a symmetric matrix of 44 vertex and weighted edges. All the values are non negative.
Using the comand:

There are multiple ways of defining the overall level of clustering in a network. The global clustering coefficient which is defined as 3x triangles over triples has been around for a long time (all the way back to Luce and Perry, 1949). In their 1998 paper, Watts and Strogatz introduced the local clustering coefficient, which is a node level metric (studying the number of ties among neighbors were not a new idea, see Granovetter, 1973, Coleman, 1988, and Burt, 1992). To get an overall coefficient, Watts and Strogatz took a simple average of the node-level scores. By using a simple average, you get a different answer than the global clustering coefficient if the nodes do not all have the same number of ties. To get the same answer, you can use a weighted average where each node is weighted by the number of triplets they are the center of, n*(n-1) (see code below).

I have a geographic distance matrix (distance of one household to another). I have taken the inverse distance, so households located closer to each other have stronger weights. All the households have the same number of connections. I am interested in calculating network level measures, instead of individual nodes. I have 100 20×20 (dimension) matrices. I have tried calculating the global clustering coefficient for each matrix after I have transformed each matrix to an edge list. However, the global clustering coefficient I get for all the matrices are 1, which is not what I would expect because of the different weights for each matrix.

Also, would you be able to recommend other network level measures? I know you have discussed some of the issues of calculating somehow aggregate measures, but I was just wondering maybe it may not be a big issue if all households are connected to the same number of households anyway and that the network structure is fixed (household location)? Thanks in advance.

Hi Tore,
Ohh, that was what I was guessing. Thanks for confirming.
About the weighted rich-club suggestion, I did see the R commands you posted. I also read through you paper and I think it is a good measure for what I am looking for. I followed what you have posted (which may be a wrong application for me), but I think I may be doing something wrong and hence not getting the weighted rich club coefficient. Here is the command I used:

where edge_b2 is an edgelist for one of my neighborhoods. Is that how I do it? Cause I'm not getting the answer. Sorry if this seems like an easy question but I couldn't figure it out as of now. Thanks so much!

I’m a masters student in Lisbon University, faculty of human kinetics, and the masters is in High performance training. I’m using clustering coefficients and graph densities. i would like you to send me the pdf documents for the compilation of my thesis.

Thank you so much for making tnet available. I’m an assistant professor of political science, and working on a project on the relationship between international trade and democracy. Since bilateral trade has different impacts on the two parties involved (depending on their size of economy), my data is directed and weighted. Is there any way to calculate local clustering coefficient in this case?

The local clustering coefficient is only defined for undirected network. I would look into the structural holes theory by Ron Burt. Specifically, the constraint metric might be the appropriate one for you.

The denominator in the binary local clustering coefficient is the number of triplets centered on the focal node (B). These triples are A-B-C, A-B-D, A-B-E, C-B-D, C-B-E, and D-B-E. It can also be calculated as N*(N-1)/2 when N the number of neighbors in undirected networks.

Thank you so much for the comments. I was also trying to use Burt’s c as a proxy of local clustering of trade. Yet, there are still two issues. First, in Burt’s conceptualization, the network seems undirected as well. So it might be problematic to use it in the case of bilateral trade. Second, the trade network generally is a complete network. So there is few, if any, structural holes in the network.

Recently, I’ve read an article on the subject. How do you think about their approach? Thank you so much!

Burt’s constraint metric is created for directed networks (see Burt, 1992, chapter 2); however, this metric or any local clustering coefficient becomes irrelevant if you have a complete network. You may want to set a cut-off and remove ties below this.

Hi Tore¡. Thanks for your website. It´s been useful for my because I´m doing a work about city networks with this tools. I have a question about the maximum and minimun method measure of the global cluster coefficient. I think that, in t-net in R, the maximim correspond to minimun method and viceversa, because I compute the data of the example presented in this page manually and I found that, with the maximun method, the global cluster is 0.375, while with the minimun method the global cluster is 0.5. Regards.. Juan

The maximum and minimum methods for the global clustering coefficient do not refer to the maximum or minimum value of the coefficient. They are methods for calculating the triplet value. See the table with the two black-and-white triplets. It is important to recognize that the triplet method affect both the numerator and the denominator of the coefficient. The minimum method produces lower numerator and denominator than the maximum one (except if all tie weights are equal; binary network).

You might also think about the different values in terms of the value assigned to heterogeneous triplets (i.e., the two ties weights are different). The greater the difference between the tie weights, the greater the impact of changing the method. I would recommend that you hack the code for the clustering coefficient to return the numerator and denominator separately. The code below will print these values.

Your website is very useful to learn the characteristics of weighted networks. Thanks a lot for your effort.

I do not quite understand how the weighted global clustering coefficient is calculated, specifically, how the weights are taken into account. In the original paper, you calculate the C_ω,gm of two sampled networks (Figure 1 (a) and (b)), the results are 0.44 and 0.23 respectively. Can you please explain how the results are calculated?

I have a network (denoted as NET) consisting of a number of fully connected clusters, i.e., all nodes are connected to one another in the clusters. The weights are normalized, so the values of weights are between 0 and 1. I obtain clustering_w(NET) = 1. I’m wondering how to interpret the result. This is obvious if NET was a binary network. Since a weight will be add when calculating the weighted clustering coefficient, I though the result should not 1 any more. Can you please guide me here?

The clustering coefficient will always be 1 for fully connected networks. This is because all triplets are closed, and thus the triplet value (however calculated) of all triplets are part of the numerator. As the denominator is the sum of all triplet values, the numerator is the same as the denominator, hence a ratio of 1.

I tied to calculate the clustering coefficients of the two sampled networks (Figure 1 (a) and (b)), but get slightly different results. I am wondering if I do not understand the algorithm correctly. The following are my calculation.

I have two questions in above given reply to Hang. First how the value of numerator in geometric mean is 9.66?
second, can you give one example for minimum and maximum method as you have given for AM and GM. I’ll always remain thankful. Regards,

There are three closed triplets in the above example among nodes A, B, and C. These triplets have the following values on their ties {4,2}, {2,4}, and {4,4}. The geometric mean method for computing triplet values takes the square root of the product of the tie weights (e.g., sqrt(4*2)=2.83). As such, the closed triplets in the above example have the following triplet values using the GM method: 2.83, 2.83, and 4. The sum of these values is 9.66, which is the numerator.

The minimum method simply takes the minimum tie weight. In the above example, this would be: 2, 2, 4, 1, 1, 2, 2, 1, and 1. The sum of the closed ones (the first three) is 8, and the sum of all of them is 16. Hence, the clustering coefficient is 8/16 or 0.5 when using the minimum method. The maximum method is identical, but takes the maximum of the two tie weights instead.

I am currently conducting research on a database with undirected and weighted networks, and would greatly appreciate it if I can use your generalized clustering coefficient. As mentioned in the publication, could I possibly receive a copy of the Matlab code for doing so?

Also, is there no way to calculate the clustering coefficient for a complete graph besides setting a threshold for the edge weights and dichotomizing the network?

This post is very useful. Thanks very much! I was wondering if there was a possibility to calculate a global clustering coefficient based on a weighted network with your package tnet, and considering the sum of the weighted edges (of triplets) instead of the arithmetic mean, geometric mean, maximum and minimum value.

Glad you’re finding it useful. If you calculate the clustering coefficient with the arithmetic mean and with the sum of triple weights, you will get the same outcome due to it being a ratio. The difference is that both the numberator and denominator will be double in the latter case.