Abstract

When we represent real-world systems as networks, the directions of links often convey valuable information. Finding module structures that respect link directions is one of the most important tasks for analysing directed networks. Although many notions of a directed module have been proposed, no consensus has been reached. This lack of consensus results partly because there might exist distinct types of modules in a single directed network, whereas most previous studies focused on an independent criterion for modules. To address this issue, we propose a generic notion of the so-called truss structures in directed networks. Our definition of truss is able to extract two distinct types of trusses, named the cycle truss and the flow truss, from a unified framework. By applying the method for finding trusses to empirical networks obtained from a wide range of research fields, we find that most real networks contain both cycle and flow trusses. In addition, the abundance of (and the overlap between) the two types of trusses may be useful to characterize module structures in a wide variety of empirical networks. Our findings shed light on the importance of simultaneously considering different types of modules in directed networks.

1. Introduction

Analysis methods developed in network science provide us with useful tools for investigating and characterizing the kinds of network structures observed in real-world systems [1]. Standard techniques in network science include characterizing global properties of networks, measuring centralities of nodes and links, and classifying nodes into groups [2,3]. Finding relevant subgroups of nodes, often called communities or modules, is a fundamental problem. This problem is referred to as the community detection problem [4,5], which has been studied in different disciplines, including computer science, statistics and statistical physics. Although there is no unifying definition of a community, information regarding communities in networks gives us a guide to summarize large-scale networks [6], to predict the existence of links [7] and to reveal functional organization in real networks [8,9].

Within the community detection problem for real-world networks, the direction of links plays a crucial role. Although the majority of previous notions of a community assume undirected networks as their target, several of them are explicitly designed for directed networks. Examples include the extensions of graph conductance [10–12], a generalization of the modularity function [13] and the map equation [6] (see reference [14] for a comprehensive review on the community detection problem in directed networks). These definitions of communities often successfully detect communities that satisfy their criteria. Nevertheless, it remains an open problem as to how to choose a notion of modules when we are given directed network data. For undirected networks, the local density of links within a subgroup of nodes is arguably a suitable criterion for a module, regardless of the details of the definitions [5]. In contrast, for directed networks, the directionality of links can alter the module structure, even when we observe the same link density in two subgroups of nodes. The choice of algorithms crucially depends on what types of modules we expect to find. In addition, some definitions of communities and their associated algorithms are known to fail to detect certain types of module structure [14,15]; the impact of this drawback is not clear until we analyse the network. Therefore, a generic notion of directed modules is necessary for understanding the nature of module structures in real-world directed networks.

To address this issue, in this paper, we propose the notions of module structures in directed networks. The following observations underlie the core concept of our work. First, there could be different types of modules within a single directed network. For example, one part of a network can be an all-to-all connected module, whereas another part can form a layered structure [14]. Second, it is not necessary to divide the entire network into modules. For example, the World Wide Web network is well known to exhibit the so-called bow-tie structure [16,17]. This fact implies that the different parts of a directed network may not be regarded as modules to an equal degree. Instead of partition of networks into modules, extraction of modules from the network should be considered. Previous studies often ignore these observations: they aim at partitioning the entire network into modules based on a single objective function. Therefore, we propose two distinct types of modules, called the cycle truss and the flow truss, using a unified framework, and an algorithm for finding them. Our definition of trusses relies on pattern matching and local agglomeration of directed triangles (i.e. connected subgraphs composed of three nodes and three links).

We apply the proposed algorithm for finding trusses to a variety of empirical networks to verify its practicality. First, we observe that the extracted trusses seem to capture meaningful subgroups of nodes in networks with node label data. Second, because our method simultaneously detects the two types of modules, we can use them as features for classifying different networks. Empirical networks obtained from the same categories (e.g. social or biological) tend to show a similar degree of abundance of the two types of trusses. In addition, the overlap between the two types of trusses captures another kind of similarity between networks in a given category. Our findings demonstrate the importance of simultaneously considering different types of modules in directed networks to understand the common properties underlying the organization of modules.

2. Methods

2.1 Definitions of cycle and flow trusses

In this paper, we assume that the focal network is directed and simple, i.e. there are no self-loops and no multiple links between any pair of nodes in the same direction. Bidirectional connections between two nodes are possible: links from node i to node j and from j to i may coexist. We also assume that links are unweighted. First, we define cycle and flow triangles as the elements of cycle and flow trusses, respectively. A cycle triangle is a connected subgraph composed of three nodes all of which have out-degree equal to one, namely a directed cycle composed of three nodes (figure 1a). A flow triangle is a connected subgraph composed of three nodes that have out-degrees equal to zero, one and two (figure 1b). A flow triangle is also called a feedforward loop [18,19]. Next, we define the cycle and flow trusses by generalizing the k-truss, originally defined for undirected networks [20]. A cycle (flow) k-truss is defined as a maximal connected subgraph of a network in which every link is involved in at least k cycle (flow) triangles within the subgraph1 (see figure 1c for an example). Free parameter k takes a non-negative integer value in 0≤k≤dmax−1, where dmax denotes the maximum node degree in the network, and k controls the extent to which triangles are overlapped within a truss. It should be noted that a cycle (flow) truss may contain flow (cycle) triangles. There are several relationships between the cycle and flow trusses and other notions of directed subgraphs presented in previous studies, and we will describe this point in the Discussion.

Definitions of cycle and flow trusses. (a) A cycle and (b) flow triangles. (c) The cycle (blue) and flow (red) k=2-trusses in an example network. The vertical arrow coloured purple at the centre represents the link that belongs to both of the cycle and flow 2-trusses.

The definitions of the cycle and flow k-trusses satisfy the requirements described in the Introduction, as we can see in the example shown in figure 1c. To be more precise, these definitions enable us to find two distinct types of modules using the unified framework. In addition, this method extracts modules from a network, instead of partitioning the entire network into modules. The algorithm for finding cycle and flow k-trusses in a given network is a modified version of that for an undirected truss [21]. The details of the algorithm are presented with pseudo-codes in the electronic supplementary material.

The definitions of the trusses lead to their basic properties as follows: first, there can be multiple cycle (flow) k-trusses in a network. Second, cycle and flow (k+a)-trusses are, if they exist, subgraphs of cycle and flow k-trusses, respectively (a=1,2,…). Third, the node sets of two k-trusses of the same type can overlap, but their link sets must be disjoint because of the maximal property in the definition. Fourth, the complete graph with k nodes in which all node pairs are connected for both directions is both a cycle (k−2)-truss and a flow 3(k−2)-truss at the same time (k≥3).

We assign a link from nodes i to j with the truss number ki→j defined by
ki→j≡max{k∣(i→j)∈Ek},2.1
where Ek is the set of links involved in k-trusses. We denote the truss number for cycle and flow trusses by kci→j and kfi→j, respectively. We use the superscripts ‘c’ and ‘f’ to represent the variables related to the cycle and flow trusses throughout this paper. The truss number indicates the extent of agglomeration of triangles around a link. For example, in the network shown in figure 1c, link (i1→j1) has (kc,kf)=(2,0) and link (i2→j2) has (kc,kf)=(1,2). We are also interested in the maximum values of kci→j and kfi→j over all the links. We call these values the maximum truss numbers; they are denoted by kcmax and kfmax for cycle and flow trusses, respectively. For the network shown in figure 1c, kmaxc=kmaxf=2 holds.

3. Results

3.1 Trusses in empirical networks

We apply the proposed method to empirical network data that are assigned with predefined node labels so as to demonstrate that cycle and flow trusses can extract meaningful modules. For this purpose, we use two networks obtained from different fields: the neural network of Caenorhabditis elegans (C. elegans) [22] and the network between words collected via word association experiments, so-called the Edinburgh Associative Thesaurus (EAT) [23].

3.1.1 Neural network of Caenorhabditis elegans

The C. elegans neural network comprises 279 nodes, which correspond to neurons, and 2990 links between the nodes. A chemical synapse between two neurons is represented as a directed link, and an electrical junction as two directed links in both directions between the node pair [22]. Each neuron is assigned a unique name and additional information such as soma positions in the worm’s body and functional categories (i.e. sensory neuron, interneuron or motor neuron), which allows us to interpret the neuronal functions of the extracted trusses.

In figure 2, the resulting k-trusses are depicted. We focus on the cycle kcmax=3- and flow kfmax=9-trusses, which are the most cohesive trusses in the network. In this case, the cycle 3-truss is a single strongly connected component, and the flow 9-trusses is a single weakly connected component (i.e. any pair of nodes in each of the trusses is connected if we discard the link direction). When we map the cycle and flow trusses in the entire network (figure 2a), neither of these trusses is localized in any particular part of the worm body, but instead, spans almost the entire range of the body (from head to tail). We can see that a small number of nodes bridges most of the triangles in the trusses. The cycle 3-truss (figure 2b) consists of four command interneurons relevant for locomotion (AVAL, AVAR, AVBR and PVCR) and three motor neurons in the ventral cord of the worm (VA08, VA09 and VB09). Here, we follow the description of each neuron in reference [24]. These seven neurons in the cycle 3-truss are tightly connected to each other; however, certain node pairs have a link in only one direction. On the other hand, the flow 9-truss (figure 2c) consists of 13 interneurons relevant to locomotion (AVAL, AVAR, AVBL, AVBR, AVDL, AVDR, AVEL, AVER, AVJL, AVJR, PVCL, PVCR and SABVR) and seven motor neurons in the dorsal cord of the worm (DA1, DB03–06 and DVA), except for one in the ventral cord (AS01). Although we need further explanation by biological experts as to why these neurons constitute the cycle and flow trusses, the trusses seem to represent some functional modules of neurons. The cycle and flow trusses overlap with each other and have four command interneurons AVAL, AVAR, AVBR and PVCR in common. This fact arises logically, as the command interneurons related to locomotion should play a central role in mediating the motor neurons in the ventral and dorsal cords [22]. This example of the C. elegans neural network demonstrates the ability of our proposed method to extract different types of cohesive modules from networks. In addition, the overlap between the two types of trusses can shed light on the importance of nodes that bridge different modules.

Cycle and flow trusses in the C. elegans neural network. We set k to kcmax=3 and kfmax=9 for the cycle and flow trusses, respectively. The links are coloured with blue (in the cycle 3-truss), red (in the flow 9-truss), purple (in both) and grey (remainder). (a) The whole picture of the C. elegans neural network. The nodes are ordered according to the soma position in the worm body (head to tail from left to right and from top to bottom). A group of nodes composing a circle have the same position. (b) The cycle 3- and (c) the flow 9-trusses. The node labels indicate the names of neurons.

3.1.2 Word network of the Edinburgh Associative Thesaurus

Our second example is the graph representation of the EAT, which was collected through word association experiments with subjects [23]. A directed link from nodes (i.e. words) i to j represents the associative relationship between the two: for subjects, word j comes to mind when they are shown word i as a stimulus. After the aggregation of the results of word association experiments for many subjects with different stimuli, the EAT network contains 23 219 nodes and 325 029 links between them.

In figure 3, the resulting k-trusses are depicted. We show the cycle kcmax=2- and flow kfmax=8-trusses. Unlike the results of the C. elegans neural network (figure 2), there are multiple disjoint cycle and flow trusses with kcmax=2 and kfmax=8. We can see that each of the trusses consists of words related to a topic, for example, religion, emotion, health and poem. All six of the cycle 2-trusses composed of five nodes are fully connected, in which all node pairs have links in both directions. Each of the cycle 2-trusses related to emotion, health and colour strongly overlap with one of the flow 8-trusses (top centre and bottom right of figure 3). Only the flow 8-truss related to liquor, the largest one (bottom left of figure 3), is less overlapped with the cycle trusses than the other flow 8-trusses are. We can intuitively explain the difference between the cycle and flow trusses in this example network as follows. The words constituting a cycle truss have an equal relationship with each other such that the experimental subjects tend to recall all words based on each word. By contrast, the words constituting a flow truss have a hierarchical relationship such that some words remind the subjects of other words but the converse rarely occurs. If we discard the link direction, then we cannot distinguish the modules of the cycle and flow trusses. Therefore, this example demonstrates that the link direction plays an important role in finding modules.

Cycle and flow trusses in the EAT network. We set k to kcmax=2 and kfmax=8 for the cycle and flow trusses, respectively. The links are coloured with blue (in the cycle 2-trusses), red (in the flow 8-trusses) and purple (in both). The node labels indicate the corresponding words.

3.2 Classification of networks based on truss number distributions

We can use the truss number statistics to classify various networks.2 In the following, we demonstrate the classification of networks based on the truss number statistics in two ways. First, we separately quantify the abundance of cycle and flow trussses and use them as two features. Second, we quantify the overlap between the cycle and flow trusses with large k values.

Intuitively, a network is more cycle (flow) truss oriented if the links tend to have larger cycle (flow) truss numbers. Note that a large truss number implies the agglomeration (and abundance) of triangles. To quantify how much a network is truss oriented, we define a measure D as
D≡1K∑k=0K(Frand(k)−Forig(k)),3.1
where F(k) is the complementary cumulative distribution of truss numbers defined by F(k)≡∑k′=0kf(k′) and f(k′) (in the sum) is the frequency distribution of the truss number. In equation (3.1), the subscripts ‘orig’ and ‘rand’ represent the distributions for the original and randomized networks, respectively. We randomize the original network by rewiring links in a uniformly random manner while retaining the in- and out-degrees of all nodes (i.e. the configuration model for directed networks [25]). The range of the sum over k is determined by K. Here, we choose K≡min{k∣Forig(k)>0.9∧Frand(k)>0.9}. We do not assume K=kmax, because kmax might be sensitive to noise in the network data. The measure D takes a value in [−1,1]; a large positive value of D represents that the links in the original network tend to have a larger truss number than those in the randomized networks. We denote the measure D for the cycle and flow truss numbers by Dc and Df, respectively. In figure 4a,b, we plot the truss number distributions fc(k) and ff(k) of the C. elegans neural network and the randomized networks. The kcmax and kfmax values are larger for the original network than those for the randomized networks. The proportions of links with large k values are also larger for the original than the randomized networks. For this network, we obtain (Dc,Df)=(0.122,0.329). Therefore, the C. elegans network is inclined to have more flow trusses than cycle trusses, which agrees well with our intuition based on figure 4a,b.

Distributions of the truss numbers and the D measure. Distributions of (a) kc and (b) kf for the original network and 100 randomized networks of the C. elegans neural network. (c) Scatter plot of Dc and Df for (main panel) empirical networks of the 12 categories and for (inset) the metabolic networks.

The measure D allows us to compare various networks of different sizes in terms of truss tendency. In figure 4c, we show the scatter plot of Dc and Df for the empirical networks. As we can see, the networks of certain categories such as airport, circuit, citation and food webs loosely fall into the similar positions on the plane. The points are located around the diagonal, because networks with larger numbers of triangles tend to have larger D values. This phenomenon occurs because our randomization procedure does not conserve the number of triangles in the original network; the randomization tends to destroy triangles, and consequently, destroy the truss structure. This reasoning is supported by the observation shown in electronic supplementary material, figure S1. The elements of the first principle component of the plot shown in figure 4c exhibit a strong positive correlation with the clustering coefficient [26] after discarding the link directions. Nevertheless, figure 4c provides the information regarding network topology beyond simply the count of the triangles. For example, there are several distinguishable classes of network (such as citation and circuit networks) in which either cycle or flow trusses are dominant. The neural, airport and web networks contain both types of trusses. The metabolic networks tend to have few cycle or flow trusses when compared to those in the randomized networks. These observations suggest the usefulness of the cycle and flow trusses to characterize different types of directed networks.

The randomization method that we used above generally destroys triangles, and randomization conserving the number of triangles is desirable to distinguish the abundance and agglomeration of triangles. Such a randomization method is proposed in reference [27] which conserves the number of focal motif counts, whereas it is basically a rejection sampling and computationally demanding and feasible only for small networks. We applied this randomization method to the food web networks so as to retain both the number of the cycle triangles and that of the flow triangles, and performed the same set of analyses of the D measure (see electronic supplementary material, figure S2 for the resulting plot). The result indicates that the food web networks are more flow-truss oriented than cycle-truss oriented, which is consistent with the conclusion based on the randomization method without conserving the number of triangles (figure 4).

While we separately considered the properties related to the cycle and flow trusses so far, the overlap between the two types of trusses can be another characteristic of the networks as we observed in the example networks (figures 2 and 3). In particular, we are interested in the overlap between highly cohesive cycle and flow trusses with large k values. To analyse the overlap, we plot the joint frequency distribution of truss numbers (kc,kf) for four example networks, i.e. the C. elegans neural network, the EAT network, the USairport 2010 network [28] and the web-Google network [29] (figure 5). In these plots, a cell at (kc,kf) indicates the proportion of links with these truss numbers. These plots indicate the unique characteristics of the different networks. In the C. elegans neural network (figure 5a), the links with kc=3 have only 7≤kf≤9=kmaxf. This property suggests that the size of the cycle kcmax-truss is smaller than that of the flow kfmax-truss and a large section of the cycle truss overlaps with the flow truss, as we observed in figure 2. By contrast, for the EAT network (figure 5b), the majority of links have relatively small truss numbers, as kc=0 and 0≤kf≤3. Thus, the cohesive cycle and flow trusses are not strongly overlapped. The plots for the USairport 2010 (figure 5c) and web-Google (figure 5d) networks look similar, such that we can see the coloured cells along the diagonal of slope equal to three. This observation suggests that in these networks there are complete subgraphs within which node pairs are connected in both directions. This situation may correspond to airports within local regions and web pages under the same directories.

To quantify the overlap between the cohesive cycle and flow trusses in a network, we define
R≡|{e∈E∣(kec>kmedc)∧(kef>kmedf)}||{e∈E∣(kec>kmedc)∨(kef>kmedf)}|,3.2
where E is the set of all links and kcmed and kfmed are the median values of fc(k) and ff(k) (indicated by the dashed lines in figure 5), respectively. The measure R characterizes the proportion of links with large kcandkf values among those with large kcorkf values. The measure R takes a value in [0,1]; a large R value indicates a strong overlap between the cohesive cycle and flow trusses. For the four networks shown in figure 5, we obtain R=0.531, 0.383, 0.983 and 0.540 for the C. elegans neural network, the EAT network, the USairport 2010 network and the web-Google network, respectively.

In figures 6 and 7, we plot the R values for the empirical networks (i.e. the same set that we used in figure 4, except for the metabolic networks). First, we can see that the networks of some categories have the R values close to the extreme cases, i.e. 0 or 1. For the airport networks, the R values for the three networks are almost equal to unity. This result follows logically, because the reciprocity of links, defined by the double of the number of bidirectionally adjacent node pairs divided by the total number of links, are large: 0.972, 1 and 0.781 for the openflights, USairport500 and USairport_2010 networks, respectively (see electronic supplementary material, table S1). Therefore, any triplet of nodes is likely to constitute a cycle triangle if it composes a flow triangle and vice versa. For the circuit, citation, gene regulatory, P2P and software networks, the R values are close to zero because these types of networks have huge gaps between the number of cycle and flow triangles (electronic supplementary material, tables S1 and S2). The three circuit networks do not contain any flow triangles. For the citation, gene regulatory, P2P, and software networks, the number of cycle triangles is much smaller than that of flow triangles.

Overlap measure R between the cycle and flow trusses for the gene, neural, P2P, software, web and word networks.

The following, neural, web and word networks tend to have R values greater than 0.5, although fluctuations within a category are large. The food webs tend to have R values less than 0.5. The results for the metabolic networks are shown in electronic supplementary material, figure S3; 166 out of 172 networks have R values in [0.4,0.6] (the mean R value ± the standard deviation is equal to 0.491±0.0481). These observations may indicate the usefulness of the R value to characterize the tendency of module organization for different categories of networks.

4. Discussion

In this paper, we proposed the cycle and flow k-trusses in order to extract two distinct types of cohesive modules from directed networks. We also developed an efficient algorithm for computing these trusses and defined the measures used to quantify the module organization in a network based on the truss properties. Applications of our method to a wide variety of empirical networks illustrated that most empirical networks contain either type of trusses or both of them. We investigated the extracted trusses for several networks with the given node labels and found that the trusses seem to capture relevant subgroups of nodes. We also found that the abundance of (and the overlap between) cycle and flow trusses helps us to classify empirical networks obtained from different fields. These findings suggest the importance of exploring different types of modules in directed networks. We believe that our method will be a useful tool for investigating module structure in directed networks.

It is worth noting several relationships between the cycle and flow trusses and other notions of directed subgraphs presented in previous studies. Detecting directed triangles that are significantly over-presented is a key idea of motif analysis [18,19]. The original notion of the motif often focuses on subgraphs with a small number of nodes (e.g. three or four). Previous studies [27,30–32] considered the generalization of motifs by aggregating the motifs sharing links so as to construct functional modules larger than single motifs. In particular, the generalization of the feedforward loop motif (called the flow triangle in this paper) described in references [27,30–32] is an example of the flow 1-truss in our definition. Another related notion is the directed k-clique [33]. A directed k-clique is a subgraph with k nodes and k(k−1)/2 links, in which the k nodes have a linear ordering and all nodes have directed links to all the lower-rank nodes. A directed k-clique module is defined by a union of adjacent directed k-cliques. Two directed k-cliques are said to be adjacent if the two have k−1 nodes in common. A directed k-clique module is a flow (k−2)-truss (k≥3); however, the converse does not always hold true. Therefore, the cohesiveness of the flow k-trusses is between those of the generalization of feedforward loop motif and the directed k-clique modules. To the best of our knowledge, a cycle k-truss does not exactly correspond to any of the previous notions. Based on the definition, a cycle truss is a strongly connected component (i.e. any node in the cycle truss is connected to all the other nodes via directed links).

Recently, a graph-partitioning method based on the so-called motif conductance has been proposed [34]. This method focuses on a given motif (e.g. the cycle or flow triangle) and splits a network into two parts so as to minimize the ratio of the number of the motifs crossing the two parts to the number of the motifs contained by either part (taking the minimum value for the two parts). Repetitive application of the method is expected to result in a partition of the network in which each part contains many of the focal motifs. Although the aim of this method is different from our method, it would be interesting to see the overlap and difference between the results of the two methods. As an example, we applied the motif conductance method3 to the C. elegans neural network (§3.1.1.). The method returned four non-trivial components for the cycle triangle and two for the flow triangles. The node set of the four components for the cycle triangle has no intersection with the node set of the cycle 3-truss (figure 2b). The node set of a component for the flow truss contains the node set of the flow 9-truss (figure 2c). In this sense, the motif conductance method for the flow triangle and the truss method are consistent for this example. The cycle 3-truss is not captured by the motif conductance method, maybe because the method prefers increasing the number of focal motifs inside of the two parts to decreasing the number of motifs crossing the partition. Further analysis on comparison between the two methods is future work.

Our definition of trusses in this paper assumes that the focal network is directed and unweighted. However, the importance of link weight in many systems has been suggested in previous work [35]. This point is a clear limitation of the present method and a suitable generalization for weighted networks is warranted. Although we determined the existence of truss structures in empirical networks, the origins and functional roles of these trusses are not yet well understood. Functional roles of the generalized motifs in biological networks have been investigated [27,30–32]. A similar investigation on the functionality of trusses would be the next step and require the knowledge of field experts and well-documented network data with link annotations such as gene ontology databases. Finally, dynamical models of the growth processes of directed networks that yield truss structures will be potential future work, providing further understanding of the organization of modules in directed networks.

Data accessibility

The network we used in this study is all publicly available online. Please see the URL references shown in Results. The codes we used are available at https://github.com/yyoshida/k-truss.

Authors' contributions

T.T. and Y.Y. conceived and designed the research. Y.Y. performed the coding. T.T. and Y.Y. analysed the data, discussed the results and wrote the manuscript.

Competing interests

The authors declare no competing financial interests.

Funding

Y.Y. acknowledges the financial support through JST, ERATO, Kawarabayashi Large Graph Project, JSPS grants in aid for Young Scientists (B) (no. 26730009), and MEXT Grant-in-Aid for Scientific Research on Innovative Areas (no. 24106003).

Acknowledgements

The authors thank Naoki Masuda for insightful comments.

Footnotes

↵1 In the original definition for undirected networks [20], the k-truss is the maximal subgraph in which every link is involved in (k−2) triangles, not k triangles. We modify the definition in order to make the truss number of links involved in no triangles zero.

Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.