Bookmark

OpenURL

Abstract

Statistics on networks have become vital to the study of relational data drawn from areas including bibliometrics, fraud detection, bioinformatics, and the Internet. Calculating many of the most important measures—such as betweenness centrality, closeness centrality, and graph diameter—requires identifying short paths in these networks. However, finding these short paths can be intractable for even moderate-size networks. We introduce the concept of a network structure index (NSI), a composition of (1) a set of annotations on every node in the network and (2) a function that uses the annotations to estimate graph distance between pairs of nodes. We present several varieties of NSIs, examine their time and space complexity, and analyze their performance on synthetic and real data sets. We show that creating an NSI for a given network enables extremely efficient and accurate estimation of a wide variety of network statistics on that network.

Citations

...icate pathologically poor search performance. r i=1 We evaluate the NSIs from Section 2 on synthetic graphs of 10,000 nodes generated using three models: random networks as defined by Erdős and Rényi =-=[5]-=-, rewired lattices defined by Watts and Strogatz [24], and the Forest Fire graph model recently introduced by Leskovic [14]. (See Appendix A for more detail on the network generation procedures.) In F...

...ortunately, this strategy performs ! rather poorly in practice. Many of today’s “small-world” data sets are characterized by small diameters due to the existence of “short cut” links in the graph [11]=-=[24]-=-. As a result, a found path that passes through a landmark often forms two sides of a triangle, resulting in artificially long paths.s2.4 ZONES The ZONE NSI utilizes multiple dimensions, where each di...

... Unfortunately, this strategy performs ! rather poorly in practice. Many of today’s “small-world” data sets are characterized by small diameters due to the existence of “short cut” links in the graph =-=[11]-=-[24]. As a result, a found path that passes through a landmark often forms two sides of a triangle, resulting in artificially long paths.s2.4 ZONES The ZONE NSI utilizes multiple dimensions, where eac...

...nness centrality—the proportion of all shortest paths in the network that run through a given node—and closeness centrality—the average distance from the given node to every other node in the network =-=[8]-=-. For example, centrality measures can help evaluate whether Mr. Bacon lies near the center of the Hollywood universe or Marc Maier Knowledge Discovery Laboratory Department of Computer Science Univer...

...I does not perform well in practice, as we show in Section 3. 2.3 LANDMARKS Previous work in network path finding has utilized a system of network landmarks to efficiently navigate graph structure [3]=-=[16]-=-. With this technique, we randomly designate a small number of nodes in the network to serve as navigational beacons. Then, we annotate nodes in the graph by flooding out from each landmark and record...

...t of Computer Science University of Massachusetts Amherst jensen@cs.umass.edu whether he is near the periphery. Several researchers have used such measures to construct statistical models of networks =-=[9]-=-[15]. Recent work in knowledge discovery has begun to study very large networks, often comprising millions of nodes. Given networks of this size, even the most efficient algorithms for calculating net...

...s generated using three models: random networks as defined by Erdős and Rényi [5], rewired lattices defined by Watts and Strogatz [24], and the Forest Fire graph model recently introduced by Leskovic =-=[14]-=-. (See Appendix A for more detail on the network generation procedures.) In Figure 5, we compare the performance of DEGREE, LANDMARK, ZONE, and DTZ when implemented with increasing numbers of dimensio...

...able. For example, the most efficient known algorithms for calculating betweenness centrality and closeness centrality are O(ne+n 2 logn), where n and e are the number of nodes and edges in the graph =-=[2]-=-. Ad hoc calculations that use basic path finding can have even higher complexity, as they require bidirectional breadth-first search. Figure 1: The average number of nodes explored by bidirectional b...

...Is is provided by previous work that has shown that path finding can be surprisingly efficient in a network that exhibits homophily, the tendency of neighboring nodes to have similar attribute values =-=[1]-=-. Unfortunately, many networks do not “naturally” have attributes that exhibit homophily. However, we can synthetically generate and annotate any arbitrary graph with such an attribute and use it for ...

...e.g., actors that sit between winners of Academy Awards for best picture and the IMDb’s “Bottom 100,” the worst 100 movies as voted by users of the Internet Movie Database). 5. RELATED WORK Kleinberg =-=[10]-=-[11] demonstrates the notion of similarity-based navigation in small-world networks. He demonstrates how the presence of network homophily can provide a gradient that guides search using local informa...

...s/Internet community as a basis for determining network latency between hosts on the Internet. Most of the Internet coordinate approaches attempt to minimize network latency through extensions of GNP =-=[22]-=-[19][18]. Kleinberg provides a theoretical analysis and framework of all beacon-based strategies, such as GNP and others [12]. This mostly describes the effectiveness of triangulation (determining pos...

... beacon-based approaches. Other strategies in the Internet domain have attempted to create network overlay structures, such as a rings-based approach that does not rely on selection of landmark nodes.=-=[26]-=- This concept has recently been explored theoretically as a technique for distance estimation and nearest neighbor searches by Slivkins [20] and Krauthgamer [13]. However, it is unclear how accurately...

...et community as a basis for determining network latency between hosts on the Internet. Most of the Internet coordinate approaches attempt to minimize network latency through extensions of GNP [22][19]=-=[18]-=-. Kleinberg provides a theoretical analysis and framework of all beacon-based strategies, such as GNP and others [12]. This mostly describes the effectiveness of triangulation (determining positions o...

...sence of network homophily can provide a gradient that guides search using local information. Watts investigated a similar approach by constructing a hierarchical model from which to derive homophily.=-=[23]-=- In this paper, we present methods for creating such homophily in domains that may lack local information. We detail a number of ways in which this information can be obtained for both synthetic and r...

...ternet community as a basis for determining network latency between hosts on the Internet. Most of the Internet coordinate approaches attempt to minimize network latency through extensions of GNP [22]=-=[19]-=-[18]. Kleinberg provides a theoretical analysis and framework of all beacon-based strategies, such as GNP and others [12]. This mostly describes the effectiveness of triangulation (determining positio...

...ot rely on selection of landmark nodes.[26] This concept has recently been explored theoretically as a technique for distance estimation and nearest neighbor searches by Slivkins [20] and Krauthgamer =-=[13]-=-. However, it is unclear how accurately any of these strategies perform on domains other than the Internet or for the purposes of approximating network statistics. Additionally, our current work focus...

...hors have pioneered work in this area by identifying efficient methods for finding connection subgraphs— sets of short paths between nodes—and for approximating the size of the neighborhood of a node.=-=[6]-=-[17] NSIs may provide an alternative way of representing much of the information needed for both of these tasks. 7. ACKNOWLEGEMENTS This research is supported by Lawrence Livermore National Laboratory...

...s have pioneered work in this area by identifying efficient methods for finding connection subgraphs— sets of short paths between nodes—and for approximating the size of the neighborhood of a node.[6]=-=[17]-=- NSIs may provide an alternative way of representing much of the information needed for both of these tasks. 7. ACKNOWLEGEMENTS This research is supported by Lawrence Livermore National Laboratory and...

...e approaches attempt to minimize network latency through extensions of GNP [22][19][18]. Kleinberg provides a theoretical analysis and framework of all beacon-based strategies, such as GNP and others =-=[12]-=-. This mostly describes the effectiveness of triangulation (determining positions of uncertain nodes) in beacon-based approaches. Other strategies in the Internet domain have attempted to create netwo...

...f Computer Science University of Massachusetts Amherst jensen@cs.umass.edu whether he is near the periphery. Several researchers have used such measures to construct statistical models of networks [9]=-=[15]-=-. Recent work in knowledge discovery has begun to study very large networks, often comprising millions of nodes. Given networks of this size, even the most efficient algorithms for calculating network...

... approach that does not rely on selection of landmark nodes.[26] This concept has recently been explored theoretically as a technique for distance estimation and nearest neighbor searches by Slivkins =-=[20]-=- and Krauthgamer [13]. However, it is unclear how accurately any of these strategies perform on domains other than the Internet or for the purposes of approximating network statistics. Additionally, o...

...into this table. While this strategy yields optimal results when searching for paths, in many cases it may be infeasible in terms of annotation complexity—the Floyd-Warshall algorithm runs in O(n 3 ) =-=[7]-=-, while more complex approaches using fast matrix multiplication can reduce the exponent to 2.376 [4]. Furthermore, APSP requires O(n 2 ) to store the distances themselves. Although APSP may seem triv...

...date NSIs when nodes and links are added to the network so that dynamic graphs can be successfully indexed. Finally, we are investigating how to apply our own recent developments in network searching =-=[21]-=- to more effectively use NSI annotations to find short paths. We are actively exploring additional applications of network structure indices. Two of the most promising directions are finding connectio...

... NSI does not perform well in practice, as we show in Section 3. 2.3 LANDMARKS Previous work in network path finding has utilized a system of network landmarks to efficiently navigate graph structure =-=[3]-=-[16]. With this technique, we randomly designate a small number of nodes in the network to serve as navigational beacons. Then, we annotate nodes in the graph by flooding out from each landmark and re...

...t may be infeasible in terms of annotation complexity—the Floyd-Warshall algorithm runs in O(n 3 ) [7], while more complex approaches using fast matrix multiplication can reduce the exponent to 2.376 =-=[4]-=-. Furthermore, APSP requires O(n 2 ) to store the distances themselves. Although APSP may seem trivial, the use of structure indices is a general approach, not specific to a single implementation or a...