The mathematics of networks

Transcription

1 The mathematics of networks M. E. J. Newman Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI In much of economic theory it is assumed that economic agents interact, directly or indirectly, with all others, or at least that they have the opportunity to do so in order to achieve a desired outcome for themselves. In reality, as common sense tells us, things are quite different. Traders in a market have preferred trading partners, perhaps because of an established history of trust, or simply for convenience. Buyers and sellers have preferred suppliers and customers. Consumers have preferred brands and outlets. And most individuals limit their interactions, economic or otherwise, to a select circle of partners or acquaintances. In many cases partners are chosen not on economic grounds but for social reasons: individuals tend overwhelmingly to deal with others who revolve in the same circles as they do, socially, intellectually or culturally. The patterns of connections between agents form a social network (Fig. 1) and it is intuitively clear that the structure of such networks must affect the pattern of economic transactions, not to mention essentially every other type of social interaction amongst human beings. Any theory of interaction that ignores these networks is necessarily incomplete, and may in fact be missing some important and crucial phenomena. In the last few decades, therefore, a researchers have conducted extensive investigations of networks in economics, mathematics, sociology and a number of other fields, in an effort to understand and explain network effects. The study of social (and other) networks has three primary components. First, empirical studies of networks probe network structure using a variety of techniques such as interviews, question- 1

2 Figure 1: An example of a social network, in this case of collaborative links. The nodes (squares) represent people and the edges (lines) social ties between them. naires, direct observation of individuals, use of archival records, and specialist tools like snowball sampling and ego-centred studies. The goal of such studies is to create a picture of the connections between individuals, of the type shown in Fig. 1. Since there are many different kinds of possible connections between people business relationships, personal relationships, and so forth studies must be designed appropriately to measure the particular connections of interest to the experimenter. Second, once one has empirical data on a network, one can answer questions about the community the network represents using mathematical or statistical analyses. This is the domain of classical social network analysis, which focuses on issues such as: Who are the most central mem- 2

3 bers of a network and who are the most peripheral? Which people have most influence over others? Does the community break down into smaller groups and if so what are they? Which connections are most crucial to the functioning of a group? And third, building on the insights obtained from observational data and its quantitative analysis, one can create models, such as mathematical models or computer models, of processes taking place in networked systems the interactions of traders, for example, or the diffusion of information or innovations through a community. Modelling work of this type allows us to make predictions about the behaviour of a community as a function of the parameters affecting the system. This article reviews the mathematical techniques involved in the second and third of these three components: the quantitative analysis of network data and the mathematical modelling of networked systems. Necessarily this review is short. Much more substantial coverage can be found in the many books and review articles in the field [1 8]. Let us begin with some simple definitions. A network also called a graph in the mathematics literature is made up of points, usually called nodes or vertices, and lines connecting them, usually called edges. Mathematically, a network can be represented by a matrix called the adjacency matrix A, which in the simplest case is an n n symmetric matrix, where n is the number of vertices in the network. The adjacency matrix has elements 1 if there is an edge between vertices i and j, A i j = 0 otherwise. (1) The matrix is symmetric since if there is an edge between i and j then clearly there is also an edge between j and i. Thus A i j = A ji. In some networks the edges are weighted, meaning that some edges represent stronger connections than others, in which case the nonzero elements of the adjacency matrix can be generalized to values other than unity to represent stronger and weaker connections. Another variant is the directed network, in which edges point in a particular direction between two vertices. For instance, 3

4 in a network of cash sales between buyers and sellers the directions of edges might represent the direction of the flow of goods (or conversely of money) between individuals. Directed networks can be represented by an asymmetric adjacency matrix in which A i j = 1 implies the existence (conventionally) of an edge pointing from j to i (note the direction), which will in general be independent of the existence of an edge from i to j. Networks may also have multiedges (repeated edges between the same pair of vertices), selfedges (edges connecting a vertex to itself), hyperedges (edges that connect more than two vertices together) and many other features. We here concentrate however primarily on the simplest networks having undirected, unweighted single edges between pairs of vertices. Turning to the analysis of network data, we start by looking at centrality measures, which are some of the most fundamental and frequently used measures of network structure. Centrality measures address the question, Who is the most important or central person in this network? There are many answers to this question, depending on what we mean by important. Perhaps the simplest of centrality measures is degree centrality, also called simply degree. The degree of a vertex in a network is the number of edges attached to it. In mathematical terms, the degree k i of a vertex i is k i = n A i j. (2) j=1 Though simple, degree is often a highly effective measure of the influence or importance of a node: in many social settings people with more connections tend to have more power. A more sophisticated version of the same idea is the so-called eigenvector centrality. Where degree centrality gives a simple count of the number of connections a vertex has, eigenvector centrality acknowledges that not all connections are equal. In general, connections to people who are themselves influential will lend a person more influence than connections to less influential people. If we denote the centrality of vertex i by x i, then we can allow for this effect by making x i 4

5 proportional to the average of the centralities of i s network neighbours: x i = 1 λ n A i j x j, (3) j=1 where λ is a constant. Defining the vector of centralities x = (x 1,x 2,...), we can rewrite this equation in matrix form as λx = A x, (4) and hence we see that x is an eigenvector of the adjacency matrix with eigenvalue λ. Assuming that we wish the centralities to be non-negative, it can be shown (using the Perron Frobenius theorem) that λ must be the largest eigenvalue of the adjacency matrix and x the corresponding eigenvector. The eigenvector centrality defined in this way accords each vertex a centrality that depends both on the number and the quality of its connections: having a large number of connections still counts for something, but a vertex with a smaller number of high-quality contacts may outrank one with a larger number of mediocre contacts. Eigenvector centrality turns out to be a revealing measure in many situations. For example, a variant of eigenvector centrality is employed by the well-known Web search engine Google to rank Web pages, and works well in that context. Two other useful centrality measures are closeness centrality and betweenness centrality. Both are based upon on the concept of network paths. A path in a network is a sequence of vertices traversed by following edges from one to another across the network. A geodesic path is the shortest path, in terms of number of edges traversed, between a specified pair of vertices. (Geodesic paths need not be unique; there is no reason why there should not be two paths that tie for the title of shortest.) The closeness centrality of vertex i is the mean geodesic distance (i.e., the mean length of a geodesic path) from vertex i to every other vertex. Closeness centrality is lower for vertices that are more central in the sense of having a shorter network distance on average to other vertices. (Some writers define closeness centrality to be the reciprocal of the average so that higher numbers indicate greater centrality. Also, some vertices may not be reachable from vertex i two vertices 5

6 can lie in separate components of a network, with no connection between the components at all. In this case closeness as above is not well defined. The usual solution to this problem is simply to define closeness to be the average geodesic distance to all reachable vertices, excluding those to which no path exists.) The betweenness centrality of vertex i is the fraction of geodesic paths between other vertices that i falls on. That is, we find the shortest path (or paths) between every pair of vertices, and ask on what fraction of those paths vertex i lies. Betweenness is a crude measure of the control i exerts over the flow of information (or any other commodity) between others. If we imagine information flowing between individuals in the network and always taking the shortest possible path, then betweenness centrality measures the fraction of that information that will flow through i on its way to wherever it is going. In many social contexts a vertex with high betweenness will exert substantial influence by virtue not of being in the middle of the network (although it may be) but of lying between other vertices in this way. It is in most cases only an approximation to assume that information flows along geodesic paths; normally it will not, and variations of betweenness centrality such as flow betweenness and random walk betweenness have been proposed to allow for this. In many practical cases however, the simple (geodesic path) betweenness centrality gives quite informative answers. The study of shortest paths on networks also leads to another interesting network concept, the small-world effect. It is found that in most networks the mean geodesic distance between vertex pairs is small compared to the size of the network as a whole. In a famous experiment conducted in the 1960s, the psychologist Stanley Milgram asked participants to get a message to a specified target person elsewhere in the country by passing it from one acquaintance to another, stepwise through the population. Milgram s remarkable finding that the typical message passed though just six people on its journey between (roughly) randomly chosen initial and final individuals has been immortalized in popular culture in the phrase six degrees of separation, which was the title of 6

7 a 1990 Broadway play by John Guare in which one of the characters discusses the small-world effect. Since Milgram s experiment, the small-world effect has been confirmed experimentally in many other networks, both social and nonsocial. Other network properties that have attracted the attention of researchers in recent years include network transitivity or clustering (the tendency for triangles of connections to appear frequently in networks in common parlance, the friend of my friend is also my friend ), vertex similarity (the extent to which two given vertices do or do not occupy similar positions in the network), communities or groups within networks and methods for their detection, and crucially, the distribution of vertex degrees, a topic discussed in more detail below. Turning to models of networks and of the behaviour of networked systems, perhaps the simplest useful model of a network (and one of the oldest) is the Bernoulli random graph, often called just the random graph for short [9 11]. In this model one takes a certain number of vertices n and creates edges between them with independent probability p for each vertex pair. When p is small there are only a few edges in the network, and most vertices exist in isolation or in small groups of connected vertices. Conversely, for large p almost every possible edge is present between the ( n) 2 possible vertex pairs, and all or almost all of the vertices join together in a single large connected group. One might imagine that for intermediate values of p the sizes of groups would just grow smoothly from small to large, but this is not the case. It is found instead that there is a phase transition at the special value p = 1/n above which a giant component forms, a group of connected vertices occupying a fixed fraction of the whole network, i.e., with size varying as n. For values of p less than this, only small groups of vertices exist of a typical size that is independent of n. Many real-world networks show behaviour reminiscent of this model, with a large component of connected vertices filling a sizable fraction of the entire network, the remaining vertices falling in much smaller components that are unconnected to the rest of the network. The random graph has a major shortcoming however: the distribution of the degrees of the 7

8 vertices is quite unlike that seen in most real-world networks. The fraction p k of vertices in a random graph having degree k is given by the binomial distribution, which becomes Poisson in the limit of large n: ( n 1 p k = )p k (1 p) n 1 k zk e z, (5) k k! where z = (n 1)p is the mean degree. Empirical observations of real networks, social and otherwise, show that most have highly non-poisson distributions of degree, often heavily right-skewed with a fat tail of vertices having unusually high degree [6, 7]. These high-degree nodes or hubs in the tail can, it turns out, have a substantial effect on the behaviour of a networked system. To allow for these non-poisson degree distributions, one can generalize the random graph, specifying a particular, arbitrary degree distribution p k and then forming a graph that has that distribution but is otherwise random. A simple algorithm for doing this is to choose the degrees of the n vertices from the specified distribution, draw each vertex with the appropriate number of stubs of edges emerging from it, and then pick stubs in pairs uniformly at random and connect them to create complete edges. The resulting model network (or more properly the ensemble of such networks) is called the configuration model. The configuration model also shows a phase transition, similar to that of the Bernoulli random graph, at which a giant component forms. To see this, consider a set of connected vertices and consider the boundary vertices that are immediate neighbours of that set. Let us grow our set by adding the boundary vertices to it one by one. When we add one boundary vertex to our set the number of boundary vertices goes down by 1. However, the number of boundary vertices also increases by the number of new neighbours of the vertex added, which is one less than the degree k of that vertex. Thus the total change in the number of boundary vertices is 1 + (k 1) = k 2. However, the probability of a particular vertex being a boundary vertex is proportional to k, since there are k times as many edges by which a vertex of degree k could be connected to our set than a vertex of degree 1. Thus the average change in the number of boundary vertices when we add one 8

9 vertex to our set is a weighted average i k i (k i 2)/ j k j = i k i (k i 2)/(nz), where z is again the mean degree. If this quantity is less than zero, then the number of boundary vertices dwindles as our set grows bigger and will in the end reach zero, so that the set will stop growing. Thus in this regime all connected sets of vertices are of finite size. If on the other hand this number is greater than zero then the number of boundary vertices will grow without limit, and hence the size of our set of connected vertices is limited only by the size of the network. Thus a giant component exists in the network if and only if k 2 2 k > 0, (6) where k = z = n 1 i k i is the mean degree and k 2 = n 1 i ki 2 is the mean-square degree. The occurrence here of the mean-square degree is a phenomenon that appears over and over in the mathematics of networks. Another context in which it appears is in the spread of information (or anything else) over a network. Taking a simple model of the spread of an idea (or a rumour or a disease), imagine that each person who has heard the idea communicates it with independent probability r to each of his or her friends. If the person s degree is k then there are k 1 friends to communicate the idea to, not counting the one from whom they heard it in the first place, so the expected number who hear it is r(k 1). Performing the weighted average over vertices again, the average number of people a person passes the idea on to, also called the basic reproductive number R 0, is R 0 = r i k i (k i 1) i k i = r k2 k. (7) k If R 0 is greater than 1, then the number of people hearing the idea grows as it gets passed around and it will take off exponentially. If R 0 is less than 1 then the idea will die. Again, we have a phase transition, or tipping point, for the spread of the idea: it spreads if and only if r > k k 2 k. (8) 9

10 The simple understanding behind the appearance of the mean-square degree in this expression is the following. If a person with high degree hears this idea they can spread it to many others, because they have so many friends. However, such a person is also more likely to hear the idea in the first place, because they have so many friends to hear it from. Thus, the degree enters twice into the process: a person with degree 10 is = 100 times more efficacious at spreading the idea than a person with degree 1. The appearance of the mean-square degree in expressions like (6) and (8) can have substantial effects. Of particular interest are networks whose degree distributions have fat tails. It is possible for such networks to have very large values of k 2 in the hundreds or thousands so that, for example, the right-hand side of Eq. (8) is very small. This means that the probability of each individual person spreading an idea (or rumour or disease) need not be large for it still to spread through the whole community. Another important class of network models is the class of generative models, models that posit a quantitative mechanism or mechanisms by which a network forms, usually in an effort to explain how the observed structure of the network arises. The best known example of such a model is the cumulative advantage or preferential attachment model [12, 13], which aims to explain the fat-tailed degree distributions mentioned above. In its simplest form this model envisages a network that grows by the steady addition of vertices, one at a time. Many networks, such as the World Wide Web and citation networks grow this way; it is a matter of current debate whether the model applies to social networks as well. Each vertex is added with a certain number m of edges emerging from it, whose other ends connect to preexisting vertices with probability proportional to those vertices current degree. That is, the higher the current degree of a vertex, the more likely that vertex is to acquire new edges when the graph grows. This kind of rich-get-richer phenomenon is plausible in many network contexts and is known to generate Pareto degree distributions. Using a rate-equation method [12, 14, 15] it is straightforward to show that in the limit of large network 10

11 size the degree distribution obeys: p k = 2m(m + 1) k(k + 1)(k + 2). (9) This distribution has a tail going as p k k 3 in the large-k limit, which is strongly reminiscent of the degree distributions seen particularly in citation networks and also in the World Wide Web. Generative models of this type have been a source of considerable interest in recent years and have been much extended beyond the simple ideas described here by a number of authors [6, 7]. Concepts such as those appearing in this article can be developed a great deal further and lead to a variety of useful, and in some cases surprising, results about the function of networked systems. More details can be found in the references. References [1] S. Wasserman and K. Faust, Social Network Analysis. Cambridge University Press, Cambridge (1994). [2] J. Scott, Social Network Analysis: A Handbook. Sage, London, 2nd edition (2000). [3] D. B. West, Introduction to Graph Theory. Prentice Hall, Upper Saddle River, NJ (1996). [4] F. Harary, Graph Theory. Perseus, Cambridge, MA (1995). [5] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, Network Flows: Theory, Algorithms, and Applications. Prentice Hall, Upper Saddle River, NJ (1993). [6] S. N. Dorogovtsev and J. F. F. Mendes, Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, Oxford (2003). [7] R. Albert and A.-L. Barabási, Statistical mechanics of complex networks. Rev. Mod. Phys. 74, (2002). 11

General Network Analysis: Graph-theoretic Techniques COMP572 Fall 2009 Networks (aka Graphs) A network is a set of vertices, or nodes, and edges that connect pairs of vertices Example: a network with 5

Graph models for the Web and the Internet Elias Koutsoupias University of Athens and UCLA Crete, July 2003 Outline of the lecture Small world phenomenon The shape of the Web graph Searching and navigation

Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

Scott J. Simon / p. 1 Network Theory: 80/20 Rule and Small Worlds Theory Introduction Starting with isolated research in the early twentieth century, and following with significant gaps in research progress,

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks Imre Varga Abstract In this paper I propose a novel method to model real online social networks where the growing

The Structure of Growing Social Networks Emily M. Jin Michelle Girvan M. E. J. Newman SFI WORKING PAPER: 2001-06-032 SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily

Stationary random graphs on Z with prescribed iid degrees and finite mean connections Maria Deijfen Johan Jonasson February 2006 Abstract Let F be a probability distribution with support on the non-negative

Graph Theory and Networks in Biology Oliver Mason and Mark Verwoerd March 14, 2006 Abstract In this paper, we present a survey of the use of graph theoretical techniques in Biology. In particular, we discuss

Network Analysis and Visualization of Staphylococcus aureus by Russ Gibson Network analysis Based on graph theory Probabilistic models (random graphs) developed by Erdős and Rényi in 1959 Theory and tools

This article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing

this version: 26 February 2009 7 Communication Classes Perhaps surprisingly, we can learn much about the long-run behavior of a Markov chain merely from the zero pattern of its transition matrix. In the

COPYRIGHT NOTICE: Mark Newman, Albert-László Barabási, and Duncan J. Watts: The Structure and Dynamics of Networks is published by Princeton University Press and copyrighted, 2006, by Princeton University

The physics of networks Mark Newman Statistical analysis of interconnected groups of computers, animals, or people yields important clues about how they function and even offers predictions of their future

Generating Biological Networks: A Software Tool Brian Mulhall July 27, 2012 Abstract The study of biological networks and the representation of those networks as graphs has created an opportunity to better

Social and Economic Networks: Lecture 1, Networks? Alper Duman Izmir University Economics, February 26, 2013 Conventional economics assume that all agents are either completely connected or totally isolated.

Statistical and computational challenges in networks and cybersecurity Hugh Chipman Acadia University June 12, 2015 Statistical and computational challenges in networks and cybersecurity May 4-8, 2015,

Classification: Physical Sciences, Mathematics The average distances in random graphs with given expected degrees by Fan Chung 1 and Linyuan Lu Department of Mathematics University of California at San

Infinite circuits are easy. How about long ones? Mikhail Kagan, Xinzhe Wang Penn State Abington arxiv:507.08v [physics.gen-ph] 0 Jul 05 Abstract We consider a long but finite ladder) circuit with alternating

Evolving Networks with Distance Preferences Juergen Jost M. P. Joy SFI WORKING PAPER: 2002-07-030 SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent

SPERNER S LEMMA AND BROUWER S FIXED POINT THEOREM ALEX WRIGHT 1. Intoduction A fixed point of a function f from a set X into itself is a point x 0 satisfying f(x 0 ) = x 0. Theorems which establish the

PROBLEM ONE (Trees) Homework 15 Solutions 1. Recall the definition of a tree: a tree is a connected, undirected graph which has no cycles. Which of the following definitions are equivalent to this definition

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

Online Appendix to Social Network Formation and Strategic Interaction in Large Networks Euncheol Shin Recent Version: http://people.hss.caltech.edu/~eshin/pdf/dsnf-oa.pdf October 3, 25 Abstract In this

A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

Graph A graph G consist of 1. Set of vertices V (called nodes), (V = {v1, v2, v3, v4...}) and 2. Set of edges E (i.e., E {e1, e2, e3...cm} A graph can be represents as G = (V, E), where V is a finite and

The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

1. Write the number of the left-hand item next to the item on the right that corresponds to it. 1. Stanford prison experiment 2. Friendster 3. neuron 4. router 5. tipping 6. small worlds 7. job-hunting

Senior Secondary Australian Curriculum Mathematical Methods Glossary Unit 1 Functions and graphs Asymptote A line is an asymptote to a curve if the distance between the line and the curve approaches zero