4.5 A Case Study: Small World Phenomenon

This section under major construction.

Small world phenomenon.

The mathematical model that we use for studying the nature of
pairwise connections among entities is known as the graph.
Some graphs exhibit a specific property known as the
small-world phenomenon.
You may be familiar with this property, which is sometimes known as
six degrees of separation.
It is the basic idea that, even though each of us has relatively
few acquaintances, there is a relatively short chain of acquaintances
(the six degrees of separation) separating us from one another.

Graphs.

A graph is comprised of a set of vertices and a set of edges.
Each edge represents a connection between two vertices.
The list below suggests the diverse range of systems where
graphs are appropriate starting points for understanding structure.

GRAPH

VERTICES

EDGES

Communication

telephones, computers

fiber optic cable

Circuits

gates, registers, processors

wires

Mechanical

joints

rods, beams, springs

Hydraulic

reservoirs, pumping stations

pipelines

Financial

stocks, currency

transactions

Transportation

street intersections, airports

highways, air routes

Scheduling

tasks

precedence constraints

Software systems

functions

function calls

Internet

web pages

hyperlinks

Games

board positions

legal moves

Social networks

people, actors, terrorists

friendships, movie casts, associations

Protein interaction networks

proteins

protein-protein interactions

Genetic regulatory networks

genes

regulatory interactions

Neural networks

neurons

synapses

Infectious disease

people

infections

Electrical power grid

transmission stations

cable

Chemical compounds

molecules

chemical bonds

Graph data type.

The following API supports graph processing:

Graph implementation.

Program Graph.java implements this API.
Its internal representation is a symbol table of sets: the keys are vertices
and the values are the sets of neighbors
(the vertices adjacent to the key)
This representation uses the two data types
ST.java and SET.java
that we introduced in
Section 4.4.
Our implementation has three important properties:

Clients can efficiently iterate through the graph vertices.

Clients can efficiently iterate through a vertex's neighbors.

Space usage is proportional to the number of edges.

Performer-movie graphs.

The performer-movie graph includes a vertex for each performer and movie
and an edge between a performer and a movie if the performer appeared in the movie.

We provide a number of sample data files (encoded using UTF-8 for compatibility
with In).
Each line in a file consists of a movie, followed by a list of peformers
that appeared in the movie, separated by the / delimiter.

Program IndexGraph.java
takes a query which is the name of a movie (in which case it
prints the list of performers that appear in that movie) or the name
of a performer (in which case it prints the lst of movies
in which that performer has appeared).

Shortest paths in graphs.

Given two vertices in a graph, a path is a sequence
of edges connecting them. A shortest path is one with minimal length over
all such paths (there typically are multiple shortest paths).

We use the following API.

PathFinder.java implements the API
using a classic algorithm known as breadth-first search.

Degrees of separation.

One of the classic applications of shortest-paths algorithms is to find the
degrees of separation of individuals in social networks.
To fix ideas, we discuss this application in terms of a recently popularized pastime
known as the Kevin Bacon game, which uses the movie-performer
graph that we just considered.
Kevin Bacon is a prolific actor who appeared in many movies.
We assign every performer who has appeared in a movie
a Bacon number:
Bacon himself is 0, any performer who has been in the same cast
as Bacon has a Kevin Bacon number of 1, any other performer (except Bacon) who
has been in the same cast as a performer whose number is 1 has a Kevin Bacon
number of 2, and so forth. For example, Meryl Streep has a Kevin Bacon number
of 1 because she appeared in The River Wild with Kevin Bacon.
Nicole Kidman's number is 2: although she did not appear in any movie with Kevin Bacon,
she was in Cold Mountain with Donald Sutherland, and Sutherland appeared in Animal
House with Kevin Bacon.

Given the name of a performer, the simplest version of
the game is to find some alternating sequence of movies and performers that lead
back to Kevin Bacon.
Remarkably, PathFinder.java
is precisely the program that you need to find a shortest path that establishes
the Bacon number of any performer in movies.txt.

The six degrees of separation rule suggests that most actors will have a Kevin
Bacon number of 6 or less. In fact, most have a Kevin Bacon number of 3 or less!

Q + A

Q.
What is formally meant by a small world network?

A.
High local clustering and short global separation.
Local clustering means that two vertices are more likely
to be adjacent if they share a neighbor.
Short global separation means that
all (or most) pairs of vertices are connected by "short" paths.
Here short means the length of chain scales logarithmic (or polylogarithmic)
with the number of vertices.
For example, empirically, any two documents on the Web are around 19
clicks away (directed) on average.
Empirical estimate = .3 + 2.06 log10 N, where N = 1 billion.
If web grows by factor of 1,000, only 21 clicks according to model.

Q.
Other small world applications.

A.
The small world phenomenon has numerous applications in the natural and social
sciences.
The domain of the Kevin Bacon is actor-movie relationships, but we can
ask the same questions with mathematicians and papers
(Erdös number),
musicians and rock groups, baseball players and teams,
or people and online social network groups.
Sociology: looking for a job via connections,
marketing products or ideas, formation and spread of fame and fads,
train of thought followed in a conversation, and
defining representative-ness of political bodies.
Science: synchronization of neurons, analysis of World Wide Web,
secure peer-to-peer networks, routing algorithms for ad hoc wireless networks,
design of electrical power grids, processing of information in the human brain,
phase transitions in coupled Kuramoto oscillators,
spread of infectious diseases and computer viruses, and
evolution of cooperation in multi-player iterated Prisoner's Dilemma.

A.
Another important application of breadth first search is crawling the web.
Or routing packets on the Internet.
An important generalization is when each edge has an associated distance
or time and the goal is to find a path that minimizes the sum of the edge
lengths. More sophisticated algorithm known as Dijkstra's algorithm
solves that problem in linearithmic time.
Basis for map routing applications on the web and in GPS-equipped
automobiles.

Exercises

Wikipedia graph - a nice dataset.

Add methods vertices() and edges()
to Graph that returns the number of
vertices and edges in the graph, respectively.

Add a boolean method exists(String v) that returns true
if v is a vertex in the graph, and false otherwise.

Add a boolean method exists(String v, String w) that
returns true if v-w is an edge in the graph, and false otherwise.

Suppose you use a stack instead of a queue when running breadth first
in PathFinder.java.
Does it still compute Kevin Bacon numbers correctly?

Suppose you use a queue instead of a stack when forming the shortest
path in path(). What is the effect?

Add a method isReachable(v) to
PathFinder.java that returns
true if v can reach s via some path, and false
otherwise.

The degree of a vertex is the number of incident edges.
Add a method degree() to Graph
that takes a Vertex
as input and returns the degree of that vertex.
Use this method to find the actor node in the Kevin Bacon graph
with the highest degree, i.e., the actor that has appeared in the most
movies.

Write a program MovieStats.java
that reads in a movie data set and prints out the number of distinct actors
and movies.

Modify Bacon.java so that
the user types two actors (one per line) and the program prints a
shortest chain between the two actors.

Create a copy constructor for Graph that takes
as input a graph G, and creates and initializes a new, independent,
copy of the graph. Any changes you make to G should not affect the
newly created graph.

Array of PathFinder's to save away previous queries?

Modify Graph.java to include a method iterator()
that returns an Iterator to the graph's vertices.
Also, make Graph implement the
Iterable<String> interface so that you can
use it with foreach to iterate over the graph's vertices.

public Iterator<String> iterator() {
return st.iterator();
}

True or false. At some point during breadth first search the queue can contain
two vertices, one whose distance from the source is 7 and one whose distance is 9.
Answer: False. The queue can contain vertices of at most two distinct
distances d and d+1. Breadth first search examines the vertices in increasing
order of distance from the source. When examining a vertex at distance d,
only vertices of distance d+1 can be enqueued.

Creative Exercises

Histogram.
Write a program Histogram.java to
print a histogram of Kevin Bacon numbers, indicating how many actors have
a Bacon number of 0, 1, 2, 3, etc.

Large Bacon numbers.
Find the actors in cast.txt with the largest, but finite, Kevin Bacon number.
Answer: here are a few actors whose number is 8:
Ozra Esmaili, Gabriela Pereyra, Yamil Turk, Alejandro Zain.

Actor graph.
An alternate way to compute Kevin Bacon numbers is to
build a graph where there is a vertex for each actor (but not for
each movie). Instead, two actors are connected
by an edge if they appear in a movie together. Calculate Kevin Bacon
numbers by running BFS on the actor graph. Compare the running time
versus the algorithm described in the text. Explain why this approach
is so much slower (and more difficult to reconstructor the movies
along the path).

Connected components.
Given an undirected graph, a connected component is
a maximal set of vertices that are mutually reachable.
Write a program CCFinder.java that computes the connected
components of a graph. Include a constructor that takes a Graph
as an argument and computes all of the connected components using
breadth first search. Also, include a method isReachable(v, w)
that returns true if v and w are in the same connected component,
and false otherwise.
Also include a method components() that returns the
number of of connected components.
Note: need a way to iterate over the vertices in the
symbol table. Then run BFS from each vertex that hasn't yet been
visited.

Flood fill / image procesing.
Blob = collection of neighboring pixels of the same color.
How many blobs in picture?
Given N-by-N image of color pixels, identify all clusters
of pixels of the same color that are connected.
Find connected components.

Word ladder.
Write a program WordLadder.java
that takes two 5 letter strings as command-line arguments, and reads in
a list of 5 letter words
from standard input, and prints out a shortest
(or word ladder
connecting the two strings (if it exists).
This word game was invented by Lewis Carroll and was orignally known as a doublet.
Two words can be connected in a word ladder chain if they differ
in exactly one letter.
As an example, the following word ladder connects green and brown.

All paths.
Given an undirected graph and two vertices s and t,
write a program AllPaths.java
that can be used to count or print out all simple paths from s to t.
A simple path does not revisit any vertex more than once.
In two-dimensional grids, such paths are referred to as self-avoiding walks.
It is a fundamental problem in in statistical physics and theoretical chemistry,
e.g., to model the spatial arrangement of linear polymer molecules in a solution.
Warning: there might be exponentially paths, so don't run this on
large graphs.
Reference.

Garbage collection.
Reference counting vs. mark-and-sweep.
Automatic memory management in languages like Java is a
challenging problem. Allocating memory is easy, but discovering
when a program is finished with memory (and reclaiming it) is more difficult.
Reference counting: doesn't work with circular linked structure.
Mark-and-sweep algorithm. Root = local variables and static variables.
Run DFS from roots, marking all variables references from roots, and so on.
Then, make second pass: free all unmarked objects and unmark
all marked objects. Or a copying GC would then move all of the
marked objects to a single memory area.
Uses one extra bit per object.
JVM must pause while garbage collection occurs. Fragments memory.

Directed graph.
Create a data type Digraph.java for
directed graph. It is almost identical to our Graph ADT,
except that when we insert an edge v-w, we only insert w on the adjacently
list of v, but not v onto the adjacency list of w.
Include a method outdegree(String v).

Shortest path in a digraph.
Write a directed version of PathFinder that finds the shortest path
from s to every other vertex in a directed graph.
Hint: breadth search first works in digraphs.

Transitive closure.
Write an ADT TransitiveClosure.java whose constructor
takes a Digraph as an argument, and whose method
isReachable(v, w)true if w is reachable
from v along a directed path in G,
and false otherwise.
Note: need a way to iterate over the vertices in the
symbol table. Then run BFS from each vertex. Might need to
store an V-by-V table or compute queries on the fly.

Topological order and cycle detection.
Given a directed graph, determine if there is a directed cycle. If so, print
one out. Otherwise print out a topological order.
Given a list of pairs of variables of the form x < y, find
an ordering that is consistent with the partial order. For example,
if x < z, y < z, z < v, and v < w, then x < y < z < v < w
is one possible solution. Print that the input is inconsistent
if there is a cycle, e.g., x < y, y < z, z < x.

Cover time.
Given a connected, undirected graph G, a random walk moves
from vertex v to w with probability equal to 1 / degree(v)
if (v, w) is an edge (and 0 otherwise).
Write a program to simulate how many steps
it takes to visit every vertex in the graph.
Find an infinite sequence graph where the cover time
grows proportional to V, proportional to V^2.
Can you find one where it grows proportional to V^3 or 2^V?

Center of the Hollywood universe.
We can measure how good of a center that Kevin Bacon is by computing
their Hollywood
number or average path length. The Hollywood number of Kevin Bacon
is the average Bacon number of all the actors (in its connected
component). The Hollywood number
of another actor is computed the same way, but we make them be the
source instead of Kevin Bacon. Compute Kevin Bacon's Hollywood number
and find an actor and actress with better Hollywood numbers
than Kevin Bacon. Find the actor or actress (in the same connected
component as Kevin Bacon) with the worst Hollywood number.

Diameter.
The distance between two vertices is the length of the shortest
path between them.
The eccentricity of a vertex is the greatest distance
between it and any other vertex.
The diameter of a graph is the greatest distance
between any two vertices (the maximum eccentricity of any vertex).
Write a program Diameter.java to compute
the eccentricity of a vertex and the diameter of a graph.
Use it to find the diameter of the Kevin Bacon graph.

Parallel edges.
Modify Graph to allow parallel edges, i.e., two
or more edges connecting the same pair of vertices.
Hint: use a Queue for the
adjacency list instead of a SET.

Erdos-Renyi graph model.
Graph on V vertices. Edge possible edge is included
independently with probability p. Verify the following
properties.

Connectivity threshholds.
If p < 1/n and n is large, then most of
connected components are small, with largest having
O(log n) vertices. If p > 1/n then there is almost
surely a giant component containing Theta(n) vertices.
If p < ln n / n, the graph is disconnected
with high probability; if p > ln n / n the graph
is connected with high probability.

Distribution of degrees.
Distribution of degrees follows a binomial distribution.
Most vertices have similar degrees.
Called exponential since the probability
that a vertex is connected to k other vertices
decreases exponentially in k.

No hubs.
Maximum degree if p is a constant is O(log n).

Not many triangles.
Random graph model doesn't capture social networks.
Typically the neighbors of a vertex are also neighbors
of each other.

Small diameter.
If p > log n / n, then diameter is O(log n).

Power law of web links.
(Micahel Mitzenmacher)
The indegrees and outdegrees of World Wide Web obey a power law.
Can be modeled by
preferred attachment process.
Suppose that each web page has exactly one outgoing link.
Each page is created one at a time, starting with a single page that
points to itself. With probability p < 1,
it links to one of the existing pages, chosen uniformly at random.
With probability 1-p it links to an existing page
with probability proportional to the number of incoming
links of that page. This rule reflects the common tendency for new
web pages to point to popular pages.
Write a program to simulate this process and plot a
histogram of the number of incoming links.
Answer: you should observe that the fraction of pages
with indegree k is proportional to k^(-1 / (1 - p)).

Web Exercises

All shortest pairs.
Write a program AllShortestPaths.java
that builds a graph from a file,
and reads source-destination requests from standard input and prints a shortest
path in the graph from the source to the destination.

Faster word ladders.
To speed things up (if the word list is very large),
don't write a nested loop to try all pairs of
words to see if they are adjacent. To handle 5 letter words,
first sort the word list. Words that only differ in their last letter
will appear consecutively in the sorted list. Sort 4 more times,
but cyclically shift the letters one position to the right so
that words that differ in the ith letter will appear consecutively
in one of the sorted lists.

Try out this approach using a larger
word list with words of different sizes. Two words of different
lengths are neighbors if the smaller word is the same as the bigger
word, minus the last letter, e.g., brow and brown.

Checkers.
Extend the rules of checkers to an N-by-N checkerboard.
Show how to determine whether a checker can become in king
in the current move. (Use BFS or DFS.)
Show how to determine whether black has a winning move.
(Find an Euler path.)

Combinational circuits.
Determining the truth value of a combinational circuit given
its inputs is a graph reachability problem (on a directed acyclic
graph).

Hex.
The game of Hex is played on a trapezoidal grid of hexagons....

Describe how to detect whether white or black has one the game
using breadth first search.

Prove that the game cannot end in a tie.
Hint: consider the set of cells reachable from
the left side of the board.

Prove that the first player in Hex can always win if they
play optimally.
Hint: if the second player had a winning strategy,
you could choose a random cell initially, and then just copy
the second players winning strategy. This is called
strategy stealing.

Explain the design flaw in the following implementation
of a ring graph.

The client can create a ring graph on V vertices, then
invoke the addEdge() method, which would result
in something other than a ring graph.

Topological robustness.
"Scale-free networks are robust to random damage, but
vulernable to malicious attack."
In a power law distributed small world network, deletion
of random node rarely has much effect on length of shortest
paths (since most shortest paths are routed through hubs). This is
in constrast to random network.
However, small world networks can catastrophically fail if the
hub is deleted or attacked (whereas random networks have no
central point of failure).
Random 95% of Internet links (6209 vertices an 24401 links)
can be removed and graph stays connected, but
destroying 2.3% of hubs would disconnect it.
Scale-free networks are also vulnerable to spreading viruses.
We should immunize the hubs.
Barabasi hypothesized that biological systems may be evolutionarily predisposed
to prefer small world networks since they are more robust to
perturbations (mutation, viral infection).
Viruses do adapt to attack the hubs.
Internet will still function if we randomly delete 80% of vertices.
Scale-free networks are topologically robust if gamma <= 3.
"promiscuous individuals are the vulnerable nodes to target in
safe-sex campaigns." Liljeros.

Power laws and scale-free networks.

Create a histogram of the the degrees of vertices in a graph
and look at distribution.
Power laws model scale invariance observed in many
natural phenomena, including physics, sociology, geography, and economics.
Direct result of self-organization.
US Road network is approximately random network with bell-shaped
connectivity distribution;
US airport network is scale-free with many hubs.
Think hub = airline hub.

WWW, Internet, food webs, social webs, pairwise interactions
between proteins in human body.
Probability than an actor has k neighbors follows a
power law-tail for large k: p(k) ~ k^(-2.3).
Web graph (in-degrees and out-degrees) obey a power
law.
Probability that k documents point to a given web page
~ k^(-2.1).

Generative models.

Reference: powerlaws.media.mit.edu.
Traditionally, scientists have modeled graphs using completed
ordered graphs (a grid or crystal lattice) or completely random graphs.
Most real world networks seem fall in between the two extremes.
People meet friends via existing friends, so the networks have
many local connections and highly clustered.
If person moves to a different city or joins a
group, they form long-range connections.

Such networks grow over time by the gradual addition of vertices and
edges with preferential connectivity.
New web page likely to connect to well-established web pages.
Barabasi model: rich get richer.
Start at time t = 1 with two vertices linked by d parallel edges.
At every time t >= 2, add a new vertex with d edges using
preferential attachment.
Probability that new vertex will be connected to vertex i
is ki / sum (kj), where ki is the connectivity of vertex i
(indegree for directed graphs).
Add edges one at a time, with subsequent edges doing preferential
attachment using updated degrees.
(Has property that for general average outdegree d, can
run d = 1 model for dn steps and collapse vertices
kd, kd - 1, ..., (k-1)d + 1 to make vertex k.)
Fraction of vertices with degree k converges to
2d(d+1)/(k(k+1)(k+2)) as n gets large. In other
words, a power law proportional with exponent 3,
where d only influences constant.
reference.
Variants lead to exponents between 2 and infinity.
(Exercise: simulate and compute an estimate.)
As a result, hubs are formed.

Lovasz. Fixed preferential attachment graph.
Fix n vertices with self-loops and m steps. In step i,
choose two random vertices independently with probabilityd
proportional to degree and connect them.
Random graph model of Erdos and Renyi not model many large networks
that arise in practice.

Kleinberg model.
No way for participants in Watts-Strogatz model
to find short paths in a decentralized
But Milgram experiment also had striking algorithmic
component - individuals can find short paths!
Kleinberg proposed making distribution of shortcuts obey
a power law, with probability proportional to the dth power
of the distance (in d dimensions).
Each vertex has one long-range neighbor.
Uniform over all distance
scales (same number of links at distances 1-10 as 10-100
or 100-1000. Greedy algorithm (forward message along
edge that brings it as close to target as possible in
terms of lattice distance) leads to log^2 N length paths.
The only exponent that works is d.