Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

13.
PageRank sucks!
(or NOT?)
The Hollywood graph we used contains 2,000,000
nodes. Most of them are completely unknown!

14.
PageRank sucks!
(or NOT?)
The Hollywood graph we used contains 2,000,000
nodes. Most of them are completely unknown!
PageRank is not singling out the best actors...

15.
PageRank sucks!
(or NOT?)
The Hollywood graph we used contains 2,000,000
nodes. Most of them are completely unknown!
PageRank is not singling out the best actors...
...but still it is not pointing to random individuals, is it?

21.
The glass is half full
Link Analysis is good, but you cannot expect
any centrality index to work on any network

22.
The glass is half full
Link Analysis is good, but you cannot expect
any centrality index to work on any network
Probably PageRank is scarcely useful on Hollywood,
but maybe other measures would work like a charm

23.
The glass is half full
Link Analysis is good, but you cannot expect
any centrality index to work on any network
Probably PageRank is scarcely useful on Hollywood,
but maybe other measures would work like a charm
Betweenness? Closeness? Katz? ...

24.
The glass is half full
Link Analysis is good, but you cannot expect
any centrality index to work on any network
Probably PageRank is scarcely useful on Hollywood,
but maybe other measures would work like a charm
Betweenness? Closeness? Katz? ...
A problem here: some indices are computationa!y
unfeasible on large networks!

30.
My point, today
By contract, I have to convince you that networks
contain a big deal of information useful for IR

31.
My point, today
By contract, I have to convince you that networks
contain a big deal of information useful for IR
But you have to use them in a proper way (i.e., to
compute suitable indices)

32.
My point, today
By contract, I have to convince you that networks
contain a big deal of information useful for IR
But you have to use them in a proper way (i.e., to
compute suitable indices)
And more often than not this calls for new algorithms
because of their size (and sometimes density)

36.
Centrality in social sciences
a historical account
First works by Bavelas at MIT (1948)
This sparked countless works (Bavelas 1951; Katz 1953;
Shaw 1954; Beauchamp 1965; Mackenzie 1966; Burgess
1969; Anthonisse 1971; Czapiel 1974...) that Freeman
(1979) tried to summarize concluding that:
several measures are o"en only vaguely
related to the intuitive ideas they purport to
index, and many are so complex that it is
diﬃcult or impossible to discover what, if
anything, they are measuring

38.
The 1990s revival
Link Analysis Ranking
With the advent of search engines, there was a strong
revamp of centrality (LAR in this context)

39.
The 1990s revival
Link Analysis Ranking
With the advent of search engines, there was a strong
revamp of centrality (LAR in this context)
New scenarios

40.
The 1990s revival
Link Analysis Ranking
With the advent of search engines, there was a strong
revamp of centrality (LAR in this context)
New scenarios
• graphs are directed mainly

41.
The 1990s revival
Link Analysis Ranking
With the advent of search engines, there was a strong
revamp of centrality (LAR in this context)
New scenarios
• graphs are directed mainly
• they are huge

42.
The 1990s revival
Link Analysis Ranking
With the advent of search engines, there was a strong
revamp of centrality (LAR in this context)
New scenarios
• graphs are directed mainly
• they are huge
• new attention to eﬃciency

43.
The 1990s revival
Link Analysis Ranking
With the advent of search engines, there was a strong
revamp of centrality (LAR in this context)
New scenarios
• graphs are directed mainly
• they are huge
• new attention to eﬃciency
LAR is the ungrateful reincarnation of Centrality

60.
A tale of three tribes
Indices based on degree
Indices based on the number of paths or shortest paths
(geodesics) passing through a vertex;

61.
A tale of three tribes
Indices based on degree
Indices based on the number of paths or shortest paths
(geodesics) passing through a vertex;
Indices based on distances from the vertex to the other
vertices

62.
A tale of three tribes
Indices based on degree
Indices based on the number of paths or shortest paths
(geodesics) passing through a vertex;
Indices based on distances from the vertex to the other
vertices
Let me call these indices geometric

64.
Three dogs strive for a bone, and
the fourth runs away with it...
The advent of Link Analysis pushed for a third
(winning) tribe: spectral indices

65.
Three dogs strive for a bone, and
the fourth runs away with it...
The advent of Link Analysis pushed for a third
(winning) tribe: spectral indices
Some of the geometric indices have also a spectral
(equivalent) deﬁnition

87.
3) The distance tribe
a new member
Give a warm welcome to Harmonic centrality:
The denormalized reciprocal of the harmonic mean of a!
distances (even ∞)
Inspired by the use the the harmonic mean in
(Marchiori, Latora 2000)
charm(x) =
X
y6=x
1
d(y, x)

88.
3) The distance tribe
a new member
Give a warm welcome to Harmonic centrality:
The denormalized reciprocal of the harmonic mean of a!
distances (even ∞)
Inspired by the use the the harmonic mean in
(Marchiori, Latora 2000)
Probably already appeared somewhere (e.g., quoted for
undirected graphs in Tore Opsahl’s blog)
charm(x) =
X
y6=x
1
d(y, x)

100.
Seeley index
(Seeley 1949)
It is essentially like PageRank with no damping factor;
equivalently, the stable state of the natural random walk
on G:
It is obtained as the limit of PageRank when the
damping goes to 1
cSeeley(x) =
✓
lim
t!1
1
n
(Gr)t
◆
x

101.
Seeley index
(Seeley 1949)
It is essentially like PageRank with no damping factor;
equivalently, the stable state of the natural random walk
on G:
It is obtained as the limit of PageRank when the
damping goes to 1
In general it is a dominant eigenvector of
cSeeley(x) =
✓
lim
t!1
1
n
(Gr)t
◆
x
Gr

105.
HITS
(Kleinberg 1997)
The idea is to start from the system:
HITS centrality is deﬁned to be the “authoritativeness”
score
It is a dominant eigenvector of
cHauth(x) =
X
y!x
cHhub(y)
cHhub(x) =
X
x!y
cHauth(y)
GT
G

108.
HITS (, SALSA etc.)
WARNING: These measures were proposed exactly
for ranking results in hyperlinked collections
Should be applied not to the whole graph, but to a
suitable subgraph derived from the query

109.
HITS (, SALSA etc.)
WARNING: These measures were proposed exactly
for ranking results in hyperlinked collections
Should be applied not to the whole graph, but to a
suitable subgraph derived from the query
How the subgraph is derived is very relevant for
eﬀectiveness (Najork, Gollapudi, Panighray 2009)

110.
HITS (, SALSA etc.)
WARNING: These measures were proposed exactly
for ranking results in hyperlinked collections
Should be applied not to the whole graph, but to a
suitable subgraph derived from the query
How the subgraph is derived is very relevant for
eﬀectiveness (Najork, Gollapudi, Panighray 2009)
Not really the central point here, though...

133.
An axiomatic slaughter
Size Density Monot.
Degree only k yes yes
Betweennes
s
only p no no
Katz only k yes yes
Closeness no no no
Lin only k no no
Harmonic yes yes yes
PageRank no yes yes
Seeley no yes no
HITS only k yes no
SALSA no yes no

134.
An axiomatic slaughter
Size Density Monot.
Degree only k yes yes
Betweennes
s
only p no no
Katz only k yes yes
Closeness no no no
Lin only k no no
Harmonic yes yes yes
PageRank no yes yes
Seeley no yes no
HITS only k yes no
SALSA no yes no

135.
An axiomatic slaughter
Size Density Monot.
Degree only k yes yes
Betweennes
s
only p no no
Katz only k yes yes
Closeness no no no
Lin only k no no
Harmonic yes yes yes
PageRank no yes yes
Seeley no yes no
HITS only k yes no
SALSA no yes no

136.
An axiomatic slaughter
Size Density Monot.
Degree only k yes yes
Betweennes
s
only p no no
Katz only k yes yes
Closeness no no no
Lin only k no no
Harmonic yes yes yes
PageRank no yes yes
Seeley no yes no
HITS only k yes no
SALSA no yes no

185.
TREC .gov2
150 queries (query title words, in AND; with stemming,
no stopword elimination)
Generated the result graph using the method described
by Najork et al. 2009 (a variant of Kleinberg’s HITS
graph, taking a in-links and b out-links)

186.
TREC .gov2
150 queries (query title words, in AND; with stemming,
no stopword elimination)
Generated the result graph using the method described
by Najork et al. 2009 (a variant of Kleinberg’s HITS
graph, taking a in-links and b out-links)
Considered many combinations: here I present only
the cases a=b=0 (i.e., subgraph induced by the result set)

187.
TREC .gov2
150 queries (query title words, in AND; with stemming,
no stopword elimination)
Generated the result graph using the method described
by Najork et al. 2009 (a variant of Kleinberg’s HITS
graph, taking a in-links and b out-links)
Considered many combinations: here I present only
the cases a=b=0 (i.e., subgraph induced by the result set)
With or without intra-host links

197.
Intra-host links?
Keep them or throw them away?
Most indices get better if you throw them away...

198.
Intra-host links?
Keep them or throw them away?
Most indices get better if you throw them away...
Throwing such links away injects a lot of information,
but apparently harmonic doesn’t need it!

199.
Intra-host links?
Keep them or throw them away?
Most indices get better if you throw them away...
Throwing such links away injects a lot of information,
but apparently harmonic doesn’t need it!
...but harmonic is better (and best of all) with the
whole thing!

268.
Easy but expensive
Each set uses linear space; overall quadratic
Impossible!
But what if we use approximate sets?

269.
Easy but expensive
Each set uses linear space; overall quadratic
Impossible!
But what if we use approximate sets?
Idea: use probabilistic counters, which represent sets but
answer just to “size?” questions

270.
Easy but expensive
Each set uses linear space; overall quadratic
Impossible!
But what if we use approximate sets?
Idea: use probabilistic counters, which represent sets but
answer just to “size?” questions
Very small! With 40 bits you can count up to 4 billion
with a standard deviation of 6%

274.
Use the Force
HyperBall (http://webgraph.dsi.unimi.it/)
does the job
Fully exploits multicore architectures; uses broadword
microparallelization
Works like a charm on networks with hundreds of
millions of nodes

276.
What is HyperBall
It uses the diﬀusion idea to compute (at the same
time):

277.
What is HyperBall
It uses the diﬀusion idea to compute (at the same
time):
Lin + Closeness + Harmonic + Number of reachable
nodes...

278.
What is HyperBall
It uses the diﬀusion idea to compute (at the same
time):
Lin + Closeness + Harmonic + Number of reachable
nodes...
It employs Flajolet’s HyperLogLog counters to store
sets

279.
What is HyperBall
It uses the diﬀusion idea to compute (at the same
time):
Lin + Closeness + Harmonic + Number of reachable
nodes...
It employs Flajolet’s HyperLogLog counters to store
sets
More on HyperBall in the next few slides...

287.
HyperLogLog counters
Instead of actually counting, we observe a statistical
feature of a set (think stream) of elements

288.
HyperLogLog counters
Instead of actually counting, we observe a statistical
feature of a set (think stream) of elements
The feature: the number of trailing zeroes of the value
of a very good hash function

289.
HyperLogLog counters
Instead of actually counting, we observe a statistical
feature of a set (think stream) of elements
The feature: the number of trailing zeroes of the value
of a very good hash function
We keep track of the maximum m (log log n bits!)

290.
HyperLogLog counters
Instead of actually counting, we observe a statistical
feature of a set (think stream) of elements
The feature: the number of trailing zeroes of the value
of a very good hash function
We keep track of the maximum m (log log n bits!)
The number of distinct elements ∝ 2m

291.
HyperLogLog counters
Instead of actually counting, we observe a statistical
feature of a set (think stream) of elements
The feature: the number of trailing zeroes of the value
of a very good hash function
We keep track of the maximum m (log log n bits!)
The number of distinct elements ∝ 2m
Important: the counter of stream AB is simply the
maximum of the counters of A and B!

293.
Other ideas
We keep track of modiﬁcations: we do not maximize
with unmodiﬁed counters

294.
Other ideas
We keep track of modiﬁcations: we do not maximize
with unmodiﬁed counters
Systolic computation: each modiﬁed set signals back to
predecessors that something is going to happen (much
fewer updates!)

295.
Other ideas
We keep track of modiﬁcations: we do not maximize
with unmodiﬁed counters
Systolic computation: each modiﬁed set signals back to
predecessors that something is going to happen (much
fewer updates!)
Multicore exploitation by decomposition: a task is
updating just 1000 counters (almost linear scaling)

299.
Footprint
Scalability: a minimum of 20 bytes per node
On a 2TiB machine, 100 billion nodes
Graph structure is accessed by memory-mapping in a
compressed form (WebGraph)

300.
Footprint
Scalability: a minimum of 20 bytes per node
On a 2TiB machine, 100 billion nodes
Graph structure is accessed by memory-mapping in a
compressed form (WebGraph)
Pointers to the graph are stored using succinct lists
(Elias-Fano representation)