Graph structure in the Web

Structure: (AltaVista, May 1999).
Total: 203.5 million
WCC: 186.7 million
SCC: 56.5 million
IN: 43.3 million
OUT: 43.1 million
TENDRILS: 43.7 million
DISC: 16.8 million
Links: 1466 million = 7.2 per page.

10 million WCC's (weakly connected components) almost all of size 1
(isolated page with no inlinks or outlinks.)

Two interesting points:
1. The seed set for AltaVista must be at least 10 million (actually,
presumably, much larger). Creating a huge seed set almost as important
as crawling.
2. How on earth does AltaVista find URLs for 10 million isolated pages?

Server logs

Links from now dead pages

Mentioned in text or other non-HTML indices.

Readable directories

Initial part of path names

???

Still mysterious.
I hope to hear from Kumar next week.

Other interesting measurements

The diameter of the SCC is at least 28.
The depth of IN is about 475? The depth of OUT is about 430? Very few
pages are anywhere near these depths.
The probability that there is a directed path from page U to page V is about
24%.
The probability that there is a undirected path from page U to page V is about
82%.
If there is a path from U to V, the average directed distance is about 16.15;
the average undirected distance is 6.83.

The connectivity of the WCC is not dependent on the existence of pages
of high in-degree.
If all pages of indegree =&gt k are removed, the WCC still has size W

k 1000 100 10 5 4 3
W million 177 165 105 59 41 15

Inverse power distribution on the Web

The following Web quantities follow an inverse power distribution:

Number of pages with K inlinks vs K.
Exponent = -2.1, very stably
over many significant subsets of Web. Same whether consider all inlinks
or only remote inlinks.

Zipf's Law, Benford's Law Readable discussion. Benford's Law is
the even stranger observation that the first digit of numbers that
come up in practice is distributed with frequences log10(1+1/D).
X is a random variable.
X.f is a property of X whose value is a positive integer.
In a inverse power distribution
Prob(X.f = i) = C / iA for A &gt 1.
(C is a constant equal to 1 over the sum from k=1 to infinity of
1/kA).

If A =&lt 1, then the above sum diverges. Have to give an upper
bound M.

If a collection O of N items is chosen according to the above distribution
the expected number of items Q such that Q.f=i is N*C/iA.
Conversely, O fits the inverse power distribution, if the above condition
holds over O.

Zipf distribution

Let O be a bag of items of size N over a space W. For w in W, let count(w,O)
be the number of times w occurs in O. Let w1,
w2 ... be the
items in W sorted by decreasing value of count(w,O) (break ties arbitrarily).
We define the rank of wi in O to be i. That
is, the most common item has rank 1; the second most common has rank 2;
etc. O follows the Zipf distribution if
count(wi,O) = NC/rank(w,O)A for some a &gt 1.

As before if A =&lt 1, then W has to be finite, of size L.

Note: The power distribution is often called the "power law" and
the Zipf distribution generally called "Zipf's law", but the only
"law" is that a lot of things follow this distribution. They are
also called "long tail" or "fat tail" distributions because the
probability of extreme
values decreases much more slowly than with other distributions such
as the exponential or the normal.

Things that follow the Zipf distribution

Frequency of words in a corpus of text. This is one of the main reasons
that learning from corpora is so hard. Having seen a training corpus
of N words, the probability that the next word you see will be new to
you goes up only like log(N).

Population of cities.

Incomes

Numbers of papers published by scientists

Distribution of species among genera

Access statistics for web pages.

Number of times users at a single site access particular pages.

In many applications, the exponent "A" is quite close to 1, which
is the "standard" Zipf distribution. As indicated above,
there is in principle a large qualitative difference between an exponent that
is greater than 1, which can accommodate an infinite range of values
and an exponent less than or equal to 1, where the tail must ultimately
depart from the power law.

If A &lt 2 in an inverse power distribution, then the variance (and the
standard deviation)
is infinite. Conversely, if you're doing statistical analysis of something
and you come up with a strangely large standard deviation, you might consider
whether you're looking at an inverse power distribution.

To find the optimal inverse power distribution matching a data set,
plot the data set on a log-log graph. If y = C/xA then
log(y) = log C - A log(x). So the log-log graph of the distribution
is a straight line with slope = -A and y-intercept = C. You can
find the best fit by doing any standard linear regression (e.g. least
squares.)

Connection between the inverse power and the Zipf distribution.

If X.f follows the inverse power distribution, then
count(X.f) follows the Zipf distribution. Argument: count(X.f=i) is
(characteristically) a decreasing function of i; hence, the rank of the
value X.f=i is just i, so it's the same distribution. In fact, the
fit to the Zipf distribution of count as a function of rank is often
better than the fit to the power law distribution, because

by definition, count is a non-increasing function of rank;

tends to be true even beyond that effect.

(If you plot count against X.f, you can turn this into a plot of
count against rank by (a) reordering the points on the x-axis so
as to go indecreasing order by count; (b) packing the point to the
left to eliminate gaps.)

Two directions for inverse power distribution

The inverse power law can apply over a collection of sets in either of two
directions:

1. The size of the Kth largest set is about C/KA.

2. Let Q(K) be the number of sets of size K
Then |Q(K)| is about C/KA.

E.g. with (1) we would plot the Kth largest indegree vs. K; with (2)
we would plot the number of nodes of indegree K against K.

The two directions
are actually the same rule (under reasonable smoothness and monotonicity
assumptions); if a distribution satisfies (1) with exponent
A then it satisfies (2) with exponent 1+1/A, and vice versa.
However, with approximate data (i.e. any actual data)

the quality of fit may not be the same;

the best fit exponents to the two graphs may not exactly satisfy the
above relation

power laws, like other distributions, often break down at one or both
extremes.

of Kth most common word vs. K); inlink count is an example of the second
(rank 1 = the most common number of inlinks, rather
than the page with the most inlinks).

Why do you get power law distributions?

Stochastic model. Suppose that at every time step, a new page and a new link is
created. With probability P, the link points to a page chosen at random
uniformly; with probability 1-P the link points to a page chosen by random
choice weighted by the number of inlinks. Then for large values of
I, the distribution of nodes with I inlinks follows a power-law distribution
with exponent (2-P)/(1-P). (Herbert Simon)

By contrast, 2 different stochastic models leading to very different
distributions. Let N be the number of pages = 203 million
and L be the number of links = 1.46 billion.

Method 1

for I := 1 to L {
choose a random page U;
choose a random page V;
place a link from U to V;
}

Number of inlinks follows a binomial distribution. Prob. that there exists
any page with 100 inlinks is about 10-62. (Note: these and
the next figure were done on the back of an envelope and could easily
be off by a factor of 100 or so, but they give the right idea.)
Prob. that there exists any page with 200 inlinks is about 10-192.

Method 2

for each page V do {
flip a coin with weight 7.2 for heads, 1 for tails;
if tails exitloop;
choose a random page U;
place a link from U to V
}

Total number of inlinks will be very nearly L, to within about 0.0025%.
Number of inlinks follows an exponential distribution. Expected number of
pages with 100 inlinks = about 200. Probability that there exists any
page with 200 inlinks = about 2*10-4.

Contrast actual value, predicted by power law, of about 30,000 pages
(? eyeballed from small graph) with 100 inlinks and about 8000 pages
with 200 inlinks. The point is that "ordinary" random processes don't
give power laws, or anything close to them.

Of course this is not a plausible model of the Web, so there's a small
industry in constructing more plausible stochastic models of the Web.
No one (as far I as I have found) has yet constructed a model which
is known to give all the structural features of the Web.

Multiplicative model. If you have a large collection of identically distributed
independent random variables, and you multiply them, then you are just
adding the logs, so the sum of the logs follows a normal curve, so you
get what's called a log-normal distribution, which looks a lot like a power
law in the middle range.

Information theoretic model (Mandelbrot). Can show that a power-law
distribution of word frequencies maximizes information over cost.

Significance of these Web measurements

Well, that a little harder. We've seen the implication for crawlers;
need immense seed set, techniques for finding URLs without links.
Also:

Support for efficient algorithm, or explanation of why algorithms
such as PageRank run as rapidly as they do. Problems that would be
intractable on a random graph (e.g. finding small cliques) may be
tractable due to structure.

Improved charaterization for browsing and ranking.

Data compression. Whenever you understand structure, you can
use that to compress.

Data mining.

Self-Similarity

Made measurements over various subsets of the Web. Generally, the
results were that many structural properties of the Web apply
to significant subsets as well.

Subsets

Keyword sets: All pages that include specified keyword(s).

Baseball

Golf

Math

MP3

Restaurant

Baseball Yankees

Golf Tiger Woods

Math Geometry

MP3 Napster

Restautant Sushi.

IBM INTRANET

100 web sites

Geographic location: Web sites that have references to geographic
locations between Denver on east, Nilolski Alaska on west, Vancouver
on north, and Brownsville Texas on south.

7 random collections of websites: STREAM1 ... STREAM7, each with
about 6 million pages.

Some others, but these are the most interesting.

Some Results

Different measurements for different subgraphs: Why?

STREAM1 ... 7:
Remarkable consistency as regards:

Links per node.

Expansion factor (between 2.01 and 2.06).

Indegree exponent (between 2.06 and 2.13)

Outdegree exponent (between 2.12 and 2.32)

SCC exponent (between 2.11 and 2.16)

WCC exponent (between 2.25 and 2.32)

WCC/nodes (between 0.69 and 0.72)

SCC/WCC (betweeen 0.22 and 0.24)

IN/WCC (between 0.19 and 0.23)

OUT/WCC (between 0.23 and 0.24)

K5,7 factor (ratio of size of set divided by number of nodes
that are in a bipartite K5,7 graph. (between 43.5 and 50.1)

Single keyword sets: Sizes between 336,500 (baseball) and 831.7 (math).

Bipartite graphs (hubs and authorities) in the Web

Technical issues: Exclude aliases, copies, near-copies, nepotistic cores
(hubs from same website). There are lots of copies of Yahoo pages.

Algorithmics:

1. Prune all potential hubs with out-degree &lt i and all
authorities with indegree &lt j. Iterate. (As we have seen, eliminates
most of the pages on Web.)

2. "Inclusion/exclusion:" To check whether a hub H of out-degree i
is in a Ki,j, you (a) collect all the i authorities it
points to; (b) compute the interesection of all their sets of tails of
inlinks. If this interesction has q elements then H is in a Ki,q
graph, which can be pruned. If q &lt j, then prune H and propagate.
This step finds most of the bipartite graphs.
3. The remaining graph is now small enough for a search for the remaining
graphs to be tractable.

Results:
About 200,000 bipartite graph; almost all are coherent in terms of
topics. Many unknown to Yahoo. Hubs in a community last longer than
average Web pages.