On Google's PageRank

Anyone who runs a website knows that a successful site has to have a high Google PageRank. Unfortunately, though, Google updates their publicly available PRs only every a few months. Moreover, the PR is only given in discrete increments, so it is hard to know whether a site improved if the actual increase is less than a PR point on average. Given that I am curious about the PR of my site, I thought I should do something. And I did.

The PR is almost all about incoming links and their quality. Thus, an accurate page rank determination requires detailed knowledge of the page rank of each an every back-linked page. It would therefore be a formidable task to calculate the page rank of a high profile site such as cnn.com.

Statistically, however, if you have many incoming links, they tend to have the same rank distribution (many low quality links and fewer high quality links, and so forth). So in principle, we could try to estimate the page rank using just inbound links to a site. Moreover, since the PageRank is logarithmic in the inbound links, we could for example approximate it as:

$$ PR \approx f(N_{in}) \approx a \log_{10}(N_{in}) $$

This approximation is already not too bad, but one can do somewhat better, as we'll see below.

Figure 1 - Actual PageRank vs. inbound site links, for 120 random site. The horizontal scale is logarithmic. The nearly linear slope implies that the google PR is logarithmic as well. The linear fit gives a base 8 or so.

I took about 120 random website (all those in my bookmark's file) and plotted the actual PR as a function of log10(Nin). One can see from this graph that the slope varies. Hence, the PR is not exactly linear with log10(Nin). This could arise from several things. For example, Google's algorithm may not be exactly logarithmic, or the assumption that the quality of the links does not depend on average on the quality of the site is probably an over simplification, or perhaps other reasons.

The linear fit has a slope of 1.12 (which implies a PR unit increase every factor 7.8 increase in the inbound links). By comparing the fit to the actual PR data, one finds that the standard deviation in PR is 1.2 PR units, and also that about 82% of the sites have a predicted PR which is the correct one or +/-1 PR unit.

Figure 2 - Histogram of the Predicted PR minus the actual PR for 120 random sites. We see that in just over 50% of the cases, the predicted PR the same as the real, while in an extra 40%, it is within one PR unit of the real value.

To improve the fit, we can use a higher rank polynomial in log10(Nin) and also use other available data, such as the number of pages in a site, the back-links as seen by different search engines (which would be differently sensitive to different quality pages) and also the number of links within a site. I will save you from the ugly looking fitting formula (it has 8 different terms).

The standard deviation obtained with the improved fit is only 0.85, and 91% of the sites have a predicted PR which is the correct one or +/-1 PR unit. It is probably impossible to obtain a notably better fit with the data I use. To improve the PR prediction, one would in principle require more data, such as the actual PR of the back-linked pages.