January 26, 2005

More arithmetic problems at Google

Jean Véronis has some further observations and speculations about Google counts. He discovers another experimental situation in which there's apparently a large bias that increases with the size of the counts. I saw something similar earlier in the ratio of {X OR X} to {X} counts; Jean finds a systematic and increasing error in comparing counts in English pages to counts in all pages. As Jean points out, in both experiments there's a sort of phase change at counts of around 0.5x108.

So to the earlier piece of advice ("be careful about counts much greater than 100K") we can add another one ("pay no attention at all to counts above about 500M"). Or if you care about counts, use Yahoo, which (at least for the experimental situations examined) doesn't show these weird errors.

I don't entirely agree with Jean's characterization of the situation, however. He seems to feel that if "the
real index on which the data centers are operating [is] be much smaller, and in such a case an extrapolation would be done", that this would constitute "faking" the counts. I think this is quite unfair, because I don't see how it could be any other way. There's simply no way that Google -- or Yahoo or any other search provider -- could service every query involving more than one search term by doing a full intersection of sets of results that might be counted in the hundreds of millions or even billions. They will often have no choice but to "do a prefix and then extrapolate", as my correspondent from Google put it. The issue is how accurate the extrapolation is. And at the moment, Google's extrapolation clearly sucks.

At least this is true in the experimental circumstances that Jean and I have tested, and for cases where large sets of results are involved. Given these findings, my belief Google's counts (for queries involving more than one term or other search condition) starts at about 0.9 when sets of about 100K pages are being combined, and falls monotonically to zero as the set size increases to 500M.

I doubt that this matters at all to most of Google's users, who want the relevant pages, not the counts. But I'm sure that Google's (smart, honest and dedicated) researchers and programmers will fix the problem anyhow.