This is kind of funny, as i was talking to a chap at Google who "joked" that maybe Yahoo! had just counted all the urls in their DB before de-duping them. Now, i see John's been talking to GOOG and they're "officially baffled".

I spent an hour or so on the phone with a group of Google folks, and they shared a lot of information about how they measure index size, how they deal with issues of duplicate URLs and documents, and why they are baffled by Yahoo's claim.

[...]

"Our scientists are not seeing the increase claimed in the Yahoo! index. The data we have doesn't support the 19.2 (billion page) claim and we're confused by that."

I've got to say that I find 20bn a hard figure to swallow also, but Google's comments do strike a certain "sour grapes" chord at the same time.

Now, the question is, are Yahoo! stuffing socks down their trousers, or is it really a whopper?

Spider all domains, multiply result by 1.8 for sites that don't redirect non-WWW to WWW, multiply that result by 2 to account for all the sites that allow strings like www.example.com?foo, have the PR department intern slip a decimal point and voila, 19.2 billion docs.

Seriously, I am amazed that Y is claiming that number. I have a site that has been online since March 26th, and has 4400 pages indexed by G, 1005 indexed by MSN, and only the home page indexed in Y. Wouldn't they have been hammering any and all sites that they came across for the past several months in order to reach this volume, yet there has certainly not been any sort of increase of spidering that would have foreshadowed this announcement.

Remember the month or two before G announced their new index? Everyone's logs were hammered with gBot tracks.

NOTE: I am not using the quote marks shown below in the queries. These links go to FIND ALL queries, not EXACT FIND queries. I do not compete for any of these terms on any of the Web sites that I control or assist people with. These queries are, from my point of view, random.

all of which don't, have or ever will exist. Since I have a mod rewrite going any file or folder that doesn't exist automatically shows the site map. So if you typed in .com/seomike-rules/ you'd get a page. As I discover these made up urls I add rules to trigger 404's yet they still exist in the yahoo index... odd.

MM's queries just prove that
1. Google doesn't index garbage.
2. Both engines guess the number of matches.
3. The real answer is 42.

It is not possible to estimate the size of a SE index by quering anything. Period.

Although I doubt it, Yahoo may have crawled 20 billion pages in infinite loops on session IDs and other unproductive cycles. Probably every fetched 'page' (plus all embedded objects) got an UUID assigned. Then counting the UUIDs gives that useless number. I bet that only a fraction of those crawled pages made in the index. Google seems to count indexed pages which can appear in searches, but their published number of searchable pages isn't accurate.

Sergey Brin, Google's co-founder, suggested that the Yahoo index was inflated with duplicate entries in such a way as to cut its effectiveness despite its large size.

"The comprehensiveness of any search engine should be measured by real Web pages that can be returned in response to real search queries and verified to be unique," he said on Friday. "We report the total index size of Google based on this approach."

But Yahoo executives stood by their earlier statement. "The number of documents in our index is accurate," Jeff Weiner, senior vice president of Yahoo's search and marketplace group, said on Saturday. "We're proud of the accomplishments of our search engineers and scientists and look forward to continuing to satisfy our users by delivering the world's highest-quality search experience."