Comparing the number of results between the different search engines for those common words really didn’t tell us anything about the relative sizes of the indexes for those search engines for a number of reasons.

One is that the number of results shown are rough estimates only. It’s also possible that the way that estimates are calculated from one search engine to another are very different. Some of the pages listed among those results are likely duplicate pages at different URLs, or may have contained misspellings of the words. Some of the words may be abbreviations or acronyms, as well (such as “it” being an abbreviation for information technology).

Some pages also show up as relevant for a particular search query without actually including that term on the page itself. For example, the Adobe Reader download page has ranked at the top of search results for the term “click here” on Google for years, without that phrase appearing on that page. So many links using those words as anchor text pointing to the page have been enough for the page to show up in search results for the term.

As I noted in that post, it might be possible to get a more realistic look at the relative sizes of search engine indexes by looking at the number of search results for terms that are rare, rather than looking at the most frequently appearing words. Cuil’s CEO and founder, Tom Costello recently described using that technique in his blog on a post about Bing (no longer available), to tell us that “Bing is now around 20% the size of Google.”

I don’t have access to an advanced web crawler like the CEO of Cuil might, to identify a large number of “rare” terms. I’m also using a very small sample size, but I wanted to take a look at a few “very rare” English words, to see how frequently they appeared at the search engines.

I identified a number of English words that appear in less than 1,000 search results at Google Caffeine, Google, Yahoo, Bing, Ask, and Cuil by looking at the phrontistery’s Compendium of Lost Words, and doing searches for those terms. Since these search engines will only show the first 1,000 results for a query, it’s possible to see all the URLs for the terms, and use actual numbers rather than estimates, and to see if the words actually appear upon the pages listed. If I had a much larger sample size, I would feel comfortable in saying that the following table gives us a much better idea of the relative sizes of the indexes for the search engines that I’ve included.

Here are some very rare English words, and the number of times that they actually appear in search results at different search engines (not counting duplicate pages and “substantially similar” results).

Originally I thought that maybe Caffeine might have more recent results (perhaps due to massive crawling seen by Googlebot pre-Caffeine launch), but the results reported either mean Caffeine date data is wonky, or Caffeine perhaps doesn’t have an historic index to speak of:

Curiously – I thought that last query stood out due to the larger variance between vanilla Google and caffeinated Google [vicambulate] from my location is a bit different:

If you click through to the second page, the number of results actually drops. Google still shows estimates on the front page, even if the number of results numbers less than 1,000. For this post, I didn’t click through to include the “substantially similar” search results, though I could have and maybe I should have. Regardless, the actual number of results listed is still much smaller than the estimates that are shown on the first page of the search results.

Originally I thought that maybe Caffeine might have more recent results (perhaps due to massive crawling seen by Googlebot pre-Caffeine launch), but the results reported either mean Caffeine date data is wonky, or Caffeine perhaps doesn’t have an historic index to speak of:

One of the reasons for this post, and my previous post on common words, was to see if there were some differences between the present Google, and the Google Caffeine update. It’s interesting that when we look at the number of results for very common words, that Google is showing us a good number more results in the estimate, but when we look at very rare results, the Caffeine estimates tend to be just a little less.

Nice post Bill, but of course this is likely a very noisy proxy as it assumes that all the SEs crawl and index in the same way (I’m thinking that some SEs may crawl wide, while others deep).

Thanks. I think regardless of how search engines crawl sites (deep or wide or focused), I was more interested in seeing if I could learn something about the size of the search engines’ indexes. Of course, we don’t know what they might be filtering out of results that might limit what we see, or other factors that might influence those numbers. But it’s still interesting to look at, and if Tom Costello is correct about the utility of this method, looking at rare words may give us some insight into the different sizes of search engine indexes.

Now I’ll have to go and learn what all of those words mean.

I was tempted to include the definitions here, but felt that it was better to link to the source where I found them, as repayment for making those rare words easier for me to find. 🙂

using rare words can produce bizzare results, due to different ways in which engines are producing search layers. You may want to consider a more balanced way of sampling. There are some papers about this topic.

The Indexable Web is more than 11.5 billion pages, 2005

K.Bharat and A.Broder, A technique for measuring the relative size and overlap of public web search engines [WWW1998]

S.Lawrence and C.L. Giles, Accessibility of information on the web [Nature 400:107-109, 1999]

You’re welcome. If you find this post interesting, I suggest you dig into the papers that Antonio mentioned in his comment. I provided links to them in my comment, which is right above this one. He’s a co-author of the paper at the first link.

Thanks. The compendium of lost words, where I found these words, and which I linked to in the post has some pretty interesting terms. Interestingly, one of the criteria that was used to consider a word as a “lost word” on that site was:

The word may not appear in its proper English context on any readily accessible web page.

Using the Google search engine, I have ensured that none of these words appear on any English-language web page in its proper context. Many of these words turn up no hits whatsoever. Others occur only as part of long alphabetic word lists that lack definitions…

It looks as though some of these “lost words” are becoming found again.

Thank you very much. I appreciate your commenting on this topic, especially considering your research in the area. The approach of using rare words is too simple a method for determining something such as the size of search indexes. I’ve seen the papers that you mention, and I believe that Tom Costello even referred to a couple of them in his post at Cuil.

They are worth looking at for anyone who might want to learn more about how difficult it might be to estimate the size of a search engine’s index.

Andrew Thomkins had a copy of “Estimating corpus size via queries” on his site, but the link appears to be broken. He did discuss the topic in a presentation, which is available here: Estimating corpus size via queries.

Very good point. If you visit many of the pages that are listed for these rare words, there are many empty search results pages otherwise filled with ads, as well as spam pages. There are also very few pages where people are actually writing about these words, or using them as actual parts of sentences.

It is very much possible that there is some filtering of search results going on at some of the search engines I’ve included, which makes their numbers lower than at other search engines.

Thanks for this post, Actually there is useful SEO trick I got from this. I do experiment on my site in the content of the site put some rare word. Which give me good PR just after first update of the google PR. No other search engine do more importance to rare words as google.

The PageRank of a page shouldn’t rely upon the content of that page, regardless of whether you are using rare words or not. Would Google give a page a higher query independent ranking (not pagerank, but maybe something else), if that page had one or more words on it that appeared very infrequently on the Web? I don’t know, but it’s something to think about.