Why Google Can’t Count Results Properly

It’s a long-standing complaint of mine with Google. You can do a search, then repeat the same search and “subtract” a word from your original set, and Google will return more matches — not less. It shouldn’t happen. But does, and here’s why.

Subtracting Gives You More?

Consider a search for cars, which reports that there are 546,000,000 pages that match that word:

Now consider a search for cars -used, which should find all the pages that are relevant to the word cars (a set of 546 million) and then subtract any pages that have the word “used” on them. The figure should be less than 546 million. But it’s not. Instead, the set increases by 115 million matches, to 661,000,000 in total:

A Long Standing Problem

It makes no sense. It’s irked me for years. It was the top item on my “25 Things I Hate About Google” story that I wrote in 2006:

1) Web search counts that make no sense.

“Why do search engines lie?” has Robert Scoble recently poking at this, on how the reported counts don’t always match reality. Heck, try class two contributions with “about” 59,800,000 matches. But then you find that only 879 are considered non-duplicates! Meanwhile, mars landing sites gives 1,050,000 matches while mars landing sites earth gives nearly double that amount, 1,840,000 listings. It shouldn’t. Adding that extra word should give you a subset of the original query. It should come back with less results, not more.

But then again, if you are going to put out a number, perhaps it should be accurate?

Here we are more than five years later, and the problem still hasn’t been fixed. And today, it helped spark an article about whether Google is changing its counts as part of a conspiracy to bury “enemies of Google.”

Uh, no. Along with a number of SEOs, the head of Google’s spam team Matt Cutts stepped in to clarify why this happens.

As to why the query [A B -C] can return more estimated results than [A B], that’s easy to explain. The query [A B -C] causes us to go deeper through our posting lists looking for matches, which can lead to more accurate (and larger) results estimates. Other things can cause us to go deeper in finding matches, such as clicking deeper in search results. Results estimates can also vary based on which data centers or indices your query hits, as well as what language you’re searching in. It certainly has nothing to do with whether you’re a “possible enemy of Google,” as you put it.

We try to be very clear that our results estimates are just that–estimates. In theory we could spend cycles on that aspect of our system, but in practice we have a lot of other things to work on, and more accurate results estimates is lower on the list than lots of other things.

Let’s translate that to the cars query above. When I searched for cars, Google did a fast look and found it had somewhere around 546 million matching pages for that word, which can also include pages that don’t actually contain the word but have synonyms of it, along with pages that don’t have the word but are relevant because people link to them with the word “cars” in the hyperlinks.

When I searched for [cars -used], Google effectively thought harder about what I asked. It’s sort of like when someone may ask you a question that you know the answer to off the top of your head. Google gets asked about “cars” all the time — and it has that set of answers sitting on the tip of its memory tongue, so to speak, ready to spit out without much thought going into it.

But for the harder query, Google goes “Hmm, let me dig around.” And it discovers that it has even more pages about “cars” out there than it thought it had originally. That gives it a larger set of “cars” pages that, even when the word “used” is removed, still comes out bigger than the original “cars” set.

Plus, there are other factors that Cutts mentions — Google has a lot of data centers out there, giant copies of its search index spread out across the world. Imagine a library that has exact branches across the world. Technically, they’re “mirrors” of each other. In reality, each library might be missing a few books here and there for a variety of reasons. That can lead to different results.

On top of all this, counting results is hard. Google cared more about being accurate years and years ago, when it was dealing with only millions of pages — and it was still trying to prove what a “big boy” search engine it was in indexing so much content. But the days of “bigger is better” are long gone.

Google has tens of billions of pages stored these days (the exact number isn’t given out). So does its nearest competitor, Bing. Neither has much of an advantage on many searches just because they might have more pages than each other. That’s because many pages on the web are junk, not helpful, not original content that add value to searches.

People have talked about removing results estimates altogether. I’m not a fan of that. Still useful, even if noisy.

It’s useful to have inaccurate figures? Again from Cutts:

They’re not meaningless. [A B -C] has been known for longtemps, but estimates are stateless across queries; not worth trouble.

OK, how about a disclaimer next to the results count? Or make them into a link that leads to a disclaimer?

Not worth the pixel real estate on serps [search engine results pages] and annoying every user on earth. Better to debunk the yearly conspiracy theory.

I disagree. I mean, it’s equally possible that showing results counts under the search box are also annoying or at least a waste of pixel real estate, especially when they do indeed feel meaningless.

But more important, those counts aren’t just used by conspiracy theorists (last year’s big one with counts was over Climategate. They’re used by…

News outlets, such as Fox News, which cited them as proof that the BBC was anti-American (“BBC anti-American” resulted in 47,200 hits, you see. Of course, at that time, “bush anti-american” brought up 351,000 hits, underscoring how you could prove or disprove anything with those counts)

Courts, including the US Supreme Court. In 2004, the US Solicitor General cited Google’s count for for “free porn” as evidence of how much porn was increasing on the web.

I see little reason to retain results counts. I love data, but it should be accurate data — and these numbers are anything but. Time to retire them, Google. Or after at least five years of the “it’s not our highest priority” mantra, finally make it a priority.

The Twitter Chatter

By the way, the whole conspiracy thing sparked quite a multi-person conversation on Twitter. Using Storify, I pulled together some of the comments: