On the Web

Profile Information

You should do an analysis on the 10-result data as though they are 7-result pages. In other words, it could be that 2 algorithms are in use, one for 7-result pages and one for 10-result pages. But, it could also be that 2 algorithms are used on all pages: one for the top n results and another for the rest.

This wouldn't surprise me at all to learn this was the case and that insight was buried if we don't look for it. In fact, I have seen evidence as I look at my own site's behavior on the SERP's that there is some funny business going on between pages that rank very high and pages that are much lower.

This is a hard problem, to be sure. The description of what's going on makes it sound like your engineers designing the system may not have built many large distributed systems. In particular, the description of the very batchy sounding all-or-nothing start over on a hard disk failure is scary for a system of this size.

These kinds of systems need to be seriously decentralized and they need to expect that failure will be frequent and commonplace. Read, for example, the accounts of how Netflix went to Amazon.

I recommend getting in touch with some folks who've done some seriously large scale Cloud systems who can lend some advice. They're out there and compared to the expense of being late or disappointing customers are pretty cheap. It's too late to use their advice on this go-round, but you're going to continue to see pain in future go-rounds if you don't get some help.

First, as was mentioned, it is hard as heck to develop algorithms that sift the high quality content from the spammy chaff that is most of the pages out there.

But, let's assume they did have the miracle algorithm. They may very well choose not to use it for profit reasons. Using better algorithms can fail in two ways. Such algorithms will require more processing to implement. At Google's volumes, they are going to prefer "pretty good" algorithms that are cheap to computer versus "awesome" algorithms that are extremely expensive to compute.

You also have to wonder about the subtleties of monetizing search traffic. Start with the obvious of how the spammy sites monetize with AdWords and how badly Google wants to put a stop to that. At the more subtle end, I look at Bing and my own site, for example. Big delivers better search results to my site--they have lower bounce rates and they are more likely to convert when they arrive at landing pages. I take that to mean they're using better algorithms and visitors arrive to find something closer to what they're searching for. Just one problem--I get far fewer visitors. In fact, it is much less than the relative market shares of Bing vs Google would suggest. So I look at those incoming volumes and wonder whether Google wouldn't rather deliver a lot more visitors even if they're not quite as good a match (e.g. a little higher bounce rates, a little worse conversions) just because many sites will be so attracted to the volumes.