Google doesn't seem to get blogs

by Michael S. Kaplan, published on 2006/08/05 10:24 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/08/05/689506.aspx

No, this post is not making the point that Chris Wetherell did and folks picked up on a couple of years ago, like here. Let me exaplain....

So I found myself IM'ing with Melanie Spiller as I have been doing intermittently since that dinner last month, and thinking about how refreshing (and entertaining!) it is every once in a while to have a whole new source of information. Like she pointed out a bumper sticker she had seen that she thought I might enjoy:

What if the hokey pokey is what it's really all about?

That is freaking hilarious, in my opinion. Perhaps you don't agree, and I won't argue with you since I have no sense of humor, really. But if I was ever going to put a bumper sticker on my car, that would be it. I even wrote it on my white board and Kieran, who happened to be passing by, agreed that it was pretty awesome.

So what does this have to do with Google not getting blogs? :-)

Well, I did that ubiquitous thing that is going to make Google lose their trademark someday just like Xerox might -- I google'd this phrase. And the results were surprising to me:

1 - 12 of about 173? WTF?

Clearly Google is smart enough to recognize that there is a pattern but not smart enough to identify that it might be due to the modern equivalent of pages with frames that happen to share the same frame text -- not smart enough to point out which links have the repeats. The subsidary info that every blog page might have? It can't call out why they might be the same, why it might have lumped them all together?

This is not proof that it Google doen't get blogs by the way. If anything it is proof that thay do get blogs, at least in the sense that they can see patterns and such. So what am I rambling on about?

Well, I took another phrase, one that appears in the disclaimer text of my own blog:

It is counting every page. And every month link in the archives on every page. And every category link on every page. And so on.

If you scroll to the end of the results, it is eventually smart enough to see something is going on and avoids the recursive freaking hole it has dug for itself and actually stops at around 1,000. Which would still be about 995 after even some of the dimmest children will realize what is going on and be a bit smarter about how it describes things.

And if you search on actual content like the title of a post or text inside of a post, you see a different problem -- it is actually looking at every RSS link off of every page, too. And indexing all of those as well. Fools like me who actually aggregate full posts are punished the most here, and Google will provide links to each of those pages, too.

We are impressed with Rainman in particular and with some of the capabilities of the more talented Idiot Savants in general. But we eventually get over that and realize that the first word in that title is Idiot and that what is widely believed to be the most talented searching algorithm could perhaps become a bit smarter. I'd find it much more impressive than throwing half a million servers at the problem, were someone to ask me....

Not to further cast asparagus, but search.msn.com, for example, stops after only ten results across five different domains. Not such an idiot, msn is, huh? :-)

It's the web that needs to get smarter here, not (just) Google. Actually providing dozens, hundreds or thousands of URLs on your site which have the same content mixed together in different ways is a goof. As you've noticed, Google can smush search results for those URLs, but the long term solution is better use of transclusions.

Imagine how many people each day download two, three, even more separate copies of the text of your latest post, with different surroundings, as they surf around - and then cache all this redundant text. Google's doing the same thing (as are all the other search engines). A minimal completely automatic use of transclusions in managed content sites (like blogs) fixes all of this, and delivers better performance for ordinary users. A more ambitious plan could deliver citation transclusion, meaning that when someone famous (like say Jamie Zawinski) links someone else's inciteful comments on LISP, Google search results for LISP find the original comment first, not Jamie's blog quoting an excerpt - because Google would actually be able to figure out which is the original.

The following is from the Washington Post Style Invitational contest that asked readers to submit "instructions" for something (anything), but written in the style of a famous person. The winning entry was The Hokey Pokey (as written by William Shakespeare).

O proud left foot, that ventures quick within
Then soon upon a backward journey lithe.
Anon, once more the gesture, then begin:
Command sinistral pedestal to writhe.
Commence thou then the fervid Hokey-Poke,
A mad gyration, hips in wanton swirl.
To spin! A wilde release from Heavens yoke.
Blessed dervish! Surely canst go, girl.
The Hoke, the poke -- banish now thy doubt
Verily, I say, 'tis what it's all about.
-- by "William Shakespeare"

I noticed stupid search engine behavior before, didn't look much at others but Google is a pretty bad offender here. It has a tendency to direct people to the front page of my blog, or one of the later pages (?page=2) ... those are, by their very nature, pretty dynamic (well, not so much in recent times, but still). Also the front page (or category pages, etc.) tends to aggregate a bunch of keywords belonging to different posts which Google helpfully sees belonging to a single relevant page. So one could actually land on my site with a search like "postscript batch array". Never talked about those things together but separately. Yuck.

I never really found ot how to teach them only to consider individual articles or pages while ignoring everything that aggregates more than one of them.