Thursday, December 28, 2006

One of the most impressive features in Windows Vista ... is instant search.

Anyone who's struggled with the lousy search functionality in Windows XP or previous Windows versions will be happy to hear that the Vista version is fantastic, delivering near-instantaneous search results while providing the types of advanced features that power users will simply drool over.

Throughout Windows Vista, you will see various search points, all of which are context sensitive.

[For example,] in the right side of the Start Menu ... you can search your ... documents and other data files. As you type a search query in the windows search box, search results begin appearing immediately. The speed at which this happens is pretty impressive ... You can [also] search for applications, ... IE Favorites, email, and other items directly from the Start Menu.

The opportunity for third party desktop search applications like Google Desktop Search only existed because Windows XP desktop search was so pitifully slow.

As I said before, the moment Microsoft corrects this flaw, this opportunity will evaporate, as will the numerous also-ran desktop search apps. It appears Microsoft finally has fixed desktop search in Windows.

Paul's review goes on to say that "Microsoft will work to make instant search more pervasive in [the] future". Integration of search into Windows has long been expected as part of the search war. From a NYT article:

"Search will not be a destination, but it will become a utility" that is more and more "woven into the fabric of all kinds of computing experiences," said Kevin Johnson, co-president of Microsoft's platforms and services division.

Search is a very pervasive thing. You want to search the Web, you want to search your corporate network, you want to search your local machine, and sometimes you want search to work against multiples of those things.

Now that Microsoft has fixed desktop search, they will integrate search throughout Windows and Windows applications. The easiest and most obvious option for searching will be the search box sitting right in front of you. That box will be powered not by Google or Yahoo, but by Microsoft.

One analyst's survey doesn't make a trend. But a Global Equities research analyst said this week that he found "many Vista owners that once used Google's desktop search feature have switched to Microsoft's" desktop search which is built into Windows Vista.

Wednesday, December 27, 2006

Google apparently now makes $0.20 per search from advertising revenue, according to Caris & Co. analyst Tim Boyd as quoted in the BusinessWeek article, "Why Yahoo's Panama Won't Be Enough".

Using data on total search queries, released by comScore, Caris & Co. analyst Tim Boyd estimates that Yahoo made on average between 10 cents and 11 cents per search in 2006, bringing in a total of $1.61 billion for the first nine months of the year.

Google, meanwhile, makes between 19 cents and 21 cents per search. As a result, it made an estimated $4.99 billion during the same period.

The BusinessWeek article also has some interesting tidbits on Yahoo's Panama, the difficulty of monetizing non-search page views, and the potential of behavioral targeted advertising to improve targeting on non-search page views.

On the topic of early efforts at advertising targeted to past behavior, Barry Schwartz's post, "How Microsoft's Behavioral Targeting Works" at Search Engine Land has a nice excerpt from a recent WSJ article on how Microsoft's adCenter does coarse-grained behavioral targeting.

I was a lifeguard at Rinconada Pool in Palo Alto when I was a teenager. No, not very geeky, but it was a long time ago. I assure you that any coolness I once had is now gone.

I occasionally brew beer, but I am not very good at it. I once tried to brew a batch of Russian Imperial Stout, a very heavy beer, and bottled too early. 44 of 50 bottles exploded with enough force to embed shards of glass in a nearby wall.

As an undergrad, I had an odd double major in Computer Science and Political Science. Geeking out on political economy and game theory is great, but there is precious little overlap between that and computers, so I spent most of college with my nose buried in books.

I used to mess around with artificial life. It was more fun than useful, but I did help develop two simulations that were used in undergrad classrooms, the LEE Project and an iterated prisoner's dilemma simulation (PDF).

When I was a grad student, I got a black lab and named her Pavlova. If you think that is funny, you, like me, probably are a geek.

Monday, December 18, 2006

Randy Shoup and Dan Pritchett gave a talk on scaling eBay, "The eBay Architecture", at SD Forum 2006. The slides are available (PDF).

The parallels with Amazon are remarkable. Like Amazon, eBay started with a two-tiered architecture. Like Amazon, they split the website into a cluster in the late 1990's, followed soon after by partitioning the databases.

Like Amazon, they soon encountered poor performance and difficulty compiling their massive, monolithic binary (150M for eBay, Randy and Dan say). Like Amazon, they started a major rewrite of their monolithic binary around 2001, eventually building a services architecture on top of partitioned databases.

They even built their own search engine because "no off-the-shelf search engine met [their] needs." Amazon did that as well.

It is interesting that their new architecture basically gives up on transactional databases. They say eBay has "absolutely no client side transactions", "no distributed transactions", and "auto-commit for [the] vast majority of DB writes". Instead, they apparently use "careful ordering of DB operations". It sounds like mistakes happen in this system, because they mention running "asynchronous recovery events" and "reconciliation batch" jobs, which, I assume, means asynchronous processes run over the database repairing inconsistencies.

In all, a very interesting talk for anyone who is working or wants to work on big websites and big data. As Tim Bray said, "This ought to be required reading for everyone in this business whose title contains the words 'Web' or 'Architect'."

See also Dan Pritchett's weblog post, "You Scaled Your What?", where he mentions his talk and these slides at the end.

See also some other interesting commentary ([1][2][3][4]) on this talk.

Thursday, December 14, 2006

I liked Marshall's framework for the spam problem. He talked about spam as an externality, like pollution, and proposed a solution based on the Coase theorem that attempts to give people "property rights over their attention."

We propose an "Attention Bond," allowing recipients to define a price that senders must risk to deliver the initial message.

Requiring attention bonds creates an attention market ... to price this scarce resource. In this market, screening mechanisms shift the burden of message classification from recipients to senders, who know message content ... In certain limited cases, this leads to greater welfare than use of even "perfect" filters.

I was mostly interested in the theory discussed in the talk, but Marshall did propose an application for trying to eliminate spam. The basic idea is a whitelist system where senders not on your whitelist have to post a micropayment bond ($.01 - $.05) for you to receive the message. If you determine the message is spam, you seize the bond.

While I enjoyed the talk, the proposed solution has problems. The biggest I see is that there is a quiet assumption that e-mail senders can be identified.

Yes, if you implement a strong identification system over e-mail, you can implement all kinds of promising anti-spam solutions. However, as security guru Bruce Schneier said, "These solutions generally involve re-engineering the Internet, something that is not done lightly."

Marshall addressed other criticisms near the end of his talk, including how the system would deal with honeypots and botnets, but, I think, also may have oversimplified the challenges there.

For example, Marshall claimed that marketers would be careful who they send e-mail to, so someone who sets up a honeybot to seize "attention bonds" would not get much business. But, I suspect enterprising people would not just set up one honeybot, but billions of them, each of which has a forged identity behind it made to look as attractive as possible to marketers. True, we may not have much sympathy for e-mail marketers, but this may threaten to ruin e-mail marketing completely, which would create opposition from the business community to this system.

Slashdot has a post on Marshall Van Alstyne's work, including some snarky comments ([1][2][3]) in the discussion.

Wednesday, December 13, 2006

There are some interesting tidbits on personalization in this July 2006 AlwaysOn panel, "What Is the Data Telling Us?", with Peter Norvig (Google), Jim Lanzone (Ask), Usama Fayyad (Yahoo), and Michael Yavonditte (Quigo).

The panel moderator, Bambi Francisco, focused on privacy issues at the beginning, and the panelists appeared a little reluctant to talk. Usama Fayyad started off early by saying:

Knowing what people do collectively or in segments of special interests gives you a lot of very interesting information and a lot of leverage in terms of product and making things more relevant, including making advertising more relevant, and makes a better service.

A bit generic, but it is a good framing of the problem. We are trying to use aggregate data to make search and advertising more relevant and useful.

Bambi continued poking at the privacy issue, sparking Peter Norvig to say that Google really does not need or want to know everything about you. As Peter explained, building up some uber profile of everything you have ever done is less important than focusing on your recent history:

What's important is not you as an individual, but it's the role you are playing at the moment. When you are looking for one particular piece of information, I don't want to know about you so much as I want to know about all the other people in the same situation and what they did then.

And I'd rather know about what is your history for the last five minutes as you try to solve this problem than know about your history for the last five years.

Exactly right. What matters is your current mission, what you are trying to do right now. We can help by paying attention to what you are doing right now and helping you get it done.

Jim Lanzone chimed in around here, both talking about how users will not do a lot of up-front work in search and expanding on Peter's point about helping people with the problem they are currently trying to solve:

Most users are actually very lazy. While some high end users might use products that require tagging, the vast majority of people won't.

The behavior they will use is to iterate on a search engine. That one white box is just so easy for them to put in whatever is in the top of their head ... then the average searcher will review a result page in 5 seconds or less ... they get clues and then they will iterate their search.

That's why the average search session will have 3 or 4 searches ... That is part of the game for them, is finding a clue, iterating their search, getting more specific, and then finding what they need.

It's not worth their time to sit there and toggle a bunch of things in advance of their query, to then hopefully get a better result. It just saves them time to start going.

At this point, Bambi seemed to shift focus a bit and ask a bunch of questions about personalization and recommendations. Again, Bambi was not getting a lot of answers, but most of the answers she did get were fairly negative toward the idea of personalized search.

For example, Usama said, "You really can't read the searcher's mind," a statement that reminded me of a quote from former A9 CEO Udi Manber: "People will learn to use search better but have to invest the thinking -- we are not in the mind reading business." I was surprised to see Peter echo this point, saying something to the effect that Google would have to be clairvoyant to guess user intent given a search of a couple keywords.

I think both of these statements miss the point of personalized search. The idea is not do to something with nothing. That would be magic, mind reading. No, the idea behind search personalization is to add data about what a searcher has done -- especially what a searcher just did -- to refine the current search.

If the couple keywords in a search are too vague, looking back at a searcher's history may help disambiguate it. If a searcher is iterating and not finding what they want, paying attention to what they just did and did not find can help us narrow down on what they might need.

The entire talk is good fun, worth watching. Usama is focused on Yahoo Answers and social search. Jim talks mostly about search experience and making search easy. Peter adds clarity on a few points and has a few amusing anecdotes. Do not miss Peter's joke around 53:23 in the video about a haiku he found of some searches in the logs, "a story of ... frustration and release", very funny.

Friday, December 08, 2006

Last year, like many others, I made a bunch of predictions for what would happen in 2006. It is time to look back and see how many I got right.

The press will attack Google, GOOG will dropFalse. The press has been more critical of Google, poking at it occasionally on management, privacy issues, and the YouTube deal. But, there has been no major disillusionment or scandal, and the stock price has only gone higher.

Yahoo bets on community, buys more community startups, gets little benefitTrue. Yahoo has bet heavily on community and social search (Yahoo Answers, My Web, del.icio.us), but success with these in the mainstream has been mixed. Yahoo acquired del.icio.us and Jumpcut. Yahoo has disappointed investors with their performance, leading to a major reorg recently.

Microsoft launches unsuccessful AdSense competitorTrue. Microsoft launched adCenter, which has not yet been successful at threatening Google AdSense.False. Microsoft launched adCenter, an AdWords competitor, but has not launched an AdSense competitor yet.

Microsoft will abandon Windows LiveFalse. What I meant by this prediction is that Microsoft could not maintain both the MSN and the Live brand, so they would choose MSN over an expensive effort to build a new Live brand. But that was wrong too. Microsoft is not abandoning the MSN brand or the Live brand; they are trying to build both brands, creating much confusion.

Mainstream will like tagging images and videos, but not documentsMostly true. My Web 2.0, del.icio.us, and other apps for tagging documents do not seem to be attracting large audiences. Tagging images on Flickr and videos on YouTube seems reasonably popular, though, even for images and videos, it is not clear that large mainstream audiences widely have embraced the effort required to label things with tags.

Tagging sites will be assaulted by spamFalse, at least at the level I was predicting. I thought Technorati, del.icio.us, and Flickr would be flooded with spammers labeling ads and other crap with arbitrary tags, hoping to attract clicks. Technorati and del.icio.us show some spam, but are not "assaulted" by an "influx of crap".

A spam robot will attack WikipediaFalse, but the part of this prediction that said that Wikipedia will "shut off anonymous edits and place other controls on changes" was at least partially true. As Nick Carr said, "the administrators adopted an 'official policy' of what they called ... 'semi-protection' to prevent 'vandals' ... from messing with their open encyclopedia." Moreover, as Eric Goldman argues, major spam attacks on Wikipedia may just be a matter of time.

Yahoo and MSN launch blog search, Technorati and Feedster lose share, Google Blog Search dominatesMostly false true. Yahoo and MSN still do not have a separate blog search, but Ask did launch one. Feedster is suffering, but Technorati is doing surprisingly well against Google Blog Search. Feedster and Technorati are both suffering, and Google is dominating.

An impressive and more ambitious version of Google Q&AFalse, at least not yet. I was expecting to see something really cool here, a product of the massive processing power of the Google cluster, but it did not happen. Though, I have to say, hints of good things to come seem to keep popping up in Peter Norvig's talks. Maybe this is just a matter of time.

A VC-fueled bubble around personalizationFalse. There has been interest and some funding for startups doing personalization and recommendations, but not at the absurd, frothy level I expected.

Google News adds recommendations, MSN/Yahoo experiment with personalization, all three expand in targeted advertisingMostly true. Google News does have a widget that recommends news based on your reading history. AOL launched news recommendations in My AOL. Yahoo and MSN are both doing early experiments with behavioral targeted advertising, but have not done much elsewhere with implicit personalization.

Hype about mashups and APIs will fadeFalse. If anything, the hype seems to be increasing. I have not seen much evidence that people are disillusioned yet with the restrictions or lack of uptime guarantees on APIs. That may be a matter of time; a scandal like an extended downtime or sudden change to harsher terms on an API might be sufficient.Mostly false, since there still is much hype, but there are signs of a growing backlash. See the updates at the bottom of this post.

eBay's business slows, eBay makes other acquisitions to acquire growthMostly true. eBay's growth has slowed. The expensive Skype deal seems to have tempered eBay's interests in additional acquisitions, but they did do a $2M acquisition of Meetup, $48M acquisition of Tradera, and a deal with Google.

Well, not such a good track record. About a third true or mostly true. Maybe I should put away my crystal ball?

Update: As John K pointed out in the comments, Microsoft adCenter is an AdWords competitor, not an AdSense competitor. Microsoft has not yet launched an AdSense competitor. Sorry, my mistake.

Update: I may have judged too soon on the lack of a backlash against APIs. Google very recently pulled their web search API, causing Dare Obasanjo to say:

One thing that is slowly becoming clear is that providers of data services would rather provide you their data in ways they can explicitly monetize (e.g. driving traffic to their social bookmarking site or showing their search ads) instead of letting you drain their resources for free no matter how much geek cred it gets them.

I keep hearing people talk about as if companies are creating web services because they just dream of setting all their data free. Sorry, folks, that isn't the reason.

Companies offer web services to get free ideas, exploit free R&D, and discover promising talent. That's why the APIs are crippled with restrictions like no more than N hits a day, no commercial use, and no uptime or quality guarantees. They offer the APIs so people can build clever toys, the best of which the company will grab -- thank you very much -- and develop further on their own.

Update: I guess I also judged too soon when I said Technorati is doing surprisingly well against Google Blog Search. According to a Dec 28 article from Hitwise, the combined traffic of blogsearch.google.com and search.blogger.com is now about twice that of Technorati.

Update: Maybe I was just too early on the VC frenzy around personalization. VC Fred Wilson predicts that "the implicit web is going to start taking off in 2007" where the "implicit web", as Fred defines it, is using clickstream and other implicit information about preferences to do recommendations and personalization. Perhaps the frenzy will be in 2007, not 2006.

There should be an alternative to one-size-fits-all RSS feeds for busy sites.

Too many high-volume sites assume everyone wants to read every post. That's wishful thinking. Some readers may want 5+ posts a day from your site, but what about moderate fans who only want 5 posts a week? Or casual fans who want a mere 5 posts a month? These people just want a glass of water yet sites insist on pointing a firehose at them.

Matt goes on to quote the frustration of Khoi Vinh at his feed reader:

I've collected so damn many RSS feeds that, when I sit down in front of the application, it's almost as difficult a challenge as having no feed reader whatsoever. With dozens and dozens of subscriptions, each filled with dozens of unread posts, I often don't even know where to start.

Matt also quotes an older Wired article that nicely states the problem:

I want to solve the question of "I don't have any time and I subscribe to 500 feeds. I just got off the plane. What do I need to read?"

Current RSS readers merely reformat XML for display. That isn't enough. Feed readers need to filter and prioritize. Show me what matters. Help me find what I need.

Matt's post focuses on issues for people with hundreds of feeds in their feed reader -- a common problem for us geeks -- but I think the problem is much broader than that.

Not only do most people not want to read every post from various feeds, but most people do not want to go through the hassle of tracking down and subscribing to individual feeds in the first place. XML is for geeks, not something that should be exposed to readers. Most people just want to read news. Next generation feed readers should hide the magic of locating content.

Overall, feed readers need to do a better job of focusing on scarce attention. Readers have limited time. Feed readers should be helping readers focus, filter, and prioritize. Feed readers should throw out the crap, surface the gems, and help people manage the flood of information coming at them.

Thursday, December 07, 2006

When I worked at Amazon, there was a lot of effort into recognizing that two items in the catalog were actually the same item. That was called item authority.

I was recently browsing around in YouTube and I noticed how bad the site is about dealing with multiple copies of the same content. For example, on Weird Al's video, "White & Nerdy", look at the related videos.

The first four are all copies of the same video. They are not "related"; they are the same video.

Of the first ten videos in that list, only three are unique. The others are all duplicates.

This problem is not unique to YouTube. On Google Video, "White & Nerdy", eight of the top ten "related" videos are identical copies of the Weird Al music video.

The point of showing me related content is to help me discover new and interesting content. Showing identical copies of the same video I just watched is not useful to me.

What is useful is helping me find interesting other videos. At a minimum, you could screen out duplicates and then show other Weird Al videos; that would be useful, if a bit obvious. Alternatively, you could show videos that interest people who liked "White & Nerdy", using other customers' actions to help me find interesting content.

Crawling the world's information is not enough. You need to make that information useful. You must help people find relevant information, help people find the information they need.

Wednesday, December 06, 2006

Dubious Internet marketers are planting stories, paying people to promote items, and otherwise trying to manipulate rankings on Digg and other so-called social-media sites like Reddit and Delicious.

Some marketers offer "content generation services," where they sell stories to Web sites for the sole purpose of getting them submitted to Digg and other sites.

Companies charge as much as $15,000 to get content up on Digg, said [ACS CTO] Neil Patel ... If a story becomes popular on Digg and generates links back to a marketer's Web site, that site may rise in search engine results and will not have to spend money on search advertising, he said.

Another way to get Web links to a suspicious site is to get inside help from users at a social-media site. For instance, spammers have tried to infiltrate Digg to build up reputations and promote stories for marketers, experts say.

Other scammers are trying other ways to buy votes. A site dubbed "User/Submitter," purports to pay people 50 cents for digging three stories and charges $20 for each story submitted to the site, plus $1 for every vote it gets. The Spike the Vote Web site boasts that it is a "bulletproof way to cheat Digg" and offers a point system for Digg users to submit and dig stories. And Friendly Vote bills itself as an "online resource for Web masters" to improve their marketing on sites like Digg and Delicious.

These problems with Digg were predictable. Getting to the top of Digg now guarantees a flood of traffic to the featured link. With that kind of reward on the table, people will fight to win placement by any means necessary.

It was not always this way. When Digg was just used by a small group of early adopters, there was little incentive to mess with the system. The gains from bad behavior were low, so everyone played nice.

Now that Digg is starting to attract a large mainstream audience, Digg will be fighting a long and probably losing battle against attempts to manipulate the system for personal gain.

There seems to be a repeating pattern with Web 2.0 sites. They start with great buzz and joy from an enthusiastic group of early adopters, then fill with crud and crap as they attract a wider, less idealistic, more mainstream audience.

[CTO] Farzad Nazem will be pressed to speed development of the company's delayed next-generation advertising software.

Analysts believe one of the chief reasons behind Yahoo's current woes is its failure to deploy software that could prioritize the placement of the most lucrative ads on its Internet properties.

During the past year, Yahoo's financial performance has repeatedly disappointed investors who have sent its stock price plunging more than 35 percent.

See also my earlier post, "Yahoo's troubles", where I said, "The business is advertising ... To fail to compete on advertising is to fail."

See also Om Malik's harsher comments on the reorg. By the way, I kind of like the mission Om gave for Yahoo at the end of his post, to organize all "relevant information". That captures the importance of helping people focus their attention on useful information rather than making all information accessible.

Monday, December 04, 2006

I had the great pleasure of giving a talk today on practical issues in personalization and recommendations for the Data Mining (CS345) class at Stanford taught by Anand Rajaraman and Jeff Ullman.

The slides from my talk are available in two versions. The first version is the talk I actually gave; make sure to read the notes pages for the slides, or it will be difficult to follow. The second version is done in a very different style and should be easier to follow without me blabbing away in front of you.

It was very fun giving this talk. The students were clever, thoughtful, and enthusiastic. I was pleased to get a chance to talk again with Anand, a sharp former colleague from Amazon.com. And, I was overjoyed to meet Jeff Ullman who, for many of us computer science geeks, is a legend because of his seminal work and books.

I hope the talk was as fun for those in the audience as it was from the podium. Thanks again, Anand, for inviting me to speak.

See also my earlier post about the excellent lecture notes that are publicly available for this data mining class. If you are working (or just dabbling) in this field, they are well worth your time to review.

Update: Matt Wyndowe, who was sitting in the class, posted some thoughts on my talk. Thanks, Matt!

Saturday, December 02, 2006

Jon Fine at BusinessWeek reports on the ongoing saga of dealing with copyright issues following the Google-YouTube deal.

From the article:

Google and YouTube are dangling nine-figure sums in front of major programming and network players -- that is, the Time Warners, News Corps, and NBC Universals of the world. Google calls these monies licensing fees, according to executives who've been involved in the discussions.

But some of them characterize the subtext like this: Don't sue us over copyrights. Take this (substantial) payment, and trust us to figure out how we'll all make serious money once we get advertising and revenue sharing worked out.

If you're a network, you can't ignore YouTube's reach ... But if you're a network ... your copyrights, and insisting on your programming's premium value, underpin the entire business model. To complicate matters, no publicly traded media company today is in a position simply to dismiss, say, $100 million. Such a sum far exceeds what any single broadcast network can extract from the online world.

I have to admit, I was surprised lawsuits were not filed immediately after the GooTube deal closed. This appears to explain why. It is not that Google thinks YouTube has no infringing content (as Eric Schmidt absurdly claimed), but that the studios are dazzled by shiny hordes of Google boodle.

It does not seem like a good position for Google. They essentially are saying: We know YouTube is illegal, but here's a huge bribe if you ignore it for now. Those are going to be expensive deals for Google; the studios know all the leverage is on their side.

And, perhaps it is naive of me to think Googlers actually believe in it, but it is hard for me to see how this fits into Google's "do no evil" philosophy. Even if you think current IP laws needs to be changed, pushing that forward by brazenly violating copyright law seems to cede the high ground.

See also my previous post, "YouTube is not Googly". That post originally was written before the deal had been announced, but the updates at the bottom have additional links and comments on Google's efforts to buy off YouTube lawsuits.

See also Om Malik's post where he said, "It is the distraction factor ... The copyright issues and all those other problems are going to strain google where it is weakest - management and control."