Saturday, June 30, 2007

Ron Kohavi, Randal Henne, and Dan Sommerfield from Microsoft have a paper on A/B testing, "Practical Guide to Controlled Experiments on the Web" (PDF), at the upcoming KDD 2007 conference.

Ronny Kohavi was at Amazon.com as Director of Personalization and Data Mining for about two years (Sept 2003 - June 2005). The paper contains some mentions of Amazon's A/B testing framework (which was developed in the 1990s, but has been continuously refined since then) and other useful information on running experiments on a live website.

Some excerpts from the paper:

The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called ... A/B tests.

The authors of this paper were involved in many experiments at Amazon, Microsoft, Dupont, and NASA. The culture of experimentation at Amazon, where data trumps intuition, and a system that made running experiments easy, allowed Amazon to innovate quickly and effectively.

Controlled experiments provide a methodology to reliably evaluate ideas ... Most organizations have many ideas, but the return-on-investment (ROI) for many may be unclear ... A live experiment goes a long way in providing guidance as to the value of the idea.

Many theoretical techniques seem well suited for practical use and yet require significant ingenuity to apply them to messy real world environments. Controlled experiments are no exception. Having run a large number of online experiments, we now share several practical lessons:

A Treatment might provide a worse user experience because of its performance ... because it is slower ... Compute the minimum sample size needed for the experiment ... We recommend that 50% of users see each of the variants in an A/B test ... A small [win] ... may not outweigh the cost of maintaining the feature ... Running frequent experiments and using experimental results as major input to company decisions and product planning can have a dramatic impact on company culture.

Amusingly, some details from a couple of the posts on this weblog are quoted at a couple points in the paper.

Ronny also gave a talk (PDF) at eBay Research Labs earlier this month that covered similar material.

See also Dare Obasanjo's post on Ronny's paper and his eBay Labs talk.

See also "Front Line Internet Analytics at Amazon.com" (PDF), a 2004 talk by Ronny Kohavi and former Amazon.com Personalization Director Matt Round that has more details on Amazon.com's A/B testing framework.

Update: It appears that all three of the authors of the paper -- Ron Kohavi, Randal Henne, and Dan Sommerfield -- were at Amazon.com. Randal Henne was at Amazon Apr 2003 - May 2006. Dan Sommerfield was there Dec 2003 - Jul 2006.

The others may show up on Google Video in a few days. If they do, I will update this post with links to them.

If you are interested in this conference, you might also be interested in a few ([1][2][3][4][5]) of my many older posts on large scale computing at Google that include links to other papers, presentations, and videos.

Update: Some other notes ([1][2]) on the conference from Robin Harris at StorageMojo. [via Ewan Silver].

Update: Torsten Curst posted notes as well and was particularly impressed with the Amazon talk. As François Schiettecatte mentioned, a paper with at least some of the material from that talk, "Dynamo: Amazon's highly available key-value store", will be presented at the upcoming ACM Symposium on Operating System Principles.

It's kind of astonishing: Windows users had to wait nearly a quarter century, until Windows Vista, for an OS with really good search features.

Windows XP Search may be the worst of all, with an interface that's as patronizing as it is sluggish and confusing.

See also my Mar 2005 post, "Desktop search should not exist", where I said, "The opportunity for third-party desktop search apps [only] exists because the Microsoft Windows file search is pitifully weak."

See also my Nov 2006 post, "Is desktop search over?", about the desktop search that finally works in Windows Vista.

Monday, June 25, 2007

I just got back from Foo Camp, Tim O'Reilly's "Friends Of O'Reilly" conference.

It was an interesting event, pretty much as described, a self-organized, somewhat chaotic blend of "people who're doing interesting works in fields such as web services, data visualization and search, open source programming, computer security, hardware hacking, GPS, alternative energy, and all manner of emerging technologies" who sat down to chat, debate, "share their works-in-progress, show off the latest tech toys and hardware hacks, and tackle challenging problems together."

Most incredible was the diverse group of attendees, ranging from university professors to tech gurus to venture capitalists to goofy little startups. In addition to the various tech celebrities -- Larry Page, Caterina Fake, Paul Graham, Ray Ozzie, Kevin Rose, to name just a few -- there were even folks such as Wes Boyd, founder of MoveOn.org.

Part of the experience is the opportunity to bump into random people and explore various ideas. Joe McCarthy and I discussed innovation at startups versus innovation in research groups. I talked to Paul Kedrosky about trying to use information on the web for hedge funds. I had an extended discussion with Wes Boyd about the future of journalism. Steve Yegge convinced me not to hate Javascript quite so much. Udi Manber and I discussed his departure from A9. Nat Torkington and I talked about environmental and energy policy. Mark Atwood and I argued about the short-term prospects for utility computing. And, there were many more casual conversations on many more topics.

The many talks, most of which take the form of a discussion rather than a lecture, were remarkable as well. Sadly, there were often three or four talks in the same time slot I wanted to attend -- so much to see, so little time -- but I was able to attend and enjoy many.

For example, Mez Naam gave a fun, SciFi-like talk on what happens as 3D printers become cheaper, smaller, higher quality, and widely adopted for manufacturing. In the near term, we may see some goods reduced to information -- all you need is the blueprint for what to print to make your very own iPhone -- which could cause serious disruptions in some industries and much confusion for intellectual property laws. In the much more speculative longer term, Mez asked, what might happen if people can create drugs, even pathogens, at their desktop with cheap hardware?

Researchers Marti Hearst, Martin Wattenberg, Fernanda Viegas, Jeffrey Heer talked about data visualization, focusing on demos of Many Eyes and Sense.us. The talk explored how easy data visualization and sharing tools help people collaborate and learn from data. A very cool idea was the ability not only to comment on the graphs, but also draw on the graphs and refer to other graphs, facilitating discussion and exploration. Marti Hearst also briefly discussed tag clouds, ending with the thought-provoking conclusion that tag clouds are intended not as a particularly useful method of conveying and summarizing information, but as a means of socializing among people.

Researcher Andrea Thomaz from the MIT Media Lab showed off videos of Leonardo, a robot designed with gestures that naturally appeal to and are easily interpreted by people.

Researcher Neil Halelamien from CalTech discussed how placing a rapidly fluctuating magnetic field at the back of someone's head can stimulate neurons on the surface of the brain and create some unusual (and temporary) visual effects involving replay of images just seen.

I sat in on a conversation with Stephen Hsu and several other folks working on computer security that came to the rather dismal conclusion that not only can we expect severe, large scale botnet attacks in the near future, but also we can expect a future where most computers have some low level of infection by malware (much like the human body has a continuous, low-level infection by viruses and bacteria). Some of our discussion is similar to what appeared yesterday in the NYT, "When Computers Attack", which quotes one of the Foo campers, Ross Stapleton-Gray, at one point.

There was a discussion of the book Paradox of Choice -- which argues that more choice can make it difficult to take action and that overoptimizing choices makes people unhappy -- led by H.B. Siegel and including Flickr founders Caterina Fake and Stewart Butterfield. The discussion focused mostly on personal experiences, but some interesting meta questions about the economic rationality of optimization -- cost of time for gathering information versus the cost of a (usually only moderately) sub-optimal choice -- were also raised.

Toby Segaran lead a discussion about wisdom of the crowds and hive mind that, at one point, dived into fun questions of whether hive mind communities suffer from tyranny of the majority and end up fracturing at a certain size. Digg came up several times as an example of wisdom of the crowds, tyranny of the majority, and a hive mind that might fracture.

Peter Norvig gave a great version of his talk on the advantages of big data for solving many types of machine learning problems. The machine translation examples are particularly compelling. If you want to check it out, the talk was similar, though not identical, to some of Peter's talks I linked to in an older post.

Finally, I very much enjoyed a session with Erick Wilhelm, Dennis Cramey, and Will Carter talking about location-aware gaming. The basic idea is to have the virtual game world overlap with the real world. Initially, this has taken the form of games where the real world is used for navigation -- moving in the real-world moves you in the virtual world, but the virtual world is otherwise separated from the real-world -- but there was some fun talk about how the worlds could be blended further. What I would really like to see here is a game where you are essentially someone different in the real world (e.g. a secret agent) and interact with others in the game through your device and through the real world (tasks, information drops, puzzles). It would be like the cell phone is your access into a different persona, but that persona exists both in the real and virtual world.

In all, a very interesting and unusual experience. I am still not sure how I managed to get invited, but it was great to get a chance to go.

Wednesday, June 20, 2007

The current leader has a 7.7% improvement. While each step in closing the remaining 2.3% gap to win the $1M prize will be harder and harder, it is remarkable how far the entrants have come.

I have to admit that I have spent a fair amount of time playing with the contest and the Netflix data. It is quite a bit of fun.

Other than some initial attempts at very simple things like predicting averages or modified averages to get a baseline, most of my attempts were either using variants of traditional collaborative filtering (finding similar users) or item-to-item collaborative filtering (finding similar items). I did get modest improvements, but nothing like the performance of the current leaders. I also played with some simple clustering; my results on that were quite poor.

As it turns out, Netflix currently uses a type of item-to-item collaborative filtering for their recommendations, as Netflix's VP of Recommendation Systems Jim Bennet described in a Sept 2006 talk (PDF).

I suspect substantial improvements, then, will require something quite a bit different than what Netflix is doing. This likely rules out item-to-item collaborative filtering. Experimenting with other techniques, I suspect, will be more likely to bear fruit.

I also suspect that additional data may be useful on this problem. While the creators of the Netflix contest apparently did not expect this much progress this fast, it may be the case that it simply is impossible to get the full 10% improvement using the ratings data alone. In my analyses, data simply seemed too sparse in some areas to make any predictions, and supplementing with another data set seemed like the most promising way to fill in the gaps.

It was very fun working on the contest. As much as I would like to keep plugging away, I am starting to feel constrained by my available hardware (4+ year old Linux boxes), and dropping cash on servers just to keep playing seems excessive. Moreover, if I am going to spend more time and money on this kind of thing, it really should be on Findory. Too bad, it is a cool data set.

By the way, I do want to mention one thing about the structure of the contest. I think the problem statement is a little off given the business needs of Netflix. In particular, the contest requires a recommender system which can predict movies you will hate or be lukewarm to, not just the movies you will like.

A system that predicts only what you will like, often referred to as TopN recommendations, is what most businesses want from recommendations. They want to surface interesting products to customers, helping customers discover products they have never seen. In Netflix's case, they mostly should want to help you discover movies to add to your rental queue.

You might think that a system that is the best at the Netflix contest problem would be the best at the TopN problem, but that is not the case. To see that, take a system that perfectly predicts the TopN next items you will want, but makes mistakes when you are lukewarm to a product. The RMSE of that recommender would be poor in the Netflix contest (because of inaccuracies in predicting ~3 star items), despite the obvious value of the recommender.

Looking at the recent traffic data for Findory.com, I was surprised to see traffic spiking.

In fact, including all traffic, Findory.com is up to 26M page views per month, about 10 page views per second on average.

That's odd, I thought. Findory's advertising revenue and third party analyses from sites like Alexa both show slow but steady declines at Findory.com, not a traffic spike. What is keeping Findory's web servers so darn busy?

Turns out that the vast majority (in excess of 95%) of these page views are various forms of robots, mostly hitting Findory's free APIs.

Those page views are not people. They generate no revenue directly. They have little to no value to Findory.

In fact, I suspect that most of these API accesses are being used for various forms of weblog spam. For example, I suspect some are accessing Findory content, stripping all the links out, then placing AdSense ads or link farm links next to that content. Ah, spam, wonderful spam.

I never have been particularly idealistic when it comes to APIs. I tend to take a cynical view on the motivations of companies that offer free APIs.

I also have suspected that most people using APIs seek short term profits, not innovation or building something substantial. While it is just one data point, Findory's experience appears to confirm that view.

Not only [has Yahoo] not executed on a vision of social search, but that they have bungled the communities that they have purchased and actually done more harm to the company than good.

Almost everyone I've talked to about Yahoo has expressed to me that the company is a wreck. That people are unhappy. That executives are leaving. That bureaucracy reigns supreme and that almost nothing can get done in the current environment.

Page wanted ... to license the PageRank invention and get some royalties while he went back to his academic work. Unfortunately, licensing proved difficult. Only one search engine company made an offer, and it was more of a token offer.

"They (Page and fellow Google co-founder Sergey Brin) got frustrated so they decided to start a company," [Luis Mejia, a senior associate in the Office of Technology Licensing at Stanford University] said.

Larry and Sergey .... began calling on potential partners who might want to license a search technology better than any then available ... They had little interest in building a company of their own.

Among those they called on was friend and Yahoo! founder David Filo. Filo agreed that their technology was solid, but encouraged Larry and Sergey to grow the service themselves ... "When it's fully developed and scalable," he told them, "let's talk again."

Others were less interested in Google, as it was now known. One portal CEO told them, "As long as we're 80 percent as good as our competitors, that's good enough. Our users don't really care about search."

Rejected, frustrated, but not willing to let a good idea die, Larry and Sergey created Google, Inc.

Cord Blomquist at the Competitive Enterprise Institute cleanly explains the appeal of personalization for advertising:

All of this data is not being used to create an Orwellian dystopia, but an online ad revolution.

The Web's most annoying feature, the ubiquitous banner ad, may be forever changed. Fewer in number and more useful, the ads of tomorrow will hone in on users' real needs and wants.

Future searches will know if a query for "ring" should present ads for engagement rings, Lord of the Rings, or Saturn's rings. Ads that appear alongside searches will become a resource, instead of a nuisance, thanks to more intelligently assessing users' intentions. This will make our back-link powered, dumb search of today seem, well...dumb.

Saturday, June 09, 2007

The future of the web is about personalisation. Where search was dominant, now the web is about 'me.' It's about weaving the web together in a way that is smart and personalised for the user.

Personalization can help people discover information they would not find on their own, but it is important not to overstate the impact.

Especially in the context Tapan is discussing -- Tapan leads the team running the Yahoo home page -- discovery is important. People need help navigating Yahoo and finding new content on Yahoo. A Yahoo home page that learns from what you do on Yahoo and helps you get what you need faster would be helpful.

But personalization does not replace search. For people who know what they want, the best thing we can do is get out of the way. When people are actively and explicitly searching, when they are on a mission, it is not the time to distract them.

At Amazon, the majority of people came to the Amazon.com home page, then searched. For those people, we mostly got out of their way, showing them search results and some helpful other information strongly related to their search. However, another group of people came to Amazon without such a sense of purpose. For these people, the personalization was key, helping tailor the home page to focus their attention on a selection of Amazon's massive catalog based on their past interests, a view into Amazon created just for them.

Even if search remains dominant, even if we mostly want to get out of the way when people actively search, personalization can still help. Many times, searchers cannot find what they want when they search. They need help expressing their intent. The search engine needs additional information to understand their intent. By learning from what each person and others have done and found, personalization can help better understand intent and help people get what they need faster.

But, in any case, personalization does not mean search is going away. People often know what they want and want it now. In those cases, we should give it to them. Search is and will remain dominant.

On a somewhat different topic, as the Times UK article describes, many interpreted Tapan's words as Yahoo giving up on core search. I do not have much to add to that, but I do want to point out that this is not the first time a high level Yahoo executive has demonstrated a lack of competitive fire for core search.

Sunday, June 03, 2007

An article in the NYT business section today by Saul Hansell, "Google Keeps Tweaking Its Search Engine", has intriguing details on Google's ranking algorithm from discussions with Googlers Amit Singhal and Udi Manber.

Some excerpts:

When it comes to the search engine — which has many thousands of interlocking equations — [Google] has to double-check the engineers' independent work with objective, quantitative rigor to ensure that new formulas don't do more harm than good.

Recently, a search for "French Revolution' returned too many sites about the recent French presidential election campaign -- in which candidates opined on various policy revolutions -- rather than the ouster of King Louis XVI. A search-engine tweak gave more weight to pages with phrases like "French Revolution" rather than pages that simply had both words.

Typing the phrase "teak patio Palo Alto" didn't return a local store called the Teak Patio .... Mr. Singhal's group [wrote] a new mathematical formula to handle queries for hometown shops.

Is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until now, Google has preferred pages old enough to attract others to link to them ... [Singhal's] team's solution: a mathematical model that tries to determine when users want new information and when they don't. (And yes, like all Google initiatives, it had a name: QDF, for "query deserves freshness.")

I found this surprising. Google manually comes up with tweaks to its search engine that only apply to a small percentage of queries, tests the tweaks, and then tosses them into the relevance rank?

The problem with these manual tweaks is that they rapidly become unwieldy. As you add hundreds or thousands of these hand-coded rules, they start to interact in unpredictable ways. When evaluating a new rule, it becomes unclear if performance of that rule might be improved by tweaks to the rule, tweaks to other rules, or removing other rules that have now been subsumed.

It appears Google has hit this problem head on. The "many thousands of interlocking equations" require a "closely guarded internal [program] called Debug" that attempts to explain whether the rules are doing "more harm than good."

Frankly, I thought Google was beyond this. Rather than piling hack upon hack, I thought Google's relevance rank was a giant, self-optimizing system, constantly learning and testing to determine automatically what works best.

How would this work? With a search engine the size of Google's, every search query can be treated as an experiment, every interaction as an opportunity to learn and adapt.

In each query, a few of the results would be different each time. Each time, the search engine is making a prediction on the impact (usually an anticipated slight negative impact) of making this change. Wrong predictions are surprises, opportunities to learn, and are grouped with other wrong predictions until the engine can generalize and attempt a broader tweak to the algorithm. Those broader tweaks are automatically tested, integrated if they work, and the cycle repeats.

For example, on the query [teak patio palo alto], experiments may show the Teak Patio store in Palo alto is getting unexpectedly high clickthrough. Another query, for [garden stone seattle], is showing similar problems in experiments. In clustering, both queries are classified as local. Both show modest purchase intent. The clickthrough urls have been classified as local businesses. On the known data, the most general rule based on this result, one boosting local businesses when a query is classified as local with purchase intent, appears to give a lift. The rule is tested live on a percentage of users, results match predictions, and the system adds the new rule for all users. The process repeats.

Similarly, a group of queries that match known names (e.g. celebrities) may show that links classified as sites are appearing too low in rankings. In automated testing, a general version of this rule performs poorly unless another rule with strong overlap is removed; a more specific rule performs well but applies less frequently. The older rule is removed, the newer general rule put in place.

While this process does result in specific tweaks to the engine like the manual approach, it does not rely on someone manually finding the rule. Unexpected tweaks to relevance rank may arise from the data. Moreover, a self-optimizing relevance rank does not rely on someone manually coming back to rules to maintain or debug them over time.

This approach would require massive computational power -- a huge infrastructure classifying and clustering queries and urls into hierarchies, a framework for testing billions of changes and tweaks simultaneously and generalizing from the results -- but I thought Google had that power already.

Perhaps this merely shows how much further there is to go in search. As Larry Page recently said, "We're probably only 5 per cent of the way there."

Update: Five weeks later, there are a few more tidbits about Google's experiments and their tweaking of their relevance rank in an interview of Udi Manber by Eric Enge. Some excerpts:

We run literally thousands of experiments a year and pick the ones that score well.

We have projects that their sole purpose is to reduce complexity. A team may go and work for two months on a new simpler sub-algorithm. If it performs the same as the previous algorithm, but it's simpler, that will be a big win.

Overall, we have to be very careful that the complexity of the algorithm does not exceed what we can maintain.