Monday, January 28, 2008

Puppin et al. have a 2007 paper, "Load-Balancing and Caching for Collection Selection Architectures" (PDF) with a curious idea for optimizing large scale distributed search engines they call incremental caching.

The basic concept is to do a bit of additional work each time the cache is accessed, adding the results from that additional work to the cache. In this way, cached search results improve each time they are accessed.

From the paper:

[We] discuss a novel class of caching strategies that we call incremental caching.

When a query is submitted to our system, its results are looked for in the cache. In the case of a hit, some results previously retrieved from a subset of servers will be available in the cache. The incremental cache however will try to poll more servers for each subsequent hit, and will update the top-scoring results stored in the cache.

Over time, the cached queries will get perfect coverage because results from all the servers will be available. This is true, in particular, for common or bursty queries: the system will show great performance in answering them.

Seems like this makes things a lot more complicated, however, with users of the cache having to do a lot more work to figure out what to do after a cache hit, the cache suffering a lot more write contention, and debugging the system (determining why you got the results you did) becoming much more difficult.

Moreover, it is not clear to me that this offers a huge amount of value over a simpler schemes such as a cache that combines a static cache computed from access patterns with a smaller normal run time cache to catch new or bursty behavior.

Even so, it seems valuable for things like federated search where the data sources are not under your control, each data source access is expensive, and there may be limits on the number of data sources you reasonably can query simultaneously. In that case, more traditional caching might cache the results from each of the data sources independently, but incremental caching probably would allow a much more compact and efficient cache.

Please see also my Jan 2007 post, "Yahoo Research on distributed web search", that discusses another 2007 paper, "Challenges in Distributed Information Retrieval" (PDF). That paper shares authors in common with this one and also discusses some ideas around distributed search including caching.

Update: About a year later, Puppin et al. published another paper, "Tuning the Capacity of Search Engines: Load-driven Routing and Incremental Caching to Reduce and Balance the Load" (ACM), that proposes another scheme, one that is quite clever, that learns from searcher behavior (the query-click graph) to cluster index chunks, maximize the likelihood of cache hits, and adapt to load. Nice work there with several good ideas.

Saturday, January 26, 2008

Microsoft Researchers Benjamin Livshits and Emre Kiciman have the fun idea in their recent paper, "Doloto: Code Splitting for Network-Bound Web 2.0 Applications" (PDF), of automatically optimize the downloading of Javascript code for large Web 2.0 applications in a way that minimizes the delay before users can interact with the website.

The basic concept is simple but clever. Look at the code to see what Javascript functions are called immediately and which are called only after a user interacts with the website. Then, replace the functions that are not needed immediately with stubs that get the Javascript function body on demand. Finally, as network bandwidth becomes available, preemptively load the stubbed functions' code in the order of the likelihood that it will be needed quickly.

From the paper:

[We] rewrite existing application JavaScript code based on a given access profile to split the existing code base into small stubs that are transferred eagerly and the rest of the code that is transferred either on-demand on in the background using a prefetch queue.

When a new JavaScript file is received from the server on the client, we let the browser execute it normally. This involves running the top-level code that is contained in the file and creating stubs for top-level functions contained therein.

When a function stub is hit at runtime, if there is no locally cached function closure, download the function code using helper function blocking download, apply eval to it, ... cache the resulting function closure locally ... [and then] apply the locally cached closure and return the result.

When the application has finished its initialization and a timer is hit, fetch the next cluster [of Javascript code] from the server and save functions contained in it on the client.

The paper motivates this optimizer with some pretty remarkable statistics on how big Web 2.0 applications are becoming. The paper says, for example, Pageflakes is over 1M gzipped, GMail nearly 900k, and Live Maps 500k. By automatically splitting the code and data into pieces that are needed immediately and those that are not, "the time to download and begin interacting with large applications is reduced by 20-40%."

As Web applications add more and more functionality to try to match their offline competition, their code will become bigger and bigger, and techniques like Doloto are sure to be necessary.

On a related note, if you have not seen Yahoo Steve Souders' work on exceptional performance for Web 2.0 sites, you definitely should. What makes Steve's work so compelling is that it correctly focuses on the user experience -- the delay a user sees when trying to view a web page -- and not the server-side. Because of that, much of his advice and the YSlow tool look for ways to reduce the size of the download and number of connections needed to render a web page.

Saturday, January 19, 2008

As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:

2. A sub-optimal implementation, in that it uses brute force instead of indexing

3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

4. Missing most of the features that are routinely included in current DBMS

5. Incompatible with all of the tools DBMS users have come to depend on

The comments on the post are enjoyable and useful. Many rightfully point out that it might not be fair to compare a system like MapReduce to a full database. DeWitt and Stonebraker do partially address this, though, by not just limiting their criticism to GFS, but also going after BigTable.

The most compelling part of the post for me is their argument that some algorithms require random access to data, something that is not well supported by GFS, and it is not always easy or efficient to restructure those algorithms primarily to do sequential scans.

However, as the slides from one of Google's talks on MapReduce say, "MapReduce isn't the greatest at iterated computation, but still helps run the 'heavy lifting'" (slide 33, lecture 5). And those slides in both lectures 4 and 5 give examples of how iterated algorithms like clustering and PageRank can be implemented in a MapReduce framework.

Moreover, from the heavy usage of MapReduce inside of Google, it is clear that the class of computations that can be done in MapReduce reasonably efficiently (both in programmer time and computation) is quite substantial. It seems hard to argue that MapReduce is not supporting many of Google's needs for large scale computation.

Perhaps if the argument is that Google's needs are specialized and that others may find their computations more difficult to implement in a MapReduce framework, DeWitt and Stonebraker have a stronger point.

Update: I finally got around to reading Stonebraker et al., "The End of an Architectural Era" (PDF) from VLDB 2007. The article is mostly about Stonebraker's H-store database, but I cannot help but notice that many of the points the authors make against RDBMs would seem to undermine the new claim that MapReduce is "a giant step backwards."

For example, Stonebraker et al. write, "RDBMs can be beaten handily ... [by] a specialized engine." They predict that "the next decade will bring domination by shared-nothing computer systems ... [and] DBMS should be optimized for this configuration .... We are heading toward a world ... [of] specialized engines and the death of the 'one size fits all' legacy systems." Finally, they write "SQL is not the answer" and argue for a system where the computation is done on the nodes near the data.

While the authors also make some more specific statements that do not apply quite as well to GFS/MapReduce/BigTable -- for example, that we will have "a grid of systems with main memory storage, build-in high availability, no user stalls, and useful transaction work in under 1 millisecond" -- I do not understand why they do not see MapReduce as a wonderful example of a specialized data store that runs over a massive shared-nothing system and keeps computation near the data.

Update: Stonebraker and DeWitt follow up in a new post, "MapReduce II". The new post can be summarized by this line from it: "We believe it is possible to build a version of MapReduce with more functionality and better performance." That almost certainly is true, but no one has yet, at least not at the scale Google is processing.

Eyetracking studies are really expensive, but mouse tracking can be done just by dropping some Javascript into a page. How close can we approximate the results from an costly eyetracking study just using cheap and easily implemented mouse tracking?

From the paper:

Eye tracking can provide insights into users’ behaviour while using the search results page, but eye tracking equipment is expensive and can only be used for studies where the user is physically present. The equipment also requires calibration, adding overhead to studies.

In contrast, the coordinates of mouse movements on a web page can be collected accurately and easily, in a way that is transparent to the user. This means that it can be used in studies involving a number of participants working simultaneously, or remotely by client-side implementations – greatly increasing the volume and variety of data available.

To capture mouse movements, we used ... a piece of Javascript code at the top of every Google search results page visited. This Javascript code captured the user’s mouse coordinates (and the ID of the HTML element the mouse was over) at 100 millisecond intervals, and submitted the gathered data, with timestamps, into a MySQL database every 2 seconds (or when the user left the Google search results page).

The work is just exploratory -- concluding only that mouse tracking "definitely [shows] potential as a way to estimate which results ... the user has considered before deciding where to click" -- but does have interesting results on the relationship between mouse and eye movements. It also reports on common mouse moving patterns they saw searchers do when considering on which search result to click.

The goal of this kind of work is to automatically learn what makes a good relevance rank for web search, certainly fun and useful stuff. In this case, the learning is from hand-labeled training data, but there have been other attempts to learn from clickstream data.

After reviewing the older RankNet paper, I said I was surprised to see a neural network used and that it was hard to tell how effective RankNet is because it is was not compared to other approaches. This 2007 tech report fixes that, evaluating the newer LambdaRank NNet against three other non-NNet approaches, but unfortunately reports (in Section 6.2) that the NNet had relatively poor performance.

One thing I am curious about in the paper is whether what they are optimizing for is really well correlated with user satisfaction. It would seem to me that their particular definition of discounted cumulative gain (as described in Section 2) would underweight the value of the first search result and overweight the value of later results relative to other possible definitions, especially for results below the fold on the page. It is worth pointing out that their measure differs from the metric used by Huffman and Hochster which much more heavily weighs the first few results (especially for navigational queries).

By the way, I have to admit, I found this 2007 tech report quite difficult to read. It may have just been me but, if you decide to attack it, you probably should be prepared to put in some effort.

If you are interested in this, you might also be interested in my June 2006 post, "Beyond PageRank, RankNet, and fRank", which looks at some similar work by different researchers at MSR.

Googlers Scott Huffman and Michael Hochster have a SIGIR 2007 paper, "How Well does Result Relevance Predict Session Satisfaction?" (ACM), that attempts to find a simple metric for people's satisfaction when they search.

The paper looks at "session satisfaction", not query satisfaction, because the goal is to determine how satisfied people are when trying to accomplish a task using the search engine. In some cases, people make many queries before getting the answer they need or before they give up.

The metric they end up with is a combination of the relevance of the first three results for the first query in the session that treats navigational and non-navigational queries differently (placing much more value on the first result for navigational queries). They also got a small boost by considering how many pages (which they call "events") were viewed in the session.

Don't miss the discussion in the paper of other things they tried (Section 4). Particularly interesting is that only navigational queries appear to require special handling and that they never tried using the last query in the session rather than the first (though they mention they would like to).

One thing I would have liked to see more of was a discussion of why they only consider the first three results. It was my understanding that a common behavior for searchers is to look at the top 1-3 results, then quickly skim the remainder. If true, I wonder if this suggests that the relevance of the top couple results are most important, but that the other results should still be considered in some form.

Finally, I would love to see a version of this using clickstream data instead of manually labeled relevance for the search results. Clickstream data is easily available. It would be tremendously useful to have a good proxy for session satisfaction that uses clicks instead of other data.

Thursday, January 10, 2008

Live Labs is an applied research group affiliated with Microsoft Research and MSN. The group has the enjoyable goal of not only trying to solve hard problems with broad impact, but also getting useful research work out the door and into products so it can help as many people as possible as quickly as possible.

Live Labs is lead by Gary Flake, the former head of Yahoo Research. It is a fairly new group, formed only two years ago. Gary wrote a manifesto that has more information about Live Labs.

Google has published slides and video from a course taught to interns during the summer of 2007.

The slides are well worth reading whether you are new to these topics or consider yourself an old pro. They cover introductory topics in distributed systems, motivate and describe MapReduce, and then discuss how to implement the PageRank algorithm and a variant on K-means clustering in parallel on a large cluster using MapReduce.

It is a great introduction to MapReduce, but what I found most interesting was the numbers they cite on usage of MapReduce at Google.

Jeff and Sanjay report that, on average, 100k MapReduce jobs are executed every day, processing more than 20 petabytes of data per day.

More than 10k distinct MapReduce programs have been implemented. In the month of September 2007, 11,081 machine years of computation were used for 2.2M MapReduce jobs. On average, these jobs used 400 machines and completed their tasks in 395 seconds.

What is so remarkable about this is how casual it makes large scale data processing. Anyone at Google can write a MapReduce program that uses hundreds or thousands of machines from their cluster. Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time.

By the way, there is another little tidbit in the paper about Google's machine configuration that might be of interest. They describe the machines as dual processor boxes with gigabit ethernet and 4-8G of memory. The question of how much memory Google has per box in its cluster has come up a few times, including in my previous posts, "Four petabytes in memory?" and "Power, performance, and Google".

Update: Let me also point out Professor David Patterson's comments on MapReduce in the preceding article in the Jan 2008 issue of CACM. David said, "The beauty of MapReduce is that any programmer can understand it, and its power comes from being able to harness thousands of computers behind that simple interface. When paired with the distributed Google File System to deliver data, programmers can write simple functions that do amazing things."

Friday, January 04, 2008

Early January is the time we see many predictions for 2008. I have not played this game since 2006, but I want to chime in this year.

I am only going to make one prediction, but one with broad impact. We will see a dot-com crash in 2008. It will be more prolonged and deeper than the crash of 2000.

The crash will be driven by a recession and prolonged slow growth in the US. Global investment capital will flee to quality, ending the speculative dumping of cash on Web 2.0 startups.

Venture capital firms will seek to limit their losses by forcing many of their portfolio companies to liquidate or seek a buyout. Buyout prospects will be poor, however, as the cash rich companies find themselves in a buyers market and let those seeking a savior come face-to-face with the spectre of bankruptcy before finally buying up the assets on the cheap.

Startups that managed to get cash before the bubble collapses will have a cash horde, but will find little opportunity to rest on it. Most startups will find their revenue models were unrealistic and will rapidly have to seek change. Many will jump over to advertising, but the advertising market will have constricted. Bigger businesses will seek to drive out the new entrants, and online advertising will become a cutthroat business with little profits to be found. Others startups may shift toward licensing and development deals for bigger companies, but will find their investors impatient now that the promised $500M startup has become a $10M company.

The big players will not be immune from this contagion. Google, in particular, will find its one-trick pony lame, with the advertising market suddenly stagnant or contracting and substantial new competition. The desperate competition with dwindling opportunity will drive profits in online advertising to near zero. Google and Yahoo will find their available cash dropping and will do substantial layoffs.

Unfortunately, this scenario has privacy implications as well. Much like we saw after the 2000 crash, it is likely that those with little to lose will attempt scary new forms of advertising. The Web will become polluted with spyware, intrusiveness, and horrible annoyances. None of this will work, of course, and there will be lawsuits and new privacy legislation, but we will have to endure it while it lasts.

It is a dire scenario, but one that looks much like what we saw after 2000. That was a much smaller crash without the fuel from broader problems in the US economy, but we still had investment capital shut off for a few years, most startups shut down, and the remaining startups shift business models. We also saw a dramatic rise in pop-up advertising and spyware.

The crash of 2008 will be similar to 2000 but deeper. We all will have to weather the storm.

Thursday, January 03, 2008

Findory was a personalized news site. The site launched in January 2004 and shut down November 2007.

A reader first coming to Findory would see a normal front page of news, the popular and important news stories of the day. When someone read articles on the site, Findory learned what stories interested that reader and changed the news that was featured to match that reader's interests. In this way, Findory built each reader a personalized front page of news.

Below is a screenshot of an example personalized Findory home page. Articles marked with a sunburst icon are personalized, picked specifically for this reader based on this person's reading history.

[Clicking on the screenshot will bring up a full-sized version]

Findory's personalization used a type of hybrid collaborative filtering algorithm that recommended articles based on a combination of similarity of content and articles that tended to interested other Findory users with similar tastes.

One way to think of this is that, when a person found and read an interesting article on Findory, that article would be shared with any other Findory readers who likely would be interested. Likewise, that person would benefit from interesting articles other Findory readers found. All this sharing of articles was done implicitly and anonymously without any effort from readers by Findory's recommendation engine.

Findory's news recommendations were unusual in that they were primarily based on user behavior (what articles other readers had found), worked from very little data (starting after a single click on Findory), worked in real-time (changed immediately when someone read an article), required no set-up or configuration (worked just by watching articles read), and did not readers to identify themselves (no login necessary).

Findory's primary product was in news, but the broader goal of Findory was to personalize information. Toward that, Findory had alpha features that would recommend videos, podcasts, feeds, advertisements, and web search results.

Video, podcast, and feed recommendations worked much like the news recommendations. The advertisement recommendations were an unusual form of fine-grained personalized advertising that attempted to target advertisements based not only on the content of the page, but also a person's reading history on Findory. The web search was an unusual form of fine-grained personalized web search that modified Google search results to feature items clicked on by searchers with similar search behavior (recommendations) or that were clicked on by this specific searcher in the past (re-finding).

At its height, Findory was a popular website with over 100k unique visitors and 5M page views per month. Findory was well reviewed and received press coverage in the Wall Street Journal, Forbes, Time Magazine, PC World, The Times, Spiegel, Seattle PI, Seattle Times, Puget Sound Business Journal, KPLU, Slate, and elsewhere.

More information and more details on Findory's history can be found in my many previous posts on Findory.

Wednesday, January 02, 2008

Andrei Broder from Yahoo Research will be giving a talk on "Computational Advertising" next week (Thursday, Jan 10) at University of Washington. Video of the talk will be available live and archived a few days afterward.

The talk looks like a great one. Excerpts from the description:

Computational advertising ... [attempts to] find the "best match" between a given user in a given context and a suitable advertisement.

The information about the user can vary from scarily detailed to practically nil. The number of potential advertisements might be in the billions. Thus, depending on the definition of "best match" this challenge leads to a variety of massive optimization and search problems, with complicated constraints.

This talk will give an introduction to this area and give a "taste" of some recent results.

From the way Andrei is framing the problem -- matching advertisements not only to content, but also to what we know about each user -- Andrei clearly is talking about personalized advertising.

Personalized advertising is a tremendous computational challenge. Traditional contextual advertising matches ads to static content. We only have to do the match infrequently, then we can show a selection of the ads that we think will work well for a given piece of content to everyone who views that content.

With personalized advertising, we match ads to content and each user's interests, and then show different ads for each user. Like with all personalization, caching no longer works. Each user sees a different page. With personalized advertising, targeting ads now means we have to find matches in real-time for each page view and each user.

On a related note, Yahoo Researcher Omid Madani and ex-Yahoo Researcher Dennis DeCoste had an interesting short paper back in 2005, "Contextual Recommender Problems" (PDF), that has some more thoughts on this problem. As I wrote in an older post, Omid and Dennis treated personalized advertising as a recommendations problem and proposed a few methods of attacking the problem.

By the way, this talk seems to be a shift for Andrei away from a more general problem of personalized information -- which he called "information supply" -- toward focusing on the more specific task of personalized advertising.

Update: It appears this talk will not be archived. It will be broadcast live. If you want to see it remotely, you will have to watch it at the time of the talk.

Update: It was an interesting talk, but, frankly, a bit disappointing in its lack of depth.

Andrei spent most of the talk describing the state of online advertising today, including market size and how targeted advertising works. He touched on some of the more interesting and harder problems, but only touched on them, and only very briefly.

For example, on one slide, Andrei criticized Google AdSense for showing ads for Libby shoes on an article about Dick Cheney and Scooter Libby, saying that the match is spurious. But, Andrei did not say what would be a better ad to show for that news article. In response to my question later, Andrei did say that perhaps no ad is appropriate in that case, but he did not expand on this to talk about how to detect, in general, when it might be undesirable to show ads because of lack of value and commercial intent. When I did a follow-up question after the talk, he expanded briefly into ideas around personalized advertising -- showing ads that might interest this user based on this person's history rather than ads targeting the current content -- and an advertising engine that explores and attempts to learn what ads might be effective, but not in any depth.

For another example, on one slide, Andrei drew a parallel between web search and advertising search, arguing that both can be seen as searches for information, but pointed out that advertisements are a smaller database of smaller documents and that the relevance rank of a search for ads depends on the bids. He did not discuss the issue that web search in some ways is an easier problem, though, in that the results are more easily cached. Web search relevance rank is static over substantial periods of time, but advertisements are not because the relevance of ads depends not only on keyword matching, but also on bids, competing bids, budgets, and clickthrough rates, all of which can vary rapidly.

For a third example, Andrei briefly mentioned using a user profile for personalized advertising, but only touched the surface of what that profile should contain, how it should be used, where it should be stored, in what cases personalized advertising is likely to outperform unpersonalized advertising, and how trying to show different ads to different people massively increases the computation necessary for ad targeting.

Overall, there was an unfortunate lack of detail on how to solve the "best match" massive optimization challenge of online advertising under all its complicated constraints. Andrei did hint at one point that all the search giants are reluctant to talk about these details, but it is too bad that we were not be able to explore the fun issues in more depth.

Update: Andrei gave a version of his talk at WWW 2008. During the question and answer time, I asked him to expand on what is the "best match" for an advertisement given a user and a context.

He described three major categories of utility: advertiser utility, user utility, and publisher utility. In response to further questions, he suggested that advertiser utility is complicated by the fact that ad agencies may have different incentives than the advertiser and by branding effects. He also pointed out that publisher utility is not as simple as just revenue because of publisher branding issues (e.g. the New York Times will not accept ads for pornography).

As for user utility, suggested that it is complicated and difficult to measure, but that clickthrough rate may be one proxy for it.

Andrei did not expand on how these utility functions could or should be combined or how to deal with conflicts between them.

On a related note, Saul Hansell has some "New Questions for a New Year" with some harsh thoughts on the search giants, including Google's inability to "create a significant advertising business for any format other than text ads", Yahoo's failure to become "the best company to work for" which is leaving them with nothing but "a site that is just an old habit in need of changing", and Microsoft's need to go "through MSN and Windows Live with an honest assessment of their business prospects" and determine "how many of the new initiatives" are "rational" investments. As for MySpace, Saul snipes, "Does anyone care anymore?"