Tuesday, December 30, 2008

Danny Sullivan is an insightful writer, long-time watcher of the search industry, and founder of Search Engine Watch, Search Engine Land, and the popular Search Engine Strategies (SES) conference. His thoughts are well worth reading.

Monday, December 29, 2008

Amazon CTO Werner Vogels posted an copy of his recent ACM Queue article, "Eventually Consistent - Revisited". It is a nice overview of the trade-offs in large scale distributed databases and focuses on availability and consistency.

An extended excerpt:

Database systems of the late '70s ... [tried] to achieve distribution transparency -- that is, to the user of the system it appears as if there is only one system instead of a number of collaborating systems. Many systems during this time took the approach that it was better to fail the complete system than to break this transparency.

In the mid-'90s, with the rise of larger Internet systems ... people began to consider the idea that availability was perhaps the most important property ... but they were struggling with what it should be traded off against. Eric Brewer ... presented the CAP theorem, which states that of three properties of shared-data systems -- data consistency, system availability, and tolerance to network partition -- only two can be achieved at any given time .... Relaxing consistency will allow the system to remain highly available under the partitionable conditions, whereas making consistency a priority means that under certain conditions the system will not be available.

If the system emphasizes consistency, the developer has to deal with the fact that the system may not be available to take, for example, a write ... If the system emphasizes availability, it may always accept the write, but under certain conditions a read will not reflect the result of a recently completed write ... There is a range of applications that can handle slightly stale data, and they are served well under this model.

[In] weak consistency ... The system does not guarantee that subsequent accesses will return the updated value. Eventual consistency ... is a specific form of weak consistency [where] the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value ... The most popular system that implements eventual consistency is DNS (Domain Name System).

[In] read-your-writes [eventual] consistency ... [a] process ... after it has updated a data item, always accesses the updated value ... Session [eventual] consistency ... is a practical version of [read-your-writes consistency] ... where ... as long as [a] session exists, the system guarantees read-your-writes consistency. If the session terminates because of a certain failure scenario, a new session needs to be created and the guarantees do not overlap the sessions.

As Werner points out, session consistency is good enough for many web applications. When I make a change to the database, I should see it on subsequent reads, but anyone else who looks often does not need to see the latest value right away. And most apps are happy if this promise is violated in rare cases as long as we acknowledge it explicitly by terminating the session; that way, the app can establish a new session and either decide to wait for eventual consistency of any past written data or take the risk of a consistency violation.

Session consistency also has the advantage of being easy to implement. As long as a client reads and writes from the same replica in the cluster for the duration of the session, you have session consistency. In the event that node goes down, you terminate the session and force the client to start a new session on a replica that is up.

Werner did not talk about it, but some implementations of session consistency can cause headaches if a lot of clients doing updates to the same data where they care what the previous values were. The simplest example is a counter where two clients with sessions on different replicas both try to increment a value i and end up with i+1 in the database rather than i+2. However, there are ways to deal with this kind of data. For example, just for the data that needs it, we can use multiversioning while sending writes to all replicas or forcing all read-write sessions to the same replica. Moreover, a surprising vast amount of application data does not have this issue because there is only one writer, there are only inserts and deletes not updates, or the updates do not depend on previous values.

Please see also Werner's older post, "Amazon's Dynamo", which, in the full version of their SOSP 2007 paper at the bottom of his post, describes the data storage system that apparently is behind Amazon S3 and Amazon's shopping cart.

[An] attacker may cleverly decorate portions of such a third-party UI to make it appear as if they belong to his site instead, and then trick his visitors into interacting with this mashup. If successful, clicks would be directed to the attacked domain, rather than attacker's page -- and may result in undesirable and unintentional actions being taken in the context of victim's account.

[For example,] the attacker may also opt for showing the entire UI of the targeted application in a large <IFRAME>, but then cover portions of this container with opaque <DIV> or <IFRAME> elements placed on top ... [Or] the attacker may simply opt for hiding the target UI underneath his own, and reveal it only miliseconds before the anticipated user click, not giving the victim enough time to notice the switch, or react in any way.

Tuesday, December 09, 2008

Yahoo ... is planning to .... launch in beta relatively soon with half a dozen small applications running in a sidebar inside the Yahoo mail client (Evite is one of the services that is said to be building a nano-app for this new Yahoo Mail-as-a-platform). Users' address books would act as a social graph, essentially turning Yahoo Mail into the basis of a whole new social networking experience.

The only way for Yahoo or Google to challenge the social networking incumbents like Facebook [is] to leverage their email infrastructure ... With relationship buckets pre-defined by the address book, which contains everything from web-based addresses to geo-local data (physical address) to mobile numbers, email clients are already rich with the very data set that Facebook [has].

I liked this idea back when Om talked about it last year and still like it now.

The address book is essentially a social network. Not only does it have friend relationships, but also we can determine the importance of the relationships, the weights of the social connections. Oddly, surprisingly little has been done with that information in e-mail clients.

Perhaps it is fear of breaking something that so many people use and depend on, but e-mail clients have largely stood still over the last several years while social networking applications nibbled away at their market and mind share. What experimentation has occurred seems stuck in startups and research (e.g. Xobni or Inner Circle).

Meanwhile, there seems to be a trend where social networks are creeping toward e-mail clients. For example, Facebook has added limited e-mail functionality within their site as well as Twitter-like microblogging. These features seem intended to make communication dwell within Facebook.com rather than inside e-mail and IM clients.

I admit I am a bit outside of the demographic for heavy social network users, but, from what I can tell, the primary use of social networks is for communication, perhaps with the twist of being focused primarily on dating and entertainment. It makes me wonder if social networks really are a different app from communication apps like e-mail clients or just a different facet of the same idea.

If they are nearly the same, I would expect Yahoo will do much better from implementing social networking features in Yahoo Mail than from attempting to create a new social network such as Yahoo 360. Something similar probably could be said for Orkut and GMail.

Monday, December 01, 2008

Deepak Agarwal along with many others from Yahoo Research have a paper at the upcoming NIPS 2008 conference, "Online Models for Content Optimization", with a fun peek inside of a system at Yahoo that automatically tests and optimizes which content to show to their readers.

It is not made entirely clear which pages at Yahoo use the system, but the paper says that it is "deployed on a major Internet portal and selects articles to serve to hundreds of millions of user visits per day."

The system picks which article to show in a slot from 16 potential candidates where the pool of candidates are picked by editors and change rapidly. The system seeks to optimize the clickthrough rate in the slot. The problem is made more difficult by the way the clickthrough rate on a given article changes rapidly as the article ages and as the audience coming to Yahoo changes over the course of a day, which means the system needs to adapt rapidly to new information.

The paper describes a few variations of algorithms that do explore/exploit by showing the article that performed best recently while constantly testing the other articles to see if they might perform better.

The result was that the CTR increased by 30-60% over editors manually selecting the content that was shown. Curiously, their attempt to show different content to different user segments (a coarse-grained version of personalization) did not generate additional gains, but they say this is almost certainly due to the very small pool of candidate articles (only sixteen articles) from which the algorithm was allowed to pick.

One amusing tidbit in the paper how they describe the culture clash that occurred between maintaining the control the editors were used to and giving the algorithms freedom to pick the content users really seem to want.

I remember similar issues at Amazon way back when we first started use algorithms to pick content rather than editors. It is hard for editors to give up control even to the collective voice of people voting on what they want. While I always have been sympathetic to the need for editorial voice, if it is forcing content on users that they do not actually want, it is important to understand its full cost.

Thursday, November 20, 2008

There has been more talk lately, it seems to me, on moving away from stateless search where each search is independent and toward a search engine that pays attention to your previous searches when it tries to help you find the information you seek.

Which makes that much more relevant a paper by Rosie Jones and Kristina Klinkner from Yahoo Research at CIKM 2008, "Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs" (PDF).

Rosie and Kristina looked at how to accurately determine when a searcher stops working on one task and starts looking for something new. The standard technique people have used in the past for finding task boundaries is to simply assume that all searches within a fixed period of time are part of the same task. But, in their experiments, they find that "timeouts, whatever their length, are of limited utility in identifying task boundaries, achieving a maximum precision of only 70%."

Looking at the Yahoo query logs more closely to explain this low accuracy, they find some surprises, such as the high number of searchers that work on multiple tasks simultaneously, even interleaving the searches corresponding to one task with the searches for another.

So, when the simple stuff fails, what do most people do? Think up a bunch of features and train a classifier. And, there you go, that's what Rosie and Kristina did. They trained a classifier using a set of features that combined characteristics of searcher behavior (e.g. people searching for [tribeca salon] after [new york hairdresser]) with characteristics of the queries (e.g. lexically similar or return similar results from a search engine), eventually achieving much higher accuracy rates on finding task boundaries.

As the authors say, being able to accurately segment tasks could improve our ability to evaluate search engines. In particular, we could seek to minimize the amount of time needed by searchers "to satisfy an information need or fulfill a more complex objective" rather than just looking at click and usage data for one query at a time. Judging search engines by how well they help people get things done is something that, in my opinion, is long overdue.

Please see also my earlier post, "Tasks, not search, at DEMOfall2008", where Head of Yahoo Research Prabhakar Raghavan said that people really don't want to search; what they really want is to fulfill their tasks and get things done.

Wednesday, November 19, 2008

You have to love research work that takes some absurdly simple idea and shows that it works much better than anyone would have guessed.

Steve Webb, James Caverlee, and Calton Pu had one of these papers at CIKM 2008, "Predicting Web Spam with HTTP Session Information" (PDF). They said, everyone else seems to think we need the content of a web page to see if it is spam. I wonder how far we can get just from the HTTP headers?

Turns out surprisingly far. From the paper:

In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.

We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.

It appears that web spammers tend to use specific IP ranges and put unusual gunk into their headers (e.g. "X-Powered-By" and "Link" fields), which makes it fairly easy to pick them out just from their headers. As one person suggested during the Q&A for the talk, spammers probably would quickly correct these oversights if it became important, but you still have to give credit to the authors for this cute and remarkably effective idea.

If you might enjoy another example of making remarkable progress using a simple idea, please see my earlier post, "Clever method of near duplicate detection". It summarizes a paper about uniquely identifying the most important content in a page by looking around the least important words on the page, the stop words.

Monday, November 17, 2008

Googlers Diane Lambert and Daryl Pregibon had a paper at AdKDD 2008, "Online Effects of Offline Ads" (PDF), with a fun look at how far we can get measuring the impact of offline advertising by increases in search queries or website visits.

Some excerpts:

One measure of offline ad effectiveness is an increase in brand related online activity .... As people spend more time on the web, [the] steps toward purchase increasingly include searching for the advertiser's brand or visiting the advertiser's websites, even if the ad campaign was in an offline medium such as print, radio, or TV.

There are two obvious strategies for estimating the [gain] ... We can assume that the past is like the present and use daily outcomes before the campaign ran ... [but] the "before" number of visits ... is not necessarily a good estimate ... if interest in the product is expected to change over time even if no ad campaign is run. For example, if an ad is more likely to be run when product interest is high, then comparing counts-after to counts-before overstates the effect of the campaign.

Alternatively, we could estimate the [gain] ... by the outcome in control DMAs, which are markets in which the ad did not appear ... One problem, though, is that the advertiser may be more likely to advertise in DMAs where the interest in the product is likely to be high.

The paper goes on to detail the technique used and the ability of the technique to detect changes in very noisy traffic data.

This paper is part of a larger group of fun and remarkably successful work that tries to predict offline trends from online behavior. One example that recently received much press attention is Google Flu Trends, which uses searches for flu-related terms to predict real flu outbreaks. Another example is recent work out of Yahoo Research and Cornell to find the physical location of objects of interest from a combination of search terms and geolocation data.

Friday, November 14, 2008

Andrei Broder and a large crew from Yahoo Research had a paper at CIKM 2008, "To Swing or not to Swing: Learning when (not) to Advertise" (PDF), that is a joy to see for those of us that are hoping to make advertising more useful and less annoying.

The paper starts by motivating the idea of sometimes not showing ads:

In Web advertising it is acceptable, and occasionally even desirable, not to show any [ads] if no "good" [ads] are available.

If no ads are relevant to the user's interests, then showing irrelevant ads should be avoided since they impair the user experience [and] ... may drive users away or "train" them to ignore ads.

The paper looks at a couple approaches on when to show ads, one based on a simple threshold on the relevance score produced by Yahoo's ad ranking system, another training a more specialized classifier based on a long list of features.

An unfortunate flaw in the paper is that the system was evaluated using a manually labeled set of relevant and irrelevant ads. As the paper itself says, it would have been better to consider expected revenue and user utility, preferably using data from actual Yahoo users. But, with the exception of a brief mention of "preliminary experiments ... using click-through data" that they "are unable to include ... due to space constraints", they leave the question of the revenue and user satisfaction impact of not showing ads to future work.

Thursday, November 13, 2008

Baidu Chief Scientist William Chang gave an industry day talk at CIKM 2008 on the next generation of search where he predicted an industry-wide move toward personalized web search.

William first described two earlier generations of search, then said the upcoming third generation of search would be the "internet as a matching network". He expected to see an integration of search and recommendations, where recommendations are used to find related concepts, entities, phrases, documents, and sources.

As part of this, he expected to see a renewed interest in the diversity of search results -- he described it as relevance versus user satisfaction -- that appeared to be going down the path of information exploration and search as a dialogue to better help searchers with the task behind the search phrase.

William also briefly mentioned personalized spidering. While he did not elaborate, I would guess he meant software agents gathering, synthesizing, and summarizing information from several deep web sources to satisfy a complicated task.

This talk was one of the ones recorded by videolectures.net and should appear there in a week or so.

Please see also my earlier post, "Rakesh Agrawal at CIKM 2008", that summarizes Rakesh's predictions about a new interest in diversity and discovery in web search.

Wednesday, November 12, 2008

In two insightful posts, "The Omnigoogle" and "The cloud's Chrome lining", Nick Carr cleanly summarizes how Google benefits from the growth of the Web and why expanding Web use makes sense as a large part of their business strategy.

Some excerpts:

[Google] knows that its future ... hinges on the continued rapid expansion of the usefulness of the Internet, which in turn hinges on the continued rapid expansion of the capabilities of web apps, which in turn hinges on rapid improvements in the workings of web browsers .... [Chrome's] real goal ... is to upgrade the capabilities of all browsers.

The way Google makes money is straightforward: It brokers and publishes advertisements through digital media. More than 99 percent of its sales have come from the fees it charges advertisers for using its network to get their messages out on the Internet.

For Google, literally everything that happens on the Internet is a complement to its main business. The more things that people and companies do online, the more ads they see and the more money Google makes.

Just as Google controls the central money-making engine of the Internet economy (the search engine), Microsoft controlled the central money-making engine of the personal computer economy (the PC operating system).

In the PC world, Microsoft had nearly as many complements as Google now has in the Internet world, and Microsoft, too, expanded into a vast number of software and other PC-related businesses - not necessarily to make money directly but to expand PC usage. Microsoft didn't take a cut of every dollar spent in the PC economy, but it took a cut of a lot of them.

In the same way, Google takes a cut of many of the dollars that flow through the Net economy. The goal, then, is to keep expanding the economy.

Tuesday, November 11, 2008

Filip Radlinski, Madhu Kurup, and Thorsten Joachims had a paper at CIKM 2008, "How Does Clickthrough Data Reflect Retrieval Quality?" (PDF), with a surprising result on learning to rank using click data.

Specifically, they found that, instead of testing two search rankers in a normal A/B test (e.g. 50% of users see ranker A, 50% see ranker B), showing all searchers an interleaved combination of the two possible search result orderings makes it much easier to see which ranker people prefer. The primary explanation the authors give for this is that interleaving the results gives searchers the easier task of expressing a relative preference between the two rankers.

Some excerpts from the paper:

Unlike expert judgments, usage data ... such as clicks, query reformulations, and response times ... can be collected at essentially zero cost, is available in real time, and reflects the value of the users, not those of judges far removed from the users' context. The key problem with retrieval evaluation based on usage data is its proper interpretation.

We explored and contrasted two possible approaches to retrieval evaluation based on implicit feedback, namely absolute metrics and paired comparison tests ... None of the absolute metrics gave reliable results for the sample size collected in our study. In contrast, both paired comparison algorithms ... gave consistent and mostly significant results.

Paired comparison tests are one of the central experiment designs used in sensory analysis. When testing a perceptual quality of an item (e.g. taste, sound) ... absolute (Likert scale) evaluations are difficult to make. Instead, subjects are presented with two or more alternatives and asked ... which of the two they prefer.

This work proposes a method for presenting the results from two [rankers] so that clicks indicate a user's preference between the two. [Unlike] absolute metrics ... paired comparison tests do not assume that observable user behavior changes with retrieval quality on some absolute scale, but merely that users can identify the preferred alternative in direct comparison.

Please see also my older post, "Actively learning to rank", which summarizes some earlier very interesting work by Filip and Thorsten.

Update: Filip had a nice update to this work in a SIGIR 2010 paper, "Comparing the Sensitivity of Information Retrieval Metrics". Particularly notable is that only x10 as many clicks are required as explicit judgments to detect small changes in relevance. Since click data is much easier and cheaper to acquire than explicit relevance judgments, this is another point in favor of using online measures of relevance rather than the older technique of asking judges (often a lot of judges) to compare the results.

Monday, November 10, 2008

Philipp Lenssen points to a video of a DEMOfall08 panel on "Where the Web is Going" that included Peter Norvig from Google and Prabhakar Raghavan from Yahoo.

What is notable in the talk is that (starting at 12:32) Prabhakar and Peter agree that, rather than supporting only one search at a time, search engines will soon focus on helping people get a bigger task done.

Prabhakar says:

I think the next step that faces us ... is divining the intent of what people are doing [and] fulfilling their tasks, getting what they need to get done.

People intrinsically don't want to search. People don't come to work every day saying I need to search ... They want to run their lives.

The notion of a [mere] retrieval engine as the ultimate [tool] is incredibly limiting. We have to get much further along to task completion and fulfillment.

The current world of search engines [are] stateless ... In the act of completing a task -- booking a vacation, finding a job -- you spend hours and days start to end. In the process, you make repeated invocations of search engines ... And all the while the search engine has no state about you. It doesn't recognize that you are in the midst of a task.

What does it take to recognize an intent and synthesize an experience that satisfies an intent? So, if it is a vacation you are planning, it should say ... here is a package I recommend that is based on your budget, the fact that you have two kids, that you don't want to go to too many museums. That's the future we have to get to.

Peter says:

We have to get a lot farther in saying what is it that the user means both in ... tasks where there is a clear intent ... and [even] more so in cases where it is exploratory.

We see this all the time that the user has some area he wants to figure out -- let's say a medical problem -- and the user starts out by kind of floundering around not sure what to talk about and then he reads some document and then he starts to learn the lingo. And, now they say, I don't say funny red splotch, I use this medical term and now I'm on the right track.

We have to accelerate that process ... not make the user do all the work.

There is an interesting shift here from a model where each search is independent to one where a search engine may expect searchers to do multiple searches when trying to accomplish their tasks.

That new model could take the form of search as a dialogue (a back-and-forth with the search engine focused on helping you understand what information is out there), personalized search (re-ranking your results based on your past actions, interests, and goals), or recommender systems (helping you discover interesting things you might not know exist using what people like you found interesting). Most likely, I would expect, it would require a combination of all three.

Thursday, November 06, 2008

Yahoo Fellow, VP, and computational advertising guru Andrei Broder gave a talk at CIKM 2008 on "The Evolving Computational Advertising Landscape" with some notable details that were missing from his previous talks on this topic.

Specifically, Andrei described "the recommender system connection" with advertising where we want "continuous CTR feedback" for each (query, ad) pair to allow us to learn the "best match between a given user in a given context and a suitable advertisement". He said this was "closest to a recommender system" because, to overcome sparse data and get the best match, we need to find similar ads, pages, and users.

At this point, Andrei offered a bit of a tease that an upcoming (and not yet available) paper that has Yehuda Koren as a co-author will talk more on this topic of treating advertising as a recommender problem. Yehuda Koren recently joined Yahoo Research and is one of the members of the top-ranked team in the Netflix recommender contest.

Andrei continued on the theme of personalization and recommendations for advertising, talking briefly about personalized ad targeting and saying that he thought short-term history (i.e. the last few queries) likely would be more useful than long-term history (i.e. a subject-interest profile).

Andrei also talked about several other topics that, while covered in his older talks, also are quite interesting. He contrasted computational advertising and classical advertising, saying that the former uses billions of ads and venues, has liquid market, and is personalizable, measurable, and low cost. He described the four actors in an advertising market -- advertisers, ad agencies, publishers, and users -- and said they advertising engines have the difficult task of optimizing the four separate and possibly conflicting utility functions of these actors. He talked about the ideal of "advertising as a form of information" rather than as an annoyance, the key there being making it as useful and relevant as possible. And, he spent some time on mobile advertising, talking about the very interesting but slightly scary possibilities of using precise locations of individuals and groups over time to do "instant couponing" to nearby stores (where what we mean by nearby is determined by your current speed and whether that makes it clear that you are in a car), to recognize which stores are open and which stores are popular, to predict lifestyle choices and interests of individuals and groups, and to make recommendations.

This talk was one of the ones recorded by videolectures.net and should appear there in a week or so.

Wednesday, November 05, 2008

Googler Peter Norvig gave a talk at industry day at CIKM 2008 that, despite my fascination with all things Peter Norvig, almost frightened me off by including the phrase "the Ultimate Agile Development Tool" in its title.

The talk redeemed itself in the first couple minutes, citing Steve Yegge's "Good Agile, Bad Agile" and making it clear that Peter more meant being agile than Agile.

His core point was that "code is a liability". Relying on data over code as much as possible allows simpler code that is more flexible, adaptive, and robust.

In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.

For another example, Peter pulled from Jing et al, "Canonical Image Selection from the Web" (ACM), which uses a clever representation of the features of an image, a huge image database, and clustering of images with similar features to find the most representative image of, for example, the Mona Lisa on a search for [mona lisa].

Peter went on to say to say that more data seems to help in many problems more than complicated algorithms. More data can hit diminishing returns at some point, but the point seems to be fairly far out for many problems, so keeping it simple while processing as much data as possible often seems to work best. Google's work in statistical machine translation works this way, he said, primarily using the correlations discovered between the words in different languages in a training set of 18B documents.

The talk was one of the ones recorded by videolectures.net and should appear there in a week. If you cannot wait, the CIKM talk was similar to Peter's startup school talk from a few months ago, so you could use that as a substitute.

Friday, October 31, 2008

The talk started with a bit of a pat on the back for those working on search, with Bruce saying that "search is everywhere" -- not just the web, but enterprise, desktop, product catalogs, and many other places -- and, despite hard problems of noisy data and vocabulary mismatch between what searchers actually want and how they express that to a search engine, search seems to do a pretty good job.

But then, Bruce took everyone to task, saying that current search is nothing like the "vision of the future" that was anticipated decades back. We still are nowhere near a software agent that can understand and fulfill complex information needs like a human expert would. Bruce said the "hard problems remain very hard" and that search really only works well when searchers are easily able to translate their goals into "those little keywords".

To get to that vision, Bruce argued that we need to be "evolutionary, not revolutionary". Keep chipping away at the problem. Bruce suggested long queries as a particularly promising "next step toward the vision of the future", saying that long queries work well "for people [not] for search engines", and discussed a few approaches using techniques similar to statistical machine translation.

It may have been that he ran short on time, but I was disappointed that Bruce did not spend more time talking about how to make progress toward the grand vision of search. A bit of this was addressed in the questions at the end. One person asked about whether search should be more of a dialogue between the searcher and the search engine. Another asked about user interface innovations that might be necessary. But, in general, it would have been interesting to hear more about what new paths Bruce considers promising and which of the techniques currently used he considers to be dead ends.

On a side note, there was a curious contrast between Bruce's approach of "evolutionary not revolutionary" and the "Mountains or the Street Lamp" talk during industry day. In that talk, Chris Burges argued that we should primarily focus on very hard problems we have no idea how to solve -- climb the mountains -- not twiddles to existing techniques -- look around nearby in the light under the street lamp.

Tuesday, October 28, 2008

Rakesh Agrawal from Microsoft Research gave a keynote talk yesterday morning at CIKM 2008 on Humane Data Mining. Much of the talk was on the potential of data mining in health care, but I am going to highlight only the part on web search, particularly the talk's notable focus on serendipity, discovery, and diversity in web search.

When talking about web search, Rakesh first mentioned making orthogonal query suggestions to support better discovery. The idea here is that searchers may not always know exactly what they want. The initial query may just be a starting point to explore the information, learn what is out there, and figure out what they really want to do. Suggesting queries that are related but a bit further afield than simple refinements may help people who don't know quite what they need to get to what they need.

Rakesh then talked briefly about result diversification. This is particularly important on ambiguous queries, but also is important for ambiguous tasks, where a searcher doesn't quite know what he wants and needs more information about what information is out there. Rakesh mentioned the long tail of search results as part of improving diversity. He seemed surprisingly pessimistic about the ability of recommender system approaches to surface items from the tail, either in products or in search, but did not elaborate.

Finally, learning from click data came up repeatedly, once in learning to classify queries using similarities in click behavior, again in creating implicit judgments as a supplement or replacement for explicit human judges, and finally when talking about a virtuous cycle between search and data where better search results attract more data on how people use the search results which lets us improve the results which gives us more data.

The talk was filmed by the good people at videolectures.net and should be available there in a couple weeks.

Friday, October 24, 2008

Yahoo Chief Scientist Jan Pedersen recently wrote a short position paper, "Making Sense of Search Result Pages" (PDF), that has some interesting tidbits in it. Specifically, it advocates for click-based methods for evaluating search result quality and mentions using toolbar data to see what people are doing after leaving the search result page.

Some extended excerpts:

Search engine result pages are presented hundreds of millions of times a day, yet it is not well understood what makes a particular page better from a consumer's perspective. For example, search engines spend large amounts of capital to make search-page loading latencies low, but how fast is fast enough or why fast is better is largely a subject of anecdote.

Much of the contradiction comes from imposing a optimization criterion ... such as discounted cumulative gain (DCG) ... that does not account for perceptual phenomena. Users rapidly scan search result pages ... and presentations optimized for easy consumption and efficient scanning will be perceived as more relevant.

The process Yahoo! search uses to design, validate, and optimize a new search feature includes ... an online test of the feature ... [using] proxy measures for the desired behaviors that can be measured in the user feedback logs.

Search engine query logs only reflect a small slice of user behavior -- actions taken on the search results page. A more complete picture would include the entire click stream; search result page clicks as well as offsite follow-on actions.

This sort of data is available from a subset of toolbar users -- those that opt into having their click stream tracked. Yahoo! has just begun to collect this sort of data, although competing search engines have collected it for some time.

We expect to derive much better indicators of user satisfaction by consider the actions post click. For example, if the user exits the clicked-through page rapidly then one can infer that the information need was not satisfied by that page.

Thursday, October 23, 2008

Bill Zeller and Ed Felten have an interesting paper, "Cross-Site Request Forgeries: Exploitation and Prevention" (PDF), that looks at exploiting the implicit authentication in browsers to take actions on the user's behalf using img tags or Javascript.

The most dramatic of the attacks allowed the attacker to take all the money from someone's ING Direct account just by visiting a web page. The attack sent POST requests off to ING Direct using Javascript, so they appear to come from the victim's browser. The POST requests quickly and quietly cause the victim's browser to create a new account by transferring money from their existing account, add the attacker as a valid payee on the new account, then transfer the funds to the attacker's account. Danger, Will Robinson.

Wednesday, October 22, 2008

A wave of job losses has started to spread across California’s Silicon Valley as the trademark optimism of the region’s technology start-ups has turned to pessimism amid the financial market rout.

The rapid reversal in mood has reawakened memories of the dotcom bust in 2001.

They also say that "Sequoia Capital ... [recently] greeted [entrepreneurs] with a presentation that began with a slide showing a gravestone and the words 'RIP good times' and were told to treat every dollar they spent as though it was their last."

Tuesday, October 21, 2008

A paper at WebKDD 2008, "Exploring the Impact of Profile Injection Attacks in Social Tagging Systems" (PDF), by Maryam Ramezani, JJ Sandvig, Runa Bhaumik, and Bamshad Mobasher claims that social tagging systems like del.icio.us are easy to attack and manipulate.

The paper looks at two types of attacks on social tagging systems, one that attempts to make a document show up on tag searches where it otherwise would not show up, another that attempts to promote a document by associating it with other documents.

Some excerpts:

The goal of our research is to answer questions such as ... How many malicious users can a tagging system tolerate before results significantly degrade? How much effort and knowledge is needed by an attacker?

We describe two attack types in detail and study their impact on the system .... The goal of an overload attack, as the name implies, is to overload a tag context with a target resource so that the system correlates the tag and the resource highly ... thereby increasing traffic to the target resource ... The goal of a piggyback attack is for a target resource to ride the success of another resource ... such that they appear similar.

Our results show that tagging systems are quite vulnerable to attack ... A goal-oriented attack which targets a specific user group can easily be injected into the system ... Low frequency URLs are vulnerable to piggyback attack as well as popular and focused overload attacks. High frequency URLs ... are [still] vulnerable to overload attacks.

The paper goes on to describe a few methods of attacking social tagging systems that require creating remarkably few fake accounts, as few as 0.03% of the total accounts in the system.

Frankly, I have been surprised not to see more attacks on tagging systems. It may be the case that most of these sites lack a large, mainstream audience, so the profit motive is still not sufficiently high to motivate persistent attacks.

Monday, October 20, 2008

In a post by David Sarno at the LA Times, David describes Digg CEO Jay Adelson as using their new $28.7M round of funding to push "a renewed shift in personalizing content for individual users."

Instead of showing users the most popular stories, [Digg] would make guesses about what they'd like based on information mined from the giant demographic veins of social networks. This approach would essentially turn every user into a big Venn diagram of interests, and send them stories to match.

Adelson said Digg had not yet deployed local views of the content, but that it was in the planning stages. "We do believe the implicit groupings of users and interests that we use in the recommendation engine will certainly play a role in the future of Digg and how we can address localities and topics."

Now, now, don't be put off by the frighteningly dull title. The paper is a fascinating look at whether people doing web advertising appear to be acting consistently and rationally in their bidding.

To summarize, in their data, advertisers do not appear to be bidding rationally or consistently.

Bidders very often have inconsistencies in their bidding on keywords over time that violate the ROI-maximizing strategy. The problem was most severe for advertisers that attempted to bid on many keywords. Only 30% of second-price auction bidders who bid on more than 25 keywords managed to keep their bids consistent over time. Only 19% of those bidders managed to maximize their ROI.

It looks like advertisers quite easily become confused by all the options given to them when bidding. 52% of the second price bidders they examined simply gave up and submitted essentially the same bid across all their keywords even if those keywords might have different value to them.

As Auerback et al. say, it may be the case that advertisers have neither "the resources or sophistication to track each keyword separately" or may "not have an accurate assessment of their true values per click on different keywords".

But this brings into question the entire ad auction model. In sponsored search auctions, we assume that advertisers are rational, able to manage bids across many keywords, and able to accurately predict their conversion rates from clicks to actions.

More work should be done here. This paper's analysis was done over small data sets. But, if this result is confirmed, then, as the authors say, a simpler system auction system, one with "improved bidding agents or a different market design" may result in more efficient outcomes than one that assumes advertisers have unbounded rationality.

Wednesday, October 15, 2008

Googler Bryan Horling recently was on a panel with Danny Sullivan at SMX and talked about personalized search. A few people ([1][2][3]) posted notes on the session.

Not too much there, but one interesting tidbit is the way Google is thinking about personalization coming from three data sources, localization data (IP address or information in the history that indicates location), short-term history (specific information from immediately preceding searches), and long-term history (broad category interests and preferences summarized from months of history).

A couple examples were offered as well, such as a search for [jordans] showing the furniture store rather than Michael Jordan if the immediately preceding search was for [ethan allan], a search for [galaxy] showing LA Galaxy sports sites higher in the rankings if the searcher has a long-term history of looking at sports, and favoring web sites the searcher has seen in the past. Curiously, none of these examples worked as described when I tried them just now, but it is still interesting to think about it.

What I like best about what Bryan described is that the personalization is subtle, only doing minor reorderings. It uses the tidbits of additional information about your intent in your history to make it just a little bit quicker to find what you probably are seeking. It's a nice, low risk approach to experimenting with personalization, making only small changes that are likely to be helpful.

Tuesday, October 14, 2008

Danny Sullivan at Search Engine Land posts an insightful review of SearchPerks, Microsoft's new incentive program for Live Search, that includes a history of the rather dismal track record of other attempts to pay people to use a search engine.

The talk is a "collection of problems we think are interesting/difficult" and, since it is coming from Jeff, has a heavy bias toward infrastructure problems.

The talk starts with energy efficiency in large scale clusters. Jeff pointed out that most work on power optimization is on laptops, not servers, but servers in a cluster have drastically different power optimization needs. In particular, laptops optimize power by shutting down completely, but servers are often at 20%-30% utilization on CPU, memory, and disk, and it would be nice to have them use only 20-30% power in this state rather than about 80%.

At this point, I was wondering why they didn't just shut down part of their cluster to get the utilization of the remaining servers closer to 80% or so. Ed Lazowska apparently was wondering the same thing since, moments later, he asked why can't Google just use smarter scheduling to compress the workload in the cluster (and, presumably, then put the now idle part into a low power mode). Jeff said that didn't work because it would impact responsiveness due to locality issues. Jeff's answer was vague and I am still somewhat unclear on what Jeff meant, but, thinking about it, I suspect what he wants is to use all the memory across all the boxes in the entire cluster, have a box respond immediately, but still use a lot less power when executing no-ops than the 50% of peak an idle box currently uses. So, keeping all the memory on the cluster immediately accessible so we can maximize how much data we can keep in memory seems like it is a big part of what makes this a challenging problem.

Next, Jeff talked about the OS. He pointed out that the "design point of the original version of Linux is pretty far removed from [our] very large data centers" and wondered if an operating system would be designed differently if it was specifically made for running in a cluster of 10k+ machines. He gave a few examples such as not really needing paging to disk but maybe wanting remote paging to the memory of other machines, adapting the network stack to microsecond network distances between machines, and changing the security model to focus on isolating apps running on the same box to guarantee performance.

Moving up a level again, Jeff described wanting a consistent framework for thinking about and building distributed applications and the consistency of the data for distributed applications.

Up one more level, Jeff wanted people to start thinking of having very large scale systems of 10M machines split into 1k different locations and how these would deal with consistency, availability, latency, failure modes, and adaptively minimizing costs (especially power costs).

Finally, Jeff briefly mentioned very large scale information extraction, speech processing, image and video processing, and machine learning, mostly talking about scale, but also giving a few examples such as moving beyond N-grams to handle non-local dependencies between words and Google's efforts to understand the semi-structured data in tables in web pages and data hidden behind forms on the Web.

Coming away from the talk, the biggest points for me were the considerable interest in reducing costs (especially reducing power costs), the suggestion that the Google cluster may eventually contain 10M machines at 1k locations, and the call to action for researchers on distributed systems and databases to think orders of magnitude bigger than they often are, not about running on hundreds of machines in one location, but hundreds of thousands of machines across many locations.

The talk is available for download in a variety of formats. Light, enjoyable, and worth watching if you are interested in large scale computing.

Friday, October 10, 2008

The paper has a nice overview of the goal of Games with a Purpose (GWAP), which is to produce useful output from the work done in games, and a good survey of some of the games available at gwap.com and the useful data they output.

If you've seen the GWAP work before, what is new and interesting about the article is the general framework they describe for building these types of games. In particular, the authors describe four generic types of guessing games, give examples of each class of games, and help guide those that are thinking of building their own games. In addition, Luis and Laura give a fair bit of advice on how to make the games enjoyable and challenging, how to prevent cheating, and techniques for mixing and matching human and computer players.

If you haven't seen GWAP before, go over to gwap.com and try a few games. My favorite is Verbosity, and Tag a Tune is good fun. I also think Tag a Tune is impressive as a demonstration of how games that label audio and video can still be quite fun even though they take a lot more time to play.

Thursday, October 09, 2008

In a post titled "Ad Perfect", Googler Susan Wojcicki describes targeting ads as matching a deep understanding of a user's intent, much like personalized search.

Some key excerpts:

Advertising should deliver the right information to the right person at the right time ... Our goal is always to show people the best ads, the ones that are the most relevant, timely, and useful .... We need to understand exactly what people are looking for, then give them exactly the information they want.

When a person is looking for a specific item ... the best ads will give more specific information, like where to buy the item.

In other cases, ads can help you learn about something you didn't know you wanted ... [and to] discover something [you] didn't know existed.

One way to make ads better would be to customize them based on factors like a person's location or preferences.

It [also] needs to be very easy and quick for anyone to create good ads ... to measure [and learn] how effective they are .... [and then] to show them only to people for whom they are useful.

What strikes me about this is how much this sounds like treating advertising as a recommendation problem. We need to learn what someone wants, taking into account the current context and long-term interests, and then help them discover interesting things they might not otherwise have known existed.

It appears to be a big shift away from mass market advertising and toward personalized advertising. This vision no longer has us targeting ads to people in general, but to each individual's intent, preferences, and context.

Wednesday, October 08, 2008

The Large Scale Recommender Systems and the Netflix Prize workshop was recently held at KDD 2008. I was not able to attend, but I still wanted to highlight a few of the papers from and related to the workshop.

Gavin Potter, the famousguy in a garage, had a short paper in the workshop, "Putting the collaborator back into collaborative filtering" (PDF). This paper has a fascinating discussion of how not assuming rationality and consistency when people rate movies and instead looking for patterns in people's biases can yield remarkable gains in accuracy. Some excerpts:

When [rating movies] ... a user is being asked to perform two separate tasks.

First, they are being asked to estimate their preferences for a particular item. Second, they are being asked to translate that preference into a score.

There is a significant issue ... that the scoring system, therefore, only produces an indirect estimate of the true preference of the user .... Different users are translating their preferences into scores using different scoring functions.

[For example, people] use the rating system in different ways -- some reserving the highest score only for films that they regard as truly exceptional, others using the score for films they simply enjoy .... Some users [have] only small differences in preferences of the films they have rated, and others [have] large differences .... Incorporation of a scoring function calibrated for an individual user can lead to an improvement in results.

[Another] powerful [model] we found was to include the impact of the date of the rating. It seems intuitively plausible that a user would allocate different scores depending on the mood they were in on the date of the rating.

Gavin has done quite well in the Netflix Prize; at the time of writing, he was in eighth place with an impressive score of .8684.

Galvin's paper is a light and easy read. Definitely worthwhile. Galvin's work forces us to challenge our common assumption that people are objective when providing ratings, instead suggesting that it is quite important to detect biases and moods when people rate on a 1..5 scale.

Another paper given in the workshop that I found interesting was Takacs et al, "Investigation of Various Matrix Factorization Methods for Large Recommender Systems" (PDF). In addition to the very nice summary of matrix factorization (MF) methods, the paper at least begins to address the practical issue of handling online updates to ratings, offering "an incremental variant of MF that efficiently handles new users/ratings, which is crucial in a real-life recommender system."

Finally, a third paper that was presented in the main KDD session, "Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model" (PDF), by Yehuda Koren from the leading Bellkor team, is an excellent discussion of combining different popular approaches to the Netflix contest. Some extended excerpts:

The two more successful approaches to CF are latent factor models, which directly profile both users and products, and neighborhood models, which analyze similarities between products or users.

Neighborhood models ... [compute] the relationship between items or ... users. An item-oriented approach evaluates the preferences of a user to an item based on ratings of similar items by the same user. In a sense, these methods transform users to the item space by viewing them as baskets of rated items .... Neighborhood models are most effective at detecting very localized relationships.

Latent factor models such as ... SVD ... [transform] both items and users to the same latent factor space ... [and] tries to explain ratings by characterizing both products and users on factors automatically inferred from user feedback. For example, when products are movies, factors might measure obvious dimensions such as comedy vs. drama, amount of action, or orientation toward children ... [as well as] less well defined dimensions such as depth of character development or "quirkiness" .... Latent factor models are generally effective at estimating overall structure that relates simultaneously to most or all users. However, these models are poor at detecting strong associations amount a small set of closely related items, precisely where neighborhood models do best.

In this work, we suggest a combined model that improves prediction accuracy by capitalizing on the advantages of both neighborhood and latent factor approaches.

Like Gavin, Yehuda describes how to compensate for "systematic tendencies for some users to give higher ratings than others". Yehuda also discusses how implicit data on what users chose to rate and did not rate can be used for improving accuracy. And, Yehuda even addresses some of the differences between what works well in the Netflix Prize and what is necessary in a practical recommender system, talking about top-K recommendations, handling new users and new ratings without a full re-train of the model, using implicit feedback such as purchases and page views, and explaining the recommendations. Definitely worth the read.

Finally, let me mention that the Netflix Prize is about to give its second progress prize. The current leader narrowed the gap between the grand prize and the last progress prize more than half over the last year. In conversations at the conference, people alternated between being hopeful that the remaining gap could be closed with a few new clever insights and despairing over the slow progress recently and questioning whether the grand prize is possible to win.

Monday, September 15, 2008

Michael Schwarz from Yahoo Research gave an invited talk at KDD 2008 on "Internet Advertising and Optimal Auction Design".

Much of the talk was background on first and second price auctions, but a particularly good discussion came at the end when Michael was talking about optimal setting of reserve prices for auctions.

A reserve price is the minimum bid for an advertiser to win an auction. Michael discussed how setting a reserve price correctly can significantly improve advertising revenue.

He did not cover the assumptions behind this claim in his talk, but details are available from his 2007 paper, "Optimal Auction Design in a Multi-unit Environment" (PDF). Briefly, setting the reserve price carefully can substantially increase advertising revenue when there are only a few bidders and when most bidders value a click well above the normal reserve price (e.g. most advertisers get $1 of value from a click but the default reserve price is $.05).

Michael also described a non-obvious effect where higher reserve prices indirectly impact top bidders because of price pressure from the now higher lower bids. Again, this is primarily is an issue of the reserve price substituting for lack of competition in the bidding, but still a useful thing to note.

After the talk, I chatted a bit with Michael about optimizing the reserve price for things other than immediate revenue. For example, we might want to optimize reserve price for user satisfaction, setting the reserve price at a level that effectively represents a high cost of user attention and filters out ads that users do not find worthwhile. Or, we might want to set our reserve price lower in the long tail of advertising keywords, making it cheaper (and bearing some of the risk) for advertisers as they try to gather information about how well their ads might work on more obscure keywords.

By the way, if you liked this post, you might also like one of Michael's other papers, "Internet Advertising and the Generalized Second Price Auction: Selling Billions of Dollars Worth of Keywords" (PDF). It is a fun paper that describes how current GSP advertising auctions do not actually force truth telling. The problem is rooted in the way there are multiple positions available and, if the lower positions still generate a good number of clicks, it can be advantageous for advertisers to offer lower bids.

Wednesday, September 03, 2008

One question that was repeatedly presented to the panel was how to make money off social networks. Underlying this question, and at least once stated explicitly, was the issue of whether social networks really matter or are just the latest flash in the pan, the latest bubble, and soon to fade if people find the riches they deliver fall short of their absurdly hyped expectations.

The best answer to this question came only at the very end, sneaking in when the session was already over, from panel chair Andrew Tompkins. Andrew said that he thought that what was most likely is that most money would be made from understanding an individual in a network and their immediate neighborhood. He expected that, as the space matured, the more theoretical work that we see now that looks at global patterns and trends across different types of social networks would continue, but the emphasis would shift to understanding each person in the network and from their behavior and their immediate neighbors to help people find and discover things they want.

Another question that came up and went largely unanswered was, are social networks really application specific? That is, are the patterns we see in one network (e.g. mobile) distinct from what we see in another (e.g. LinkedIn) because people use these tools differently? There was some evidence for this in one of the conference papers where there was a rather unusual distribution (PDF) reported in the contact lists of Sprint mobile customers because of how people use their mobile devices. The general issue here is that there is a different meaning of a link between people -- A friend? A colleague? Someone I e-mailed once? -- in each of these applications, so it is not clear that conclusions we draw from analyzing one social network generalize to others.

And this leads to another question, are social networks just a feature? That is, are social networks just a part of an application, not an thing worth anything by itself? For example, is our e-mail contact list only important in the context of e-mail? Is Facebook just a tool for communication, like IM, but not really best described as a social network? This came up a bit during the Q&A with the panel when someone suggested that perhaps we were just using one hammer -- graph analysis -- on every problem when the graph might not actually best capture what is important, the underlying activity in the application.

Though it was not discussed at all during the panel, two of the panelists focused their work on privacy, which made me wish I had asked, do we really have firm evidence that people care about privacy in these social applications? Instead, it seems people say they care about privacy but then freely give away private information when asked ([1]). From what I can tell, we do not really know to what extent privacy really matters to social networking application users, do we?

In the end, it seems many questions remain, some of them quite basic, on the where social networks are going. At this point, it is not even quite clear we know precisely what social networks are, much less whether they have an independent future.

Saturday, August 30, 2008

Jitendra Malik from UC Berkeley gave an enjoyable and often quite amusing invited talk at KDD 2008 on "The Future of Image Search" where he argued that "shape-based object recognition is the key."

The talk started with Jitendra saying image search does not work well. To back this claim, he showed screen shots of searches on Google Image Search and Flickr for [monkey] and highlighted the false positives.

Jitendra then claimed that neither better analysis of the text around images nor more tagging will solve this problem because these techniques are missing the semantic component of images, that is, the shapes, concepts, and categories represented by the objects in the image.

Arguing that we need to move "from pixels to perception", Jitendra pushed for "category recognition" for objects in images where "objects have parts" and form a "partonomy", a hierarchical grouping of object parts. He said current attempts at this have at most 100 parts in the partonomy, but it appears humans commonly use 2-3 orders of magnitude more, 30k+ parts, when doing object recognition.

A common theme running through the talk was having a baseline model of a part, then being able to deform the part to match most variations. For example, he showed some famous examples from the 1917 text "On growth and form"

then talked about how to do these transformation to find the closest matching generic part for objects in an image.

A near the end of the talk, Jitendra contrasted the techniques used for face detection, which he called "a solved problem", with the techniques he thought would be necessary for general object and part recognition. He argued that the techniques used on face detection, to slide various sized windows across the image looking for things that have a face pattern to them, would both be too computationally expensive and have too high a false positive rate to do for 30k objects/parts. Jitendra said something new would have to be created to deal with large scale object recognition.

What that something new is is not clear, but Jitendra seemed to me to be pushing for a system that is biologically inspired, perhaps quickly coming up with rough candidate sets of interpretations of parts of the image, discounting interpretations that violate prior knowledge on objects that tend to occur nearby to each other, then repeating at additional levels of detail until steady state is reached.

Doing object recognition quickly, reliably, and effectively to find meaning in images remains a big, hairy, unsolved research problem, and probably will be for some time, but, if Jitendra is correct, it is the only way to make significant progress in image search.

Friday, August 22, 2008

Daniel Abadi, Samuel Madden, and Nabil Hachem had a paper at SIGMOD 2008, "Column-Stores vs. Row-Stores: How Different Are They Really?" (PDF), with a fascinating discussion of what optimizations could be implemented in a traditional row store database to make it behave like a column store.

The paper attempts to answer the question of whether there "is something fundamental about the way column-oriented DBMSs are internally architected" that yields performance gains or whether the column-oriented optimizations can be mimicked in a row-store design.

Specifically, the authors look at taking a row-store and vertically partitioning the rows, indexing all the columns, creating materialized views of the data (just the needed columns for a query already precomputed), compressing the data, and a couple variations on late materialization when joining columns that significantly impact performance.

Despite strong statements in the abstract and introduction that "it is not possible for a row-store to achieve some of the performance advantages of a column-store" and that "there is in fact something fundamental about the design of column-store systems that makes them better suited to data-warehousing workloads", the discussion and conclusion further into the paper are far murkier. For example, the conclusion states that it "is not that simulating a column-store in a row-store is impossible", just that attempting to fully simulate a column-store is difficult in "today's row store systems".

The murkiness seems to come from the difficulty the authors had forcing a row-store to do some of the query plan optimizations a column-store does once they had jiggered all the underlying data to simulate a column-store.

For example, when discussing "index-only plans" (which basically means throwing indexes on all the columns of a row-store), the row-store used slow hash-joins to combine indexed columns and they "couldn't find a way to force [the system] to defer these joins until later in the plan, which would have made the performance of this approach closer to vertical partitioning".

Later, the authors say that they couldn't "trick" the row-store into storing columns "in sorted order and then using a merge join", so it instead did "relatively expensive hash joins". They say the problems are not fundamental and that "there is no reason why a row-store cannot store tuple headers separately, use virtual record-ids to join data, and maintain heap files in guaranteed position order", just that they were not able to force the row-store to do these things in their simulation.

In the end, the paper only concludes that:

A successful column-oriented simulation [in a row-store] will require some important system improvements, such as virtual record-ids, reduced tuple overhead, fast merge joins of sorted data, run-length encoding across multiple tuples ... operating directly on compressed data ... and late materialization ... some of ... [which] have been implemented or proposed to be implemented in various row-stores.

That seems like a much weaker conclusion that what the introduction promised. The conclusion appears to be that a row-store can perform as well as a column-store on data warehouse workloads, just that it looks awkward and difficult to make all the optimizations necessary for it to do so.

What strikes me as remarkable about the system is its rather casual treatment of writes. As far as I can tell, a write is only sent to memory on one box, not written to disk, not even written to multiple replicas in memory. That seems fine for log or click data, but, for the kind of data Facebook deals with, it seems a little surprising to not see a requirement for multiple replicas to get the write in memory before the app is told that the write succeeded.

The code for Cassandra is open source and there is a wiki that adds a few tidbits to the SIGMOD slides. Note that HBase is also open source and also is modeled after Google's Bigtable; HBase is layered on top of Hadoop and sponsored heavily by Yahoo.

Please see also James Hamilton, who posts a summary of the slides and brief commentary, and Dare Obasanjo, who offers more detailed commentary on Cassandra.

If you are interested in all things Bigtable, you might also enjoy Phil Bernstein's post, "Google Megastore", that summarizes a Google talk at SIGMOD 2008 on a storage system they built on top of Bigtable that adds transitions and additional indexes, among other things.

Update: Avinash Lakshman swings by in the comments and clarifies that Cassandra does have a commit log, so writes do go to disk on at least one machine immediately. A write first updates the commit log, then updates the memory tables, and, finally, in batch some time later, goes out to all the tables on disk.

Thursday, August 07, 2008

Martin Theobald, Jonathan Siddharth, and Andreas Paepcke from Stanford University have a cute idea in their SIGIR 2008 paper, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections" (PDF). They focus near duplicate detection on the important parts of a web pages by using the next few words after a stop word as a signature.

An extended excerpt:

The frequent presence of many diverse semantic units in individual Web pages makes near-duplicate detection particularly difficult. Frame elements for branding, and advertisements are often freely interspersed with other content elements. The branding elements tend to be deliberately replicated across the pages of individual sites, creating a "noise level" of similarity among pages of the same site.

SpotSigs provides a robust scheme for extracting characteristic signatures from Web documents, thus aiming to filter natural-language text passages out of noisy Web page components ... [It] is much more efficient, easier to implement, and less error-prone than ... layout analysis .... [and] we are able to show very competitive runtimes to the fastest known but more error-prone schemes like I-Match and Locality Sensitive Hashing (LSH).

The points (or "spots") in the page at which spot signatures are generated are all the locations where ... anchor words [occur]. We call the anchor words antecedents, which are typically chosen to be frequent within the corpus. The most obvious, largely domain-independent choices ... are stopwords like is, the, do, have, etc. ... which are distributed widely and uniformly within any snippet of natural-language text.

A spot signature ... consists of a chain of words that follow an antecedent word .... Spot signatures ... [focus on] the natural-language passages of Web documents and skip over advertisements, banners, and navigational components.

The paper gives an example of a generating a spot signature where a piece of text like "at a rally to kick off a weeklong campaign" produces two spots: "a:rally:kick" and "a:weeklong:campaign".

What I particularly like about this paper is that they take a very hard problem and find a beautifully simple solution. Rather than taking on the brutal task of tearing apart the page layout to discard ads, navigation, and other goo, they just noted that the most important part of the page tends to be natural language text. By starting all their signatures at stopwords, they naturally focus the algorithm on the important parts of the page. Very cool.

If you can't get enough of this topic, please see also my previous post, "Detecting near duplicates in big data", that discusses a couple papers out of Google that look at this problem.

Update: Martin Theobald posted a detailed summary of the paper on the Stanford InfoBlog.

Wednesday, August 06, 2008

Liu et al. from Microsoft Research Asia had the best student paper at SIGIR 2008, "BrowseRank: Letting Web Users Vote for Page Importance" (PDF), that builds a "user browsing graph" of web pages where "edges represent real transitions between web pages by users."

An excerpt:

The user browsing graph can more precisely represent the web surfer's random walk process, and thus is more useful for calculating page importance. The more visits of the page made by users and the longer time periods spent by the users on the page, the more likely the page is important ... We can leverage hundreds of millions of users' implicit voting on page importance.

Some websites like adobe.com are ranked very high by PageRank ... [because] Adobe.com has millions of inlinks for Acrobat Reader and Flash Player downloads. However, web users do not really visit such websites very frequently and they should not be regarded [as] more important than the websites on which users spend much more time (like myspace.com and facebook.com).

BrowseRank can successfully push many spam websites into the tail buckets, and the number of spam websites in the top buckets in BrowseRank is smaller than PageRank or TrustRank.

Experimental results show that BrowseRank indeed outperforms the baseline methods such as PageRank and TrustRank in several tasks.

One issue that came up, in discussions afterward about the paper, is whether BrowseRank gets something much different than a smoothed version of visit counts.

Let's say all the transition probabilities between web pages are set by how people actually move across the web, all the starting points on the web are set by where people actually start, and then you simulate random walks. If all this is done correctly, your random walkers should move like people do on the web and visit where they visit on the web. So, after all that simulating, it seems like what you get should be quite close to visit counts.

This is a bit of an oversimplification. BrowseRank uses time spent on a page in addition to visit counts and its Markov model of user behavior ends up smoothing the raw visit count data. Nevertheless, the paper does not compare BrowseRank to a simple ranking based on visit counts, so the question still appears to be open.

Tuesday, August 05, 2008

A SIGIR 2008 paper out of Yahoo Research, "ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines" (ACM page), looks at how performance optimizations to a search engine can impact each other. In particular, it looks at how caching the results of search queries impacts the query load that needs to be served from the search index, and therefore changes the effectiveness of index pruning, which attempts to serve some queries out of a reduced index.

An extended excerpt:

Typically, search engines use caching to store previously computed query results, or to speed up the access to posting lists of popular terms.

[A] results cache is a fixed-capacity temporary storage of previously computed top-k query results. It returns the stored top-k results if a query is a hit or reports a miss otherwise.

Results caching is an attractive technique because there are efficient implementations, it enables good hit rates, and it can process queries in constant time. ... One important issue with caches of query results is that their hit rates are bounded ... [by] the large fraction of infrequent and singleton queries, even very large caches cannot achieve hit rates beyond 50-70%, independent of the cache size.

To overcome this limitation, a system can make use of posting list caching or/and employ a pruned version of the index, which is typically much smaller than the full index and therefore requires fewer resources to be implemented. Without affecting the quality of query results, such a static pruned index is capable of processing a certain fraction of queries thus further decreasing the query rate that reaches the main index.

[A] pruned index is a smaller version of the main index, which is stored on a separate set of servers. Such a static pruned index resolves a query and returns the response that is equivalent to what the full index would produce or reports a miss otherwise. We consider pruned index organizations that contain either shorter lists (document pruning), fewer lists (term pruning), or combine both techniques. Thus, the pruned index is typically much smaller than the main index.

The original stream of queries contains a large fraction of non-discriminative queries that usually consist of few frequent terms (e.g., navigational queries). Since frequent terms tend to be popular, those queries are likely to repeat in the query log and therefore are typical hits in the results cache. At the same time, these queries would be good candidates to be processed with the document pruned index due to their large result set size. Therefore, document pruning does not perform well anymore when the results cache is included in the architecture.

We observed that the results cache significantly changes the characteristics of the query stream: the queries that are misses in the results cache match approximately two orders of magnitude fewer documents on average than the original queries. However, the results cache has little effect on the distribution of query terms.

These observations have important implications for implementations of index pruning: the results cache only slightly affects the performance of term pruning, while document pruning becomes less effective, because it targets the same queries that are already handled by the results cache.

It is an excellent point that one optimization can impact another, so analyzing performance and configuration twiddles in isolation of the other optimizations is likely to give incorrect results. The results in this paper could be problematic for past work on index pruning since it indicates that the query streams used in their evaluations may not be representative of reality.

Monday, August 04, 2008

Jaime Teevan, Sue Dumais, and Dan Liebling had a paper at SIGIR 2008, "To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent" (PDF), that looks at how and when different people want different things from the same search query.

An excerpt:

For some queries, everyone ... is looking for the same thing. For other queries, different people want very different results.

We characterize queries using features of the query, the results returned for the query, and people's interaction history with the query ... Using these features we build predictive models to identify queries that can benefit from personalized ranking.

We found that several click-based measures (click entropy and potential for personalization curves) reliability indicate when different people will find different results relevant for the same query .... We [also] found that features of the query string alone were able to help us predict variation in clicks.

Click entropy is a measure of variation in the clicks on the search results. Potential for personalization is a measure of how well any one ordering of results can match the ordering each individual searcher would most prefer. The query features that worked the best for predicting ambiguity of the query were query length, the number of query suggestions offered for the query, and whether the query contained a url fragment.

One tidbit I found interesting in the paper was that average click position alone was a strong predictor of ambiguity of the query and was well correlated with both click entropy and potential for personalization. That's convenient. Average click position seems like it should be much easier to calculate accurately when facing sparse data.