Addressing and Identifying Privacy Leakage from Query Logs: An Accountability ApproachOshani Seneviratne, Daniel Weitzner, and Lalana KagalMassachusetts Institute of Technology, United States

The use of query logs for studying search engines and for improving information retrieval on the Web is invaluable. This use, however, is currently restricted as it might sacrifice user privacy and expose significant amount of private and identifying information. We believe efforts to address information policy issues such as online privacy have been overly dominated by access restriction and privacy-preserving algorithms such as anonymization, generalization, and perturbation. An alternative is to emphasize the design of systems that provide greater information accountability as judged against rules governing appropriate use, rather than information security and access restriction. The goal of this proposal is to develop a technical proof-of-concept that can be the basis for monitoring privacy leakage in various query log research contexts.

The goal of this study is to build a powerful prediction model for estimating the likelihood that a user will act in response to a specific advertisement. This work builds on previous research by designing and constructing a strong model capable of handling complex interactions between attributes yet still explicit and comprehensive for its users (both the search engine and the advertisers). The researchers hope to achieve this goal by using an ensemble of decision rules. Assuming the built model is comprehensive one can use it for making general recommendations for improving the quality of ads, but also specific recommendations for advertisers—recommendations that are sensitive to the context of their particular ad and potential queries matching this ad.

An Ad Auction Trading Agent CompetitionMichael P. Wellman and Patrick R. JordanUniversity of Michigan, United States

The annual Trading Agent Competition (TAC) has provided a forum for researchers to test and compare ideas for autonomous bidding by software agents. Since 2000, the TAC series has featured market games based on scenarios from travel shopping, supply chain management, and market design. For 2009, we developed a new game in the domain of search advertising. In the TAC Ad Auctions game (TAC/AA), agents representing advertisers bid for search ad placement over a range of interrelated keyword combinations. A back-end search-user model translates placement over each simulated day to impressions, clicks, and conversions, yielding revenue for the advertiser. The overall advertiser objective is to maximize profit over the simulated campaign horizon. In this talk, we explain the objectives and design of TAC/AA, and anticipate the results from the tournament to be held in July.

An Examination of Language Use in Online Dating ProfilesMeenakshi Nagarajan, Knoesis, Wright State University, United StatesMarti A. Hearst, School of Information, University of California, United States

This work contributes to the study of self-presentation in online dating systems. The larger goal is to understand effects of self-presentation via free-text on perceived attractiveness. Motivating question: What are the similarities and differences between how men and women self-present in online dating profiles?

Collaborative Personalized AdvertisingYi ZhangUniversity of California Santa Cruz, United States

How to recommend advertisements relevant to a specific query issued by a specific user in a specific context is an important and challenging research problem. Even a small improvement in accuracy could lead to a big benefit for a search engine company. On the other hand, getting information when needed is one of the most desirable things for search engine users. The proposed project will tackle this challenge based on Personalized Collaborative Advertising. The goal of the proposed project is to advance the fundamental theory, investigate long term and short term practical techniques, and develop efficient tools for building a personalized advertising agent.

Computational Analysis of Perfect-Information Position AuctionsDavid Robert Martin Thompson and Kevin Leyton-BrownUniversity of British Columbia, Canada

Position auctions were widely used by search engines to sell keyword advertising before being well understood (and, indeed, studied) theoretically. To date, theorists have made significant progress—for example, showing that a given auction is efficient or revenue-dominates a benchmark auction, such as Vickrey-Clarke-Groves (VCG). This paper augments that line of work, relying on computational equilibrium analysis. By computing Nash equilibria and calculating their expected revenue and social welfare, we can quantitatively answer questions that theoretical methods have not. Broadly, the questions we answer are:

How often do the theoretically predicted "good" (in other words, efficient, high-revenue) equilibria of Generalized Second-Price (GSP) occur?

In models where GSP is known to be inefficient, how much welfare does it waste?

We also use our data to examine the larger question of whether GSP is a good choice, compared with the alternatives.

This prototype system demonstrates the technology of automatically generating display ads and transforming between different formats. Given text descriptions and an optional logo image, the system automatically suggests relevant image creatives and generates the corresponding display ads with optimal layouts.

Empirical Study of the Impact of Privacy Information in Search ResultsJaniceTsai, Serge Egelman, LorrieCranor, and AlessandroAcquistiCarnegie Mellon University, United States

Numerous studies have found that Internet users are concerned about online privacy and worried about what e-commerce Web sites might do with their personal data (Ackerman, et al., 1999; P&AB, 2005; Turow, et al., 2005). However, as privacy policies are difficult and time consuming to read and understand (Hochhauser, 2003; Jensen and Potts, 2004; Antón, et al., 2004; Milne and Culnan, 2004), few users make the necessary effort to read them (Privacy Leadership Initiative, 2001), let alone seek out the Web sites that have the best privacy policies. Recently, the authors of this proposal have developed a search service that facilitates the comparison of sites, based on their privacy policies, and gathered preliminary laboratory evidence suggesting that simpler privacy comparison information can affect users’ online behavior. To examine whether privacy indicators in search results also influence browsing behavior outside the laboratory, we conducted a field study in which we recruited participants to use a search engine with privacy indicators instead of their usual search engine. We collected more than 15,000 search queries from 460 participants over a 10-month period. We found that when search results were annotated with privacy indicators, participants were more likely to visit sites with high privacy ratings than sites with low or no privacy ratings, even when the former appeared towards the bottom of the search results page. Sites with low privacy ratings did not have significantly different visitation rates than sites without any privacy ratings.

Today's applications like search and social networks, although powerful, are limited in power due to the shortcoming of their underlying data stores. These information stores are application tuned, silo-ed, and not configured for a deep understanding of the content. In this talk, we discuss some ideas on evolving the underlying store into a potent knowledge base. We envision how an evolved Web might look like in the future: an intelligent knowledge store that integrates information with inference, and a platform that can unleash powerful new applications with ease. We outline some practical challenges that must be solved by the community to propel this evolution.

It is widely believed that the value of acquiring a slot in a sponsored search list (that comes along with the organic links in a search engine’s result page) highly depends on who else is shown in the other sponsored positions. To capture such externality effects, we consider a model of keyword advertising where bidders participate in a Generalized Second Price (GSP) auction and users perform ordered search (they browse from the top to the bottom of the sponsored list and make their clicking decisions slot by slot).

Our contribution is twofold: First, we use impression and click data from Microsoft Live to estimate the ordered search model. With these estimates in hand, we are able to assess how the click-through rate of an ad is affected by the user’s click history and by the other competing links. Further, we compare the clicking predictions of our ordered search model to those of the most widely used model of user behavior: the separable click-through rate model. Second, we study complete information Nash equilibria of the GSP under different scoring rules.

We characterize the efficient and revenue-maximizing complete information Nash equilibrium (under any scoring rule) and show that such an equilibrium can be implemented with any set of advertisers if and only if a particular weighting rule that combines click-through rates and continuation probabilities is used. Interestingly, this is the same ranking rule derived in previous work for solving the efficient allocation problem. On the negative side, we show that there is no scoring rule that implements an efficient equilibrium with VCG payments (VCG equilibrium) for all profiles of valuations and search parameters. This result extends previous work that argues the rank-by-revenue GSP does not possess a VCG equilibrium.

This rich media technology concept was built on top of Silverlight technology and the Audience Intelligence platform, and provides a non-intrusive and customized consumer experience. GAZE is a unique gadget that extracts keywords and contextually links them to advertisements. Unlike other rich media features, GAZE non-intrusively blends advertising with news, celebrity timelines, access to music, videos and photos, and social networking. For example, a user might hover over the name "Madonna" on a celebrity gossip site and GAZE would allow that user to see everything from a history of Madonna to what dress she wore at the Grammy Awards, which would provide an opportunity for an advertiser to place a relevant retail advertisement. The content comes from our owned and operated assets, including MSN Video, MSN Entertainment, MSN Music, and Live Search.

There are many instances when a person may want to find a gift for someone or the system should recommend a gift to a consumer. Examples of this might include when a user knows about an upcoming friend’s birthday or when the system knows that a user has received his Christmas bonus check.

In these instances, we will want to match a gift to that particular user. In order to do so, we want to understand what kinds of gifts might interest a user. In order to do this, we take the user's interests into account. However, often, the interests may provide an incomplete picture. For example, just because a user is interested in basketball does not mean that those are his only interests. He may be interested in other sports or sporting event tickets for which he may never have indicated any interest. In order to provide a more complete picture, we also utilize a person’s demographic information such as age, gender, and location. Utilizing this information allows us to examine people with similar profiles/backgrounds and interests and attempt to tailor a unique and relevant gift to the user.

The key is to return a set of unique and relevant gifts, not just popular items. For popular items, we could simply return what everyone in aggregate wants. Instead, we want to generate gifts that other groups may not find appealing or that this particular person will find more appealing than others will.

Today's search technologies heavily rely on textual queries of users to identify their information needs. Yet, users have various information needs. To cope with this, this research proposes a goal-driven information retrieval framework, in which the retrieval processes and combination strategies are influenced by automatically learning users' information goals (in other words, types of user needs or retrieval tasks) and by estimating the probability of relevance between user information needs and documents. The research intends to establish a sequential approach to learning relevance by applying recent advances in Bayesian machine learning by explicitly modeling users’ various information goals, and, as a result, having the final algorithm respond to different information goals accordingly by a weighted average of different retrieval models. By counting the dependency of information needs among users, we will correlate users by applying hierarchical Bayesian methods.

Search auctions have become a dominant source of revenue generation on the Internet. Such auctions have typically used per-click bidding and pricing. We propose the use of hybrid auctions where an advertiser can make a per-impression as well as a per-click bid, and the auctioneer then chooses one of the two as the pricing mechanism. We assume that the advertiser and the auctioneer both have separate beliefs (called priors) on the click-probability of an advertisement. We first prove that the hybrid auction is truthful, assuming that the advertisers are risk-neutral. We then show that this auction is superior to the existing per-click auction in multiple ways:

It takes into account the risk characteristics of the advertisers.

For obscure keywords, the auctioneer is unlikely to have a very sharp prior on the click-probabilities. In such situations, the hybrid auction can result in significantly higher revenue.

An advertiser who believes that its click-probability is much higher than the auctioneer's estimate can use per-impression bids to correct the auctioneer's prior without incurring any extra cost.

The hybrid auction can allow the advertiser and auctioneer to implement complex dynamic programming strategies. As Internet commerce matures, we need more sophisticated pricing models to exploit all of the information held by each of the participants. We believe that hybrid auctions could be an important step in this direction.

Intent detection is one of the crucial long-standing goals of information access. This research proposes to automatically identify the user queries and behavior patterns associated with commercial intent. Specifically, the researchers will develop a taxonomy of commercial intent to classify user information needs along dimensions such as immediate versus near-term purchase, or purchase vs. research. This taxonomy will be developed using a combination of user behavior mining, user surveys, and longitudinal tracking of user purchasing behavior. The researchers will also develop techniques to automatically infer the information needs from user actions such as queries, result and ad click-through and browsing behavior to classify the session in the corresponding commercial intent category. The general intent inference algorithms developed will also be applicable to other (non-commercial) intent detection tasks, such as detection of scholarly research intent, health information seeking, and market research. Finally, the researchers will develop and evaluate the effectiveness of these techniques to improve practical applications such as ad ranking and content personalization. The findings and the resulting algorithms will help advertisers automatically create more appropriate and relevant ad content, develop better ranking ads by matching the tone, focus and the content of the ads with the users current intent, as well as contribute to the general understanding of user intent inference and Web search behavior modeling.

Integrating the Deep Web with the Shallow Websis of Sponsored Search AuctionsMike StonebrakerMassachusetts Institute of Technology, United States

It is generally accepted that the deep Web is significantly bigger than the shallow Web. We define the deep Web as that information that is available only by filling information into Web forms. Current search systems are getting very good at answering queries directed to the shallow Web. However, queries that can be answered only from the deep Web, such as “What is the Amtrak fare from Boston to New York?” or “What is the flight status of UA 179?” typically require a user to know the URL of the site with the information. He must then access that site and fill in the required information. Giving such a query to a search engine instead is likely to yield only frustration to the user. The purpose of this research is to build a software system that can integrate the deep Web with the shallow Web, via semantics.

Learning to Advertise in Sponsored Search: Relevance Ranking, Diversity, and BeyondPing LiCornell University, United States

The goal of this research is to learn ads relevance and ranking in sponsored search for improving user experience. The researchers propose (1) learning ads ranking from click-through data using state-of-the-art algorithms currently applied in (regular) Web search ranking including support vector machines (SVM) and ensemble boosted trees; (2) incorporating pair-wise preference information from click-through history; (3) taking advantage of the huge volume of unlabeled (undisplayed) data under a semi-supervised learning paradigm including graph learning and collaborative filtering. The final goal of this research is to develop principled techniques to guide the positioning of ads for enhancing the probability of (at least) one click on sponsored links per displayed ad section. All tasks will lead to very large-scale machine learning problems, challenging and interesting for the own sake. The researchers expect to achieve part of the goals in a one-year time frame and afterwards we will continue the investigation using Microsoft data.

Modeling Trust Influence and Bias in Social MediaAnupam Joshi and Tim FininUniversity of Maryland, Baltimore County, United States

Web-based social media systems such as blogs, Wikis, and forums are an important new way to publish information, engage in discussions, and form communities on the Internet. Since approximately one-third to one-half of all new Web content is generated by social media systems their reach and impact is significant. This research seeks at better understanding on the following questions: How can blog communities be identified based on a combination of topic, bias, and underlying beliefs? Which authors and blogs are most influential within a given community? From where do particular beliefs or ideas originate and how do they spread? What are the most trustworthy sources of information about a particular topic? What opinions and beliefs characterize a community and how do these opinions change?

Monetizing User Activity on Social Networks – Challenges and ExperiencesMeena Nagarajan and Amit ShethKno.e.sis Center, Wright State University, United States

Current advertising approaches to monetizing user content on social networks are profile-based contextual advertisements, demographic-based ads or a combination of the two. While profile information might be useful for launching product campaigns and micro-targeting customers, it does not necessarily always contain current interests or purchase intents. We posit that in addition to using profile information, ad programs should also generate profile ads from user activity or footprints left on public venues (forums, discussion sites, marketplace, and so one) on social networking sites.

While the intuition is rather straightforward, there are some challenges that need to be addressed before such content can be used for monetization. We will present our first attempt at solving two of these problems:

Identifying user intentions: We observed that unlike Web search, the use of certain entity types does not accurately classify a post's intents. The presence of a product name does not immediately imply navigational or transactional intents as it can appear with several user intentions: "I am thinking of getting X’ (transactional), "I likely new X" (information sharing), and "what do you think about X?" (information seeking). Among the many footprints users leave on a site, it is important to identify those with high monetization potential, to generate profile ads that they are more likely to click.

Informal content and off-topic noise: A characteristic of communities, both online and offline, is the shared understanding and context they operate on. Use of slang and variations of entity names are commonplace. Not being able to spot such keywords in posts will mean fewer matched ad impressions. Additionally, due to the interactional nature of social networking platforms, when users share information, they are typically sharing an experience or event. The main message is overloaded with information that is off-topic and needs to be discarded before it is used by ad-programs.

We present our initial approach along with preliminary results from user studies that confirms that user generated content is ideally suited for monetization. Specifically, using data from MySpace and Facebook, we show that 52 percent of ad impressions generated using keywords from our system were more targeted compared to the 30 percent relevant impressions generated without using our system. We also discuss next steps and present our ongoing work in the general area of analyzing user-generated content that complements the above topic of investigation.

Determining query intent in an evolving user session is critical in achieving effective ad generation for sponsored search. This requires identifying and exploiting the “query funnel.” We explore several features of this problem using AdCenter data:

Use of click-through rates in ad placement, using sequential query-ad-click probability from data

Search on unstructured data on the Web has received a lot of attention. However, there are numerous additional sources, including (a) diverse personal data: e-mail, desktop files, contacts, (b) organizational data, often from disconnected islands in the organization: LDAP directory data, HR and other ERP data, and (c) Web data that is relevant to the organization, such as bibliographic data for a research organization, or stock market data for an investment fund. Today it is possible to query each of these information silos independently; however, one cannot readily exploit rich connections between them. This research looks at (a) bridging diverse data sources, by adding semi-structured annotations to them and then creating probabilistic connections between the annotations; (b) developing query models that can search across data from multiple such sources by aggregating possibly uncertain information from multiple sources to get higher confidence; (c) developing efficient algorithms for executing queries under the models developed.

Search Logs as Information Footprints: Turning Search Logs into a Multi-Resolution Topic Map to Support Collaborative SurfingChengXiang Zhai, Xuanhui Wang, and Kevin Chang University of Illinois at Urbana-Champaign, United States

We view search logs as information footprints left by users navigating in the information space and propose a way to organize these footprints into a multi-resolution topic map. The map makes it possible for users to navigate flexibly in the information space by following the footprints left by other users. As new users use the map for navigation, they leave more footprints, which can then be used to enrich and refine the map dynamically and continuously for the benefit of future users. Thus, by turning search logs into a topic map, we can establish a sustainable infrastructure to facilitate users to surf the information space in a collaborative manner. Preliminary experiment results show that the topic map is effective in helping users to satisfy exploratory information needs.

Search Query Disambiguation from Short Sessions using Markov LogicRaymond J. Mooney and Lily MihalkovaUniversity of Texas at Austin, United States

Web search queries tend to be short and ambiguous. Several projects have explored personalized search-query disambiguation; however, existing work relies almost exclusively on extensive log data in which each user's search activities are recorded over long periods of time. Such approaches raise privacy concerns and require some mechanism for long-term user tracking. We present an approach to query disambiguation that bases its predictions only on a short glimpse of a user's search activity from a single session. We use Markov Logic Networks, a recently introduced approach to statistical relational learning, to exploit the relations between the current search session and previous sessions in order to predict the probability that a given user will click on a particular result. We present experimental results on MSN search logs that demonstrate the ability of our approach to reorder search results effectively, based on session context.

We provides a comprehensive study of the structure and dynamics of online advertising markets, mostly based on techniques from the emergent discipline of complex systems analysis. First, we look at how the display rank of a URL link influences its click frequency, for both sponsored search and organic search. Second, we study the market structure that emerges from these queries, especially the market share distribution of different advertisers. We show that the sponsored search market is highly concentrated, with fewer than five percent of all advertisers receiving more than two-thirds of the clicks in the market.

Furthermore, we show that both the number of ad impressions and the number of clicks follow power law distributions of approximately the same coefficient. However, we find this result does not hold when studying the same distribution of clicks per rank position, which shows considerable variance, most likely due to the way advertisers divide their budget on different keywords.

Finally, we turn our attention to how such sponsored search data could be used to provide decision support tools for bidding for combinations of keywords. We provide a method to visualize keywords of interest in graphical form, as well as a method to partition these graphs to obtain and discover desirable subsets of search terms.

The Topic Explorer is a browser prototype that pro-actively retrieves and makes available to users Web pages with content related to a document being browsed. The system projects the browsed document onto a large concept space, and then employs the results of Web searches for the most important concepts extracted in a re-ranking scheme that targets both relevance and diversity.

This research attempts to address the problems and limitations of the topic and keyword-based ad-matching technology and move toward a pragmatics-based approach. For ad placement, it is crucial to understand the activities, tasks, and states that are pertinent to the authors or readers to infer what services or products would be of interest and in need. The main goal of this research is to develop a novel ad-matching method that penetrates into the needs and intents of potential beneficiaries of ads, moving one step further beyond topical similarities. To infer hidden entities that need to be mapped to ads, common sense knowledge on essential relations is mined from the Web and used in conjunction with the existing large-scale common sense knowledge base constructed by the general public.

Two-sided Combinatorial Auction Mechanism for Sponsored Searchability ApproachRichard Shang and Roumen Vragov, Baruch College – The City University of New York, United States David Porter, Chapman University, United States Vernon Smith, George Mason University, United States

This proposal uses past theoretical advances in consumer search and auction theory to propose a two-sided combinatorial auction mechanism for sponsored search ads. Existing auction mechanisms either assume a constant user behavior and analyze advertiser’s bidding patterns, or assume static bids and analyze the utility and attractiveness of the ads for each user. This research will initialize experiments with economically motivated human subjects using parameters taken from the data available from Microsoft adCenter logs and test the mechanism along two dimensions: advertiser bidding behaviors and ad ranking methodology.

Understanding Change on the Web and the DesktopSusan Dumais, Jaime Teevan, Dan Liebling, and Richard HughesMicrosoft Research

From the Web to the desktop, people work with dynamic, ever-changing collections of information. Yet the tools used to view and search these collections—browsers and search engines—focus only on a single snapshot of the information. We will present analyses of how Web content changes over time, how people revisit Web pages over time, and how revisitation patterns are influenced by user intent and changes in content. We will also demonstrate several prototypes that we have developed to support people in understanding how information with which they interact changes over time.

The long-term goal of this research is to link multiple related stories together temporally, and automatically construct a cohesive narrative that is easily accessible to the user, and that facilitates in-depth study of news on any topic. Current newsreaders are highly driven by recency of information: most of the user’s attention is directed towards events of the last few hours or even minutes. This phenomenon leads to a grasp of current events, which is broad, but shallow. Many events, however, are better interpreted in context of previous related events. Understanding which past stories give the best context for an event is difficult, requiring many subtle judgments about relevance, entity identity, and so on. By aggregating information in query logs, this research aims at reconstructing links between related stories.

Successful information search requires a joint effort from both syntactic matching provided by current search engines and semantic matching performed by human users. Word-based syntactic matching schemes work well for tasks such as home page finding or simple fact finding; but they are less effective in supporting exploratory search tasks such as learning and investigation. One way to overcome this limitation of syntactic matching is to capture the search journeys of other users with the same or semantically similar query, and then use them as a roadmap to provide an intuitive guidance to support exploratory search.

The research presented here looks at the utilization of query semantics learned from query logs to:

Increase the diversity of a search result; and

Devise new interfaces that display a search result to support exploratory search.

Web-scale Semantic Social Mash-Ups with ProvenanceHarry Halpin and Henry ThompsonUniversity of Edinburgh, United Kingdom

As the Web grows ever larger and increasing amounts of data are available on the Web, how can users access and combine data from multiple sources to discover specific information about a particular thing such as a person, place, organization, or even abstract concept?

In this project, we focused on determining how an ordinary user could find a relevant Semantic Web URI for abstract concepts and physical entities (like people and places) by taking advantage of their ordinary behavior using hypertext search engines. In contrast to the traditional approaches to meaning on the Semantic Web being defined formally in terms of logical inference or the intended meaning of its owner, this allows the meaning of a Semantic Web URI to be defined socially, by taking advantage of the implicit social semantics of search engines. Then users can find and retrieve URIs using a paradigm they are already familiar with: keyword-based search that lets them click on hypertext Web pages. The approach is empirically evaluated on queries for entities (like people and places) and abstract concepts selected from the Microsoft Live Search engine. In our experiment, our approach gives users a vastly (almost 90 percent) chance of finding a relevant Semantic Web URI to use in their mash-up, a large increase over other approaches. We then show how a straightforward provenance framework can be used to evaluate and track the semantics in these mash-ups. This lets ordinary users, not only experts, create data with semantics and share this data with other users by building on the Semantic Web on top of the use of hypertext search engines.

What Makes them Click: Empirical Analysis of Consumer Demand for Search AdvertisingPrzemyslaw Jeziorski and Ilya Segal, Stanford University, United States

We study users' response to sponsored-search advertising using a data set from Microsoft's Live AdCenter distributed in the "Beyond Search" initiative. We estimate a structural model of utility maximizing users, which quantifies "user experience" by using the economic approach of "revealed preference," and allows us to predict user responses to alternative ad placement policies. In the model, each user chooses a sequential clicking strategy to maximize his or her expected utility under incomplete information about the relevance of ads. We estimate the substitutability of ads in users' utility function, the fixed effects of different ads and positions, users' uncertainty about ads' relevance, and the heterogeneity of users. We find substantial substitability of ads in users' payoffs, which generates large negative externalities: we predict 50 percent more clicks in a hypothetical world in which each ad faces no competition. Our numerical simulations of alternative ad placement policies find that user-optimal matching increases consumer welfare by 25 percent and CTR-optimal matching increases CTR by nearly 50 percent. Moreover, if we allow for user level ad targeting consumer surplus gain amounts to 60 percent. Finally, we estimate that user welfare could be raised by nearly 15 percent if users had full information about the relevance of ads to them.