Jeff's Search Engine Caffè

Tuesday, February 3

I'm happy to announce that our Google Research is releasing the largest collection of entity-linked data every made publicly available. The dataset can be used for a wide variety of information retrieval and information extraction tasks. The Freebase Annotations of the TREC KBA Stream Corpus 2014 (FAKBA1) contains over 9.4 billion entity annotations from over 496 million documents. More details, including a link to download the data are available at:

This data set is an important data release because entity linking can be an expensive process that is difficult for researchers to perform at scale.

The KBA Stream Corpus was designed to help track and filter important updates about entities as they change over time. The goal of KBA was to recommend edits to Wikipedia editors based incoming streams of news and social media. One of the tasks in the track is the "Cumulative Citation Recommendation" (CCR) task, whose goal is to recommend cite-worthy articles to editors. There are also extraction tasks, Streaming Slot Filling, which suggests changes to an entity profile (similar to updating a Wikipedia infobox).

In order to facilitate research in this field, we annotated all of the English documents from the TREC KBA Stream Corpus 2014 (http://trec-kba.org/kba-stream-corpus-2014.shtml) with entity links to Freebase. The entity links are resolved automatically, and are imperfect. For each named entity recognized we provide: the mention text, begin and end byte offsets, Freebase MID, and confidence scores. The dataset includes manual annotations of the TREC KBA CCR 2014 entity queries (in TSV format) that I performed.

FAKBA1 has 394,051,027 documents with at least one entity annotated. There are over 9.4 billion entity mentions with links to Freebase.

Although it's early, the dataset has a variety of possible applications, including:

TAC Knowledge Base Population Tasks - The goal is to construct a knowledge base, including tasks such as entity linking. There is a new Tri-lingual track (Spanish, English, and Chinese) being planned for 2015.

This paper tackles the issue of robustness, and examines how systems that despite achieving gain overall may still significantly hurt many queries. They present a framework for optimizing both effectiveness and robustness and the tradeoff between the two.

Mark D Smucker (University of Waterloo), Charles L. A. Clarke (University of Waterloo)
In this paper, we introduce a time-biased gain measure, which explicitly accommodates such aspects of the search process... As its primary beneﬁt, the measure allows us to evaluate system performance in human terms, while maintaining the simplicity and repeatability of system-oriented tests. Overall, we aim to achieve a clearer connection between user-oriented studies and system-oriented tests, allowing us to better transfer insights and outcomes from one to the other.

[We have to listen to the old guys, but we don't have to accept it, but this doesn't hold for my talk today]

What is IR?
- IR is about vagueness and imprecision in information systems

Vagueness
- User is not able to precisely specify the object he is looking for
--> "I am looking for a high-end Android smartphone at a reasonable price"
- Typically, interative retrieval process.
- IR is not restricted to unstructured media

IR vs Databases
-> DB: given a query, find objects o with o->q
-> IR given a query q, find documents d with high values of P9d->q)
-> DB is a special case of IR! (in a certain sense)

Foundation of DBs
-> Codd's paper on a relational model was classified as Information Retrieval
-> The concept of transactions separated the fields.

Fundamental differences between IR and DB is handling the pragmatic level.
DB: User interactions with the application --> DBMs --> DB
IR: User interacts with the IR system -> over the collection
(separation between the management system and the application)

IR as Engineering Science
- Most of us feel our that we are engineers. But, things are not as simple as they might seem.
-> Example: An IR person in civil engineering.

4 or 5 types of bridges - Boolean bridge, vector space, language modeling, etc..
-- build all 5 and see which one stands up.
-- Test the variants in live search
-- Users in IR are blame themselves when they drive over a collapsing (non-optimal) system
-- There could be serious implications by choosing a non-optimal system.

Why we need IR engineering
-> IR is more than web search
Instiutions and companies have
- large varieties of collections
- board range of search tasks
[example: searching in the medical domain. A doctor performs a search, and then waits 30 minutes for an answer. We could return as engineers work on getting this down to 10 min)

Limitations of current knowledge
- Moving in Terra Incognita
- example: Africa. Knowledge of the western world about the african geopgrahy several hundred years ago; the map of it was very innacurate and incomplete.
- At best, interpolation is reasonable.
- Extrapolation lacks theoretic foundation
-> But how to define the boundaries of current knowledge?

We should put more focus on the development of theoretic models.
-> each theory is application within a well-defined application range

But, what is the application range?
-> defined by the underlying assumptions
-> Are the underlying assumptions of the model valid? For this, we need experiments to validate them.

Experimentation
- Why vs How experiments

Why -> based on a solid theoretical model.
-> performed for validating the model assumptions

How
- based on some ad-hoc model
- focus on the outcome
- no interest in the underlying assumptions

How experiments
-> Improvements that Don't Add Up: Ad Hoc retrieval results since 1998.
-> Trec-8 adhoc collection, MAP
-> It's easy to show improvements, but few beat the best official TREC result.
-> Over 90% of the paper claim improvements that exist due to poor baselines, but do not beat the best TREC results.
-> Improvements don't add up.

New IR Applications
- Dozens of IR applications (see the SWIRL 2012 workshop)
- Heuristic approaches are valuable for starting and comparison, but they are limited in the generality.
-> We don't know how far we can generalize the method.

Conclusion: Possible Steps
-> Encourage theoretic research of the 'why' type, e.g. having a separate conference track for these papers.
-> Define and enforce rigid evaluation standards to be able to perform metastudies
-> Setup repositories for standardized benchmarks.

Questions

Nick Belkin -> Where do the assumptions underlying the theory come from? Where do we get evidence? How would you approach that?
-> A: Without any hypothesis, the observations are useless. We need a back and forth between theory and experimentation.

DB and IR
-> Can they be united? DB is part of IR. IR is part of DB. [the issues is bringing the people together]

David Hawking
-> Have we hit the limit of our engineering capability? What are the biggest opportunities for significant progress?
A: We perhaps cannot improve the classical adhoc setting. We need to know more about the user and their task. Smartphone example: your phone knows where are you, what time it is, looking for a Chinese restaurant (including opening hours). We need to study the hard tasks for knowledge workers that integrate more deeply in their application.

Friday, August 10

If you're planning some sightseeing, or a place to catch up over a beer or meal with friends and colleagues, here are some ideas.

Portland has been made famous by the "Portlandia" series. It's a quirky, young, outdoorsy, hipster, organic, crunchy kind of town, where "young people go to retire". It's ground zero for the burgeoning craft beer, coffee, and micro-distilling movement in America. It has been called "beervana" because of it's plethora of oustanding breweries, bars, and pubs.

To cut to the chase, here is my Food Map of Portland on Google maps. Below are some of my sources and raw research notes.

Note: If you arrive in Portland early and you like food be sure to checkout the Bite of Oregon food festival taking places on Saturday and Sunday.

Le Pigeon (think Paris!) and their new place (foie gras profiteroles!) Little Bird (another french bistro)

Coffee Shops
Portland is known for having some of the best coffee in the country. Here are some of the best places to try a cup.
Stumptown (several locations)
Ristretto Roasters
Coava Coffee
(the business area near the conference is a bit of a coffee & restaurant wasteland, so plan on venturing north into the heart of downtown)

Pok Pok - Thai street food (get the drinking vinegar) - be prepared to wait, there is always a long line (think 1 to 1.5 hours. Go across the street and wait at the whisky soda lounge), famous for its fish sauce wings. There is a new restaurant, Ping in downtown from the same owners, which was just named one of GQ's ten best new restaurants. It's reasonably priced, casual food without crazy lines.

CloudSearch is a fully managed search service based on Amazon's search infrastructure that provides near-realtime, faceted, scalable search. The index is stored in memory for fast search and updates.

Dynamic Scaling

What makes A9 offering particularly interesting is it's ability to dynamically scale. The architecture of A9's search system, with shards and replicas, is a common and well-understood model. What makes Amazon's offering unique is the ability to easily scale your search cluster. A9 will automatically add (and remove) search instances and index partitions as the index size grows and shrinks. It will also dynamically add and remove replicas to respond to changes in search request traffic. The exact details are still not clearly described technically in detail.

Right now, there is a limit to 50 search instances. An extra large search instance can handle approximately 8 Million 1K documents. It appears that assumption is that the documents are quite small (e.g. product documents). To put it in perspective, an rough rule of thumb for web documents is approximately 10k. Given this, it translates into roughly 800k web documents per server * 50 servers = 40 million web documents. This is not for building large-scale web search, yet. However, it should be more than enough for most enterprise e-commerce and site-search applications.

The real value added by the search engine is in the ranking of results.

Ranking
The control over the search index ranking is rudimentary with a few basic knobs. You can add stopwords, perform stemming, and add synonyms. This is very basic stuff. How you might do more interesting (and important) IR ranking changes is vague. From the article,

Rank expressions are mathematical functions that you can use to change how search results are ranked. By default, documents are ranked by a text relevance score that takes into account the proximity of the search terms and the frequency of those terms within a document. You can use rank expressions to include other factors in the ranking. For example, if you have a numeric field in your domain called 'popularity,' you can define a rank expression that combines popularity with the default text relevance score to rank relevant popular documents higher in your search results.

This indicators that it is possible to boost documents. However, it is unclear how the underlying text search works in order to boost individual important fields (e.g. name, description).

For more details on the more advanced query processing needed to make search work in practice, read the post: Query Rewriting in Search Engines from Hugh Williams at EBay. In order to employ these methods, you need log data, which brings me to my next point.

- get your toe wet with a few areas: 1) linked data, and 2) semantic markup

- 1) linked data - all articles get categorized from a controlled vocabulary (strong ids tied to all docs). BUT - No context to what those IDs mean. e.g. barack obama is the president of the united states. Kansas city is the capital... you need to link the external data to add new understanding.

-- e.g. find all articles in A1, P1 that mention presidents of the United States

-- e.g. find all articles that occur near park slope brooklyn

2) semantic markup (rdfa, microformat, rich snippets). They use rnews vocab as part of schema.org.

Wlodek Zadrozny (IBM. Watson)

- what are the open problems in QA

- Trying to detect relations that occur in the candidate passages that are retrieved (in relevance to the question)

- Then scores and ranks the candidate answers. Some of it in RDF data. Confidences are important because wrong answers are penalized.

- Is there an answer? (Google wins by giving people documents and presenting many possible answers)

Evan - the real-time metadata is needed for the website. They use a rule based information extraction system which suggests terms they might want want to suggest. Then the librarians review the producers tags.

About Me

I work in Google Research on Knowledge Discovery. I received my PhD at the CIIR, UMass Amherst, see my research page. I was a contestant on Season 2 of MasterChef on Fox. For my culinary research read my cooking blog, Cooking PhD.
You can reach me at jeffdalton104-at-hotmail-dot-com or JeffD on Twitter.