Wednesday, August 08, 2007

Using Wikipedia to disambiguate names

Silviu Cucerzan at Microsoft Research recently published a paper, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data" (PDF), that is a great example of using the high-quality data available in Wikipedia to solve a difficult search problem, in this case, distinguishing between different meanings of the same name.

From the paper:

When ... scaling entity tracking to ... the Web, resolving semantic ambiguity becomes of central importance, as many surface forms turn out to be ambiguous.

For example, the surface form "Texas" is used to refer to more than twenty different named entities in Wikipedia.

In the context "former Texas quarterback James Street", Texas refers to the University of Texas at Austin; in the context "in 2000, Texas released a greatest hits album", Texas refers to the British pop band; in the context "Texas borders Oklahoma on the north", it refers to the U.S. state; while in the context "the characters in Texas include both real and fictional explorers", [it] ... refers to ... [a] novel.

Silviu cleverly uses the high-quality, semi-structured data available from Wikipedia for this task. In addition to pages describing different entities where contextual clues can be extracted (example), Wikipedia contains redirects for different surface forms of the same entity (example), list pages that categorize names (example), and disambiguation pages that show many of the different entities for a surface form (example).

Wikipedia contains much more than unstructured text. Exploiting the semi-structured data -- the redirect, list, and disambiguation pages -- gives this work its power.

For a quick overview on one of many ways this kind of named entity data could be applied, do not miss the screenshot in Figure 3 on page 6 of the paper. It shows a prototype that annotates a web page with pop-ups for the proper names to disambiguate the meaning.

As fun as this paper is, what really excites me is that this is one of many recent research projects that are cleverly using Wikipedia to attack challenging problems. There is little doubt that, deep in the Wikipedia pages, there is much buried treasure, if we can just figure out how to look.

4 comments:

I am surprised that people haven't looked at Wikipedia and Collaborative Filtering yet. Wikipedia offers millions of editor profiles. Sky is the limit on what one can do with these profiles, find related articles, find related users, recommend articles for editing to specific users, find related phrases, you name it.

Recommendation technologies can accelerate and improve content creation in Wikipedia. There will always be articles in need for editing and by using the editor profiles we can match these articles to the most suitable editors.

There are so many articles in need of improvement that remain in poor-quality only because the right people to improve them, don't know they are there. Wikipedia is a classic example of a Long Tail, and Long Tail can be leveraged through recommendations and personalization.

Good find. I was on program committee for this working group at WWW2007 this year and we got a lot of really interesting submissions in the same vein. See http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-249/ for the list of papers. The first entry on the list did similar mining of Wikipedia, and I think another of the ones that were selected also did.

The interesting aspect of this is not the Wikipedia part, but the fact that for the last 5+ years human edited content was considered not scalable for search engine purposes, hence the demise of Yahoo Directory and LookSmart. And now, through the proxy of Wikipedia, we are back at looking how humans do a much better job a certain tasks.

PS: I think I actually worked with Silviu on a phrasal speller for MSN Search circa 2003.