Streaming Fact Extraction for Wikipedia Entities at Web-Scale

Wikipedia.org (WP) is the largest and most popular general reference work on the Internet. Presently, there is considerable time lag between the publication of an event and its citation in WP. The median time lag for a sample of about 60K web pages cited by WP articles in the living people category is over a year Frank et al. (2013). Moreover, Discovering facts relevant and cite-worthy to a certain WP entity all across the Internet is quite challenging.

Consider an example sentence: “Boris Berezovsky, who made his fortune in Russia in the 1990s, passed away March 2013.” We observe that there are two persons named Boris Berezovsky in Wikipedia; one a businessman and the other a pianist. Any extraction needs to take this into account (a.k.a entity resolution). Then, we match the sentence to a list of topics and find a match to topic DateOfDeath valued in the sentence as March 2013.

Table 1: The set of possible slot names for each entity type.

In this work, we introduce an efficient fact extraction system from a list of facts for given WP entities. Fact extraction is the task of matching each sentence to the {subject — verb — adverbial/complement} sentence structure. The subject represents the WP entity, verb is the relation type (slot) as in Table 1, and adverbial/complement, represents the value of the associated slot. In our example, the entity is Boris Berezovsky and the slot we extract is DateOfDeath with a slot value of March 2013. The resulting extraction containing an entity, slot name and slot value is a fact.

Our system is built with a pipeline style architecture depicted in Figure 1. The three logical components are divided into sections entitled Model for entity resolution purposes, Wikipedia Citation to annotate cite-worthy documents, and Slot Filling to generate the actual slot values.

Model. Using regular expressions we extract bold phrases of the initial paragraph of the WP entity page as aliases. Then we generate possible forms of writing (e.g. ‘Boris Berezovsky’ can also be written as ‘Berezovsky, Boris’). Next, we iterate over documents in the stream and filter out all documents that do not explicitly contain a string matching the list of entities.

Wikipedia Citation. Corpus of documents comes in the form of chunk files each of which contain thousands of documents, corpus is processed by a two-layer filter system referred to as Document Chunk Filter and Document Filter. The purpose of these filters is to reduce I/O cost while generating slot values for various entities. Document Chunk Filter removes the chunk files that do not contain a mention of any of the desired entities, and the Document Filter removes documents that do not contain a mention of an entity.

Slot Filling. We extract fact values from sentences according to a list of patterns. We define slot values extraction patterns as a tuple of five values ⟨p1, p2, p3, p4, p5⟩, where p1 represents the type of entity from set {FACILITY, ORGANIZATION, PERSON}. p2 represents a slot name from Table 1. p3 is the pattern content — a string found in the sentence that identifies a slot name. The pattern evaluator uses a direction (left or right) found in p4 to explore sentence. The final element p5 represents the type of the slot value. Therefore an example pattern would be ⟨PER, DateOfDeath, passed away, right, NP⟩.

Inference and constraints. The output contains many duplicate entries. Duplicates can be present in a window of rows; we use a window size of two meaning we only be adjacent rows (of extractions). Two rows are duplicates if they have the same exact extraction or if the rows have the same slot name and a similar slot value, or if the extracted sentence for a particular slot types come from the same sentence. New slots can be deduced from existing slots by defining inference rules. For example, two slots for the task are “FounderOf” and “FoundedBy”. A safe assumption is these slot names are biconditional logical connectives with the entities and slot values. Therefore, we can express a rule “X FounderOf Y ” ↔ “Y FoundedBy X” where X and Y are single unique entities. Additionally, we found that the slot names “Contact Meet PlaceTime” could be inferred as “Contact Meet Entity” if the Entity was a FAC and the extracted sentence contained an additional ORG/FAC tag, and so on.

Our system was developed using Java based system and we tested our techniques on the large scale KBA corpus. A more detailed write up on this system was published at the 2014 FLAIRS Conference. Our paper will be release shortly.