Article Structure

Abstract

We present WiBi, an approach to the automatic creation of a bitaxonomy for Wikipedia, that is, an integrated taxonomy of Wikipage pages and categories.

Introduction

Knowledge has unquestionably become a key component of current intelligent systems in many fields of Artificial Intelligence.

WiBi: A Wikipedia Bitaxonomy

We induce a Wikipedia bitaxonomy, i.e., a taxonomy of pages and categories, in 3 phases:

Phase 1: Inducing the Page Taxonomy

The goal of the first phase is to induce a taxonomy of Wikipedia pages.

Phase 2: Inducing the Bitaxonomy

The page taxonomy built in Section 3 will serve as a stable, pivotal input to the second phase, the aim of which is to build our bitaxonomy, that is, a taxonomy of pages and categories.

Phase 3: Category taxonomy refinement

As the final phase, we refine and enrich the category taxonomy.

Related Work

Although the extraction of taxonomies from machine-readable dictionaries was already being studied in the early 1970s (Calzolari et al., 1973), pioneering work on large amounts of data only appeared in the 1990s (Hearst, 1992; Ide and Veronis, 1993).

Comparative Evaluation

7.1 Experimental Setup

Conclusions

In this paper we have presented WiBi, an automatic 3—phase approach to the construction of a bitaxonomy for the English Wikipedia, i.e., a full-fledged, integrated page and category taxonomy: first, using a set of high-precision linkers, the page taxonomy is populated; next, a fixed point algorithm populates the category taxonomy while enriching the page taxonomy iteratively; finally, the category taxonomy undergoes structural refinements.

Topics

hypernym

Appears in 79 sentences as: Hypernym (3) hypernym (58) hypernyms (37)

In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project

However, unlike the case with smaller manually-curated resources such as WordNet (Fellbaum, 1998), in many large automatically-created resources the taxonomical information is either missing, mixed across resources, e.g., linking Wikipedia categories to WordNet synsets as in YAGO, or coarse-grained, as in DBpedia whose hypernyms link to a small upper taxonomy.

Page 1, “Introduction”

Creation of the initial page taxonomy: we first create a taxonomy for the Wikipedia pages by parsing textual definitions, extracting the hypernym (s) and disambiguating them according to the page inventory.

Page 2, “WiBi: A Wikipedia Bitaxonomy”

At each iteration, the links in the page taxonomy are used to identify category hypemyms and, conversely, the new category hypernyms are used to identify more page hypernyms .

Page 2, “WiBi: A Wikipedia Bitaxonomy”

For each p E P our aim is to identify the most suitable generalization ph E P so that we can create the edge (p, ph) and add it to E. For instance, given the page APPLE, which represents the fruit meaning of apple, we want to determine that its hypemym is FRUIT and add the hypernym edge connecting the two pages (i.e., E := E U {(APPLE, FRUIT)}).

Page 2, “Phase 1: Inducing the Page Taxonomy”

3.1 Syntactic step: hypernym extraction

Page 2, “Phase 1: Inducing the Page Taxonomy”

In the syntactic step, for each page p E P, we extract zero, one or more hypernym lemmas, that is, we output potentially ambiguous hypernyms for the page.

Page 2, “Phase 1: Inducing the Page Taxonomy”

the Wikipedia guidelines and is validated in the literature (Navigli and Velardi, 2010; Navigli and Ponzetto, 2012), is that the first sentence of each Wikipedia page p provides a textual definition for the concept represented by p. The second assumption we build upon is the idea that a lexical taxonomy can be obtained by extracting hypernyms from textual definitions.

Page 2, “Phase 1: Inducing the Page Taxonomy”

To extract hypernym lemmas, we draw on the notion of copula, that is, the relation between the complement of a copular verb and the copular verb itself.

Page 2, “Phase 1: Inducing the Page Taxonomy”

The noun involved in the copula relation is actress and thus it is taken as the page’s hypernym lemma.

To cope with this problem we use a list of stop-word8.1When such a term is extracted as hypernym , we replace it with the rightmost noun of the first following noun sequence (e.g., deity in the above example).

WordNet

Appears in 13 sentences as: WordNet (15) WordNet’s (1)

In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project

However, unlike the case with smaller manually-curated resources such as WordNet (Fellbaum, 1998), in many large automatically-created resources the taxonomical information is either missing, mixed across resources, e.g., linking Wikipedia categories to WordNet synsets as in YAGO, or coarse-grained, as in DBpedia whose hypernyms link to a small upper taxonomy.

Page 1, “Introduction”

(2005) provide a general vector-based method which, however, is incapable of linking pages which do not have a WordNet counterpart.

Page 1, “Introduction”

Higher coverage is provided by de Melo and Weikum (2010) thanks to the use of a set of effective heuristics, however, the approach also draws on WordNet and sense frequency information.

Page 1, “Introduction”

In this paper we address the task of taxono-mizing Wikipedia in a way that is fully independent of other existing resources such as WordNet .

Page 1, “Introduction”

However, these methods do not link terms to existing knowledge resources such as WordNet , whereas those that explicitly link do so by adding new leaves to the existing taxonomy instead of acquiring wide-coverage taxonomies from scratch (Pan-tel and Ravichandran, 2004; Snow et al., 2006).

However, the categories are linked to the first, i.e., most frequent, sense of the category head in WordNet , involving only leaf categories in the linking.

Page 7, “Related Work”

Our work differs from the others in at least three respects: first, in marked contrast to most other resources, but similarly to WikiNet and WikiTaxonomy, our resource is self-contained and does not depend on other resources such as WordNet ; second, we address the taxonomization task on both sides, i.e., pages and categories, by providing an algorithm which mutually and iteratively transfers knowledge from one side of the bitaxonomy to the other; third, we provide a wide coverage bitaxonomy closer in structure and granularity to a manual WordNet-like taxonomy, in contrast, for example, to DBpedia’s flat entity-focused hierarchy.2

Page 7, “Related Work”

Since WordNet’s average height is 8.07 we deem WiBi to be the resource structurally closest to WordNet .

Page 7, “Related Work”

As regards recall, we note that in two cases (i.e., DBpedia returning page super-types from its upper taxonomy, YAGO linking categories to WordNet synsets) the generalizations are neither pages nor categories and that MENTA returns heterogeneous hypernyms as mixed sets of WordNet synsets, Wikipedia pages and categories.

Page 8, “Comparative Evaluation”

MENTA seems to be the closest resource to ours, however, we remark that the hypernyms output by MENTA are very heterogeneous: 48% of answers are represented by a WordNet synset, 37% by Wikipedia categories and 15% are Wikipedia pages.

iteratively

In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project

Finally, to capture multiple hypernyms, we iteratively follow the conj_and and conj_or relations starting from the initially extracted hypernym.

Page 2, “Phase 1: Inducing the Page Taxonomy”

In the following we describe the core algorithm of our approach, which iteratively and mutually populates and refines the edge sets E(Tp) and E (To).

Page 4, “Phase 2: Inducing the Bitaxonomy”

Figure 4b shows the performance trend as the algorithm iteratively covers more and more categories.

Page 6, “Phase 3: Category taxonomy refinement”

Our work differs from the others in at least three respects: first, in marked contrast to most other resources, but similarly to WikiNet and WikiTaxonomy, our resource is self-contained and does not depend on other resources such as WordNet; second, we address the taxonomization task on both sides, i.e., pages and categories, by providing an algorithm which mutually and iteratively transfers knowledge from one side of the bitaxonomy to the other; third, we provide a wide coverage bitaxonomy closer in structure and granularity to a manual WordNet-like taxonomy, in contrast, for example, to DBpedia’s flat entity-focused hierarchy.2

Page 7, “Related Work”

In this paper we have presented WiBi, an automatic 3—phase approach to the construction of a bitaxonomy for the English Wikipedia, i.e., a full-fledged, integrated page and category taxonomy: first, using a set of high-precision linkers, the page taxonomy is populated; next, a fixed point algorithm populates the category taxonomy while enriching the page taxonomy iteratively ; finally, the category taxonomy undergoes structural refinements.

It was established by selecting the combination, among all possible permutations, which maximized precision on a tuning set of 100 randomly sampled pages, disjoint from our page dataset.

Page 4, “Phase 1: Inducing the Page Taxonomy”

Category taxonomy quality To estimate the quality of the category taxonomy, we randomly sampled 1,000 categories and, for each of them, we manually associated the super-categories which were deemed to be appropriate hypemyms.

synsets

Appears in 3 sentences as: synset (1) synsets (3)

In Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project

However, unlike the case with smaller manually-curated resources such as WordNet (Fellbaum, 1998), in many large automatically-created resources the taxonomical information is either missing, mixed across resources, e.g., linking Wikipedia categories to WordNet synsets as in YAGO, or coarse-grained, as in DBpedia whose hypernyms link to a small upper taxonomy.

Page 1, “Introduction”

As regards recall, we note that in two cases (i.e., DBpedia returning page super-types from its upper taxonomy, YAGO linking categories to WordNet synsets) the generalizations are neither pages nor categories and that MENTA returns heterogeneous hypernyms as mixed sets of WordNet synsets , Wikipedia pages and categories.

Page 8, “Comparative Evaluation”

MENTA seems to be the closest resource to ours, however, we remark that the hypernyms output by MENTA are very heterogeneous: 48% of answers are represented by a WordNet synset , 37% by Wikipedia categories and 15% are Wikipedia pages.