Posted:October 15, 2008

Research Shows Natural Fit between Wikipedia and Semantic Web

An earlier popular entry of this AI3 blog was “99 Wikipedia Sources Aiding the Semantic Web”. Each academic paper or research article in that compilation was based on Wikipedia for semantic Web-related research. Many of you suggested additions to that listing. Thanks!

Wikipedia continues to be an effective and unique source for many information extraction and semantic Web purposes. Recently, I needed to update my own research and found that many valuable new papers have been added to the literature.

I thus decided to make a compilation of such papers a permanent feature — which I’ve named SWEETpedia — and to update it on a periodic basis. You can now find the most recent version under the permanent SWEETpedia page link.

Hint, hint: Check out this link to see the 163 Wikipedia research sources!

NOTE: If you know of a paper that I’ve overlooked, please suggest it as a comment to this posting and I will add it to the next update.

Status of Wikipedia

Meanwhile, a complementary technical report, Mining Meaning from Wikipedia[1], was just released from the University of Waikato in New Zealand. It is a fantastic resource for anyone in this field.

For starters, it summarizes the size and status of the English-version Wikipedia with a more discerning eye than usual:

Categories

390,000

Articles and related pages

5,460,000

redirects

2,970,000

disambiguation pages

110,000

lists and stubs

620,000

bona-fide articles

1,760,000

Templates

174,000

infoboxes

9,000

other

165,000

Links

between articles

62,000,000

between category and subcategory

740,000

between category and article

7,270,000

The size, scope and structure of Wikipedia make it an unprecedented resource for researchers engaged in natural language processing (NLP), information extraction (IE) and semantic Web-related tasks. Further, the more than 250 language versions of Wikipedia also make it a great resource for multi-lingual and translation studies.

Growth of SWEETpedia

In the eight months since posting the semantic Web-related research papers using Wikipedia, my new SWEETpedia listing has grown by about 65%. There are now 63 new papers, bringing the total to 163.

Of course, these are not the only academic papers published about or using Wikipedia. The SWEETpedia listing is specifically related to structure, term, or semantic extractions from Wikipedia. Other research about frequency of updates or collaboration or growth or comparisons with standard encyclopedias may also be found under Wikipedia’s own listing of academic studies.

This graph indicates the growth in use of Wikipedia as a source of semantic Web research. It is hard to tell if the effort is plateauing or not; the apparent slight dip in 2008 is too early to yet conclude that.

For example, the current SWEETpedia listing adds another 35% more listings for 2007 to the earlier records. It is likely many 2008 papers will also be discovered later in 2009. Many of the venues at which these papers get presented can be somewhat obscure, and new researchers keep entering the field.

However, we can conclude that Wikipedia is assuming a role in semantic Web and natural language research never before seen for other frameworks.

Kinds of Semantic Web-related Research

As noted, the new 82-page technical report by Olena Medelyan et al. from the University of Waikato in New Zealand, Mining Meaning from Wikipedia[1], is now the must-have reference for all things related to the use of Wikipedia for semantic Web and natural language research.

Olena and her co-authors, Catherine Legg, David Milne and Ian Witten, have each published much in this field and were some of the earliest researchers tapping into the wealth of Wikipedia.

They first note the many uses to which Wikipedia is now being put:

Wikipedia as an encyclopedia — the standard use familiar to the general public

Wikipedia as corpus — large text collections for testing and modeling NLP tasks

Wikipedia as a thesaurus — equivalent and hierarchical relationships between terms and related or synoymous terms

Wikipedia as a database — the extraction and codification of structure and structural relationships

Wikipedia as an ontology — the formal expression of relationships in semantic Web and logical constructs, and

Wikipedia as a network structure — relationship analysis and mining through Wikipedia’s representation as a network graph.

These type of uses then enable the authors to place various research efforts and papers into context. They do so via four major clusters of relevant tasks related to language processing and the semantic Web:

There are many interesting observations throughout this report. There are also useful links to related tools, supporting and annotated datasets, and key researchers in the field.

I highly recommend this report as the essential starting point for anyone first getting into these research topics. Many of the newly added references to the SWEETpedia listing arose from this report. Reading the report is useful grounding to know where to look for specific papers in a given task area.

Though clearly the authors have their own perspectives and research emphases, they do an admirable job of being complete and even-handed in their coverage. Basic review reports such as this play an important role in helping to focus new research and make it productive.

Related

Schema.org Markup

headline:

Research Shows Natural Fit between Wikipedia and Semantic Web

alternativeHeadline:

author:

Mike Bergman

image:

description:

SWEETpedia Listing of 163 Research Articles; NZ Technical Report Affirm Trend An earlier popular entry of this AI3 blog was “99 Wikipedia Sources Aiding the Semantic Web”. Each academic paper or research article in that compilation was based on Wikipedia for semantic Web-related research. Many of you suggested additions to that listing. Thanks! Wikipedia continues […]