Posted:February 18, 2008

99 Wikipedia Sources Aiding the Semantic Web

Since about 2005 — and at an accelerating pace — Wikipedia has emerged as the leading online knowledge base for conducting semantic Web and related research. The system is being tapped for both data and structure. Wikipedia has arguably replaced WordNet as the leading lexicon for concepts and relations. Because of its scope and popularity, many argue that Wikipedia is emerging as the de facto structure for classifying and organizing knowledge in the 21st century.

Our work on the UMBEL lightweight reference subject concept structure has stated since the project’s announcement in July 2007 that Wikipedia is a key intended resource for identifying subject concepts and entities. For the past few months I have been scouring the globe attempting to find every drop of research I could find on the use of Wikipedia for semantic Web, information extraction, categorization and related issues.

Thus, I’m pleased to offer up herein the most comprehensive such listing available anywhere: more than 99 resources and counting! (I say “more than” because some entries below have multiple resources; I just liked the sound of 99 as a round number!)

Wikipedia itself maintains a listing of academic studies using Wikipedia as a resource; fewer than one-third of the listings below are on that list (which itself may be an indication of the current state of completeness within Wikipedia). Some bloggers and other sources around the Web also maintain listings in lesser degrees of completeness.

It is well documented the tremendous growth of content and topics within Wikipedia (see, as examples, the W1, W2, W3, W4, W5, W6 and W7 internal Wikipedia sources for gory details), with as of early 2008 about 2.25 million articles in English and versions in 256 languages and variants.

Download access to the full knowledge base has enabled the development of notable core references to the Linked Data aspects of the semantic Web such as DBpedia [5,6] and YAGO [72,73]. Entire research teams, such as Ponzetto and Strube [61-65] (and others as well; see below) are moving toward creating a full-blown ontologies or structured knowledge bases useful for semantic Web purposes based on Wikipedia. So, one of the first and principle uses of Wikipedia to date has been as a data source of concepts, entities and relations.

But much broader data mining and text mining and analysis is being conducted against Wikipedia, that is currently defining the state-of-the-art in these areas, too:

Ontology development and categorization

Word sense disambiguation

Named entity recognition

Named entity disambiguation

Semantic relatedness and relations.

These objectives, in turn, are mining and extracting these various kinds of structure for these purposes in Wikipedia:

Subject — Category suggestion (phrase marked in bold or in first paragraph)

Section heading — Category suggestions

Article links

Context — Related terms; co-occurrences

Label — Synonyms; spelling variations; related terms

Target — Link graph; related terms

LinksTo — Category suggestion

LinkedBy — Category suggestion

Categories

Category — Category suggestion

Contained articles — Semantically related terms (siblings)

Hierarchy — Hyponymic and meronymic relations between terms

Disambiguation pages

Article links — Sense inventory

Infobox Templates

Name —

Item — Category suggestion; entity suggestion

Lists

Hyponyms

These are some of the specific uses that are included in the 99 resources listed below.

This is an exciting (and, for most all of us just a few years back, unanticipated) use of the Web in socially relevant and contextual knowledge and research. I’m sure such a listing one year by now will be double in size or larger!

BTW, suggestions for new or overlooked entries are very much welcomed! 🙂

Sören Auer and Jens Lehmann, 2007. What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content, in The Semantic Web: Research and Applications, pages 503-517, 2007. See http://www.eswc2007.org/pdf/eswc07-auer.pdf.

Somnath Banerjee, Krishnan Ramanathan, Ajay Gupta, 2007. Clustering Short Texts using Wikipedia, poster presented at Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, pp. 787-788.

Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells, 2007. Automatising the Learning of Lexical Patterns: an Application to the Enrichment of WordNet by Extracting Semantic Relationships from Wikipedia. See http://nets.ii.uam.es/publications/nlp/dke07.pdf.

Ryuichiro Higashinaka, Kohji Dohsaka and Hideki Isozaki, 2007. Learning to Rank Definitions to Generate Quizzes for Interactive Information Presentation, in Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; see pages 117-120 within http://acl.ldc.upenn.edu/P/P07/P07-2.pdf

Simone Paolo Ponzetto and Michael Strube, 2007c. An API for Measuring the Relatedness of Words in Wikipedia, in Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, See pages 49-52 within http://acl.ldc.upenn.edu/P/P07/P07-2.pdf

Anne-Marie Vercoustre, Jovan Pehcevski and James A. Thom, 2007. Using Wikipedia Categories and Links in Entity Ranking, in Pre-proceedings of the Sixth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2007), Dec 17, 2007. See http://hal.inria.fr/docs/00/19/24/89/PDF/inex07.pdf.

Schema.org Markup

headline:

99 Wikipedia Sources Aiding the Semantic Web

alternativeHeadline:

author:

Mike Bergman

image:

description:

Most Comprehensive Reference List Available Shows Impressive Depth, Breadth Since about 2005 — and at an accelerating pace — Wikipedia has emerged as the leading online knowledge base for conducting semantic Web and related research. The system is being tapped for both data and structure. Wikipedia has arguably replaced WordNet as the leading lexicon for […]

articleBody:

see above

datePublished:

February 18, 2008

12 thoughts on “99 Wikipedia Sources Aiding the Semantic Web”

Hi, I would like to suggest one of our projects to make it a three digit number!

Bests,

Sebastian

Sebastian Blohm, Philipp Cimiano, Using the Web to Reduce Data Sparseness in Pattern-based Information Extraction, in Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 18-29. Springer , Warsaw, Poland, September 2007.

I’m one of the authors. What you done is quite helpful work for other Wikipedia researchers including me. Thank you very much.

By the way, our system named “Wikipedia Thesaurus,” a huge scale association thesaurus constructed by mining Wikipedia, is available on the WWW to prove the capability of Wikipedia Mining. (The method is described in paper #57).

Thank you for this great bibliography! I have collected quite a few papers about wikipedia myself (here), for my master’s thesis (see here). Have a look if you like – there’s quite some overlap, and the top-most 20 are the ones i imported from your list just now, so scroll down a bit.

By the way: it would be nice to have your list in a machine-readable form, like bibtex, so entries can easily be transferred into places like citeulike or bibsonomy.

So you are brightbyte; I have seen the reference many times before. 🙂 Thank you for the compliment.

I agree with your idea about machine readable form. In fact, I have been playing around at times with things like Zotero, Connotea and the ones you mention. What I am really looking for is an easy online place to put the thousands of references I track (the Wikipedia is only one example), but gives me easy re-formatting, etc. Initial finds I keep in an internal wiki, but structured entry is not the easiest.

I probably already have encountered the solution, but have not taken it the last step. Any ideas?

First, let me clarify that my thesis is work in progress. More work than progress, lately… I probably shouldn’t be posting on the web right now 🙂

Anyway, I have myself been frustrated quite a bit about the state of the art of online bibliography systems. After ranting about it, some interresting discussion developed, involving, among others, Jakob Voss (who is not only one of the authors on your list, but also a professional bibliographer, involved with Wikipedia Deutschalnd and, it seems, with Zotero). Have a look and join in (account available on request but not needed for comments).

We have parsed English Wikipedia into a well-structured XML
document (21 Gb in size) and loaded it into Sedna XML
database (http://modis.ispras.ru/sedna). With WikiXMLDB
demo you can run predefined or your own XQuery queries via
Web interface.