Wiktionary RDF extraction

Currently available languages: English, German, French, Russian, Greek, Vietnamese
In the works: Greece, Vietnamese
Need data from other languages? Help us creating wrappers for each language editions (if you know Regex, XML and Wiktionary, an initial wrapper can be created in less than one day.)

Intro

Wiktionary, the free dictionary, is another project of the Wikimedia Foundation from where DBpedia extracts structured RDF data. Just like Wikipedia, Wiktionary comes in many languages, such as the English Wiktionary (http://en.wiktionary.org) and German Wiktionary (http://de.wiktionary.org). However, each of these independent sites contains entries in many languages. For the French word deux, there is one entry in the English Wiktionary (http://en.wiktionary.org/wiki/deux) and another entry in the German Wiktionary (http://de.wiktionary.org/wiki/deux). For a word such as in, a single wiki page in the English Wiktionary (http://en.wiktionary.org/wiki/in) contains sections for 24 different languages that use this word. (This is quite different from how Wikipedia handles disambiguation pages.) The exact structure for the entries differs between languages of Wiktionary, and slightly also between language entries within each site, so that Danish and Dutch entries in the English Wiktionary may use different kinds of wiki templates.

We aim to provide an open-source framework (based on DBpedia) to extract semantic lexical resources (an ontology about language use) from Wiktionary. The data currently includes language, part of speech, senses, definitions, synonyms, taxonomies (hyponyms, hyperonyms, synonyms, antonyms) and translations for each lexical word. Main focus is on flexibility (to the loose schema) and configurability (towards differing language-editions of Wiktionary). The configuration is done within an XML file which encodes language-mappings and parse-templates containing place holders. The goal is to allow the addition of languages just by configuration without the need of programming skills and without altering the Scala source code. This enables the quick and cheap ad(o|a)ption to new usage scenarios. The extracted data can (due to its semantically richness) be automatically transformed into the Lemon model or simpler domain specific formats – for example a CSV representation of the translation, which can be loaded into a relational database. By offering a Linked Data service, we hope to extend DBpedia's central role in the LOD infrastructure to the world of Open Linguistics. The RDF dump currently contains 100 million triples.

'An animal, member of the genus `Canis` (probably descended from the common wolf) that has been domesticated for thousands of years; occurs in many breeds. Scientific name: `Canis lupus familiaris`.'@en

wr:dog-English-Noun-1

rdf:type

wt:Sense

wr:dog-English-Noun-1

rdf:type

lemon:LexicalSense

....................................................

notice the nested schema (lexical entry -> language-usage -> PoS-usage -> senses -> properties). This hierarchy represents the sections from the wiktionary article. There is no normalization/transformation happening. Different language editions of wiktionary could use a different schema, and you could call that inconsistent. But in fact we believe it is better to keep the data "as is" – efforts on consolidation should be taken on the wiktionary side.

Usage Scenarios

Reference, Annotation – annotate corpora with unique identifiers. Then you get all infos from Wiktionary via linked data

disambiguation – for a given lexical word (a sequence of characters) one can look up possible usages in languages, and possible meanings (each meaning should have a definition and a example sentence). The definition can help to determine which meaning was intended by the author (by comparing the context to possible definitions, etc.)

synset reduction – for a given word one can lookup synonyms and replace the word by a deterministically chosen representative of its synset.

translation – the provided translations are defined on the meanings, which gives you a context aware translation.

Wiktionary2RDF – Live

We will provide a live version of the ontology soon, that reflects changes to the wiki within seconds. This should encourage users of the RDF data to contribute to Wiktionary. If you are unhappy with the data quality of the automatically generated ontology, you just edit the wiki, improve the guideline-compliance there and you will get your high-quality semantic data right away.

Contributor – add new Wiktionary

The main target of this approach is to be extendible for new Wiktionary language editions by non-programmers.
This is done by supplying a new configuration file for example "config-en.xml". To create this config, you will need to follow these steps:

Step 1: get a copy the software

Install git and Maven

checkout the repo

git clone https://github.com/dbpedia/extraction-framework.git dbpedia

build

cd dbpedia && mvn install

Step 2: Download the Wiktionary XML dump

If you just want to get started, you can use the example files in wiktionary/sample-xml-dumps. If you want the real data,
go to the wikimedia dump archive, search the latest version of the targeted language. Download the article dump file "[...]-pages-articles.xml" and put into a directory structure like the one in wiktionary/sample-xml-dumps .

Look at existing configs to get started, a complete documentation will be available soon.
For debugging, you can also test the configuration with single pages: MediaWiki can export pages in the dump format easily. Just put a "Special:Export/" before the page name. example. The extractor will use the file with the latest modified timestamp in the "wiktionaryDump" folder.
After changes to the config, try and run in the wiktionary directory

mvn scala:run

The extracted triple can be found in the output folder.
To determine what goes wrong, you can also increase the loglevel (to 4), pipe the debug to a file, and try to trace the error. That is very verbose, and hard to read, but the best way to find out whats happening.

When you are satisfied to the data, you can load it into virtuoso (use isql console)

Developer – improve the framework

I will try to explain internals here, but its quite complex and I omitted some details. If you cant follow, ask on the mailinglist or wait for my thesis (first week of August 2012).

The idea is somewhat different from DBpedia (although we use the framework):
Instead of info boxes and very specific extractors we tried to make a meta-extractor of declarative nature instead of imperative.
The rational is that although there exist some scrapers, none of them allows to parse more than a 3-5 languages.
So we encode the language-specific characteristics of each Wiktionary in a machine-readable format (e.g. the "config-de.xml") and create a generic extractor that interprets the config.
Top-down these properties are:

Entry Layout (EL)

e.g. in the german Wiktionary, a given page has the structure:

Lexical Entity
languages it occurs in
part of speech it is used as
different senses/meanings
properties like
synonyms,
example sentence,
pronunciation,
translations, etc.

In the English Wiktionary there is a etymology section after the language. The used schema can differ between language-editions of wiktionary. So we came up with a simple encoding scheme, to represent the expected schema for each language. We configure the EL with nested XML-nodes named „block“'s (and the top level node is named „page“). Each block can contain three things:
further blocks (to represent the hierarchy)
and indicator template – if it is encountered on the page, this block starts
declaration on how to parse the page on this level (see below).
More about the EL: http://en.wiktionary.org/wiki/[..]try_layout_explained

Wiki Templates

now we come to the core of the extraction. We made an engine that can
match a given Wiktionary page to several "extraction templates" (ET).
So consider this dummy page (written in German – to show the effect of language mappings):

We can start with this snippet wikitext and transform it to a generic schema definition by introducing placeholders a.k.a. variables and control symbols that indicate the possible repeating of parts (like the regex "(ab)*" matches "ababab"). The engine then fills the placeholders with information scraped from the page (in other words binds variables). The configuration contains declaration on
what to do with the bound variables, often that is a "use it as literal object of predicate x" but also arbitrary transformations
like "format to URI sprintf-style" or "hand bindings to a static method on class y, that returns triples". An example: we have the example page and we defined a config like this:

There are several things to notice: the nested structure of the block-nodes (implying the schema), the indicator-nodes, which contain templates, that if they are encountered on the page, imply that the following belongs to that block. Then you will see the normal tempate-nodes that contain information on how to parse the page, using placeholders, and how to emit triples, again with these placeholders. And (in the templates) you will see the control structures looking like Regular Expressions but we only allow ()*, ()+, ()?.
We hope you agree, that this approach is pretty generic and declarative, rather than imperative, thus enabling a quick creation of new configs.
For the definition template, the engine will find a set of bindings:

definition -> "eine Farbe"
definition -> "umweltfreundlich"

and then generates triples according to the resultTemplates, which again contain the placeholders

Our prototype recognizes the EL and thus gives information about Languages and PoS usages of all
words in the Wiktionary and has ETs for the definitions, hyphenation, synonyms (to have community based wordnet), translations, hyponyms, hyperonyms and example sentences.

Mapping

a mapping from language-specific tokens to a global vocabulary. In contrast to the DBpedia mappings, we dont map property labels to property URIs, but we map any token or value to a shared vacabulary (not URIs directly). It just means occurring labels within the EL. Their meaning has to be translated to a shared, global vocabulary.
In this wiki snippet there are two tokens: "Englisch" and "Adjektiv". They
indicate that the following section is about a english word and its part
of speech is adjective. Obviously we need a mapping from them to a
shared vocabulary. Such a mapping is easy and part of the configuration.
But a nice thing to have would be a better ontology backing of this vocabulary.
some ontology about POS (GOLD?) and language families (ISO 639-3?)) – we
should discuss what to use there.

Use Case: Relational View on Translations

A requirement for a related project was the transformation of translations into a relational schema.
This is achieved with a SPARQL query, that interprets the graph model to a row based schema and retrieves them.
The result is saved into a CSV file – which can be loaded into a SQL DBMS with the built-in LOAD DATA INFILE command.
There are two scripts to accomplish that functionality:
to-csv.php – a php shell script that queries the public SPARQL endpoint of wiktionary.dbpedia.org and retrieves 10.000 wordpairs each request and transforms and writes them into a CSV file.
The columns are:
sword (the lexical source word), slang (the language of the word), spos (the part of speech of the word), ssense (a sense of the word), tword (the target word), tlang (the language of the target word).
An example entry could be:
"abändern","German","Verb","to change, to alter","change","English"
The process of the extraction takes a few hours. There are no configuration options but the script can be changed easily.
Optionally you should remove duplicate lines with uniq.

load-csv.sql – a trivial sql script that creates a table with the above explained columns and inserts the CSV with a LOAD DATA command.

Vocabulary: Lemon Model

the dataset contains contains triples using the lemon vocabulary.
This is achieved by postprocessing the raw dataset with a recursive algorithm that "presses" the hierarchical schema into a flat one.
For example