Category Archives: 2016-10

Just recently, DBpedia Association member and hosting specialist, OpenLink released the DBpedia Usage report, a periodic report on the DBpedia SPARQL endpoint and associated Linked Data deployment.

The report not only gives some historical insight into DBpedia’s usage, number of visits and hits per day but especially shows statistics collected between October 2016 and December 2017. The report covers more than a year of logs from the DBpedia web service operated by OpenLink Software at http://dbpedia.org/sparql/.

Before we want to highlight a few aspects of DBpedia’s usage we would like to thank Open Link for the continuous hosting of the DBpedia Endpoint and the creation of this report

The graph shows the average number of hits/requests per day that were made to the DBpedia service during each of the releases.The graph shows the average number of unique visits per day made to the DBpedia service during each of the datasets.

Speaking of which, as you can see in the following tables, there has been a massive increase in the number of hits coinciding with the DBpedia 2015–10 release on April 1st, 2016.

This boost can be attributed to an intensive promotion of DBpedia via community meetings, communication with various partners in the Linked Data community and Social media presence among the community, in order to increase backlinks.

Since then, not only the numbers of hits increased but DBpedia also provided for better data quality. We are constantly working on improving accessibility, data quality and stability of the SPARQL endpoint. Kudos to Open Link for maintaining the technical baseline for DBpedia.

This release took us longer than expected. We had to deal with multiple issues and included new data. Most notable is the addition of the NIF annotation datasets for each language, recording the whole wiki text, its basic structure (sections, titles, paragraphs, etc.) and the included text links. We hope that researchers and developers, working on NLP-related tasks, will find this addition most rewarding. The DBpedia Open Text Extraction Challenge (next deadline Mon 17 July for SEMANTiCS 2017) was introduced to instigate new fact extraction based on these datasets.

We want to thank anyone who has contributed to this release, by adding mappings, new datasets, extractors or issue reports, helping us to increase coverage and correctness of the released data. The European Commission and the ALIGNED H2020 project for funding and general support.

You want to read more about the New Release? Click below for further details.

Statistics

Altogether the DBpedia 2016-10 release consists of 13 billion (2016-04: 11.5 billion) pieces of information (RDF triples) out of which 1.7 billion (2016-04: 1.6 billion) were extracted from the English edition of Wikipedia, 6.6 billion (2016-04: 6 billion) were extracted from other language editions and 4.8 billion (2016-04: 4 billion) from Wikipedia Commons and Wikidata.

In addition, adding the large NIF datasets for each language edition (see details below) increased the number of triples further by over 9 billion, bringing the overall count up to 23 billion triples.

Changes

The NLP Interchange Format (NIF) aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. To extend the versatility of DBpedia, furthering many NLP-related tasks, we decided to extract the complete human- readable text of any Wikipedia page (‘nif_context’), annotated with NIF tags. For this first iteration, we restricted the extent of the annotations to the structural text elements directly inferable by the HTML (‘nif_page_structure’). In addition, all contained text links are recorded in a dedicated dataset (‘nif_text_links’).
The DBpedia Association started the Open Extraction Challenge on the basis of these datasets. We aim to spur knowledge extraction from Wikipedia article texts in order to dramatically broaden and deepen the amount of structured DBpedia/Wikipedia data and provide a platform for benchmarking various extraction tools with this effort.
If you want to participate with your own NLP extraction engine, the next deadline for the SEMANTICS 2017 is July 17th.
We included an example of these structures in section five of the download-page of this release.

A considerable amount of work has been done to streamline the extraction process of DBpedia, converting many of the extraction tasks into an ETL setting (using SPARK). We are working in concert with the Semantic Web Company to further enhance these results by introducing a workflow management environment to increase the frequency of our releases.

In case you missed it, what we changed in the previous release (2016-04)

We added a new extractor for citation data that provides two files:

citation links: linking resources to citations

citation data: trying to get additional data from citations. This is a quite interesting dataset but we need help to clean it up

In addition to normalised datasets to English DBpedia (en-uris), we additionally provide normalised datasets based on the DBpedia Wikidata (DBw) datasets (wkd-uris). These sorted datasets will be the foundation for the upcoming fusion process with wikidata. The DBw-based uris will be the only ones provided from the following releases on.

We now filter out triples from the Raw Infobox Extractor that are already mapped. E.g. no more “<x> dbo:birthPlace <z>” and “<x> dbp:birthPlace|dbp:placeOfBirth|… <z>” in the same resource. These triples are now moved to the “infobox-properties-mapped” datasets and not loaded on the main endpoint. See issue 22 for more details.

Major improvements in our citation extraction. See here for more details.

We incorporated the statistical distribution approach of Heiko Paulheim in creating type statements automatically and providing them as additional datasets (instance_types_sdtyped_dbo).

Upcoming Changes

DBpedia Fusion: We finally started working again on fusing DBpedia language editions. Johannes Frey is taking the lead in this project. The next release will feature intermediate results.

Id Management: Closely pertaining to the DBpedia Fusion project is our effort to introduce our own Id/IRI management, to become independent of Wikimedia created IRIs. This will not entail changing out domain or entity naming regime, but providing the possibility of adding entities of any source or scope.

RML Integration: Wouter Maroy did already provide the necessary groundwork for switching the mappings wiki to an RML based approach on Github. Wouter started working exclusively on implementing the Git based wiki and the conversion of existing mappings last week. We are looking forward to the consequent results of this process.

Further development of SPARK Integration and workflow-based DBpedia extraction, to increase the release frequency.

New Datasets

SDTypes: We extended the coverage of the automatically created type statements (instance_types_sdtyped_dbo) to English, German and Dutch.

Extensions: In the extension folder (2016-10/ext) we provide two new datasets (both are to be considered in an experimental state:

DBpedia World Facts: This dataset is authored by the DBpedia Association itself. It lists all countries, all currencies in use and (most) languages spoken in the world as well as how these concepts relate to each other (spoken in, primary language etc.) and useful properties like iso codes (ontology diagram). This Dataset extends the very useful LEXVO dataset with facts from DBpedia and the CIA Factbook. Please report any error or suggestions in regard to this dataset to Markus.

JRC-Alternative-Names: This resource is a link based complementary repository of spelling variants for person and organisation names. The data is multilingual and contains up to hundreds of variations entity. It was extracted from the analysis of news reports by the Europe Media Monitor (EMM) as available on JRC-Names.

Community

The DBpedia community added new classes and properties to the DBpedia ontology via the mappings wiki. The DBpedia 2016-04 ontology encompasses:

The editor community of the mappings wiki also defined many new mappings from Wikipedia templates to DBpedia classes. For the DBpedia 2016-10 extraction, we used a total of 5887 template mappings (DBpedia 2015-10: 5800 mappings). The top language, gauged by the number of mappings, is Dutch (648 mappings), followed by the English community (606 mappings).

SpringerNature for offering a co-internship to a bright student and developing a closer relation to DBpedia on multiple issues, as well as Links to their SciGraph subjects.

Kingsley Idehen, Patrick van Kleef, and Mitko Iliev (all OpenLink Software) for loading the new data set into the Virtuoso instance that provides 5-Star Linked Open Data publication and SPARQL Query Services.

OpenLink Software (http://www.openlinksw.com/) collectively for providing the SPARQL Query Services and Linked Open Data publishing infrastructure for DBpedia in addition to their continuous infrastructure support.