Science, code and stuff…

Tag Archives: dbpedia

Ever wondered what the top subjects / predicates / objects are in DBpedia?

I recently came across this problem while trying to draw a random sample of nodes from DBpedia which follow a given degree distribution for my PhD.

Turns out this is actually more difficult than i expected. Mostly due to the fact that quad stores don’t optimize for such queries. This means that you can’t just ask a SPARQL endpoint (not even your local one) to give you the top subjects, predicates or objects with a query like this:

SELECT ?n COUNT(*)AS ?cWHERE{
?n ?p ?o.}ORDERBYDESC(?c)LIMIT10

Try yourself here if you don’t believe me… (i set it to time out after 15 seconds and it will return quite a dangerously nonsensical result if you’re not aware that you might get partial answers).

Some Rant

So this lead me to the fascinating conclusion that our beloved RDF query language doesn’t even allow us to answer simple questions such as “which node is most often used as a subject / predicate / object?” (we’re talking with a single SPARQL endpoint here, don’t even try dragging me into an open/closed world assumption discussion, …).

So, all is great, let’s just not ask those evil questions…

… said no (computer) scientist ever.

So let’s get our hands dirty and use some unix tool magic…

Working with Dumps in NT Format

Luckily, I already had all the dumps laid out locally as described here, and lucky again, they are in N-Triples format.

N-Triples is a line based format, which means we have exactly one triple per line. I don’t exactly know whom to thank for this, but should you ever read this (wait, why are you reading my blog?) THANK YOU. It means that neither subject nor predicate nor object can contain (unescaped) newlines. And this means that you can actually quite sanely sort and parse .nt files with standard unix tools that have been optimized by generations of smart people.

A Word about Sort Orders and Locales

Sort orders depend on your locale! This means that files sorted with a locale such as en_US.UTF-8 are not properly sorted for someone with a locale such as de_DE.UTF-8. Hence it’s wise to always run this in a shell before working with sort:

exportLC_ALL=C

It resets your locale to a classic C byte-wise one, having the nice side effect that it’s faster as well.

Deduplication

First, it turns out the DBpedia dumps actually contain quite an astonishing amount of duplicate triples. This is not a problem if loaded into a quad store as they’ll just count once, but for counting them like we will, it is a problem.

To split them apart let’s do the following: we pick up all the dump files that are loaded into our endpoint with pv a handy little tool similar to cat, but it shows a nice progress bar. Then we decompress with zcat, remove comments from the files with grep and then call sort. We actually tell sort to use a ton of RAM (32 GB), but actually not even that is enough for the > 80 GB decompressed dumps, so we need temp files. We can direct sort to put them onto an SSD instead of just in /tmp as by default, and we can also compress those temp files on the fly. lzop is a very fast compression tool and the perfect fit for this (not compressing the files with this actually degrades performance even at 300 MB/s write speeds of the SSD!). After this we use tee to multiplex our stream into two channels: one plain uniq and gzipped with pigz (like gzip but parallel, as gzipping > 80 GB becomes quite the bottleneck here otherwise) into dbpedia_uniq.nt.gz and another invocation of uniq-c-d which only counts the duplicate lines and gzips (this is ok to be single threaded, as it’s not sooo big) them into dbpedia_dups.txt.gz.

Getting S,P,O Counts

OK, now let’s count the subject, predicate and object occurrences.
Subjects, predicates and objects are delimited with a single space (” “), everything else in the line we just count as an object (so we just count the final ” .” to the object).
Similar to the above pipeline, we use tee again to multiplex the stream into three pipelines for subject, predicate and object counts.
Each of them is mostly based on cut, first to get the fields (-f1 for subject, -f2 predicate, -f3- object), then for limiting very long strings to only the first 1024 chars. While this actually introduces some false positive matches for long literals, it’s probably safe for URIs, and reduces sort times and file sizes for the object chunk a lot. If you want very accurate counts you should probably re-run without the cut-c-1024 lines.
Afterwards in each pipeline the occurrences of a node in the s,p,o positions are sorted and counted with uniq-c, then gzipped with pigz.

Observations:

The top subjects are clearly dominated by list-like resources. Very big “normal” articles such as those of countries like dbpedia:United_States (1375 occurrences as subject) or dbpedia:Germany (1331 occurrences as subject) can only be found below ranks of 1518 or 1673. Scrolling through the top subject counts it seems that the amount of “List” vs. non-“List” resources slowly seems to equalize around 1000 occurrences (rank 3800+), but even for subjects that “only” occur ~500 times (rank 21000+) there seem to be ~1/4 “Lists” still.

Observations:

The object counts are dominated on top with an order of magnitude difference by foaf:Document and “en”. The non-negative “1” follows an order of magnitude ahead of the normal “0” and “1” In between a lot of very useful types follow, and we can see that we have a lot of information about physical things, people, concepts and places. It’s also nice to see http://wikidata.dbpedia.org/resource/Q5 right under foaf:Person, even though the URI doesn’t resolve anymore(?)

Conclusion

We’ve seen that it’s sadly not possible to get basic top-degree-counts for big datasets via SPARQL, as the endpoints don’t seem to be optimized for these kind of queries. I hope this changes in the future as it’s quite useful to know degree distributions for all kinds of queries. Especially in the machine learning sector it seems quite essential to know if you’re dealing with a “normal” node or one of the exceptional top nodes that is several orders of magnitude bigger than the rest.

Hope you enjoyed. Feedback welcome, as always.

Further reading

Thanks for all the feedback i got on this post. There are somewhat similar works, that you might be interested in:

http://dbtrends.aksw.org/ calculates some stats similar to this post (and some more), but atm sadly only for DBpedia 3.9 and without stats about Literals and http://dbpedia.org/resource/Category:* resources.

So you’re the guy who is allowed to setup a local DBpedia mirror or more generally a local Linked Data mirror for your work group? OK, today is your lucky day and you’re in the right place. I hope you’ll be able to benefit from my many hours of trials and errors. If anything goes wrong (or everything works fine), feel free to leave a comment below.

Versions of this guide

There are three older versions of this guide:

Oct. 2010: The first version focusing on DBpedia 3.5 – 3.6 and Virtuoso 6.1

May 2012: A bigger update to DBpedia 3.7 (new local language versions) and Virtuoso 6.1.5+ (with a lot of updates making pre-processing of the dumps easier)

In this step by step guide I’ll tell you how to install a local Linked Data mirror of the DBpedia 2014, hosting a combination of the regular English and (exemplary) the i18n German datasets adding up to over half a billion triples. If this isn’t enough you can also follow the links to the Freebase, DBLP, Yago, Umbel and Schema.org datasets / vocabularies adding up to over 3.5 billion triples.

Let’s jump in.

Used Versions

DBpedia 2014

Virtuoso OpenSource 7.1.0

Ubuntu 14.04 LTS

Prerequesits

A strong machine with root access and enough RAM: We used a VM with 4 Cores and 32 GBs of RAM for DBpedia only. If you intend to also load Freebase and other datasets i recommend at least 64 GBs of RAM (we actually ended up using a 16 Core, 256 GB RAM Server). For installing i recommend more than 128 GB free HD space for DBpedia alone, 256 GB if you want to load Freebase as well, especially for downloading and repacking the datasets, as well as the growing database file when importing (mine grew to 50 GBs for DBpedia and 180 GB with Freebase).

Let’s go

Download and install virtuoso

Go and download virtuoso opensource: either from http://sourceforge.net/projects/virtuoso/ (make sure you get v7.1.0 as in this guide or a newer version).

Put the file in your home dir on the server, then extract it and switch to the directory:

# note: virtuoso ignores lines starting with whitespace and stuff after a ;[Parameters]# you need to include the directory where your datasets will be downloaded# to, in our case /usr/local/data/datasets:
DirsAllowed = ., /usr/share/virtuoso/vad, /usr/local/data/datasets# IMPORTANT: for performance also do this[Parameters]# the following two are as suggested by comments in the original .ini# file in order to use the RAM on your server:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000# each buffer caches a 8K page of data and occupies approx. 8700 bytes of# memory. It's suggested to set this value to 65 % of ram for a db only server# so if you have 32 GB of ram: 32*1000^3*0.65/8700 = 2390804# default is 2000 which will use 16 MB ram ;)# Make sure to remove whitespace if you uncomment existing lines![Database]
MaxCheckpointRemap = 625000# set this to 1/4th of NumberOfBuffers[SPARQL]# I like to increase the ResultSetMaxrows, MaxQueryCostEstimationTime# and MaxQueryExecutionTime drastically as it's a local store where we# do quite complex queries... up to you (don't do this if a lot of people# use it).# In any case for the importer to be more robust add the following setting# to this section:
ShortenLongURIs = 1

The next step installs an init-script (autostart) and starts the virtuoso server. (If you’ve changed directories to edit /var/lib/virtuoso/db/virtuoso.ini, go back to the virtuoso source dir!):

DBpedia URIs (en) vs. DBpedia IRIs (i18n)

The DBpedia 2014 consists of several datasets: one “standard” English version and several localized versions for other languages (i18n). The standard version mints URIs by going through all English Wikipedia articles. For all of these the Wikipedia cross-language links are used to extract corresponding labels in other languages for the en URIs (e.g., de/labels_en_uris_de.nt.bz2). This is problematic as for example articles which are only in the German Wikipedia won’t be extracted. To solve this problem the i18n versions exists and create IRIs in the form of de.dbpedia.org for every article in the German Wikipedia (e.g., de/labels_de.nt.bz2).

As the standard DBpedia provides labels, abstracts and a couple other things in several languages, there are two types of files in the localized DBpedia folders: There are triples directly associating the English URIs with for example the German labels (de/labels_en_uris_de) and there are the localized triple files which associate for example the DE IRIs with the German labels (de/labels_de).

Downloading the DBpedia dump files & Repacking

For our group we decided that we wanted a reasonably complete mirror of the standard DBpedia (EN) (have a look at datasets loaded into the public DBpedia SPARQL Endpoint), but also the i18n versions for the German DBpedia loaded in separate graphs, as well as each of their pagelink datasets in another separate graph. For this we download the corresponding files in (NT) format as follows. If you need something different do so (and maybe report back if there were problems and how you solved them).

Another hint: Virtuoso can only import plain (uncompressed) or gzipped files, the DBpedia dumps are bzipped, so you either repack them into gzip format or extract them. On our server the importing procedure was reasonably slower from extracted files than from gzipped ones (ignoring the vast amount of wasted disk space for the extracted files). File access becomes a bottleneck if you have a couple of cores idling. This is why I decided on repacking all the files from bz2 to gz. As you can see I do the repacking per folder in parallel, if that’s not suitable for you, feel free to change it. You might also want to change this if you want to do it in parallel to downloading. The repackaging process below took about 1 hour but was worth it in the end. The more CPUs you have, the more you can parallelize this process.

# see comment above, you could also get the all_language.tar or another DBpedia version...mkdir-p/usr/local/data/datasets/dbpedia/2014cd/usr/local/data/datasets/dbpedia/2014wget-r-nc-nH--cut-dirs=1-np-l1-A'*.nt.bz2'-A'*.owl'-R'*unredirected*' http://downloads.dbpedia.org/2014/{en/,de/,links/,dbpedia_2014.owl}

# notice that the extraction (and repacking) of *.bz2 takes quite a while (about 1 hour)# gzipped data is reasonably packed, but still very fast to access (in contrast to bz2), so maybe this is the best choice.

Importing DBpedia dumps into virtuoso

Now AFTER the re-/unpacking of the DBpedia dumps we will register all files in the dbpedia dir (recursively ld_dir_all) to be added to the dbpedia graph. If you use this method make sure that only files reside in the given subtree that you really want to import.
Also don’t forget to import the dbpedia_2014.owl file (first step in the script below)!
If you only want one directory’s files to be added (non recursive) use ld_dir('dir','*.*','graph');.
If you manually want to add some files, use ld_add('file','graph');.
See the VirtBulkRDFLoaderScript file for details.

Be warned that it might be a bad idea to import the normal and i18n dataset into one graph if you didn’t select specific languages, as it might introduce a lot of duplicates.

In order to keep track (and easily reproduce) what was selected and imported into which graph, I actually link (ln-s) the repacked files into a directory structure beneath /usr/local/data/datasets/dbpedia/2014/importedGraphs/ and import from there instead. To make sure you think about this, I use that path below, so it won’t work if you didn’t pay attention. If you really want to import all downloaded files, just import /usr/local/data/datasets/dbpedia/2014/.

Also be aware of the fact that if you load certain parts of dumps in different graphs (such as I did with the pagelinks, as well as the i18n versions of the DE and FR datasets) that only triples from the http://dbpedia.org graph will be shown when you visit the local pages with your browser (SPARQL is unaffected by this)!

One more thing (thanks to Romain): In order for the DBpedia.vad package (which is installed at the end) to work correctly, the dbpedia_2014.owl file needs to be imported into graph http://dbpedia.org/resource/classes#.

Note: In the following i will assume that your virtuoso isql command is called isql. If you’re in lack of such a command it might be called isql-vt, but this usually means you installed it using some other method than described in here

OK, now comes the fun (and long part: about 1.5 hours (new virtuoso 7 is cool for DBpedia alone, +~3 hours for Freebase)… After we registered the files to be added, now let’s finally start the process. Fire up screen if you didn’t already. (For more detailed metering than below see VirtTipsAndTricksGuideLDMeterUtility.)

sudoaptitudeinstallscreenscreen isql

rdf_loader_run();-- DO NOT USE THE DB BESIDES THE FOLLOWING COMMANDS:-- depending on the amount of CPUs and your IO performance you can run-- more rdf_loader_run(); commands in other isql sessions which will-- speed up the import process.-- you can watch the progress from another isql session with:-- select * from DB.DBA.LOAD_LIST;-- if you need to stop the loading for any reason: rdf_load_stop ();-- if you want to force stopping: rdf_load_stop(1);
checkpoint;
commit WORK;
checkpoint;
EXIT;

After this:
Take a look into var/lib/virtuoso/db/virtuoso.log file. Should you find any errors in there… FIX THEM! You might use the dump, but it’s incomplete then. Any error quits out of the loading of the corresponding file and continues with the next one, so you’re only using the part of that file up to the place where the error occurred. (Should you find errors you can’t fix please leave a comment.)

Final polishing

You can & should now install the DBpedia and RDF Mappers packages from the Virtuoso Conductor.http://your-server:8890

I just found this aged post in my drafts folder, maybe someone will still like it…

So you’re the guy who is allowed to setup a local DBpedia mirror or more generally a local Linked Data mirror for your work group? OK, today is your lucky day and you’re in the right place. I hope you’ll be able to benefit from my many hours of trials and errors. If anything goes wrong, feel free to leave me a comment below.

Versions of this guide

There are two older versions of this guide:

Oct. 2010: The first version focusing on DBpedia 3.5 – 3.6 and Virtuoso 6.1

May 2012: A bigger update to DBpedia 3.7 (new local language versions) and Virtuoso 6.1.5+ (with a lot of updates making pre-processing of the dumps easier)

With the recent release of Virtuoso 7 (way faster, thanks to Openlink!) and DBpedia 3.9 i again felt the urge to update this guide as a couple of things changed.

In this step by step guide I’ll tell you how to install a local Linked Data mirror of the DBpedia 3.9 hosting a combination of the regular English and (exemplary) the i18n German datasets adding up to nearly half a billion triples.

Let’s jump in.

Used Versions

DBpedia 3.9 + 3.9-i18n dataset

Virtuoso OpenSource 7.0.0

Ubuntu 12.04 LTS

Prerequesits

A strong machine with root access and enough RAM: We use a VM with 4 Cores and 32 GBs of RAM. For installing i recommend more than 128 GB free HD space, especially for downloading and repacking the datasets, as well as the growing database file when importing (mine grew to 41 GBs).

Let’s go

Download and install virtuoso

Go and download virtuoso opensource: either from http://sourceforge.net/projects/virtuoso/ (make sure you get v7.0.0 as in this guide or newer version).

Put the file in your home dir on the server, then extract it and switch to the directory:

# note: virtuoso ignores lines starting with whitespace and stuff after a ;[Parameters]# you need to include the directory where your datasets will be downloaded# to, in our case /usr/local/data/datasets:
DirsAllowed = ., /usr/share/virtuoso/vad, /usr/local/data/datasets# IMPORTANT: for performance also do this[Parameters]# the following two are as suggested by comments in the original .ini# file in order to use the RAM on your server:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000# each buffer caches a 8K page of data and occupies approx. 8700 bytes of# memory. It's suggested to set this value to 65 % of ram for a db only server# so if you have 32 GB of ram: 32*1000^3*0.65/8700 = 2390804# default is 2000 which will use 16 MB ram ;)# Make sure to remove whitespace if you uncomment existing lines![Database]
MaxCheckpointRemap = 625000# set this to 1/4th of NumberOfBuffers[SPARQL]# I like to increase the ResultSetMaxrows, MaxQueryCostEstimationTime# and MaxQueryExecutionTime drastically as it's a local store where we# do quite complex queries... up to you (don't do this if a lot of people# use it).# In any case for the importer to be more robust add the following setting# to this section:
ShortenLongURIs = 1

The next step installs an init-script (autostart) and starts the virtuoso server. (If you’ve changed directories to edit /var/lib/virtuoso/db/virtuoso.ini, go back to the virtuoso source dir!):

DBpedia URIs (en) vs. DBpedia IRIs (i18n)

The DBpedia 3.9 consists of several datasets: one “standard” English version and several localized versions for other languages (i18n). The standard version mints URIs by going through all English Wikipedia articles. For all of these the Wikipedia cross-language links are used to extract corresponding labels in other languages for the en URIs (e.g., de/labels_en_uris_de.nt.bz2). This is problematic as for example articles which are only in the German Wikipedia won’t be extracted. To solve this problem the i18n versions exists and create IRIs in the form of de.dbpedia.org for every article in the German Wikipedia (e.g., de/labels_de.nt.bz2).

As if this isn’t confusing enough there is another trap: If you were to download the .ttl files then you suddenly have all statements associated with the IRI for the standard DBpedia (unlike the online endpoint). The only reason i can think of for this inconsistency is that at some point the actual inconsisty of URIs in EN vs IRIs in everything else will be resolved. For now these files are most certainly not what you want! So use the .nt files!

As the standard DBpedia provides labels, abstracts and a couple other things in several languages, there are two types of files in the localized DBpedia folders: There are triples directly associating the English URIs with for example the German labels (de/labels_en_uris_de) and there are the localized triple files which associate for example the DE IRIs with the German labels (de/labels_de).

Downloading the DBpedia dump files & Repacking

For our group we decided that we wanted a reasonably complete mirror of the standard DBpedia (EN) (have a look at datasets loaded into the public DBpedia SPARQL Endpoint), but also the i18n versions for the German and French DBpedia loaded in separate graphs, as well as each of their pagelink datasets in another separate graph. For this we download the corresponding files in (NT) format (also see previous section with remarks about the TTL files!). If you need something different do so (and maybe report back if there were problems and how you solved them).

Another hint: Virtuoso can only import plain (uncompressed) or gzipped files, the DBpedia dumps are bzipped, so you either repack them into gzip format or extract them. On our server the importing procedure was reasonably slower from extracted files than from gzipped ones (ignoring the vast amount of wasted disk space for the extracted files). File access becomes a bottleneck if you have 4 cores idling. This is why I decided on repacking all the files from bz2 to gz. As you can see I do the repacking per folder in parallel, if that’s not suitable for you, feel free to change it. You might also want to change this if you want to do it in parallel to downloading. The repackaging process below took about 1 hour but was worth it in the end. The more CPUs you have, the more you can parallelize this process.

sudo-i# get root# see comment above, you could also get the all_language.tar or another DBpedia version...mkdir-p/usr/local/data/datasets/dbpedia/3.9cd/usr/local/data/datasets/dbpedia/3.9wget-r-nc-nH--cut-dirs=1-np-l1-A'*.nt.bz2'-A'*.owl'-R'*unredirected*' http://downloads.dbpedia.org/3.9/{en/,de/,fr/,links/,wikidata/,dbpedia_3.9.owl}

# notice that the extraction (and repacking) of *.bz2 takes quite a while (about 1 hour)# gzipped data is reasonably packed, but still very fast to access (in contrast to bz2), so maybe this is the best choice.

Importing DBpedia dumps into virtuoso

Now AFTER the re-/unpacking of the DBpedia dumps we will register all files in the dbpedia dir (recursively ld_dir_all) to be added to the dbpedia graph. If you use this method make sure that only files reside in the given subtree that you really want to import.
Also don’t forget to import the dbpedia_3.9.owl file (last step in the script below)!
If you only want one directory’s files to be added (non recursive) use ld_dir.
If you manually want to add some files, use ld_add.
See the VirtBulkRDFLoaderScript file for args to pass.

Be warned that it might be a bad idea to import the normal and i18n dataset into one graph if you didn’t select specific languages, as it might introduce a lot of duplicates.

In order to keep track what was selected and imported into which graph, I actually link (ln-s) the repacked files into a directory structure beneath /usr/local/data/datasets/dbpedia/3.9/importedGraphs/ and import from there instead. To make sure you think about this, I use that path below, so it won’t work if you didn’t pay attention. If you really want to import all downloaded files, just import /usr/local/data/datasets/dbpedia/3.9/.

Also be aware of the fact that if you load certain parts of dumps in different graphs (such as I did with the pagelinks, as well as the i18n versions of the DE and FR datasets) that only triples from the http://dbpedia.org graph will be shown when you visit the local pages with your browser (SPARQL is unaffected by this)!

Note: in the following i will assume that your virtuoso isql command is called isql. If you’re in lack of such a command it might be called isql-vt, but this usually means you installed it using some other method than described in here

isql # enter virtuoso sql mode

-- we are in sql mode now
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/dbpedia.org','*.*','http://dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/de.dbpedia.org','*.*','http://de.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/pagelinks.dbpedia.org','*.*','http://pagelinks.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/pagelinks.de.dbpedia.org','*.*','http://pagelinks.de.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/topicalconcepts.dbpedia.org','*.*','http://topicalconcepts.dbpedia.org');

-- do the following to see which files were registered to be added:SELECT*FROM DB.DBA.LOAD_LIST;-- if unsatisfied use:-- delete from DB.DBA.LOAD_LIST;
EXIT;

OK, now comes the fun (and long part: about 1.5 hours (new virtuoso 7 is cool )… We registered the files to be added, now let’s finally start the process. Fire up screen if you didn’t already.

sudoaptitudeinstallscreenscreen isql

rdf_loader_run();-- DO NOT USE THE DB BESIDES THE FOLLOWING COMMANDS:-- (I had some warnings about a possibly corrupt db in the log,-- when I visited the virtuoso conductor during the first run...)-- you can watch the progress from another isql session with:-- select * from DB.DBA.LOAD_LIST;-- if you need to stop the loading for any reason: rdf_load_stop ();-- if you want to force stopping: rdf_load_stop(1);
checkpoint;
commit WORK;
checkpoint;
EXIT;

After this:
Take a look into var/lib/virtuoso/db/virtuoso.log file. Should you find any errors in there… FIX THEM! You might use the dump, but it’s incomplete then. Any error quits out of the loading of the corresponding file and continues with the next one, so you’re only using the part of that file up to the place where the error occurred. (Should you find errors you can’t fix in the way I did above, please leave a comment.)

Final polishing

You can & should now install the DBpedia and RDF Mappers packages from the Virtuoso Conductor.http://your-server:8890

Unlike the previous updates so many things have changed that I decided to put them into a separate post instead of continuing to update the old one making it more and more complicated.
Two of the most severe changes are that Virtuoso 6.1.5+ includes a setting making the importer more robust so the repacking of the files isn’t needed anymore and the changes of DBpedia 3.7 to also provide internationalized versions causing a couple of problems / inconsistencies.

In this step by step guide I’ll tell you how to install a local mirror of the DBpedia 3.7 hosting a combination of the regular English and the i18n German datasets adding up to nearly half a billion triples!!!
Let’s jump in.

Versions

Prerequesits

A strong machine with root access and enough RAM: We used a VM with 4 Cores and 32 GBs of RAM. For installing i recommend more than 128 GB free HD space, especially for downloading and repacking the datasets, as well as the growing database file when importing (mine grew to 45 GBs).

# note: virtuoso ignores lines starting with whitespace[Parameters]# you need to include the directory where your datasets will be downloaded# to, in our case /usr/local/data/datasets:
DirsAllowed = ., /usr/share/virtuoso/vad, /usr/local/data/datasets# IMPORTANT: for performance also do this[Parameters]# the following two are as suggested by comments in the original .ini# file in order to use the RAM on your server:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000# each buffer caches a 8K page of data and occupies approx. 8700 bytes of# memory. It's suggested to set this value to 65 % of ram for a db only server# so if you have 32 GB of ram: 32*1000^3*0.65/8700 = 2390804# default is 2000 which will use 16 MB ram ;)# Make sure to remove whitespace if you uncomment existing lines![Database]
MaxCheckpointRemap = 625000# set this to 1/4th of NumberOfBuffers[SPARQL]# I like to increase the ResultSetMaxrows, MaxQueryCostEstimationTime# and MaxQueryExecutionTime drastically as it's a local store where we# do quite complex queries... up to you (don't do this if a lot of people# use it).# In any case for the importer to be more robust add the following setting# to this section:
ShortenLongURIs = 1

The next step installs an init-script (autostarts) and starts the virtuoso server. (If you’ve changed directories to edit /var/lib/virtuoso/db/virtuoso.ini, go back to the virtuoso source dir!):

Downloading the DBpedia dump files and a word about problems / inconsistencies in them

The DBpedia 3.7 is split into two separate datasets: one standard version and one i18n version. The standard version mints URIs by going through all English Wikipedia articles. For all of these the cross-language links are used to extract corresponding labels for the en URIs. This is problematic as for example articles which are only in the German Wikipedia won’t be extracted. To solve this problem the i18n version exists and creates IRIs in the form of de.dbpedia.org for every article in the German Wikipedia. There also are interlinking datasets providing owl:sameAs between the new URIs and the ones in corresponding other datasets. Note that the i18n IDs for concepts are IRIs, while the ones in the English Wikipedia are URIs. Also even though the i18n dataset includes all languages, only the Greek (el), German (de) and Russian (ru) Wikipedia have minted their own IRIs. The others are broken… they use URIs start with http://dbpedia.org but are linked to their corresponding language codes in the interlanguage links (e.g., the French interlanguage links falsely point to fr.dbpedia.org ). So it’s a mess! If you have a cleaned version of the datasets let us know or just wait for DBpedia 3.8 as we all do

Besides that, the el, de and ru i18n files ending in .nt.gz are actually not valid NT files, because the IRIs are UTF-8 encoded. After finding this out I simply renamed all the German files to .n3.gz. and as n3 is a subset of turtle (TTL) and as virtuoso actually uses a TTL-parser (also for NT which is a subset of n3), I guess that renaming wasn’t all that important for Virtuoso. Still I had a bad feeling of having files with wrong endings flying around.

We have decided that we only needed the German and English files in (NT) format. If you need something different do so (and maybe report back if there were problems and how you solved them). If you decide to download the all-languages tar then make sure to exclude the NQ files from the later importing steps. One simple way to do this is to move everything you don’t want to import out of the directory. Also don’t forget to import the dbpedia_3.*.owl file (last step in the script below)!
Another hint: Virtuoso can only import plain (uncompressed) or gzipped files, the DBpedia dumps are bzipped, so you either repack them into gzip format or extract them. On our server the importing procedure was reasonably slower from extracted files than from gzipped ones (ignoring the vast amount of wasted disk space for the extracted files). File access becomes a bottleneck if you have 4 cores idling. This is why I decided on repacking all the files from bz2 to gz. As you can see I do the en and de repacking in parallel, if that’s not suitable for you, feel free to change it. You might also want to change this if you want to do it in parallel to downloading. The repackaging process below took about 1 hour but was worth it in the end. The more CPUs you have, the more you can parallelize this process.

sudo-i# get root# see comment above, you could also get the all_language.tar or another DBpedia version...mkdir-p/usr/local/data/datasets/dbpedia/3.7/3.7/encd/usr/local/data/datasets/dbpedia/3.7/3.7/enwget-r-np-nd-nc-A'*.nt.bz2' http://downloads.dbpedia.org/3.7/en/

# notice that the extraction (and repacking) of *.bz2 takes quite a while (about 1 hour)# gzipped data is reasonably packed, but still very fast to access (in contrast to bz2), so maybe this is the best choice.

Importing DBpedia dumps into virtuoso

Now AFTER the re-/unpacking of the DBpedia dumps we will register all files in the dbpedia dir (recursively ld_dir_all) to be added to the dbpedia graph. As mentioned above: If you use this method make sure that only files reside in the given subtree that you really want to import.
If you only want one directory’s files to be added (non recursive) use ld_dir.
If you manually want to add some files, use ld_add.
See the VirtBulkRDFLoaderScript file for args to pass.

Be warned that it might be a bad idea to import the normal and i18n dataset into one graph if you didn’t select specific languages, as it might introduce a lot of duplicates. In order to keep track what was selected and imported into which graph (see Note 2 below), we linked (ln-s) the files from the English (orig) and German (i18n) into a directory structure beneath /usr/local/data/datasets/dbpedia/3.7/importedGraphs/ and imported from there instead. To make sure you think about this, I use that path below, so it won’t work if you didn’t pay attention. If you really want, just import /usr/local/data/datasets/dbpedia/3.7/.

Note: in the following i will assume that your virtuoso isql command is called isql. If you’re in lack of such a command it might be called isql-vt.Note2: in our case we actually decided not to import all the files into just one graph but instead used separated graphs for en and de as well as for the pagelinks, infoboxprops, extlinks and interlanguage_links dumps. Be warned though that only a certain amount of triples from the http://dbpedia.org graph will be shown in case you visit the local pages with your browser.

isql # enter virtuoso sql mode

-- we are in sql mode now
ld_dir_all('/usr/local/data/datasets/dbpedia/3.7/importedGraphs/dbpedia.org','*.*','http://dbpedia.org');-- do the following to see which files were registered to be added:SELECT*FROM DB.DBA.LOAD_LIST;-- if unsatisfied use:-- delete from DB.DBA.LOAD_LIST;
EXIT;

OK, now comes the fun (and long part: about 7 hours)… We registered the files to be added, now let’s finally start the process. Fire up screen (see comment) if you didn’t already.

sudoaptitudeinstallscreenscreen isql

rdf_loader_run();-- DO NOT USE THE DB BESIDES THE FOLLOWING COMMANDS:-- (I had some warnings about a possibly corrupt db in the log,-- when I visited the virtuoso conducter during the first run...)-- you can watch the progress from another isql session with:-- select * from DB.DBA.LOAD_LIST;-- if you need to stop the loading for any reason: rdf_load_stop ();-- if you want to force stopping: rdf_load_stop(1);
checkpoint;
commit WORK;
checkpoint;
EXIT;

After this:
Take a look into var/lib/virtuoso/db/virtuoso.log file. Should you find any errors in there… FIX THEM! You might use the dump, but it’s incomplete then. Any error quits out of the loading of the corresponding file and continues with the next one, so you’re only using the part of that file up to the place where the error occurred. (Should you find errors you can’t fix in the way I did above, please leave a comment.)

Final polishing

You can & should now install the DBpedia and RDF Mappers packages from the Virtuoso Conductor.http://your-server:8890

login: dba
pw: dba

Go to System Admin / Packages. Install the dbpedia and rdf_mappers packages (takes about 5 minutes).

Testing your local mirror

This might take about 15 minutes and then returns 437,768,995. Subsequent queries are a lot faster (if you find another way (preferably automatic) to warm up the caches, please leave me a note).
I also like this query showing all the graphs and how many triples are in them:

Yay, done
As always, feel free to leave comments to tell us about your problems or how happy you are :D.

Thanks

Many thanks to the DBpedia team for their endless efforts of providing us all with a great dataset. Also many thanks to the Virtuoso crew for releasing an opensource version of their DB; especially to Hugh Williams and Patrick van Kleef for helping me out with a couple of problems in the newer version.

So you’re the guy who is allowed to setup a local DBpedia mirror for your work group? OK, today is your lucky day and you’re in the right place. I hope you’ll be able to benefit from my hours of trials and errors Continue reading →