A how-to guide for creating a Linked Data site

“That sounds fantastic” you say, “..but how do I create a Linked Data site?”

In this article, I will try to address this by walking you through the whole process: from zero to look at my linked data bling. Well, at least one way of going at it. I will conveniently dodge the question of answering what constitutes a Linked Data site, which technologies are involved, or how things are ought to be built. However, many would agree that TimBL's Design Issues on Linked Data is the authoritative outline.

For our purposes, we will assume that you have a basic understanding of the RDF data model, have seen a few triple statements in one of the serializations, know what SPARQL is for, can read some PHP and HTML code, not too shy in the Linux (Ubuntu/Debian as far as these examples go) command line, and have Apache and PHP ready to go in your environment.

Unfortunately there are lots of steps at the moment, but, I will do my best to make it painless. Bare with the Linked Data community as improvements are made on a daily basis. I’ll do my best to keep the tutorial steps as up to date as possible, but, don’t be shocked if something is slightly off. Please let me know, and I’ll update here. With that out of the way, let’s dive in.

What’s on the menu?

Setup a SPARQL server to store and query our data, and import RDF triples into that store

Install a bunch of tools which will interact with the queried data

Create templates to output stuff from our RDF store

Setting up Fuseki

We will use Fuseki as our SPARQL server. If you wish to use a different server (see also SPARQL Query Engines), you can skip this part of the how-to. Before we get Fuseki, let’s make sure the essentials to run it are in place:

Note: If you get build errors of some sort (after all, this is the development version), you could either try to roll back to some version that builds without errors, or instead use the official Fuseki builds which are more stable.

Lets configure the way we want to run Fuseki.

Although the following depends on your needs, it is worth pointing out an example custom configuration for your query results. For instance, if we want DESCRIBE queries to look for the resource as a subject (default) as well as the object in the triple, there is a class that does that for us. Keep in mind that this query is a bit more resource intensive, hence you should only enable the following if you are sure about it. In any case, it can be used by copying the required file over:

Update the namespace to use org.openjena.fuseki instead of dev for package at /usr/lib/fuseki/src/main/java/org/openjena/fuseki/BackwardForwardDescribeFactory.java:

package org.openjena.fuseki;

We have one more change to do, and that's in /usr/lib/fuseki/tdb.ttl. Configuring TDB settings for the Fuseki server. In here, we use the same namespace that we used earlier i.e., org.openjena.fuseki for BackwardForwardDescribeFactory. Additionally, we can uncomment tdb:unionDefaultGraph true ; to use all graphs in the dataset as one, default graph. We can of course always refer to graphs individually.

A minor note here about the dataset name. It is currently set to dataset by default, but you can change this easily from here.

Let’s factor-in our changes by rebuilding:

mvn install

By default, Fuseki starts in read only mode:

./fuseki-server --desc tdb.ttl /dataset

This is good because if we make our SPARQL endpoint public, we have one less worry about access controls. Whenever we need to update (write), we can simply restart the server by adding in the --update option. By the way, the server runs on http://localhost:3030/ by default.

Some quick tests to make sure we got everything up and running okay. Let’s import some RDF triples into store.

That simply imports a Turtle file named books.ttl with triples to default graph into dataset named dataset. Remember that every time we put triples into store with the same graph name, it will first delete the existing graph before inserting the new one. If you wish add triples to the existing graph, then you should use the s-post command instead.

Note here that the service has the SPARQL endpoint URI at http://localhost:3030/dataset/query. If you wish to offer a public SPARQL endpoint e.g., http://example.org/sparql, then you might want to use a reverse proxy in Apache.

A little note here about the system architecture that Fuseki runs on. Fuseki uses TDB (which handles the RDF storage and query) out of the box, but it could be configured to use SDB as well. This is not something you have to worry about right now unless you have particular performance needs you need to address. For the time being, just know that TDB performs well if you are doing a lot of reads.

Even though TDB can run on both 32-bit and 64-bit Java Virtual Machines, 64-bit is highly recommended with minimum 1GB of RAM for reasonable performance for small number of triples. In production environments, you should dedicate a lot more RAM. More memory certainly helps Java especially during SPARQL Updates (writes) as opposed to queries (reads). For significant amount of data importing, I recommend upping the Java max memory to at least 8GB even if it is temporarily during importing. That should be fine for say 10-50 million triples. Anything above that, consider 16GB+ of RAM on that machine.

If you have a large dataset (like greater than 100k triples), and if you want to perform DESCRIBE queries in both directions (resource as subject we well as object) then try to run Fuseki on a 64-bit machine. For anything small-scale like running it for a personal blog or just testing on your local machine, 32-bit with 1GB of RAM is sufficient – famous last words.

By default, Fuseki server (fuseki-server) runs Java with -Xmx1200M. Simply use -Xmx8192M instead if you want to assign ~8GB of RAM.

If you are importing a lot of data, it would be a good idea to do it via TDB as opposed to Fuseki instead. The command-line tooling for TDB works in a similar way to Fuseki. For instance, take a look at the --help options available for each TDB command that's available for you i.e., type java tdb. and press tab to get a list in your shell. Make sure to have the following CLASSPATH setup:

/usr/lib/fuseki/target/jena-fuseki-{version}-SNAPSHOT-server.jar

Change the version line to the version you are currently using.

So, now take a look at java tdb.tdbloader --help. The importing that we did earlier with s-put can also be done as follows:

java tdb.tdbloader --desc=/usr/lib/fuseki/tdb.ttl default books.ttl

Note that the Fuseki service is not required to run for you to work with TDB. If you have Fuseki running and doing updates with TDB (not a good idea, but you could get away with it), you should stop the service and then start again to see the changes.

Logging requests to Fuseki server is done through Apache’s log4j. There should be a log4j.properties file at the root of your Fuseki install.

Creating a public SPARQL endpoint

If you wish to create a public SPARQL endpoint i.e., allowing your data to be queried (and maybe even updated), here are a few steps you could take.

If our site is at http://site/, then http://site/sparql could be our public SPARQL Endpoint. That means that we can take requests from that endpoint and pass it to http://localhost:3030/dataset/query. To accomplish this, all we have to do is a reverse proxy with Apache. Add the following to your Apache configuration:

The first rewrite rule is simple. It just takes a request like http://site/sparql and does an internal rewrite to load the SPARQL query form at /usr/lib/fuseki/pages/sparql.html. If you wish to make changes to this form e.g., like adding PREFIXes, you could edit that file.

The RewriteCond and RewriteRule handles the proxy bit we need. When the SPARQL query form is submitted, it does a GET request. You might be wondering why is a form doing a GET request when it should be POST. One reason is that, the resulting GET URI can be used as a dereferencable resource. It has its limitations e.g., number of characters, so, you could change it to POST if you feel that other people may make long requests. In any case, when the form is submitted, the request URI would be something like http://site/sparql?query=...&output=xml&stylesheet=xml-to-html-links.xsl. Hence, the RewriteCond takes the request with the query and sends it off to http://localhost:3030/dataset/query, which is same as what we'll use for the Linked Data Pages in the next section.

Alright, so, are you all Fusekied out? Good. Me too. Let’s move on to something else.

Setting up Linked Data Pages

You can place the libraries anywhere you like, but I find it convenient to have it all under /var/www/lib/. The Linked Data Pages package is what we’ll work with and it requires the following libraries:

In this how-to we have used Fuseki as our SPARQL server, however, as mentioned earlier, the Linked Data Pages framework can use any other for its baseline dataset. Therefore, you can hook up a local or a remote SPARQL endpoint for it to work.

Setting up our site

At this point we’ll assume that you have a site enabled at /var/www/site/. But, just to make sure you avoid running into problems with your server not permitting access, use a simple configuration like the following in your /etc/apache2/sites-available/site.conf:

The Linked Data Pages package comes with an installation script to setup the directories and SPARQL endpoint URI. We’ll copy that file over and go from there.

cp /var/www/lib/linked-data-pages/install.php /var/www/site/

We need to make sure that this installation script (temporarily) has write access to the site directory:

chmod a+w /var/www/site

Now we simply load http://site/install.php (replace site with whatever host is pointing to /var/www/site/) in our browser. Enter the form values that correspond to the locations where we installed the libraries earlier. If you went ahead with the defaults in this how-to, then the following is what we want:

That is it! When you submit this form, you should be ready to go. If you now load http://site/ in your browser, you should see the default homepage. If you have your host set to something other than site, then we need to revisit /var/www/site/config.php and change the value at:

$config['site']['server'] = 'site';

Since site corresponds to http://site.

While we are here, we can update the following as well:

$config['site']['name'] = 'My Linked Data site';

That simply sets the name of the site as it appears in page title, address etc.

The following is used to set the path of our site if it is somewhere other than base e.g., /foo in http://site/foo.

$config['site']['path'] = '';

We can set the theme here too, where default points to /var/www/site/theme/default:

$config['site']['theme'] = 'default';

If you have your own theme, simply copy over your theme files under a directory in /var/www/www/theme/.

And finally, we can specify the site logo file, where logo_latc.png is at /var/www/site/theme/default/images/logo_latc.png

$config['site']['logo'] = 'logo_latc.png';

Creating templates

We can finally get down to doing cool stuff. Let’s say we want to create a template that renders a FOAF profile of a person. We’ll first import the RDF data into our store, create a query to get it out, and finally create a template where we can process the data and render it back out to the user.

Note here that http://site/graph/people is the name of the graph where we’ve put our people data. Note also that since the default graph is the union of all named graphs, we don’t need to use GRAPH <http://site/graph/people> in our SPARQL queries if we don’t want to.

If you would like to use different names for your datasets, simply update tdb.ttl and run the fuseki server using that dataset name.

It is now time to create a template where we can process this data. Before we do that however, it is important to give an overview for managing entities. This frameworks uses entity sets, by providing unique ids to identify each set. Each entity set contains the following information:

Path

Path value is used to do URI pattern matching in order to identify which entity set to initiate.

Query

Query value is sent directly to the SPARQL endpoint based on entity path match.

Template

Template values specifies the template to load based on entity path match.

Let’s start by writing our SPARQL query, and for that we’ll head over to /var/www/site/config.php and add the following:

This is our SPARQL query with key people. All we are doing here is constructing an RDF graph result of all the triples in named graph http://site/graph/people. Now, we’ll tie this id to our entity set id site_people query:

$config['entity']['site_people']['query'] = 'people';

Similarly, we set the URI path where everything will be initiated when we visit http://site/people:

This is a pretty simple template which we can reuse. It is simply rendering the resulting triples in a table. The most noteworthy line here is:

$triples = $this->getTriples('http://csarven.ca/#i');

The getTriples function gets all the triples for us from our SPARQL query result and we place it in a multi-dimensional array. In this example, we are getting all the triples with subject http://csarven.ca/#i. It should give us the same triples we’ve imported into our RDF store. But we could also get other triples that match the pattern for parameters (subject, property, object) e.g.,

That would get us all the triples with subject http://csarven.ca/#i and property http://xmlns.com/foaf/0.1/knows. An alias to this is the getValue function where we can use qnames for the property position:

$this->getValue('http://csarven.ca/#i', 'foaf:knows');

You can define more qnames in $config['prefixes'] at /var/www/site/config.php. See also /var/www/lib/linked-data-pages/README for more uses of getTriples like wildcards.

For complex templating, that is, if you’d like to do more data processing with PHP, you can dive into the SITE_Template class in /var/www/classes/SITE_Template.php instead of creating your functions directly inside HTML templates. The functions in here can be called directly from your templates.

If you are curious about internals of Linked Data Pages, see this article section.

Conclusions

If you have made it this far, congratulations! But don't stop here, take your site even further by building useful interactions for your consumers. Consider the following items:

Create data visualisations to help your users to get insights into the data that you are publishing.

The setup outlined in this article is more or less used at 270a.info. One of the goals of Linked Data Pages is to have a framework where a Linked Data site can be created with minimal “development” work. This framework relies heavily on Paget, which in return relies heavily on Moriarty and ARC2. Even though Paget has some quirks, with Moriarty and ARC2, it worked out quite well and got me at least 80% of the way there. There are probably a few more things I could iron out (i.e., fix bugs, not reinvent) once I address the finer details of Paget and Moriarty. Here is my to-do list for Linked Data Pages:

An additional administration user interface for site configuration, instead of having to edit config files directly.

Improve templating by adding more common functions.

Integration of data visualisations for common data dimensions.

Let's wrap this up here. All feedback are most welcome. Let me know how all this works out for you, especially if you would like me to clarify anything further. Happy Linking and stuff =)

1 interaction

Hi Christopher. Thanks for mentioning Graphite. I've came across Graphite a while back and find it pretty neat. I'd like to take a look at it at again at some point.

I didn't get too into Linked Data Pages' features in this article because it was intended to be a quick summary on how to get something up and running. Perhaps an update is in order.

Like Graphite, LDP comes with a bunch of function calls that let's you easily dissect whatever is in the query result and have it ready in the templates. Generally speaking, helper functions which do more than the common data manipulation probably gets to be domain specific. Similarly, that line of development (or thinking) results in something along the lines of Fresnel - which I consider to be quite useful as well.

Personally, what I find really handy in LDP is to be able to trigger a SPARQL query, and an accompanying template based on the requested URL pattern.

The tutorial is very helpful. You might want to add to your tutorial a hint on two issues that sidetracked me for a long time: (1) If you find it impossible to run ./s-put or other ./s commands in your fuseki directory, make sure that they are enabled for execution using, e.g., chmod 744 s-* (2) On the LDP install.php page, you'll get examples that have a trailing / at the end of directory paths. Be certain *not* to terminate a path with that /

Also, I am uncertain what steps a person would follow to set up a reverse proxy for this particular example. Could you provide something more on this?

I made it all the way to the install.php script provided by LDP, but the resultant index.php file renders blank in my browser. So I'm a bit stuck, wondering if the answer is in the reverse proxy.

Thank you for such a comprehensive guide.
I am facing a problem with getting Fuseki.
This link is not working:
svn co https://svn.apache.org/repos/asf/jena/trunk/jena-fuseki/
Can you suggest the alternative? I'd be thankful to you.

Dear Sarven,
Does this article still represent best practise for setting up a POD? Is it better to use one of the Solid servers in development? Hope to see more public-facing documentation of how to be an early dogfooder, a la Indiewebcamp - e.g, a rolling guide to publishing Linked Research both with and w.o a POD.

Thanks and keep up the inspiring work, - Myles

1 interaction

This article offers one way where you can use an RDF store/SPARQL endpoint, and a templating system. You can contrast this with a relational DB setup like with MySQL. From that point of view, this article tends to take a generic approach to publishing Linked Data with the "social web" use case; publishing a personal web page for instance. Nothing more. But, it doesn't exclude other possibilities. This is still classic Web architecture. Still good.

So, I think that for a social web, Solid is more appropriate here if you want to plug and play with everything else. For example, being able to annotate/reply to someone's article and have that note be stored in your own dataspace, meanwhile letting the author of the article know about your annotation (by sending a notification(s) to the article's inbox(es)). They can then decide to display them or not.

Aside i.e., increasingly off topic: Both have a strong emphasis on being able to conduct follow-your-nose (FYN) type of exploration that's human and machine-readable. What's important is the application of the RDF language (the particular syntax is a minor point). For prose content, e.g., blog posts, annotations etc, I think HTML+RDFa is great because it gives the publisher the ability to be as specific as they want to be about their "statements", and the consumer to discover that information using a uniform mechanism. In the end, the same mechanism allows the consumer to integrate any data (statistical, geo, social web, media, health, climate...) from anywhere on the Web. See also: http://lod-cloud.net/

You may have already read this, but check out the article on dokieli, or code at https://github.com/linkeddata/dokieli. It doesn't give a you step-by-step instructions on how to setup a pod and use it. Documentation on Solid pods is in progress, and I will write another article which covers the Solid/dokieli approach - similar to this article here.

1 interaction

Sarven, thank you for a layered and thoughtful reply. Your description of the difference between your 2011 LD site architecture described here and Solid emphasises the similarities, with a neat summary of why this approach is important. This is good, but i have to say that, for me, the greater impact of you and your colleagues recent work is in what differentiates Solid from this methods paper. As your colleagues have written elsewhere, the capacity of Solid to replace the silos is critical. This paper provides a valid LDP personal server which "doesn't exclude other possibilities", but Solid seems to be managing to have an opinion on which of these possibilities to include, in order to break through usability barriers. I feel this is exactly what both the indieweb and scientific (self-)publishing desperately need.

In addition to your answer, a clarification of the trade-offs between using hosted Solid PODs such as databox, and running a personal Solid server, would be helpful. I look forward to your Solid POD tutorial.

Hi Sarven,
I have been trying to follow your tutorial to create a public SPARQL endpoint for my data. But I am having some issues. Could you please help me solving those?
1- I have written a configuration file for TDB to start a query service. Then I started fuseki with ./fuseki-server --config=config.ttl
The terminal shows
INFO Dataset path = /books
INFO Fuseki 1.3.0 2015-07-25T17:11:28+0000
INFO Started 2016/05/31 05:30:57 EDT on port 3030

I am able to query my dataset in the browser or terminal with the following URI: http://localhost:3030/books/query?query=ASK{}
But what I want is a SPARQL endpoint UI for this dataset. If I use http://localhost.me:3030/sparql.tpl
It displays the fuseki query interface with empty dataset.
If I start it with http://localhost.me:3030/sparql
It gives error Error 404: Service Description: /sparql
and If I use http://localhost.me:3030/books/query
It gives
Error 404: Service Description: /books/query

How can I solve this problem and start my public sparql endpoint with pre-loaded dataset??
I'll appreciate your response a lot. Regards.