Check out our book on the Semantic Web!. Now in Second Edition!

September 01, 2009

This is a blog entry I promised some folks a few months back - I'm finally getting around to it.

It is another in my series of "What can you do with an RDF data set that someone gives you?" In this case, someone asked me if there was a way to get rid of the instance data in an RDF data set, leaving just the 'ontology' (where by 'ontology' in this context we mean all the schema stuff, e.g., the RDFS).

Now, I usually recommend when someone wants to create a set of ontologies and data, that they develop these things separately. Create your schema in one file, your data in another. You can merge them together easily enough with just a couple of owl:imports; e.g., you can either have the data file import the appropriate schema information (a typical pattern when re-using a schema, e.g., SKOS). Alternatively, you can build a file that imports both your schema and your data. This can be useful if you want to manage different data files with different schema information.

But what if someone gives you a file that hasn't been factored like this? What can you do about it?

One option, if you are a TopBraid Composer user, is to just remove all the instances. Holger gave an example of this in a very particular context in his blog. That method will work for other ontologies as well, not just the Semantic XML example that he gives.

Another way to do this is to use SPARQL. With SPARQL, you can select the instance information, and separate it out. There are a number of ways to approach this, depending a bit on the details of the RDF file you are working with.

Let's suppose someone has sent you an OWL file, with a bunch of owl:Classes, and some connections between them. With SPARQL you can select for all the classes easily enough:

SELECT ?class WHERE {?class a owl:Class}

Now the members of that class can be found pretty easily too:

SELECT ?member WHERE {?class a owl:Class. ?member a ?class}

So - how do we get rid of all the information about these members? Easy - match all the triples about them:

This gives us all the triples about the instances (including the type triples!). You can save these in a file on their own. Using TopBraid Composer, you can do this in the SPARQL tab by pulling down the context menu option, "Export results to file ...". If you are coding in Java, you can use the Jena API to create a new OntModel with these triples in it.

But how do we get just the schema? Well, if we use the ARQ SPARQL extensions (which are pretty sure to get into the recommendation soon), this is easy; just delete these triples:

The stuff that's left is the schema. In Composer, you can just File>Save As, and put this where you like.

OWL's insistance that members of classes not be classes themselves (or properties) does a lot of work for us here - we know that we didn't 'accidentally' get any classes or properties in with our instance data. The situation gets a bit more complex if you can't count on this separation, but the principle is the same - select the triples that make up the part of the file you are interested in, and save it separately.

March 25, 2009

We produced a webinar a couple weeks back that shows off some of the cool things we can do to help people cope with creating and managing SPARQL queries. There were some technical difficulties on the day of the webinar itself that were disappointing, but I understand that the recording (linked here) came out well (I can't listen to it - you know how hard it is to listen to your own voice).

I have shown in particular the query generation stuff live to a few audiences since then - most of them - well, all of them so far - have been pretty excited about it. I find that even as an experienced SPARQLer, I use the automated generation quite a lot myself.

August 07, 2008

From time to time, someone will give me access to an RDF data set for me to 'have a look at'. One of the advantages of how RDF works is that it is possible to query a dataset without knowing anything about the data set at the outset. There are some simple queries that you can get started with to show how this works. As an example, let's check out the dbpedia (query web page available at http://www.dbpedia.org/sparql). When I first learned about this, Orri Erling just gave me a link; he told me nothing about the dataset.

The dbpedia page starts out with a simple sample query:

SELECT DISTINCT ?Concept WHERE {[] a ?Concept}

So let's start by running that. It is a bit of an advanced query, since the query graph includes a blank node; if you aren't comfortable with blank nodes in queries, think of

SELECT DISTINCT ?Concept WHERE {?x a ?Concept}

instead.

This gives us all classes that have any members. There are a lot of these, maybe even too many. But we can get a feeling for the sort of thing that dbpedia talks about.

Another useful first query is

SELECT DISTINCT ?p WHERE {?s ?p ?o}

This gives you all the properties that are used in this data set.

Those are starting points - but lets go a bit further. Suppose we had a class that we were interested in. For example, when I ran the default query, one of the answers on the first page was http://dbpedia.org/class/yago/Airline102690270. So perhaps we can learn about airlines.

So, let's see the airlines that dbpedia knows about. So now I make a new query, based on what I learned in the previous one.

SELECT ?air WHERE {?air a <http://dbpedia.org/class/yago/Airline102690270>}

I get a lot of answers, including http://dbpedia.org/resource/Delta_AirElite_Business_Jets

Well - now what does dbpedia know about this Delta subsidiary? We can find out with a query like this:

We can continue in this way in a number of directions; find other places where certain airlines have headquaters, find other things that have US headquarters, etc.

What is special about RDF / SPARQL that allowed this to happen? There are a few things here - we were able to query the schema using the same query language as we did for the data. The pattern

{?x a ?Concept}

returns the set of (nonempty) classes in the data set - a schema-level result. If this were a relational database, this would be akin to querying to find out the tables in the database.

We can even mix schema and data in the same query. For instance, the pattern

{<http://dbpedia.org/resource/Delta_AirElite_Business_Jets> ?p ?o}

tells us all the properties that correspond to Delta Air Elite Jets, as well as the values of those properties. This is like querying for the columns in a table that are filled in for a particular row, along with the values in those cells.

This is a real sense in which an RDF store is 'self-describing' - there is no need to know about traditional metadata (schemas) before exploring a data set.

September 25, 2007

It is a bit embarassing when I teach SPARQL to someone with a background in SQL. Once they figure out how it works, they start to appreciate it as a powerful language for representing just what information you want from a graph. Then they start asking about some of the commonly used query operations from SQL - things like aggregators and grouping operations like COUNT, SUM, MAX, MIN and AVG.

The SPARQL specification at least has some guidance for how to do negation in SPARQL, though this idiom is a bit more difficult than simply being able to have NOT as a keyword.

As part of a recent exercise, I wondered whether it was possible to use a similar trick to define MAX and MIN in SPARQL. I was surprised to find that it is possible.

Suppose we have data on the members of a prominent family, where we have the year of birth represented in triples of the form

How does this work? The first three triples match any member of the family for which a birth year is known.

The pattern inside the OPTIONAL clause also matches a family member, and gets their birth year. We use the variable name ?older for this person because of the FILTER clause; we retain only the bindings of ?older that have an earlier birth year than ?kennedy.

Now here's how we get a max out of this: what happens if we can't find anyone with an earlier birth year? Then all matches to the pattern inside the OPTIONAL braces will be filtered out, and no bindings will remain for ?older.

Back outside the OPTIONAL, we filter based on the binding of ?older; if ?older is not bound, then we didn't find anyone with an earlier birth year.

Who is the person for whom nobody else has an earlier birth year? That's the oldest, of course.

Left as an exercise for the reader:

Suppose also have :death-year represented in the same way as :birth-year, but there is no triple in case the person is still living. How do we modify this query to find the oldest living family member? Or the youngest dead kennedy?

What happens if the oldest (youngest) is not unique? What would you expect to happen? What does this query do?