Users who have contributed to this file

What is RDF and what is it good for?

Joshua Tauberer

Last revised January 2008

This is an introduction to RDF (“Resource Description
Framework”), which is the standard for encoding metadata and other knowledge on
the Semantic Web. In the Semantic Web, computer applications make use of
structured information spread in a distributed and decentralized way throughout
the current web. RDF is an abstract model, a way to break down knowledge
into discrete pieces, and while it is most popularly known for its RDF/XML
syntax, RDF can be stored in a variety of formats. This article discusses
the abstract RDF model, two concrete serialization formats, how RDF is used
and how it differs from plain XML, higher-level RDF semantics, best practices
for deployment, and querying RDF data sources.

Translations

The first chapter (Really Quick Introduction) has been translated into:

This document was originally written in October 2005.
In July 2006 it was revised and extended with material
from my xml.com
article "What is RDF". In January 2008 it was revised
with more on N3 and RDF/XML and extended with the new sections Linked Data for the Web and
Querying Semantic Web Databases

Really Quick Intro to RDF

This section is a really brief introduction to Resource Description Framework (RDF). You might also be interested
in...

My more in-depth guide, which you'll find after this quick part wraps up.

RDF is a method for expressing knowledge in a decentralized world and
is the foundation of the Semantic Web, in which computer applications
make use of distributed, structured information spread throughout the
Web. Just to get it out of the way, RDF isn't strictly an XML format,
it's not just about metadata, it has little to do with RSS, and it's not
as complicated as you think.

The Big Picture

RDF is a general method to decompose any type of knowledge into small pieces, with some rules about the semantics, or
meaning, of those pieces. The point is to have a method so simple that it can express any fact, and yet so
structured that computer applications can do useful things with it. Here's some RDF:

If you know XML, here's a brief comparison. Like RDF, XML also is designed to be simple and
general-purpose. XML can be abstracted beyond its written brackets-and-slashes notation to something more
abstract, a "DOM" for tree-structured data. Similarly, RDF isn't just about how you write it. It's about
representing network- or graph-structured information. You can write RDF in XML, and many
people do. Here's what it might look like:

But you don't have to use XML. I don't. The first format above, called N3, is just as good.

What really sets RDF apart from XML and other things is that RDF is designed
to represent knowledge in a distributed world. This means RDF is
particularly concerned with meaning. Everything at all mentioned in RDF means something, whether a reference
to something concrete in the world, an abstract concept, or a fact. Standards built on RDF
describe logical inferences between facts and how to search for facts in a large database of RDF knowledge.

What makes RDF suited for distributed knowledge is that RDF applications can
put together RDF files posted by different people around the Internet and easily learn from them new things
that no single document asserted. It does this in two ways, first by linking documents together by the common
vocabularies they use, and second by allowing any document to use any vocabulary. This flexibility is fairly
unique to RDF.

This RDF document defines what it means to be an aunt, in terms of two other relations.
You could imagine an application putting this document together with the first RDF document to determine
that :rebecca is :john's aunt. What makes this work is that the names of entities
are global. That is, when :john and :hasFather are used in one document, applications can assume
they have the same meaning in any other RDF document with the same @prefix.

You want to integrate data from different sources without custom programming.

You want to offer your data for re-use by other parties* You want to decentralize data in a way that no single party "owns" all the data.

You want to do something fancy with large amounts of data (browse,
query, match, input, extract, ...), so you develop (or re-use) a generic
tool that allows you to do this on top of the RDF data model (which has
the advantage of not being tied to a proprietary data
storage/representation technology, like a database dialect).

RDF Defined

RDF can be defined in three simple rules:

A fact is expressed as a triple of the form (Subject, Predicate, Object). It's like a little English sentence.

Subjects, predicates, and objects are names for entities, whether concrete or abstract, in the real world.
Names are either 1) global and refer to the same entity in any RDF document in which they appear, or
2) local, and the entity it refers to cannot be directly refered to outside of the RDF document.

Objects can also be text values, called literal values.

You've seen facts already. Each line below was a fact:

:john a :Person .
:john :hasMother :susan .
...

Names come in two types. Global names, which have the same meaning everywhere, are always Uniform
Resource Identifiers (URIs). URIs can have the same syntax or format as website addresses, so you will see
RDF files that contain URIs like http://www.w3.org/1999/02/22-rdf-syntax-ns#type, where that URI is
the global name for some entity. The fact that it looks like a web address is totally incidental. There may or
may not be an actual website at that address, and it doesn't matter. There are other types of URIs besides
http:-type URIs. URNs are a subtype of URI used for things like identifying books by their ISBN number, e.g.urn:isbn:0143034650. TAGs are a general-purpose type of URI. They look like
tag:govtrack.us,2005:congress/senators/frist. URIs are used as global names because they provide a
way to break down the space of all possible names into units that have obvious owners. URIs that start with
http://www.rdfabout.net/ are implicitly controlled by me.

This point is important and needs repeating: Whatever their form, URIs you see in RDF documents are merely
verbose names for entities, nothing more. Forget that it has anything to do with the web.

Since URIs can be quite long, in various RDF notations they're usually abbreviated using the concept of
namespaces from XML. That's what the colons are doing in :john, :hasMother, and the
other entities in the example. The colons indicate the name is an abbreviated form. In these
cases, the names were http://www.example.org/#john, http://www.example.org/#hasMother, etc.

When written out, URIs are generally enclosed in brackets to distinguish them from namespaced-abbreviated names.

Literal values allow text to be included in RDF. This is used heavily when RDF is used for metadata:

RDF As A Graph

There are two complementary ways of looking at RDF information. The first is as a set of statements, like above. Each statement represents a fact. The second way is as a graph.

A graph is basically a network. Graphs consist of nodes interconnected by edges. In the Internet, for instance, the nodes are the computers, and the edges are the ethernet wires connecting them. In RDF, the nodes are names (not actual entities) and the edges are statements. Here's an example:

Each arrow or edge is a RDF statement. The name at the start of the arrow is the statement's subject, the name at the end of the arrow is the statement's object, and the name that labels the arrow is the predicate.
RDF as a graph expresses exactly the same information as RDF written out as triples, but the graph form makes it easier for us humans to see structure in the data.

A Quick Example

So how is RDF useful? It's the technology for the job when you want to mesh together distributed
information.

Here's a scenario where distributed information makes a lot of sense: a database of products from multiple
vendors and reviews of those products from multiple reviewers. No one vendor is going to want to be
responsible for maintaining a central database for this project, especially since it will contain information
for competing products and negative reviews. Likewise, no one reviewer may have the resources to keep such a
database up to date.

RDF is particularly suited for this project. Each vendor and reviewer will publish a file in RDF on their
own websites. The vendors will choose URIs for their products, and the reviewers will use those URIs when
composing their reviews. Vendors don't need to agree on a common naming scheme for products, and reviewers
aren't tied to a vendor-controlled data format. RDF allows the vendors and reviewers to agree on what they
need to agree on, without forcing anyone to use one particular vocabulary.

It's an open question just how an application will retrieve these files, but I'll put that aside. Once an
application has these files, it has enough information to relate products to reviews to prices, and even to
vendor-specific information like vendor1:partno. What you should take away from this
example is how unconstraining RDF is, while still allowing applications to immediately be able to relate
information together.

The vendors and reviewers didn't have to agree on much to make this happen. They had to agree to use RDF,
but they didn't have to agree on any specific data format or even on specific URIs. Crucially, they
didn't have to enumerate everything any vendor would want to include about their products, and they
can't lock out reviewers from posting related information.

Another way to look at this from the standpoint of interoperability. Vendor 1's format is entirely
interoperable with anyone else's format, even though Vendor 1 didn't hash out a common format with anyone.
When someone comes along and wants to be interoperable with Vendor 1's information, they don't need a new
format, they just need to choose the right subjects, predicates, and objects.

Conclusion

If you thought RDF was complicated, I hope you see now that it
doesn't have to be. RDF is easy to write, flexible, and unconstraining.
It makes it easy to model knowledge and to mesh distributed knowledge sources.

Why we need a new standard for the Semantic Web

On the Semantic Web, computers do the browsing for us. The “SemWeb” enables computers to seek out knowledge distributed throughout the Web, mesh it, and then take action based on it. To use an analogy, the current Web is a decentralized platform for distributed presentations while the SemWeb is a decentralized platform for distributed knowledge. RDF is the W3C standard for encoding knowledge.

There of course is knowledge on the current Web, but it's off limits to computers. Consider a Wikipedia page, which might convey a lot of information to the human reader, but to the computer displaying the page all it sees is presentation markup. To the extent that computers make sense of HTML, images, Flash, etc., it's almost always for the purpose of creating a presentation for the end-user. The real content, the knowledge the files are conveying to the human, is opaque to the computer.

What is meant by “semantic” in the Semantic Web is not that computers are going to understand the meaning of anything, but that the logical pieces of meaning can be mechanically manipulated by a machine to useful ends.

So now imagine a new Web where the real content can be manipulated by computers. For now, picture it as a web of databases. One “semantic” website publishes a database about a product line, with products and descriptions, while another publishes a database of product reviews. A third site for a retailer publishes a database of products in stock. What standards would make it easier to write an application to mesh distributed databases together, so that a computer could use the three data sources together to help an end-user make better purchasing decisions?

There's nothing stopping anyone from writing a program now to do those sorts of things, in just the same way that nothing stopped anyone from exchanging data before we had XML. But standards facilitate building applications, especially in a decentralized system. Here are some of the things we would want a standard about distributed knowledge to consider:

Files on the Semantic Web need to be able to express information flexibly. Life can't be neatly packed into tables, as in relational databases, or hierarchies, as in XML. The information about movies and TV shows contained in the graph below is really best expressed as a graph:

Knowledge as a Graph

Of course, we can't be drawing our way through the Semantic Web, so instead we will need a tabular notation for these graphs. Compare the table below to the figure above. Each row represents an arrow (an “edge”) in the figure. The first column has the name of the “node” at the start of the edge. The second column has the label of the edge itself (the kind of edge). The third column has the name of the node at the end of the arrow.

Start Node

Edge Label

End Node

vincent_donofrio

starred_in

law_&_order_ci

law_&_order_ci

is_a

tv_show

the_thirteenth_floor

similar_plot_as

the_matrix

...

Whether we represent the graph as a picture or in a table, we're talking about the same thing. Both describe what is abstractly called a graph. More on this later.

Files on the Semantic Web need to be able to relate to each other. A file about product prices posted by a vendor and a file with product reviews posted independently by a consumer need to have a way of indicating that they are talking about the same products. Just using product names isn't enough. Two products might exist in the world both called “The Super Duper 3000,” and we want to eliminate ambiguity from the SemWeb so that computers can process the information with certainty. The SemWeb needs globally unique identifiers that can be assigned in a decentralized way.

We will use vocabularies for making assertions about things, but these vocabularies must be able to be mixed together. A vocabulary about TV shows developed by TV aficionados and a vocabulary about movies independently developed by movie connoisseurs must be able to be used together in the same file, to talk about the same things, for instance to assert that an actor has appeared in both TV shows and movies.

These are some of the requirements that RDF, Resource Description Framework, provides a standard for, as we'll see in the next section. Before getting too abstract, here are actual RDF examples of the information from the graph above, first in the Notation 3 format, which closely follows the tabular encoding of the underlying graph:

RDF was originally created in 1999 as a standard on top of XML for encoding metadata — literally, data about data. Metadata is of course things like who authored a Web page, what date a blog entry was published, etc., information that is in some sense secondary to some other content already on the regular Web. Since then, and perhaps even after the updated RDF spec in 2004, the scope of RDF has really evolved into something greater. The most exiting uses of RDF aren't in encoding information about Web resources, but information about and relations between things in the real world: people, places, concepts, etc.

Introducing RDF

Unless you know Resource Description Framework (RDF) well, it's
probably best if you try to forget what you already know about it as you
read the rest of this section. RDF
exists at the intersection of a few different technologies, so it's easy
to be lead into thinking that it is merely a particular XML data format or a
tool for blog feeds. Forget what you know. Here is RDF from the
beginning.

RDF is a general method to decompose knowledge into small pieces, with some
rules about the semantics, or meaning, of those pieces. The
point is to have a method so simple that it can express any fact, and
yet so structured that computer applications can do useful things with knowledge
expressed in RDF. I say "method" in particular, rather than format, because one
can write down those pieces in any number of ways and still preserve the
original information and structure, just like how one can express the same meaning in
different human languages or implement the same data structure in
multiple ways.

In some ways, RDF can be compared to XML. XML also is designed to be
simple and applicable to any type of data. XML is also more than a file
format. It is a foundation for dealing with hierarchical, self-contained
documents, whether they be stored on disk in the usual brackets-and-slashes format,
or held in memory and accessed through a DOM API.

What sets RDF apart from XML is that RDF is designed to represent
knowledge in a distributed world. That RDF is designed for
knowledge, and not data, means RDF is particularly concerned with
meaning. Everything at all mentioned in RDF means something. It may be
a reference to something in the world, like a person or movie, or it may
be an abstract concept, like the state of being friends with someone
else. And by putting three such entities together, the RDF standard says
how to arrive at a fact. The meaning of the triple “(John, Bob, the state of being
friends)” might be that John and Bob are friends. By putting a lot of facts
together, one arrives at some form of knowledge. Standards built on top of
RDF, including RDFS and OWL, add to RDF semantics for drawing logical
inferences from data.

For comparison, XML itself is not very much concerned with meaning.
XML nodes don't need to be associated with particular concepts, and
the XML standard doesn't indicate how to derive a fact from a document.
For instance, if you were presented with a few XML documents whose root nodes
were in a foreign language you don't understand, you couldn't do anything
useful with the documents but display them. RDF documents with nodes
you can't understand could still actually be usefully processed because RDF specifies
some basic level of meaning. Now, this isn't to say that you couldn't
develop your own standard on top of XML that says how to derive
the set of facts in an XML document, but you'll find you've probably
just reinvented something like RDF.

The second key aspect of RDF is that it works well for distributed
information. That is, RDF applications can put together RDF files
posted by different people around the Internet and easily learn from them new
things that no single document asserted. It does this in two ways,
first by linking documents together by the common vocabularies they use,
and second by allowing any document to use any vocabulary. This allows
enormous flexibility in expressing facts about a wide range of things,
drawing on information from a wide range of sources.

Triples of knowledge

RDF provides a general, flexible method to decompose any knowledge into small pieces, called triples, with some rules about the semantics (meaning) of those pieces.

The foundation is breaking knowledge down into basically what's called a labeled, directed graph, if you know the terminology.

Each edge in the graph represents a fact, or a relation between two things. The edge in the figure above from the node vincent_donofrio labeled starred_in to the node the_thirteenth_floor represents the fact that actor Vincent D'Onofrio starred in the movie “The Thirteenth Floor.” A fact represented this way has three parts: a subject, a predicate (i.e. verb), and an object. The subject is what's at the start of the edge, the predicate is the type of edge (its label), and the object is what's at the end of the edge.

The abstract RDF model: Statements

RDF is nothing more than a general method to decompose information into
pieces. The emphasis is on general here because the same method can be
used for any type of information. And the method is this: Express
information as a list of statements in the form SUBJECT PREDICATE
OBJECT. The subject and object are names for two things in the world,
and the predicate is the name of a relation between the two. You can
think of predicates as verbs.

Here's how I would break down information about my apartment into RDF
statements:

SUBJECT

PREDICATE

OBJECT

I

own

my_apartment

my_apartment

has

my_computer

my_apartment

has

my_bed

my_apartment

is_in

Philadelphia

These four lines express four facts. Each line is called a statement
or triple.

The subjects, predicates, and objects in RDF are always simple
names for things: concrete things, like my_apartment, or abstract concepts,
like has. These names don't have internal structure or significance
of their own. They're like proper names or variables. It doesn't matter
what name you choose for anything, as long as you use it consistently
throughout.

Names in RDF statements are said to refer to or denote things in
the world. The things that names denote are called resources (dating back
to RDF's use for metadata for web resources),
nodes (from graph terminology), or entities. These terms are
generally all synonymous. For instance, the name
my_apartment denotes my actual apartment, which is an entity in the
real world. The distinction between names and the entities they denote
is minute but important because two names can be used to refer to the
same entity.

Predicates are always relations between two things. Own is a
relation between an owner and an 'ownee'; has is a relation between the
container and the thing contained; is_in is the inverse relation,
between the contained and the container. In RDF, the order of the subject
and object is very important.

The next aspect of RDF almost goes without saying, but I want to put
everything down in print: If someone refers to something as X in one place
and X is used in another place, the two X's refer to the same entity.
When I wrote my_apartment in the first line, it's the
same apartment that I meant when I wrote it in the other three
lines.

The rules so far already get us a lot farther than you might realize. Given this
table of statements, I can write a simple program that can answer
questions like "who own my_apartment" and "my_apartment has what." The
question itself is in the form of an RDF statement, except the program
will consider wh-words like who and what to be wild-cards. A simple question-answering
program can compare the question to each row in the table. Each
matching row is an answer. Here's the pseudocode:

The computer doesn't need to know what has actually means in English for
this to be useful. That is, it's left up to the application writer to
choose appropriate names for things (e.g. my_apartment) and to use the
right predicates (own, has). RDF tools are ignorant of what these names
mean, but they can still usefully process the information. (I'll get to
more useful things later.)

URIs to Name Resources

RDF is meant to be published on the Internet, and so the
names I used above have a problem. I shouldn't name something
my_apartment because someone else might use the name my_apartment for
their apartment too. Following from the last fact about RDF, RDF tools
would think the two instances of my_apartment referred to the same thing
in the real world, whereas in fact they were intended to refer to two
different apartments. The last aspect of RDF is that names must be
global, in the sense that you must not choose a name that someone else
might conceivably also use to refer to something different. Formally,
names for subjects, predicates, and objects must be Uniform Resource
Identifiers (URIs). (Technically, names can be Internationalized
Resource Identifiers (IRIs) but the distinction is not important.)

URIs can have the same syntax or format as website addresses, so you will
see RDF files that contain URIs like http://www.w3.org/1999/02/22-rdf-syntax-ns#type, where that URI is the
global name for some entity. This happens to be the URI for the concept
of "is a" (if you recognize it).

Now, in the SemWeb world, URIs are treated in a somewhat inconsistent
way, so bear with me here. On the one hand, URIs are supposed to be opaque.
The fact that a URI looks like a web address is totally
incidental. There may or may not be an actual website at that address,
and it doesn't matter. There are other types of URIs besides http:-type
URIs. URNs are a subtype of URI used for things like identifying books
by their ISBN number, e.g. urn:isbn:0143034650. TAGs are a general-purpose
type of URI. They look like tag:govtrack.us,2005:congress/senators/frist.
Not all URIs name web pages. That's the difference between a URI and a URL.
URLs are just those URIs that name things on the web that can be retrieved, aka "dereferenced".

So the convention goes: whatever their form, URIs you see in RDF
documents are merely verbose names for entities, nothing more.
Well, at least, that's how people felt about URIs till around 2007.

Starting in recent years there actually has been an expectation
that if you create an http: URI — or, any dereferencable
URI (a URL) — that you actually put something at that address so that RDF
clients can access that page and get some information. Here's the bottom
line: As for what a URI means in a document, what the URI is
simply doesn't matter, but when you use dereferencable URIs, there
may be an expectation that you put something on the web at that address.
We will return to this in the section about Linked Data.

URIs are used as global names because they provide a way to break down
the space of all possible names into units that have obvious owners.
URIs that start with http://www.govtrack.us/ are implicitly controlled
by me, or whoever is running the website at that address. By
convention, if there's an obvious owner for a URI, no one but that owner
will "mint" a new resource with that URI. This prevents name clashes.
If you create a URI in the space of URIs that you control, you can rest
assured no one will use the same URI to denote something else.
(Of course, someone might use your URIs in a way that you would not
appreciate, but this is a subject for another article.)

Since URIs can be quite long, in various RDF notations they're
usually abbreviated using the concept of namespaces from XML. As in
XML, a namespace is generally declared at the top of an RDF document and
then used in abbreviated form later on. Let's say I've declared the
abbreviation taubz for the URI http://razor.occams.info/index.html#.
In many RDF notations, I can then abbreviate URIs like
http://razor.occams.info/index.html#my_apartment by replacing the namespace
URI exactly as it is given in the declaration with the abbreviation and
a colon, in this case simply as taubz:my_apartment. The precise
rules for namespacing depend on the RDF serialization syntax being
used.

Importantly, namespaces have no significant status in RDF. They are merely
a tool to abbreviate long URIs.

I might re-write the table about my apartment as it is below, replacing
the simple names I first used above with abritrary URIs:

The table above is just an informal table representing the graph
of information that exists at an abstract level, which could just
as well be described by the figure below. We will talk more
about standard ways of actually writing out RDF later on.

RDF as a Graph

Wrapping It Up So Far

And that's RDF. Everything else in the Semantic Web builds on those
three rules, repeated here to hammer home the simplicity of the system:

A fact is expressed as a triple of the form (Subject, Predicate, Object).

Subjects, predicates, and objects are given as names for entities, whether
concrete or abstract, in the real world.

Names are in the format of URIs, which are opaque and global.

These concepts form most of the abstract RDF model for encoding knowledge. It's analogous to the common API that most XML libraries provide. If it weren't for us curious humans always peeking into files, the actual format of XML wouldn't matter so much as long as we had our appendChild, setAttribute, etc. Of course, we do need a common file format for exchanging data, and in fact there are two for RDF, which we look at later.

Blank Nodes and Literal Values

There is actually a bit more to RDF than the three rules above. So
far I've described three types of things in RDF: resources (things or
concepts) that exist in the real world, global names for resources (i.e. URIs), and
RDF statements (triples, or rows in a table). There are two more things.

Literals

The first new thing is the literal value. Literal values
are raw text that can be used instead of objects in RDF triples.
Unlike names (i.e. URIs) which are stand-ins for things in the real world,
literal values are just raw text data inserted into the graph.
Literal values could be used to relate people to their names,
books to their ISBN numbers, etc.:

Blank/Anonymous Nodes

Then there are anonymous nodes, blank nodes, or
bnodes. These terms are all synonymous. The words anonymous or blank are meant to indicate
that these are nodes in a graph without a name, either because the
author of the document doesn't know or doesn't want to or need to provide a name.
In a sense, this is like saying “John is friends with someone, but
I'm not telling who.” When we say these nodes are nameless, keep in mind two
things. First, the real-world thing that the node denotes is not inherently
nameless. John's friend, in the example, has a name, after all. Second,
when we say nameless here, we are refering to the concept of naming things
with URIs. Actual blank nodes in documents may be given “local” identifiers
so that they may be referred to multiple times within a document. It is
only that these local identifiers are explicitly not global, and have no
meaning outside of the document in which they occur.

If you're familiar with formal semantics, blank nodes can often be thought
of as existentially bound variables.

Here's one way literal values and anonymous nodes are used. One literal
value in the example is "Joshua Tauberer", and the anonymous
or blank node is _:anon123.

To distinguish between URIs, namespaced names (abbreviated URIs),
anonymous nodes, and literal values, I used the following common convention:

Full URIs are enclosed in angle brackets.

Namespaced names are written plainly, but their colons give them away.

Anonymous nodes are written like namespaced names, but in the reserved "_" namespace with an arbitrary local name after the colon.

Literal values are enclosed in quotation marks.

You should take a moment to try to visualize what graph is described by the
table. Picture arrows between nodes.

There is one blank node in this example, :anon123. What we know
about this resource is that it is the author of <urn:isbn:0143034650>
and it has the name _Lawrence Lessig. Because no global name is used
for this resource, we can't really be sure who we're talking about here. And,
if we wanted to say more about whatever is denoted by _:anon123,
we would have to do it in this very RDF document because we would have no
way to refer to this particular Lawrence Lessig outside of the document.

More on Literals: Language Tags and Datatypes

Literal values can be optionally adorned with one of two pieces
of metadata. The first is a language tag, to specify what language
the raw text is written in. The language tag should be viewed as
a vestige of how RDF was used in the early days. Today it is an
ugly hack. You may see “ "chat"@en ”, the literal value
“chat” with an English language tag, or “ "chat"@fr ”,
the same with the French language tag.

Alternatively, a literal value can be tagged with a URI indicating a datatype.
The datatype indicates how to interpret the raw text, such as as a number,
a URI, a date or time, etc. Datatypes can be any URI, although the datatypes
defined in XML Schema are used by convention.
The notation for datatypes is often the literal value in quotes followed
by two carets, followed by the datatype URI (possibly abbreviated):

Datatypes

"1234" This is an untyped literal value. No datatype.
"1234"^^xsd:integer This is a typed literal value using a namespace.
"1234"^^<http://www.w3.org/2001/XMLSchema#integer> The same with the full datatype URI.

Datatypes are a bit tricky. Let's think of the datatype for floating-point numbers.
At an abstract level, the floating-point numbers themselves are different from the text
we use to represent them on paper. For instance, the text “5.1” represents
the number 5.1, but so does “5.1000” and “05.10”. Here there
are multiple textual representations — what are called lexical representations —
for the same value. A datatype tells us how to map lexical representations
to values, and vice versa.

The semantics of RDF takes language tags and datatypes into account.
This means two things. First, a literal value without either a language
tag or datatype is different from a literal with a language tag is different
from a literal with a datatype. These four statements say four different
things and none can be inferred from the others:

So, an untyped literal with or without a language tag is not the same as a typed literal.
The second part of the semantics of literals is that two typed literals
that appear different may be the same if their datatype maps their
lexical representations to the same value. The following statements are
equivalent (at least for an RDF application that has been given the
semantics of the XSD datatypes):

Datatype Semantics

#john ex:age "10"^^xsd:float
#john ex:age "10.000"^^xsd:float

These mean John's age is 10. That is, the textual representation of the
number is besides the point and is not part of the meaning encoded by
the triples. Note that if the float datatype were not specified, the
triples would not be inherently equivalent, and the textual representation
of the 10 would be maintained as part of the information content.

More on Blank Nodes: Some Caveats

Unlike the rule for URIs stating that they are global, local identifiers
used to name blank nodes are explicitly not global. A local bnode
identifier used in two separate documents can refer to two things. Still,
however, the identifier itself is arbitrary, and the actual identifier
used in any particular case is not a part of the information content of
the document.

Anonymous nodes are often used to avoid having to assign
people URIs, as in the example above. They're also often used in representing more complex relations:

Here the anonymous node was used as an intermediate step in the
relation between me and the parts of my name. The node represents my
name in a structured way, rather than using a single opaque literal value
"Joshua Ian Tauberer". RDF only allows binary relations, so it's
necessary to express many-way relations using intermediate nodes, and
these nodes are often anonymous.

Reading and Writing RDF

In this section, we describe two standard ways of writing out RDF.
Because RDF is the abstract graph-like model, we call these
written formats serialization syntaxes for RDF.

<a name="Notation 3"> </a>

Notation 3

Notation 3 (“N3”), or the subset called Turtle, is a de facto standard for writing out RDF.
It is not a W3C standard, but it is widely deployed, commonly used in email discussions between
SemWeb developers, and the most important RDF notation to understand because it most clearly
captures the abstract graph.

Here is an example, and it should be mostly clear what triples are encoded
by this document.

This is meant to express the geographic latitude and longitude coordinates of Princeton University,
that it has a department, and that that department is named "Department of Computer Science".

In N3 and Turtle, statements are just written out as the subject URI (in brackets or abbreviated with
namespaces), followed by the predicate URI, followed by the object URI or literal value, followed by a period.
Also, namespaces are declared at the top with the @prefix directive. Full URIs are reassembled
from the abbreviated notation, like geo:lat, just by concatenating the corresponding
namespace with the second part of the abbreviation, i.e. http://www.w3.org/2003/01/geo/wgs84_pos#

The second way uses square brackets, and is recursive. Square brackets
represent a blank node, and inside the brackets you can put predicate-object pairs
separated by semicolons to encode properties of that blank node. For example:

The two blank node N3 examples encode precisely the same information
as in the previous example.
Keep in mind that it's not the form of the N3 document that ever matters.
Whether someone uses commas, semicolons, brackets, or underscores is always besides the point. What matters
is what triples are encoded by the document.

The questions you should be asking yourself are: What are the standardized rules for interpreting the nodes and attributes in this document as an abstract RDF graph, and what are the standardized rules for producing a document in this form given an abstract RDF graph?

In an RDF/XML document there are two types of XML nodes: 1) resource XML nodes and 2) property XML nodes. Resource XML nodes are the subjects and objects of statements, and they usually are rdf:Description tags that have an rdf:about attribute on them giving the URI of the resource they represent. In this example, the rdf:Description nodes are the resource nodes.

Resource XML nodes contain within them property XML nodes (and nothing else). Each property XML node represents a single statement. The subject of the statement is the outer resource XML node that contains the property. There are four statements in this example, the first three with the subject http://www.princeton.edu and the fourth with the subject http://www.cs.princeton.edu. The URIs of the predicates in the four statements are (abbreviated) geo:lat, geo:long, edu:hasDept, and dc:title.

Statements can have either resources or literal values as their objects, as discussed earlier in this article.
To put a literal value as the object of a property XML node, the value simply goes inside the XML element. See how "40.35" and "-74.66" are used above.

Using a resource as an object can be done in two ways. The first way is illustrated in the example, by using a rdf:resource attribute in which you put the URI of the object. You can describe properties of that object elsewhere, as is done above in a separate rdf:Description resource XML node at the top-level of the document.

The second way is to tuck the rdf:Description element right inside the property XML node and leave off rdf:resource. The example above is equivalently written as:

From the specification we are told how to take the XML document above and get out of it this table of statements. You can check that you understand the format by comparing this
document with the first N3 example and the explanation in the previous section. The two documents
encode exactly the same triples.

We'll discuss just one syntactic shortcut here because it is
so commonly used. Consider this example:

This example contains just a single triple. The rdf:type
predicate is used to say what type of thing a resource is. Here, the
point is to say that the resource denoted by
http://www.princeton.edu is a university. Because rdf:type
is so common, rather than creating a property XML node for it, it
can be abbreviated by replacing the rdf:Description tag
with the type:

These two documents are completely equivalent. So, when something
other than rdf:Description is the name of a resource XML
node, then there is a rdf:type triple encoded.
rdf:Description is not itself a type. It is a syntactic placeholder. It does not
translate into a triple.

It is possible to use blank nodes in RDF/XML as well, and there are
many syntactic shortcuts that you will see commonly used.
You will have to
consult the specifications to see these things, as it's time we move on.

Use a Validator!

Triples are the bread and butter of RDF. When applications use RDF in XML or N3 format, they see the triples. The hierarchical structure of the XML and the order of the XML nodes is lost in the table of triples, which means that, like whitespace, it was not a part of the information meant to be encoded in the RDF.
With XML and N3, the particular syntactic or formatting choices are always besides the point.
Make sure you understand the triples encoded in any document!

When in doubt, you can use an online validator to convert RDF in one
format to another more triple-oriented format. Converting RDF/XML to
N3 or, better yet, N-Triples is a good way to get a clear view of
what triples are encoded by the XML.

The W3C maintains a RDF validator
which converts an RDF document that you paste into a text box into a table
of RDF triples. This is good also to check your RDF syntax.

There is also a validator on this site
which can convert back and forth between N3 and RDF/XML. You can use
this to check your RDF/XML or N3 syntax, and to double-check what
triples are encoded in N3 documents.

Distributed Information

So far, RDF might seem unnecessarily complex. For all of the examples
so far, the information could be written in plain XML just the same.
Taking a single document in isolation,
this is true. But RDF excels when every document is thought
of as part of the bigger picture, part of a huge graph of
knowledge spread throughout the Internet. In this "Semantic
Web," every RDF statement posted somewhere on the net contributes
a small piece to the greater body of knowledge.

But, not every piece of knowledge in the Semantic Web is
automatically interpretable. Applications have to be told what
to do with particular predicates, or else the RDF is just URI salad.
In the next section, I tackle this further.

Choosing the Right Predicates

As you've seen so far, one can use RDF to model any type of knowledge
without having to use any centrally approved notions. If no one
has coined a URI for something you want to describe, you can create
your own URI for it. This goes for not just subjects and objects but
predicates as well. In the examples above, I made up all of the URIs,
and it's still perfectly valid RDF.

The trouble is that if I make up all of my own URIs, my RDF document
has no meaning to anyone else unless I explain what each URI is intended
to denote or mean. Two RDF documents with no URIs in common have no information
that can be interrelated. But, two documents that have some URIs in
common are talking about some of the same things. Someone else might
want to publish information about Law & Order. By using the same URI for
the television show in the two documents, RDF tools will be able to recognize
that the two documents are describing the same thing.

Here is an example of using RDF to describe books. Let's say,
hypothetically, that the Library of Congress posted an RDF list of books
and Amazon.com did the same.

RDF from the Library of Congress (hypothetical)

<urn:isbn:0143034650> dc:title "Free Culture : The Nature and Future of Creativity"
<urn:isbn:0613917472> dc:title "Code and Other Laws of Cyberspace"
<urn:isbn:B00005U7WO> dc:title "The Future of Ideas"

The URIs for the books (urn:isbn:...)
are what tie the two files together. An RDF application using these
files would be able to report that "The Future of Ideas" is "$9.95" at
Amazon because both the title and the price are related to a common
resource, denoted by the URI urn:isbn:B00005U7WO. If the files did not
have the same URI for that book, nothing would indicate that the titles
went with particular prices.

It's also important to use the same predicates others are using when
the predicates you want already exist. This allows existing
applications to make use of your information, without needing to be
modified by developers to recognize your own URIs. For instance, if you're
describing documents, you should use the existing Dublin Core (DC) title and
description predicates so RDF applications that already use those
predicates will be able to extract the titles and descriptions from your
data. If you're describing people, use existing Friend of a Friend (FOAF) predicates so that
FOAF-based applications will be able to take advantage of the
information.

I used dc:title in my books example above because the predicate already
existed to convey the information I wanted to put into the files: a
relation between a book and its title. No standard predicate exists for
giving the Amazon.com price of a book, so I made one up. (Remember
amazon:price is an abbreviated form of a full URI, but I've left out the
declaration of amazon: for the sake of brevity.)

Be careful of respecting the meanings of existing predicates, though.
RDF applications expect the dc:title predicate to relate a document to
its title. You wouldn't want to use it to relate a telephone to its
phone number, for instance, because existing RDF applications that get
a hold of your data will use it in ways that simply won't make sense.

Meshing Information: A Real-World Example

Most of the time in today's world you get information from one place.
For instance, a single database might contain the information for an
entire product line. There isn't much on the web in the way of
distributed information because it's usually hard to put information
together from multiple sources, each of which may have its own data
format and conventions.

Here's a scenario where distributed information makes a lot of sense: a
database of products from multiple vendors and reviews of those products
from multiple reviewers. No one vendor is going to want to be
responsible for maintaining a central database for this project,
especially since it will contain information for competing products and
negative reviews. Likewise, no one reviewer may have the resources to
keep such a database up to date. How can this become a reality?

RDF is particularly suited for this project. Each vendor and reviewer
will publish a file in RDF on their own websites. The vendors will
choose URIs for their products, and the reviewers will use those URIs
when composing their reviews. Vendors don't need to agree on a common
naming scheme for products, and reviewers aren't tied to a
vendor-controlled data format. RDF allows the vendors and reviewers to
agree on what they need to agree on, without forcing anyone to use one
particular vocabulary.

It's an open question just how an application will retrieve these files,
but I'll put that aside. Once an application has these files, it has
enough information to relate products to reviews to prices, and even to
vendor-specific information like vendor1:partno and
vendor2:warranty_code. What you should take away from this example is
how unconstraining RDF is, while still allowing applications to
immediately be able to relate information together.

And, RDF applications don't need to know about the nature of the data in
these files to be able to make use of it. If an application already knows what
the dc:title and dc:description predicates are for,
and nothing else, then it is at least able to present the titles and
reviews of the four products. Note that the presence of predicates
the application doesn't understand, like review:rating, doesn't
impact the application at all. It can simply ignore it without worrying
that it has misunderstood the rest of the data.

In addition, the vendors and reviewers did not have to agree on much to
make this happen. They had to agree to use RDF, but they didn't have to
agree on any specific data format or even on specific URIs. It helps
that they agreed on URIs for indicating titles and prices, although even
that wasn't strictly necessary. But, crucially, they didn't have to
enumerate everything any vendor would want to include about their
products. When a vendor needed something that wasn't already agreed on
(product numbers and warranty codes), they were able to create a new
predicate without disrupting any existing systems. Likewise, the
reviewers aren't tied to a vendor-controlled vocabulary. Reviewers were
free to add their own relations, such as a ratings, to their RDF files.

Another way to look at this from the standpoint of interoperability.
Vendor 1's format is entirely interoperable with anyone else's format,
even though Vendor 1 didn't hash out a common format with anyone. When
someone comes along and wants to be interoperable with Vendor 1's
information, they don't need a new format, they just need to choose the
right subjects, predicates, and objects.

Is this any better than XML? I'll take a look at that later on.

Comparing RDF with XML

Earlier I showed how RDF could be used to create a
decentralized database for product information and reviews. Here is how
a similar system would be accomplished using non-RDF XML.

First, though, I'll look at how a single vendor would approach this
alone. The vendor, if it was so inclined, might publish an XML file
with a node for each product, and within that a node for its name and
some vendor-specific information.

What can be done with this file? An application to display this
information would have to be specifically programmed to know that
<product> nodes are for products, with titles in the title
attribute, etc. And, if a reviewer wanted to post a review XML file,
the only way to relate reviews to products would be by name. Two
vendors might have products with the same name, so vendors would have to
use IDs of some sort to keep their products separate.

The first problem arises. Vendors will need to come together to
establish a product ID system so that IDs are unique within the local ID
space of this vendor consortium. RDF solves this problem by requiring
that all IDs be globally unique, and by using URIs for IDs, allowing
individuals to create IDs in a local space that they control.

With IDs in the XML files, reviewers will be able to identify
products in their review files, but applications still won't be able to
relate products to reviews. IDs aren't enough. The applications
themselves have to be told where to find the IDs. In Vendor 1's file,
it's in the product node's id attribute. In the review files,
the same attribute is to give an identifier to each review. To connect
the review to the product, another attribute is used. Even if the
vendors and reviewers agreed where to put the IDs, the application still
needs to know where it is. RDF solves this problem by making everything
a global ID (except literals), so basically anything the RDF application sees is
an ID that means something.

The vendors and reviewers next have to decide what constitutes a
valid product or review XML file, and how the nodes of these files
should be interpreted by software. If these files are defined by a DTD
or Schema, the files will not be extensible. Before adding anything new
into these files, such as vendor-specific information, all of the
vendors and reviewers will need to agree to a DTD or Schema change.

On the other hand, the vendors and reviewers can go without a DTD or Schema.
There are no rules for what elements go where, which provides the
flexibility that the vendors and reviewers need. But, then there must be
some guide to what the elements of the XML files mean, and thus a
central authority for deciding these things. Unless, the vendors and
reviewers use namespaces to allow them to develop their own
vocabularies.

I could go on, but you should see now that XML isn't particularly suited for
distributed, extensible information unless that XML looks a lot
like RDF!

RDF about RDF

So far I've shown how RDF can be used to describe the relationships
between entities in the world. RDF can be used at a higher level, too,
to describe RDF predicates and classes of resources. Ontologies, schemas,
and vocabularies, which all mean roughly the same thing, are RDF
information about... other RDF information.

RDF ontologies play a vaguely similar role as XML Document Type
Definitions and XML Schema. But they are as different as they are the
same. DTDs and XML Schema specify what constitutes a valid document.
They don't indicate how a document should be interpreted, and they only
restrict the set of elements that can be used in any given file. RDF
ontologies, which are themselves written in RDF, provide relations
between higher-level things, entirely for the purpose of indicating to
applications how some information should be interpreted. RDF
ontologies also don't restrict at all which predicates are valid where.
Any statement is valid anywhere, as before.

RDF Schema (RDFS)
and Web Ontology Language (OWL) define
a few classes and predicates that are, merely by convention,
used to provide higher-level descriptions of data.

The classes and predicates below are some of the standard tools that
are available to you when you write ontologies. For some examples
of complete ontologies, including the standard RDF, RDFS, and OWL
ontologies, see SchemaWeb.

RDF Schema (RDFS)

RDF Schema (RDFS) introduces the notion of a class. A class is a
type of thing. For instance, you and I are members of the class Person.
Computers are members of the class Machine. Person and Machine are
classes. That is to say, they are themselves members of the type Class. The first
higher-level predicate is the rdf:type predicate. (rdf
is the usual namespace abbreviation for
http://www.w3.org/1999/02/22-rdf-syntax-ns#.) It relates an
entity to another entity that denotes the class of the entity. The
purpose of this predicate is to indicate what kind of thing a resource
is. But, as with anything else in RDF, the choice of class is either by
convention or arbitrary.

To add class information into the vendor files from a few sections ago,
a vendor would simply add this:

Adding Type Information

vendor1:productX rdf:type general:Product

As with choosing predicates, it's helpful to choose URIs for classes
that are used by others. Agreement among different parties for classes,
and for the other things in this section, is very important.

One interesting class is rdf:Property. Any entity
used as a predicate is an rdf:Property. Thus, from the
examples above, we can conclude:

To be explicit, one might include these statements in an RDF
ontology describing the predicates used in the data.

Other RDFS predicates are used to provide even more information about
predicates. The rdfs:domain and rdfs:range predicates
relate a predicate to the class of resources that can
serve as the subject or object of the predicate, respectively. Here's an example:

These statements say that the subjects of vendor2:warranty_code
are things typed as general:Product and the objects of this
predicate are literals (raw text). That's true. Recall:

warranty_code

vendor2:product1 vendor2:warranty_code "None."

vendor2:product1 is a product, and "None." is a
literal value.

Specifying domains and ranges for predicates
serves two purposes. First, it allows applications to make inferences
from statements about the types of things. If it sees something that is
the subject of vendor2:warranty_code, it can infer that it is
a general:Product. Second, these specifications serve
as documentation of a vocabulary for people. The RDF itself is used
to indicate how predicates should be used.

Two RDFS predicates are used to give relations between classes
and predicates. The rdfs:subClassOf relation indicates
that one class is a sub-class of another. For instance, the class
Mammal is a sub-class of the class Animal. Anything true of
the Animal class is also true of the Mammal class, and applications
are able to make such inferences once this predicate is present. The
rdfs:subPropertyOf is similar, but for predicates. For
example, the friend predicate is a sub-property of the knows
predicate. Any friend is someone you know.

Web Ontology Language (OWL)

Web Ontology Language (OWL) defines more classes that let RDF authors
define more of the meaning of their predicates within RDF. Four
classes of predicates defined by OWL include:
owl:SymmetricProperty, owl:TransitiveProperty,
owl:FunctionalProperty, and owl:InverseFunctionalProperty.
(The OWL namespace is http://www.w3.org/2002/07/owl#.)
Each of these classes is rdf:subClassOf rdf:Property.

Applications can use these classes, by convention, to make inferences
about data. You would use these classes in an ontology like this:

Defining amazon:price

amazon:price rdf:type owl:FunctionalProperty

Because these classes are defined in the OWL ontology as being sub-classes
of rdf:Property, applications can infer the following:

Defining amazon:price

amazon:price rdf:type rdf:Property

That's the same statement as earlier. So, when you use a sub-class
in place of the 'parent' class, you're being strictly more informative.
Anything the application knew before it still knows (if it has inferencing
capabilities and knows the OWL ontology), and it knows more
because the sub-class is more specific.

OWL symmetric properties tell applications that the
following inference is valid. If the application sees the statement
S P O, and if P is typed as a symmetric property, then O P S is also
true. For instance, we think of the has-friend relation are being symmetric.
If you're my friend (ME HAS_FRIEND YOU), I'm your friend (YOU HAS_FRIEND ME).

OWL transitive properties work like this. If the application sees
the statements X P Y and Y P Z, and if P is typed as a transitive
property, then X P Z is also true. rdfs:subClassOf is a transitive
relation. If Mammal is a sub-class of Animal and Animal is a sub-class
of Organism, then Mammal is a sub-class of Organism.

OWL functional and inverse-functional properties indicate how many
times a property can be used for a given subject or object. A functional
property is one that has at most one value for any particular subject. An
example is the hasBirthday relation between a person and his or
her birthday. Everyone has just one birthday, so for any given subject
(person), there can be just one object (birthday). But, the
owns relation between an owner and ownee is not functional.
People can own more than one thing.

Inverse functional properties do the same in reverse. For any
object, there is only one subject for a particular inverse functional
property. The has_ISBN relation is inverse functional. For any ISBN,
there is only one book that has that ISBN. The has_ISBN relation may
not be functional. Can a book have more than one ISBN number?

Functional and inverse-functional properties can be used by
applications to infer things like two entities denote the same
thing. For instance, take the following input:

If this data is consistent, then the fact that ex:isbn is
marked as an inverse-functional property lets the application conclude
book:a and book:b denote the very same book. They
have the same ISBN, and since the ISBN relation was marked as inverse-functional,
then the two subjects must denote the same book. Recall
that two names can refer to the same thing.

Linked Data for the Web

Earlier I left two points to be discussed later. The first was
a remark that when using http:-type URIs, there are expectations
that something actually exists at that web URL. The second was the
open question about how Semantic Web clients are supposed to find
RDF data on the web. A new Semantic Web community movement under
the name Linked Data seeks to provide some answers to these questions.

The notion of Linked Data is to bring the concept and benefits of
hyperlinking between HTML documents on the World Wide Web to RDF
documents on the Semantic Web. The core principle is that http:-type URIs
should be used for RDF resources, so that RDF documents can
exist at those locations describing the resources. When those documents
mention other resources, if they have http:-type URIs then
SemWeb clients can jump from document to document finding more information
as it goes.

To take an example: I have minted the URI http://www.rdfabout.com/rdf/usgov/geo/us/ny
to represent the state of New York in the United States. If you visit that URL
you will get back an RDF document describing New York. And it refers to some
other resources, which happen to have http:-type URIs that you could
retrieve to get documents describing the other resources. Don't confuse
the documents you get back with the resources named by the URIs themselves. There's no guarantee that
the document you get back will even mention the resource named by that address
(though that would certainly defeat the purpose).

Terminology

I chose to use a http:-type URI
so that it is “dereferencable”. Dereferencable is a term in the World Wide
Web, not the Semantic Web, which means a URI that is a URL, or in other
words a URI with the http: scheme (among others) that specifies
how to fetch a document at that address. The tag: and urn:
schemes do not specify how to find documents with those types of URLs,
so such URIs are not dereferencable. You can't put them in your browser
and get back a document. URIs that you can put in your browser are
dereferencable.

The distinction above between document — what you can get back from
a browser — and resource — something named by a URI — highlights some
common terminology people use. An “information resource” is
something that can be transmitted electronically. Documents, such as web
pages and RDF/XML documents, images, and binary files are all information
resources. “Non-information resources” are those resources
that can be named by a URI but which cannot be transmitted electronically.
Human beings, abstract concepts, etc. are non-information resources. As we've seen,
both information and non-information resources can be named with URIs.
However, browsers can only display information resources. They can display
representations of non-information resources (such as pictures of
people), but they are (by definition) incapable of displaying a non-information
resource itself.

Linked Data Under the Hood

Using HTTP GET

According to web architecture standards, the HTTP 200 OK response
to requests is to be used for URIs denoting information resources only. So when
visiting the URI above for New York, a non-information resource, the
web server at the other end first sends back a HTTP 303 See Other response,
i.e. a redirect.
This indicates that the URI is not something the web server can provide
directly, because it cannot transmit a non-information resource over
the wire. It sends back instead a URL directing the user agent to
an information resource, with the implication that information about
the original URI can be found in the document at that URL.

If you use Linux, you can observe this with the curl
or wget command-line tools. Running curl as
shown below prints out the redirect browsers get when going
to the URI for New York.

If you look carefully, you'll see that in this case the redirect
happens to take you to what looks like a dynamically-generated URL.
More on this URL later. Often, however, you will be redirected
to a static RDF/XML document (with a .rdf extension, for instance).

After following the redirect, an RDF/XML document that
describes New York is returned.

You may also want to try wget with the -S option
to view the HTTP response headers:

Using wget to follow the linked data

$ wget -S -O /dev/null http://www.rdfabout.com/rdf/usgov/geo/us/ny

Content Negotiation and Link tags

The HTTP protocol allows for different response documents depending on
some aspects of the request. In particular, the requesting HTTP client
can specify in the HTTP Accept header the MIME type of what it wants, say an HTML (text/html)
page, as is usually the case, or something else. This allows the server to overload its
response, and this is called content negotiation.

A web-browsing client accessing an RDF URI might want an HTML
representation of a resource to display to the user (some textual description
of what the URI represents), whereas an RDF client may want an RDF/XML
description of the resource. In this case, the client specifies
Accept: application/rdf+xml. We think of the HTML page and
the RDF/XML document as two information-resource representations
of the same non-information resource (i.e. what the URI denotes).

HTTP GET's on RDF URIs, as just discussed, is one way for RDF clients
to find triples about a particular resource. And, content negotiation
allows there to be both an HTML page and an RDF document at the same URL.
Another way to associate a URI with an HTML page to an RDF document is
to use the HTML link tag. This tag will help RDF clients
discover related RDF information from a page the user may already be
browsing.

In this example, the file moreinfo.rdf is expected to be
an RDF/XML document, hopefully either describing the URI of the page on
which the link tag is found, or describing a resource
of primary importance on the HTML page.

Querying Semantic Web Databases

A Semantic Web or RDF database, also called a triple store
because it stores triples (which, if you recall, is another word
for an RDF statement), is generally any repository of RDF statements
that supports some form of querying operation. This is important, of
course, because once you have a document, you have to ask
what is inside.

The process of asking abstracts over how the
triples were originally stored, such as in XML or Notation 3.
As a result, triple stores often cannot tell you what order
statements appeared in in the original document, what namespace,
prefix, and local name were used for any given resource, or what
local name was used to identify blank nodes within documents.
These are all incidental artifacts of serializing RDF that are
not a part of the information content of the documents, after all,
and are accordingly meant to be forgotten.

What a triple store can tell you is what statements
are inside (in an unordered fashion). Usually a triple store
can answer the question of whether a particular statement
that you provide is contained in the store, and some will
be able to provide a set of statements that match a given
pattern.

Triple stores can roughly be divided into two types.
On the one hand, there are pure, low-level triple stores that
maintain a repository of triples and answer questions about
the existence of triples in the repository. On the other hand,
there are triple stores that can answer more complicated
questions that rely on the semantics of RDFS and OWL (described above),
or on other rules of logic provided specially. We will discuss
the former type only.

As in the land of RDBMSs and XML, there is for RDF also a
standard query language, and it is called SPARQL. The lexicographic
similarity to SQL is not an accident. SPARQL is a query language
for RDF modeled roughly after SQL, although they are quite different.
Further, the SPARQL standard is only concerned with asking queries.
No part of the SPARQL standard addresses adding statements into a
triple store or otherwise modifying the contents of a triple store.
Triple stores of course may support these operations, but not through
the SPARQL standard.

As with the introductions of Notation 3 and RDF/XML above, this
section is meant to provide only the very basics of the SPARQL
language.

Let's start with some example data set in a hypothetical
RDF repository:

The SPARQL language has four types of queries: SELECT, ASK,
DESCRIBE, and CONSTRUCT.

The SELECT query is most similar to SQL in that you provide
a query and you get back a tabular response. However, the similarity
pretty much ends there. In SPARQL's SELECT query, you are asking
to find resources. For instance, a query might ask for
which resources "own" "my apartment".
That would be to say, which resources x can be found in
statements in the repository of the form x ex:own taubz:my_apartment.
The answer, according to the example repository data above, would
be just taubz:me. Or, which resources are in my apartment?
That is, which resources are the objects x of statements
like taubz:my_apartnemtn ex:contains x.
This differs from SQL, in which each row of results is a row from a table.
Here, each row is a resource.

SPARQL, like SQL, has a WHERE clause. However, the WHERE clause
in SPARQL is a graph pattern. A graph pattern is like a graph, in Notation 3 format,
except that besides URIs, bnodes, and literals, you are also permitted
to use variables. Variables are placeholders, precisely like the
x above. The variables represent the resources you are looking for.
The second question above would be written in SPARQL as:

A few things are of note. First, the name of the variable
?what is incidental, as are names in general in RDF,
except that it must begin with a question mark.
Second, while some things look a lot like N3, such as the graph
pattern syntax, other parts do not, such as the PREFIX declarations.
Lastly, the output of a SPARQL query is, abstractly, a table. How
you access the table depends on the SPARQL engine you use. However,
there is a standard XML output format. We won't look at that format
here, however.

Graph patterns can contain more that one statement. This has the
effect of either 1) filtering the resources that might be returned,
or 2) querying based on longer paths through the repository data.
For instance, to ask what is in my apartment that I own,
that is, to filter the results to contain only the
resources that I own, add a second statement to the graph pattern:

Each resource returned must be able to substitute into
all occurrences of the variable, ?what, so that
each statement in the graph pattern exists in the repository.
The resource taubz:friends_junk is ruled out because
when we substitute that resource into the graph pattern, yielding

one of those statements, namely the second one, does not
exist in the repository.

A query to find resources contained in any object I own
(my computer which is contained in my apartment, my friend's junk
which is also contained in my apartment, and my pens and pencils
which are contained in my desk)
would use a graph pattern that has a longer path through
resources in the repository. We can rephrase this query
as a conjunction of two statements, as before, but this time
using two variables. This is: For any pair of resources x
and y that we can find, such that
A) taubz:me ex:own y and B) y
ex:contains x, report resource x. Here is
the SPARQL query and expected output:

To push this example further, rather than asking
just for the list of objects, we can ask for the pairs
of ?container and ?what values that
go together. The only change is adding the ?container variable
to the SELECT line:

Each row is verified by substituting for each variable
in the graph pattern the corresponding value listed in
that row, one row at a time. As before,
if after substituting all of the values for the variables
all of the statements in the graph pattern occur in the
repository, the row is included in the result output.

Before we conclude, here is some more terminology.
Within any particular result row, a variable is said to
be "bound" by the value for that variable in that
row. In the first row in the table for example 4 above,
the variable ?container is bound by taubz:my_apartment.

Until a future revision of this article, we won't
go into SPARQL any further.