Jan Wielemaker
University of Amsterdam/VU University Amsterdam
The Netherlands
E-mail: J.Wielemaker@vu.nl

Abstract

This document describes the SWI-Prolog semweb package. The
core of this package is an efficient main-memory based RDF store that is
tightly connected to Prolog. Additional libraries provide reading and
writing RDF/XML and Turtle data, caching loaded RDF documents and
persistent storage. This package is the core of a ready-to-run platform
for developing Semantic Web applications named
ClioPatria,
which is distributed seperately. The SWI-Prolog RDF store is amoung the
most memory efficient main-memory stores for RDF1http://cliopatria.swi-prolog.org/help/source/doc/home/vnc/prolog/src/ClioPatria/web/help/memusage.txt

Version 3 of the RDF library enhances concurrent use of the
library by allowing for lock-free reading and writing using short-held
locks. It provides Prolog compatible logical update view on the
triple store and isolation using transactions and jargonsnapshots.
This version of the library provides near real-time modification and
querying of RDF graphs, making it particularly interesting for handling
streaming RDF and graph manipulation tasks.

The core of the SWI-Prolog package semweb is an
efficient main-memory RDF store written in C that is tightly integrated
with Prolog. It provides a fully logical predicate rdf/3
to query the RDF store efficiently by using multiple (currently 9)
indexes. In addition, SWI-Prolog provides libraries for reading and
writing XML/RDF and Turtle and a library that provides persistency using
a combination of efficient binary snapshots and journals.

Below, we describe a few usage scenarios that guides the current
design of this Prolog-based RDF store.

Application prototyping platform

Bundled with ClioPatria,
the store is an efficient platform for prototyping a wide range of
semantic web applications. Prolog, connected to the main-memory based
store is a productive platform for writing application logic that can be
made available through the SPARQL endpoint of ClioPatria, using an
application specific API (typically based on JSON or XML) or as an HTML
based end-user application. Prolog is more versatile than SPARQL, allows
composing of the logic from small building blocks and does not suffer
from the Object-relational impedance mismatch.

Data integration

The SWI-Prolog store is optimized for entailment on the
rdfs:subPropertyOf relation. The rdfs:subPropertyOf
relation is crucial for integrating data from multiple sources while
preserving the original richness of the sources because integration can
be achieved by defining the native properties as sub-properties of
properties from a unifying schema such as Dublin Core.

Dynamic data

This RDF store is one of the few stores that is primarily based on
backward reasoning. The big advantage of backward reasoning is
that it can much easier deal with changes to the database because it
does not have to take care of propagating the consequences. Backward
reasoning reduces storage requirements. The price is more reasoning
during querying. In many scenarios the extra reasoning using a main
memory will outperform the fetching the precomputed results from
external storage.

Prototyping reasoning systems

Reasoning systems, not necessarily limited to entailment reasoning,
can be prototyped efficiently on the Prolog based store. This includes
`what-if' reasoning, which is supported by snapshot and
transaction isolation. These features, together with the
concurrent loading capabilities, make the platform well equiped to
collect relevant data from large external stores for intensive
reasoning. Finally, the
TIPC
package can be used to create networks of cooperating RDF based agents.

Streaming RDF

Transactions, snapshots, concurrent modifications and the database
monitoring facilities (see rdf_monitor/2)
make the platform well suited for prototyping systems that deal with
streaming RDF data.

Depending on the OS and further application restrictions, the
SWI-Prolog RDF stores scales to about 15 million triples on 32-bit
hardware. On 64-bit hardware, the scalability is limited by the amount
of physical memory, allowing for approximately 4 million triples
per gigabyte. The other limiting factor for practical use is the time
required to load data and/or restore the database from the persistent
file backup. Performance depends highly on hardware, concurrent
performance and whether or not the data is spread over multiple (named)
graphs that can be loaded in parallel. Restoring over 20 million
triples per minute is feasible on medium hardware (Intel i7/2600 running
Ubuntu 12.10).

The current `semweb' package provides two sets of interface
predicates. The original set is described in section
3.3. The new API is described in section
3.1. The original API was designed when RDF was not yet standardised
and did not yet support data types and language indicators. The new API
is designed from the RDF 1.1 specification, introducing consistent
naming and access to literals using the value space. The new
API is currently defined on top of the old API, so both APIs can be
mixed in a single application.

True if an RDF triple <S,P,O>
exists, optionally in the graph G. The object O is
either a resource (atom) or one of the terms listed below. The described
types apply for the case where O is unbound. If O
is instantiated it is converted according to the rules described with rdf_assert/3.

Triples consist of the following three terms:

Blank nodes are encoded by atoms that start with `_:`.

IRIs appear in two notations:

Full IRIs are encoded by atoms that do not start with `_:`.
Specifically, an IRI term is not required to follow the IRI standard
grammar.

Abbreviated IRI notation that allows IRI prefix aliases that are
registered by rdf_register_prefix/[2,3] to be used. Their notation is Alias:Local,
where Alias and Local are atoms. Each abbreviated IRI is expanded by the
system to a full IRI.

Literals appear in two notations:

String@Lang A language-tagged string, where String is a Prolog
string and Lang is an atom.

Value^^Type A type qualified literal. For
unknown types, Value is a Prolog string. If type is known, the Prolog
representations from the table below are used.

Datatype IRI

Prolog term

xsd:float

float

xsd:double

float

xsd:decimal

float (1)

xsd:integer

integer

XSD integer sub-types

integer

xsd:boolean

true or false

xsd:date

date(Y,M,D)

xsd:dateTime

date_time(Y,M,D,HH,MM,SS)
(2,3)

xsd:gDay

integer

xsd:gMonth

integer

xsd:gMonthDay

month_day(M,D)

xsd:gYear

integer

xsd:gYearMonth

year_month(Y,M)

xsd:time

time(HH,MM,SS) (2)

Notes:

(1) The current implementation of xsd:decimal
values as floats is formally incorrect. Future versions of SWI-Prolog
may introduce decimal as a subtype of rational.

(2) SS fields denote the number of seconds. This
can either be an integer or a float.

(3) The date_time structure can have a 7th
field that denotes the timezone offset in seconds as an integer.

There is a fine distinction in how duplicate statements are handled
in rdf/[3,4]: backtracking over rdf/3
will never return duplicate triples that appear in multiple graphs. rdf/4
will return such duplicate triples, because their graph term differs.

S

is the subject term. It is either a blank
node or IRI.

P

is the predicate term. It is always an
IRI.

O

is the object term. It is either a
literal, a blank node or IRI (except for true and false
that denote the values of datatype XSD boolean).

Similar to rdf/3 and rdf/4,
but P matches all predicates that are defined as an
rdfs:subPropertyOf of P. This predicate also recognises the
predicate properties inverse_of and
symmetric. See rdf_set_predicate/2.

True when O can be reached from S using the
transitive closure of P. The predicate uses (the internals
of) rdf_has/3 and thus matches
both rdfs:subPropertyOf and the inverse_of and
symmetric predicate properties. The version rdf_reachable/5
maximizes the steps considered and returns the number of steps taken.

If both S and O are given, these predicates are semidet.
The number of steps D is minimal because the implementation
uses
breath first search.

Formulate constraints on RDF terms, notably literals. These are intended
to be used as illustrated below. RDF constraints are pure: they may be
placed before, after or inside a graph pattern and, provided the code
contains no commit operations (!, ->), the
semantics of the goal remains the same. Preferably, constraints are
placed before the graph pattern as they often help the RDF
database to exploit its literal indexes. In the example below, the
database can choose between using the subject and/or predicate hash or
the ordered literal table.

{ Date >= "2000-01-01"^^xsd:date },
rdf(S, P, Date)

The following constraints are currently defined:

>,>=,==,=<,<

The comparison operators are defined between numbers (of any recognised
type), typed literals of the same type and langStrings of the same
language.

prefix(String, Pattern)

substring(String, Pattern)

word(String, Pattern)

like(String, Pattern)

icase(String, Pattern)

Text matching operators that act on both typed literals and langStrings.

lang_matches(Term, Pattern)

Demands a full RDF term (Text@Lang) or a plain Lang term to
match the language pattern Pattern.

The predicates rdf_where/1
and {}/1 are identical. The
rdf_where/1 variant is provided
to avoid ambiguity in applications where {}/1 is used for other
purposes. Note that it is also possible to write rdf11:{...}.

True when O is a currently known object, i.e. it appeasr in
the object position of some visible triple. If Term is ground, it is
pre-processed as the object argument of rdf_assert/3
and the predicate is semidet.

True if Term appears in the RDF database. Term is
either an iri, literal or blank node and may appear in any position of
any triple. If Term is ground, it is pre-processed as the
object argument of rdf_assert/3
and the predicate is semidet.

True when Lexical is the lexical form for the literal Literal.
Lexical is of one of the forms below. The ntriples
serialization is obtained by transforming String into a proper ntriples
string using double quotes and escaping where needed and turning Type
into a proper IRI reference.

Assert a new triple. If O is a literal, certain Prolog terms
are translated to typed RDF literals. These conversions are described
with rdf_canonical_literal/2.

If a type is provided using Value^^Type
syntax, additional conversions are performed. All types accept either an
atom or Prolog string holding a valid RDF lexical value for the type and
xsd:float and xsd:double accept a Prolog integer.

True when Length is the number of cells in RDFList.
Note that a list cell may have multiple rdf:rest triples, which makes
this predicate non-deterministic. This predicate does not check whether
the list cells have associated values (rdf:first). The list must end in
rdf:nil.

Create an RDF list from the given Prolog List. PrologList
must be a proper Prolog list and all members of the list must be
acceptable as object for rdf_assert/3.
If RDFList is unbound and
PrologList is not empty, rdf_create_bnode/1
is used to create
RDFList.

Implementation of the conventional human interpretation of RDF 1.1
containers.

RDF containers are open enumeration structures as opposed to RDF
collections or RDF lists which are closed enumeration structures. The
same resource may appear in a container more than once. A container may
be contained in itself.

True when Alt is an instance of rdf:Alt with
first member
Default and remaining members Others.

Notice that this construct adds no machine-processable semantics but
is conventionally used to indicate to a human reader that the numerical
ordering of the container membership properties of Container is intended
to only be relevant in distinguishing between the first and all
non-first members.

Default denotes the default option to take when choosing
one of the alternatives container in Container. Others
denotes the non-default options that can be chosen from.

True when Bag is an rdf:Bag and set is the set
values related through container membership properties to Bag.

Notice that this construct adds no machine-processable semantics but
is conventionally used to indicate to a human reader that the numerical
ordering of the container membership properties of Container is intended
to not be significant.

True when Seq is an instance of rdf:Seq and List
is a list of associated values, ordered according to the container
membership property used.

Notice that this construct adds no machine-processable semantics but
is conventionally used to indicate to a human reader that the numerical
ordering of the container membership properties of Container is intended
to be significant.

True when List is the list of objects attached to Container
using a container membership property (rdf:_0, rdf:_1, ...). If multiple
objects are connected to the Container using the same
membership property, this predicate selects one value
non-deterministically.

The central module of the RDF infrastructure is library(semweb/rdf_db).
It provides storage and indexed querying of RDF triples. RDF data is
stored as quintuples. The first three elements denote the RDF triple.
The extra Graph and Line elements provide information
about the origin of the triple.

The actual storage is provided by the foreign language (C)
module. Using a dedicated C-based implementation we can reduced memory
usage and improve indexing capabilities, for example by providing a
dedicated index to support entailment over rdfs:subPropertyOf.
Currently the following indexes are provided (S=subject, P=predicate,
O=object, G=graph):

S, P, O, SP, PO, SPO, G, SG, PG

Predicates connect by rdfs:subPropertyOf are combined in a predicate
cloud. The system causes multiple predicates in the cloud to share
the same hash. The cloud maintains a 2-dimensional array that expresses
the closure of all rdfs:subPropertyOf relations. This index supports rdf_has/3
to query a property and all its children efficiently.

Additional indexes for predicates, resources and graphs allow
enumerating these objects without duplicates. For example, using rdf_resource/1
we enumerate all resources in the database only once, while enumeration
using e.g., (rdf(R,_,_);rdf(_,_,R)) normally produces many
duplicate answers.

Elementary query for triples. Subject and Predicate
are atoms representing the fully qualified URL of the resource. Object
is either an atom representing a resource or literal(Value)
if the object is a literal value. If a value of the form
NameSpaceID:LocalName is provided it is expanded to a ground atom using expand_goal/2.
This implies you can use this construct in compiled code without paying
a performance penalty. Literal values take one of the following forms:

Atom

If the value is a simple atom it is the textual representation of a
string literal without explicit type or language qualifier.

lang(LangID, Atom)

Atom represents the text of a string literal qualified with
the given language.

type(TypeID, Value)

Used for attributes qualified using the rdf:datatypeTypeID. The Value is either the textual
representation or a natural Prolog representation. See the option
convert_typed_literal(:Convertor) of the parser. The storage layer
provides efficient handling of atoms, integers (64-bit) and floats
(native C-doubles). All other data is represented as a Prolog record.

For literal querying purposes, Object can be of the form
literal(+Query, -Value), where Query is one of the terms
below. If the Query takes a literal argument and the value has a numeric
type numerical comparison is performed.

plain(+Text)

Perform exact match and demand the language or type qualifiers to match.
This query is fully indexed.

icase(+Text)

Perform a full but case-insensitive match. This query is fully indexed.

exact(+Text)

Same as icase(Text). Backward compatibility.

substring(+Text)

Match any literal that contains Text as a case-insensitive
substring. The query is not indexed on Object.

word(+Text)

Match any literal that contains Text delimited by a non
alpha-numeric character, the start or end of the string. The query is
not indexed on Object.

prefix(+Text)

Match any literal that starts with Text. This call is
intended for completion. The query is indexed using the skip list of
literals.

ge(+Literal)

Match any literal that is equal or larger then Literal in the
ordered set of literals.

gt(+Literal)

Match any literal that is larger then Literal in the ordered
set of literals.

eq(+Literal)

Match any literal that is equal to Literal in the ordered set
of literals.

le(+Literal)

Match any literal that is equal or smaller then Literal in
the ordered set of literals.

lt(+Literal)

Match any literal that is smaller then Literal in the ordered
set of literals.

between(+Literal1, +Literal2)

Match any literal that is between Literal1 and Literal2
in the ordered set of literals. This may include both Literal1
and
Literal2.

like(+Pattern)

Match any literal that matches Pattern case insensitively,
where the `*' character in Pattern matches zero or more
characters.

Backtracking never returns duplicate triples. Duplicates can be
retrieved using rdf/4. The predicate rdf/3
raises a type-error if called with improper arguments. If rdf/3
is called with a term literal(_) as Subject or Predicate
object it fails silently. This allows for graph matching goals like
rdf(S,P,O),rdf(O,P2,O2) to proceed without
errors.

Succeeds if the triple rdf(Subject, Predicate, Object) is
true exploiting the rdfs:subPropertyOf predicate as well as inverse
predicates declared using rdf_set_predicate/2
with the
inverse_of property.

Same as rdf_has/3, but RealPredicate
is unified to the actual predicate that makes this relation true. RealPredicate
must be
Predicate or an rdfs:subPropertyOf Predicate. If
an inverse match is found, RealPredicate is the term inverse_of(Pred).

Is true if Object can be reached from Subject
following the transitive predicate Predicate or a
sub-property thereof, while repecting the symetric(true) or inverse_of(P2)
properties.

If used with either Subject or Object unbound,
it first returns the origin, followed by the reachable nodes in
breath-first search-order. The implementation internally looks one
solution ahead and succeeds deterministically on the last solution. This
predicate never generates the same node twice and is robust against
cycles in the transitive relation.

With all arguments instantiated, it succeeds deterministically if a
path can be found from Subject to Object.
Searching starts at Subject, assuming the branching factor is
normally lower. A call with both Subject and Object
unbound raises an instantiation error. The following example generates
all subclasses of rdfs:Resource:

Same as rdf_reachable/3, but
in addition, MaxD limits the number of edges expanded and D
is unified with the `distance' between
Subject and Object. Distance 0 means Subject
and Object are the same resource. MaxD can be the
constant infinite to impose no distance-limit.

The predicates below enumerate the basic objects of the RDF store.
Most of these predicates also enumerate objects that are not associated
to any currently visible triple. Objects are retained as long as they
are visible in active queries or snapshots. After that, some are
reclaimed by the RDF garbage collector, while others are never
reclaimed.

True when Resource is a resource used as a subject or object
in a triple.

This predicate is primarily intended as a way to process all
resources without processing resources twice. The user must be aware
that some of the returned resources may not appear in any
visible triple.

True when Predicate is a currently known predicate.
Predicates are created if a triples is created that uses this predicate
or a property of the predicate is set using rdf_set_predicate/2.
The predicate may (no longer) have triples associated with it.

Note that resources that have rdf:typerdf:Property
are not automatically included in the result-set of this predicate,
while all resources that appear as the second argument of a
triple are included.

True when Literal is a currently known literal. Enumerates
each unique literal exactly once. Note that it is possible that the
literal only appears in already deleted triples. Deleted triples may be
locked due to active queries, transactions or snapshots or may not yet
be reclaimed by the garbage collector.

The predicates below modify the RDF store directly. In addition, data
may be loaded using rdf_load/2 or
by restoring a persistent database using rdf_attach_db/2.
Modifications follow the Prolog logical update view semantics,
which implies that modifications remain invisible to already running
queries. Further isolation can be achieved using
rdf_transaction/3.

Assert a new triple into the database. This is equivalent to
rdf_assert/4 using Graph user. Subject
and Predicate are resources. Object is either a
resource or a term literal(Value). See rdf/3
for an explanation of Value for typed and language qualified literals.
All arguments are subject to name-space expansion. Complete duplicates
(including the same graph and `line' and with a compatible `lifespan')
are not added to the database.

Run Goal in an RDF transaction. Compared to the ACID model,
RDF transactions have the following properties:

Modifications inside the transactions become all atomically visible
to the outside world if Goal succeeds or remain invisible if Goal
fails or throws an exception. I.e., the atomicy property is fully
supported.

Consistency is not guaranteed. Later versions may implement
consistency constraints that will be checked serialized just before the
actual commit of a transaction.

Concurrently executing transactions do not infuence each other.
I.e., the isolation property is fully supported.

Durability can be activated by loading
library(semweb/rdf_persistency).

Processed options are:

snapshot(+Snapshot)

Execute Goal using the state of the RDF store as stored in
Snapshot. See rdf_snapshot/1. Snapshot
can also be the atom true, which implies that an anonymous
snapshot is created at the current state of the store. Modifications due
to executing Goal are only visible to Goal.

Take a snapshot of the current state of the RDF store. Later, goals may
be executed in the context of the database at this moment using rdf_transaction/3
with the snapshot option. A snapshot created outside a
transaction exists until it is deleted. Snapshots taken inside a
transaction can only be used inside this transaction.

True if Id is the identifier of a transaction in the context
of which this call is executed. If Id is not instantiated,
backtracking yields transaction identifiers starting with the innermost
nested transaction. Transaction identifier terms are not copied, need
not be ground and can be instantiated during the transaction.

Tests if a resource is a blank node (i.e. is an anonymous resource). A
blank node is represented as an atom that starts with _:.
For backward compatibility reason, __ is also considered to
be a blank node.

The RDF library can read and write triples in RDF/XML and a
proprietary binary format. There is a plugin interface defined to
support additional formats. The library(semweb/turtle) uses
this plugin API to support loading Turtle files using rdf_load/2.

How to handle equivalent blank nodes. If share (default),
equivalent blank nodes are shared in the same resource.

base_uri(+URI)

URI that is used for rdf:about="" and other RDF constructs
that are relative to the base uri. Default is the source URL.

concurrent(+Jobs)

If FileOrList is a list of files, process the input files
using Jobs threads concurrently. Default is the mininum of
the number of cores and the number of inputs. Higher values can be
useful when loading inputs from (slow) network connections. Using 1
(one) does not use separate worker threads.

format(+Format)

Specify the source format explicitly. Normally this is deduced from the
filename extension or the mime-type. The core library understands the
formats xml (RDF/XML) and triples (internal quick load and cache
format). Plugins, such as library(semweb/turtle) extend the
set of recognised extensions.

graph(?Graph)

Named graph in which to load the data. It is not allowed to load
two sources into the same named graph. If Graph is unbound,
it is unified to the graph into which the data is loaded. The default
graph is a =file://= URL when loading a file or, if the specification is
a URL, its normalized version without the optional #fragment.

if(Condition)

When to load the file. One of true, changed
(default) or
not_loaded.

modified(-Modified)

Unify Modified with one of not_modified, cached(File),
last_modified(Stamp) or unknown.

cache(Bool)

If false, do not use or create a cache file.

register_namespaces(Bool)

If true (default false), register xmlns
namespace declarations or Turtle @prefix prefixes using
rdf_register_prefix/3
if there is no conflict.

silent(+Bool)

If true, the message reporting completion is printed using
level silent. Otherwise the level is informational.
See also print_message/2.

prefixes(-Prefixes)

Returns the prefixes defined in the source data file as a list of pairs.

Other options are forwarded to process_rdf/3.
By default,
rdf_load/2 only loads RDF/XML
from files. It can be extended to load data from other formats and
locations using plugins. The full set of plugins relevant to support
different formats and locations is below:

Reload all loaded files that have been modified since the last time they
were loaded.

Partial save

Sometimes it is necessary to make more arbitrary selections of
material to be saved or exchange RDF descriptions over an open network
link. The predicates in this section provide for this. Character
encoding issues are derived from the encoding of the Stream,
providing support for
utf8, iso_latin_1 and ascii.

Save XML document header, doctype and open the RDF environment. This
predicate also sets up the namespace notation.

Save an RDF header, with the XML header, DOCTYPE, ENTITY and opening
the rdf:RDF element with appropriate namespace declarations. It uses the
primitives from section 3.5 to generate the required namespaces and
desired short-name. Options is one of:

graph(+URI)

Only search for namespaces used in triples that belong to the given
named graph.

namespaces(+List)

Where List is a list of namespace abbreviations. With this
option, the expensive search for all namespaces that may be used by your
data is omitted. The namespaces rdf and rdfs
are added to the provided List. If a namespace is not
declared, the resource is emitted in non-abreviated form.

Loading and saving RDF format is relatively slow. For this reason we
designed a binary format that is more compact, avoids the complications
of the RDF parser and avoids repetitive lookup of (URL) identifiers.
Especially the speed improvement of about 25 times is worth-while when
loading large databases. These predicates are used for caching by
rdf_load/2 under certain
conditions as well as for maintaining persistent snapshots of the
database using
library(semweb/rdf_persistency).

Many RDF stores turned triples into quadruples. This store is no
exception, initially using the 4th argument to store the filename from
which the triple was loaded. Currently, the 4th argument is the RDF
named graph. A named graph maintains some properties, notably to
track origin, changes and modified state.

Literal values are ordered and indexed using a skip list The
aim of this index is threefold.

Unlike hash-tables, binary trees allow for efficient
prefix and range matching. Prefix matching is useful in
interactive applications to provide feedback while typing such as
auto-completion.

Having a table of unique literals we generate creation and
destruction events (see rdf_monitor/2).
These events can be used to maintain additional indexing on literals,
such as `by word'. See library(semweb/litindex).

As string literal matching is most frequently used for searching
purposes, the match is executed case-insensitive and after removal of
diacritics. Case matching and diacritics removal is based on Unicode
character properties and independent from the current locale. Case
conversion is based on the `simple uppercase mapping' defined by Unicode
and diacritic removal on the `decomposition type'. The approach is
lightweight, but somewhat simpleminded for some languages. The tables
are generated for Unicode characters upto 0x7fff. For more information,
please check the source-code of the mapping-table generator
unicode_map.pl available in the sources of this package.

Currently the total order of literals is first based on the type of
literal using the ordering numeric < string <
term Numeric values (integer and float) are ordered by value,
integers preceed floats if they represent the same value. strings are
sorted alphabetically after case-mapping and diacritic removal as
described above. If they match equal, uppercase preceeds lowercase and
diacritics are ordered on their unicode value. If they still compare
equal literals without any qualifier preceeds literals with a type
qualifier which preceeds literals with a language qualifier. Same
qualifiers (both type or both language) are sorted alphabetically.

The ordered tree is used for indexed execution of
literal(prefix(Prefix), Literal) as well as literal(like(Like), Literal)
if Like does not start with a `*'. Note that results of queries
that use the tree index are returned in alphabetical order.

The predicates below form an experimental interface to provide more
reasoning inside the kernel of the rdb_db engine. Note that symetric,
inverse_of and transitive are not yet
supported by the rest of the engine. Alo note that there is no relation
to defined RDF properties. Properties that have no triples are not
reported by this predicate, while predicates that are involved in
triples do not need to be defined as an instance of rdf:Property.

True if this predicate is transitive. This predicate is currently not
used. It might be used to make rdf_has/3
imply
rdf_reachable/3 for
transitive predicates.

triples(Triples)

Unify Triples with the number of existing triples using this
predicate as second argument. Reporting the number of triples is
intended to support query optimization.

rdf_subject_branch_factor(-Float)

Unify Float with the average number of triples associated
with each unique value for the subject-side of this relation. If there
are no triples the value 0.0 is returned. This value is cached with the
predicate and recomputed only after substantial changes to the triple
set associated to this relation. This property is intended for path
optimalisation when solving conjunctions of rdf/3
goals.

rdf_object_branch_factor(-Float)

Unify Float with the average number of triples associated
with each unique value for the object-side of this relation. In addition
to the comments with the subject_branch_factor property, uniqueness of
the object value is computed from the hash key rather than the actual
values.

rdfs_subject_branch_factor(-Float)

Same as rdf_subject_branch_factor, but also considering
triples of `subPropertyOf' this relation. See also rdf_has/3.

rdfs_object_branch_factor(-Float)

Same as rdf_object_branch_factor, but also considering
triples of `subPropertyOf' this relation. See also rdf_has/3.

Prolog code often contains references to constant resources with a
known
prefix (also known as XML namespaces). For example,
http://www.w3.org/2000/01/rdf-schema#Class refers to the
most general notion of an RDFS class. Readability and maintability
concerns require for abstraction here. The RDF database maintains a
table of known prefixes. This table can be queried using rdf_current_ns/2
and can be extended using rdf_register_ns/3.
The prefix database is used to expand prefix:local terms
that appear as arguments to calls which are known to accept a resource.
This expansion is achieved by Prolog preprocessor using expand_goal/2.

Query predefined prefixes and prefixes defined with
rdf_register_prefix/2
and local prefixes defined with
rdf_prefix/2. If Alias is
unbound and one URI is the prefix of another, the longest is
returned first. This allows turning a resource into a prefix/local
couple using the simple enumeration below. See rdf_global_id/2.

If true, Replace existing namespace alias. Please note that
replacing a namespace is dangerous as namespaces affect preprocessing.
Make sure all code that depends on a namespace is compiled after
changing the registration.

keep(Boolean)

If true and Alias is already defined, keep the original
binding for Prefix and succeed silently.

Without options, an attempt to redefine an alias raises a permission
error.

Convert between Prefix:Local and full IRI (an atom). If IRISpec
is an atom, it is simply unified with IRI. This predicate
fails silently if IRI is an RDF literal.

Note that this predicate is a meta-predicate on its output argument.
This is necessary to get the module context while the first argument may
be of the form (:)/2. The above mode description is correct, but should
be interpreted as (?,?).

Errors

existence_error(rdf_prefix, Prefix)

See also

- rdf_equal/2 provides a compile
time alternative
- The rdf_meta/1 directive asks for
compile time expansion of arguments.

bug

Error handling is incomplete. In its current implementation the same
code is used for compile-time expansion and to facilitate runtime
conversion and checking. These use cases have different requirements.

Same as rdf_global_id/2, but
intended for dealing with the object part of a triple, in particular the
type for typed literals. Note that the predicate is a meta-predicate on
the output argument. This is necessary to get the module context while
the first argument may be of the form (:)/2.

Performs rdf_global_id/2 on
predixed IRIs and rdf_global_object/2
on RDF literals, by recursively analysing the term. Note that the
predicate is a meta-predicate on the output argument. This is necessary
to get the module context while the first argument may be of the form
(:)/2.

Terms of the form Prefix:Local that appear in TermIn
for which
Prefix is not defined are not replaced. Unlike rdf_global_id/2
and
rdf_global_object/2, no
error is raised.

Namespace handling for custom predicates

If we implement a new predicate based on one of the predicates of the
semweb libraries that expands namespaces, namespace expansion is not
automatically available to it. Consider the following code computing the
number of distinct objects for a certain property on a certain object.

In addition to expanding calls, rdf_meta/1
also causes expansion of
clause heads for predicates that match a declaration. This is
typically used write Prolog statements about resources. The following
example produces three clauses with expanded (single-atom) arguments:

True when Generation is the current generation of the
database. Each modification to the database increments the generation.
It can be used to check the validity of cached results deduced from the
database. Committing a non-empty transaction increments the generation
by one.

When inside a transaction, Generation is unified to a term
TransactionStartGen + InsideTransactionGen. E.g., 4+3
means that the transaction was started at generation 4 of the global
database and we have created 3 new generations inside the transaction.
Note that this choice of representation allows for comparing generations
using Prolog arithmetic. Comparing a generation in one transaction with
a generation in another transaction is meaningless.

Return the number of alternatives as indicated by the database internal
hashed indexing. This is a rough measure for the number of alternatives
we can expect for an rdf_has/3
call using the given three arguments. When called with three variables,
the total number of triples is returned. This estimate is used in query
optimisation. See also rdf_predicate_property/2
and
rdf_statistics/1 for
additional information to help optimizers.

Total number of triples in the database. This is the number of asserted
triples minus the number of retracted ones. The number of visible
triples in a particular context may be different due to visibility rules
defined by the logical update view and transaction isolation.

resources(-Count)

Number of resources that appear as subject or object in a triple. See rdf_resource/1.

True if Lang matches Pattern. This implements XML
language matching conform RFC 4647. Both Lang and Pattern
are dash-separated strings of identifiers or (for Pattern)
the wildcart *. Identifiers are matched case-insensitive and a * matches
any number of identifiers. A short pattern is the same as *.

Remove all triples from the RDF database and reset all its statistics.

bug

This predicate checks for active queries, but this check is not properly
synchronized and therefore the use of this predicate is unsafe in
multi-threaded contexts. It is mainly used to run functionality tests
that need to start with an empty database.

Storing RDF triples in main memory provides much better performance
than using external databases. Unfortunately, although memory is fairly
cheap these days, main memory is severely limited when compared to
disks. Memory usage breaks down to the following categories. Rough
estimates of the memory usage is given for 64-bit systems. 32-bit
system use slightly more than half these amounts.

Actually storing the triples. A triple is stored in a C struct of
144 bytes. This struct both holds the quintuple, some bookkeeping
information and the 10 next-pointers for the (max) to hash tables.

The bucket array for the hashes. Each bucket maintains a
head, and tail pointer, as well as a count for the number
of entries. The bucket array is allocated if a particular index is
created, which implies the first query that requires the index. Each
bucket requires 24 bytes.

Bucket arrays are resized if necessary. Old triples remain at their
original location. This implies that a query may need to scan multiple
buckets. The garbage collector may relocate old indexed triples. It does
so by copying the old triple. The old triple is later reclaimed by GC.
Reindexed triples will be reused, but many reindexed triples may result
in a significant memory fragmentation.

Resources are maintained in a seperate table to support
rdf_resource/1. A resources
requires approximately 32 bytes.

Identical literals are shared (see rdf_current_literal/1)
and stored in a skip list. A literal requires approximately 40
bytes, excluding the atom used for the lexical representation.

Resources are stored in the Prolog atom-table. Atoms with the
average length of a resource require approximately 88 bytes.

The hash parameters can be controlled with rdf_set/1.
Applications that are tight on memory and for which the query
characteristics are more or less known can optimize performance and
memory by fixing the hash-tables. By fixing the hash-tables we can
tailor them to the frequent query patterns, we avoid the need for to
check multiple hash buckets (see above) and we avoid memory
fragmentation due to optimizing triples for resized hashes.

Set properties for a triple index. Hash is one of s,
p, sp, o, po, spo, g, sg
or pg. Parameter is one of:

size

Value defines the number of entries in the hash-table.
Value is rounded down to a power of 2. After setting
the size explicitly, auto-sizing for this table is disabled. Setting the
size smaller than the current size results in a permission_error
exception.

average_chain_len

Set maximum average collision number for the hash.

optimize_threshold

Related to resizing hash-tables. If 0, all triples are moved to the new
size by the garbage collector. If more then zero, those of the last Value
resize steps remain at their current location. Leaving cells at their
current location reduces memory fragmentation and slows down access.

The garbage collector

The RDF store has a garbage collector that runs in a separate thread
named =__rdf_GC=. The garbage collector removes the following objects:

Triples that have died before the the generation of last still
active query.

Entailment matrices for rdfs:subPropertyOf relations
that are related to old queries.

In addition, the garbage collector reindexes triples associated to
the hash-tables before the table was resized. The most recent resize
operation leads to the largest number of triples that require
reindexing, while the oldest resize operation causes the largest
slowdown. The parameter optimize_threshold controlled by rdf_set/1
can be used to determine the number of most recent resize operations for
which triples will not be reindexed. The default is 2.

Normally, the garbage collector does it job in the background at a
low priority. The predicate rdf_gc/0
can be used to reclaim all garbage and optimize all indexes.Warming
up the database

The RDF store performs many operations lazily or in background
threads. For maximum performance, perform the following steps:

Load all the data without doing queries or retracting data in
between. This avoids creating the indexes and therefore the need to
resize them.

Perform each of the indexed queries. The following call performs
this. Note that it is irrelevant whether or not the query succeeds.

Duplicate adminstration is initialized in the background after the
first call that returns a significant amount of duplicates. Creating the
adminstration can be forced by calling rdf_update_duplicates/0.

Run the RDF-DB garbage collector until no garbage is left and all tables
are fully optimized. Under normal operation a seperate thread with
identifier __rdf_GC performs garbage collection as long as
it is considered `useful'.

Using rdf_gc/0 should only be
needed to ensure a fully clean database for analysis purposes such as
leak detection.

Update the duplicate administration of the RDF store. This marks every
triple that is potentionally a duplicate of another as duplicate. Being
potentially a duplicate means that subject, predicate and object are
equivalent and the life-times of the two triples overlap.

The duplicates marks are used to reduce the administrative load of
avoiding duplicate answers. Normally, the duplicates are marked using a
background thread that is started on the first query that produces a
substantial amount of duplicates.

The predicate rdf_monitor/2
allows registrations of call-backs with the RDF store. These call-backs
are typically used to keep other databases in sync with the RDF store.
For example,
library(library(semweb/rdf_persistency)) monitors the RDF
store for maintaining a persistent copy in a set of files and
library(library(semweb/rdf_litindex)) uses added and
deleted literal values to maintain a fulltext index of literals.

Goal is called for modifications of the database. It is
called with a single argument that describes the modification. Defined
events are:

assert(+S, +P, +O, +DB)

A triple has been asserted.

retract(+S, +P, +O, +DB)

A triple has been deleted.

update(+S, +P, +O, +DB, +Action)

A triple has been updated.

new_literal(+Literal)

A new literal has been created. Literal is the argument of
literal(Arg) of the triple's object. This event is
introduced in version 2.5.0 of this library.

old_literal(+Literal)

The literal Literal is no longer used by any triple.

transaction(+BeginOrEnd, +Id)

Mark begin or end of the commit of a transaction started by
rdf_transaction/2. BeginOrEnd
is begin(Nesting) or
end(Nesting). Nesting expresses the nesting
level of transactions, starting at `0' for a toplevel transaction. Id
is the second argument of rdf_transaction/2.
The following transaction Ids are pre-defined by the library:

parse(Id)

A file is loaded using rdf_load/2. Id
is one of file(Path) or stream(Stream).

Mark begin or end of rdf_load_db/1
or load through rdf_load/2
from a cached file. Spec is currently defined as file(Path).

rehash(+BeginOrEnd)

Marks begin/end of a re-hash due to required re-indexing or garbage
collection.

Mask is a list of events this monitor is interested in.
Default (empty list) is to report all events. Otherwise each element is
of the form +Event or -Event to include or exclude monitoring for
certain events. The event-names are the functor names of the events
described above. The special name all refers to all events
and
assert(load) to assert events originating from rdf_load_db/1.
As loading triples using rdf_load_db/1
is very fast, monitoring this at the triple level may seriously harm
performance.

This predicate is intended to maintain derived data, such as a
journal, information for undo, additional indexing in literals,
etc. There is no way to remove registered monitors. If this is required
one should register a monitor that maintains a dynamic list of
subscribers like the XPCE broadcast library. A second subscription of
the same hook predicate only re-assignes the mask.

The monitor hooks are called in the order of registration and in the
same thread that issued the database manipulation. To process all
changes in one thread they should be send to a thread message queue. For
all updating events, the monitor is called while the calling thread has
a write lock on the RDF store. This implies that these events are
processed strickly synchronous, even if modifications originate from
multiple threads. In particular, the transactionbegin,
... updates ... end sequence is never interleaved with
other events. Same for load and parse.

This RDF low-level module has been created after two year
experimenting with a plain Prolog based module and a brief evaluation of
a second generation pure Prolog implementation. The aim was to be able
to handle upto about 5 million triples on standard (notebook) hardware
and deal efficiently with subPropertyOf which was
identified as a crucial feature of RDFS to realise fusion of different
data-sets.

The following issues are identified and not solved in suitable
manner.

subPropertyOf of subPropertyOf

is not supported.

Equivalence

Similar to subPropertyOf, it is likely to be profitable to
handle resource identity efficient. The current system has no support
for it.

The library(rdf_db) module provides several hooks for
extending its functionality. Database updates can be monitored and acted
upon through the features described in section
3.4. The predicate rdf_load/2
can be hooked to deal with different formats such as rdfturtle,
different input sources (e.g. http) and different strategies for caching
results.

The hooks below are used to add new RDF file formats and sources from
which to load data to the library. They are used by the modules
described below and distributed with the package. Please examine the
source-code if you want to add new formats or locations.

Open an input. Input is one of file(+Name),
stream(+Stream) or url(Protocol, URL). If this
hook succeeds, the RDF will be read from Stream using rdf_load_stream/3.
Otherwise the default open functionality for file and stream are used.

Gather information on Input. Modified is the last
modification time of the source as a POSIX time-stamp (see time_file/2).
Format is the RDF format of the file. See rdf_file_type/2
for details. It is allowed to leave the output variables unbound.
Ultimately the default modified time is `0' and the format is assumed to
be
xml.

True if Format is the default RDF file format for files with
the given extension. Extension is lowercase and without a
'.'. E.g. owl. Format is either a built-in
format (xml or triples) or a format understood
by the rdf_load_stream/3
hook.

This
module uses the library(zlib) library to load compressed
files on the fly. The extension of the file must be .gz.
The file format is deduced by the extension after stripping the .gz
extension. E.g. rdf_load('file.rdf.gz').

This module allows for rdf_load('http://...').
It exploits the library library(http/http_open.pl). The
format of the URL is determined from the mime-type returned by the
server if this is one of
text/rdf+xml, application/x-turtle or
application/turtle. As RDF mime-types are not yet widely
supported, the plugin uses the extension of the URL if the claimed
mime-type is not one of the above. In addition, it recognises
text/html and application/xhtml+xml, scanning
the XML content for embedded RDF.

The library library(semweb/rdf_cache) defines the
caching strategy for triples sources. When using large RDF sources,
caching triples greatly speedup loading RDF documents. The cache library
implements two caching strategies that are controlled by rdf_set_cache_options/1.

Local caching This approach applies to files only. Triples are
cached in a sub-directory of the directory holding the source. This
directory is called .cache (_cache on
Windows). If the cache option create_local_directory is true,
a cache directory is created if posible.

Global caching This approach applies to all sources, except
for unnamed streams. Triples are cached in directory defined by the
cache option global_directory.

When loading an RDF file, the system scans the configured cache files
unless cache(false) is specified as option to rdf_load/2
or caching is disabled. If caching is enabled but no cache exists, the
system will try to create a cache file. First it will try to do this
locally. On failure it will try to configured global cache.

The library library(semweb/rdf_litindex.pl) exploits the
primitives of section 4.5.1 and the
NLP package to provide indexing on words inside literal constants. It
also allows for fuzzy matching using stemming and `sounds-like' based on
the double metaphone algorithm of the NLP package.

Find literals (without type or language specification) that satisfy
Spec. The required indices are created as needed and kept
up-to-date using hooks registered with rdf_monitor/2.
Numerical indexing is currently limited to integers in the range ±2^30
(±2^62 on 64-bit platforms). Spec is defined
as:

and(Spec1, Spec2)

Intersection of both specifications.

or(Spec1, Spec2)

Union of both specifications.

not(Spec)

Negation of Spec. After translation of the full specification
to
Disjunctive Normal Form (DNF), negations are only allowed
inside a conjunction with at least one positive literal.

case(Word)

Matches all literals containing the word Word, doing the
match case insensitive and after removing diacritics.

stem(Like)

Matches all literals containing at least one word that has the same stem
as Like using the Porter stem algorithm. See NLP package for
details.

sounds(Like)

Matches all literals containing at least one word that `sounds like'
Like using the double metaphone algorithm. See NLP package
for details.

prefix(Prefix)

Matches all literals containing at least one word that starts with
Prefix, discarding diacritics and case.

between(Low, High)

Matches all literals containing an integer token in the range
Low..High, including the boundaries.

ge(Low)

Matches all literals containing an integer token with value
Low or higher.

le(High)

Matches all literals containing an integer token with value
High or lower.

Token

Matches all literals containing the given token. See tokenize_atom/2
of the NLP package for details.

Uses the same database as rdf_find_literals/2
to find possible expansions of Spec, i.e. which words `sound
like', `have prefix', etc. Spec is a compound expression as
in rdf_find_literals/2.
Expansions is unified to a list of terms sounds(Like,
Words), stem(Like, Words) or prefix(Prefix,
Words). On compound expressions, only combinations that provide
literals are returned. Below is an example after loading the ULAN2Unified
List of Artist Names from the Getty Foundation. database
and showing all words that sounds like `rembrandt' and appear together
in a literal with the word `Rijn'. Finding this result from the 228,710
literals contained in ULAN requires 0.54 milliseconds (AMD 1600+).

Tokenize a literal, returning a list of atoms and integers in the range
-1073741824 ... 1073741823. As tokenization is in general
domain and task-dependent this predicate first calls the hook
rdf_litindex:tokenization(Literal, -Tokens). On failure it
calls tokenize_atom/2
from the NLP package and deletes the following: atoms of length 1,
floats, integers that are out of range and the english words and, an, or, of,
on, in, this and the.
Deletion first calls the hook rdf_litindex:exclude_from_index(token,
X). This hook is called as follows:

`Literal maps' provide a relation between literal values, intended to
create additional indexes on literals. The current implementation can
only deal with integers and atoms (string literals). A literal map
maintains an ordered set of keys. The ordering uses the same
rules as described in section 4.5.
Each key is associated with an ordered set of values. Literal
map objects can be shared between threads, using a locking strategy that
allows for multiple concurrent readers.

Typically, this module is used together with rdf_monitor/2
on the channals new_literal and old_literal to
maintain an index of words that appear in a literal. Further abstraction
using Porter stemming or Metaphone can be used to create additional
search indices. These can map either directly to the literal values, or
indirectly to the plain word-map. The SWI-Prolog NLP package provides
complimentary building blocks, such as a tokenizer, Porter stem and
Double Metaphone.

Destroy a literal map. After this call, further use of the Map
handle is illegal. Additional synchronisation is needed if maps that are
shared between threads are destroyed to guarantee the handle is no
longer used. In some scenarios rdf_reset_literal_map/1
provides a safe alternative.

As rdf_insert_literal_map/3.
In addition, if Key is a new key in
Map, unify KeyCount with the number of keys in Map.
This serves two purposes. Derived maps, such as the stem and metaphone
maps need to know about new keys and it avoids additional foreign calls
for doing the progress in rdf_litindex.pl.

Unify ValueList with an ordered set of values associated to
all keys from KeyList. Each key in KeyList is
either an atom, an integer or a term not(Key). If not-terms
are provided, there must be at least one positive keywords. The
negations are tested after establishing the positive matches.

Succeeds if Key is a key in the map and unify Answer
with the number of values associated with the key. This provides a fast
test of existence without fetching the possibly large associated value
set as with rdf_find_literal_map/3.

prefix(+Prefix)

Unify Answer with an ordered set of all keys that have the
given prefix. Prefix must be an atom. This call is intended
for auto-completion in user interfaces.

ge(+Min)

Unify Answer with all keys that are larger or equal to the
integer Min.

le(+Max)

Unify Answer with all keys that are smaller or equal to the
integer Max.

The library(semweb/rdf_persistency)
provides reliable persistent storage for the RDF data. The store uses a
directory with files for each source (see rdf_source/1)
present in the database. Each source is represented by two files, one in
binary format (see rdf_save_db/2)
representing the base state and one represented as Prolog terms
representing the changes made since the base state. The latter is called
the journal.

Attach Directory as the persistent database. If Directory
does not exist it is created. Otherwise all sources defined in the
directory are loaded into the RDF database. Loading a source means
loading the base state (if any) and replaying the journal (if any). The
current implementation does not synchronise triples that are in the
store before attaching a database. They are not removed from the
database, nor added to the presistent store. Different merging options
may be supported through the Options argument later.
Currently defined options are:

concurrency(+PosInt)

Number of threads used to reload databased and journals from the files
in Directory. Default is the number of physical CPUs
determined by the Prolog flag cpu_count or 1 (one) on
systems where this number is unknown. See also concurrent/3.

max_open_journals(+PosInt)

The library maintains a pool of open journal files. This option
specifies the size of this pool. The default is 10. Raising the option
can make sense if many writes occur on many different named graphs. The
value can be lowered for scenarios where write operations are very
infrequent.

If true, nested log transactions are added to the
journal information. By default (false), no log-term is
added for nested transactions.

The database is locked against concurrent access using a file
lock in Directory. An attempt to attach to a
locked database raises a permission_error exception. The
error context contains a term rdf_locked(Args), where args
is a list containing time(Stamp) and pid(PID).
The error can be caught by the application. Otherwise it prints:

Change presistency of named database (4th argument of rdf/4).
By default all databases are presistent. Using false, the
journal and snapshot for the database are deleted and further changes to
triples associated with DB are not recorded. If Bool
is true a snapshot is created for the current state and
further modifications are monitored. Switching persistency does not
affect the triples in the in-memory RDF database.

Flush dirty journals. With the option min_size(KB) only
journals larger than KB Kbytes are merged with the base
state. Flushing a journal takes the following steps, ensuring a stable
state can be recovered at any moment.

Save the current database in a new file using the extension .new.

On success, delete the journal

On success, atomically move the .new file over the base
state.

Note that journals are not merged automatically for two
reasons. First of all, some applications may decide never to merge as
the journal contains a complete changelog of the database.
Second, merging large databases can be slow and the application may wish
to schedule such actions at quiet times or scheduled maintenance
periods.

The above predicates suffice for most applications. The predicates in
this section provide access to the journal files and the base state
files and are intented to provide additional services, such as reasoning
about the journals, loaded files, etc.3A
library library(rdf_history) is under development
exploiting these features supporting wiki style editing of RDF.

Using rdf_transaction(Goal, log(Message)), we can add
additional records to enrich the journal of affected databases with Term
and some additional bookkeeping information. Such a transaction adds a
term
begin(Id, Nest, Time, Message) before the change operations
on each affected database and end(Id, Nest, Affected) after
the change operations. Here is an example call and content of the
journal file mydb.jrn. A full explanation of the terms that
appear in the journal is in the description of rdf_journal_file/2.

Using rdf_transaction(Goal, log(Message, DB)), where DB
is an atom denoting a (possibly empty) named graph, the system
guarantees that a non-empty transaction will leave a possibly empty
transaction record in DB. This feature assumes named graphs are named
after the user making the changes. If a user action does not affect the
user's graph, such as deleting a triple from another graph, we still
find record of all actions performed by some user in the journal of that
user.

True if
File is the absolute file name of an existing named graph
DB. A journal file contains a sequence of Prolog terms of the
following format.4Future versions
of this library may use an XML based language neutral format.

start(Attributes)

Journal has been opened. Currently Attributes contains a term time(Stamp).

end(Attributes)

Journal was closed. Currently Attributes contains a term time(Stamp).

assert(Subject, Predicate, Object)

A triple {Subject, Predicate, Object} was added to the database.

assert(Subject, Predicate, Object, Line)

A triple {Subject, Predicate, Object} was added to the database with
given Line context.

retract(Subject, Predicate, Object)

A triple {Subject, Predicate, Object} was deleted from the database.
Note that an rdf_retractall/3
call can retract multiple triples. Each of them have a record in the
journal. This allows for `undo'.

Added before the changes in each database affected by a transaction with
transaction identifier log(Message). Id is an
integer counting the logged transactions to this database. Numbers are
increasing and designed for binary search within the journal file.
Nest is the nesting level, where `0' is a toplevel
transaction.
Time is a time-stamp, currently using float notation with two
fractional digits. Message is the term provided by the user
as argument of the log(Message) transaction.

end(Id, Nest, Others)

Added after the changes in each database affected by a transaction with
transaction identifier log(Message). Id and Nest
match the begin-term. Others gives a list of other databases
affected by this transaction and the Id of these records. The
terms in this list have the format DB:Id.

Convert between DB (see rdf_source/1)
and file base-file used for storing information on this database. The
full file is located in the directory described by rdf_current_db/1
and has the extension
.trp for the base state and .jrn for the
journal.

This module implements the Turtle language for representing the RDF
triple model as defined by Dave Beckett from the Institute for Learning
and Research Technology University of Bristol and later standardized by
the W3C RDF working group.

This module acts as a plugin to rdf_load/2,
for processing files with one of the extensions .ttl or .n3.

Read a stream or file into a set of triples or quadruples (if faced with
TriG input) of the format

rdf(Subject, Predicate, Object [, Graph])

The representation is consistent with the SWI-Prolog RDF/XML and
ntriples parsers. Provided options are:

base_uri(+BaseURI)

Initial base URI. Defaults to file://<file>
for loading files.

anon_prefix(+Prefix)

Blank nodes are generated as <Prefix>1, <Prefix>2,
etc. If Prefix is not an atom blank nodes are generated as
node(1), node(2), ...

format(+Format)

One of auto (default), turtle or trig.
The auto mode switches to TriG format of there is a
{ before the first triple. Finally, of the format is
explicitly stated as turtle and the file appears to be a
TriG file, a warning is printed and the data is loaded while ignoring
the graphs.

resources(URIorIRI)

Officially, Turtle resources are IRIs. Quite a few applications however
send URIs. By default we do URI->IRI mapping because
this rarely causes errors. To force strictly conforming mode, pass iri.

Streaming Turtle parser. The predicate rdf_process_turtle/3
processes Turtle data from Input, calling OnObject
with a list of triples for every Turtle statement found in Input. OnObject
is called as below, where ListOfTriples is a list of
rdf(S,P,O) terms for a normal Turtle file or rdf(S,P,O,G)
terms if the GRAPH keyword is used to associate a set of
triples in the document with a particular graph. The Graph
argument provides the default graph for storing the triples and Line
is the line number where the statement started.

The option expand allows for serializing alternative
graph representations. It is called through call/5,
where the first argument is the expand-option, followed by S,P,O,G. G is
the graph-option (which is by default a variable). This notably allows
for writing RDF graphs represented as rdf(S,P,O) using the
following code fragment:

The library(semweb/rdf_ntriples) provides a fast reader
for the RDF N-Triples and N-Quads format. N-Triples is a simple format,
originally used to support the W3C RDF test suites. The current format
has been extended and is a subset of the Turtle format (see
library(semweb/turtle)).

The API of this library is almost identical to library(semweb/turtle).
This module provides a plugin into rdf_load/2,
making this predicate support the format ntriples and nquads.

Read the next triple from Stream as Triple. Stream
must have UTF-8 encoding.

Triple

is a term triple(Subject,Predicate,Object).
Arguments follow the normal conventions of the RDF libraries. NodeID
elements are mapped to node(Id). If end-of-file is reached, Triple
is unified with
end_of_file.

Read the next quad from Stream as Quad. Stream
must have UTF-8 encoding.

Quad

is a term quad(Subject,Predicate,Object,Graph).
Arguments follow the normal conventions of the RDF libraries. NodeID
elements are mapped to node(Id). If end-of-file is reached, Quad
is unified with
end_of_file.

This module implements extraction of RDFa triples from parsed XML or
HTML documents. It has two interfaces: read_rdfa/3
to read triples from some input (stream, file, URL) and xml_rdfa/3
to extract triples from an HTML or XML document that is already parsed
with load_html/3 or
load_xml/3.

True when Triples is a list of rdf(S,P,O)
triples extracted from
Input. Input is either a stream, a file name, a
URL referencing a file name or a URL that is valid for http_open/3. Options
are passed to open/4, http_open/3
and xml_rdfa/3. If no base is
provided in Options, a base is deduced from Input.

The library(semweb/rdfs)
library adds interpretation of the triple store in terms of concepts
from RDF-Schema (RDFS). There are two ways to provide support for more
high level languages in RDF. One is to view such languages as a set of entailment
rules. In this model the rdfs library would provide a predicate rdfs/3
providing the same functionality as rdf/3
on union of the raw graph and triples that can be derived by applying
the RDFS entailment rules.

Alternatively, RDFS provides a view on the RDF store in terms of
individuals, classes, properties, etc., and we can provide predicates
that query the database with this view in mind. This is the approach
taken in the library(semweb/rdfs.p)l library, providing
calls like
rdfs_individual_of(?Resource, ?Class).5The
SeRQL language is based on querying the deductive closure of the triple
set. The SWI-Prolog SeRQL library provides entailment modules
that take the approach outlined above.

True if SubProperty is equal to Property or Property
can be reached from SubProperty following the
rdfs:subPropertyOf relation. It can be used to test as well
as generate sub-properties or super-properties. Note that the commonly
used semantics of this predicate is wired into rdf_has/[3,4].bugThe
current implementation cannot deal with cycles.bugThe
current implementation cannot deal with predicates that are an rdfs:subPropertyOf
of rdfs:subPropertyOf, such as owl:samePropertyAs.

True if SubClass is equal to Class or Class
can be reached from SubClass following the
rdfs:subClassOf relation. It can be used to test as well as
generate sub-classes or super-classes.bugThe
current implementation cannot deal with cycles.

True if Resource is an indivisual of Class. This
implies
Resource has an rdf:type property that refers to
Class or a sub-class thereof. Can be used to test, generate
classes Resource belongs to or generate individuals described
by Class.

If List is a list of resources, create an RDF list Resource
that reflects these resources. Resource and the sublist
resources are generated with rdf_bnode/1.
The new triples are associated with the database DB.

Complex projects require RDF resources from many locations and
typically wish to load these in different combinations. For example
loading a small subset of the data for debugging purposes or load a
different set of files for experimentation. The library library(semweb/rdf_library.pl)
manages sets of RDF files spread over different locations, including
file and network locations. The original version of this library
supported metadata about collections of RDF sources in an RDF file
called Manifest. The current version supports both the
VoID format and the
original format. VoID files (typically named void.ttl) can
use elements from the RDF Manifest vocabulary to support features that
are not supported by VoID.

A manifest file is an RDF file, often in
Turtle
format, that provides meta-data about RDF resources. Often, a manifest
will describe RDF files in the current directory, but it can also
describe RDF resources at arbitrary URL locations. The RDF schema for
RDF library meta-data can be found in rdf_library.ttl. The
namespace for the RDF library format is defined as
http://www.swi-prolog.org/rdf/library/
and abbreviated as
lib.

The schema defines three root classes: lib:Namespace, lib:Ontology
and lib:Virtual, which we describe below.

Ontologies imported. If rdf_load_library/2
is used to load this ontology, the ontologies referenced here are loaded
as well. There are two subProperties: lib:schema and lib:instances with
the obvious meaning.

Virtual ontologies do not refer to an RDF resource themselves. They only
import other resources. For example the W3C WordNet manifest defines wn-basic
and wn-full as virtual resources. The lib:Virtual resource
is used as a second rdf:type:

Defines a URL to be a namespace. The definition provides the preferred
mnemonic and can be referenced in the lib:providesNamespace and
lib:usesNamespace properties. The rdf_load_library/2
predicates registers encountered namespace mnemonics with rdf-db using
rdf_register_ns/2.
Typically namespace declarations use @prefix declarations. E.g.

The VoID aims at
resolving the same problem as the Manifest files described here. In
addition, the
VANN vocabulary
provides the information about preferred namepaces prefixes. The RDF
library manager can deal with VoID files. The following relations apply:

VoID Dataset and Linkset are similar to
lib:Ontology, but a VoID resource is always
Virtual. I.e., the VoID URI itself never refers to an RDF
document.

The owl:imports and its lib specializations are
replaced by void:subset (referring to another VoID dataset)
and void:dataDump (referring to a concrete document).

A description of the dataset is given using dcterms:description
rather than rdfs:comment

The RDF library recognises lib:source, lib:baseURI
and lib:Cloudnode, which have no equivalent in VoID.

The RDF library recognises vann:preferredNamespacePrefix
and
vann:preferredNamespaceUri as alternatives to its
proprietary way for defining prefixes. The domain of these predicates is
unclear. The library recognises them regardless of the domain. Note that
the range of vann:preferredNamespaceUri is a literal.
A disadvantage of that is that the Turtle prefix declaration cannot be
reused.

Currently, the RDF metadata is not stored in the RDF
database. It is processed by low-level primitives that do not
perform RDFS reasoning. In particular, this means that
rdfs:supPropertyOf and rdfs:subClassOf cannot be used to specialise the
RDF meta vocabulary.

Load meta-data on RDF repositories from FileOrDirectory. If
the argument is a directory, this directory is processed recursively and
each for each directory, a file named void.ttl,
Manifest.ttl or Manifest.rdf is loaded (in
this order of preference).

Declared namespaces are added to the rdf-db namespace list.
Encountered ontologies are added to a private database of
rdf_list_library.pl. Each ontology is given an
identifier, derived from the basename of the URL without the
extension. This, using the declaration below, the identifier of the
declared ontology is wn-basic.

List the available resources in the library. Currently only lists
resources that have a dcterms:title property. See section
9.2 for an example.

It is possible for the initial set of manifests to refer to RDF files
that are not covered by a manifest. If such a reference is encountered
while loading or listing a library, the library manager will look for a
manifest file in the directory holding the referenced RDF file and load
this manifest. If a manifest is found that covers the referenced file,
the directives found in the manifest will be followed. Otherwise the RDF
resource is simply loaded using the current defaults.

Lists the resources that will be loaded if Id is handed to
rdf_load_library/2.
See rdf_attach_library/1
for how ontology identifiers are generated. In addition it checks the
existence of each resource to help debugging library dependencies.
Before doing its work,
rdf_list_library/2
reloads manifests that have changed since they were loaded the last
time. For HTTP resources it uses the HEAD method to verify existence and
last modification time of resources.

Typically, a project will use a single file using the same format as
a manifest file that defines alternative configurations that can be
loaded. This file is loaded at program startup using
rdf_attach_library/1.
Users can now list the available libraries using rdf_list_library/0
and rdf_list_library/1:

Now we can list a specific category using rdf_list_library/1.
Note this loads two additional manifests referenced by resources
encountered in
ec-mappings. If a resource does not exist is is flagged
using
[NOT FOUND].

Resources and manifests are located either on the local filesystem or
on a network resource. The initial manifest can also be loaded from a
file or a URL. This defines the initial base URL of the
document. The base URL can be overruled using the Turtle @base
directive. Other documents can be referenced relative to this base URL
by exploiting Turtle's URI expansion rules. Turtle resources can be
specified in three ways, as absolute URLs (e.g. <http://www.example.com/rdf/ontology.rdf>),
as relative URL to the base (e.g. <../rdf/ontology.rdf>)
or following a
prefix (e.g. prefix:ontology).

The prefix notation is powerful as we can define multiple of them and
define resources relative to them. Unfortunately, prefixes can only be
defined as absolute URLs or URLs relative to the base URL. Notably, they
cannot be defined relative to other prefixes. In addition, a prefix can
only be followed by a Qname, which excludes . and /.

Easily relocatable manifests must define all resources relative to
the base URL. Relocation is automatic if the manifest remains in the
same hierarchy as the resources it references. If the manifest is copied
elsewhere (i.e. for creating a local version) it can use @base to refer
to the resource hierarchy. We can point to directories holding manifest
files using @prefix declarations. There, we can reference
Virtual resources using prefix:name. Here is an example, were
we first give some line from the initial manifest followed by the
definition of the virtual RDFS resource.

In this section we provide skeleton code for filling the RDF database
from a password protected HTTP repository. The first line loads the
application. Next we include modules that enable us to manage the RDF
library, RDF database caching and HTTP connections. Then we setup the
HTTP authentication, enable caching of processed RDF files and load the
initial manifest. Finally load_data/0
loads all our RDF data.

Execute a SPARQL query on an HTTP SPARQL endpoint. Query is
an atom that denotes the query. Result is unified to a term
rdf(S,P,O) for CONSTRUCT and DESCRIBE
queries, row(...) for
SELECT queries and true or false
for ASK queries.
Options are

Variables that are unbound in SPARQL (e.g., due to SPARQL optional
clauses), are bound in Prolog to the atom '$null$'.

endpoint(+URL)

May be used as alternative to Scheme, Host, Port and Path to specify the
endpoint in a single option.

host(+Host)

port(+Port)

path(+Path)

scheme(+Scheme)

The above four options set the location of the server.

search(+ListOfParams)

Provide additional query parameters, such as the graph.

variable_names(-ListOfNames)

Unifies ListOfNames with a list of atoms that describe the
names of the variables in a SELECT query.

Remaining options are passed to http_open/3.
The defaults for Host, Port and Path can be set using sparql_set_server/1.
The initial default for port is 80 and path is /sparql/.

For example, the ClioPatria server understands the parameter
entailment. The code below queries for all triples using
_rdfs_entailment.

This library provides predicates that compare RDF graphs. The current
version only provides one predicate: rdf_equal_graphs/3
verifies that two graphs are identical after proper labeling of the
blank nodes.

Future versions of this library may contain more advanced operations,
such as diffing two graphs.

- Define alternate predicate to use for providing a comment
- Use rdf:type if there is no meaningful label?
- Smarter guess whether or not the local identifier might be meaningful
to the user without a comment. I.e. does it look `word-like'?

This module defines rules for user:portray/1
to help tracing and debugging RDF resources by printing them in a more
concise representation and optionally adding comment from the label
field to help the user interpreting the URL. The main predicates are:

The core
infrastructure for storing and querying RDF is provided by this package,
which is distributed as a core package with SWI-Prolog.
ClioPatria
provides a comprehensive server infrastructure on top of the semweb
and
http packages. ClioPatria provides a SPARQL 1.1 endpoint,
linked open data (LOD) support, user management, a web interface and an
extension infrastructure for programming (semantic) web applications.

Thea
provides access to OWL ontologies at the level of the abstract syntax.
Can interact with external DL reasoner using DIG.

RDF-DB version 3 is a major redesign of the SWI-Prolog RDF
infrastructure. Nevertheles, version 3 is almost perfectly upward
compatible with version 2. Below are some issues to take into
consideration when upgrading.

Version 2 did not allow for modifications while read operations
were in progress, for example due to an open choice point. As a
consequence, operations that both queried and modified the database had
to be wrapped in a transaction or the modifications had to be buffered
as Prolog data structures. In both cases, the RDF store was not modified
during the query phase. In version 3, modifications are allowed
while read operations are in progress and follow the Prolog logical
update view semantics. This is different from using a transaction in
version 2, where the view for all read operations was frozen at the
start of the transaction. In version 3, every read operation sees
the store frozen at the moment that the operation was started.

We illustrate the difference by writing a forwards entailment rule
that adds a sibling relation. In version 2, we
could perform this operation using one of the following:

In version 3, we can write this in the natural Prolog
style below. In itself, this may not seem a big advantage because
wrapping such operations in a transaction is often a good style anyway.
The story changes with more complicated constrol structures that combine
iterations with steps that depend on triples asserted in previous steps.
Such scenarios can be programmed naturally in the current version.

Generations inside a transaction are represented as
BaseGeneration+TransactionGeneration, where
BaseGeneration is the global generation where the transaction
started and TransactionGeneration expresses the generation
within the transaction. Counting generation has changed as well. In
particular, comitting a transaction steps the generation only by one.

now only accepts a source location and deletes the associated graph
using rdf_unload_graph/1.

Acknowledgements

This research was supported by the following projects: MIA and
MultimediaN project (www.multimedian.nl) funded through the BSIK
programme of the Dutch Government, the FP-6 project HOPS of the European
Commission, the COMBINE project supported by the ONR Global NICOP grant
N62909-11-1-7060 and the Dutch national program COMMIT.