Parsing syntaxes to RDF TriplesIntroduction
The typical sequence of operations to parse is to create a parser
object, set various callback and features, start the parsing, send
some syntax content to the parser object, finish the parsing and
destroy the parser object.Several parts of this process are optional, including actually
using the triple results, which is useful as a syntax checking
process.
Create the Parser objectThe parser can be created directly from a known name such as
rdfxml for the W3C Recommendation RDF/XML syntax:
raptor_parser* rdf_parser;
rdf_parser = raptor_new_parser("rdfxml");
or the name can be discovered from an enumeration
as discussed in Querying Functionality
The parser can also be created by identifying the syntax by a
URI, specifying the syntax by a MIME Type, providng an identifier for
the content such as filename or URI string or giving some initial
content bytes that can be used to guess.
Using the
raptor_new_parser_for_content()
function, all of these can be given as optional parameters, using NULL
or 0 for undefined parameters. The constructor will then use as much of
this information as possible.
raptor_parser* rdf_parser;
Create a parser that reads the MIME Type for RDF/XML
application/rdf+xml
rdf_parser = raptor_new_parser_for_content(NULL, "application/rdf+xml", NULL, 0, NULL);
Create a parser that can read a syntax identified by the URI
for Turtle http://www.dajobe.org/2004/01/turtle/,
which has no registered MIME Type at this date:
syntax_uri = raptor_new_uri("http://www.dajobe.org/2004/01/turtle/");
rdf_parser = raptor_new_parser_for_content(syntax_uri, NULL, NULL, 0, NULL);
Create a parser that recognises the identifier foo.rss:
rdf_parser = raptor_new_parser_for_content(NULL, NULL, NULL, 0, "foo.rss");
Create a parser that recognises the content in buffer:
rdf_parser = raptor_new_parser_for_content(NULL, NULL, buffer, len, NULL);
Any of the constructor calls can return NULL if no matching
parser could be found, or the construction failed in another way.
Parser featuresThere are several options that can be set on parsers, called
features. The exact list of features can be
found via
Querying Functionality
or in the API reference for
raptor_set_feature(). (This should be properly called raptor_parser_set_feature() as
it only applies to raptor_parser objects).
Features are integer enumerations of the
raptor_feature enum and have values
that are either integers (often acting as booleans) or strings.
The two functions that set features are:
/* Set an integer (or boolean) valued feature */
raptor_set_feature(rdf_parser, feature, 1);
/* Set a string valued feature */
raptor_set_feature_string(rdf_parser, feature, "abc");
There are also two corresponding functions for reading the values of parser
features:
raptor_get_feature()
and
raptor_get_feature_string()
taken the feature enumeration parameter and returning the integer or string
value correspondingly.
Set RDF triple callback handlerThe main reason to parse a syntax is to get RDF triples
returned and this is done by a callback function which is called
with parameters of a user data pointer and the triple itself.
The handler is set with
raptor_set_statement_handler()
as follows:
void
triples_handler(void* user_data, const raptor_statement* triple)
{
/* do something with the triple */
}
raptor_set_statement_handler(rdf_parser, user_data, triples_handler);
It is optional to set a handler function for triples, which does
have some uses if just counting triples or validating a syntax.
Set fatal error, error and warning handlersThere are several other callback handlers that can be set
on parsers. These can be set any time before parsing is called.
Errors and warnings from parsing can be returned with functions
that all take a callback of type
raptor_message_handler
and signature:
void
message_handler(void *user_data, raptor_locator* locator,
const char *message)
{
/* do something with the message */
}
returning the user data given, associated location information
as a raptor_locator
and the error/warning message itself. The locator
structure contains full information on the details of where in the
file or URI the message occurred.
The fatal error, error and warning handlers are all set with
similar functions that take a handler as follows:
raptor_set_fatal_error_handler(rdf_parser, user_data, fatal_handler);
raptor_set_error_handler(rdf_parser, user_data, error_handler);
raptor_set_warning_handler(rdf_parser, user_data, warning_handler);
The program will terminate
with abort() if the fatal error handler returns.
Set the identifier creator handlerIdentifiers are created in some parsers by generating them
automatically or via hints given a syntax. Raptor can customise this
process using a user-supplied identifier handler function.
For example, in RDF/XML generated blank node identifiers and those
those specified rdf:nodeID are passed through this
process. Setting a handler allows the identifier generation mechanism to be
fully replaced. A lighter alternative is to use
raptor_set_default_generate_id_parameters()
to adjust the default algorithm for generated identifiers.
It is used as follows
raptor_generate_id_handler id_handler;
raptor_set_generate_id_handler(rdf_parser, user_data, id_handler);
The id_handler takes the following signature:
unsigned char*
generate_id_handler(void* user_data, raptor_genid_type type,
unsigned char* user_id) {
/* return a new generated ID based on user_id (optional) */
}
where the
raptor_genid_type
provides extra information on the identifier being created and
user_id an optional user-supplied identifier,
such as the value of a rdf:nodeID in RDF/XML.
Set namespace declared handlerRaptor can report when namespace prefix/URIs are declared in
during parsing a syntax such as those in XML, RDF/XML or Turtle.
A handler function can be set to receive these declarations using
the namespace handler method.
raptor_namespace_handler namespaces_handler;
raptor_set_namespace_handler(rdf_parser, user_data, namespaces_handler);
The namespaces_handler takes the following signature:
void
namespaces_handler(void* user_data, raptor_namespace *nspace) {
/* */
}
This may be called multiple times with the same namespace,
if the namespace is declared inside different XML sub-trees.
Set the parsing strictnessraptor_set_parser_strict()
allows setting of the parser strictness flag. The default is lax parsing,
accepting older or deprecated syntax forms but may generate a warning. Setting
to non-0 (true) will cause parser errors to be generated in these cases.
Provide syntax content to parseThe operation of turning syntax into RDF triples has several
alternatives from functions that do most of the work starting from a
URI to functions that allow passing in data buffers.Parsing and MIME Types
The mime type of the retrieved content is not used to choose
a parser unless the parser is of type guess.
The guess parser will send an Accept: header
for all known parser syntax mime types (if a URI request is made)
and based on the response, including the identifiers used,
pick the appropriate parser to execute. See
raptor_guess_parser_name()
for a full discussion of the inputs to the guessing.
Parse the content from a URI (raptor_parse_uri())The URI is resolved and the content read from it and passed to
the parser:
raptor_parse_uri(rdf_parser, uri, base_uri);
The base_uri is optional (can be
NULL) and will default to the
uri.
Parse the content of a URI using an existing WWW connection (raptor_parse_uri_with_connection())The URI is resolved using an existing WWW connection (for
example a libcurl CURL handle) to allow for any existing
WWW configuration to be reused. See
raptor_www_new_with_connection
for full details of how this works. The content is then read from the
result of resolving the URI:
raptor_parse_uri_with_connection(rdf_parser, uri, base_uri, connection);
The base_uri is optional (can be
NULL) and will default to the
uri.
Parse the content of a C FILE* (raptor_parse_file_stream())Parsing can read from a C STDIO file handle:
stream=fopen(filename, "rb");
raptor_parse_file_stream(rdf_parser, stream, filename, base_uri);
fclose(stream);
This function can use take an optional filename which
is used in locator error messages.
The base_uri may be required by some parsers
and if NULL will cause the parsing to fail.
Parse the content of a file URI (raptor_parse_file())Parsing can read from a URI known to be a file: URI:
raptor_parse_file(rdf_parser, file_uri, base_uri);
This function requires that the file_uri is
a file URI, that is
raptor_uri_uri_string_is_file_uri( raptor_uri_as_string( file_uri) )
must be true.
The base_uri may be required by some parsers
and if NULL will cause the parsing to fail.
Parse chunks of syntax content provided by the application (raptor_start_parse() and raptor_parse_chunk())
raptor_start_parse(rdf_parser, base_uri);
while(/* not finished getting content */) {
unsigned char *buffer;
size_t buffer_len;
/* obtain some syntax content in buffer of size buffer_len bytes */
raptor_parse_chunk(rdf_parser, buffer, buffer_len, 0);
}
raptor_parse_chunk(rdf_parser, NULL, 0, 1); /* no data and is_end = 1 */
The base_uri argument to
raptor_start_parse()
may be required by some parsers
and if NULL will cause the parsing to fail.
On the last
raptor_parse_chunk()
call, or after the loop is ended, the is_end
parameter must be set to non-0. Content can be passed with the
final call. If no content is present at the end (such as in
some kind of end of file situation), then a 0-length
buffer_len or NULL buffer can be used.The minimal case is an entire parse in one chunk as follows:
raptor_start_parse(rdf_parser, base_uri);
raptor_parse_chunk(rdf_parser, buffer, buffer_len, 1); /* is_end = 1 */
Restrict parser network access
Parsing can cause network requests to be performed, especially
if a URI is given as an argument such as with
raptor_parse_uri()
however there may also be indirect requests such as with the
GRDDL parser that retrieves URIs depending on the results of
initial parse requests. The URIs requested may not be wanted
to be fetched or need to be filtered, and this can be done in
three ways.
Filtering parser network requests with feature RAPTOR_FEATURE_NO_NET
The parser feature
RAPTOR_FEATURE_NO_NET
can be set with
raptor_set_feature()
and forbids all network requests. There is no customisation of
this approach.
rdf_parser = raptor_new_parser("rdfxml");
raptor_set_feature(rdf_parser, RAPTOR_FEATURE_NO_NET);
Filtering parser network requests with raptor_www_set_uri_filter()
The
raptor_www_set_uri_filter()
allows setting of a filtering function to operate on all URIs
retrieved by a WWW connection. This connection can be used in
parsing when operated by hand.
void write_bytes_handler(raptor_www* www, void *user_data,
const void *ptr, size_t size, size_t nmemb) {
{
raptor_parser* rdf_parser=(raptor_parser*)user_data;
raptor_parse_chunk(rdf_parser, (unsigned char*)ptr, size*nmemb, 0);
}
int uri_filter(void* filter_user_data, raptor_uri* uri) {
/* return non-0 to forbid the request */
}
int main(int argc, char *argv[]) {
...
rdf_parser = raptor_new_parser("rdfxml");
www = raptor_new_www();
/* filter all URI requests */
raptor_www_set_uri_filter(www, uri_filter, filter_user_data);
/* make WWW write bytes to parser */
raptor_www_set_write_bytes_handler(www, write_bytes_handler, rdf_parser);
raptor_start_parse(rdf_parser, uri);
raptor_www_fetch(www, uri);
/* tell the parser that we are done */
raptor_parse_chunk(rdf_parser, NULL, 0, 1);
raptor_www_free(www);
raptor_free_parser(rdf_parser);
...
}
Filtering parser network requests with raptor_parser_set_uri_filter()
The
raptor_parser_set_uri_filter()
allows setting of a filtering function to operate on all URIs that
the parser sees. This operates on the internal raptor_www object
used inside parsing to retrieve URIs, similar to that described in
the previous section.
int uri_filter(void* filter_user_data, raptor_uri* uri) {
/* return non-0 to forbid the request */
}
rdf_parser = raptor_new_parser("rdfxml");
raptor_parser_set_uri_filter(rdf_parser, uri_filter, filter_user_data);
/* parse content as normal */
raptor_parse_uri(rdf_parser, uri, base_uri);
Querying parser static information
These methods return information about the constructed parser
implementation corresponding to the information available
via raptor_syntaxes_enumerate()
for all parsers.
raptor_get_name() return the parser syntax name,
raptor_get_label()
the long label for the parser and
raptor_get_mime_type()
the primary MIME Type for the parser (there may be others that the parser
will accept but this is the main one).
raptor_parser_get_accept_header()
returns a string that would be sent in an HTTP
request Accept: header for the syntaxes accepted by this
parser only.
Querying parser run-time informationraptor_get_locator()
returns the raptor_locator
for the current position in the input stream. The locator
structure contains full information on the details of where in the
file or URI the current parser has reached.
Aborting parsingraptor_parse_abort()
allows the current parsing to be aborted, at which point no further
triples will be passed to callbacks and the parser will attempt to
return control to the application. This is most useful when called
inside a handler function which allows the application to decide to stop
an active parsing.
Destroy the parser
To tidy up, delete the parser object as follows:
raptor_free_parser(rdf_parser);
Parsing example coderdfprint.c: Parse an RDF/XML file and print the triplesCompile it like this:
$ gcc -o rdfprint rdfprint.c `raptor-config --cflags` `raptor-config --libs`
and run it on an RDF file as:
$ ./rdfprint raptor.rdf
_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://usefulinc.com/ns/doap#Project> .
_:genid1 <http://usefulinc.com/ns/doap#name> "Raptor" .
_:genid1 <http://usefulinc.com/ns/doap#homepage> <http://librdf.org/raptor/> .
...