This page discusses the design of URIs used in converting governmental datasets to RDF. We briefly review the current practice of converting datasets using the old style of hash-based URIs, and then detail the new style of slash-based URIs and the various types of data that use them in the process of converting and enhancing governmental datasets.

Contents

Old Style

The original TWC data-gov project used a conversion tool that was developed very quickly, with the intention of rapidly converting the data.gov datasets into RDF. The URIs created by this tool weren't terribly friendly for a number of reasons. Here's an example:

While this approach has helped us produce a great number of triples, there are several issues that make it less than ideal going forward. By grouping all data into a single RDF file (which would contain all the triples for dataset 353), it becomes difficult to interact with information relating to just one instance without downloading the entire dataset. Also, the hash-based URIs make serving alternative representations of the data (e.g. HTML or different RDF serialization formats) all but impossible.

New Style

We've been developing a new system for converting data.gov datasets that takes to heart some of the lessons learned in our early conversions. Many of the decisions of what URIs in our new system look like are based on the idea of an initial "raw" conversion of a dataset, and an iterative process of enhancing the data. The focus of the raw conversion is to be just as easy as our previous system, requiring little to no specific knowledge of a particular dataset's domain or modeling. By designing a system that allows enhancing a raw dataset, however, we end up making better decisions about what the output of a raw conversion should look like.

We'll start out by describing a dataset. Each dataset has an identifier (e.g. '1530') and a version (this can be any identifying string, but here we use a date such as '2009-10-08'). The version of a dataset is used to indicate whenever the underlying data for a dataset has changed. With this information, we now can construct a URI for the dataset:

Again, we will be converting each row of a dataset CSV file to an instance, and assigning property values to the instances for each column. If we don't know anything about the data in the dataset, we will create a URI for a row based on its row number. Therefore, row 1 of dataset 1530 would be assigned the URI:

However, if we know enough about the dataset to know that each row represents a FOIA request, and that the "Request ID" column contains a unique value for each row (a primary key), row 1 would instead be assigned the URI:

Note that property URIs exist "within" the dataset scope (/dataset/1530/), but outside any particular dataset version (2009-10-08). Upon a first, "raw" conversion, all properties also have "/raw/" in their URI. All "raw" property values are plain literals, taken directly from the underlying CSV file. As the dataset is enhanced, the property values may change to better represent the underlying data. For example, if dataset 1530 is enhanced from its "raw" form to the first enhancement by transforming the "Received Date" values from plain literals to xsd:date datatyped literals, the new property URI becomes

Every time the dataset is enhanced, all the properties in the dataset are moved into a new enhancement namespace. While this results in more triples being created every time a dataset is enhanced, it allows existing applications and queries over the published data to continue to work as expected because any particular property URI will always be used with the same value types (plain literals, datatyped literals, URIs, etc.). We expect that datasets will only go through a small number of enhancements before they stabalize as useful, semantically-enhanced datasets.

Property values (cells in the original CSV) can be promoted from string literals to resources. When this happens, the promoted values can be given an optional rdf:type using a newly created (dataset-local) rdfs:Class (this class can be mapped to an existing, external class in a later enhancement). For example, the "Requester Name" field in dataset 1530 represents people. The values in this field can be promoted to resources either in a value-space scoped to the "Requester Name" property or, if we know there are other fields that map to the same values, to a dataset scoped value space. Examples of these two resource promotions are:

If we asserted during such a resource promotion that these resources were of type "Person" (as is required in the former case of defining a dataset scoped value space), then we would also end up with each of these resources having an rdf:type of

Discussion

The URI design discussed above makes some assumptions about how governmental datasets are produced, converted to RDF, and published. Currently this design is influenced by the data.gov approach of aggregating and publishing governmental datasets in bulk. Ideally the design and ownership of URIs would be done by the owners of each dataset, allowing them to make informed decisions based on relevant domain information such as knowledge of underlying dataset version changes and which if any fields uniquely identify a row (primary key) or value (e.g. are two rows with the same "requester name" value referencing the same person or two different people with the same name?). In such a situation, some of the URI design assumptions made here would obviously be affected. Understanding these assumptions and how they might be affected by a change in dataset publisher is important, but beyond the scope of this document.

There has been quite a bit of work put into similar issues surrounding the data.gov.uk datasets. Within the UK effort, URIs are designed, "both to encourage those that definitively own reference data to make it available for re-use, and to give those that have data that could be linked, the confidence to re-use a URI set that is not under their direct control."

Recent Developments

[25 August 2010]

We have made some great progress throughout the summer. The following three pages provide more in-depth descriptions of our work to design, implement, and adopt a new URI naming scheme. We just finished leading a Mashathon with representatives of many federal agencies, and our two-day experience "in the trenches" has reassured us that we are on the right track.

URI design for RDF conversion of CSV-based data - lists enhancement parameters that can describe how tabular data should be interpreted and cast into Linked-friendly RDF. Examples are provided with snippets of input, the (RDF) enhancement parameter, and the output.

Csv2rdf4lod - describes (and provides a pointer to download) some automation infrastructure to ease the conversion process. While a user-friendly interface is very much desirable, the system currently uses unix shell scripts to invoke a Java jar.