OGD Metadata

Each dataset published at Data.gov has a web page showing its
metadata (e.g. who published it, summary of its content, where to
download it, where to find additional information for understanding the
dataset, and etc.). For example, we can collect some metadata (see
below) for "Dataset 1623" from its Data.gov URL http://www.data.gov/details/1623.

OGD Raw Data Files

An important mission of OGD portals is to support citizens in the
downloading of raw data files. The web pages for Data.gov datasets have a
dedicated section "Download Information" that lists available raw data
in various formats, including XML, wikipedia:CSV CSV
and XLS (Excel format). For example, Dataset 1623 has a raw data file
in XLS (can be opened by Microsoft Excel or similar tools) downloadable
at http://www.data.gov/download/1623/xls.

The following image shows a fragment of the raw data (accessed on Sep
17, 2010). It is easy to see that the raw data is essentially a table
listing total OMHA claims received by region, state, and fiscal year.

As show from the figure above, the raw data is not normalized for machine consumption

The table header is not on the first row.

Values in column 1 are missing for the sake of visual
abbreviation (e.g., "Mid-West" should apply to everything from
"Connecticut" through "West Virginia").

Values in column 2 are referencing entities that are very
commonly recognized -- and are already mentioned in a slew of other
datasets in a variety of slightly different ways (e.g., "MD" and "24",
instead of "Maryland").

Columns 3 through 7 contain integers (_not_ the characters
"1", ",", "0", "2", and "9"), but the integers are in comma-separated
format

LOGD RDF Data

In this tutorial we're focusing on government data organized as
tables.
Although tabular data can be easily recognized by human users, clean-ups
and format-conversions are needed to ensure that government data can be
consumed by machines.
The TWC LOGD RDF Data of a dataset is created by several automated (or
semi-automated) conversion processes. Note that users need to assign a
version identifier to the RDF Data because the conversion is done a
snapshot of the raw data at a certain time. The RDF data of LOGD is
available through "dump files" for downloading or by dereferencing an
HTTP URI.
For example, the zipped RDF dump file for Dataset 1623 (version 2010-Sept-17) is available at this link.

Note: to unzip the file in windows, see http://www.gzip.org/.
The downloaded RDF data has been encoded using Turtle syntax. The
dump file appends results from several conversion processes, and we
will only explain several essential fragments of the data file.

Namespace Declaration

The namespace declarations are used to support wikipedia:QName of URI. Below is the content on lines 2,13,23,24 and 25 in the dump file. Each line declared a prefix with corresponding wikipedia:XML_namespace.

Property Definitions

The property definitions are used to (i) preserve the original text
of each table header field name; and (ii) add additional descriptions
contributed in data conversion process. Below is the content on lines
from 5589 to 5592 in the dump file. The "rdfs:label" declares a
human-readable label for the property "raw:region", which expands to a
URI "http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/raw/region".

the entire first line corresponds to one RDF triple meaning
"the record 'ds1623:thing_8' is referenced by the version of the
dataset". This triple is automatically added by the TWC LOGD converter.

the second line corresponds to one RDF triple meaning "the
record 'ds1623:thing_8' is related to a region named 'Mid-Atlantic'".
This triple is associated with the 1st column ("region") of the raw data
and the corresponding cell ("Mid-Atlantic") on the 8th row and the 1st
column. It is notable that the "region" of "ds1623:thing_9" has empty
string value.

the numbers after "raw:total" are still encoded in comma separated string.

RDF Data Generated from Enhancement Conversion

An "enhancement" conversion converts the raw data (in CSV format)
into an RDF representation based on a manually-generated configuration
file. Below is the content on lines 82-100 (generated by an enhancement
conversion) in the RDF dump file. The RDF data encodes the first two
records of the data table, corresponding to 8th and 9th rows in the raw
data (Excel file).

on the first line, the URI of record is the same as the one in
raw conversion. This allows incrementally add enhanced descriptions to
the existing descriptions.

on the second line, a new RDF property "e1:region" has been
created in addition to the "raw:region". Note that the range of the two
RDF properties are different.

on the second line, a new RDF resource
"value_of_region:Mid-Atlantic" is promoted from the original literal
string in raw data. By assigning the named entity (a region in this
case) a unique URI, users can later add more descriptions or links to
the entity, e.g. linking to wikipedia:Mid-Atlantic_states, in the future.

on the third line, the number "12" is now annotated with a
datatype. This not only helps users to better underestand the meaning of
data, but also supports triple stores' aggregation functions (e.g. sum)
on such data which cannot be used on plain literals.

To the extent possible under law,
Tetherless World Constellation
has waived all copyright and related or neighboring rights to
TWC LOGD. TWC LOGD is an educational project on open government data using Semantic Web technologies.
Datasets hosted on this site are converted from a number of data sources such as data.gov. All data created by us are open for reuse, and usage of data created and managed by other sources should follow their own licenses.

The data contained on this site is automatically repopulated from US government or other open data sites, and any personal data in our linked-data versions is coming from those sources. If your information is removed from the government sources, it will be automatically removed from ours on the next update.