Abstract

This document defines the procedures and rules to be applied when converting tabular data into RDF. Tabular data may be complemented with metadata annotations that describe its structure, the meaning of its content and how it may form part of a collection of interrelated tabular data. This document specifies the effect of this metadata on the resulting RDF.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

The CSV on the Web Working Group was chartered to produce a recommendation "Access methods for CSV Metadata" as well as recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g., RDF, JSON, or XML)". This document aims to satisfy the RDF variant of the mapping recommendation.

This document has been reviewed by W3C Members, by software developers, and by other W3C
groups and interested parties, and is endorsed by the Director as a W3C Recommendation.
It is a stable document and may be used as reference material or cited from another
document. W3C's role in making the Recommendation is to draw attention to the
specification and to promote its widespread deployment. This enhances the functionality
and interoperability of the Web.

1. Introduction

This document describes the processing of tabular data to create an RDF subject-predicate-object triples [rdf11-concepts]. Since RDF is an abstract syntax, these triples MAY be serialized in a concrete RDF syntax such as N-Triples [n-triples], Turtle [turtle], RDFa [rdfa-primer], JSON-LD [json-ld], or TriG [trig]. The RDF serializations offered by a conversion application is implementation defined.

There is no requirement on conversion applications to check the semantic consistency of the data during the conversion, nor to validate the triples against RDF schema. Downstream applications SHOULD be aware of the potential for inconsistencies and take appropriate action.

2. Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples,
and notes in this specification are non-normative. Everything else in this specification is
normative.

The key words MAY, MUST, and SHOULD are
to be interpreted as described in [RFC2119].

A document reference (normative or informative) is enclosed in square brackets and links to the references section.

Note

Notes are in light green boxes with a green left border and with a "Note" header in green. Notes are normative or informative depending on the whether they are in a normative or informative section, respectively.

Example 1

Examples are in light khaki boxes, with khaki left border, and with a
numbered "Example" header in khaki. Examples are always informative.
The content of the example is in monospace font and may be syntax colored.

A node is defined in [rdf11-concepts] as a subject or an object of an RDF triple. When in subject position, it can be either a blank node or identified with a URL; when in object position, it can be a blank node, a literal, or identified with a URL.

If a CSV file is originally encoded as UTF-8, it should not go through Unicode normalization during parsing, nor in conversion to RDF. This can result in RDF literals that are not in Normal Form C as they should be according to [rdf11-concepts].

A datatype's format annotation is irrelevant to the conversion procedure defined in this specification; the cell value has already been parsed from the contents of the cell according to the format annotation.

According to [rdf11-concepts] language tags cannot be combined with other xsd datatypes. If a cell has any datatype other than string, the value of langMUST be ignored. Also, all literals have a datatype; however, specific serializations, like [turtle], MAY provide a special syntax for literals with datatype xsd:string or rdf:langString.

5. Inclusion of provenance information

This section is non-normative.

In addition to the namespaces defined above, the following namespace is used in this section:

prov:

http://www.w3.org/ns/prov#

Conversion applications MAY include provenance information in the RDF output describing how and when the output was created; e.g., using terms from the PROV Ontology [prov-o]. Information that may be of interest to downstream applications includes:

the source tabular data file;

the metadata description file(s) used;

when the conversion to RDF occurred; and

the conversion application used.

In order to facilitate providing such information, this specification introduces two instances of prov:Role:

csvw:csvEncodedTabularData

Defines the role of the source tabular data file.

csvw:tabularMetadata

Defines the role of the metadata description file.

An illustrative example of provenance information is provided below in [turtle] syntax, the conversion application used is identified as http://example.org/my-csv2rdf-application:

Conversion applications may have other means to create annotated tables, e.g. through some application specific APIs. In such cases the exact format for non-core annotations or notes may be different. Specifications for such annotation processes should specify how these annotations are converted into RDF.

If value is an array, generate RDF by running this algorithm using subject, property using each array member as value.

If value is an object containing @value, create an RDF Literal lit using the string value of @value and language from @language, or datatype from @type if present, expanding @type as necessary using the procedure outlined for property, and emit the following triple:

For every value of @type, either a term defined in the [csvw-context], a prefixed name, or an absolute URL; establish a new node Ti by expanding the value to an absolute URL by replacing a term with the URI from the term definition in [csvw-context] or a prefixed name with its expanded value. For each Ti, emit the following triple:

7. Examples

In addition to the namespaces defined above, the examples provided here make use of the following namespaces:

dc:

http://purl.org/dc/terms/

foaf:

http://xmlns.com/foaf/0.1/

oa:

http://www.w3.org/ns/oa#

org:

http://www.w3.org/ns/org#

schema:

http://schema.org/

Furthermore, these examples also make use of the Turtle syntax @base declaration (as defined in [turtle]). Where a single tabular data file is used in the example, the @base declaration is set to the URL of that tabular data file.

Note

Each of the examples expresses more complex conversions - it is recommended that readers of this specification work through the examples in sequential order.

7.1 Simple example

This example comprises a single annotated table containing information attributes about countries; country code, position (latitude, longitude) and name. Whilst the input tabular data file, published at http://example.org/countries.csv, includes a header line, no further metadata annotations are given. The tabular data file is provided below:

The about URL annotation has not been set for cells in table T ({ "url": "http://example.org/countries.csv"}); cells in a given row where about URL has not been specified are assumed to refer to the same subject. This unspecified subject is treated as a blank node.

Given that the property URL is null for cells in table T ({ "url": "http://example.org/countries.csv"}), the property URL defaults to the URI Template (see [RFC6570]) #{[column-name]}, where [column-name] is the value of the name annotation of the column associated with the cell. For example, the value of the property URL annotation for all cells in columnC1 ("name": "countryCode") is http://example.org/countries.csv#countryCode.

The type of both table and group of tables objects is explicitly stated; csvw:TableGroup and csvw:Table respectively.

The csvw:url property provides reference to the original tabular data file and to specific rows therein - noting the need to escape the Turtle-syntax reserved character = (U+003D) within the fragment identifier.

7.2 Example with single table and rich annotations

This example is based on Use Case #11 - City of Palo Alto Tree Data and comprises a single annotated table describing an inventory of tree maintenance operations. The input tabular data file, published at http://example.org/tree-ops-ext.csv, and the associated metadata description http://example.org/tree-ops-ext.csv-metadata.json are provided below:

{"@context":["http://www.w3.org/ns/csvw",{"@language":"en"}],"@id":"http://example.org/tree-ops-ext","url":"tree-ops-ext.csv","dc:title":"Tree Operations","dcat:keyword":["tree","street","maintenance"],"dc:publisher":[{"schema:name":"Example Municipality","schema:url":{"@id":"http://example.org"}}],"dc:license":{"@id":"http://opendefinition.org/licenses/cc-by/"},"dc:modified":{"@value":"2010-12-31","@type":"xsd:date"},"notes":[{"@type":"oa:Annotation","oa:hasTarget":{"@id":"http://example.org/tree-ops-ext"},"oa:hasBody":{"@type":"oa:EmbeddedContent","rdf:value":"This is a very interesting comment about the table; it's a table!","dc:format":{"@value":"text/plain"}}}],"dialect":{"trim":true},"tableSchema":{"columns":[{"name":"GID","titles":["GID","Generic Identifier"],"dc:description":"An identifier for the operation on a tree.","datatype":"string","required":true,"suppressOutput":true},{"name":"on_street","titles":"On Street","dc:description":"The street that the tree is on.","datatype":"string"},{"name":"species","titles":"Species","dc:description":"The species of the tree.","datatype":"string"},{"name":"trim_cycle","titles":"Trim Cycle","dc:description":"The operation performed on the tree.","datatype":"string","lang":"en"},{"name":"dbh","titles":"Diameter at Breast Ht","dc:description":"Diameter at Breast Height (DBH) of the tree (in feet), measured 4.5ft above ground.","datatype":"integer"},{"name":"inventory_date","titles":"Inventory Date","dc:description":"The date of the operation that was performed.","datatype":{"base":"date","format":"M/d/yyyy"}},{"name":"comments","titles":"Comments","dc:description":"Supplementary comments relating to the operation or tree.","datatype":"string","separator":";"},{"name":"protected","titles":"Protected","dc:description":"Indication (YES / NO) whether the tree is subject to a protection order.","datatype":{"base":"boolean","format":"YES|NO"},"default":"NO"},{"name":"kml","titles":"KML","dc:description":"KML-encoded description of tree location.","datatype":"xml"}],"primaryKey":"GID","aboutUrl":"http://example.org/tree-ops-ext#gid-{GID}"}}

The datatype annotation is set on columnsC5, C6, C8 and C9 ({ "name": "dbh"}, { "name": "inventory_date" }, { "name": "protected" } and { "name": "kml" }); as integer, date, boolean and xml respectively. The datatype property is inherited by all cells in each of those columns, therefore the RDF output for those cells includes the appropriate datatype IRI.

CellC3.7 (rowR3; column, { "name": "comments" }) contains an unordered sequence of values; the set of values are included as a simple set of triples as opposed to an instance of rdf:List as the ordered annotation has defaulted to false.

As the metadata description file http://example.org/tree-ops-ext.csv-metadata.json defines a default language within the context ("@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}]), all non-core annotations of type string (e.g. dc:title, dcat:keyword, dc:publisher, dc:license and dc:modified) are expressed in the RDF output using the the appropriate language tag.

7.3 Example with single table and using virtual columns to produce multiple subjects per row

This example uses a single annotated table describing a listing of music events. Each row from the tabular data file corresponds to three resources; the music event itself, the location where that event occurs and the offer to sell tickets for that event. The goal is to convert the CSV content into schema.org markup that a search engine such as Google can use to index music events. Details of how Google expects this information to be structured can be found here.

The input tabular data file, published at http://example.org/events-listing.csv, and the associated metadata description http://example.org/events-listing.csv-metadata.json are provided below:

The CSV to RDF translation is limited to providing one statement, or triple, per column in the table. The target schema.org markup requires 10 statements to describe each event. As the base tabular data file contains 5 columns, an additional 5 virtual columns have been added in order to provide for the full complement of statements—including the relationships between the 3 resources (event, location, and offer) described by each row of the table. Note that the virtual annotation is set to true for these virtual columns.

Furthermore, note that no attempt is made to reconcile between locations or offers that may be associated with more than one event; every row in the table will create both a new location resource and offer resource in addition to the event resource. If considered necessary, applications such as OpenRefine may be used to identify and reconcile duplicate location resources once the RDF output has been generated.

The annotated table generated from parsing the tabular data file and associated metadata is shown below and provides the basis for the conversion to RDF.

Annotations for the resulting tableT, with 10 columns and 2 rows, are shown below:

The resources described by each row are explicitly defined using the about URL annotation this case three resources per row (event, location, and offer). The relationship between the row and each subject resource is asserted using the csvw:describes property; e.g. for rowR1, a blank node, we state the following triples:

This example is based on Use Case #4 - Publication of public sector roles and salaries and uses four annotated tables published as a group of tables. Information about senior roles and junior roles within a government department or organization are published in CSV format by each department. These are validated against a centrally published schema to ensure that all the data published by departments is consistent. Additionally, lists of organizations and professions are also published centrally, providing controlled vocabularies against which departmental submissions are validated.

Information published about junior and senior roles provides summary information for each post within the government department or organization. Whilst the junior role information is anonymous, providing only an indication of the number of full-time-equivalent (FTE) staff occupying a given post, the senior role information specifies the named individual occupying each post. As such, each row from the tabular data file describing senior roles corresponds to two resources; the post and the person occupying that post.

This example is concerned only with converting the information provided by each government department or organization rather than the centrally published information listing organizations and professions.

This example makes extensive use of the example.org domain. As described in [RFC6761], this domain is used for illustrative examples within documentation. In reality, the resources described here with the URL path /gov.uk would be centrally published by the UK Government at, say, the domain data.gov.uk.

Given that these resources are centrally published with an aspiration for reuse, the schema descriptions have been factored out into separate resources. As such, the top-level metadata description resource metadata.json simply provides the list of tables and binds each of them to the appropriate schema that is defined elsewhere.

Finally, note that because the centrally published metadata descriptions are intended to be reused across many government departments and organizations, extra consideration has been given to defining URIs for the person and post resources defined in each row of the senior roles tabular data and subsequently referenced from the junior roles tabular data. To ensure that naming clashes are avoided, the unique reference for the organization to which the person or post belongs has been included in a path segment of the identifier. For example, the URI template propertyaboutUrl used to identify the senior post is specified as http://example.org/organization/{organizationRef}/post/{ref}, thus yielding the URI http://example.org/organization/hefce.ac.uk/post/90115 for the post described in the first row of the senior roles tabular data.

The group of tables generated from parsing the tabular data files and associated metadata is shown below and provides the basis for the conversion to RDF.

In this example, output for the centrally published lists of organizations and professions, tables Ta and Tb (http://example.org/gov.uk/data/organizations.csv and http://example.org/gov.uk/data/professions.csv respectively), are not required; only information from the departmental submissions is to be translated to RDF. Note the suppress output annotation on this table.

ColumnCc8, with the virtual annotation specified as true, is used to relate the person resource, whose name is provided in columnCc2, to the associated post resource within the current row of tableTc ({ "url": "http://example.org/senior-roles.csv" }).

Output for tablesTa and Tb ({ "url": "http://example.org/gov.uk/data/organizations.csv" } and { "url": "http://example.org/gov.uk/data/professions.csv" }) are not included as the suppress output annotation is true.

ColumnsCc5 and Cd1 ({ "name": "reportsTo" } and { "name": "reportsToSenior" }) use the about URL, property URL and value URL annotations to assert the relationship between the post described by a given row and the senior post to which it reports.

Similarly, columnsCc7 and Cd8 (both with { "name": "organizationRef" }) use the about URL, property URL and value URL annotations to assert the relationship between the post described by a given row and the organization to which it belongs.

Finally, note that two resources are created for each row within tableTc ({ "url": "http://example.org/senior-roles.csv" }): the person and the post they occupy. The relationship between these resources is specified via virtualcolumnCc8 ({ "name": "post_holder" }) using the about URL, property URL and value URL annotations.

Table groupG was explicitly defined, but has not been explicitly identified; the table group and table resources are treated as blank nodes.

The person and post resources described by each row of tableTc ({ "url": "http://example.org/senior-roles.csv"}) are explcitly defined using the aboutUrl property; therefore, say, for rowRc1 we state the following triples:

subject

Rc1

predicate

csvw:describes

object

<http://example.org/organization/hefce.ac.uk/post/90115>

subject

Rc1

predicate

csvw:describes

object

<http://example.org/organization/hefce.ac.uk/person/1>

Conversely, the aboutUrl property has not been defined for resources described by each row of tableTd ({ "url": "http://example.org/junior-roles.csv"}); therefore blank nodes are used, e.g. for rowRd1 we state the following triple:

subject

Rd1

predicate

csvw:describes

object

_:d8b8e40c-8c74-458b-99f7-64d1cf5c65f2

A. Relationship to RDB Direct Mapping

This section is non-normative.

The "Direct Mapping of Relational Data to RDF" W3C Recommendation [rdb-direct-mapping] defines a simple transformation (referred to as direct mapping) from a relational representation of data to RDF. The direct mapping takes as input a relational database (data and schema), and generates an RDF graph called the direct graph. Tables in a relational databases bear a strong resemblance to tabular data as defined in [tabular-data-model]; this section highlights the similarities and differences between the direct mapping and the tabular data mapping defined by this document. The following statements summarize the relationships:

where http://foo.example.org/CSV/People is the URL of the CSV file exported from the relational database.

Comparing the two conversion results:

The tabular data mapping does not have the information that the first column provides unique identifiers for the rows (i.e., that it is a primary key); consequently, a blank node must be used for the common row subject.

The tabular data mapping does not have information on the data types and, therefore, cannot presume that the first and the third columns contain integers.

It is, however, possible to add annotation to the tabular data so that the two graphs would be semantically equivalent. Indeed, consider the following metadata using the definitions of the [tabular-metadata] specification:

The metadata adds annotations on datatypes, adds the RDF typing triple explicitly (using a virtual column), and uses the aboutUrl URL template property to provide the common subject. Essentially, the metadata provides the information that the direct mapping retrieves from the table schema.

To generate the correct object URI in the highlighted statement of the direct graph (above) the processor has to:

Extract the values of columns deptName and deptCity for the current row in the People table in order to determine the value of the compound unique key to the Departments table for that row.

Find the associated row in the Departments table that matches that compound unique key value and determine the subject for that row.

This can be done because the direct mapping processor has simultaneous access to several tables within the same relational database. It is therefore straightforward to access all the tables in parallel and establish the necessary relationships to generate the triples.

However, this combination cannot be handled by the tabular data mapping. The situation for tabular data is indeed different: tables are typically generated from single and, potentially, very large CSV files, meaning that a tabular data mapping processor cannot be expected to handle several tables in parallel. That is the reason why the [tabular-data-model] does not include features that would require such parallel access. As a consequence, the output of the direct mapping for such tables cannot be reproduced by the tabular data mapping.