A large percentage of the data published on the Web is tabular data, commonly published as
comma separated values (CSV) files. The CSV on the Web Working Group aim to specify
technologies that provide greater interoperability for data dependent applications on the
Web when working with tabular datasets comprising single or multiple files using CSV, or
similar, format.

This document lists the use cases compiled by the Working Group that are considered
representative of how tabular data is commonly used within data dependent applications. The
use cases observe existing common practice undertaken when working with tabular data, often
illustrating shortcomings or limitations of existing formats or technologies. This document
also provides a set of requirements derived from these use cases that have been used to
guide the specification design.

This is a draft document which may be merged into another document or eventually make its
way into being a standalone Working Draft.

Introduction

A large percentage of the data published on the Web is tabular data, commonly published as
comma separated values (CSV) files. CSV files may be of a significant size but they can be
generated and manipulated easily, and there is a significant body of software available to
handle them. Indeed, popular spreadsheet applications (Microsoft Excel, iWork’s Number, or
OpenOffice.org) as well as numerous other applications can produce and consume these files.
However, although these tools make conversion to CSV easy, it is resisted by some publishers
because CSV is a much less rich format that can't express important detail that the
publishers want to express, such as annotations, the meaning of identifier codes etc.

Existing formats for tabular data are format-oriented and hard to process (e.g. Excel);
un-extensible (e.g. CSV/TSV); or they assume the use of particular technologies (e.g. SQL
dumps). None of these formats allow developers to pull in multiple data sets, manipulate,
visualize and combine them in flexible ways. Other information relevant to these datasets,
such as access rights and provenance, is not easy to find. CSV is a very useful and simple
format, but to unlock the data and make it portable to environments other than the one in
which it was created, there needs to be a means of encoding and associating relevant
metadata.

To address these issues, the CSV on the Web Working Group seeks to provide:

In order to determine the scope of and elicit the requirements for this extended
CSV format (CSV+) a set of use cases have been compiled. Each use case provides a narrative
describing how a representative user works with tabular data to achieve their goal,
supported, where possible, with example datasets. The use cases observe existing common
practice undertaken when working with tabular data, often illustrating shortcomings or
limitations of existing formats or technologies. It is anticipated that the additional
metadata provided within the CSV+ format, when coupled with metadata-aware tools, will
simplify how users work with tabular data. As a result, the use cases seek to identify where
user effort may be reduced.

A set of requirements, used to guide the development of the CSV+ specification, have been
derived from the compiled use cases.

Use Cases

The use cases below describe many applications of tabular data. Whilst there are many
different variations of tabular data, all the examples conform to the definition of
tabular data defined in the Model for Tabular Data and Metadata on the Web [[!tabular-data-model]]:

Tabular data is data that is structured into rows, each of which
contains information about some thing. Each row contains the same number of fields
(although some of these fields may be empty), which provide values of properties of
the thing described by the row. In tabular data, fields within the same column provide
values for the same property of the thing described by the particular row.

In selecting the use cases we have reviewed a number of row oriented data
formats that, at first glance, appear to be tabular data. However, closer inspection
indicates that one or other of the characteristics of tabular data were not present.
For example, the HL7 format,
from the health informatics domain defines a separate schema for each row (known as
a "segment" in that format) which means that HL7 messages do not have a regular
number of columns for each row.

Use Case #1 - Digital preservation of government records

(Contributed by Adam Retter; supplemental information about use of XML provided by Liam Quin)

The laws of England and Wales place obligations upon departments and The National Archives for the collection,
disposal and preservation of records. Government departments are obliged within the Public Records Act 1958 sections 3, 4 and 5 to select, transfer, preserve and make
available those records that have been defined as public records. These obligations apply
to records in all formats and media, including paper and digital records. Details
concerning the selection and transfer of records can be found here.

Departments transferring records to TNA must catalogue or list the selected records
according to The National Archives' defined cataloguing principles and standards.
Cataloguing is the process of writing a description, or Transcriptions of Records
for the records being transferred. Once each Transcription of Records is added to the
Records Catalogue, records can be subsequently discovered and accessed using the supplied
descriptions and titles.

TNA specifies what information should be provided within a Transcriptions of Records and
how that information should be formatted. A number of formats and syntaxes are supported,
including RDF. However, the predominant format used for the exchange of Transcriptions of
Records is CSV as the government departments providing the Records lack either the
technology or resources to provide metadata in the XML and RDF formats preferred by the
TNA.

A CSV-encoded Transcriptions of Records typically describes a set of Records, often
organised within a hierarchy. As a result, it is necessary to describe the
interrelationships between Records within a single CSV file.

Each row within a CSV file relates to a particular Record and is allocated a unique
identifier. This unique identifier behaves as a primary key for the Record within the
scope of the CSV file and is used when referencing that Record from within other Record
transcriptions. The unique identifier is unique within the scope of the datafile; in
order for the Record to be referenced from outside this datafile, the local identifier
must be mapped to a globally unique identifier such as a URI.

Upon receipt by TNA, each of the Transcriptions of Records is validated against the (set
of) centrally published data definition(s); it is essential that received CSV metadata
comply with these specifications to ensure efficient and error free ingest into the
Records Catalogue.

The validation applied is dependent the type of entity described in each row. Entity type
is specified in a specific column (e.g. type).

The data definition file, or CSV Schema, used by the CSV Validation Tool effectively
forms the basis of a formal contract between TNA and supplying organisations. For more
information on the CSV Validation Tool and CSV Schema developed by TNA please refer to the
online
documentation.

Following validation, the CSV-encoded Transcriptions of Records are transformed into RDF
for insertion into the triple store that underpins the Records Catalogue. The CSV is initially
transformed into an interim XML format using XSLT and then processed further using a mix
of XSLT, Java and Scala to create RDF/XML. The CSV files do
not include all the information required to undertake the transformation, e.g. defining
which RDF properties are to be used when creating triples for the data value in each cell.
As a result, bespoke software has been created by TNA to supply the necessary additional
information during the CSV to RDF transformation process. The availability of generic
mechanisms to transform CSV to RDF would reduce the burden of effort within TNA when
working with CSV files.

In this particular case, RDF is the target format for the conversiono f the CSV-encoded
Transcriptions of Records. However, the conversion of CSV to XML (in this case used as
an interim conversion step) is illustrative of a common data conversion workflow.

The transformation outlined above is typical of common practice in that it uses a
freely-available XSLT transformation or XQuery parser (in this case
Andrew Wlech's CSV to
XML converter in XSLT 2.0) which is then modified to meet the specific usage requirements.

The resulting XML document can then be used include further transformed using XSLTto create
XHTML documention - perhaps including charts such histograms to present summary data.

Use Case #2 - Publication of National Statistics

(Contributed by Jeni Tennison)

The Office for National Statistics (ONS) is the UK’s
largest independent producer of official statistics and is the recognised national
statistical institute for the UK. It is responsible for collecting and publishing
statistics related to the economy, population and society at national, regional and local
levels.

Sets of statistics are typically grouped together into datasets comprising of collections of
related tabular data. Within their underlying information systems, ONS maintains a clear
separation between the statistical data itself and the metadata required for interpretation.
ONS classify the metadata into two categories:

The ONS Data Explorer presents the user with a list of available datasets. A user may choose
to browse through the entire list or filter that list by topic. To enable the user to determine
whether or not a dataset meets their need, summary information is available for each dataset.

Once the required dataset has been selected, the user is prompted to choose how they
would like the statistical data to be aggregated. In the case of QS601EW Economic
activity, the user is required to choose between the two mutually exclusive
geography types: 2011 Administrative Hierarchy and 2011 Westminster Parliamentary
Constituency Hierarchy. Effectively, the QS601EW Economic activity
dataset is partitioned into two separate tables for publication.

The user is also provided with an option to sub-select only the elements of the
dataset that they deem pertinent for their needs. In the case of QS601EW Economic
activity the user may select data from upto 200 geographic areas within the
dataset to create a data subset that meets their needs. The data subset may be viewed
on-line (presented as an HTML table) or downloaded in CSV or Microsoft Excel formats.

An example extract of data for England and Wales in CSV form is provided below.
The data subset is provided as a compressed file containing both a CSV formatted data file
and a complementary html file containing the reference metadata. White space has been added
for clarity. File =
CSV_QS601EW2011WARDH_151277.zip

Correct interpretation of the statistics requires additional qualification or
awareness of context. To achieve this the complementary html file includes supplementary information and
annotations pertinent to the data published in the accompanying CSV file. Annotation or references may
be applied to:

Climate change and global warming have become one of the most pressing environmental
concerns in society today. Crucial to predicting future change is an understanding of how
the world’s historical climate, with long duration instrumental records of climate being
central to that goal. Whilst there is an abundance of data recording the climate at
locations the world over, the scrutiny under which climate science is put means that much
of this data remains unused leading to a paucity of data in some regions with which to
verify our understanding of climate change.

To achieve this goal, climate datasets, known as “decks”, are gathered from participating
organisations and merged into a combined dataset using a scientifically peer reviewed method which assesses the data records for inclusion against a variety of
criteria.

Given the need for openness and transparency in creating the databank, it is essential
that the provenance of the source data is clear. Original source data, particularly for
records captured prior to the mid-twentieth century, may be in hard-copy form. In order to
incorporate the widest possible scope of source data, the International Surface
Temperature Initiative is supported by data rescue
activities to digitise hard copy records.

Stage 2: data converted into a common format and with provenance and version control
information appended

Stage 3: merged collation of stage 2 data within a single consolidated dataset

The Stage 1 data is typically provided in tabular form - the most common variant is
white-space delimited ASCII files. Each data deck comprises multiple files which are
packaged as a compressed tar ball (.tar.gz). Included within the compressed
tar ball package, and provided alongside, is a read-me file providing unstructured
supplementary information. Summary information is often embedded at the top of each
file.

The Ugandan Stage 1 data deck appears to be comprised of two discrete datasets, each
partitioned into a sub-directory within the tar ball: uganda-raw and
uganda-bestguess. Each sub-directory includes a Microsoft Word document
providing supplementary information about the provenance of the dataset; of particular
note is that uganda-raw is collated from 9 source datasets whilst
uganda-bestguess provides what is considered by the data publisher to be
the best set of values with duplicate values discarded.

Dataset uganda-raw is split into 96 discrete files, each providing maximum,
minimum or mean monthly air temperature for one of the 32 weather observation stations
(sites) included in the data set. Similarly, dataset uganda-bestguess is
partitioned into discrete files; this case just 3 files each of which provide maximum,
minimum or mean monthly air temperature data for all sites. The mapping from data file to
data sub-set is described in the Microsoft Word document.

summary information pertinent to the “data rows” is included at the beginning of the
data file

row, column and cell value interpretation is informed by accompanying Microsoft Word
document; human intervention is required to unambiguously determine semantics, e.g. the
meaning of each column, the unit of measurement

the observed property is defined as “Tmax”; there is no reference to an authoritative
definition describing that property

there is no header line providing column names

the year and month (column 1) is expressed as a decimal value; e.g. 1901.04 –
equivalent to January, 1901

multiple temperature values (“replicates”) are provided for each row; one from each of
the sources defined in the header, e.g. BEA (British East Africa),
GHCNv3G, ColArchive, GSOD and
NCARds512

Many of the characteristics concerning the “raw” file are exhibited here too.
Additionally, we see that:

the delimiter is now tab (U+0009)

metadata is entirely missing from this file, requiring human intervention to combine
the filename token (tmx) with supplementary information in the accompanying
Microsoft Word document to determine the semantics

Because of the heterogeneity of the Stage 1 data decks, bespoke data processing programs
were required for each data deck consuming valuable effort and resource in simple data
pre-processing. If the semantics, structure and other supplementary metadata pertinent to
the Stage 1 data decks had been machine readable, then this data homogenisation stage
could have been avoided altogether. Data provenance is crucial to this initiative,
therefore it would be beneficial to be able to associate the supplementary metadata
without needing to edit the original data files.

The data pre-processing tools created to parse each Stage 1 data deck into the standard
Stage 2 format and the merge process to create the consolidated Stage 3 data set were
written using the software most familiar to the participating scientists: Fortran 95. The
merge software source code is available online. It is worth noting that this sector of the scientific community also
commonly uses IDL and is
gradually adopting Python as the default software
language choice.

The resulting merged dataset is published in several formats – including tabular text.
The GHCN-format merged dataset comprises of several files: merged data and withheld
data (e.g. those data that did not meet the merge criteria) each with an associated
“inventory” file.

The columns are: station identifier, latitude,
longitude, altitude (m) and station name. The
data is fixed format rather than delimited.

Similarly, a snippet of the merged data itself is provided. Given that the original
.dat file is a largely unmanageable 422.6 MB in size, a subset is provided.
File = merged.monthly.stage3.v1.0.0-beta4.snip

The columns are: station identifier, year, quantity
kind and the quantity values for months January to December in that year. Again,
the data is fixed format rather than delimited.

Here we see the station identifier REC41011874 being used as a foreign key
to refer to the observing station details; in this case Entebbe Airport. Once again, there
is no metadata provided within the file to describe how to interpret each of the data
values.

The resulting merged dataset provides time series of how the observed climate has changed
over a long duration at approximately 32000 locations around the globe. Such instrumental
climate records provide a basis for climate research. However, it is well known that these
climate records are usually affected by inhomogeneities (artifical shifts) due to changes
in the measurement conditions (e.g. relocation, modification or recalibration
of the instrument etc.). As these artificial shifts often have the same magnitude as the
climate signal, such as long-term variations, trends or cycles, a direct analysis of the
raw time-series data can lead to wrong conclusions about climate change.

Statistical homogenisation procedures are used to detect and correct these artificial shifts.
Once detected, the raw time-series data is annotated to indicate the presence of artifical
shifts in the data, details of the homogenisation procedure undertaken and, where possible,
the reasons for those shifts.

Use Case #4 - Publication of public sector roles and salaries

The collection of this information is managed by the
Cabinet Office and subsequently
published via the UK Government data portal at data.gov.uk.

In order to ensure a consistent return from submitting departments and agencies, the
Cabinet Office mandated that each response conform to a data definition schema, which is described within a narrative PDF document. Each submission comprises a
pair of CSV files - one for senior roles and another for junior roles.

The submission for senior roles from the Higher Education Funding Council for England (HEFCE) is provided
below to illustrate. White space has been added for clarity. File = HEFCE_organogram_senior_data_31032011.csv

double quote escaping of text cells including the delimiter character (comma)

Within the senior role CSV the cell Post Unique Reference provides
a primary key within the data file for each row. In addition, it provides a
unique identifier for the entity described within a given row. In order for the
entity to be referenced from outside this datafile, the local identifier
must be mapped to a globally unique identifier such as a URI.

This unique identifier is referenced both from within the senior post dataset,
Reports to Senior Post, and within the junior post dataset, Reporting
Senior Post in order to determine the relationships within the organisational
structure.

The public sector roles and salaries information is published at
data.gov.uk using an interactive "Organogram Viewer" widget implemented using javascript.
The HEFCE data can be visualized
here. For convenience, a screenshot is provided in
.

Screenshot of Organogram Viewer web application showing HEFCE data

In order to create this visualization, each pair of tabular datasets were transformed
into RDF and uploaded into a triple store exposing a SPARQL end-point which the
interactive widget then queries to acquire the necessary data. An example of the derived RDF
is provided in file
HEFCE_organogram_31032011.rdf.

The transformation from CSV to RDF required bespoke software, supplementing the content
in the CSV files with additional information such as the RDF properties for each column.
The need to create and maintain bespoke software incurs costs that may be avoided through
use of a generic CSV-to-RDF transformation mechanism.

Use Case #5 - Publication of property transaction data

(Contributed by Andy Seaborne)

The Land Registry is the
government department with responsibility to register the ownership of land and property
within England and Wales. Once land or property is entered to the Land Register
any ownership changes, mortgages or leases affecting that land or property are recorded.

Their
Price paid data, dating from 1995 and consisting of more than 18.5 million records,
tracks the residential property sales in England and Wales that are lodged for registration.
This dataset is one of the most reliable sources of house price information in England and Wales.

Residential property transaction details are extracted from a data warehouse system
and collated into a tabular dataset for each month.
The current monthly dataset is available online in both .txt and
.csv formats. Snippets of data for January 2014 are provided below. White space
has been added for clarity.

There seems to be little difference between the two formats with the exception that all
cells within the .csv file are escaped with a pair of double quotes ("").

The header row is absent. Information regarding the meaning of each column and the
abbreviations used within the dataset are provided in a complementary
FAQ document.
The column headings are provided below along with some supplemental detail:

Transaction unique identifier

Price - sale price stated on the Transfer deed

Date of Transfer - date when the sale was completed, as stated on the Transfer deed

Each row, or record, within the tabular dataset describes a property transaction. The
Transaction unique identifier column provides a unique identifier for that
property transaction. Given that transactions may be amended, this identifier cannot
be treated as a primary key for rows within the dataset as the identifier may occur
more than once. the primary key for each record. In order for the
property transaction to be referenced from outside this dataset, the local identifier
must be mapped to a globally unique identifier such as a URI.

Each transaction record makes use of predefined category codes as outlined above; e.g.
Duration may be F (freehold) or L (leasehold). Furthermore,
geographic descriptors are commonly used. Whilst there is no attempt to
link these descriptors to specific geographic identifiers, such a linkage is likely
to provide additional utility when aggregating transaction data by location or region for further
analysis. At present there is no standardised mechanism to associate the catagory codes,
provided as plain text, or geographic identifiers with their authoritative definitions.

The collated monthly transaction dataset is used as the basis for updating the Land Registry's
information systems; in this case the data is persisted as RDF triples within a triple store.
A SPARQL end-point
and accompanying data definitions are provided
by the Land Registry allowing users to query the content of the triple store.

In order to update the triple store, the monthly transaction dataset is converted into RDF. The
value of the Record status cell for a given row informs the update process: add, update or
delete. Bespoke software has been created by the Land Registry to transformation from CSV to RDF.
The transformation requires supplementary information not present in the CSV, such as the RDF
properties for each column specified in the
data definitions. The need to create and maintain bespoke software incurs costs that may
be avoided through use of a generic CSV-to-RDF transformation mechanism.

The monthly transaction dataset contains in the order of 100,000 records;
any transformation will need to scale accordingly.

In parallel to providing access via the
SPARQL end-point, the Land Registry also provides aggregated sets of transaction data. Data is
available as a single file containing all transactions since 1995, or partitioned by year.
Given that the complete dataset is approaching 3GB in size, the annual partitions provide a
far more manageable method to download the property transaction data. However, each annual
partition is only a subset of the complete dataset. It is important to be able to both make
assertions about the complete dataset (e.g. publication date, license etc.) and to be
able to understand how an annual partition relates to the complete dataset and other partitions.

Use Case #6 - Journal Article Solr Search Results

(Contributed by Alf Eaton)

When performing literature searches researchers need to retain a persisted collection of
journal articles of interest in a local database compiled from on-line publication websites.
In this use case a researcher wants to retain a local personal journal article publication
database based on the search results from Public Library
of Science. PLOS One is a nonprofit open access scientific publishing project aimed at creating
a library of open access journals and other scientific literature under an open content license.

In general this use case also illustrates the utility of CSV as a convenient exchange format for pushing
tabular data between software components:

making it easier to interpret the data on subsequent ingest

being able to work with manageable chunks of a tabular data set (e.g. only subsets of the tabular dataset
are ever materialised in a single CSV file, and we often want to know how that subset fits within the larger
whole).

The PLOS website features a Solr index search
engine (Live Search) which can return query results in
XML,
JSON
or in a more concise CSV format.
The output from the CSV Live Search is illustrated below:

Versions of the search results provided at time of writing are available locally in
XML,
JSON and
CSV formats for reference.

A significant difference between the CSV formatted results and those of JSON
and XML is the absence of information about how the set of results provided in the HTTP response fit within
the complete set of results that match the Live Search request. The information provided
in the JSON and XML search results states both the total number of "hits" for the Live
Search request and the start index within the complete set (zero for the example provided
here as the ?start={offset} query parameter is absent from the request).

Other common methods of splitting up large datasets into manageable chunks include
partitioning by time (e.g. all the records added to a dataset in a given day may be
exported in a CSV file). Such partitioning allows regular updates to be shared. However,
in order to recombine those time-based partitions into the complete set, one needs to know
the datetime range for which that dataset partition is valid. Such information should be
available within a CSV metadata description.

To be useful to a user maintaining a PLOS One search results need to be returned in an organized and
consistent tabular format. This includes:

mapping search critiera cells to columns returned in the search results

ordering the columns to match the order of the search criteria cells.

Lastly because the researcher may use different search criteria the header row plays an important role
later for the researcher wanting to combine multiple literature searches into their database.
The researcher will use the header column names returned in the first row as a way to identify
each column type.

Search results returned in a tabular format can contain cell values that organized in data structures
also known as micro formats. In example above the publication_date and authors list represent two
micro formats that are represented in a recognizable pattern that can be parsed by software or
by the human reader. In the case of the author column, microformats provide the advantage of being
able to store a single author's name or multiple authors names separated by a comma delimiter.
Because each author cell value is surrounded by quotes a parser can choose to ignore the
data structure or address it.

Furthermore, note that the values of the title_display column contain markup. Whilst
these values may be treated as pure text, it provides an example of how structure or
syntax may be embedded within a cell.

Use Case #7 - Reliability Analyzes of Police Open Data

(Contributed by Davide Ceolin)

Several Web sources expose datasets about UK crime statistics.
These datasets vary in format (e.g. maps vs. CSV files), timeliness, aggregation level, etc.
Before being published on the Web, these data are processed to preserve the privacy of the people
involved, but again the processing policy varies from source to source.

Every month, the UK Police Home Office publishes (via data.police.uk) CSV files that report crime
counts, aggregated on geographical basis (per address or police neighbourhood) and on type basis.
Before publishing, data are smoothed, that is, grouped in predefined areas and assigned to the
mid point of each area. Each area has to contain a minimum number of physical addresses. The goal
of this procedure is to prevent the reconstruction of the identity of the people involved in the
crimes.

Over time, the policies adopted for preprocessing these data have changed, but data previously
published have not been recomputed. Therefore, datasets about different months present relevant
differences in terms of crime types reported and geographical aggregation (e.g. initially, each
geographical area for aggregation had to include at least 12 physical addresses. Later, this
limit was lowered to 8).

These policies introduce a controlled error in the data for privacy reasons, but these changes
in the policies imply the fact that different datasets adhere differently to the real data, i.e.
they present different reliability levels. Previous work provided two procedures for measuring
and comparing the reliability of the datasets, but in order to automate and improve these procedures,
it is crucial to understand the meaning of the columns, the relationships between columns, and how the
data rows have been computed.

For instance, here is a snippet from a dataset about crime happened in Hampshire in April 2012:

In order to properly handle the columns,
it is crucial to understand the type of the data contained therein. Given the context, knowing
this information would reveal an important part of the column meaning (e.g. to identify dates).

Also, it is important to understand the precise semantics of each column.
This is relevant for two reasons. First, to identify relations between columns (e.g. some crime types
are siblings, while other are less semantically related). Second, to identify semantic relations between
columns in heterogeneous datasets (e.g. a column in one dataset may correspond to the sum of two or more
columns in others).

Lastly, datasets with different row numbers are the result of different smoothing procedures. Therefore, it would
be important to trace and access their provenance, in order to facilitate their comparison.

Use Case #8 - Analyzing Scientific Spreadsheets

(Contributed by Alf Eaton, Davide Ceolin, Martine de Vos)

A paper published
in Nature Immunology in December 2012 compared changes in expression of a range of genes in
response to treatment with two different cytokines. The results were published in the paper as graphic figures,
and the raw data was presented in the form of
supplementary spreadsheets, as Excel files (local copy).

Having at disposal both the paper and the results, a scientist may wish to reproduce the experiment, check if the results
he obtains coincide with those published, and compare those results with others, provided by different studies about the same
issues.

Because of the size of the datasets and of the complexity of the computations, it could be necessary to perform such
analyses and comparisons by means of properly defined software, typically by means of an R, Python or Matlab script.
Such software would require as input the data contained in the
Excel file. However, it would be difficult to write a parser to extract the information, for the reasons described below.

To clarify the issues related to the spreadsheet parsing and analysis, we first present an example extrapolated from it.
The example below shows a CSV encoding of the original Excel speadsheet converted using Mircosoft Excel 2007. White space
has been added to aid clarity. (file = ni.2449-S3.csv)

As we can see from the example, the table contains several columns of data that are measurements of gene expression in cells after
treatment with two concentrations of two cytokines, measured after two periods of time, presented as both actual values and fold
change. This can be represented in a table, but needs 3 levels of headings and several merged cells. In fact, the first row is the title of the table,
the second to fourth rows are the table headers.

We also see that the first column gene_name provides a unique identifier for
the gene described in each row, with the second column symbol providing a
human readable notation for each gene - albeit a scientific human! It is necessary to
determine which column, if any, provides the unique identifier for the entity which
each row describes. In order for the gene to be referenced from outside the datafile,
e.g. to reconcile the information in this table with other information about the gene,
the local identifier must be mapped to a globally unique identifier such as a URI.

The first column contains a GenBank identifier for each gene, with the column name "gene_name". The GenBank identifier
provides a local identifier for each gene. This local identifier, e.g. “NM_008638”, can be converted to a fully qualified URI by
adding a URI prefix, e.g. “http://www.ncbi.nlm.nih.gov/nuccore/NM_008638” allowing the gene to be uniquely and unambiguously identified.

The second column contains the standard symbol for each gene, labelled as "symbol".
These appear to be HUGO gene nomenclature symbols, but as there's no mapping it's hard to be sure which namespace these symbols
are from.

As this spreadsheet was published as supplemental data for a journal article, there is little description
of what the columns represent, even as text. There is a column labelled as "Cont", which has no description anywhere, but is
presumably the background level of expression for each gene.

Half of the cells represent measurements, but the details of what those measurements are can only be found
in the article text. The other half of the cells represent the change in expression over the background level. It is difficult to
tell the difference without annotation that describes the relationship between the cells (or understanding of the nested headings).
In this particular spreadsheet, only the values are published, and not the formulae that were used to calculate the derived values.
The units of each cell are "expression levels relative to the expression level of a constant gene, Rpl7", described in the text of
the methods section of the full article.

The heading rows contain details of the treatment that each cell received,
e.g. "4 hour, IL2_1nM". It would be useful to be able to make this machine readable (i.e. to represent treatment with 1nM IL-2 for
4 hours).

All the details of the experiment (which cells were used, how they were treated, when they were
measured) are described in the methods section of the article. To be able to compare data between multiple experiments, a parser
would also need to be able to understand all these parameters that may have affected the outcome of the experiment.

Use Case #9 - Chemical Imaging

(Contributed by Mathew Thomas)

Chemical imaging experimental work makes use of CSV formats to record its measurements. In this use
case two examples are shown to depict scans from a mass spectrometer and corresponding FTIR corrected files
that are saved into a CSV format automatically.

Mass Spectrometric Imaging (MSI) allows the generation of 2D ion density maps that help visualize
molecules present in sections of tissues and cells. The combination of spatial resolution and mass
resolution results in very large and complex data sets. The following is generated using the software
Decon Tools, a tool to de-isotope MS spectra and to detect features from MS data using isotopic signatures
of expected compounds, available freely at omins.pnnl.gov. The raw files generated by the mass spec
instrument are read in and the processed output files are saved as CSV files for each line.

Fourier transform (FTIR) spectroscopy is a measurement technique whereby spectra are collected based on measurements
of the coherence of a radiative source, using time-domain or space-domain measurements of the electromagnetic
radiation or other type of radiation.

In general this use case also illustrates the utility of CSV as a means for scientists to collect and process
their experimental results:

making it easier for data to be loaded into a spreadsheet to examine results

being able to edit or select a portion of results to be plotted

making it possible to combine all scans to examine full 2D composite image.

The key characteristics are:

CSV uses fixed number of cells

First row provides header cell tags, although the FTIR header begins with a comma

All values are comma separated, but they can be delimited by tabs as well.

Because the data is being collected from an instrument some of the columns
represent measurement values taken during the experiment.

Below is a example FTIR data. The files from the instrument are baseline corrected, normalized and saved as CSV files automatically. Column 1 represents the
wavelength # or range and the represent different formations like bound eps (extracellular polymeric substance), lose eps, shewanella etc.
Below are (5 of 3161 rows) is a example:

Use Case #10 - OpenSpending Data

(Contributed by Stasinos Konstantopoulos)

The OpenSpending and the Budgit platforms
provide plenty of useful datasets providing figures of national budget and spending of several countries. A journalist willing to investigate
about public spending fallacies can use these data as a basis for his research, and possibly compare them against different sources.
Similarly, a politician that is interested in developing new policies for development can, for instance, combine these data with those from the
World Bank to identify correlations and, possibly, dependencies to leverage.

Nevertheless, these uses of these datasets are possibly undermined by the following obstacles.

There are whole collections of datasets where a single currency is implied for all amounts given. See, for example, how all
Slovenian Budget Datasets are implicitly give amounts in Euros. Given that Slovenia joined the Eurozone in 2007,
the currency in has changed relatively recently. How do we know if a given table expresses currency amounts in “tolar” or “Euro”?

In order to be able to compare and combine these data with those provided by other sources like the
World Bank,
in an automatic manner, it would be necessary to explicitly define the currency of each column. Given that
the currency will be uniform for a specific table, the currency metadata may be indicated once for the entire table.

Similar issues are also in the Uganda Budget and Aid to Uganda, 2003-2007 file,
where there are four columns related to the amount. Of these, "amount" (Ugandan Shillings implied)
and "amount_dollars" (USD implied) are mandatory. The value of these columns is implicit, and moreover, as explained in the
complementary information, the Ugandan Shillings amount is computed by converting
the Dollars amount using a ratio determined on year basis (e.g. 2003/4: 1 USD = 1.847 UGX). Since this ratio varies on year basis,
and still corresponds to an approximation of the yearly value of the exchange rate, in order to properly use these data, it would
be preferable to know how these were obtained, or where to find such information.

Again in the Uganda Budget and Aid to Uganda, 2003-2007 file, if a row represents a
donation, then the values for the "amount_donor" (the amount in the donor's original currency) and "donorcurrency" (the donor's currency name) columns of
that row are reported.
Otherwise, the corresponding values are set to "0", to indicate that the row does not represent a donation and that the only relevant amounts for that row
are reported in the "amount" and "amount_dollars" column. To make these files machine-understandable, it is necessary to make this coding explicit.

Use Case #11 - City of Palo Alto Tree Data

(Contributed by Eric Stephan)

The City of Palo Alto,
California Urban Forest Section is responsible for maintaining and tracking
the cities public trees and urban forest. In a W3C Data on the Web Best Practices (DWBP) use case
discussion with Jonathan Reichental City of Palo Alto CIO, he brought to the working groups attention
a Tree Inventory maintained by the city in a
spreadsheet
form using Google Fusion. This use case represents use of tabular data to be representative of
geophysical tree locations
also provided in Google Map form where the user can point and click on trees to look up row
information about the tree.

The complete CSV file of Palo Alto tree data is available locally
- but please note that it is approximately 18MB in size.

Google Fusion allows a user to download the tree data either from a filtered view or the entire spreadsheet.
The exported spreadsheet is organized and consistent tabular format. This includes:

mapping spreadsheet cells to columns in the CSV file.

ordering the CSV columns to match the order of the spreadsheet columns.

The CSV file provides a primary key for each row (column GID), a unique identifier
for each tree (column Tree ID), accounts for missing data, and lists characteristics
describing the condition of the tree in the comments cell using a micro syntax to delimit the
characteristics list. The spreadsheet also provides geo coordinate information pinpointing each
inventoried tree.

In order for information about a given tree to be reconciled with information about the same
tree originating from other sources, the local identifier for that tree must be mapped to a
globally unique identifier such as a URI.

Also note that in row 6, a series of statements describing the condition of the tree and other
important information are provided in the comments cell. These statements are
delimited using the semi-colon ";" character.

Use Case #12 - Chemical Structures

(Contributed by Eric Stephan)

The purpose of this use case is to illustrate how 3-D molecular structures such as the Protein Data Bank and
XYZ formats are conveyed in tabular formats. These files be archived to be used informatics analysis or as part
of an input deck to be used in experimental simulation. Scientific communities rely heavily on tabular formats such
as these to conduct their research and share each others results in platform independent formats.

The Protein Data Bank (pdb) file format is a tabular file describing the three dimensional structures of molecules
held in the Protein Data Bank. The pdb format accordingly provides for description and annotation of protein and
nucleic acid structures including atomic coordinates, observed sidechain rotamers, secondary structure assignments,
as well as atomic connectivity.

The XYZ file format is a chemical file format. There is no formal standard and several variations exist, but a typical
XYZ format specifies the molecule geometry by giving the number of atoms with Cartesian coordinates that will be read
on the first line, a comment on the second, and the lines of atomic coordinates in the following lines.

In general this use case also illustrates the utility of CSV as a means for scientists to collect and process
their experimental results:

making it easier for data to be loaded into a spreadsheet to examine results

being able to edit or select a portion of results to be plotted

making it possible to combine all scans to examine full 2D composite image.

The key characteristics of the XYZ format are:

CSV contains two header rows, the first row containing the number of atoms in molecule (number of rows in data block).
The second is a comment line.

Each row in the data block used a fix number of cells (atom name followed by x, y, z coordinates).

Use Case #13 - Representing Entities and Facts Extracted From Text

(Contributed by Tim Finin)

The US National Institute of Standards and Technology (NIST) has run various conferences on
extracting information from text centered around challenge problems. Participants submit the output
of their systems on an evaluation dataset to NIST for scoring, typically in the form of tab-separated format.

The 2013 NIST Cold Start Knowledge Base Population Task, for example, asks participants to extract
facts from text and to represent these as triples along with associated metadata that include provenance
and certainty values. A line in the submission format consists of a triple (subject-predicate-object) and, for some
predicates, provenance information. Provenance includes a document ID and, depending on the predicate, one
or three pairs of string offsets within the document. For predicates that are relations, an optional second
set of provenance values can be provided. Each line can also have an optional float as a final column to represent
a certainty measure.

The following lines show examples of possible triples of varying length. In the second line, D00124 is the ID of a document
and the strings like 283-286 refer to strings in a document using the offsets of the first and last characters.
The final floating point value on some lines is the optional certainty value.

The submission format does not require that each line have the same number of columns. The expected provenance
information for a triple depends on the predicate. For example, “type” typically has no provenance, “mention” has
a document ID and offset pair, and domain predicates like “per:age” have one or two provenance records each of which
has a document ID and three offset pairs.

The file format exemplified above opens up for a number of issues described as follows.

Each row is intended to describe an entity (e.g. the subject of the triple, “:e4”). The unique identifier for that
entity is provided in the first column. In order for information about this entity to be reconcilled with
information from other sources about the same entity, the local identifier needs to be mapped to a globally
unique identifier such as a URI.

Entities “:e4”, “:e7” and “:e9” appear to be (foreign key) references to other entities described in this or in external tables. Likewise, also the
identifiers “D00124” and “D00101” are ambiguous identifiers. It would be useful to identify the resources that these references represent.

Moreover, “per” appears to be a term from a controlled vocabulary. How do we know which controlled vocabulary it is a member of and what its authoritative
definition is?

The identifiers used for the entities (“:e4”, “:e7” and “:e9”), as well as those used for the predicates (e.g. “type”, “mention”,
“per:siblings” etc.), are ambiguous local identifiers. How can one make the identifier an unambiguous URI?
A similar requirement regards the provenance annotations. These are composed by document (e.g. “D00124”) and page number ranges.
(e.g. “180-181”). Page number ranges are clearly valid only in the context of the preceding document identifier. The interesting assertion about provenance
is the reference (document plus page range). Thus we might
want to give the reference a unique identifier comprising from document ID and page range (e.g. D00124#180-181).

Besides the entities, the table presents also some values. Some of these are strings (e.g. “10”, “Bart”), some of them are probably floating point values (e.g. “0.9”).
It would be useful to have an explicit syntactic type definition for these values.

Entity “:e4” is the subject of many rows, meaning that many rows can be combined to make a composite set of statements about this entity.

Moreover, a single row in the table comprises a triple (subject-predicate-object), one or more provenance references and an optional certainty measure.
The provenance references have been normalised for compactness (e.g. so they fit on a single row). However, each provenance statement has the same target triple so one
could unbundle the composite row into multiple simple statements that have a regular number of columns (see the two equivalent examples below).

Lastly, since we already observed that rows comprise triples, that there is a frequent reference to externally
defined vocabularies, that values are defined as text (literals), and that triples are also composed by entities,
for which we aim to obtain a URI (as described above), it may be useful to be able to convert such a table in RDF.

Use Case #15 - Intelligently Previewing CSV files

(Contributed by Jeni Tennison)

All of the data repositories based on the CKAN software, such
as data.gov.uk, data.gov, and many
others, use JSON as the representation of the data when providing a preview of CSV data within a browser.
Server side pre-processing of the CSV files is performed to try and determine column
types, clean the data and transform the CSV-encoded data to JSON in order to provide the preview. JSON has many
features which make it ideal for delivering a preview of the data, originally in CSV format,
to the browser.

Javascript is a hard dependency for interacting with data in the browser and as such
JSON was used as the serialization format because it was the most appropriate format for
delivering those data. As the object notation for Javascript JSON is natively understood
by Javascript it is therefore possible to use the data without any external dependencies.
The values in the data delivered map directly to common Javascript types and libraries for
processing and generating JSON, with appropriate type conversion, are widely available for
many programming languages.

Beyond basic knowledge of how to work with JSON, there is no further burden on the user
to understand complex semantics around how the data should be interpreted. The user of the
data can be assured that the data is correctly encoded as UTF-8 and it is easily queryable
using common patterns used in everyday Javascript. None of the encoding and
serialization flaws with CSV are apparent, although badly structured CSV files will be
mirrored in the JSON.

When providing the in-browser previews of CSV-formatted data, the utility of the preview application
is limited because the server-side processing of the CSV is not always able to determine
the data types (e.g. date-time) associated with data columns. As a result it is not possible
for the in-browser preview to offer functions such as sorting rows by date.

The header line here comes below an empty row, and there is metadata about the table in the row above the empty row. The preview code manages to
identify the headers from the CSV, and displays the metadata as the value in the first cell of the first row.

It would be good if the preview could recognise that the Date column contains a date and that the Amount in Sterling column contains a number,
so that it could offer options to filter/sort these by date/numerically.

Moreover, some of the values reported may refer to external definitions (from dictionaries or other sources). It would be useful to know where it is
possible to find such resources, to be able to properly handle and visualize the data, by linking to them.

NetCDF is a set of binary data formats, programming interfaces, and software libraries that help read and write scientific data files.
NetCDF provides scientists a means to share measured or simulated experiments with one another across the web. What makes
NetCDF useful is its ability to be self describing and provide a means for scientists to rely on existing data model
as opposed to needing to write their own. The classic NetCDF data model consists of variables, dimensions, and attributes.
This way of thinking about data was introduced with the very first NetCDF release, and is still the core of all NetCDF files.

Among the tools available to the NetCDF community, two tools: ncdump and ncgen. The ncdump tool is used
by scientists wanting to inspect variables and attributes (metadata) contained in the NetCDF file. It also
can provide a full text extraction of data including blocks of tabular data representing by variables.
While NetCDF files are typically written by a software client, it is possible to generate NetCDF files using
ncgen and ncgen3 from a text format. The ncgen tool parses the text file and stores it in a binary format.

Both ncdump and ncgen rely on a text format to represent the NetCDF file called network Common Data form
Language (CDL). The CDL syntax as shown below contains annotation along with blocks of data denoted by the
"data:" key. For the results to be legible for visual inspection the measurement data is written as delimited
blocks of scalar values. As shown in the example below CDL supports multiple variables or blocks of data.
The blocks of data while delimited need to be thought of as a vector or single column of tabular data
wrapped around to the next line in a similar way that characters can be wrapped around in a single cell block
of a spreadsheet to make the spreadsheet more visually appealing to the user.

The next example shows a small subset of data block taken from an actual NetCDF file.
The blocks of data while delimited need to be thought of as a vector or single column
of tabular data wrapped around to the next line in a similar way that characters can be
wrapped around in a single cell block of a spreadsheet to make the spreadsheet more
visually appealing to the user.

Use Case #17 - Canonical mapping of CSV

(Contributed by David Booth and Jeremy Tandy)

CSV is by far the commonest format within which open data is published, and is thus
typical of the data that application developers need to work with.

However, an object / object graph serialisation (of open data) is easier to consume within
software applications. For example, web applications (using HTML5 & Javascript) require
no extra libraries to work with data in JSON format. Similarly, RDF-encoded data in from
multiple sources can be simply combined or merged using SPARQL queries once persisted
within a triple store.

Release data quickly, and then work to make sure that it is available in open standard
formats, including linked data formats.

The open data principles recognise how the additional utility to be gained from publishing in
linked data formats must be balanced against the additional effort incurred by the
data publisher to do so and the resulting delay to publication of the data. Data publishers
are required to release data quickly - which means making the data available in
a format convenient for them such as CSV dumps from databases or spread sheets.

One of the hindrances to publishing in linked data formats is the difficulty in
determining the ontology or vocabulary (e.g. the classes, predicates, namespaces and
other usage patterns) that should be used to describe the data. Whilst it is
only reasonable to assume that a data publisher best knows the intended meaning of their
data, they cannot be expected to determine the ontology or vocabulary most applicable to
to a consuming application!

Furthermore, in lieu of agreed de facto standard vocabularies or ontologies for a given
application domain, it is highly likely that disparate applications will conform to different data
models. How should the data publisher choose which of the available vocabularies or
ontologies to use when publishing (if indeed they are aware of those applications at all)!

In order to assist data publishers provide data in linked data formats without
the need to determine ontologies or vocabularies, it is necessary to separate the syntactic
mapping (e.g. changing format from CSV to JSON) from the semantic mapping
(e.g. defining the transformations required to achieve semantic alignment with a target
data model).

As a result of such separation, it will be possible to establish a canonical
transformation from CSV conforming to the
core tabular data model [[!tabular-data-model]]
to an object graph serialisation such as JSON.

This use case assumes that JSON is the target serialisation for application developers
given the general utility of that format. However, by considering JSON-LD [[json-ld]], it becomes
trivial to map CSV-encoded tabular data via JSON into a canonical RDF model. In doing so
this enables CSV-encoded tabular data to be published in linked data formats
as required in the open data principle 9 at no extra effort to the data publisher as
standard mechanisms are available for a data user to transform the data from CSV to RDF.

Applications may automatically determine the data type (e.g. date-time, number) associated
with cells in a CSV file by parsing the data values. However, on occasion, this is prone to
mistakes where data appears to resemble something else. This is especially
prevalent for dates. For example, 1/4 is often confused with 1 April
rather than 0.25. In such situations, it is beneficial if guidance can be given to the
transformation process indicating the data type for given columns.

Provision of CSV data coupled with a canonical mapping provides significant utility by itself. However,
there is nothing stopping a data publisher from adding annotation defining data semantics once, say,
an appropriate de facto standard vocabulary has been agreed within the community of use. Similarly, a
data consumer may wish to work directly with the canonical mapping and wish to ignore any semantic
annotations provided by the publisher.

The challenge mandates the use of Linked Open Data resources in the recommendations.

An effective manner to satisfy this requirement is to make use of undirected semantic paths.
An undirected semantic path is a sequence of entities (subject or object) and properties that link two items, for instance:

{Book1 property1 Object1 property2 Book2}

This sequence results from considering the triples (subject-predicate-object) in a given Linked Open Data resource (e.g. DBpedia),
independently of their direction, such that the starting and the ending entities are the desired items and that the subject (or object) of
a triple is the object (or subject) of the following triple.
For example, the sequence above may result from the following triples:

Book1 property1 Object1
Book2 property1 Object1

Undirected semantic paths are classified according to their length. Fixed a length, one can extract all the undirected semantic paths of that length
that link two items within a Linked Open Data resource by running a set of SPARQL queries.
This is necessary because an undirected semantic path actually corresponds to the union of a set of directed semantic paths. In the source, data are stored
in terms of directed triples (subject-predicate-object).

The number of queries that is necessary to run in order to obtain all the undirected semantic paths that link to items is exponential of the
length of the path itself (2n). Because of the complexity of this task and of the possible latency times deriving from
it, it might be useful to cache these results.

CSV is a good candidate for caching undirected semantic paths, because of its ease of use, sharing, reuse. However, there are some open issues
related to this.
First, since paths may present a variable number of components, one might want to represent paths in a single cell,
while being able to separate the path elements when necessary.

For example, in this file, undirected semantic paths are grouped by means
of double quotes, and path components are separated by commas. The starting and ending elements of the undirected semantic paths (Book1 and Book2) are represented
in two separate columns by means of the book identifiers used in the challenge (see the example below).

Second, the size of these caching files may be remarkable. For example, the size of this file described above is ~2GB, and that may imply prohibitive
loading times, especially when making a limited number of recommendations.

Since rows are sorted according to the starting and the ending book of the undirected semantic path, then all the undirected semantic paths that link two books
are present in a region of the table formed by consecutive rows.

By having at our disposal an annotation of such regions indicating which book they describe, one might be able to select the "slice" of
the file he needs to make a recommendation, without having to load it entirely.

Use Case #19 - Supporting Right to Left (RTL) Directionality

(Contributed by Yakov Shafranovich)

Writing systems affect the way in which information is displayed. In some cases, these writing
systems affect the order in which characters are displayed. Latin based languages display text
left-to-right across a page (LTR). Languages such as Arabic and Hebrew are written in scripts
whose dominant direction is right to left (RTL) when displayed, however when it involves
non-native text or numbers it is actually bidirectional.

Irrespective of the LTR or RTL display of characters in a given language, data is serialised
such that the bytes are ordered in one sequential order.

Content published in Hebrew and Arabic provide examples of RTL display behaviour.

Tabular data from originating from countries where vertical writing is the norm
(e.g. China, Japan) appear to be published with rows and columns as defined in [[RFC4180]] (e.g.
each horizontal line in the data file conveys a row of data, with the first line optionally
providing a header with column names). Rows are published in the left to right topology.

Readers should be aware that both the right-to-left text direction and the cursive nature of Arabic text
has been explicitly overridden in the example above in order to display each individual character in
sequential left-to-right order.

The directionality of the content as displayed does not affect the logical structure of the tabular data;
i.e. the cell at index zero is followed by the cell at index 1, and then index 2 etc.

However, without awareness of the directionality of the content, an application may display data in
a way that is unintuitive for the a RTL reader. For example, viewing the CSV file using
Libre Office Calc (tested
using version 3 configured with English (UK) locale) demonstrates the challenge in rendering the content correctly.
shows how the
content is
incorrectly rendered; cells progress from left-to-right yet, on the positive side, the Arabic text
within a given field runs from right-to-left. Similar behaviour is observed in Microsoft Office Excel 2007.

By contrast, we can see . The simple
TextWrangler text editor is not aware that the overall
direction is right-to-left, but does apply the Unicode bidirectional algorithm such that lines starting with an
Arabic character have a direction base of right-to-left. However, as a result, the numeric digits are also displayed
right to left, which is incorrect.

It is clear that a mechanism needs to be provided such that one can explicitly declare the directionality
which applies when parsing and rendering the content of CSV files.

From Unicode version 6.3 onwards, the Unicode Standard contains new control codes (RLI, LRI, FSI, PDI)
to enable authors to express isolation at the same time as direction in inline bidirectional text. The
Unicode Consortium recommends that isolation be used as the default for all future inline bidirectional
text embeddings. To use these new control codes, however, it will be necessary to wait until the browsers
support them. The new control codes are:

RLI (RIGHT-TO-LEFT ISOLATE) U+2067 to set direction right-to-left

LRI (LEFT-TO-RIGHT ISOLATE) U+2066 to set direction left-to-right

FSI (FIRST STRONG ISOLATE) U+2068 to set direction according to the first strong character

PDI (POP DIRECTIONAL ISOLATE) U+2069 to terminate the range set by RLI, LRI or FSI

More information on setting the directionality of text without markup can be found
here

Base64-encoded binary values may be included. These are designated by setting
the initial cell value to "\#".

A number of escape sequences for special characters are supported; e.g.
"\\" (backslash within a string),
"\s" (semicolon within a string - not a cell or list item delimiter),
"\n" (newline within a string) and
"\t" (tab within a string) etc.

These special characters don't affect the parsing of the data but are further
examples of the use of microsyntax within cells.

Use Case #21 - Publication of Biodiversity Information from GBIF using the Darwin Core Archive Standard

(Contributed by Tim Robertson, GBIF, and Jeremy Tandy)

A citizen scientist investigating biodiversity in the Parque Nacional de Sierra Nevada,
Spain, aims to create a compelling web application that combines biodiversity
information with other environmental factors - displaying this information on a map and
as summary statistics.

The ignoreHeaderLines attribute can be used to ignore files with column
headings or preamble comments.

In this particular case, the tabular data file is packaged within the zip file, and
is referenced locally. However, the DwC-A specification also supports annotation of remote
tabular data files, and thus does not require any modification of the source datafiles
themselves.

Although not present in this example, DwC-A also supports the ability to specify
a property-value pair that is applied to every row in the tabular data file, or,
in the case of sparse data, for that property-value pair to be added where the property
is absent from the data file (e.g. providing a default value for a property).

Whilst the DwC-A format is embedded in many software platforms, including web based tools,
none of these seem to fit the needs of the citizen scientist. They want to use existing
javascript libraries such as Leaflet, an open-Source
javascript library for interactive maps, where possible to simplify their web development
effort.

Leaflet has good support for GeoJSON, a JSON format
for encoding a variety of geographic data structures.

In the absence of standard tooling, the citizen scientist needs to write a custom parser
to convert the tab delimited data into GeoJSON. An example GeoJSON object resulting from
this transformation is provided below.

The citizen scientist notes that many of the terms in a given row are drawn from controlled
vocabularies; geographic names and taxonomies. For the application, they want to be able to
refer to the authoritative definitions for these controlled vocabularies, say, to provide
easy access for users of the application to the defintions of scientific terms such as "Pinophyta".

Thinking to the future of their application, our citizen scientist anticipates the need
to aggregate data across multiple datasets; each of which might use different column headings
depending on who compiled the tabular dataset. Furthermore, how can one be sure they are
comparing things of equivalent type?

To remedy this, they want to use the definitions from the metadata file
meta.xml. The easiest approach to achieve this is to modify their parser
to export [[json-ld]] and transform the tabular data into RDF that can be easily
reconciled.

The @type of the entity is taken from the rowType attribute
within the metadata file.

The amendment of the GeoJSON specification to include JSON-LD is a work in progress at the time
of writing. Details can be found on the GeoJSON GitHub.

It is the hope of the DwC-A format specification authors that the availability
of general metadata vocabulary for describing CSV files, or indeed any tabular text
datasets, will mean that DwC-A can be deprecated. This would allow the biodiversity
community, and initiatives such as GBIF, to spend their efforts developing tools that
support the generic standard rather than their own domain specific conventions and
specifications, thus increasing the accessibility of biodiversity data.

To achieve this goal, it essential that the key characteristics of the DwC-A format
can be adequately described, thus enabling the general metadata vocabulary to be adopted
without needing to modify the existing DwC-A encoded data holdings.

Use Case #22 - Making sense of other people's data

spendnetwork.com harvests spending data from multiple UK local and central government CSV files.
It adds new metadata and annotations to the data and cross-links suppliers to OpenCorporates and,
elsewhere, is beginning to map transaction types to different categories of spending.

A snippet of data from East Sussex County Council indicating payments over £500 for the second financial quarter of 2011 is below to illustrate.
White space has been added for clarity. The full data file for that period (saved in CSV format from Microsoft Excel 2007) is provided here:
ESCC-payment-data-Q2281011.csv

This data is augmented by spendnetwork.com and presented in a Web page.
The web page for East Sussex County Council is illustrated in

Payments over £500 for East Sussex County Council July-Sept 2011, illustrated by spendnetwork

Notice the Linked Data column that links to
OpenCorporates data on MCL Transport Consultants Ltd.
If we follow the 'more' link we see many more cells that spendnetwork would like to include (see
). Where data is available
from the original spreadsheet it has been included.

Payment transaction details, illustrated by spendnetwork

The schema here is defined by a third party (spendnetwork.com) to make sense of the original data within their own model
(only some of which is shown here, spendnetwork.com also tries to categorize transactions and more). This model exists independently of
multiple source datasets and entails a mechanism for reusers to link to the original data from the metadata.
Published metadata can be seen variously as feedback, advertising, enrichment or annotations. Such information could help the publisher to
improve the quality of the original source, however, for the community at large it reduces the need for repetition of the work done to make
sense of the data and facilitates a network effect. It may also be the case that
the metadata creator is better able to put the original data into a wider context with more accuracy and commitment than the original publisher.

Another (similar) scenario is LG-Inform. This harvests government statistics from multiple sources,
many in CSV format, and calculate rates, percentages & trends etc. and packages them as a set of performance metrics/measures. Again, it
would be very useful for the original publisher to know, through metadata, that their source has been defined and used (potentially alongside
someone else's data) in this way.

The analysis of datasets undertaken by both spendnetwork.com and LG-Inform to make sense of other people's tabular data is time-consuming
work. Making that metadata available is a potential help to the original data publisher as well as other would-be reusers of it.

Use Case #23 - Collating humanitarian information for crisis response

(Contributed by Tim Davies)

During a crisis response, information managers within the humanitarian community face
a significant challenge in trying to collate data regarding humanitarian needs and response
activities conducted by a large number of humanitarian actors. The schemas for these data
sets are generally not standardized across different actors nor are the mechanisms for
sharing the data. In the best case, this results in a significant delay between the collection
of data and the formulation of that data into a common operational picture. In the worst case,
information is simply not shared at all, leaving gaps in the understanding of the field situation.

The Humanitarian eXchange Language
(HXL) project seeks to address this concern; enabling information from diverse parties to be
collated into a single "Humanitarian Data Registry". Supporting tools are provided to assist
participants in a given response initiative in finding information within this registry to
meet their needs.

The HXL standard is designed to be a common publishing format for humanitarian data.
A key design principle of the HXL project is that the data publishers are able to continue
publication of their data using their existing systems. Unsurprisingly, data publishers
often provide their data in tabular formats such as CSV, having exported the content from
spreadsheet applications. As a result, the HXL standard is entirely based on tabular data.

During their engagement with the humanitarian response community, the HXL project team
have identified two major concerns when working with tabular data:

Tabular data needs to be created, read by and exchanged between people speaking
different languages. Many of these are basic spreadsheet users who find it
far easier to use data with natural and clear language in the column headings.
Having the column headings in their own language makes creating and interpreting
the data a lot easier.

Tabular data needs to be created that contains literal values in multiple languages.
For example, the name of a town in English, French and Arabic. The total number of
languages that the data might be expressed in cannot be easily determined in advance,
and it should be possible for a data manager to introduce a new language variant of
a column easily.

To address these issues, the HXL project have developed a number of
conventions
for publishing tabular data in CSV format.

Instead of relying on matching text strings, a
data dictionary
provides a set of short (3-character) identifier for each field type. These field
codes are provided on the first row of the CSV file. The second row of the CSV file
provides a textual column name - as determined in the natural language of the data
publisher. The field code is normative, whilst the textual column name is merely
informative.

The HXL tabular data conventions
also propose an alternate where the 3-character field code is pre-pended to the textual
column name; e.g. 100 Location name.

The data dictionary provides both a semantic meaning for the column and the data
type for values, as illustrated in the snippet below:

Code, Title,Data type, Description, Example
000, Date reported, Date, The date that this data record was released.,2014-06-23
001,Start date applicable, Date, The date when the data in this report was first applicable.,2014-05-01
002, End date applicable, Date, The date when the data in this report was last applicable.,2014-05-31
010, Source description, Text,A free-text description of the source organisation for this data record., OXFAM GB
011, Source id, Code, A unique identifier for the source organisation for this data record., 5120
{snip}

For multilingual content, this document proposes that the field-identifiers allow
an optional "/" followed by an ISO 639 language code. The language code applies to the
content contained within that column and may also apply to the column label.

Where multiple data-values for a given field code are provided in a single
row, the field code is repeated - as illustrated in the example below that provides
geocodes for multiple locations pertaining to the subject of the record.

Finally, note that the humanitarian data often includes codes to reference some authoritative
term, definition or other resource; e.g. the Province code60400000. In order
to establish links between data sets, these codes must be reconciled with their official definitions.

The HXL proof of concept project
(from 2012-2013) developed an RDF/OWL vocabulary
to describe the resources and properties pertinent to this domain. Whilst the current implementation of
HXL build on a tabular data model, rather than a graph data model, the HXL project are reviewing the
need to provide an RDF export format. Publication of HXL data as RDF allows that information to be
augmented with additional triples to, say, provide additional content or establish links between
data sets. Exporting HXL as RDF would greatly benefit from the availability of standard tooling.

Use Case #24 - Expressing a hierarchy within occupational listings

Our user intends to analyze the current state of the job market using information gleaned
from job postings that are published using schema.org markup.

schema.org defines a schema for a listing that describes a
job opening within an organization: JobPosting.

One of the things our user wants to do is to organise the job postings into categories
based on the occupationalCategory
property of each JobPosting.

The occupationalCategory property is used to categorize the described job. The
O*NET-SOC Taxonomy is schema.org's recommended
controlled vocabulary for the occupational categories.

The schema.org documentation notes that value of the occupationalCategory property
should include both the textual label and the formal code from the O*NET-SOC Taxonomy, as
illustrated below in the following RDFa snippet:

O*NET-SOC 2010 Code,O*NET-SOC 2010 Title,O*NET-SOC 2010 Description
{snip}
15-1199.00,"Computer Occupations, All Other",All computer occupations not listed separately.
15-1199.01,Software Quality Assurance Engineers and Testers,Develop and execute software test plans in order to identify software problems and their causes.
15-1199.02,Computer Systems Engineers/Architects,"Design and develop solutions to complex applications problems, system administration issues, or network concerns. Perform systems management and integration functions."
15-1199.03,Web Administrators,"Manage web environment design, deployment, development and maintenance activities. Perform testing and quality assurance of web sites and web applications."
15-1199.04,Geospatial Information Scientists and Technologists,"Research or develop geospatial technologies. May produce databases, perform applications programming, or coordinate projects. May specialize in areas such as agriculture, mining, health care, retail trade, urban planning, or military intelligence."
15-1199.05,Geographic Information Systems Technicians,"Assist scientists, technologists, or related professionals in building, maintaining, modifying, or using geographic information systems (GIS) databases. May also perform some custom application development or provide user support."
15-1199.06,Database Architects,"Design strategies for enterprise database systems and set standards for operations, programming, and security. Design and construct large relational databases. Integrate new systems with existing warehouse structure and refine system performance and functionality."
15-1199.07,Data Warehousing Specialists,"Design, model, or implement corporate data warehousing activities. Program and configure warehouses of database information and provide support to warehouse users."
15-1199.08,Business Intelligence Analysts,Produce financial and market intelligence by querying data repositories and generating periodic reports. Devise methods for identifying data patterns and trends in available information sources.
15-1199.09,Information Technology Project Managers,"Plan, initiate, and manage information technology (IT) projects. Lead and guide the work of technical staff. Serve as liaison between business and technical aspects of projects. Plan project stages and assess business implications for each stage. Monitor progress to assure deadlines, standards, and cost targets are met."
15-1199.10,Search Marketing Strategists,"Employ search marketing tactics to increase visibility and engagement with content, products, or services in Internet-enabled devices or interfaces. Examine search query behaviors on general or specialty search engines or other Internet-based content. Analyze research, data, or technology to understand user intent and measure outcomes for ongoing optimization."
15-1199.11,Video Game Designers,"Design core features of video games. Specify innovative game and role-play mechanics, story lines, and character biographies. Create and maintain design documentation. Guide and collaborate with production staff to produce games as designed."
15-1199.12,Document Management Specialists,"Implement and administer enterprise-wide document management systems and related procedures that allow organizations to capture, store, retrieve, share, and destroy electronic records and documents."
{snip}

The CSV file follows the specification outlined in [[RFC4180]] - including the use of
pairs of double quotes ("") to escape cells that themselves contain commas.

Also note that each row provides a unique identifier for the occupation it describes. This
unique identifier is given in the O*NET-SOC 2010 Code column. This code can be considered
as the primary key for each row in the listing as it is unique for every row. Furthermore, the value
of the O*NET-SOC 2010 Code column serves as the unique identifier for the occupation.

Closer inspection of the O*NET-SOC 2010 code illustrates the hierarchical classification
within the taxonomy. The first six digits are based on the
Standard Occupational Classification (SOC)
code from the US Bureau of Labor Statistics, with further subcategorization thereafter where necessary.
The first and second digits represent the major group; the third digit represents the minor group;
the fourth and fifth digits represent the broad occupation; and the sixth digit represents the
detailed occupation.

Each of the five levels used across the occupation classification schemes are assigned
to a particular OWL class - each of which is a sub-class of
skos:Concept:

From SOC -

Major Group: ex:SOC-MajorGroup

Minor Group: ex:SOC-MinorGroup

Broad Group: ex:SOC-BroadGroup

Detailed Occupation: ex:SOC-DetailedOccupation

From O*NET-SOC -

ex:ONETSOC-Occupation

The SOC taxonomy contains four different types of entities, and so requires several different
passes to extract each of those from the CSV file. Depending on which kind of entity is being
extracted, a different column provides the unique identifier for the entity. Data in a given
row is only processed if the value for the cell designated as the unique identifier is not blank.
For example, if the Detailed Occupation column is designated as providing the
unique identifier (e.g. to extract entities of type ex:SOC-DetailedOccupation),
then the only rows to be processed in the snippet below would be "Financial Specialists, All Other",
"Computer and Information Research Scientists" and "Computer Occupations, All Other". All other rows
would be ignored.

The hierarchy in the SOC structure is implied by inheritance from
the preceeding row(s). For example, the row describing SOC minor group "Computer Occupations"
(Minor Group = 15-1100 (above) has an empty cell value for column Major Group.
The value for SOC major group is provided by the preceeding row. In the case of SOC detailed
occupation "Computer Occupations, All Other" (Detailed Occupation = 15-1199),
the value of value for column Major Group is provided 20 lines previously when a value
in that column was most recently provided. The example snippet below illustrates what the CSV would
look like if the inherited cell values were present:

It is difficult to programatically describe how the inherited values should be implemented.
It is not as simple as infering the value for a blank cell from the most recent preceeding row
when a non-blank value was provided for that column. For example, the last row in the example
above describing "Mathematical Science Occupations" does not inherit the values from columns
Broad Group and Detailed Occupation in the preceeding row because
it describes a new level in the hierarchy.

However, given that the SOC code is a string value with regular structure that reflects
the position of a given concept within the hierarchy, it is possible to determine the
identifier of each of the broader concepts by parsing the identifier string. For example,
the regular expression /^(\d{2})-(\d{2})(\d)\d$/ could be used to split the
identifier for a detailed occupation code into its constituent parts from which the
identifiers for the associated broader concepts could be constructed.

The same kind of processing applies to the O*NET-SOC taxonomy; in this case also extracting
a description for the occupation. There is also an additional complication: where a
O*NET-SOC code ends in ".00", that occupation is a direct mapping to the
occupation defined in the SOC taxonomy. For example, the O*NET-SOC code 15-1199.00
refers to the same occupation category as the SOC code 15-1199:
"Computer Occupations, All Other"

To implement this complication, we need to use conditional processing.

If the final two digits of the O*NET-SOC code are "00", then:

the entity is of type ex:SOC-DetailedOccupation;

the unique identifier and notation for the concept comprises only the six numerical
digits of the O*NET-SOC 2010 Code cell value (e.g. in the form
nn-nnnn); and

A snippet of the final SKOS concept scheme, expressed in RDF using Turtle [[turtle]] syntax, resulting
from transformation of the O*NET-SOC and SOC taxonomies into RDF is provided below. Ideally, all
duplicate triples will be removed - such as the skos:prefLabel
property for concept ex:15-1190 which would be provided by both the O*NET-SOC and SOC CSV files.

Once the SKOS concept scheme has been defined, it is possible for our user to group
job postings by SOC Major Group, SOC Minor Group, SOC Broad Group, SOC Detailed Occupation
and O*NET-SOC Occupation to provide summary statistics about the job market.

For example, we can use the SKOS concept scheme to group job postings for "Web
Administrators" (code 15-1199.03) as follows:

15-0000 "Computer and Mathematical Occupations" (SOC major group)

15-1100 "Computer Occupations" (SOC minor group)

15-1190 "Miscellaneous Computer Occupations" (SOC broad occupation)

15-1199 "Computer Occupations, All Other" (SOC detailed occupation)

15-1199.03 "Web Administrators"

Use Case #25 - Consistent publication of local authority data

Open data and transparency are foundational elements within the UK Government's approach to improve public service. The Local Government Association (LGA) promotes open and transparent local government to meet local needs and demands; to innovate and transform services leading to improvements and efficiencies, to drive local economic growth and to empower citizen and community groups to choose or run services and shape neighbourhoods.

As part of this initiative, the LGA is working to put local authority data into the public realm in ways that provide real benefits to citizens, business, councils and the wider data community. The LGA provides a web portal to help identify open data published by UK local authorities and encourage standardisation of local open data; enabling data consumers to browse through datasets published by local authorities across the UK and providing guidance and tools to data publishers to drive consistent practice in publication.

Data is typically published in CSV format.

An illustrative example is provided for data describing public toilets. The portal lists datasets of information about public toilets provided by more than 70 local authorities. In order to ensure consistent publication of data about public toilets the LGA provides both guidance documentation and a machine-readable schema against which datasets may be validated using on-line tools.

The public toilets CSV schema has 32 (mandated or optional) fields. The validator tool allows columns to appear in any order, matching the column order to the schema based on the title in the column header. Furthermore, CSV files containing additional columns, such as SecureDisposalofSharps specified within the public toilet dataset for Bath and North East Somerset (as shown below), are also considered valid. Additional columns are included where one or more local authorities have specific requirements to include additional information to satisfy local needs. Such additional columns are not supported using formal 'extensions' of the schema as the organisational and administrative burden of doing so was considered too great.

ExtractDate,OrganisationURI,OrganisationLabel,ServiceTypeURI,ServiceTypeLabel,LocationText,StreetAddress,LocalityAddress,TownAddress,Postcode,GeoAreaWardURI,GeoAreaWardLabel,UPRN,CoordinateReferenceSystem,GeoX,GeoY,GeoPointLicensingURL,Category,AccessibleCategory,BabyChange,SecureDisposalofSharps,OpeningHours,ManagingBy,ChargeAmount,Notes
15/09/2014,http://opendatacommunities.org/id/unitary-authority/bath-and-north-east-somerset,Bath and North East Somerset,http://id.esd.org.uk/service/579,Public Toilets,CHARLOTTE STREET ENTRANCE,CHARLOTTE STREET,KINGSMEAD,BATH,BA1 2NE,http://statistics.data.gov.uk/id/statistical-geography/E05001949,Kingsmead,10001147066,OSGB36,374661,165006,http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/,Female and male,Female and male,TRUE,TRUE,24 Hours ,BANES COUNCIL AND HEALTHMATIC,0.2,
15/09/2014,http://opendatacommunities.org/id/unitary-authority/bath-and-north-east-somerset,Bath and North East Somerset,http://id.esd.org.uk/service/579,Public Toilets,ALICE PARK,GLOUCESTER ROAD,LAMBRIDGE,BATH,BA1 7BL,http://statistics.data.gov.uk/id/statistical-geography/E05001950,Lambridge,10001146447,OSGB36,376350,166593,http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/,Female and male,Female and male,TRUE,TRUE,06:00-21:00,BANES COUNCIL AND HEALTHMATIC,0.2,
15/09/2014,http://opendatacommunities.org/id/unitary-authority/bath-and-north-east-somerset,Bath and North East Somerset,http://id.esd.org.uk/service/579,Public Toilets,HENRIETTA PARK,HENRIETTA ROAD,ABBEY,BATH,BA2 6LU,http://statistics.data.gov.uk/id/statistical-geography/E05001935,Abbey,10001147120,OSGB36,375338,165170,http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/,Female and male,Female and male,FALSE,Female and male,Winter & Su 10:00-16:00 | Other times: 08:00-18:00,BANES COUNCIL AND HEALTHMATIC,0,Scheduled for improvement Autumn 2014
15/09/2014,http://opendatacommunities.org/id/unitary-authority/bath-and-north-east-somerset,Bath and North East Somerset,http://id.esd.org.uk/service/579,Public Toilets,SHAFTESBURY ROAD,SHAFTESBURY ROAD,OLDFIELD ,BATH,BA2 3LH,http://statistics.data.gov.uk/id/statistical-geography/E05001958,Oldfield,10001147060,OSGB36,373809,164268,http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/,Female and male,Female and male,TRUE,TRUE,24 Hours ,BANES COUNCIL AND HEALTHMATIC,0.2,
{snip}

Ability to determine that a CSV should be rendered using RTL column ordering.

It shall be possible to declare whether a given tabular data file should be rendered with column
order direction Right-to-Left (RTL); e.g. the first column on the far right, with subsequent columns
displayed to the left of the preceeding column.

A "RTL aware" application should use the RTL declaration to determine how to display the
a given data file. Left-to-Right (LTR) rendering shall be the default behaviour (in absence of
any such declaration).

The directionality of the content does not affect the logical structure of the tabular data;
i.e. the cell at index zero is followed by the cell at index 1, and then index 2 etc. As
a result, parsing of RTL tabular data is anticipated to be identical to LTR content.

Requirements relating to applications

R-CsvValidation

Ability to validate a CSV for conformance with a specified metadata definition

The content of a CSV often needs to be validated for conformance against a
specification. A specification may be expressed in machine-readable format as defined in
the Metadata Vocabulary for Tabular Data [[!tabular-metadata]].

Validation shall assess conformance against structural definitions such as number of
columns and the datatype for a given column. Further validation needs are to be
determined. It is anticipated that validation may vary based on row-specific attributes
such as the type of entity described in that row.

Standardised CSV to RDF transformation mechanisms mitigate the need for bespoke
transformation software to be developed by CSV data consumers, thus simplifying the
exploitation of CSV data. Local identifiers for the entity described in a given row or
used to reference some other entity need to be converted to URIs. RDF properties
(or property paths) need to be determined to relate the entity described
within a given row to the corresponding data values for that row. Where available,
the type of a data value should be incorporated in the resulting RDF. Built-in types
defined in RDF 1.1 [[rdf11-concepts]] (e.g.
xsd:dateTime,
xsd:integer
etc.) and types defined in other RDF vocabularies / OWL ontologies (e.g. geo:wktLiteral,
GeoSPARQL [[geosparql]] section 8.5.1 RDFS Datatypes refers) shall be supported.

A CSV conforming with the
core tabular data model [[!tabular-data-model]], yet lacking
any annotation that defines rich semantics for that data, shall be able to be transformed into
an object / object graph serialisation such as JSON, XML or RDF using systematic rules - a "canonical"
mapping.

The canonical mapping should provide automatic scoping of local identifiers (e.g. conversion to
URI), identification of primary keys and detection of data types.

Ability to publish metadata independently from the tabular data resource it describes

Commonly, tabular datasets are published without the supplementary metadata that enables a third party to
correctly interpret the published information. An independent party - in addition to the data publisher -
shall be able to publish metadata about such a dataset, thus enabling a community of users to benefit from
the efforts of that third party to understand that dataset.

When annotating tabular data, it should be possible for one to define within the metadata
a property-value pair that is repeated for every row in the tabular dataset; for example,
the location ID for a set of weather observations, or the dataset ID for a set of
biodiversity observations.

In the case of sparsely populated data, this property-value pair must be applied as a
default only where that property is absent from the data.

As an illustration, the Darwin Core Archive standard
provides the ability to specify such a property value pair within its metadata description
file meta.xml.

Ability to apply conditional processing based on the value of a specific cell

When transforming CSV content into XML, JSON or RDF it shall be possible to vary the transformation
of the information in a particular row based on the values within a cell, or element within a cell,
contained within that row.

To vary the transformation based on an element within a cell, the value of that cell must be
well structured. See CellMicrosyntax for more information.

Ability to identify comment lines within a CSV file and skip over them during parsing, format conversion or other processing

A tabular datafile may include comment lines. It shall be possible to declare how to recognize
a comment line within the data (e.g. by specifying a sequence of characters that are found
at the beginning of every comment line).

Comment lines shall not be treated as data when parsing, converting or processing the
CSV file. During format conversion, the application may try to include the comment in the
conversion.

Non-functional requirements

R-ZeroEditAdditionOfSupplementaryMetadata

Ability to add supplementary metadata to an existing CSV file without
requiring modification of that file

It may not be possible for a tabular data file to be modified to include the supplementary
metadata required to adequately describe the content of the data file. For example, the
data may be published by a third party or the user may be constrained in their workflow
by choice of tools that do not support or even recognize the supplementary metadata.

It shall be possible to add provide annotations about a given tabular data file without
requiring that file to be modified in any way; "zero-edit" addition.

The following requirements pertain to describing and parsing microsyntax:

to document microsyntax so that humans can understand
what it is conveying; e.g. to provide human-readable annotation

to validate the cell values to ensure they conform to the expected microsyntax

to label the value as being in a particular microsyntax when converting into
JSON/XML/RDF; e.g. marking an XML value as an XMLLiteral or a datetime
value as xsd:dateTime

to process the microsyntax into an appropriate data structure when converting
into JSON/XML/RDF

The ability to declare that a column within a tabular data file carries values of a
particular type, and the potential validation of the cell against the declared type,
is covered in R-SyntacticTypeDefinition
and is not discussed further here.

We can consider cell values with microsyntax to be annotated strings. The annotation (which might
include a definition of the format of the string - such as defining the delimiter used for a list)
can be used to validate the string and (in some cases) convert it into a suitable value or data
structure.

Microsyntax, therefore, requires manipulation of the text if processed. Typically, this will
relate to conversion of lists into multiple-valued entries, but may also include reformatting of
text to convert between formats (e.g. to convert a datetime value to a date, or locale dates to
ISO 8601 compliant syntax).

We assume that lists of values within a given cell all refer to items of the same type -
such as a list of authors for a journal article.

At this time, there is no expectation for CSV parsers to be able to take cells with embedded
structure (e.g. XML, JSON, WKT etc.) or arbitrarily semi-structured text and convert them into
JSON/XML/RDF.

Validation of the embedded structure is assumed to be limited to validation of
syntax rather than structure (e.g. ensuring that what is declared as XML is valid XML).
No attempt should be made to, say, validate XML content against a XML Schema Document. Please refer
to R-SyntacticTypeDefinition for more details on validation
of data types.

Ability to determine the primary key for rows within a tabular data file

It shall be possible to uniquely identify every row within a tabular data file. The
default behaviour for uniquely identifying rows is to use the row number. However, some
datasets already include a unique identifier for each row in the dataset. In such
cases, it shall be possible to declare which column provides the primary key.

To interpret data in a given row of a CSV file, one may need to be able to refer to
information provided in supplementary CSV files or elsewhere within the same CSV file;
e.g. using a foreign key type reference. The cross-referenced CSV files may, or may
not, be packaged together.

Annotations and supplementary information may be literal values or references to a
remote resource. The presence of annotations or supplementary information must not
adversely impact parsing of the tabular data (e.g. the annotations and supplementary
information must be logically separate).

This requirement refers to provision of human-readable annotation providing
additional context to a group of tables, table, column, row, cell or other region
within a table. For example, the
publication of national statistics use case
adds the following annotations to a table:

CSV files make frequent use of code values when describing data. Examples include:
geographic regions, status codes and category codes. In some cases, names are used
as a unique identifier for a resource (e.g. company name wihtin a transaction audit).
It is difficult to interpret the
tabular data with out an unambiguous definition of the code values or (local) identifiers used.

It must be possible to unambiguously associate the notation used within a CSV file
with the appropriate external definition.

Ability to assert how a single CSV file is a facet or subset of a larger
dataset

A large tabular dataset may be split into several files for publication; perhaps to ensure that
each file is a manageable size or to publish the updates to a dataset during the (re-)publishing cycle. It shall
be possible to declare that each of the files is part of the larger dataset and to describe what
content can be found within each file in order to allow users to rapidly find the particular file
containing the information they are interested in.

Ability to declare syntactic type for cells within a specified column.

Whilst it is possible to automatically detect the type of data (e.g. date, number) in a
given cell, this can be error prone. For example, the date April 1st if written
as 1/4 may be interpreted as a decimal fraction.

It shall be possible to declare the data type for the cells in a given column of a
tabular data file. Only one data type can be declared for a given column.

An application may still attempt to automatically detect the data type for a given
cell. However, the explicit declaration shall always take precedent.

The data type declaration will typically be used to declare that a column contains
integers, floating point numbers or text. However, it may be used to assert that a cell
contains, say, embedded
XML content (rdf:XMLLiteral),
datetime values (xsd:dateTime) or
geometry expressed as well-known-text (geo:wktLiteral,
GeoSPARQL [[geosparql]] section 8.5.1 RDFS Datatypes refers).

Each row in a tabular data set describes a given resource or entity. The properties
for that entity are described in the cells of that row. All the cells in a given column
are anticipated to provide the same property.

It shall be possible to declare the semantic relationship between the entity that a
given row describes and a cell in a given column.

The following example of an occupational listing
illustrates how a row of tabular data can be mapped to equivalent content expressed in RDF (Turtle).

To express semantics in a machine readable form, RDF seems the appropriate choice.
Furthermore, best practice indicates that one should adopt common and widely adopted
patterns (e.g. RDF vocabularies, OWL ontologies) when publishing data to enable a
wide audience to consume and understand the data. Existing (de facto) standard
patterns may add complexity when defining the semantics associated with a particular
row such that a single RDF predicate is insufficient.

For example, to express a quantity value using QUDT
we use an instance of qudt:QuantityValue to relate the numerical value
with the quantity kind (e.g. air temperature) and unit of measurement (e.g.
Celsius). Thus the semantics needed for a column containing temperature values might
be: qudt:value/qudt:numericValue – more akin to a LDPath.

Furthermore, use of OWL axioms when defining a sub-property of
qudt:value would allow the quantity type and unit of measurement to
be inferred, with the column semantics then being specified as
ex:temperature_Cel/qudt:numericValue.

Ability to declare a "missing value" token and, optionally, a reason for the
value to be missing

Significant amounts of existing tabular text data include values such as
-999. Typically, these are outside the normal expected range of values
and are meant to infer that the value for that cell is missing. Automated parsing of
CSV files needs to recognise such missing value tokens and behave accordingly.
Furthermore, it is often useful for a data publisher to declare why a value
is missing; e.g. withheld or aboveMeasurementRange

Ability to map cell values within a given column into corresponding URI

Tabular data often makes use of local identifiers to uniquely identify an entity described within a tabular data file
or to reference an entity described in the same data file or elsewhere (e.g. reference data, code lists, etc.).
The local identifier will often be unique within a particular scope (e.g. a code list or data set), but cannot
be guaranteed to be globally unique. In order to make these local identifiers globally unique (e.g. so that
the entity described by a row in a tabular data file can be referred to from an external source, or to establish
links between the tabular data and the related reference data) it is necessary to map those local identifiers
to URIs.

It shall be possible to declare how local identifiers used within a column of a particular dataset
can be mapped to their respective URI. Typically, this may be achieved by concatenating the local identifier with
a prefix - although more complex mappings are anticipated such as removal of "special characters" that are not
permitted in URIs (as defined in [[RFC3986]]) or CURIEs [[curie]]).

Furthermore, where the local identifier is part of a controlled vocabulary, code list or thesaurus, it should
be possible to specify the URI for the controlled vocabulary within which the local identfier is defined.

Ability to group multiple data tables into a single package for
publication

When publishing sets of related data tables, it shall be possible to provide annotation for the
group of related tables. Annotation concerning a group of tables may include summary
information about the composite dataset (or "group") that the individual tabular datasets belong too,
such as the license under which the dataset is made available.

The implication is that the group shall be identified as an entity
in its own right, thus enabling assertions to be made about that group. The relationship
between the group and the associated tabular datasets will need to be made explicit.

Furthermore, where appropriate, it shall be possible to describe the interrelationships
between the tabular datasets within the group.

The tabular datasets comprising a group need not be hosted at the same URL. As such,
a group does not necessarily to be published as a single package (e.g. as a zip) - although we
note that this is a common method of publication.

Ability for a metadata description to explicitly cite the tabular dataset it describes

Metadata resources may be published independently from the tabular dataset(s) it describes; e.g. a third
party may publish metadata in their own domain that describes how they have interpreted the data for their
application or community. In such a case, the relationship between the metadata and data resources cannot
be inferred - it must be stated explicitly.

Such a link between metadata and data resources should be discoverable, thus enabling a data publisher to
determine who is referring to their data leading to the data publisher gaining a better understanding
of their user community.

Ability to provide multiple values of a given property for a single entity described within a tabular data file

It is commonplace for a tabular data file to provide multiple values of a given property
for a single entity. This may be achieved in a number of ways.

First, the multiple rows may be used to describe the same entity; each such row using the same
unique identifier for the entity. For example, a country, identified using its
two-letter country code,
may have more than one name:

Ability to transform data that is published in a normalized form into tabular data.

Textual data may be published in a normalized form; often improving human readability by
reducing the number of lines in the data file. As a result, such a normalized data file will
no longer be regular as additional informtion is included in each row (e.g. the number of columns will vary
because more cells are provided for some rows).

The motivating use case is an example
where we have a CSV file that is not well-formed - in this particular case, the number of
columns varies row by row amd there fore does not conform to the
model for tabular data [[!tabular-data-model]].

The ability to transform a data file into a tabular data file is a necessary
prerequisite for any subsequent transformation. That said, such a transformation is outside the
scope of this working group as it requires a parsing a data file with any structure.

Such pre-processing to create a tabular data file from a given structure is
likely to be reasonably simple for a programmer to implement, but it cannot be generalised.