* Another specific use case is that the Welsh Assembly government publishes a variety of population datasets broken down in different ways. For many uses then population broken down by some category (e.g. ethnicity) is expressed as a percentage. Separate datasets give the actual counts per category and aggregate counts. In such cases it is common to talk about the denominator (often DENOM) which is the aggregate count against which the percentages can be interpreted.

* Another specific use case is that the Welsh Assembly government publishes a variety of population datasets broken down in different ways. For many uses then population broken down by some category (e.g. ethnicity) is expressed as a percentage. Separate datasets give the actual counts per category and aggregate counts. In such cases it is common to talk about the denominator (often DENOM) which is the aggregate count against which the percentages can be interpreted.

Abstract

Many national, regional and local governments, as well as other organizations in- and outside of the public sector, collect numeric data and aggregate this data into statistics. There is a need to publish theses statistics in a standardised, machine-readable way on the web, so that they can be freely integrated and reused in consuming applications. This document is a collection of use cases for a standard vocabulary to publish statistics as Linked Data.

Status of This Document

...

Introduction

Publishing statistics is challenging for the following reasons:

Representing observations and measurements requires more complex modelling as discussed by Martin Fowler (Fowler, Martin (1997). Analysis Patterns: Reusable Object Models. Addison-Wesley. ISBN 0201895420.): Recording a statistic simply as an attribute to an object (e.g., a the fact that a person weighs 185 pounds) fails with representing important concepts such as quantity, measurement, and observation.

Quantity comprises necessary information to interpret the value, e.g., the unit and arithmetical and comparative operations; humans and machines can appropriately visualize such quantities or have conversions between different quantities.

A Measurement separates a quantity from the actual event at which it was collected; a measurement assigns a quantity to a specific phenomenon type (e.g., strength). Also, a measurement can record metadata such as who did the measurement (person), and when was it done (time).

Observations, eventually, abstract from measurements only recording numeric quantities. An Observation can also assign a category observation (e.g., blood group A) to an observation.

Figure demonstrates this relationship. Even though the intended vocabulary may not comply to this modelling, it should illustrate the complexity of modelling observations.

The ISO standard for exchanging and sharing statistical data and metadata among organizations is Statistical Data and Metadata eXchange (SDMX). Since this standard has proven applicable in many contexts, we adopt the multidimensional model that underlies SDMX and intend the standard vocabulary to be compatible to SDMX.

The multidimensional model of SDMX employs observations with measures depending on dimensions and dimension members, and further contextualized by Attributes, and cater for the complexities in modelling after Fowler.

Note:
* XXX: Should we take specific providers of statistics into account?
** E.g., LODstats http://aksw.org/projects/LODStats

Terminology

Statistics is the study of the collection, organization, analysis, and interpretation of data. (Statistics. Wikipedia, http://en.wikipedia.org/wiki/Statistics, last visited at Jan 8 2013). Statistics comprise statistical data.

The basic structure of statistical data is a multidimensional table (also called a data cube) (SDMX User Guide Version 2009.1, http://sdmx.org/wp-content/uploads/2009/02/sdmx-userguide-version2009-1-71.pdf, last visited Jan 8 2013.), i.e., a set of observed values organized along a group of dimensions, together with associated metadata. If aggregated we refer to statistical data as "macro-data" whereas if not, we refer to "micro-data".

Source data is data from datastores such as RDBs or spreadsheets that acts as a source for the Linked Data publishing process.

A publisher is a person or organization that exposes source data as Linked Data on the Web.

A consumer is a person or agent that uses Linked Data from the Web.

A format is machine-readable if it is amenable to automated processing by a machine, as opposed to presentation to a human user.

Aim of this document

The aim of this document is to present use cases (rather than general scenarios) that would benefit from a standard vocabulary to represent statistics as Linked Data. These use cases will be used for derive and justify requirements for a specification of such a standard vocabulary and will be used to later evaluate the suitability of the vocabulary to fulfil the requirements. Use cases do not necessarily need to be implemented, their main aim is a "design decision FAQ" to make sure requirements to the vocabulary are derived systematically and not in an ad-hoc way and to bring together the vocabulary's specification and use cases.

Use cases

This section presents use cases that would be enabled by the existence of a standard vocabulary for the representation of statistics as Linked Data.

SDMX Web Dissemination Use Case

Since we have adopted the multidimensional model that underlies SDMX, we also adopt the "Web Dissemination Use Case" (SDMX 2.1 User Guide Version. Version 0.1 - 19/09/2012. http://sdmx.org/wp-content/uploads/2012/11/SDMX_2-1_User_Guide_draft_0-1.pdf. Last visited on Jan 8 2013.) which is the prime use case for SDMX since it is an increasing popular use of SDMX and enables organisations to build a self-updating dissemination system.

The Web Dissemination Use Case contains three actors, a structural metadata web service (registry) that collects metadata about statistical data in a registration fashion, a data web service (publisher) that publishes statistical data and its metadata as registered in the structural metadata web service, and a data consumption application (consumer) that first discovers data from the registry, then queries data from the corresponding publisher of selected data, and then visualises the data.

Abstracted from the SDMX specificities, this use case contains the following processes, also illustrated in a process flow diagram by SDMX and in more detail described as follows:

A data web service (publisher) registers statistical data in a registry, and provides statistical data from a database and metadata from a metadata repository for consumers. For that the publisher creates database tables (see 1 in figure), and loads statistical data in a database and metadata in a metadata repository.

A consumer discovers data from a registry (3) and creates a query to the publisher for selected statistical data (4).

The publisher translates the query to a query to its database (5) as well as metadata repository (6) and returns the statistical data and metadata.

The consumer visualises the returned statistical data and metadata.

The SDMX Web Dissemination Use Case can be further concretised by the following sub-use cases:

Publisher Use Case: Combined Online Information System (COINS)

More and more organizations want to publish statistics on the web, for reasons such as increasing transparency and trust. Although in the ideal case, published data can be understood by both humans and machines, data often is simply published as CSV, PDF, XSL etc., lacking elaborate metadata, which makes free usage and analysis difficult.

Therefore, the goal in this use case is to use a machine-readable and application-independent description of common statistics with use of open standards, to foster usage and innovation on the published data.

In the Combined Online Information System (COINS), HM Treasury, the principal custodian of financial data for the UK government, released previously restricted financial information about government spendings.

Benefits: According to the COINS as Linked Data project, the reason for publishing COINS as Linked Data are threefold:

using open standard representation makes it easier to work with the data with available technologies and promises innovative third-party tools and usages

individual transactions and groups of transactions are given an identity, and so can be referenced by web address (URL), to allow them to be discussed, annotated, or listed as source data for articles or visualizations

cross-links between linked-data datasets allow for much richer exploration of related datasets

The COINS data has a hypercube structure. It describes financial transactions using seven independent dimensions (time, data-type, department etc.) and one dependent measure (value). Also, it allows thirty-three attributes that may further describe each transaction. For further information, see the "COINS as Linked Data" project website.

COINS is an example of one of the more complex statistical datasets being publishing via data.gov.uk.

Part of the complexity of COINS arises from the nature of the data being released.

The published COINS datasets cover expenditure related to five different years (2005–06 to 2009–10). The actual COINS database at HM Treasury is updated daily. In principle at least, multiple snapshots of the COINS data could be released through the year.

The COINS use case leads to the following challenges

The actual data and its hypercube structure are to be represented separately so that an application first can examine the structure before deciding to download the actual data, i.e., the transactions. The hypercube structure also defines for each dimension and attribute a range of permitted values that are to be represented.

An access or query interface to the COINS data, e.g., via a SPARQL endpoint or the linked data API, is planned. Queries that are expected to be interesting are: "spending for one department", "total spending by department", "retrieving all data for a given observation",

Also, the publisher favours a representation that is both as self-descriptive as possible, i.e., others can link to and download fully-described individual transactions and as compact as possible, i.e., information is not unnecessarily repeated.

Moreover, the publisher is thinking about the possible benefit of publishing slices of the data, e.g., datasets that fix all dimensions but the time dimension. For instance, such slices could be particularly interesting for visualisations or comments. However, depending on the number of Dimensions, the number of possible slices can become large which makes it difficult to select all interesting slices.

An important benefit of linked data is that we are able to annotate data, at a fine-grained level of detail, to record information about the data itself. This includes where it came from – the provenance of the data – but could include annotations from reviewers, links to other useful resources, etc. Being able to trust that data is correct and reliable is a central value for government-published data, so recording provenance is a key requirement for the COINS data.

A challenge also is the size of the data, especially since it is updated regularly. Five data files already contain between 3.3 and 4.9 million rows of data.

Requirements:

The use case is fulfilled if the standard will be a Linked Data vocabulary for encoding statistical data that has a hypercube structure and as such can describe common statistics in a machine-readable and application-independent way.

There should be a consensus on the issue of flattening or abbreviating data; one suggestion is to author data without the duplication, but have the data publication tools "flatten" the compact representation into standalone observations during the publication process.

It is somewhat unclear at this point which slices through the data will be useful to (COINS-RDF) users. Guidance in selecting and creating useful slices might be necessary.

Publisher Use Case: Publishing Excel Spreadsheets as Linked Data

Not only in government, there is a need to publish considerable amounts of statistical data to be consumed in various (also unexpected) application scenarios. Typically, Microsoft Excel sheets are made available for download. Those excel sheets contain single spreadsheets with several multidimensional data tables, having a name and notes, as well as column values, row values, and cell values.

Benefits: The goal in this use case is to to publish spreadsheet information in a machine-readable format on the web, e.g., so that crawlers can find spreadsheets that use a certain column value. The published data should represent and make available for queries the most important information in the spreadsheets, e.g., rows, columns, and cell values.

For instance, in the CEDA_R and Data2Semantics projects publishing and harmonizing Dutch historical census data (from 1795 onwards) is a goal. These censuses are now only available as Excel spreadsheets (obtained by data entry) that closely mimic the way in which the data was originally published and shall be published as Linked Data.

Challenges in this use case:

All context and so all meaning of the measurement point is expressed by means of dimensions. The pure number is the star of an ego-network of attributes or dimensions. In a RDF-representation it is then easily possible to define hierarchical relationships between the dimensions (that can be exemplified further) as well as mapping different attributes across different value points. This way a harmonization among variables is performed around the measurement points themselves.

In historical research, until now, harmonization across datasets is performed by hand, and in subsequent iterations of a database: it is very hard to trace back the provenance of decisions made during the harmonization procedure.

Combining Data Cube with SKOS to allow for cross-location and cross-time historical analysis

These challenges may seem to be particular to the field of historical research, but in fact apply to government information at large. Government is not a single body that publishes information at a single point in time. Government consists of multiple (altering) bodies, scattered across multiple levels, jurisdictions and areas. Publishing government information in a consistent, integrated manner requires exactly the type of harmonization required in this use case.

Excel sheets provide much flexibility in arranging information. It may be necessary to limit this flexibility to allow automatic transformation.

There are many spreadsheets.

Semi-structured information, e.g., notes about lineage of data cells, may not be possible to be formalized.

Another concrete example is the Stats2RDF [1] project that intends to publish biomedical statistical data that is represented as Excel sheets. Here, Excel files are first translated into CSV and then translated into RDF.

Publisher Use Case: Publishing hierarchically structured data from StatsWales and Open Data Communities

This multidimensional data contains for each fact a time dimension with one level year and a location dimension with levels Unitary Authority, Government Office Region, Country, and ALL.

As unit, units of 1000 households is used.

In this use case, one wants to publish not only a dataset on the bottom most level, i.e. what are the number of households at each Unitary Authority in each year, but also a dataset on more aggregated levels.

For instance, in order to publish a dataset with the number of households at each Government Office Region per year, one needs to aggregate the measure of each fact having the same Government Office Region using the SUM function.

Importantly, one would like to maintain the relationship between the resulting datasets, i.e., the levels and aggregation functions.

Note, this use case does not simply need a selection (or "dice" in OLAP context) where one fixes the time period and the measure (qb:Slice where you fix the time period and the measure).

Publisher Use Case: Publishing slices of data about UK Bathing Water Quality

As part of their work with data.gov.uk and the UK Location Programme Epimorphics Ltd have been working to pilot the publication of both current and historic bathing water quality information from the UK Environment Agency (http://www.environment-agency.gov.uk/) as Linked Data.

The UK has a number of areas, typically beaches, that are designated as bathing waters where people routinely enter the water. The Environment Agency monitors and reports on the quality of the water at these bathing waters.

The Environement Agency's data can be thought of as structured in 3 groups:

There is basic reference data describing the bathing waters and sampling points

There is a data set "Annual Compliance Assessment Dataset" giving the rating for each bathing water for each year it has been monitored

There is a data set "In-Season Sample Assessment Dataset" giving the detailed weekly sampling results for each bathing water

The most important dimensions of the data are bathing water, sampling point, and compliance classification.

Challenges:

Observations may exhibit a number of attributes, e.g., whether ther was an abnormal weather exception.

Relevant slices of both datasets are to be created:

Annual Compliance Assessment Dataset: all the observations for a specific sampling point, all the observations for a specific year.

In-Season Sample Assessment Dataset: samples for a given sampling point, samples for a given week, samples for a given year, samples for a given year and sampling point, latest samples for each sampling point.

Existing Work (optional): Semantic Sensor Network ontology (SSN) [2] already provides a way to publish sensor information. SSN data provides statistical Linked Data and grounds its data to the domain, e.g., sensors that collect observations (e.g., sensors measuring average of temperature over location and time). A number of organizations, particularly in the Climate and Meteorological area already have some commitment to the OGC "Observations and Measurements" (O&M) logical data model, also published as ISO 19156. Are there any statements about compatibility and interoperability between O&M and Data Cube that can be made to give guidance to such organizations?

Publisher Use Case: Eurostat SDMX as Linked Data

As mentioned already, the ISO standard for exchanging and sharing statistical data and metadata among organizations is Statistical Data and Metadata eXchange (SDMX). Since this standard has proven applicable in many contexts, we adopt the multidimensional model that underlies SDMX and intend the standard vocabulary to be compatible to SDMX.

Therefore, in this use case we intend to explain the benefit and challenges of publishing SDMX data as Linked Data.

As one of the main adopters of SDMX, Eurostat (http://epp.eurostat.ec.europa.eu/) publishes large amounts of European statistics coming from a data warehouse as SDMX and other formats on the web.

Eurostat also provides an interface to browse and explore the datasets. However, linking such multidimensional data to related data sets and concepts would require download of interesting datasets and manual integration.

The goal here is to improve integration with other datasets; Eurostat data should be published on the web in a machine-readable format, possible to be linked with other datasets, and possible to be freeley consumed by applications.

Any Eurostat dataset contains a varying set of dimensions (e.g., date, geo, obs_status, sex, unit) as well as measures (generic value, content is specified by dataset, e.g., GDP per capita in PPS, Total population, Employment rate by sex).

Benefits

Possible implementation of ETL pipelines based on Linked Data technologies (e.g., LDSpider) to load the data into a data warehouse for analysis

Allows to attach contextual information to statistics during the interpretation process.

Allows to reuse single observations from the data.

Linking to information from other data sources, e.g., for geo-spatial dimension.

Challenges

New Eurostat datasets are added regularly to Eurostat. The Linked Data representation should automatically provide access to the most-up-to-date data.

How to match elements of the geo-spatial dimension to elements of other data sources, e.g., NUTS, NACE.

There is a large number of Eurostat datasets, each possibly containing a large number of columns (dimensions) and rows (observations). Eurostat publishes more than 5200 datasets, which, when converted into RDF require more than 350GB of disk space yielding a dataspace with some 8 billion triples.

Provide a useful interface for browsing and visualising the data. One problem is that the data sets have to high dimensionality to be displayed directly. Instead, one could visualise slices of time series data. However, for that, one would need to either fix most other dimensions (e.g., sex) or aggregate over them (e.g., via average). The selection of useful slices from the large number of possible slices is a challenge.

Each dimension used by a dataset has a range of permitted values that ought to be represented.

Not dealt with

One possible application would run validation checks over Eurostat data. The intended standard vocabulary is to publish the Eurostat data as-is and is not intended to represent information for validation (similar to business rules).

Requirements

Observations should be able to have their own URIs.

Updates to the data

Eurostat - Linked Data pulls in changes from the original Eurostat dataset on weekly basis and conversion process runs every Saturday at noon taking into account new datasets along with updates to existing datasets.

In several applications, relationships between statistical data need to be represented.

The goal of this use case is to describe provenance, transformations, and versioning around statistical data, so that the history of statistics published on the web becomes clear. This may also relate to the issue of having relationships between datasets published.

For instance, the COINS project (http://data.gov.uk/resources/coins) has at least four perspectives on what they mean by “COINS” data: the abstract notion of “all of COINS”, the data for a particular year, the version of the data for a particular year released on a given date, and the constituent graphs which hold both the authoritative data translated from HMT’s own sources. Also, additional supplementary information which they derive from the data, for example by cross-linking to other datasets.

Another specific use case is that the Welsh Assembly government publishes a variety of population datasets broken down in different ways. For many uses then population broken down by some category (e.g. ethnicity) is expressed as a percentage. Separate datasets give the actual counts per category and aggregate counts. In such cases it is common to talk about the denominator (often DENOM) which is the aggregate count against which the percentages can be interpreted.

Here, numbers from a sustainability report have been created by a number of transformations to statistical data. Different numbers (e.g., 600 for year 2009 and 503 for year 2010) might have been created differently, leading to different reliabilities to compare both numbers.

Benefits

Making transparent the transformation a dataset has been exposed to. Increases trust in the data.

Should Data Cube support explicit declaration of such relationships either between separated qb:DataSets or between measures with a single qb:DataSet (e.g. ex:populationCount and ex:populationPercent)?

If so should that be scoped to simple, common relationships like DENOM or allow expression of arbitrary mathematical relations?

Data that is published on the Web is typically visualized by transforming it manually into CSV or Excel and then creating a visualization on top of these formats using Excel, Tableau, RapidMiner, Rattle, Weka etc.

This use case shall demonstrate how statistical data published on the web can be with few effort visualized inside a webpage, without using commercial or highly-complex tools.

An example scenario is environmental research done within the SMART research project (http://www.iwrm-smart.org/). Here, statistics about environmental aspects (e.g., measurements about the climate in the Lower Jordan Valley) shall be visualized for scientists and decision makers. Statistics should also be possible to be integrated and displayed together. The data is available as XML files on the web. On a separate website, specific parts of the data shall be queried and visualized in simple charts, e.g., line diagrams.

Figure shows the wanted display of an environmental measure over time for three regions in the lower Jordan valley; displayed inside a web page:

Figure shows the same measures in a pivot table. Here, the aggregate COUNT of measures per cell is given.

Challenges of this use case are:

The difficulties lay in structuring the data appropriately so that the specific information can be queried.

Also, data shall be published with having potential integration in mind. Therefore, e.g., units of measurements need to be represented.

Integration becomes much more difficult if publishers use different measures, dimensions.

Problems and Limitations:

Unanticipated Uses (optional): -

Existing Work(optional): -

Consumer Use Case: Visualising published statistical data in Google Public Data Explorer

Google Public Data Explorer (GPDE - http://code.google.com/apis/publicdata/) provides an easy possibility to visualize and explore statistical data. Data needs to be in the Dataset Publishing Language (DSPL - https://developers.google.com/public-data/overview) to be uploaded to the data explorer. A DSPL dataset is a bundle that contains an XML file, the schema, and a set of CSV files, the actual data. Google provides a tutorial to create a DSPL dataset from your data, e.g., in CSV. This requires a good understanding of XML, as well as a good understanding of the data that shall be visualized and explored.

In this use case, the goal is to take statistical data published on the web and to transform it into DSPL for visualization and exploration with as few effort as possible.

For instance, Eurostat data about Unemployment rate downloaded from the web.

Benefits

If a standard Linked Data vocabulary is used, visualising and exploring new data that already is represented using this vocabulary can easily be done using GPDE.

Datasets can be first integrated using Linked Data technology and then analysed using GDPE.

Challenges of this use case are:

There are different possible approaches each having advantages and disadvantages: 1) A customer C is downloading this data into a triple store; SPARQL queries on this data can be used to transform the data into DSPL and uploaded and visualized using GPDE. 2) or, one or more XLST transformation on the RDF/XML transforms the data into DSPL.

The technical challenges for the consumer here lay in knowing where to download what data and how to get it transformed into DSPL without knowing the data.

Unanticipated Uses (optional): DSPL is representative for using statistical data published on the web in available tools for analysis. Similar tools that may be automatically covered are: Weka (arff data format), Tableau, SPSS, STATA, PC-Axis etc.

Existing Work (optional): -

Consumer Use Case: Analysing published statistical data with common OLAP systems

Online Analytical Processing (OLAP) is an analysis method on multidimensional data. It is an explorative analysis methode that allows users to interactively view the data on different angles (rotate, select) or granularities (drill-down, roll-up), and filter it for specific information (slice, dice).

OLAP systems that first use ETL pipelines to Extract-Load-Transform relevant data for efficient storage and queries in a data warehouse and then allows interfaces to issue OLAP queries on the data are commonly used in industry to analyse statistical data on a regular basis.

The goal in this use case is to allow analysis of published statistical data with common OLAP systems (Kämpgen, B., & Harth, A. (2011). Transforming Statistical Linked Data for Use in OLAP Systems. I-Semantics 2011. Retrieved from http://www.aifb.kit.edu/web/Inproceedings3211)

For that a multidimensional model of the data needs to be generated. A multidimensional model consists of facts summarised in data cubes. Facts exhibit measures depending on members of dimensions. Members of dimensions can be further structured along hierarchies of levels.

An example scenario of this use case is the Financial Information Observation System (FIOS) (Andreas Harth, Sean O'Riain, Benedikt Kämpgen. Submission XBRL Challenge 2011. http://xbrl.us/research/appdev/Pages/275.aspx), where XBRL data provided by the SEC on the web is to be re-published as Linked Data and made analysable for stakeholders in a web-based OLAP client Saiku.

The following figure shows an example of using FIOS. Here, for three different companies, cost of goods sold as disclosed in XBRL documents are analysed. As cell values either the number of disclosures or - if only one available - the actual number in USD is given:

OLAP frontends intuitive interactive, explorative, fast. Interfaces well-known to many people in industry.

OLAP functionality provided by many tools that may be reused

Challenges

ETL pipeline needs to automatically populate a data warehouse. Common OLAP systems use relational databases with a star schema.

A problem lies in the strict separation between queries for the structure of data (metadata queries), and queries for actual aggregated values (OLAP operations).

Another problem lies in defining Data Cubes without greater insight in the data beforehand.

Depending on the expressivity of the OLAP queries (e.g., aggregation functions, hierarchies, ordering), performance plays an important role.

Registry Use Case: Registering published statistical data in data catalogs

After statistics have been published as Linked Data, the question remains how to communicate the publication and let users find the statistics. There are catalogs to register datasets, e.g., CKAN, datacite.org [3], da|ra [4], and Pangea [5]. Those catalogs require specific configurations to register statistical data.

The goal of this use case is to demonstrate how to expose and distribute statistics after modeling using the standard vocabulary. For instance, to allow automatic registration of statistical data in such catalogs, for finding and evaluating datasets. To solve this issue, it should be possible to transform the published statistical data into formats that can be used by data catalogs.

Use Case Scenario:

Note:
XXX: Find specific use case or ask how other publishers of QB data have dealt with this issue Maybe relation to DCAT?

Problems and Limitations: -

Unanticipated Uses (optional): If data catalogs contain statistics, they do not expose those using Linked Data but for instance using CSV or HTML (Pangea [6]). It could also be a use case to publish such data using QB.

Name: The Wiki page URL should be of the form "Use_Case_Name", where Name is a short name by which we can refer to the use case in discussions. The Wiki page URL can act as a URI identifier for the use case.

Person: The person responsible for maintaining the correctness/completeness of this use case. Most obviously, this would be the creator.

Dimension: The primary dimension which this use case illustrates, and secondary dimensions which the use case also illustrates.

Background and Current Practice: Where this use case takes place in a specific domain, and so requires some prior information to understand, this section is used to describe that domain. As far as possible, please put explanation of the domain in here, to keep the scenario as short as possible. If this scenario is best illustrated by showing how applying technology could replace current existing practice, then this section can be used to describe the current practice. This section can also be used to document statistical data within the use case.

Goal: Two short statements stating (1) what is achieved in the scenario without reference to RDF Data Cube vocabulary, and (2) how we use the RDF Data Cube vocabulary to achieve this goal.

Use Case Scenario: The use case scenario itself, described as a story in which actors interact with systems, each other etc. It should show who is using the standard vocabulary for publishing statistics as Linked Data and for what purpose. Please mark the key steps which show requirements in italics.

Problems and Limitations: The key to why a use case is important often lies in what problem would occur if it was not achieved, or what problem means it is hard to achieve. This section lists reasons why this scenario is or may be difficult to achieve, including pre-requisites which may not be met, technological obstacles etc. Important: Please explicitly list here the technical challenges (with regards to statistical data) made apparent by this use case. This will aid in creating a roadmap to overcome those challenges.

Unanticipated Uses (optional): The scenario above describes a particular case of using technology. However, by allowing this scenario to take place, the technology allows for other use cases. This section captures unanticipated uses of the same system apparent in the use case scenario.

Existing Work (optional): This section is used to refer to existing technologies or approaches which achieve the use case.

Requirements

The use cases presented in the previous section give rise to the following requirements for a standard representation of statistics.

Requirements are cross-linked with the use cases that motivate them.

Requirements are similarly categorized as deriving from publishing or consuming use cases.

Publishing use cases

Machine-readable and application-independent representation of statistics

It should be possible to add abstraction, multiple levels of description, summaries of statistics.

(UC 1-4)

Representing statistics from various resource

Statistics from various resource data should be possible to be translated into QB.

QB should be very general and should be usable for other data sets such as survey data, spreadsheets and OLAP data cubes.

What kind of statistics are described: simple CSV tables (UC 1), excel (UC 2) and more complex SDMX (UC 2) data about government statistics or other public-domain relevant data.

Communicating, exposing statistics on the web

It should become clear how to make statistical data available on the web, including how to expose it, and how to distribute it

(UC 5)

Coverage of typical statistics metadata

It should be possible to add metainformation to statistics as found in typical statistics or statistics catalogs.

(UC 1-5)

Expressing hierarchies

It should be possible to express hierarchies on Dimensions of statistics.

Some of this requirement is met by the work on ISO Extension to SKOS [7].

(UC 3, 9)

Expressing aggregation relationships in Data Cube

This requires some way to represent aggregation functions.

This requires information about

levels

hierarchies

relationships between members of a dimension

aggregation functions of a measure

Some of this requirement is met by the work on ISO Extension to SKOS [8].

(UC 0, 1,2,3,9)

Possibly, it would be good to be able to define several aggregation functions for the same measure.

Scale - how to publish large amounts of statistical data

Publishers that are restricted by the size of the statistics they publish, shall have possibilities to reduce the size or remove redundant information.

Scalability issues can both arise with peoples's effort and performance of applications.

(UC 1,2,3,4)

Compliance-levels or criteria for well-formedness

The formal RDF Data Cube vocabulary expresses few formal semantic constraints. Furthermore, in RDF then omission of otherwise-expected properties on resources does not lead to any formal inconsistencies.

However, to build reliable software to process Data Cubes then data consumers need to know what assumptions they can make about a dataset purporting to be a Data Cube.

Specific areas which may need explicit clarification in the well-formedness criteria include (but may not be limited to):

use of abbreviated data layout based on attachment levels

use of qb:Slice when (completeness, requirements for an explicit qb:SliceKey?)

avoiding mixing two approaches to handling multiple-measures

optional triples (e.g. type triples)

(UC 1-11)

Declaring relations between Cubes

In some situations statistical data sets are used to derive further datasets. Should Data Cube be able to explicitly convey these relationships?

A simple specific use case is that the Welsh Assembly government publishes a variety of population datasets broken down in different ways. For many uses then population broken down by some category (e.g. ethnicity) is expressed as a percentage. Separate datasets give the actual counts per category and aggregate counts. In such cases it is common to talk about the denominator (often DENOM) which is the aggregate count against which the percentages can be interpreted.

Should Data Cube support explicit declaration of such relationships either between separated qb:DataSets or between measures with a single qb:DataSet (e.g. ex:populationCount and ex:populationPercent)?

If so should that be scoped to simple, common relationships like DENOM or allow expression of arbitrary mathematical relations?