Use Case Notes

A use case describes a scenario that illustrates an experience of publishing and using Data on the Web. Those descriptions may be related to one or more tasks or activities from the Data on the Web Life Cycle.

The information gathered from the uses cases should be helpful for the identification of the best practices that will guide the publishing and usage of Data on the Web. In general, a best practice will be described at least by a statement and a how to do it section, i.e., a discussion of techniques and suggestions as how to implement it (similar to http://www.w3.org/TR/mobile-bp/).

Then, to help the identification of possible best practices, it is desirable that a use case presents information about positive aspects/benefits (similar to http://www.w3.org/TR/vocab-data-cube-use-cases/), but also negative aspects from the experience. These aspects may also be seen as learned lessons and will be helpful for the identification of statements of best practices as well as suggestions of how to implement (or not) a given best practice.

Other important information to be included in a use case description concerns the main challenges faced by publishers or developers. Information about challenges will be helpful to identify areas where Best Practices are necessary. According to the challenges, a set of requirements may be defined, where a requirement motivates the creation of one or more best practices.

The general description of a use case is given by:

Title:

Contributor:

City, country:

URL:

Overview:

Detailed description: (Each element described in more detail at Use-Case Elements)

Overview: Recife is a beautiful city situated in the Northeast of Brazil and it is famous for being one of the Brazil’s biggest tech hubs. Recife is also one of the first Brazilian cities to release data generated by public sector organisations for public use as Open Data. An Open Data Portal Recife was created to offer access to a repository of governmental machine-readable data about several domains, including: finances, health, education and tourism. Data is available in csv and geojson format and every dataset has a metadata description, i.e. descriptions of the data, that helps in the understanding and usage of the data. However, the metadata is not described using standard vocabularies or taxonomies. In general, data is created in a static way, where data from relational databases are exported in a csv format and then published in the data catalog. Currently, they are working to have dynamically generated data from the contents of relational databases, then data will be available as soon as they are created. The main phases of the development of this initiative were: to educate people with appropriate knowledge concerning Open Data, relevant data identification in order to identify the sources of data that their potential consumers could find useful, data extraction and transformation from the original data sources to the open data format, configuration and installation of the open data catalogue tool, data publication and portal release.

Positive aspects:

All datasets are published together with a metadata description. Metadata is described in a very clear way, which facilitates the understanding of the published data.

According to the frequency of data updating, some data may be automatically updated every day.

Negative aspects:

Metadata is not described in a machine processable format.

Data is provided in just one format (csv).

Challenges:

Use common vocabs to facilitate data integration

How to keep different versions of the same dataset?

How to define the "granularity" of the data being published?

How to provide information about the quality of the data?

How to measure the quality of the data?

The following lesson and requirements may be extracted from this use case:

Lessons:

When publishing a dataset, provide metadata in a machine processable format.

Use Cases

To add a new use-case, copy the use-case template and complete all of the sections. Use-case elements are optional, depending on information available. If you want to add a challenge or requirement to somebody else's use-case, please add your name in brackets after your update.

Documented Support and Release of Data

Contributor:
Deirdre Lee (based on email by Leigh Dodds)

Overview:
While many cases of Data on the Web may contain meta-data about creation data and last update, the regularity of the release schedule is not always clear. Similarly, how and by whom the dataset is supported should also be made clear in the meta-data. These attributes are necessary to improve the reliability of the data so that third-party users can trust the timely delivery of the data, with a follow-up point should there be any issues.

Feedback Loop for Corrections

Overview:
One of the advantages of publishing Open Data is often quoted as improving the quality of the data. Many eyes looking at a dataset helps spot errors and holes quicker than a public body may identify this themselves. For example, when bus-stop data is published, it may turn out that the official location of a bus-stop is not always accurate, but when this is mashed-up with OSM, the mistake is identified. However, how this 'improved' data is fed back into the public body is not clear. Should there be an automated mechanism for this? How can the improvement be described in a machine readable format? What is best practice for reincorporating such improvements?

Datasets required for Natural Disaster Management

Contributor:
Deirdre Lee (based on OKF Greece workshop)

Overview:
Many of the datasets that are required for Natural Disaster Management, for example critical infrastructure, utility services, road networks, are not available online as they are also deemed to be datasets that could be used for homeland security attacks.
(will expand on this use-case once slides are available)

Tracking of data usage

Contributor:
Deirdre Lee

Overview:
There are many potential/perceived benefits of Open Data, however in order to publish data, some initial investment/resources are required by public bodies. When justifying these resources and evaluating the impact of the investment, many Open Data providers express the desire to be able to track how the datasets are being used. However Open Data by design often requires no registration, explanation or feedback to enable the access to and usage of the data. How can data usage be tracked in order to inform the Open Data ecosystem and improve data provision?

Open City Data Pipeline

Overview:
Axel presented the Open City Data Pipeline, which aims to to provide an extensible platform to support citizens and city administrators by providing city key performance indicators (KPIs),leveraging Open Data sources.

The assumption of Open Data is the “Added value comes from comparable Open datasets being combined”. Axel highlighted that Open Data needs stronger standards to be useful, in particular for industrial uptake. Industrial usage has different requirements than app hobbyist or civil society, it's important to think how Open Data can be used by industry at time of publication.

They have developed a data pipeline to

(semi-)automatically collect and integrate various Open Data Sources in different formats

Another issue is when data under different licenses are combined, the license terms under which the data is available also have to be merged. This interoperability of licenses is a challenge
[may be out of scope of W3C DWBP, as it is more concerned with legal issues]

Machine-readability of SLAs

Contributor:
Deirdre Lee (based on a number of talks at EDF14)

Overview:
A main focus of publishing data on the web is to facilitate industry resuse for commercial purposes. In order for a commercial body to reuse data on the web, the terms of reuse must be clear. The legal terms of reuse are included in the license, but there are other factors that are important for commercial reuse, e.g. reliabiliy, support, incidient recovery, etc. These could be included in an SLA. Is there a standardised, machine-readable approach to SLAs?

Publication of Data via APIs

Contributor:
Deirdre Lee

Overview:
APIs are commonly used to publish data in formats designed for machine-consumption, as opposed to the corresponding HTML pages whose main aim is to deliver content suitable for human-consumption. There remains questions around how APIs can best be designed to publish data, and even if APIs are the most suitable way for publishing data at all (http://ruben.verborgh.org/blog/2013/11/29/the-lie-of-the-api/). Could use of HTTP and URIs be sufficient? If the goal is to facilitate machine-readable data, what is best-practice?

APIs can be too clunky/rich in their functionality, which may increase the amount of calls necessary and size of data transferred, reducing performance

Collaboration between API providers and users is necessary to agree on 'useful' calls

API key agreements could restrict Openess of Open Data?

Documentation accompanying APIs can be lacking

What is best practice for publishing streams of real-time data (with/without APIs)?

Each resource should have one URI uniquly identifying it. There can then be different representations of the resource (xml/html/json/rdf)

Potential Requirements:

NYC Open Data Program

Contributor: Steven Adler

Overview: Carole Post was appointed by Mayor Bloomberg as Commissioner of the NY Departnmen of IT (DOITT) in 2010 and was the first woman in the city's history to be CIO. She was the architect of NYC's Open Data program, sponsored the Open Data Portal and helped pass the city's Open Data Legislation. On March 11, she gave a presentation to the W3C on her experiences changing the city culture, building the Open Data Portal. A recording of her presentation is provided here: Carole Post Webinar - NYC. A copy of her presentation in PDF can be found here: - Carole Post Presentation on NYC Open Data

Elements:

Recife Open Data Portal

Contributor: Bernadette Lóscio

Overview: Recife is a beautiful city situated in the Northeast of Brazil and it is famous for being one of the Brazil’s biggest tech hubs. Recife is also one of the first Brazilian cities to release data generated by public sector organisations for public use as Open Data. An Open Data Portal Recife was created to offer access to a repository of governmental machine-readable data about several domains, including: finances, health, education and tourism. Data is available in csv and geojson format and every dataset has a metadata description, i.e. descriptions of the data, that helps in the understanding and usage of the data. However, the metadata is not described using standard vocabularies or taxonomies. In general, data is created in a static way, where data from relational databases are exported in a csv format and then published in the data catalog. Currently, they are working to have dynamically generated data from the contents of relational databases, then data will be available as soon as they are created. The main phases of the development of this initiative were: to educate people with appropriate knowledge concerning Open Data, relevant data identification in order to identify the sources of data that their pontential consumers could find useful, data extraction and transformation from the original data sources to the open data format, configuration and installation of the open data catalogue tool, data publication and portal release.

Overview: This is a Data Visualization made in 2012 by Vitor Batista, Léo tartari and Thiago Bueno for a W3C Brazil Office challenge about data from Rio Grande do Sul (a brazilian region). The data was released in a .zip package, the original format was .csv. The code and the documentation of the project are in it's GitHub repository.

Obligation/motivation: Data that must be provided to the public under a legal obligation, the called LAI or Brazilian Information Acess Act, edited in 2012

Usage:

Quality: not guaranteed data

Size:

Type/format: Tabular data

Rate of change: There is no new releases of data

Data lifespan:

Potential audience:

Positive aspects: the data was in CSV format, but it's now (2014) outdated, and there's no prevision for new releases. There's no metadata in it.

Negative aspects: the decision on transforming CSV in to JSON was based on the necessity to have hierarchical data - the positive point, that CSV structure can be mapped to an XML or JSON was considered. CSV only covers tabular format and JSON can cover more complex structures.

Challenges: this was not guaranteeed data and there was no metadata. there's a sample of the .csv files on location

Dados.gov.br

Contributor: Yaso

Overview: Data.gov.br is the open data portal of the Brazil's Federal Government. The catalog was delivered by the open data community developers leaded by developers from the team of Ministry of Planning. The CKAN was chosen because it is Free Software and present more independent solutions for the placement of data catalog of the Federal Government provided on the internet.

Elements: (Each element described in more detail at Use-Case Elements )

Obligation/motivation: Data that must be provided to the public under a legal obligation, the called LAI or Brazilian Information Acess Act, edited in 2012

Usage:

Data that is the basis for services to the public;

Data that has commercial re-use potential.

Quality: Authoritative, clean data, vetted and guaranteed;

Lineage/Derivation: Data came from various publishers. As a catalog, the site has faced several challenges, one of them was to integrate the various technologies and formulas used by publishers to provide datasets in the portal.

Size:

Type/format: Tabular data, text data

Rate of change: There is fixed data and data with high rate of change

Data lifespan:

Potential audience:

Technical Challenges:

data integration (lack of vocabs)

collaborative construction of the portal: managing online sprints and balancing public expectatives.

Licencing the data of the portal. Most of data that is inn the portal has not a special licence for data. As you can see, there is different types of licences that applied to the datasets.

ISO GEO Story

Overview:
ISO GEO is a small company managing catalogs records of geographic information in XML, conformed to ISO-19139. (ISO- 19139 is a French adaptation of the ISO- 19115)
An excerpts is here: http://cl.ly/3A1p0g2U0A2z. They export thousands of catalogs like that today, but they need to manage them better.
In their platform, they store the information in a more conventional manner, and use this standard for export dataset compliant to Inspire interoperability , or via the CSW protocol.
Sometimes, they have to enrich their metadata with other ones, produced by tools like GeoSource and accessed through SDI (Spatial Data Infrastructure), with their own metadata records.

A sample containing 402 metadata records in ISO 19139 are in public consultation at http://geobretagne.fr/geonetwork/srv/fr/main.home.
They want to be able to integrate all the different implementations of the ISO 19139 in different tools in a single framework to better understand the thousand of metadata records they use in their day-to-day business.
Types of information recorded in each file, see example at http://www.eurecom.fr/~atemezin/datalift/isogeo/5cb5cbeb-fiche1.xml are the following: Contact info (metadata) [Data issued]; spatial representation; reference system info [code space], spatial Resolution; Geographic Extension of the data, File distribution; Data Quality; process step, etc.

Dutch basic registers

Overview:
The Netherlands have a set of registers they are looking at opening and exposing as Linked (Open) Data under the context of the project "PiLOD". The registers contain information about buildings, people, businesses and other individuals public bodies may want to refer to for they daily activities. One of them is, for instance, the service of public taxes ("BelastingDienst") which regularly pulls out data from several registers, stores this data in a big Oracle instance and curates it. This costly and time consuming process could be optimised by providing on-demand access to up-to-date descriptions provided by the register owners.

Capacity: at this point, it can not be asked that every register owner cares for publishing his own data. Some of them export what they have on the national open data portal. This data has been used to do some testing with third-party publication from PiLODers but this is rather sensitive as a long term strategy (governmental data has to be tracable/trustable as such). The middle ground solution currently deployed is the PiLOD platform, a (semi)-official platform for publishing register data.

Privacy: some of the register data is personal or may become so when linked to others (e.g. disambiguate personal data based on adresses). Some registers will require to provide secured access to some of their data to some people only (Linked Data, not Open). Some others can go along with open data as long as they get a precise log of who is using what.

Revenue: institutions working under mixed gov/non-gov funding generate part of their revenue by selling some of the data they curate. Switching to an open data model will generate a direct loss in revenue that has to be backed-up by other means. This does not have to mean closing the data, e.g. a model of open dereferencing + paid dumps can be considered, as well as other indirect revenue streams.

Potential Requirements:

Wind Characterization Scientific Study

Overview:
This use case describes a data management facility being constructed to support scientific offshore wind energy research for the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy (EERE) Wind and Water Power Program. The Reference Facility for Renewable Energy (RFORE) project is responsible collecting wind characterization data from remote sensing and in situ instruments located on an offshore platform. This raw data is collected by the Data Management Facility and processed into a standardized NetCDF format. Both the raw measurements and processed data are archived in the PNNL Institutional Computing (PIC) petascale computing facility. The DMF will record all processing history, quality assurance work, problem reporting, and maintenance activities for both instrumentation and data.

All datasets, instrumentation, and activities are cataloged providing a seamless knowledge representation of the scientific study. The DMF catalog relies on linked open vocabularies and domain vocabularies to make the study data searchable.

Scientists will be able to use the catalog for faceted browsing, ad-hoc searches, query by example. For accessing individual datasets a REST GET interface to the archive will be provided.

Technical Challenges:
For accessing numerous datasets scientists will be accessing the archive directly using other protocols such as sftp, rsync, scp, access techniques such as: http://www.psc.edu/index.php/hpn-ssh

Potential Requirements:

BuildingEye: SME use of public data

Contributor:
Deirdre lee

Overview:
Buildingeye.com makes building and planning information easier to find and understand by mapping what's happening in your city. In Ireland local authorities handle planning applications and usually provide some customised views of the data (pdfs, maps, etc.) on their own website. However there isn't an easy way to get a nationwide view of the data. BuildingEye, an independent SME, built http://mypp.ie/ to achieve this. However as each local authority didn't have an Open Data portal, BuildingEye had to directly ask each local authority for its data. It was granted access to some authorities, but not all. The data it did receive was in different formats and of varying quality/detail. BuildingEye harmonised this data for its own system. However, if another SME wanted to use this data, they would have to go through the same process and again go to each local authority asking for the data.

Digital archiving of Linked Data

Overview:
Taking the concrete example of the digital archive "DANS", digital archives have so far been concerned with the preservation of what could be defined as "frozen" dataset. A frozen dataset is a finished, self-contained, set of data that does not evolve after it has been constituted. The goal of the preserving institution is to ensure this dataset remains available and readable for as many years as possible. This can for example concern an audio record, a digitized image, e-books or database dumps. Consumers of the data are expected to look-up for a specific content based on its associated persistent identifier, download it from the archive and use it. Now comes the question of the preservation of Linked Open Data. In opposition to "frozen" data sets, linked data can be qualified as "live" data. The resources it contains are part of a larger entity to which third parties contribute, one of the design principles indicate that other data producers and consumers should be able to point to data. As LD publishers stop offering their data (e.g. at the end of a project), taking the LD off-line as a dump and putting it in an archive effectively turns it into a frozen dataset, likewise to SQL dumps and other kind of data bases. The question then raises as to which extent this is an issue...

Technical Challenges:
The archive has to think about whether serving dereferencing for resources found in preserved datasets is required or not, also think about providing a SPARQL end point or not. If data consumers and publishers are fine with having RDF data dumps to be downloaded from the archive prior to its usage - just like any other digital item so far - the technical challenges could be limited to handling the size of the dumps and taking care of serialisation evolution over time (e.g. from Ntriples to Trig, or from RDF/XML to HDT) as the preference for these formats evolves. Turning a live dataset into a frozen dump also raises the question of the scope. Considering that LD items are only part of a much larger graph that gives them meaning through context the only valid dump would be a complete snapshot of the entire connected component of the Web of Data graph the target dataset is part of.

Potential Requirements:
Decide on the importance of the de-referencability of resources and the potential implications for domain names and naming of resources. Decide on the scope of the step that will turn a connected sub-graph into an isolated data dump.

The metadata provided on the data portal is very sparse with many fields left empty.

The dataset is itself the result of an analysis (there are only 8 lines in the table), the raw data on which it is based is not cited, let alone made available, and the methods used are not described.

Challenges:

Data Citation - how could Ron Galperin have referred to the source data in the Infographic? (the URI is way too long). QR code? Short PURL?

How could the publisher of the data link to the Infographic as a visualization of it?

In this case, the creator of the underlying data is the same as the creator of the Infographic, but if they were different, how could the data creator discover the Infographic, still less the media report about it?

The methodology used is not explained - making it hard to assess trustworthiness. How can provenance be described?

The metadata is incomplete and does not used a recognized standard vocabulary making automated discovery and use by anyone other than the data creator difficult.

The Land Portal

Contributor: Carlos Iglesias

Overview: The IFAD Land Portal platform it's been completely rebuilt as an Open Data collaborative platform for the Land Governance community. Among the new features the Land Portal will provide access to comprehensive and in-depth 100+ indicators from 25+ different sources on land governance issues for 200+ countries over the world, as well as a repository of land related-content and documentation. Thanks to the new platform people could (1) curate and incorporate new data and metadata by means of different data importers and making use of the underlying common data model; (2) search, explore and compare the data through countries and indicators; and (3) consume and reuse the data by different means (i.e. raw data download at the data catalog; linked data and SPARQL endpoint at RDF triplestore; RESTful API; and built-in graphic visualization framework)

Radar Parlamentar

Overview:Radar Parlamentar is a web application that illustrates the similarities between political parties based on the vote data analysis that occurs in the Brazilian congress. The similarities are presented in a two-dimensional graphics, in which circles represent parties or parliamentarians, and the distance between these circles is how similar they vote. There is also only a section dedicated to gender issues: how many women are in each party over the years, which are the themes most handled by each gender and party, etc.

Elements:

Domains: Political information, voting records

Obligation/motivation: The Brazilian government began to provide their data in an open format through the Dados.gov.br portal.

Usage: Re-use and exploration of data available in portal Dados.gov.br in another kinds of visualisation.

Quality: Every sort of data, from high quality to unverified one (depends on the data provided by parlamentar houses).

Size: Varies depends on the data provided by parlamentar houses).

Type/format: Tag clouds, 2D graphic, matrix display, treemap.

Rate of change: No defined periodicity.

Data lifespan: No defined.

Potential audience: Brazilian citizens

Technical Challenges:

There are significant differences between data from different parlamentar houses, i.e., they don't use a standard ontology

There are a lack of data about votations in the National Assembly

Data are been released bit by bit

The data release frequency release has not been established

There are few data about votations available at certain time periods

Data quality of the City Council is not good, some data are visibly wrong

Potential Requirements:

Feed to notify developers when there are new data

Good filters/searches to avoid many unnecessary requests

Definition of update data frequency

Standard Ontology for all parlamentar houses

Documentation: there is a page in the web application explaining the used methodology.

Uruguay: open data catalogue

Contributor: AGESIC

Overview: Uruguay open data site holds 85 datasets containing 114 resources since the first dataset was published in Dec. 2012. Open data initiative prioritizes the “use of data” rather than “quantity of data”, that’s why the catalogue holds 25 applications using datasets resources in some way. It’s important for the project to keep the relation 1/3 between applications and datasets.
Most of the resources are CSV and shapefiles; basically we have a 3 stars catalogue and the reason why we can’t go to the next level is the lack of resources (time, human, economic, etc.) at government agencies to implement an open data liberation strategy. So when we are asked about opening data, keep it simple is the answer, and CSV is far the easiest and smart way to start. Uruguay has an Access to public information law but don’t have legislation about open data. The open data initiative is leaded by AGESIC with the support of the open data working group.
OD Working group:
- Intendencia de Montevideo – www.montevideo.gub.uy
- INE – www.ine.gub.uy
- AGEV – www.agev.opp.gub.uy
- FING – UDELAR – www.fing.edu.uy
- D.A.T.A. – www.datauy.org

GS1: GS1 Digital

Overview:
Retailers and Manufacturers / Brand Owners are beginning to understand that there can be benefits to openly publishing structured data about products and product offerings on the web as Linked Open Data.
Some of the initial benefits may be enhanced search listing results (e.g. Google Rich Snippets) that improve the likelihood of consumers choosing such a product or product offer over an alternative product that lacks the enhanced search results. However, the longer term vision is that an ecosystem of new product-related services can be enabled if such data is available. Many of these will be consumer-facing and might be accessed via smartphones and other mobile devices, to help consumers to find the products and product offers that best match their search criteria and personal preferences or needs - and to alert them if a particular product is incompatible with their dietary preferences or other criteria such as ethical / environmental impact considerations - and to suggest an alternative product that may be a more suitable match.

There are at least five main actors in this use case:
Manufacturers / Brand Owners
Retailers
GS1
Search engines, data aggregators and developers of smartphone apps
Accreditation agencies

The figure below provides an overview of some of the kinds of factual claims that might be asserted about a product or product offering and the corresponding parties that have the authority to assert such claims.

1) Manufacturers / Brand Owners
They publish authoritative master data about their products, data that is intrinsic to the product itself. This includes technical specifications, lists of ingredients, allergens, the results of various accreditations (e.g. environmental, ethical), as well as the product category and various attribute-value pairs (about qualitative and quantitative characteristics of the product). Many of the quantitative values will consist of a quantity and a unit of measurement - and for some of these (e.g. nutritional information), it is essential to unambiguously specify the reference quantity - e.g. per product pack, per per serving size, per 100g or 100ml of product. Some values should be selected from standardized code lists and expressed using URIs rather than literal text strings, in order to better support multi-lingual applications as well as comparisons between products that share some characteristics in common (it is more reliable to check for exact URI matches of codified values than to check for fuzzy string matches of text strings).
Each product carries a globally unambiguous identifier, the Global Trade Item Number (GTIN), which is typically represented as an EAN-13 or UPC-12 linear barcode on the product packaging. A GTIN should point to at most one product; products that are distinct should have distinct GTINs.
The brand owner assigns a GTIN to each product they produce. An HTTP URI representation of a GTIN issued under the registered domain of a brand owner can serve as the Subject in a graph of data of factual claims about the product, for which the brand owner has authority to make such claims. It can also serve (via HTTP 303 redirection) to retrieve a graph of such data in a preferred representation (e.g. via HTTP Content Negotiation (Accept: header).

2) Retailers
A retailer has the authority to assert factual information about an offer it makes for a product. This includes information such as price, availability, payment options, delivery/collection options and store locations and should include a reference to the product (identified via its GTIN).
Typically, an online retailer’s website will often replicate or embed some or all of the authoritative product data from the brand owner or manufacturer. However, this data needs to be accurate and synchronized so that it is up-to-date (taking into account any recent changes to the product information). Existing data synchronization mechanisms (such as GDSN (Global Data Synchronization Network) exist for synchronizing master data about products and organizations within a business-to-business / supply chain context - but these mechanisms currently do not make use of Linked Data technology nor publish such data openly on the web).
If the retailer instead only references the graph of authoritative master data about the product published by the brand owner, this in turn relies upon (1) open publishing of that information by the brand owner using Linked Open Data techniques, such that the brand owner’s HTTP URI correctly redirects to a graph of authoritative structured data and (2) confidence that search engines, data aggregators and other consumers of the data will actually follow such HTTP URI references to import that externally referenced data, without disadvantaging a retailer who chooses to reference (rather than embed) product data for which they are not authoritative (with the exception of ‘own brand’ products for which they are the authority).

3) GS1
GS1 http://gs1.org is a global not-for-profit standards development organisation that develops user-driven open standards for improving the efficiency of supply chains. GS1 brings together a community of over 1 million companies who work together to develop a common language for exchanging information about products and supply chain operations. Some of the results of this include the data model for the Global Data Synchronization Network (GDSN) for synchronising details on product, party and price, the GS1 Global Data Dictionary (GDD) and its code lists and the Global Product Classification System (GPC). The GPC is a product classification developed by the GS1 community that enables trading partners to communicate more efficiently and accurately throughout their supply chain activities.

Within the GS1 Digital initiative, the GTIN+ on the Web project is supporting brand owners, manufacturers and retailers as they begin to adopt Linked Open Data technology for sharing structured data about products openly on the web. Although GS1 does not have the product data, nor is it authoritative about either the product master data or the product offers made by retailers, it does have the authority to publish its existing data models, definitions and code lists as a GS1 Linked Data Vocabulary and guidelines that can be used by anyone for describing product details with greater precision and expressive power than can currently be achieved using some of the existing broad web vocabularies (such as schema.org). Work is already in progress to convert many of these from existing open data in formats such as XML to RDF datasets and vocabularies.

4) Search engines, data aggregators and developers of smartphone apps
These are the consumers of product data. They rely on being able to make comparisons between multiple retail offers for the same product (correlated through the GTIN of the product) and to find similar products (e.g. using the Global Product Classification (GPC) and attribute-value pairs). They rely on the available data being correct and up-to-date, especially when they often present the primary user-interface to consumers who will make decisions (to buy, to consume) based on the information that is presented to them.

5) Accreditation agencies
These are independent neutral third party organizations who verify claims (e.g. ethical or environmental claims) about the product and its production. Examples include organizations such as the Marine Stewardship Council, Soil Association, etc. Each of these organizations has the sole authority to certify whether a product or its production process conforms to a particular claim - and to award the corresponding accreditation to the product. The brand owner / manufacturer and retailer may in turn embed or reference such claims, although the relevant accreditation agency is the authoritative source for such claims.

Elements: (Each element described in more detail at Use-Case Elements )

vision is to enable an ecosystem of new digital apps around product data

the food sector in the EU is already obliged under new food labelling legislation (EU 1169 / 2011, Article 14) to provide the same amount of information about a food product that is sold online to consumers as the information that would be available to them from the product packaging if they picked up the product in-store. Although the legislation does not suggest that Linked Open Data technology should be used to make the same information available in a machine-readable format, there is currently significant investment and effort to upgrade websites to provide accurate and detailed information about food products; the GS1 Digital team consider that for a relatively small amount of effort, these companies could gain some tangible benefits (e.g. enhanced search results) from such compliance efforts by using Linked Open Data technology within their web pages.

Usage:

data providing transparency about product characteristics

data used to help consumers make informed choices about which products to buy/consume

Quality: Very important to have trustworthy authoritative data from respective organizations

Rate of change: mostly static data initially - but subject to some variation over time

Data lifespan: data should remain accessible until products are no longer considered to be in circulation; this represents a challenge for deprecated product lines data that is stated authoritatively by one organization might be embedded / referenced in the data asserted by another organization; this raises concerns about whether embedded data becomes stale if it is inadequately synchronized, that referenced data is not dereferenced (and therefore not discovered / gathered) by consumers or the data. From a liability perspective, there also needs to be clarity about which organization asserted which factual information - and also information about which organization has the authority to assert specific factual claims.

An organization (e.g. retailer) might embed authoritative data asserted by another organization (e.g. brand owner) and there is the risk that such embedded information becomes stale if it is not continuously synchronized.

An organization (e.g. retailer) might reference a graph of authoritative data that can be retrieved via an HTTP request to a remote HTTP URI. There is a risk that software or search engines consuming Linked Open Data containing such references may fail to dereference such HTTP URIs and in doing so may fail to gather all of the relevant data.

Organizations are currently faced with a choice of whether to embed machine-readable structured data in their web pages using a block approach (e.g. using JSON-LD) or using an inline approach (e.g. using RDFa, RDFa Lite or Microdata). A block approach (JSON-LD) may be simpler and less brittle than inline annotation, especially as it can be easily decoupled from structural changes to the body of the web page that may happen over time in the redesign of a website. At present, tool support for the 3 major markup approaches for embedded Linked Open Data (RDFa, JSON-LD, MIcrodata) is unequal across the three formats and some tools may not export or import / ingest all 3 formats - some tools even fail to extract data from JSON-LD markup created by their corresponding export tool. There are some significant challenges to ensure that the structured data embedded within a web page is correctly linked to form coherent RDF triples, without any dangling nodes that should be connected to the Subject or other nodes.

Only through the provision of best-in-class tool support that recognize all three major formats on a completely equal footing can organizations have any confidence that they can use any of the 3 major markup formats and the ability to verify / validate that their own markup does result in the correct RDF triples.

Potential Requirements:

The ability to determine who asserted various facts - and whether they are the organization that can assert those facts authoritatively.

Where data from other sources is embedded, there is a risk that the embedded data might be stale. It is therefore helpful to indicate which graph of triples is a snapshot in time from data from another source - and to provide a link to the original source, so that the consumer of the data has the opportunity to obtain a fresh version of the live data rather than relying on a potentially stale snapshot graph of data. DWBP could provide guidance about how to indicate which graph of data is a snapshot and where it came from.

Consumers of Linked Open Data about products might rely on it for making decisions - not only about purchase but even consumption. If the data about a product is inaccurate or out-of-date, we might need to provide some guidance about how liability terms and disclaimers can be expressed in Linked Open Data. We’re not suggesting that we define such terms from a legal perspective - but perhaps there is an existing framework in a similar way that there is an existing framework for expressing various licences of the data? If not, perhaps such a framework needs to be developed - but outside of the DWBP group? Licensing generally says what you’re allowed to do with the data - but I don’t think it says anything about liability for using the data or making decisions based on that data. This area probably needs some clarification, particularly if there is a risk of injury or death (due to inaccurate information about allergens in a food product).

Tabulae - how to get value out of data

Overview:Tabul.ae is a framework to publish and visually explore data that can used to deploy powerful and easy-to-exploit open data platforms, so contributing organizations to unleash the potential of their data. The aim is to enable data owners (public organizations) and consumers (citizens and business re-users) to transform the information they manage into added-value knowledge, empowering them to easily create data-centric web applications. These applications are built upon interactive and powerful graphs, and take the shape of interactive charts, dashboards, infographies and reports. Tabulae provides a high degree of assistance to create these apps and also automate several data visualizations tasks (i.e., recognition of geographical entities to automatically generate a map). In addition, the charts and maps are portable outside the platform and can be smartly integrated with any web content, enhancing the reusability of the information.

Take into consideration the different levels of access to data (security and privacy)

Bio2RDF

Contributor: Carlos Laufer

Overview:Bio2RDF [1] is an open source project that uses Semantic Web technologies to make possible the distributed querying of integrated life sciences data. Since its inception [2], Bio2RDF has made use of the Resource Description Framework (RDF) and the RDF Schema (RDFS) to unify the representation of data obtained from diverse (molecules, enzymes, pathways, diseases, etc.) and heterogeneously formatted biological data (e.g. flat-files, tab-delimited files, SQL, dataset specific formats, XML etc.). Once converted to RDF, this biological data can be queried using the SPARQL Protocol and RDF Query Language (SPARQL), which can be used to federate queries across multiple SPARQL endpoints.

Elements:

Domains: Biological data

Obligation/motivation:

Biological researchers are often confronted with the inevitable and unenviable task of having to integrate their experimental results with those of others. This task usually involves a tedious manual search and assimilation of often isolated and diverse collections of life sciences data hosted by multiple independent providers including organizations such as the National Center for Bio-technology Information (NCBI) and the European Bioinformatics Institute (EBI) which provide dozens of user-submitted and curated data, as well as smaller institutions such as the Donaldson group which publishes iRefIndex [3], a database of molecular interactions aggregated from 13 data sources. While these mostly isolated silos of biological information occasionally provide links between their records (e.g. Uni-Prot links its entries to hundreds of other databases), they are typically serialized in either HTML tags or in flat file data dumps that lack the semantic richness required to serialize the intent of the linkage between data records. With thousands of biological databases and hundreds of thousands if not millions of datasets, the ability to find relevant data is hampered by non-standard database interfaces and an enormous number of haphazard data formats [4]. Moreover, metadata about these biological data providers (dataset source data information, dataset versioning, licensing information, date of creation, etc.) is often difficult to obtain. Taken together, the inability to easily navigate through available data presents an overwhelming barrier to their reuse.

Usage: Biological research

Quality:

Provenance
Bio2RDF scripts generate provenance records using the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary (PROV) and Dublin Core vocabulary. Each data item is linked to a provenance object that indicates the source of the data, the time at which the RDF was generated, licensing (if available from data source provider), the SPARQL endpoint in which the resource can be found, and the downloadable RDF file where the data item is located. Each dataset provenance object has a unique IRI and label based on the dataset name and creation date. The date-specific dataset IRI is linked to a unique dataset IRI using the W3C PROV predicate "wasDerivedFrom" such that one can query the dataset SPARQL endpoint to retrieve all provenance records for datasets created on different dates. Each resource in the dataset is linked the date-unique dataset IRI that is part of the provenance record using the VoID "inDataset" predicate. Other important features of the provenance record include the use of the Dublin Core "creator" term to link a dataset to the script on Github that was used to generate it, the VoID predicate "sparqlEndpoint" to point to the dataset SPARQL endpoint, and VoID predicate "dataDump" to point to the data download URL.

Dataset metrics

total number of triples

number of unique subjects

number of unique predicates

number of unique objects

number of unique types

unique predicate-object links and their frequencies

unique predicate-literal links and their frequencies

unique subject type-predicate-object type links and their frequencies

unique subject type-predicate-literal links and their frequencies

total number of references to a namespace

total number of inter-namespace references

total number of inter-namespace-predicate references

Size:

Nineteen datasets were generated as part of the Bio2RDF 2 release. Several of the datasets are themselves collections of datasets that are now available as one resource. Each dataset has been loaded into a dataset specific SPARQL endpoint using Openlink Virtuoso version 6.1.6. SPARQL endpoints, available at http://[namespace].bio2rdf.org. All updated Bio2RDF linked data and their corresponding Virtuoso DB files are available for download.

Resource Discovery for Extreme Scale Collaboration (RDESC)

Overview:
RDESC's objective is to develop a capability for describing, linking, searching and discovering scientific resources used in collaborative science. For the purpose of capturing semantic of context, RDESC adopt sets of existing ontologies where possible such as FOAFBIBOSCHEMA.ORG. RDESC also introduced new concepts in order to provide a semantically integrated view of the data. Such concepts have two distinct functions. The first is to preserve semantics of the source that are more specific than what already existed in the ontology. The second is to provide broad categorization of existing concepts as it becomes clear that concepts are forming general groups. These generalizations enable users to work with concepts they understand, rather than needing to understand the semantics of many different systems. It strive to provide lightweight enough framework to be used as a component in any software system such as desktop user environments or dashboards but also scalable to millions of resources

Ordnance Survey Linked Data

Under an agreement with the UK government, the Ordnance Survey has published a lot of its mapping data as open data, including some pioneering work in Linked Geospatial Data. Doing this has required significant effort and public investment, as has the effort to include semantics in the European Union's INSPIRE data model. In common with just about all organizations, public and private, investment in such an activity requires justification and so, speaking at the Linking Geospatial Data workshop in March 2014, the Ordnance Survey's Peter Parslow said that maintaining the service depends on showing that it is being used.

Server logs only tell you so much, i.e. the number of requests, but they don't show you the quality of the usage, or what the data is being used for. A small number of high quality, high impact uses of the data might very well have more significance than a large number of low quality ones.

Such a desire to know more about what data is being used for is not unique to Ordnance Survey. For example, the equivalent body in Denmark, the Danish Geodata Agency, offers all its data for free but requires you to register and give information about your intended use. Even where data is provided for free, the provider is very likely to want some recognition for their efforts as an encouragement to keep providing it, often in the face of demands for justification from line managers.

At the same time, users of the data need an incentive, other than simple politeness, to recognize the efforts made by data providers. Therefore any vocabulary that describes the use made of a dataset must also help in the discovery of that usage, i.e. in the discovery of the user's own work. Usage in this context means anything from usage within an application to citation in academic research.

Elements

Policy framework including

Sustainability

Impact assessment

Return on investment

Challenges

Assessing use of data without restricting or disincentivizing such use

Incentivizing provision of data about usage

(Existing) Potential Requirements

ProvAvailable

MetadataStandardized

TrackDataUsage

Citable

New

Improve Discoverability – any description of the usage of a dataset must aid the discovery of the application, i.e. it should be in the direct interest of the data user to provide such metadata.

LusTRE: Linked Thesaurus fRamework for Environment

Overview:
LusTRE is a framework that aims at combining existing environmental thesauri to support in the management of environmental resources. It considers the heterogeneity in scopes and levels of abstraction of existing environmental thesauri as an asset when managing environmental data, thus it aims at exploiting linked data best practice SKOS (Simple Knowledge Organisation System) and RDF (Resource Description Framework) in order to provide a multi-thesauri solution for INSPIRE data themes related to nature conservation.

LusTRE is intended to support in metadata compilation and data/service discovery according to the ISO 19115/19119. The development of LusTRE includes (i) a review of existing environmental thesauri and their characteristics in term of multilingualism, openness and quality; (ii) the publication of environmental thesauri as linked data; (iii) the creation of linksets among published thesauri as well as well-known thesauri exposed as linked data by third-parties, (iv) the exploitation of aforementioned linksets to take advantage of thesaurus complementarities in terms of domain specificity and multilingualism.

Quality of thesauri and linksets is an issue that is not necessary limited to the initial review of thesauri, it should be monitored and promptly documented.

In this respect, a standardised vocabulary for expressing dataset and linkset quality would be recommendable to make accessible the quality assessment of thesauri included in LusTRE. Considered the importance of linkset quality in the achievement of an effective cross-walking among thesauri, further services for assessing the quality of linksets are going to be investigated. Such services might be developed extending the measure proposed in Albertoni et al, 2013, so that, linksets among thesauri can be assessed considering their potential when exploiting interlinks for thesaurus complementarities.

LusTRE’s is currently under development within the EU project eENVplus (CIP-ICT-PSP grant No. 325232), it extends the common thesaurus framework De Martino et al. 2011 previously resulting from the EU project NatureSDIplus (ECP-2007-GEO-317007).

Elements:

Domains: Geographic information. Thesauri and Controlled vocabularies provided within LusTRE's are meant to ease the management of Geographical Data and Services.

Linked Data Glossary

Common Questions to consider for Open-Data Use-Cases

Did you have a legislative or regulatory mandate to publish Open Data?

What were the political obstacles you faced to publish Open Data?

Did your citizens expect Open Data?

Did your citizens understand the uses of Open Data?

Did you publish data and information available in other forms (print, web, etc) first?

How did you inventory your data prior to publishing?

Did you classify your data as part of the inventory?

How did you transform printed materials into Open Data?

Does your city certify the quality of the data published and what steps are involved in cerficiation?

Do you have data traceability and lineage - ie, do you know where your data came from and who has transformed it?

Can you provide an audit trail of data usage and security prior to publication?

Can you track the utility of the data published?

Are you using URIs to identify data elements?

Do you have a Data Architecture?

What is your Data Governance structure and program?

Do you have a Chief Data Officer and Data Governance Council who make decisions about what to publish and how?

Do you have an Open Data Policy?

Do you do any Open Data Risk Assessments?

Can you compare your Open Data to neighboring cities and regions?

Do you provide any Open Data visualization and analytics on top of your publication portal?

Do you have a common application development framework and cloud hosting environment to maintain Open Data apps?

What legal agreements and frameworks have you developed to protect your citizens and your city from the abuse and misuse of Open Data?

Stories

NYC Council needs modern and inexpensive member services and tools constituent services

Date: Monday, 23 Feb 2014
From: Noel Hidalgo, Executive Director of BetaNYC
To: NY City Council’s Committee on Rules, Privileges and Elections.
Subject: For a modern 21st Century City, NY Council needs modern and inexpensive member services and tools constituent services.

Dear Chairman and Committee Member,

Good afternoon. It is a great honor to address you and represent New York City’s technology community. Particularly, a rather active group of technologists ­ the civic hacker.
I am Noel Hidalgo, the Executive Director and co­founded of BetaNYC [1]. With over 1,500 members, BetaNYC’s mission is to build a city powered by the people, for the people, for
the 21st Century. Last fall, we published a “People’s Roadmap to a Digital New York City” where we outline our civic technology values and 30 policy ideas for a progressive digital
city [2]. We are a member driven organization and members of the New York City Transparency Working Group [3], a coalition of good government groups that supported the City’s transformative Open Data Law.

In 2008, BetaNYC got its start by building a small app on top of twitter. This tool, Twitter Vote Report, was built over the course of several, then, developer days, now, hacknights,
and enabled over 11,300 individuals to use a digital and social tool to provide election projection. [4]

Around the world, apps like this catalyzed our current civic hacking moment. Today, hundred of thousands of developers, designers, mappers, hackers, and yackers (the policy wonks) volunteer their time to analyze data, build public engagement applications, and use their skills for improving the quality of lives of their neighbors. This past weekend, we had Manhattan Borough President Gale Brewer, Councilmember Ben Kallos, Councilmember Mark Levine, a representative from Councilmember Rosie Mendez, and representatives from five Community Boards kick challenge over 100 civic hackers to prototype 21st Century interfaces to NYC’s open data. [15]

Though this conversation on rules reform, you have an opportunity to continue the pioneering work, a small talented team of civic hackers and I did WITHIN the New York State Senate.

In 2004, I moved from Boston to work for then Senator Patterson’s Minority Information Services department. In 2009, I re­joined NY State Senate’s first Chief Information Officer office. Our team’s mission was to move the State Senate from zero to hero, depoliticize technology, and build open­reusable tools for all.

In the course of four months, we modernized the Senate’s public information portal. Leading the way for two years of digital transparency, efficiency, and participation. These initiatives were award winning and done under the banner of “Open Senate” From Andrew Hoppin’s blog, the former NY State Senate CIO. [5]

Open Senate is an online “Gov 2.0′′ program intended to make the Senate one of the most transparent, efficient, and participatory legislative bodies in the nation. Open Senate is comprised of multiple sub­projects led by the Office of the Chief Information Officer [CIO] in the New York State Senate, ranging from migrating to cost effective, open­source software solutions, to developing and sharing original web services providing access to government transparency data, to promoting the use of social networks and online citizen engagement.

We did this because we all know how New Yorkers are getting their information. I don’t need to sit here and spout off academic number of digital connectivity. One just has to hop into a subway station to see just about everyone on some sort of digital device. For a modern NY City Council with 21st century members services, the council needs a Chief Information Officer and dedicated staff. The role of this office would be similar to the NYSenate’s CIO. Be empowered to create ranging from migrating to cost effective, open­source software solutions, to developing and sharing original web services
providing access to government transparency data, to promoting the use of social networks and online citizen engagement.

Through this office, the Council would gain an empowered digital and information officer to coordinate the development and enhancement of member and constituent services.

Member services could be improved with the following.

Online and modern digital information tools.

Imagine a council website that you can call your own and include official

Imagine being able to take a constituent issue and automatically file a 311 complaint and monitor the status of the complaint to completion. Imagine being able to send targeted constituent messages and reduce your paper mailings.

Imagine being able to survey your constituents via a mobile app or sms.

Better business and internal technology practices

No matter where you are, from desktops to mobile devices, you could always have access to council's internal systems while on the go.

A more usable interfaces to legislation

Imagine a simpler interface to legistar that integrates constituent comments and public feedback.

Palo Alto Open Data Story

On February 17th we heard a use case presentation from Jonathan Reichental, CIO of the City of Palo Alto.
A recording of the use case presentation can be found here: Palo Alto - Open by Default
1. We can explore the use of URI's for Open Data elements and physical things in a city that have multiple data elements
2. Cities are not yet tagging their data with metadata to allow comparability
3. There are not yet mechanisms to allow citizens to improve data completeness
4. Cities have internal processes for assuring data quality including sign-offs from IT and public officials but these activities are not recorded in metadata and provided with the datasets
5. Cities are not tracing origin and lineage
6. tuples would be a good way to identify relationships between things and data elements that could allow machine comparability of data sets in an internet of things that open data describes

Palo Alto pledged to be a partner with w3C in our WG, which is a great outcome.

ISO GEO Story

ISO GEO is a company managing catalogs records of geographic information in XML, conformed to ISO-19139. (ISO- 19139 is a French adaptation of the ISO- 19115)
An excerpts is here: http://cl.ly/3A1p0g2U0A2z. They export thousands of catalogs like that today, but they need to manage them better.
In their platform, they store the information in a more conventional manner, and use this standard for export dataset compliant to Inspire interoperability , or via the CSW protocol.
Sometimes, they have to enrich their metadata with other ones, produced by tools like GeoSource and accessed through SDI (Spatial Data Infrastructure), with their own metadata records.

A sample containing 402 metadata records in ISO 19139 are in public consultation at http://geobretagne.fr/geonetwork/srv/fr/main.home.
They want to be able to integrate all the different implementations of the ISO 19139 in different tools in a single framework to better understand the thousand of metadata records they use in their day-to-day business.
Types of information recorded in each file, see example at http://www.eurecom.fr/~atemezin/datalift/isogeo/5cb5cbeb-fiche1.xml are the following: Contact info (metadata) [Data issued]; spatial representation ; reference system info [code space ], spatial Resolution ; Geographic Extension of the data, File distribution; Data Quality ; process step, etc.

BuildingEye: SME use of public data

Buildingeye.com makes building and planning information easier to find and understand by mapping what's happening in your city. In Ireland local authorities handle planning applications and usually provide some customised views of the data (pdfs, maps, etc.) on their own website. However there isn't an easy way to get a nationwide view of the data. BuildingEye, an independent SME, built http://mypp.ie/ to achieve this. However as each local authority didn't have an Open Data portal, BuildingEye had to directly ask each local authority for its data. It was granted access to some authorities, but not all. The data it did receive was in different formats and of varying quality/detail. BuildingEye harmonised this data for its own system. However, if another SME wanted to use this data, they would have to go through the same process and again go to each local authority asking for the data.

Recife Open Data Story

Recife is a beautiful city situated in the Northeast of Brazil and it is famous for being one of the Brazil’s biggest tech hubs. Recife is also one of the first Brazilian cities to release data generated by public sector organisations for public use as Open Data. An Open Data Portal was created to offer access to a repository of governmental machine-readable data about several domains, including: finances, health, education and tourism. Data is available in csv and geojson format and every dataset has a metadata description, i.e. descriptions of the data, that helps in the understanding and usage of the data. However, the metadata is not described using standard vocabularies or taxonomies. In general, data is created in a static way, where data from relational databases are exported in a csv format and then published in the data catalog. Currently, they are working to have dynamically generated data from the contents of relational databases, then data will be available as soon as they are created. The main phases of the development of this initiative were: to educate people with appropriate knowledge concerning Open Data, relevant data identification in order to identify the sources of data that their pontential consumers could find useful, data extraction and transformation from the original data sources to the open data format, configuration and installation of the open data catalogue tool, data publication and portal release.

Dutch basic registers

Story: The Netherlands have a set of registers they are looking at opening and exposing as Linked (Open) Data under the context of the project "PiLOD". The registers contain information about buildings, people, businesses and other individuals public bodies may want to refer to for they daily activities. One of them is, for instance, the service of public taxes ("BelastingDienst") which regularly pulls out data from several registers, stores this data in a big Oracle instance and curates it. This costly and time consuming process could be optimised by providing on-demand access to up-to-date descriptions provided by the register owners.

Capacity: at this point, it can not be asked that every register owner cares for publishing his own data. Some of them export what they have on the national open data portal. This data has been used to do some testing with third-party publication from PiLODers but this is rather sensitive as a long term strategy (governmental data has to be tracable/trustable as such). The middle ground solution currently deployed is the PiLOD platform, a (semi)-official platform for publishing register data.

Privacy: some of the register data is personal or may become so when linked to others (e.g. disambiguate personal data based on adresses). Some registers will require to provide secured access to some of their data to some people only (Linked Data, not Open). Some others can go along with open data as long as they get a precise log of who is using what.

Revenue: institutions working under mixed gov/non-gov funding generate part of their revenue by selling some of the data they curate. Switching to an open data model will generate a direct loss in revenue that has to be backed-up by other means. This does not have to mean closing the data, e.g. a model of open dereferencing + paid dumps can be considered, as well as other indirect revenue streams.

Wind Characterization Scientific Study

Story: This use case describes a data management facility being constructed to support scientific offshore wind energy research for the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy (EERE) Wind and Water Power Program. The Reference Facility for Renewable Energy (RFORE) project is responsible collecting wind characterization data from remote sensing and in situ instruments located on an offshore platform. This raw data is collected by the Data Management Facility and processed into a standardized NetCDF format. Both the raw measurements and processed data are archived in the PNNL Institutional Computing (PIC) petascale computing facility. The DMF will record all processing history, quality assurance work, problem reporting, and maintenance activities for both instrumentation and data.

All datasets, instrumentation, and activities are cataloged providing a seamless knowledge representation of the scientific study. The DMF catalog relies on linked open vocabularies and domain vocabularies to make the study data searchable.

Scientists will be able to use the catalog for faceted browsing, ad-hoc searches, query by example. For accessing individual datasets a REST GET interface to the archive will be provided.

Challenges: For accessing numerous datasets scientists will be accessing the archive directly using other protocols such as sftp, rsync, scp, access techniques such as: http://www.psc.edu/index.php/hpn-ssh