Guidelines for Collecting Metadata on Linked Datasets in the datahub.io Data Catalog

For keeping the LOD cloud diagram up to date, the Linking Open Data community effort has started to collect meta-information about Linked datasets on datahub.io, a registry of open data and content packages provided by the Open Knowledge Foundation.

This page explains how dataset publishers or other people that want a dataset to be added to the LOD cloud, describe datasets on datahub.io.

The list of datasets about which we have already collected information is be found here:

Which datasets are included into the LOD cloud diagram?

Data items are accessible via dereferencable URIs, Offering only a SPARQL endpoint but no dereferencable URIs is not considered enough for inclusion.

The dataset sets at least 50 RDF links pointing at other datasets or at least one other dataset is setting 50 RDF links pointing at your dataset.

How do I add a data set to datahub.io or edit an existing data set?

Please register with datahub.io before editing or adding any packages.

Please confirm that your data set does not already exist on datahub.io before adding a new data set.

Add or edit your data set and describe it with the following minimum required information:

name (a unique id)

title

URL

number of triples

links to other data sets.

Please tag newly added data sets with lod.

If you are not aware of any in- or outlinks, tag it with lodcloud.nolinks.

Please provide as much additional information as possible (e.g. SPARQL endpoint, voiD description, license, and the topic of the data set) as described below. This information helps the community to know more about the development state of the Web of Linked Data and is made available via the datahub.io API.

Custom datahub.io fields

Number of RDF links pointing at data set with Data Hub ID xxx (http://thedatahub.org/dataset/xxx). Please provide separate links xxx statements for each data set your are linking to

20000

datahub.io tags

Please use the following tags to provide meta-information about your data set.

We will use the topic information to color the LOD cloud later.

Please also list the vocabularies used by your data set so that the community can get an overview of which vocabularies are commonly used on the Web of Linked Data.

Linked Data published on the Web should be as self-describing as possible in order to make it easier for clients to understand and use the data. Important aspects of self-descriptiveness are making vocabulary terms dereferenceable according to the best practices described in Publishing RDF Vocabularies, using terms from common vocabularies and providing vocabulary mappings for proprietary vocabulary terms. In order to allow the community to get an overview which data sets implement these best practices, please tag your data set accordingly.

Enhanced Information

Please provide the following additional information about your data set. This information helps the community to know more about the development state of the Web of Linked Data and is made available via the datahub.io API.

datahub.io resource links

Links (other than dereferenceable URIs) that enable alternative access to the data set (e.g., via downloads or SPARQL endpoints) should be specified in the Resources section of the CKAN entry form. Please also provide links to the voiD description or Semantic Web Sitemap describing your data set.

Purpose

Format

Description

Download page

—

Download

XML Sitemap

meta/sitemap

XML Sitemap

SPARQL endpoint

api/sparql

SPARQL endpoint

voiD file

meta/void

voiD description

RDF/XML download

application/rdf+xml

Download

Turtle download

text/turtle

Download

N-Triples download

application/x-ntriples

Download

N-Quads download

application/x-nquads

Download

RDF Schema

meta/rdf-schema

Download link to RDF/OWL Schema used by your data set (in addition to having dereferenceable vocabulary URIs)

RDF/XML example link

example/rdf+xml

Link to an example data item within your data set (RDF/XML)

Turtle example link

example/turtle

Link to an example data item within your data set (Turtle)

N-Triples example link

example/ntriples

Link to an example data item within your data set (N-Triples)

HTML+RDFa example link

example/rdfa

Link to an example data item within your data set (RDFa)

Vocabulary Mappings, e.g., OWL, RDFS, RIF, R2R

mapping/<format>

If your data set uses proprietary vocabulary terms and you know these terms also exists in other vocabularies, you should set owl:equivalentClass, owl:equivalentProperty, rdfs:subClassOf, and/or rdfs:subPropertyOf links pointing at these terms or provide mapping expressed as RIF rules or using the R2R Mapping Language. If your mappings can be downloaded as a single file, please provide the link to the download.

datahub.io tags

Please use the following tags to provide meta-information about your data set.

We will use the topic information to color the LOD cloud later.

Please also list the vocabularies used by your data set so that the community can get an overview of which vocabularies are commonly used on the Web of Linked Data.

Linked Data published on the Web should be as self-describing as possible in order to make it easier for clients to understand and use the data. Important aspects of self-descriptiveness are making vocabulary terms dereferenceable according to the best practices described in Publishing RDF Vocabularies, using terms from common vocabularies and providing vocabulary mappings for proprietary vocabulary terms. In order to allow the community to get an overview which data sets implement these best practices, please tag your data set accordingly.

Tag

Purpose

format-<prefix>

A vocabulary used by the data set, e.g., format-skos, format-dc, format-foaf. Use http://prefix.cc/ to find a prefix for a vocabulary. If a vocabulary is not in prefix.cc, then add it there or ignore that vocabulary.

no-proprietary-vocab

Indicates that your data set does not use a proprietary vocabulary (defined within your top-level domain).

deref-vocab

no-deref-vocab

Indicates whether the proprietary vocabulary terms used by your data set (the ones that are defined within your top-level domain) are dereferenceable according to the best practices for Publishing RDF Vocabularies

Indicates whether the data set provides provenance meta-information (creator of the data set, creation date, maybe creation method) as document meta-information or via a voiD description. For instance, using the dc:creator or dc:date properties.

license-metadata

no-license-metadata

Indicates whether the data set provides licensing meta-information as document meta-information or via a voiD description. For instance, using the dc:rights property.

published-by-producer

published-by-third-party

Indicates whether the data set is published by the original data producer or a third party.

limited-sparql-endpoint

Indicates whether the SPARQL endpoint is not serving the whole data set.

lodcloud.nolinks

Dataset has no external RDF links to other datasets.

lodcloud.unconnected

Dataset has no external RDF links to or from other datasets.

lodcloud.needsinfo

The data provider or dataset homepage do not provide mininum information (and information can't be determined from SPARQL endpoint or downloads).