Search form

Collaboration Environments

You are here

Use Cases:

As agreed during the break-out session we would like to collect further use cases as a bases to discuss and validate the approaches proposed for enabling precise citation of arbitrary subsets of dynamic data.

To contribute a use-case, please use the template provided below (please COPY the template before you fill it, so that the ones coming after you also have a template from which to start)

Use Case Name

Institution: where is this use case located? (name, URL)

Scenario:

Domain: what kind of data is being collected, for which urpos/activity? Who is using the data?

Data Characteristics: how much data is being collected, growing at which rate? batch-ingest, continuous stream?; static data; append-only; corrections of existing data?

UK Data Archive

Domain: The UK Data Archive leads the Economic and Social Research Council’s (ESRC) flagship national UK Data Service which is a comprehensive resource to support researchers, teachers and policymakers who depend on high-quality social and economic data. We provide a single point of access to a wide range of secondary data including large-scale government surveys, international macrodata, business microdata, qualitative studies and census data from 1971 to 2011. All are backed with extensive support, training and guidance to meet the needs of data users, owners and creators. We promote data sharing to encourage the reuse of data, and provide expertise in developing best practice for research data management.

Data Characteristics: our data are held as ‘collections’ which correspond to studies, a unique piece of fieldwork, or from data compilation (e.g. a digitisation project). Examples include cross-sectional surveys, longitudinal surveys over time, a qualitative research projects that yields a set of interviews and images, or a historical database. We have over 6500 collections, and some are major UK survey series or censuses dating back to the 1960s. We prepare a single catalogue record for each ‘collection’ which has about 30 elements and the record is compliant with the Data Documentation Initiative (DDI) and also pushed out as OAI.

Type: numeric, structured, unstructured text

Storage: metadata are stored as XML files and data are held in preservation-friendly formats on a hierarchical preservation system

Access: data are accessed primarily through web download via the catalogue as zip bundles in specific user-oriented formats per collection. UK Federation authentication is required for most but some are completely open. Standard formats are spss, stata and rtf, and some mp3. We have a few online data browsing systems that enable direct search, browsing and graphing of data held on servers, some with Shibboleth authentication.

Current citation approach: each record is assigned a DataCIte DOI when published (format is of the type: 10.5255/UKDA-SN-3314-1). We have a distinct methodology for assigning and versioning DOIs. Basically we distinguish between low and high impact changes, with high impact changes promoting a new DOI (with an increment -1, -2 etc.). the DOI resolves to a jump page which lists the history of changes (see 10.5255/UKDA-SN-7037-3). Access is only provided for the most current version, as often changes have been made due to errors, or updates that make the older versions inadvisable to use. Also we have no requests for older versions, as most users are looking for the most up to date information. Our low impact changes, which do not prompt a change on DOI, include correcting typos or other small changes in labels. Higher impact changes include addition or removal of a variable or significant new documentation. We use the APA citation format: Office for National Statistics. Social Survey Division and Northern Ireland Statistics and Research Agency. Central Survey Unit, Quarterly Labour Force Survey, January - March, 2012 [computer file]. 3rd Edition. Colchester, Essex: UK Data Archive [distributor], November 2013. SN: 7037, http://dx.doi.org/10.5255/UKDA-SN-7037-3
We also run an Eprints data repository (http://reshare-integration.ukdataservice.ac.uk/) for longer tail research data, that assigns DOIs at the point of publishing but we have not yet agreed on how to show changes in versions, or whether we allow access to older versions.

For one of online our data access systems, QualiBank (http://discover.ukdataservice.ac.uk/QualiBank) that enables searching and browsing of qualitative data, interviews and open-ended questions, we have introduced object and paragraph level citation, again using AP citation format. We are not using DOIs here, as each object comes from a higher level UK data service collection (already using the DOI method above), but we will consider assigning object levels DOIs in this system. We have structured metadata for each citation, which make use of system level GUIDs. When paragraphs are selected from a web page displaying text, the GUIDs are aggregated into a new citation object stored in a live citation database. The metadata:

Each text fragment has a persistent guid (prefixed with "q-" in the QualiBank system). When the user selects one or more fragments in the UI, these fragment guids (and the parent text document guid) are assembled by JavaScript into an HTTP GET. This invokes an XQuery on the citation XML database via a RESTful API and passes the text fragment guids plus the parent text document guid as parameters in the URL. The XQuery:

Creates a persistent citation identifier and concatenates a citation URL

Looks up the DOI of the parent dataset of the text document guid

Looks up other DDI2.5 metadata associated with the dataset.

Concatenates readable citation text using the above values.

Inserts an xml citation record into the database (including all of the original text fragment guids)

Returns a JSON response to the UI, including citation text for the user to cite - this includes the DOI of the dataset and a citation URL to enable as user to return to and highlight the relevant text fragments later.

It's important to note that the DOI itself is *not* generated by this process and is retrieved from elsewhere as a pre-existing identifier for the parent dataset. It's important to note that the DOI is *not* the same as the citation identifier (in this system anyway).

VAMDC - Atomic and Molecular Data

Domain: VAMDC is a worldwide e-infrastructure (built on a European FP7 project) which federates interoperable Atomic and Molecular (A+M) databases. A+M data are of critical importance across a wide range of applications such as astrophysics, atmospheric physics, fusion, plasma and lightning technologies, environmental sciences, health and clinical science (e.g. radiotherapy).

Data Characteristics: VAMDC federates 41 heterogeneous databases (http://portal.vamdc.org/vamdc_portal_test/nodes.seam). In the VAMDC jargon each federated database is a Node. Every partner has in charge the curation of its node and decides independently about the growing rate, the ingestion system, the corrections to apply to the already stored data. Indeed the VAMDC infrastructure can grow in two ways: each node can grow (independently) and new nodes can join the federated infrastructure.

Type, storage and access: all the A+M shared through the VAMDC infrastructure are part of the use caseRegardless of the technology used by each database for storing data (the data could be stored on every node using SQL-like bases, no-sql or even text files) each node implements the vamdc access/query protocols and returns result formatted into a standardized XML format, called XSAMS. All the standards that the nodes have to satisfy are specified at the url http://www.vamdc.eu/standards. The nodes are indeed accessible in a single and unified way. A web interface is available (http://portal.vamdc.eu/vamdc_portal_test/home.seam) and the infrastructure is also accessible using standalone software. Ad-hoc libraries (Java and Python) are provided for integrating the access to VAMDC into third-party software.

Current citation approach: The XML file returned by every node has a special field for storing the references of the paper where the data were first presented. If this information is available in the source-database, then the user will get the bibliographic information. This approach has some drawbacks: on one hand the VAMDC infrastructure is not cited. On the other hand, a XSAMS file can contain hundreds of atomic transitions, each related to a different paper. For a scientist this amount of information may block the citation reflex.

Ideal way of citing: We are looking for citation system which put the infrastructure at the core of the citation process. Maybe a kind of DOI, corresponding to the couple (request, obtained result) to put in articles at the same level as classic scientific papers.

Other aspects: We are clearly in the domain of dynamic data. However, with our distributed node approach, it is really hard to enforce versioning on every federated databases. We are thinking about some sort of time-stamp. Another important point is the «scientific workflow » aspects: we provide access to VAMDC infrastructure into high level software. A typical example is SpecView, a Java tool for visualizing spectra. Once you have loaded a given spectrum, you can access VAMDC for identifying a specific line. Into this process the access to VAMDC is very fluid and mostly hidden to the user. He just get the information he/she wants. At the end of the process he/she just feels like having used a tool (like e.g. Matlab) and he/she does not think about citation of VAMDC. How can we ‘propagate’ the bibliographic information through scientific workflow?

LNEC - Infrastructure Sensor Network Data

Domain: A monitoring service for critical infrastructures that utilizes heterogenous networks of sensors that measure characteristics of hydro elextricdams. Dams are especially important for the energy production, but they are also a risk for the environment. Hence they require constant monitoring and reporting. Naturally, dams are built for long term periods and can not be easily repaired, removed or altered. Thus more than 30 different sensor types gather data on a regular basis. They measure and monitor the deformation of the structure, rainfall, water levels inside a dam, temperature, humidity and many other factors that can influence the performance damn.

Data Characteristics: Complexity, amount, coverage and data collection frequency vary considerably, as the sources can be either collected manually by employees or in an automated fashion.
All sensor data is collected by LNEC in a central database. A Web portal allows researchers, scientific and maintenance staff to generate reports of the data sources that aggregate the measurement data. This data is then used for generating monitoring reports including tables, graphs and all details required by the users to get informed on the status of the structure during the selected period. The data is static and either inserted continuously or in batch mode.
- 31 tables for manual sensors
- 31 tables for readings of manual sensors
- 31 tables for results of manual sensors
- 25 tables for automatic sensors
- 25 tables for results of automatic sensors

Type: Sensor data

Storage: Oracle Database

Access: Custom Web interface that provides forms for end users where they can specify which datasets are relevant for a specific report.

Current citation approach: entire set plus textual description

Ideal way of citing: With each subset selected, users should get a persistent identifier that allows to retrieve the same data set again as used for the respective eport.

The MBLWHOI Library has focused on data associated with published articles. The BODC has focused on snapshots of specially chosen datasets that are archived using rigorous version management. Both repositories assign DOIs to datasets.

Infrastructure: The MBLWHOI Library uses the DSpace platform. BODC developed their own database.

Million Song Database (MSD)

Domain: The MSD is the largest benchmark dataset currently in use in the music retrieval community. A set of 1 Mio songs has been identified as a collection. For this set of songs (which cannot be distributed due to copyright reasons) several feature sets are being made available, capturing different characteristics of the audio (beat, sound texture, ...) Furthermore, annotations are provided as metadata, which in turn are used e.g. as class labels for classifictaion experiments (e.g. genre classification)

Data Characteristics: slightly less than 1Mio audio titles described by metadata (not all files were available anymore at the time of collection). For these additional features (numeric descriptors, roughly between 10 and 1500 values per title per feature set) are calculated at iregular intervals, usually resulting in incomplete feature sets (not all features can be extracted from all audio files, e.g. due to length or encoding issues). Furthermore, improved versions of feature may be computed, "replacing" old ones if an algorithm has been found to be erroneous. Furthermore, individual audio files that were truncated during download may be replaced by corrected versions, and the respective features being extracted and added to the database, leading to corrections of earlier data.

Type: numeric values, float, stuctured

Storage: metadata and features stored in a RDBMS, no timestamping at the moment, versioning performed manually when a set of features is being extracted on a modified set of audio recordings or using a different version of a feature extractor

Current citation approach: researchers refer to the benchmakr data citing the MSD and the actual feature vector file used, providing the URL o the file, usually followed by a verbal description of which subset of the vector file has been used (as in most cases researchers use only a subset of the entire vector file, filtering out empty/short audio segments, noise elements, sound samples and other). A workbench for creating such fltered subsets ha sbeen developed, is undergoing finals tests to be released before the end of the year.

Ideal way of citing: Ideally, researchers would be allowed to perform the desired filtering and then, when downloading the resulting vector file, be able to cite the exact subset selected.

Earth System Grid Federation (ESGF)

Domain: Earth System modelling data, results of the globally coordinated CMIP experiments also used for the IPCC Assessment Reports (current phase CMIP5, completed; upcoming CMIP6, in planning phase right now).

Data Characteristics: Data format is netcdf, but content is very heterogeneous due to the large number of indivdual modelling centers and models contributing. Large number of files, volume of ca. 2 PB in total (expect much more in the future), distributed through a global data infrastructure (ESGF). One major hierarchical organization scheme present; additional identification systems used in subcomponents (metadata IDs, internal file tracking). Federation consists of nodes with high amount of sovereignty by their local institutions. Data files are frequently re-published as computation and processing errors are fixed by data producers. No widely accepted definition or criteria for a "stable version". Existing tool suite and infrastructure for dealing with XML metadata documents.

Type: digital-born simulation data from more than two dozen modelling centers around the globe.

Current citation approach: DOIs assigned by WDCC after data transition in the long-term archive to high levels of the hierarchy. Citation only possible for very large sets in the long-term archive. Versioning addressed by having fixed deadline (end of project) for artificially stable versions.

Ideal way of citing: Citation along multiple dimensions - some users want to cite a set of variables stretching across multiple simulations, others may cite a specific ensemble of associated simulation runs; current single hierarchy does not represent all user needs, in fact, multiple hierarchies are conceivable. There are also all kinds of possible combinations with temporal slicing, though that comes second to the aforementioned issues.

Domain: Transcriptions of linguistic events, eg audio/video interviews or recordings of natural language use. The data is used by researchers studying a specific language (e.g. to create a grammar or lexicon) or by researchers comparing languages (typology) or studying the link between culture and language.

Data Characteristics: The data is manually created by the researchers and assistants, involving many hours of work (so the growth is rather slow). In the Dobes collection (http://hdl.handle.net/1839/00-0000-0000-0001-305B-C@view) there are currently about 8000 transcription files, most of them in the ELAN annotation format (.eaf, XML-based) Updates to files are most often extensions and adding details, sometimes corrections.

Type: The changing data are annotations, relatively small XML files that require a lot of manual work (and expertise) to be created/updated.

Current citation approach: Most researchers cite either: - a subtree of the complete archive, with a handle that results in a web application to browse through the data - a web application that visualises the audio/video recording together with the textual transcription

Ideal way of citing: A persistent link (handle) to a "frozen" virtual collection (snapshot at the time of referral), with the possibility to go to the most recent version of all resources.

Global Biodiversity Information Facility (GBIF)

Domain: The GBIF network provides access to the world’s largest aggregated database of standardised species occurrence records. It is used to understand distribution, status and trends in the world’s biodiversity by scientists as well as the general public. For a list of scientific publications citing use of GBIF data, see http://www.gbif.org/mendeley/usecases.

Data Characteristics: There currently some 430 million occurrence records in the GBIF data cache covering nearly 1.5 million species and sourced from over 14,000 data sets and 580 data publishers (http://www.gbif.org/occurrence). Content is published by institutions on a dataset by dataset basis using one of a handful of protocols, and then periodically harvested into a central index. Harvesting peaks at around 500 records per second, and runs either by paging over responses using XML processes, or by a more streamlined batch extract, transfer and load. Data are mutable, and indexing updates records wherever possible – a challenge remains that not all publishers provide stable record identifiers, thus data is periodically deleted.

Storage: Data is stored in publishing institutions in RDBMS / CSV files, transferred as XML / CSV and then processed into a column oriented database (HBase / Hadoop). Data are mutable, and timestamps are in place for most recent edits.

Current citation approach: Data retrieved from the GBIF network should be cited according to the "dataset citation provided by the publisher", as shown on the dataset or occurrence page on the GBIF portal. If the publisher-provided citation is either missing or incomplete, the user should observe the "default citation" given on the dataset or occurrence page. The default citation comprises a list providing the publisher/institution name and dataset name for each data set in the download. This is provided in the download zip file containing the data.

Ideal way of citing: : The ideal way of citing presupposes that each data set has a Digital Object Identifier and each record within a data set has a persistent identifier. Then the GBIF portal could issue a DOI for citations generated during query time. The citation would be held indefinitely and make use of the dataset DOIs. However, as not all records downloaded will be used in research, publications, etc., a second mechanism is also proposed whereby the GBIF portal allows a user to upload a collection of dataset identifiers, or record identifiers, and construct a citation; the citation would receive a DOI and be kept indefinitely.

Other aspects: Deployment of DOIs for data sets; all records have a stable, persistent identifier.

Data Characteristics: Depends on the exact data product discussed? The raw data rate coming off each of two satellites is between 6 and 10 mpbs and has been for the last 12+ years. The raw data is segmented into 5 min. data files, which are then processed into a variety of Level 1, 2, and 3 data products. Data products, often called data sets, grow over time as data is appended to them. Occasionally the whole time series of data is reprocessed (currently on Version 5 of the land/snow products about to start processing version 6).

Type: Should probably start with a single data product

Storage: Data is stored as files available via ftp or through a variety of web services. Metadata about each file is stored as XML, a subset of which is in a database for query access. Metadata for each data set is available from a variety of repositories.

Access: What infrastructure is used for accessing the data? ftp, variety of web services (e.g., opensearch).

Current citation approach: what is the current way for researchers to cite this data? Each data set has its own citation - subsets are represented using the ESIP data citation approach.

Ideal way of citing: Hopefully the ESIP approach is good enough, but that hasn't really been tested.

• Type: The approach is to encapsulate the knowledge needed to interact with a remote data repository in a micro-service (basic function), encapsulate the knowledge to parse a data format in a second micro-service, and then chain the micro-services together to create a workflow that retrieves and processes the data types. The workflows can be registered in the iRODS data grid. They can be shared, re-executed, and modified.

• Storage: Files are managed by the iRODS data grid, and stored on local storage systems, tape archives, etc. Files can be versioned, time-stamped, replicated, under the control of data management policies. Policies include retention, disposition, time-dependent access controls, metadata extraction and registration, caching, staging, archiving,

• Current citation approach: Within the data grid, every file is given a logical path name (membership in a logical collection that can span multiple storage systems). Files can be referenced by a persistent identifier (Handle system), logical file name, iRODS URL, iRODS ticket (which enforces access controls), and through queries on descriptive metadata. Citation via persistent identifier is typically reserved for published data. Data that is essential for collaboration on projects is typically identified by logical file name.

• Ideal way of citing: A major challenge in citation is the persistence of the ID resolver. The persistent identifier is only valid as long as the resolver for the identifier is maintained. This implies a national registration number, such as Library of Congress ISBN. However this establishes identity, not location or access control. To find a valid copy, one would do a search on a linked-data catalog. It will be important to separate identity from location and management controls.

NERC ECN citing dynamic monitoring data

Domain: Long-term environmental monitoring from automatic and manual recording across the UK

Data Characteristics: Collect for 20 year at hourly through to annual measures of physical, chemical and biotic factors from 50+ sites. Mainly continuous stream of data with monthly summaries published via the Web. Occasionally corrections are made for equipment drift etc.

Type: Numerical measurements either from automatic analysis or surveyor recording

Storage: All site data and measurements stored in single RDBMS schema

Access: Mainly Web inerface including resolution service for accessing specific site information

Current citation approach: PID for queries used to subset data from the database. Also DataCite DOIs for referencing selected snapshots

Ideal way of citing: A PID that identifies a subset of the database that is timestamped so that update and corrections can be allowed for.

Other aspects: Currently, these data are owned by different organisations with differing licensing arrangements

Contact info: John Watkins (jww at ceh dot ac dot uk)

UK Butterfly Monitoring Scheme annual species metrics

Data Characteristics: This dataset provides linear trends, over varying time periods, for the UK Butterfly Monitoring Scheme (UKBMS) Collated Indices of individual butterfly species across the UK. The main statistical values derived from a linear regression (slope, standard error, P-value) are presented for the entire time series for each species (1976 to 2012), for the last 20 years, and for the last decade.

Type: Statistical values of trend derived from citzen record species occurrence data over decades. Trend values are updated each year as new data is processed.

Environmental Systems Research Institute (Esri)

Scenario: ArcGIS Online is a cloud-based geospatial content management for creating interactive web maps and apps that can be shared on desktops, within browsers, or on smartphones or tables. The platform includes ready-to-use content, featuring authoritative maps, satellite imagery feeds and demographic data on hundreds of topics relating to people, Earth, and life. It also includes web-based geoprocessing services for a range of analytics, apps, and templates to enable immediate productivity.

Domain: See below for what kind of data are available and are being uploaded to ArcGIS Online. Users include research scientists in hydrology, conservation biology, forestry, geology and geophysics, climate science, ocean science, sustainability science, sociology, urban planning, geodesign, landscape architecture, agricultural science, and geographic information science

Data Characteristics: The ArcGIS Online platform is currently serving 160 million requests, 200 Tb of data, and 2.5 million items (e.g., maps, layers, files, apps, tools) to over 1.6 million users per DAY. It handles 4-5 billion map tile requests per MONTH. Sources for these data include private business data, government services, education data, citizen (free) usage, scientific research and more.

Type: A wide variety of data are involved, including ArcGIS Server services, OGC WMS and WMTS services, GeoRSS files, tile layers, and KML documents. Users can also add features that they created with the ArcGIS.com map viewer and/or features from a delimited text file, GPX file, or shapefile on their desktop. The full list of data types used in ArcGIS Online:

Access: Web interface with downloads. With the ArcGIS Open Data app, data can be accessed and downloaded in multiple formats including spreadsheet, KML, shapefile, or via the GeoJSON or GeoService APIs.

Current citation approach: Each item in ArcGIS Online has a “home page,” serving as its metadata and including a description, access and use constraints, properties and user comments. The user may include citation information here. Example: http://arcg.is/1Eoa33u for dataset or http://esriurl.com/btm for a tool.

Ideal way of citing: We would like to start including DOIs including with timestamps or version, especially as Esri has joined CrossRef and can now “mint” its own DOIs.

Other aspects: Esri also has a publishing house called Esri Press and we will begin to upload supplementary material, including data from a book to ArcGIS Online. A book soon to be released, Ocean Solutions, Earth Solutions is experimenting with the use of DOIs for the chapter text as well as accompanying supplemental data on the book’s resource web site.