datapack R package overview

Overview

The datapack R package provides an abstraction for collating multile data
objects of different types and metadata describing those objects into a bundle that can be transported and loaded using a single composite file. It is primarily meant as
a container to bundle together files for transport to or from DataONE data repositories.

The methods in this package provide a convenient way to load data from common repositories
such as DataONE into the R environment, and to document, serialize, and save data from R to
data repositories.

Create a Single Object

The datapack DataObject class is a wrapper that contains both data and system metadata that describes the data.
The data can be either R raw data or a data file, for example a CSV file. The system metadata includes attributes such as the
object identifier, type, size, checksum, owner, version relationship to other objects, access rules, and other critical metadata.

The following example shows how to create a DataObject from a CSV file:

The DataObject myObj now contains the CSV data as well as the system-level
information describing the file, such as its identifier, type, and checksum.
The getData method can be used to extract the data content of a DataObject.
Using the example DataObject:

rawData <- getData(myObj)

This raw data can be converted back to CSV format using the R commands:

Alternatively, the CSV data could be converted into a data frame using standard R
functions:

df <- read.csv(textConnection(rawToChar(rawData)))
head(df)

If the data were another format than CSV, such as PNG, JPEG, or NetCDF, the
corresponding R packages could be used to handle the object.

Each DataObject has an identifier which can be used to refer to that object, and
is meant to be globally unique so that it can be used in data repositories such
as those from the DataONE federation. To retrieve the identifier associated with
a DataObject:

getIdentifier(myObj)

In this case, the identifier was created in the UUID format, but other identifiers
such as DOIs (Digital Object Identifers) can also be used.
Each object also is associated with a specific format. To retrieve the format type:

getFormatId(myObj)

All system metadata information for a DataObject can be accessed directly from the SystemMetadata object contained in the DataObject:

str(myObj@sysmeta)

The system metadata contains access policy information for the DataObject that could be used
by a data repository that the object is uploaded to. For example, when a DataObject is
uploaded to a DataONE Member Node,
the access policy is applied to the uploaded data
and controls access to the data on the Member Node by DataONE users.

Before the DataObject is uploaded, access can be set so that anyone can read the uploaded data:

myObj <- setPublicAccess(myObj)
myObj@sysmeta@accessPolicy

Individual access rules can also be added one at a time. The access rules are expressed
using the unique identifier for an individual, such as their ORCID identity, or whatever
form the repository supports.

The dataone R package can be used to upload or download DataObjects to a DataONE Member Node.
Please see the web page for the dataone R package and the
vignettes for more information:

Create a Collection of Objects

A DataPackage is a container for a set of DataObjects. DataObject is a class that is a proxy for data of any type, including traditional data like CSV, tabular data, and spatial rasters, but also for non-traditional objects like derived data, figures, and scripts in R and Python. A collection of related DataObjects can be placed in a DataPackage and actions can be performed on it, such as serializing the entire collection of objects into a package file, or uploading all package member objects to a data repository.

Figure 1. is a diagram of a typical DataPackage showing a metadata file that
describes, or documents the data granules that the package contains.

This example creates a DataPackage with one DataObject containing metadata and two others containing science data:

The identifier values used in this example are simple and easily recognizable for demonstration purposes. A more standard unique identifier can be created with the uuid::UUIDgenerate() function:

myid <-paste("urn:uuid:", UUIDgenerate(), sep="")
myid

Next a DataPackage is created and the DataObjects are added to it:

dp <- new("DataPackage")
dp <- addData(dp, do = metadataObj)
dp <- addData(dp, do = sciObj)
# The second object will be added in the next section

Information can also be extracted from the DataPackage. To show the identifiers of the DataObjects that are in the package:

getIdentifiers(dp)

To show the number of DataObjects in the package:

getSize(dp)

To extract the data in a DataObject as raw data, ask for the data using the identifier of the DataObject:

sciObjRaw <- getData(dp, sciId)

To get access to the full instance of the DataObject class representing a data object,
use the datapack::getMember function and pass in the identifier of the desired object,
which will return an instance of hte DataObject class:

mySciObj <- getMember(dp, sciId)

Relationships Between DataObjects

The relationships between DataObjects in a DataPackage can be recorded in the DataPackage.
For example, a typical relationship is that a DataObject containing a metadata
document in a domain specific format such as Ecological Metadata Language (EML)
or ISO19139 geospatial metadata can describe, or document, DataObjects containing
associated science data. Adding relationship information about data package members
may assist a consumer of the package in better understanding the contents of the
package and how to make use of the package.

While the DataPackage can record any type of relationships that are important to
a community, we have provided functions to establish common relationships that
are needed to understand scientific data in the DataONE federation. These include
the following typical provenance properties:

cito:documents: for establishing that a metadata document provides descriptive
information about one or more associated data objects

prov:wasDerivedFrom: for asserting that a derived data object was created using
data from one or more source data objects

prov:used: for asserting that when a program (such as an R script) was executed
that it used one or more source data objects as inputs

prov:wasGeneratedBy: for asserting that when a program (such as an R script) was executed
that it generated one or more derived data objects as outputs

Figure 2. A DataPackge with provenance relationships.

Linking a metadata file with one or more data files using cito:documents

The fastest way to add the cito:documents relationship is to include the metadata object when a science data object is added to the package:

In that example, the sciObj2 DataObject is added to the package using the addData call,
and the metadata object metadataObj is passed in to the function as well. This
tells the DataPackage that metadataId cito:documents sciId2. The cito:documents relationship is defined by the Citation Typing Ontology (CITO)).

Asserting data provenance relationships between objects

Relationships that describe the processing history of package members can be added. For example,
a program that performs a modeling calculation might read one or more source data files as inputs,
perform a calculation based on the data read, and then write a data or graphics file
characterizing the results of the model run.

The following example demonstrates how to insert provenance relationships into a DataPackage
for the R program logit-regression.R that reads the source data file binary.csv and
generates the derived image file gre-predicted.png. Using the example DataPackage for which
DataObjects for the program input and output have already been added, we create a
DataObject for the program, and call describeWorkflow to add the necessary provenance
relationships:

Note that in this example, the R script had previously been run and generated the image
file before describeWorkflow() was called. The sources and derivations arguments
for describeWorkflow() can be lists of either DataObjects or the identifiers of DataObjects.

Inserting other (arbitrary) relationships

Other types of relationships between DataPackage member DataObjects can be recorded with the insertRelationship method. The main requirement is that each relationship to be described
needs to have a unique URI that is drawn from a controlled vocabulary like the Citation Typing Ontology described above. The cito:documents relationship is the default used by insertRelationship, so the relationship type doesn't need to be specified in this case. For example, with the example DataPackage created above, we can add the cito:documents relationship:

Describing The Contents of a DataPackage

In order to transport a DataPackage, for example to a data repository, a
description of the contents of the DataPackage is created so that the consumer
of the DataPackage can determine how to extract and process the contents.

A DataPackage can produce a standard description of its members and relationships
which conforms to the Open Archives Initiative Object Reuse and Exchange (OAI-ORE) specification,
which is a widely used standard to describe aggregations of web accessible
resources. This OAI-ORE description is referred to as a resource map.

The serializePackage method will create Resource Description Framework
serialization of a resource map, written to a file in this case, that conforms
to the OAI-ORE specification.

To create a resource map for the example DataPackage:

This example writes to a tempfile using the default serialization format of
"rdfxml". Also the URLs for each package member are prepended with the default
value of the DataONE resolve service, which would be the URL that could be used
to access this data object if the package is uploaded to a DataONE member node.

A different value to be prepended to each identifier can be specified with the
resoveURI argument. To specify that no value be prepended to the identifier
URLs, specify a zero-length character: