What’s in the package skeleton structure?

This has created a datapackage source tree named “mtcars20” (in a temporary directory). For a real use case you would pick a path on your filesystem where you could then initialize a new github repository for the package.

You should fill out the DESCRIPTION file to describe your data package. It contains a new DataVersion string that will be automatically incremented when the data package is built if the packaged data has changed.

The user-provided code files reside in data-raw. They are executed during the data package build process.

A few words about the YAML config file

A datapackager.yml file is used to configure and control the build process.

Where do I put my raw datasets?

Raw data (provided the size is not prohibitive) can be placed in inst/extdata.

The datapackage_skeleton() API has the raw_data_dir argument, which will copy the contents of raw_data_dir (and its subdirectories) into inst/extdata automatically.

In this example we are reading the mtcars data set that is already in memory, rather than from the file system.

An API to read raw data sets from within an R or Rmd procesing script.

As stated in the README, in order for your processing scripts to be portable, you should not use absolute paths to files. DataPackageR provides an API to point to the data package root directory and the inst/extdata and data subdirectories. These are useful for constructing portable paths in your code to read files from these locations.

For example: to construct a path to a file named “mydata.csv” located in inst/extdata in your data package source tree:

use DataPackageR::project_extdata_path("mydata.csv") in your R or Rmd file. This would return: e.g., /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpiDzFKT/mtcars20/inst/extdata/mydata.csv

Raw data sets that are stored externally (outside the data package source tree) can be constructed relative to the project_path().

YAML header metadata for R files and Rmd files.

If your processing scripts are Rmd files, the usual yaml header for rmarkdown documents should be present.

If you have Rmd files, you can still include a yaml header, but it should be commented with #' and it should be at the top of your R file. For example, a test R file in the DataPackageR package looks as follows:

Documenting your data set changes in NEWS.md

When you build a package in interactive mode, you will be prompted to input text describing the changes to your data package (one line).

These will appear in the NEWS.md file in the following format:

DataVersion: xx.yy.zz
========
A description of your changes to the package
[The rest of the file]

Why not just use R CMD build?

If the processing script is time consuming or the data set is particularly large, then R CMD build would run the code each time the package is installed. In such cases, raw data may not be available, or the environment to do the data processing may not be set up for each user of the data. DataPackageR decouples data processing from package building/installation for data consumers.

A log of the build process

DataPackageR uses the futile.logger package to log progress.

If there are errors in the processing, the script will notify you via logging to console and to /private/tmp/Test/inst/extdata/Logfiles/processing.log. Errors should be corrected and the build repeated.

If everything goes smoothly, you will have a new package built in the parent directory.

Note that the modified configuration needs to be written back to the package source directory in order for the changes to take effect.

The consequence of toggling a file to enable: no is that it will be skipped when the package is rebuilt, but the data will still be retained in the package, and the documentation will not be altered.

This is useful in situations where we have multiple data sets, and want to re-run one script to update a specific data set, but not the other scripts because they may be too time consuming, for example.

Multi-script pipelines.

We may have situations where we have mutli-script pipelines. There are two ways to share data among scripts.

filesystem artifacts

data objects passed to subsequent scripts.

File system artifacts

The yaml configuration property render_root specifies the working directory where scripts will be rendered.

If a script writes files to the working directory, that is where files will appear. These can be read by subsequent scripts.

Passing data objects to subsequent scripts.

A script (e.g., script2.Rmd) running after script1.Rmd can access a stored data object named script1_dataset created by script1.Rmd by calling