News & outreach

VertNet Data Migrator Template

General toolkit for working with VertNet data.
Once customized to an original data source, it converts the original data into Darwin Core ready for upload to an Integrated Publishing Toolkit (IPT) resource. This tool is a template: customization is required to apply it to a specific data set to transform that data set into Darwin Core. The tool embodies a generic workflow for transformations to Darwin Core. It also incorporates data cleaning though lookups to authoritative vocabularies "sold" separately, which for VertNet can be found at https://github.com/tucotuco/DwCVocabs.

These tools currently consist of a set of customizable folders and scripts for data processing, with two major parts.
The first includes steps to transform the source data from its original structure and content into Simple Darwin Core and standard extensions. This part must be customized for distinct data sources.
The second part of the migrator includes steps to analyze and improve the data. Improvements include: (1) removing non-printing characters from data; (2) providing values for Darwin Core fields that are knowable but not explicitly provided in the source data; (3) removing invalid data from non-verbatim fields; and (4) translating values of original terms from the source to standard values using controlled vocabularies. The second part of the migrator does not have to be customized for each data publisher and can be used for any file in the Simple Darwin Core format.