Data quality control using R

Checking points on land

The obistools package has a check_onland() function to check if coordinate pairs are located on land. By default this function uses a web service, but it can optionally work offline (although this is less accurate).

First fetch some Madrepora occurrences using robis:

library(robis)mad<-occurrence("Madrepora")leafletmap(mad)

Then run the check_onland() command. By default the function will return a data frame containing all records on land (another option is to return a data frame with errors):

library(obistools)land<-check_onland(mad)leafletmap(land)

In some cases it makes sense to apply a buffer when checking for records on land. In this case we add a 1000 m buffer zone:

land_buffer<-check_onland(mad,buffer=1000)leafletmap(land_buffer)

As expected this returns less “wrong” records.

Now create a map showing all suspicious records, in orange by default but in red when they are suspicious even with the 1000 m buffer zone:

Type info to see which names need manual action, y to start manual resolution, or n to skip manual resolution. After selecting y, several options will be presented for each name. Pick a number or press enter to skip the names:

After this procedure, you will end up with a data frame containing the matched name, the WoRMS LSID, and the type of match. Add the LSIDs to your source data as scientificNameID.

occurrence$scientificNameID<-names$scientificNameID

Checking depth values

The obistools package has a check_depth() function to check if there are any potential problems with the values in the minimumDepthInMeters and maximumDepthInMeters fields. This function uses a webservice to fetch bathymetry information from various sources.

First download some occurrences from OBIS:

abrseg<-robis::occurrence("Abra segmentum")

Then use check_depth() with a depthmargin of 10 meters, this will return all records where depth values are 10 meters or more below the bottom depth returned from the webservice:

library(obistools)problems<-check_depth(abrseg,depthmargin=10)

To plot sample depth versus bottom depth, first use lookup_xy() to obtain bathymetry for our points:

Checking parentEventIDs in the Event Core

Use check_eventids() to check if all parentEventIDs in the Event Core have a matching eventID. In the example below, first the original data is checked, then an error is introduced, and then the data is checked again:

Checking core record identifiers in the extension file

Use check_extension_eventids() to check if identifiers in an extension file have matching eventIDs in the Event Core file. Again, in the example first the correct data is checked and then an error is introduced. The function will return a table of errors (if there are any). The field parameter is the name of the identifier column in the extension file:

Flattening occurrences

Sometimes it’s useful to have a flat occurrence table, i.e. a table of occurrences where all the information contained in the related events has been added. For example, all date and location information may be in the Event Core file and not in the Occurrence Extension, but for checking or analyzing your data you may want to have a table with both the date/location information and the scientific names. The field parameter is the column in the extension file which points to the core table:

# first go back to the clean version of the data
event<-archive$data$event.txtoccurrence<-archive$data$occurrence.txtmof<-archive$data$extendedmeasurementorfact.txtflat<-flatten_occurrence(event,occurrence,field="id")names(occurrence)names(flat)

Visualizing dataset structure

The treeStucture() function in the obistools package generates a simplified event/occurrence tree showing the relationships between the different types (based on type and measurementType) of events and occurrences. Each node in the simplified tree is given a name based on the eventID or occurrenceID of one of the events of occurrences of that node type.

Note that an eventID column is required in the measurements table. In your dataset the extension column pointing to the event records may have another name, so make sure to add eventID.