Assertive R Programming with assertr

Tony Fischetti

2019-01-22

In data analysis workflows that depend on un-sanitized data sets from external sources, it’s very common that errors in data bring an analysis to a screeching halt. Oftentimes, these errors occur late in the analysis and provide no clear indication of which datum caused the error.

On occasion, the error resulting from bad data won’t even appear to be a data error at all. Still worse, errors in data will pass through analysis without error, remain undetected, and produce inaccurate results.

The solution to the problem is to provide as much information as you can about how you expect the data to look up front so that any deviation from this expectation can be dealt with immediately. This is what the assertr package tries to make dead simple.

Essentially, assertr provides a suite of functions designed to verify assumptions about data early in an analysis pipeline. This package needn’t be used with the magrittr/dplyr piping mechanism but the examples in this vignette will use them to enhance clarity.

concrete data errors

Let’s say, for example, that the R’s built-in car dataset, mtcars, was not built-in but rather procured from an external source that was known for making errors in data entry or coding.

This indicates that the average miles per gallon for a 8 cylinder car is a lowly 12.43. However, in the correct dataset it’s really just over 15. Data errors like that are extremely easy to miss because it doesn’t cause an error, and the results look reasonable.

enter assertr

To combat this, we might want to use assertr’s verify function to make sure that mpg is a positive number:

The verify function takes a data frame (its first argument is provided by the %>% operator), and a logical (boolean) expression. Then, verify evaluates that expression using the scope of the provided data frame. If any of the logical values of the expression’s result are FALSE, verify will raise an error that terminates any further processing of the pipeline.

We could have also written this assertion using assertr’s assert function…

The assert function takes a data frame, a predicate function, and an arbitrary number of columns to apply the predicate function to. The predicate function (a function that returns a logical/boolean value) is then applied to every element of the columns selected, and will raise an error when if it finds violations.

Internally, the assert function uses dplyr’s select function to extract the columns to test the predicate function on. This allows for complex assertions. Let’s say we wanted to make sure that all values in the dataset are greater than zero (except mpg):

verify vs. assert

The first noticable difference between verify and assert is that verify takes an expression, and assert takes a predicate and columns to apply it to. This might make the verify function look more elegant–but there’s an important drawback. verify has to evaluate the entire expression first, and then check if there were any violations. Because of this, verify can’t tell you the offending datum.

One important drawback to assert, and a consequence of its application of the predicate to columns, is that assert can’t confirm assertions about the data structure itself. For example, let’s say we were reading a dataset from disk that we know has more than 100 observations; we could write a check of that assumption like this:

dat <-read.csv("a-data-file.csv")
dat %>%verify(nrow(.) >100) %>%....

This is a powerful advantage over assert… but assert has one more advantage of its own that we’ve heretofore ignored.

assertr’s predicates

assertr’s predicates, both built-in and custom, make assert very powerful. The predicates that are built in to assertr are

not_na - that checks if an element is not NA

within_bounds - that returns a predicate function that checks if a numeric value falls within the bounds supplied, and

in_set - that returns a predicate function that checks if an element is a member of the set supplied.

is_uniq - that checks to see if each element appears only once

We’ve already seen within_bounds in action… let’s use the in_set function to make sure that there are only 0s and 1s (automatic and manual, respectively) values in the am column…

our.data %>%assert(in_set(0,1), am) %>%...

If we were reading a dataset that contained a column representing boroughs of New York City (named BORO), we can verify that there are no mis-spelled or otherwise unexpected boroughs like so…

custom predicates

A convenient feature of assertr is that it makes the construction of custom predicate functions easy.

In order to make a custom predicate, you only have to specify cases where the predicate should return FALSE. Let’s say that a dataset has an ID column (named ID) that we want to check is not an empty string. We can create a predicate like this:

not.empty.p <-function(x) if(x=="") return(FALSE)

and apply it like this:

read.csv("another-dataset.csv") %>%assert(not.empty.p, ID) %>%...

Let’s say that the ID column is always a 7-digit number. We can confirm that all the IDs are 7-digits by defining the following predicate:

seven.digit.p <-function(x) nchar(x)==7

A powerful consequence of this easy creation of predicates is that the assert function lends itself to use with lambda predicates (unnamed predicates that are only used once). The check above might be better written as

enter insist and predicate ‘generators’

Very often, there is a need to dynamically determine the predicate function to be used based on the vector being checked.

For example, to check to see if every element of a vector is within n standard deviations of the mean, you need to create a within_bounds predicate after dynamically determining the bounds by reading and computing on the vector itself.

To this end, the assert function is no good; it just applies a raw predicate to a vector. We need a function like assert that will apply predicate generators to vectors, return predicates, and then perform assert-like functionality by checking each element of the vectors with its respective custom predicate. This is precisely what insist does.

This is all much simpler than it may sound. Hopefully, the examples will clear up any confusion.

The primary use case for insist is in conjunction with the within_n_sds or within_n_mads predicate generator.

Suppose we wanted to check that every mpg value in the mtcars data set was within 3 standard deviations of the mean before finding the average miles per gallon for each number of engine cylinders. We could write something like this:

Execution of the pipeline was halted. But now we know exactly which data point violated the predicate that within_n_sds(2)(mtcars$mpg) returned.

Now that’s an efficient car!

After the predicate generator, insist takes an arbitrary number of columns just like assert using the syntax of dplyr’s select function. If you wanted to check that everything in mtcars is within 10 standard deviations of the mean (of each column vector), you can do so like this:

I chose to use within_n_sds in this example because people are familiar z-scores. However, for most practical purposes, the related predicate generator within_n_mads is more useful.

The problem with within_n_sds is the mean and standard deviation are so heavily influenced by outliers, their very presence will compromise attempts to identify them using these statistics. In contrast with within_n_sds, within_n_mads uses the robust statistics, median and median absolute deviation, to identify potentially erroneous data points.

For example, the vector <7.4, 7.1, 7.2, 72.1> almost certainly has an erroneous data point, but within_n_sds(2) will fail to detect it.

Can you spot the brazen outlier? You’re certainly not going to find it by checking the distribution of each column! All elements from both columns are within 2 standard deviations of their respective means.

Unless you have a really good eye, the only way you’re going to catch this mistake is by plotting the data set.

plot(example.data$x, example.data$y, xlab="", ylab="")

Ok, so all the ys are roughly 10 times the xs except the outlying data point.

The problem having to plot data sets to catch anomalies is that it is really hard to visualize 4-dimensions at once, and it is near impossible with high-dimensional data.

There’s no way of catching this anomaly by looking at each individual column separately; the only way to catch it is to view each row as a complete observation and compare it to the rest.

To this end, assertr provides three functions that take a data frame, and reduce each row into a single value. We’ll call them row reduction functions.

The first one we’ll look at is called maha_dist. It computes the average mahalanobis distance (kind of like multivariate z-scoring for outlier detection) of each row from the whole data set. The big idea is that in the resultant vector, big/distant values are potential anomalous entries. Let’s look at the distribution of mahalanobis distances for this data set…

There’s no question here as to whether there’s an anomalous entry! But how do you check for this sort of thing using assertr constructs?

Well, maha_dist will typically be used with the insist_rows function. insist_rows takes a data frame, a row reduction function, a predicate-generating function, and an arbitrary number of columns to apply the predicate function to. The row reduction function (maha_dist in this case) is applied to the data frame, and returns a value for each row. The predicate-generating function is then applied to the vector returned from the row reduction function and the resultant predicate is applied to each element of that vector. It will raise an error if it finds any violations.

As always, this undoubtedly sounds far more confusing than it really is. Here’s an example of it in use

Check that out! To be clear, this function is running the supplied data frame through the maha_dist function which returns a value for each row corresponding to its mahalanobis distance. (The whole data frame is used because we used the everything() selection function from the dplyr package.) Then, within_n_mads(3) computes on that vector and returns a bounds checking predicate. The bounds checking predicate checks to see that all mahalanobis distances are within 3 median absolute deviations of each other. They are not, and the pipeline errors out. Note that the data.frame of errors that is returned by error report contains the verb used (insist_rows), the row reduction function, the predicate, the column (or columns), the index of the failure and the offendind datum.

This is probably the most powerful construct in assertr–it can find a whole lot of nasty errors that would be very difficult to check for by hand.

Part of what makes it so powerful is how flexible maha_dist is. We only used it, so far, on a data frame of numerics, but it can handle all sorts of data frames. To really see it shine, let’s use it on the iris data set, that contains a categorical variable in its right-most column…

insist and insist_rows are both similar in that they both take predicate generators and not actual predicates. What makes insist_rows different is its usage of a row-reduce data frame.

assert has a row-oriented counterpart, too; it’s called assert_rows. insist is to assert as insist_rows is to assert_rows.

assert_rows works the same as insist_rows, except that instead of using a predicate generator on the row-reduced data frame, it uses a regular-old predicate.

For an example of a assert_rows use case, let’s say that we got a data set (another-dataset.csv) from the web and we don’t want to continue processing the data set if any row contains more than two missing values (NAs). You can use the row reduction function num_row_NAs to reduce all the rows into the number of NAs they contain. Then, a simple bounds checker will suffice for ensuring that no element is higher than 2…

assert_rows can be used for anomaly detection as well. A future version of assertr may contain a cosine distance row reduction function. Since all cosine distances are constained from -1 to 1, it is easy to use a non-dynamic predicate to disallow certain values.

success and error functions

The behavior of functions like assert, assert_rows, insist, insist_rows, verify when the assertion passes or fails is configurable via the success_fun and error_fun parameters, respectively.

The success_fun parameter takes a function that takes the data passed to the assertion function as a parameter. You can write your own success handler function, but there are two provided by this package:

success_continue - just returns the data that was passed into the assertion function (this is default)

success_logical - returns TRUE

The error_fun parameter takes a function that takes the data passed to the assertion function as a parameter. You can write your own error handler function, but there are a few provided by this package:

error_stop - Prints a summary of the errors and halts execution (default)

error_report - Prints all the information available about the errors and halts execution.

error_append - Attaches the errors to a special attribute of data and returns the data. This is chiefly to allow assertr errors to be accumulated in a pipeline so that all assertions can have a chance to be checked and so that all the errors can be displayed at the end of the chain.

error_logical - returns FALSE

just_warn - Prints a summary of the errors but does not halt execution, it just issues a warning.

warn_report - Prints all the information available about the errors but does not halt execution, it just issues a warning.

combining chains of assertions

Let’s say that as part of an automated pipeline that grabs mtcars from an untrusted source and finds the average miles per gallon for each number of engine cylinders, we want to perform the following checks…

that it has the columns “mpg”, “vs”, and “am”

that the dataset contains more than 10 observations

that the column for ‘miles per gallon’ (mpg) is a positive number

that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean, and

that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only

In an assertr chain with default options, assert, assert_rows, insist, insist_rows, and verify will stop at the first assertion that yields an error and not go on to process the assertions further down in the chain. For some needs, this is sensible behavior. There are times, however, when we might like to get a report of all assertion violations. For example, one might want to write an R program to download some dataset from the internet and get a detailed report of all deviations from expectation.

The best thing to do for this use case, is to use the chain_start, and chain_end functions at the beginning and end of a chain of assertr assertions. When chain_start gets called with data, the data gets a special tag that tells the assertr assertions that follow to override their success_fun and error_fun values and replace them with success_continue (which passes the data along if the test passes) and error_append (which we’ve just discussed). After all relevant verifications, chain_end will receive the data (possibly with accumulated error messages attached) and, by default, print a report of all the errors that have been found since the start of the chain.

Awesome! Now we can add an arbitrary number of assertions, as the need arises, without touching the real logic.

advanced: send email reports using custom error functions

One particularly cool application of assertr is to use it as a data integrity checker for frequently updated data sources. A script can download new data as it becomes available, and then run assertr checks on it. This makes assertr into a sort of “continuous integration” tool (but for data, not code.)

In an unsupervised “continuous integration” environment, you need a way to discover that the assertions failed. In CI-as-a-service in the software world, failed automated checks often send an email of reporting the maintainer of a botched build; why not bring that functionality to assertr?!

As we learned in the last sections, all assertion verbs in assertr support a custom error function. chain_end similarly supports custom error functions. By default, this is error_stop (or error_report in the case of chain_end) which prints a summary of the errors and halts execution.

You can specify your own, though, to hijack this behavior and redirect flow-of-control wherever you want.

Your custom error function must take, as its first argument, a list of assertr_error S3 objects. The second argument must be the data.frame that the verb is computing on. Every error function must take this because there may be some other errors that are attached to the data.frame’s assertr_errors attribute leftover from other previous assertions.

Below we are going to build a function that takes a list of assertr_errors, gets a string representation of the errors and emails it to someone before halting execution. We will use the mailR package to send the mail.

(this particular send.mail formulation will only work for gmail recipients; see the mailR documentation for more information)

Now you’ll get notified of any all failed assertions via email. Groovy!

advanced: creating your own predicate generators for insist

assertr is build with robustness, correctness, and extensibility in mind. Just like assertr makes it easy to create your own custom predicates, so too does this package make it easy to create your own custom predicate generators.

Okay… so its, perhaps, not easy because predicate generators by nature are functions that return functions. But it’s possible!

Let’s say you wanted to create a predicate generator that checks if all elements of a vector are within 3 times the vector’s interquartile range from the median. We need to create a function that looks like this

advanced: programming with assertion functions

These assertion functions use the tidyeval framework. In the past, programming in a tidyverse-like setting was accomplished through standard evaluation versions of verbs, which used functions postfixed with an underscore: insist_ instead of insist, for example. However, when tidyeval was made popular with dplyr 0.7.0, this usage became deprecated, and therefore underscore-postfixed functions are no longer part of assertr.