Coordinatized Data: A Fluid Data Specification

Authors: John Mount and Nina Zumel.

Introduction

It’s been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).

Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.

In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in R.

Example

Consider four data scientists who perform the same set of modeling tasks, but happen to record the data differently.

In each case the data scientist was asked to test two decision tree regression models (a and b) on two test-sets (x and y) and record both the model quality on the test sets under two different metrics (AUC and pseudo R-squared). The two models differ in tree depth (in this case model a has depth 5, and model b has depth 3), which is also to be recorded.

Data Scientist 1

Data scientist 1 is an experienced modeler, and records their data as follows:

Data Scientist 1 uses what is called a denormalized form. In this form each row contains all of the facts we want ready to go. If we were thinking about "column roles" (a concept we touched on briefly in Section A.3.5 "How to Think in SQL" of Practical Data Science with R, Zumel, Mount; Manning 2014), then we would say the columns model and testset are key columns (together they form a composite key that uniquely identifies rows), the depth column is derived (it is a function of model), and AUC and pR2 are payload columns (they contain data).

Denormalized forms are the most ready for tasks that reason across columns, such as training or evaluating machine learning models.

Data Scientist 2

Data Scientist 2 has data warehousing experience and records their data in a normal form:

The idea is: since depth is a function of the model name, it should not be recorded as a column unless needed. In a normal form such as above, every item of data is written only one place. This means that we cannot have inconsistencies such as accidentally entering two different depths for a given model. In this example all our columns are either key or payload.

Data Scientist 2 is not concerned about any difficulty that might arise by this format as they know they can convert to Data Scientist 1’s format by using a join command:

In this form model, testset, and measurement are key columns. depth is still running around as a derived column and the new value column holds the measurements (which could in principle have different types in different rows!).

Data Scientist 3 is not worried about their form causing problems as they know how to convert into Data Scientist 1’s format with an R command:

The above operation is a bit exotic and it (and its inverse) already go under number of different names:

pivot / un-pivot (Microsoft Excel)

pivot / anti-pivot (databases)

crosstab / un-crosstab (databases)

unstack / stack (R)

cast / melt (reshape, reshape2)

spread / gather (tidyr)

"widen" / "narrow" (colloquial)

moveValuesToColumns() and moveValuesToRows() (this writeup)

And we are certainly neglecting other namings of the concept. We find none of these particularly evocative (though cheatsheets help), so one purpose of this note will be to teach these concepts in terms of the deliberately verbose ad-hoc terms: moveValuesToColumns() and moveValuesToRows().

Note: often the data re-arrangement operation is only exposed as part of a larger aggregating or tabulating operation. Also moveValuesToColumns() is considered the harder transform direction (as it has to group rows to work), so it is often supplied in packages, whereas analysts often use ad-hoc methods for the simpler moveValuesToRows() operation (to be defined next).

Data Scientist 4

Data Scientist 4 picks a form that makes models unique keys, and records the results as:

moveValuesToRows() is (under some restrictions) an inverse of moveValuesToColumns().

Although we implement moveValuesToRows() and moveValuesToColums() as thin wrappers of tidyr‘s gather and spread, we find the more verbose naming (and calling interface) more intuitive. So we encourage you to think directly in terms of moveValuesToRows() as moving values to different rows (in the same column), and moveValuesToColums() as moving values to different columns (in the same row). It will usually be apparent from your problem which of these operations you want to use.

The Theory of Coordinatized Data

When you are working with transformations you look for invariants to keep your bearings. All of the above data share an invariant property we call being coordinatized data. In this case the invariant is so strong that one can think of all of the above examples as being equivalent, and the row/column transformations as merely changes of frame of reference.

Let’s define coordinatized data by working with our examples. In all the above examples a value carrying (or payload) cell or entry can be uniquely named as follows:

From our point of view these keys all name the same data item. The fact that we are interpreting one position as a table name and another as a column name is just convention. We can even write R code that uses these keys on all our scientists’ data without performing any reformatting:

The lookup() procedure was able to treat all these keys and key positions uniformly. This illustrates that what is in tables versus what is in rows versus what is in columns is just an implementation detail. Once we understand that all of these data scientists recorded the same data we should not be surprised we can convert between representations.

The thing to remember: coordinatized data is in cells, and every cell has unique coordinates. We are going to use this invariant as our enforced precondition before any data transform, which will guarantee our data meets this invariant as a postcondition. I.e., if we restrict ourselves to coordinatized data and exclude wild data, the operations moveValuesToColumns() and moveValuesToRows() become well-behaved and much easier to comprehend. In particular, they are invertible. (In math terms, the operators moveValuesToColumns() and moveValuesToRows() form a groupoid acting on coordinatized data.)

By "wild" data we mean data where cells don’t have unique lookup() addresses. This often happens in data that has repeated measurements. Wild data is simply tamed by adding additional keying columns (such as an arbitrary experiment repetition number). Hygienic data collection practice nearly always produces coordinatized data, or at least data that is easy to coordinatize. Our position is that your data should always be coordinatized; if it’s not, you shouldn’t be working with it yet.

Rows and Columns

Many students are initially surprised that row/column conversions are considered "easy." Thus, it is worth taking a little time to review moving data between rows and columns.

Moving From Columns to Rows ("Thinifying data")

Moving data from columns to rows (i.e., from Scientist 1 to Scientist 3) is easy to demonstrate and explain.

The only thing hard about this operation is remembering the name of the operation ("gather()") and the arguments. We can remove this inessential difficulty by writing a helper function (to check our preconditions) and a verbose wrapper function (also available as a package from CRAN or Github):

In a moveValuesToRows() operation each row of the data frame is torn up and used to make many rows. Each of the columns we specify that we want measurements from gives us a new row from each of the original data rows.

The pattern is more obvious if we process any rows of d1 independently:

Moving From Rows to Columns ("Widening data")

Moving data from rows to columns (i.e., from Scientist 3 to Scientist 1) is a bit harder to explain, and usually not explained well.

In moving from rows to columns we group a set of rows that go together (match on keys) and then combine them into one row by adding additional columns.

Note: to move data from rows to columns we must know which set of rows go together. That means some set of columns is working as keys, even though this is not emphasized in the spread() calling interface or explanations. For invertible data transforms, we want a set of columns (rowKeyColumns) that define a composite key that uniquely identifies each row of the result. For this to be true, the rowKeyColumns plus the column we are taking value keys from must uniquely identify each row of the input.

To make things easier to understand and remember, we introduce another wrapping function.

If the structure of our data doesn’t match our expected keying we can have problems. We emphasize that these problems arise from trying to work with non-coordinatized data, and not from the transforms themselves.

Too little keying

If our keys don’t contain enough information to match rows together we can have a problem. Suppose our testset record was damaged or not present and look how a direct call to spread works:

This happens because the precondition is not met: the columns (model, testset, measurement) don’t uniquely represent each row of the input. Catching the error is good, and we emphasize that in our wrapper.

The above issue is often fixed by adding additional columns (such as measurement number or time of measurement).

Too much keying

Columns can also contain too fine a key structure. For example, suppose our data was damaged and depth is no longer a function of the model id, but contains extra detail. In this case a direct call to spread produces a way too large result because the extra detail prevents it from matching rows.

The frame d3damaged does not match the user’s probable intent: that the columns (model, testset) should uniquely specify row groups, or in other words, they should uniquely identify each row of the result.

In the above case we feel it is good to allow the user to declare intent (hence the extra rowKeyColumns argument) and throw an exception if the data is not structured how the user expects (instead of allowing this data to possibly ruin a longer analysis in some unnoticed manner).

The above issue is usually fixed by one of two solutions (which one is appropriate depends on the situation):

Stricter control (via dplyr::select()) of which columns are in the analysis. In our example, we would select all the columns of d3damaged except depth.

Aggregating or summing out the problematic columns. For example if the problematic column in our example were runtime, which could legitimately vary for the same model and dataset, we could use dplyr::group_by/summarize to create a data frame with columns (model, testset, mean_runtime, measurement, value), so that (model, testset) does uniquely specify row groups.

Conclusion

The concept to remember is: organize your records so data cells have unique consistent abstract coordinates. For coordinatized data the actual arrangement of data into tables, rows, and columns is an implementation detail or optimization that does not significantly change what the data means.

For coordinatized data different layouts of rows and columns are demonstrably equivalent. We document and maintain this equivalence by asking the analyst to describe their presumed keying structure to our methods, which then use this documentation to infer intent and check preconditions on the transforms.

It pays to think fluidly in terms of coordinatized data and delay any format conversions until you actually need them. You will eventually need transforms as most data processing steps have a preferred format. For example, machine learning training usually requires a denormalized form.

We feel the methods moveValuesToRows() and moveValuesToColumns() are easier to learn and remember than abstract terms such as "stack/unstack", "melt/cast", or "gather/spread" and thus are a good way to teach. Perhaps they are even a good way to document (and confirm) your intent in your own projects.