32 Hierarchical data

32.1 Introduction

This chapter belongs in wrangle: it will give you a set of tools for working with hierarchical data, such as the deeply nested lists you often get when working with JSON. However, you can only learn it now because working with hierarchical structures requires some programming skills, particularly an understanding of data structures, functions, and iteration. Now you have those tools under your belt, you can learn how to work with hierarchical data.

The

As well as tools to simplify iteration, purrr provides tools for handling deeply nested lists. There are three common sources of such data:

JSON and XML

The map functions apply a function to every element in a list. They are the most commonly used part of purrr, but not the only part. Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:

You can extract deeply nested elements in a single call by supplying
a character vector to the map functions.

You can remove a level of the hierarchy with the flatten functions.

You can flip levels of the hierarchy with the transpose function.

32.1.1 Prerequisites

This chapter focusses mostly on purrr. As well as the tools for iteration that you’ve already learned about, purrr also provides a number of tools specifically designed to manipulate hierarchical data.

library(purrr)

32.2 Initial exploration

Sometimes you get data structures that are very deeply nested. A common source of such data is JSON from a web API. I’ve previously downloaded a list of GitHub issues related to this book and saved it as issues.json. Now I’m going to load it into a list with jsonlite. By default fromJSON() tries to be helpful and simplifies the structure a little for you. Here I’m going to show you how to do it with purrr, so I set simplifyVector = FALSE:

You might be tempted to use str() on this data. Unfortunately, however, str() is not designed for lists that are both deep and wide, and you’ll tend to get overwhelmed by the output. A better strategy is to pull the list apart piece by piece.

First, figure out how many elements are in the list, take a look at one, and then check they’re all the same structure. In this case there are eight elements, and the first element is another list.

What happens if that path is missing in some of the elements? For example, lets try and extract the HTML url to the pull request:

issues %>%map_chr(c("pull_request", "html_url"))
#> Result 4 must be a single string, not NULL of length 0

Unfortunately that doesn’t work. Whenever you see an error from purrr complaining about the “type” of the result, it’s because it’s trying to shove it into a simple vector (here a character). You can diagnose the problem more easily if you use map():

(You might wonder why that isn’t the default value since it’s so useful. Well, if it was the default, you’d never get an error message if you had a typo in the names. You’d just get a vector of missing values. That would be annoying to debug because it’s a silent failure.)

32.4 Removing a level of hierarchy

As well as indexing deeply into hierarchy, it’s sometimes useful to flatten it. That’s the job of the flatten family of functions: flatten(), flatten_lgl(), flatten_int(), flatten_dbl(), and flatten_chr(). In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.

Whenever I get confused about a sequence of flattening operations, I’ll often draw a diagram like this to help me understand what’s going on.

Base R has unlist(), but I recommend avoiding it for the same reason I recommend avoiding sapply(): it always succeeds. Even if your data structure accidentally changes, unlist() will continue to work silently the wrong type of output. This tends to create problems that are frustrating to debug.

32.5 Switching levels in the hierarchy

Other times the hierarchy feels “inside out”. You can use transpose() to flip the first and second levels of a list:

You’ll see an example of this in the next section, as transpose() is particularly useful in conjunction with adverbs like safely() and quietly().

It’s called transpose by analogy to matrices. When you subset a transposed matrix, you switch indices: x[i, j] is the same as t(x)[j, i]. It’s the same idea when transposing a list, but the subsetting looks a little different: x[[i]][[j]] is equivalent to transpose(x)[[j]][[i]]. Similarly, a transpose is its own inverse so transpose(transpose(x)) is equal to x.

Transpose is also useful when working with JSON APIs. Many JSON APIs represent data frames in a row-based format, rather than R’s column-based format. transpose() makes it easy to switch between the two: