Hey pretty

by Danielle Navarro, 12 May 2019

Well, it’s 3 a.m
I’m out here riding again
Through the wicked winding streets of my world
I make a wrong turn break it now, I’m too far gone
I got a siren on my tail and that ain’t the fine I’m lookin’ for

A little over a year ago I started a project that I referred to as 100 Days of CRAN, in which my goal was to quickly try out as many new R packages as possible, in the hope of broadening my horizons as a programmer. It’s a strange thing to look back on. Not surprisingly, the main thing I accomplished was learning a little bit about many things, and didn’t quite grasp the details in most cases. For instance, when I look at my initial post on blogdown it’s clear I didn’t know what I was doing.

And that’s absolutely fine.

When learning a new thing you’re always going to be a bit out of your depth, and in a lot of cases a shallow grasp of a lot of things is the kind of fine I’m looking for. I don’t have a lot of need for leaflet or sparklines, for instance, and I’m content not to learn more about those at the moment. On the other hand, I’ve been spending more of my time working in blogdown and bookdown and I do want to learn more about literate programming techniques. Similarly, I think my skills at package development could be improved. Having recently decided that I want to allocate some of my time to writing a proper revision to Learning Statistics with R, I don’t want to dive in on the writing until I’ve spent time thinking carefully about the scaffolding and infrastructure required to write the book the way I think it should be written.

When I first wrote the book, I was tightly constrained by the expectations that I would teach a very conventional psychological research methods class, only using R rather than SPSS. My lecture notes necesarily reflected what I had to teach, not what I think would be best for new students or for data analysts in the real world.

In the years that have followed – I started writing way back in 2010 – a lot of things have changed in R and in my thinking. The advent of the tidyverse has shifted the culture and focus of R in a lot of ways, and the toolkit surrounding reproducible research has become so much more accessible. When I wrote the original lsr I created it using package.skeleton(), wrote my own .Rd files by hand, and didn’t use any version control whatsoever. Also the code is terrible. The book and package are both artifacts of a different era.

Over the last six months or so, I’ve been teaching myself how to use a modern software development toolkit in R: devtools, usethis, roxygen2, testthat, and so on. I’ve switched to git and GitHub as the method for version control and sharing. I document my work using R Markdown, blogdown, and bookdown. It’s starting to pay off, I think. I’ve switched this blog to the slumdown theme that I wrote largely from scratch – it’s a work in progress, but there is some documentation here – and I feel much more secure about the code that I’ve written because I’ve taken a deeper dive on these topics.

It’s not merely on software development and documentation where I’ve started shifting my thinking. My data analysis methods have changed. My data wrangling has moved to dplyr, my data visualisation to ggplot2 and my programming style has become oriented more towards purrr. Even my thinking about statistical inference has changed a lot since 2010. For example, I no longer think of Bayes factors as particularly reliable (I never really believed much in p-values), and I’ve started wondering more about the relationship between science and statistics. I’ve been using preregistration tools more than I used to, but I’ve come to think that there is a lack of clarity about what it is for.

All of this has led me to a dilemma about how I should approach any revision to the book, which I’ll try to illustrate indirectly.

A slumdown parable

I see a stairway so I follow it down
Into the belly of a whale where my secrets echo all around
You know me now, but to do better than that
You’ve got to follow me boy
I’m tryin’ to show you where I’m at

One thing that strikes me as difficult about data analysis is that most data sets of interest are the outcome of a data-generating process that is much more complicated than any of the simplified models used to approximate it. I’ve often argued that we as scientists and statisticians need to be wary of putting to much faith in our rather fragile tools. That was the main point of my cute little “Devil and the Deep Blue Sea” paper, and my “Personal Essay on Bayes Factors”, and my “Nothing Works and Everythign Sucks” talk… I am nothing if not consistent in my worry.

To try to make the point in an oblique fashion, take a second to consider this image

It’s not the most exciting scatterplot in the world, I’ll admit. We have something plotted as the horizontal variable, something else plotted as the vertical variable, and it looks like there is a positive linear relationship between them. If you’re like me, that’s the only thing you see here, right?

Now consider the mechanics (not statistics) of how that plot came to exist on this page. I’m writing the text in R markdown, a hybrid tool that takes blocks of markdown, raw HTML, LaTeX, R code (or code from other languages) and YAML headers, and stitches them together into HTML output by way of some dark knitr/pandoc magic that I swear only Yihue Xie truly understands. More precisely, my R markdown document is rendered using blogdown so the output is assembled into a static website using the Hugo framework. The raw HTML is not the final thing one sees on the page however, because the page calls external javascript libraries that apply a number of changes to the visual appearance of the document (e.g., syntax highlighting is done with highlight.js). There is a long chain of dependencies between “Danielle types some words” and “those words appear on the page”.

Even within any one part of this chain, there are complicated dependencies. The scatterplot above is drawn using the ggplot2 package, which constructs a plot object that is then rendered using the grid graphics system. When coding interactively, the output is typically drawn to the plot pane in RStudio, but on this website it is instead passed to the png graphics device. Each of these systems makes some assumptions about what the user is trying to do, but these assumptions are usually reasonable so one rarely notices.

However… nothing in life is simple, and these different components are not stitched together perfectly. If one looks closely it’s not too hard to find the loose threads, and this is the point at which it becomes critical that you really understand your tools. So let’s pull at the threads. We’ll start by loading some data and packages:

There’s already a hidden complexity that comes out. The URL for the data set in question is:

https://djnavarro.net/data/brownian_bridges.csv

so why does my call to here::here() refer to the static directory at all? Shouldn’t the file be located here:

https://djnavarro.net/static/data/brownian_bridges.csv

You’d think so, but it is not. The “problem” is that the file paths on my machine that blogdown uses to initially render the post to HTML are not the same as the file paths where Hugo ends up moving files. So my code here refers to the file structure of my blogdown project, not the file structure of my webpage. So there’s one dangling thread already.

But let’s continue. The background on the page is very dark, so the white dots in my picture stand out nicely. That’s a good design principle for data visualisation - make the data stand out, and have the other elements of the plot meld seamlessly into the background. So let’s start by creating the plot in ggplot2.

Looking at this code it’s fairly obvious what we should expect. We should obtain some kind of scatterplot with large white dots, and indeed we do…

plot(p)

… but wow that is a bit of an eyesore and the data are almost imposible to see.

To anyone who knows R, this is hardly a surprise. Although my webpage uses a colour scheme that has a purple background, ggplot2 knows nothing of my website so it supplies its own defaults, namely theme_grey(). My colour scheme on the page attaches to the slum theme for Hugo, and Hugo and ggplot2 do not directly speak to one another. As it happens though, I wrote the slum theme myself, and did so with the assumption that I’d want R to have easy access to things like the colour schemes. The colour palette for this page is specified in the palette_kunoichi.css file, whose contents are quite simple

There is a line in my YAML header to this page that specifies which palette file to use, and blogdown dutifully passes this information along to Hugo, with the end result being that the webpage adheres to this colour scheme:

colourscheme: "css/palette_kunoichi.css"

What about the R side? Because the colour scheme is exposed through such a simple file it is very easy to write a wrapper function that pulls it in to R and formats it as a named vector of colours:

There’s even convenience functions to let me generate new colour palettes and find the file paths to the existing ones (slum_palette_create and slum_palette_paths), and these are partly documented here. Better yet, there is a theme_slum() function that automatically generate a ggplot theme to match a palette, so now I can do this and then everything will be…

p + theme_slum("kunoichi")

… huh. Not quite as I expected.

When I first started encountering this problem I started scouring the internet looking for some obscure ggplot2 theme element that I’d missed, and came up with nothing. Eventually, it dawned on me that the problem is not ggplot2 at all, it’s the graphics device. The theme_slum() function aligns the colour scheme in ggplot2 with the CSS on the website, but the R graphics device still doesn’t know anything about what I’m doing. Most of the time this issue doesn’t arise because the ggplot2 output “covers” the whole of the image, but there are exceptions.

Any time the generated image has a different aspect ratio to the ggplot image – in this case, the ggplot image is constrained by the coord_equal() function whereas the generated png file is constrained by the knitr defaults – the graphics device fills in its own background colour. So we need to fix that too. The png() function has an argument bg that specifies the background colour of the image, so it would be really easy to set that to purple … or at least it would be if I were calling png() myself. But I am not: knitr is doing that for me, so I have to work out how to pass arguments to the graphics device from inside my R markdown document. Thankfully Yihue Xie is a genius and there is a mechanism for doing this. So after a little searching I worked it out and wrote a simple wrapper function:

slum_plot_background("kunoichi")

And now that I’ve done that, my plot works flawlessly

p + theme_slum("kunoichi")

Of course, this could still be improved – I’m not entirely happy with this colour palette to be perfectly honest – but importantly the plot now looks like part of the page, and the result is something that looks seamless. It “just works”.

What does this have to do with data analysis?

Now I’ve got a mind full of wicked designs
I’ve got a non-stop hole in my head
Imagination
I’m in a building that has two thousand floors
And when they all fall down
I think you know it’s you they’re falling for

I’ll be honest. Most of that little story was me showing off what I’ve learned about how the different moving parts of a blogdown post interact with each other, and how tricky it can be to make them work nicely together. But there’s a serious point to this too. If you look back to my original blogdown post from a year ago you can see I was pretty intimidated by the complexity of the tool I was using, and oscillating wildly between two competing intuitions:

Intuition 1: This is terrifying so I should mindlessly follow the instructions in the desperate hope that the rules will keep me from doing something dangerous.

Intuition 2: This is too complex. Burn everything to the ground, simplify everything to the smallest possible version of the thing I want to do so that I can actually see what I’m messing with.

I think both of these are reasonable intuitions, and these two strategies have their place in the world. But I managed to break a lot of things about this site by “meddling in the affairs of wizards”, and so I’ve had to rebuild the whole thing from scratch in order to make the site do what I want. And do to that, I had to teach myself a lot of new things. I had to learn a lot of web programming, I had to learn the software development tools in R, and I had to write a lot of tests to check my work because I made a lot of mistakes as I went.

Real world data analysis has a similar character. Any given data set is the outcome of a complicated process, put together by a human agent who decided to collect it (for reasons unknown), biased by selection and censoring mechanisms you can’t detect, and then analysed using statistical tools that were almost certainly designed to solve very a different problem to the one you actually have. Whenever we analyze data we are always dealing with complicated objects we do not really understand (else why bother to study them?), and so it should come as no surprise that – in my opinion at least – neither the “follow all the rules all the time” or “strip it down and simplify” strategy is the best way to approach data analysis. Yet these are the two things we tend to teach people.

My view is that you can do much better if you take the time to learn about the many different moving parts of a data analysis. But that learning process takes a lot of time, and I find there are still so many things to learn. You need substantive domain knowledge about the phenomenon you’re studying, you need context-specific knowledge about the measurement process that was used to obtain the data, and you need technical knowledge about the tools you are using and what problems they can address. This is hard and I don’t know how to do it well myself – so I am constantly bemused by the stories that we tell ourselves about “the scientific method” or “statistical rules”. Who are these people who seem to think they actually do know “the right way to do science (or statistics)” and why do they think that?

For my own part, the more I dig into the details and discover what my tools are doing, the less certain I am about any inferential claims that we make in the scientific literature. All of which leaves me with a novel problem…

100 days of … something

I can’t forget, I am the sole architect
I built the shadows here
I built the growling voice I fear
You add it up but to do better than that
You’ve got to follow me boy
I’m tryin’ to show you where I’m at

The changes in my thinking about science, R and statistics have started showing up across my teaching in quite a few places, but they are not yet reflected in Learning Statistics with R. I am uncertain how much of my current thinking I should impose on a revision. How much of this should a student be shown in a “first course”?

There are so many things I want to add to the book, so many things I want to improve, but I want to be careful not to “fix” the things that aren’t broken. I might indeed have a mind full of wicked designs, but when it comes to this book I am also the sole architect. There are a lot of people – many, many more than I would ever have imagined when I started writing – that are using the original work, and I want to take care to make changes in ways that would be helpful to them.

Which brings me full circle to the 100 Days of CRAN exercise I undertook last year. One thing that I found very helpful about writing blog posts and sharing them on twitter is that I received a lot of super helpful comments, thoughts and ideas from the broader community. So I’m hoping to start another 100 Days exercise (I haven’t thought of a catchy name for it yet)1, this time focusing on topics that I think might be relevant to revising the book. I’m not foolish enough to think I can write a blog post every day – especially as I’ve had some health problems lately and am on long-term medical leave – but I do want to get into the habit of writing regularly as a method of clarifying my thoughts about what should go into the book revision.