Martin Johnsson's blog

Meny

Category Archives: data analysis

Tidy data — it’s one of those terms that tend to confuse people, and certainly confused me. It’s Codd’s third normal form, but you can’t go around telling that to people and expect to be understood. One form is ”long”, the other is ”wide”. One form is ”melted”, another ”cast”. One form is ”gathered”, the other ”spread”. To make matters worse, I often botch the explanation and mix up at least two of the terms.

The word is also associated with the tidyverse suite of R packages in a somewhat loose way. But you don’t need to write in a tidyverse-style (including the %>%s and all) to enjoy tidy data.

In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

In practice, I don’t think people always take their data frames all the way to tidy. For example, to make a scatterplot, it is convenient to keep a couple of variables as different columns. The key is that we need to move between different forms rapidly (brain time-rapidly, more than computer time-rapidly, I might add).

And not everything should be organized this way. If you’re a geneticist, genotypes are notoriously inconvenient in normalized form. Better keep that individual by marker matrix.

The first serious piece of R code I wrote for someone else was a function to turn data into long form for plotting. I suspect plotting is often the gateway to tidy data. The function was like what you’d expect from R code written by a beginner who comes from C-style languages: It reinvented the wheel, and I bet it had nested for loops, a bunch of hard bracket indices, and so on. Then I discovered reshape2.

The id.vars argument is to tell the function that the id column is the key, a column that tells us which individual each observation comes from. As the name suggests, id.vars can name multiple columns in a vector.

And now: tidyr. tidyr is the new tidyverse package for rearranging data like this.

The tidyr equivalent of the melt function is called gather. There are two important differences that messed with my mind at first.

The melt and gather functions take the opposite default assumption about what columns should be treated as keys and what columns should be treated as containing values. In melt, as we saw above, we need to list the keys to keep them with each observation. In gather, we need to list the value columns, and the rest will be treated as keys.

Also, the second and third arguments (and they would be the first and second if you piped something into it), are the variable names that will be used in the long form data. In this case, to get a data frame that looks exactly the same as the first, we will stick with ”variable” and ”value”.

Here are five different ways to get the same long form data frame as above:

This was a PhD course given in the spring of 2017 at Linköping University. The course was organised by the graduate school Forum scientium and was aimed at people who might be interested in using R for data analysis. The materials developed from a part of a previous PhD course from a couple of years ago, an R tutorial given as part of the Behaviour genetics Masters course, and the Wright lab computation lunches.

Around twenty people attended the seminars, and a couple of handfuls of people completed the homeworks. I don’t know how much one should read into the course evaluation form, but the feedback was mostly positive. Some people had previous exposure to R, and did the first homework in an hour. Others had never programmed in any language, and had a hard time getting started.

There is certainly scope for improvement. For example, some of the packages used could be substituted for more contemporary tools. One could say that the course is slouching towards the tidyverse. But I worry a bit about making the participants feel too boxed in. I don’t want them to feel that they’re taught a way that will solve some anticipated type of problems very neatly, but that may not generalize. Once I’ve made the switch to dplyr and tidyr (and maybe even purr … though I hesitate) fully myself, I would probably use them in teaching too. Another nice plus would be to be able to use R for data science as course literature. The readings now are scattered; maybe a monolithic book would be good.

I’ve tried, in every iteration, to emphasize the importance of writing scripts, even when working interactively with R. I still think I need to emphasize it even more. There is also a kind of ”do as I say, not as I do” issue, since in the seminars, I demo some things by just typing them into the console. I’ll force myself to write them into a script instead.

Possible alternate flavours for the course include: A longer version expanding on the same topics. I don’t think one should cram more contents in. I’d like to have actual projects where the participants can analyze, visualize and present data and simulations.

This is the course plan we sent out:

1. A crash course in R

Why do data analysis with a scripting language
The RStudio interface
Using R as a calculator
Working interactively and writing code
Getting help
Reading and looking at data
Installing useful packages
A first graph with ggplot2

Homework for next time: The Unicorn Dataset, exercises in reading data, descriptive statistics, linear models and a few statistical graphs.

2. Programming for data analysis

Programming languages one may encounter in science
Common concepts and code examples
Data structures in R
Vectors
Data frames
Functions
Control flow

Homework for next time: The Unicorn Expression Dataset, exercises in data wrangling and more interesting graphs.

3. Working with moderately large data

Exercise followup
More about functions
Lists
Objects
Functional and imperative programming
Doing things many times, loops and plyr
Simulating data
Working on a cluster

Final homework: Design analysis by simulation: pick a data analysis project that you care about; simulate data based on a model and reasonable effect size; implement the data analysis; and apply it to simulated data with and without effects to estimate power and other design characteristics. This ties together skills from all seminars.

One of my Summer of data science learning points was to play with out of the box prediction tools. So let’s try out a few genomic prediction methods. The code is on GitHub, and the simulated data are on Figshare.

Genomic selection is the happy melding of quantitative and molecular genetics. It means using genetic markers en masse to predict traits and and make breeding decisions. It can give you better accuracy in choosing the right plants or animals to pair, and it can allow you to take shortcuts by DNA testing individuals instead of having to test them or their offspring for the trait. There are a bunch of statistical models that can be used for genomic prediction. Now, the choice of prediction algorithm is probably not the most important part of genomic selection, but bear with me.

First, we need some data. For this example, I used AlphaSim (Faux & al 2016), and the AlphaSim graphical user interface, to simulate a toy breeding population. We simulate 10 chromosomes á 100 cM, with 100 additively acting causal variants and 2000 genetic markers per chromosome. The initial genotypes come from neutral simulations. We run one generation of random mating, then three generations of selection on trait values. Each generation has 1000 individuals, with 25 males and 500 females breeding.

So we’re talking a small-ish population with a lot of relatedness and reproductive skew on the male side. We will use the two first two generations of selection (2000 individuals) to train, and try to predict the breeding values of the fourth generation (1000 individuals). Let’s use two of the typical mixed models used for genomic selection, and two tree methods.

We start by splitting the dataset and centring the genotypes by subtracting the mean of each column. Centring will not change predictions, but it may help with fitting the models (Strandén & Christensen 2011).

Let’s begin with the workhorse of genomic prediction: the linear mixed model where all marker coefficients are drawn from a normal distribution. This works out to be the same as GBLUP, the GCTA model, GREML, … a beloved child has many names. We can fit it with the R package BGLR. If we predict values for the held-out testing generation and compare with the real (simulated) values, it looks like this. The first panel shows a comparison with phenotypes, and the second with breeding values.

This gives us correlations of 0.49 between prediction and phenotype, and 0.77 between prediction and breeding value.

This is a plot of the Markov chain Monte Carlo we use to sample from the model. If a chain behaves well, it is supposed to have converged on the target distribution, and there is supposed to be low autocorrelation. Here is a trace plot of four chains for the marker variance (with the coda package). We try to be responsible Bayesian citizens and run the analysis multiple times, and with four chains we get very similar results from each of them, and a potential scale reduction factor of 1.01 (it should be close to 1 when it works). But the autocorrelation is high, so the chains do not explore the posterior distribution very efficiently.

BGLR can also fit a few of the ”Bayesian alphabet” variants of the mixed model. They put different priors on the distribution of marker coefficients allow for large effect variants. BayesB uses a mixture prior, where a lot of effects are assumed to be zero (Meuwissen, Hayes & Goddard 2001). The way we simulated the dataset is actually close to the BayesB model: a lot of variants have no effect. However, mixture models like BayesB are notoriously difficult to fit — and in this case, it clearly doesn’t work that well. The plots below show chains for two BayesB parameters, with potential scale reduction factors of 1.4 and 1.5. So, even if the model gives us the same accuracy as ridge regression (0.77), we can’t know if this reflects what BayesB could do.

On to the trees! Let’s try Random forest and Bayesian additive regression trees (BART). Regression trees make models as bifurcating trees. Something like the regression variant of: ”If the animal has a beak, check if it has a venomous spur. If it does, say that it’s a platypus. If it doesn’t, check whether it quacks like a duck …” The random forest makes a lot of trees on random subsets of the data, and combines the inferences from them. BART makes a sum of trees. Both a random forest (randomForest package) and a BART model on this dataset (fit with bartMachine package), gives a lower accuracy — 0.66 for random forest and 0.72 for BART. This is not so unexpected, because the strength of tree models seems to lie in capturing non-additive effects. And this dataset, by construction, has purely additive inheritance. Both BART and random forest have hyperparameters that one needs to set. I used package defaults for random forest, values that worked well for Waldmann (2016), but one probably should choose them by cross validation.

Finally, we can use classical quantitative genetics to estimate breeding values from the pedigree and relatives’ trait values. Fitting the so called animal model in two ways (pedigree package and MCMCglmm) give accuracies of 0.59 and 0.60.

So, in summary, we recover the common wisdom that the linear mixed model does the job well. It was more accurate than just pedigree, and a bit better than BART. Of course, the point of this post is not to make a fair comparison of methods. Also, the real magic of genomic selection, presumably, happens on every step of the way. How do you get to that neat individual-by-marker matrix in the first place, how do you deal with missing data and data from different sources, what and when do you measure, what do you do with the predictions … But you knew that already.

Another interesting thing that an R course participant identified: Sometimes one wants to make a function that returns multiple layers to be added to a ggplot2 plot. One could think that just adding them and returning would work, but it doesn’t. I think it has to do with how + is evaluated. There are a few workarounds that achieve similar results and may save typing.

First, some data to play with: this is a built-in dataset of chickens growing:

One way is to write a function that takes the plot object as input, and returns a modified version of it. If we use the pipe operator %>%, found in the magrittr package, it even gets a ggplot2 like feel:

Okay, maybe that plot is a bit too busy to be good. But note how the difference between plotting a single diet and all diets at the same time is just one more mapping in aes(). No looping or custom functions required.

When you exit an R session, you’re faced with the question of whether or not to save your workspace. You should almost never answer yes. Saving your workspace creates an image of your current variables and functions, and saves them to a file called ”.RData”. When you re-open R from that working directory, the workspace will be loaded, and all these things will be available to you again. But you don’t want that, so don’t save your workspace.

Loading a saved workspace turns your R script from a program, where everything happens logically according to the plan that is the code, to something akin to a cardboard box taken down from the attic, full of assorted pages and notebooks that may or may not be what they seem to be. You end up having to put an inordinate trust in your old self. I don’t know about your old selves, dear reader, but if they are anything like mine, don’t save your workspace.

What should one do instead? One should source the script often, ideally from freshly minted R sessions, to make sure to always be working with a script that runs and does what it’s supposed to. Storing a data frame in the workspace can seem comforting, but what happens the day I overwrite it by mistake? Don’t save your workspace.

Yes, I’m exaggerating. When using any modern computer system, we rely on saved information and saved state all the time. And yes, every time a computation takes too much time to reproduce, one should write it to a file to load every time. But I that should be a deliberate choice, worthy of its own save() and load() calls, and certainly not something one does with simple stuff that can be reproduced a the blink of an eye. Put more trust in your script than in your memory, and don’t save your workspace.

I fully support this, since data wrangling and reorganization logically comes before plotting (especially in the ggplot2 philosophy).

But it’s also kind of a shame, because it’s not a very good dplyr post, and the one about the correlation heatmap is not a very good ggplot2 post. Thankfully, there is a new edition of the ggplot2 book by Hadley Wickham, and a new book by him and Garrett Grolemund about data analysis with modern R packages. I’m looking forward to reading them.

Personally, I still haven’t made the switch from plyr and reshape2 to dplyr and tidyr. But here is the updated tidyverse-using version of how to quickly calculate summary statistics from a data frame:

10: Transform the simulated dataset to long form. This means that the two variables response1 and response2 get collected to one column, which will be called ”value”. The column ”key” will indicate which variable each row belongs to. (gather is tidyr’s version of melt.)

11: Group the resulting dataframe by sex, treatment and variable. (This is like the second argument to d*ply.)

A participant in the R course I’m teaching showed me a case where a tbl_df (the new flavour of data frame provided by the tibble package; standard in new RStudio versions) interacts badly with the t.test function. I had not seen this happen before. The reason is this:

Interacting with legacy code
A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame (tibble announcement on RStudio blog)

I did not know that. The solution, which they found themselves, is to use as.data.frame.

I can see why not dropping to a vector makes sense. I’m sure you’ve at some point expected a data frame and got an ”$ operator is invalid for atomic vectors”. But it’s an unfortunate fact that number of weird little thingamajigs to remember is always strictly increasing as the language evolves. And it’s a bit annoying that the standard RStudio setup breaks code that uses an old stats function, even if it’s in somewhat non-obvious way.