In my previous post, we saw the relationship between ANOVAs and simple linear regression. Though we didn’t see a significant linear relationship between Year and Number of Distinct words in Rhianna songs, one thing that we can do is test whether there is a difference between pre-2010 Rhianna songs and post-2010 Rhianna songs. By grouping (or “binning”) our continuous variable (year), we can turn our simple linear regression into an ANOVA. This can often make things seem simpler, instead of many years, we now have two neat categories. And I did say that ANOVA and linear regression are pretty much the same thing, right?

Well, sorta. The linear models used to calculate an ANOVA or a simple linear regression are similar, they are in fact doing something slightly different. Though it’s often common to split our continuous variables into groups, it should be done with reservation, and I'll show you why, so hang tight and put your group-splitting hats on!

Before we begin, I want to remind you that all these linear models are like chainsaws. They take the total distance between data points and the overall mean and divide it into different categories.

If you remember, in both the ANOVA and simple linear regression, we calculate the total Sums of Squares the same way, we take each data point and we calculate the difference between it and the overall mean of all the data points. So the total Sums of Squares (or SST) is the same whether we look at Rhianna songs continuously by year or by pre- and post-2010. So where’s the difference?

The difference is in the Sums of Squares “Model” (in regression, we call this Sums of Squares Regression; in ANOVA, we sometimes call this Sums of Squares Group). In either case, our “Model” is how we think X will affect our Y, our variable of interest. In the regression case, we think year will have a continuous linear effect on Y – each unit of X will have an added effect on Y. In the binned ANOVA case, we think that being pre- or post-2010 will have a group effect on Y.

Remember that our Sums of Squares Model (SSM) for regression predicts your Y value (# of distinct words in a Rhianna song) based on year. In the data we observed, we saw a general downward trend, for each year that passes there tends to be fewer distinct words in Rhianna songs. Our best guess for the number of distinct words in a Rhianna song is whatever the regression line predicts.The error in our regression model is the distance between the data point and the regression line.

Our SSM for ANOVA bins pre-2010 together, so instead of a general downward trend throughout years, we would simply expect that pre-2010 songs have a higher # of distinct words compared to post-2010 songs. Under this ANOVA model, our best guess for any 2010 song would be the group mean. Everything else is considered error.

If we truly had a linear relationship between Year and # of distinct words, then running an ANOVA on the the two binned year groups will over-estimate the proportion of Sums of Squares that is due to “error”. This happens because more extreme points further from the mean of each group will tend to be much farther from the group mean than they would have been from the regression line.

In our Rhianna example, you can see from the ANOVA tables of the continuous and dichotomized models that they have the same Sums of Squares Total, but the Sums of Squares for the Model (Year vs. Pre-2010) is much less in the Pre-2010 dichotomized model. If there is a true linear relationship, then the power of the model will always be stronger as continuous model rather than dichotomizing and using an ANOVA.

In a JMP simulation [1], I used the approximate linear relationship from our BadGirlRiri data to simulate what happens to statistical power when you dichotomize one of your continuous variables. As the amount of error around the true linear relationship gets smaller and smaller, eventually either method will have almost 100% power. When the relationship is so strong, even a reduction in power due to dichotomization will allow for almost perfect effect detection.

On the other hand, when there is a large amount of variation around the linear relationship, both the dichotomized and the continuous models have trouble detecting any effect due to the disproportionately large error around the relationship. At that point, the error around the model outweighs any linear relationship that might be hiding there (remember, statistical tests measure the amount of variation due to model and compare it to the amount of variation that’s not due to the model) [2].

Most experimenters aim for 80% power as their threshold [3]. This means that if there really is an effect of that size, the experiment will allow our statistical model to detect that effect 80% of the time. A quick review of statistical power will help you remember that there are four main things that can affect your power: sample size, effect size, population variance, and alpha (your cutoff criteria for significance). Unfortunately, your effect size, populaton variance, and alpha are normally set out for you, but sample size is the researcher’s plaything. However, cell cultures, human subjects, and radioactive isotopes cost both money and time. So you want the most powerful test you can get with the smallest sample size.

The vertical distance between the red and blue highlighted dots is the average difference in power when you dichotomize vs. leave your variable as continuous. When your continuous model has 80% power (n=20), your dichotomized model has only about 65% power – that’s a 15% reduction in power! To have 80% power with your dichotomized model, you’d need more cells, humans, or isotopes, and NSF won't pay for that!

As we would expect, when there is no statistical relationship at all, both have the same 5% chance of falsely signaling the presence of an effect when there is not one. When we define our alpha cut-off criterion, we are choosing a percent of the time that our model will find an effect even when there is not one.

What have we learned? Thou shalt not split a continuous variable into groups if there is reason to suspect a truly linear relationship. Think about your data and what you expect the relationship to be between your X and Y. If you want to split your continuous variable into groups, make sure that there's a theoretical reason to do so, and don't limit yourself to dichotomous splits--often people split at the median, but that may not make sense for your data. Listen to your data.

[1] n for each simulation is 20; each level or error was simulated 1,000 times each. Both continuous models and dichotomized models were run for the same data and number of positive responses were recorded. More information is in the attached scripts. Python and R versions are avaliable.

[2] It should be noted that as you get arbitrarily large sample sizes, these things won’t happen as quickly. This is because no matter how much error you have, you’re dividing it by n-1; so as n approaches some arbitrarily large number, your MSerror could still be quite small compared to your MSmodel. TL;DR the larger your sample size, the more variation there has to be before you get 0% power for both models.