The Why Axis: small sample sizes and too many slopes

I’m working my way through Picturing the Uncertain World by Howard Wainer, a collection of articles about dealing with uncertainty in statistical thinking and visualization. I could (and might!) write an essay about every article in the book. In this first post, I want to pull out two points that might be useful when analyzing or writing about data.

Picturing the Uncertain World is a series of real-world case studies, often with deeply-felt consequences. I’m collapsing a few articles together here, so to illustrate these points I’m going to use a fictional example with absolutely no consequences: swords in the made-up country of Knightlandia. Look for the bolded text if you’re only interested in the bottom line.

1. Group sizes and extreme values

The first chapter in the book is centered around what Wainer calls the most dangerous equation, and what I know as standard error. Specifically, the chapter focuses on the tendency for measurements to be more extreme in small samples than in large samples from the same population.

Let’s think about some swords. Some swords in Knightlandia are fancy and expensive, and some swords are plain and cheap. Every sword is made by a blacksmith. Some blacksmiths are busy specialists that make a lot of swords, and some only make a few swords a year because they are focusing on plowshares. Blacksmiths in populated areas with lots of knights tend to make more swords than blacksmiths in rural areas with fewer knights.

If a busy city blacksmith gets an order for an unusually inexpensive sword, it isn’t going to have a huge impact on the blacksmith’s average sword price for the year: they’re making dozens of other swords, so the very cheap sword will be balanced out. However, that cheap sword will drive down the average yearly sword price of a rural blacksmith who only makes three swords a year.

Let’s say the king is interested in lowering armory costs, so he commissions a study of average sword prices across all the blacksmiths in Knightlandia. He looks at the ten blacksmiths with the cheapest average sword price, and finds that most of them live in rural areas. Therefore, he decides to move sword production to the countryside in order to cut down on costs.

However, if the king looked at all average sword prices, he would see that rural blacksmiths as a whole don’t make less expensive swords. Their average sword prices just vary a lot more than blacksmiths in cities, because those averages can be skewed more easily. Most of the blacksmiths with the most expensive average sword price probably live in the country, too.

In this case, I’m talking about the average price of all swords that a blacksmith makes, but the same principles apply if we are looking at samples of sword prices. The smaller your sample sizes, the more variation you can expect to see. It’s important to keep this in mind because as data analysts and humans, we love looking at extreme values. The best/worst/cheapest/priciest/prettiest/ugliest of anything is going to catch our eye, and it’s important that we make sure our conclusions aren’t based on group size and random chance.

The bottom line: if your conclusions are based on statements like, “seven of the ten blacksmiths with the lowest average sword prices are in rural areas, even though most blacksmiths live in cities” check:

Do groups in this subset tend to be smaller than other groups in the population? In this case, do rural blacksmiths tend to produce fewer swords than blacksmiths in cities and towns?

If so, check both extremes in your data: how many of the blacksmiths with the highest average sword prices are also in rural areas? If many of the blacksmiths with high average sword prices are also in rural areas, these extreme values may be due to low overall sword production by rural blacksmiths.

In general, it is always best practice to look at your entire distribution of scores (in a graphic–please don’t try to read the table!). However, if you catch yourself using this kind of thinking, it’s worth double-checking your data so that you don’t needlessly overwhelm the countryside with sword factories.

2. Visualizing changing circumstances

Wainer shows how graphics are much more efficient than verbal explanations when showing how a Medicare drug plan saves money. In my experience, it’s easy to explain and understand the way variables change when there’s some kind of linear relationship (for instance, as X goes up, Y goes down). However, describing a more complex relationship can feel like juggling bouncy balls, especially when the relationship changes under different conditions. A line graph can cut through that confusion.

For instance, let’s say the king is frustrated by the variation in sword prices and decides to standardize things with a new royal edict. (Feel free to skip to the next paragraph at any time.) By royal edict, all swords with blades less than 2 feet long cost 100 silver pieces, but extra blade length between 2 and 4 feet is 5 silver pieces per inch, extra blade length between 49 inches and and 5 feet is 5 silver pieces per inch, and all swords over 5 feet in length cost an additional 2 silver pieces per inch. The manufacture of any sword over 5 feet requires a permit, which requires a bribe of 75 silver pieces to secure.

Now, if a woman in an armored trenchcoat offers you a sword with a 58-inch blade for 410 silver pieces, should you buy it from her, or turn her in to the palace guard and commission a legal sword? Answering that question based on the description above requires a lot of math. I find it much clearer with a visualization, even a rough one:

Draw a line up from the sword length on the X axis and a line across from the price on the Y axis. If those two lines meet at a point above the line, it’s more expensive than the legal price of the sword. If they meet at a point below the line, it’s less expensive. Under royal edict, a 58-inch sword would cost 400 silver pieces. It’s looking grim for your trenchcoated friend, unless she can offer you a 61-inch blade for the same price.

For a less silly example, here is Wainer’s chart of drug prices under a Medicare drug policy:

The bottom line: if the relationship between two variables takes more than two clauses to describe, make a chart.

Stay tuned: there’s more for me to read, and much more to come on Wainer on the blog.