Averages Are Metadata

When we think of metadata, we tend to think of attributes that describe the data. Where do the numbers come from? What do the values in a categorical column mean? Etc. But there is a type of metadata we rarely even recognize as such: values computed from the data. They're often treated as part of the data, but that's a mistake.

Most metadata is so different from the data itself that it's easy to spot: of course the URL pointing to the source of the data is not part of the data! Even the number of records typically looks different than any of the values (it doesn't have a unit like dollars, for one). The line is much less clear, though, when it comes to values that are derived from the data.

The Average In The Data

This is the disconnect that bugged me when I saw this tweet from The New York Times the other day:

Seeing the average included in the chart like this bothers me. It feels wrong. I think there are a few reasons for this. For one, it's ugly. The average has to be dressed up to make clear that this is not a regular data value. But then, why include it as one?

When the data is sorted, including the average makes some sense to see which airlines, in this case, are above or below the average. This is useful especially when the distribution is a power law like here, and there is data that is not shown but that is part of the calculations (passenger volume for each airline in this case, which the number of complaints is normalized by).

The average describes the data. It summarizes it. That means it’s data about the data. It’s really no different than the number of rows in the table or the data dictionary. Just because it’s a number and in the same units as the data values doesn’t make it one.

Including averages and totals in a table is something that strikes me as very much rooted in the thinking that is dominated by printed tables and spreadsheets. When data is released as spreadsheets, it's quite common to include totals and averages. They are provided for convenience or because it may not be possible to compute them from the data provided (because of weighting or other issues). U.S. government data often does this, including the U.S. value in addition to all the states, etc., and there are many other examples.

The problem isn't that these values are included, it's how this is done: as part of the main table, not a separate sheet or file. That makes the average, total, etc., just another value in the table, rather than a piece of clearly delineated metadata. When you create a chart from such a spreadsheet, it will include the average like all the other values. So it’s understandable that charts like the one above are being made.

Clear Distinctions, Clearer Thinking

This becomes problematic when the data is to be used in some more sophisticated analysis than just turning it into a simple chart, where the non-data values will get in the way (and the data needs to be recast in a more database-like structure). A calculated average will now be wrong, because the average that's already in the data distorts it. The same is true for sums, medians, etc.

The reason for including these values, convenience, now becomes an obstacle. And since it's not possible to know beforehand what kinds of analysis will be done on the data, it's impossible to include all the derived values that might be needed.

The way to fix this is to treat derived values as metadata and not throw them into the main table. Spreadsheets have lots of features for this, like tabs. Files can be separated out. Spreadsheets (and in particular CSV files) don't have to exactly mirror the structure of a printed table – in fact, that's usually a terrible way of releasing data.

But what's more, data tables aren't like printed ones. They are the raw material for further processing, not the end result. They need to be readable by machines, not humans. Including averages and other derived values shows a fundamental misunderstanding of this basic fact.

Robert Kosara is Senior Research Scientist at Tableau Software, and formerly Associate Professor of Computer Science. His research focus is the communication of data using visualization. In addition to blogging, Robert also runs and tweets. Read More…

Reader Interactions

Comments

the quick redesign isn’t helpful because it excludes all the things that made the chart readable. the long titles, the unit descriptions, the bar labels, the value labels, the grid labels, etc. if you leave everything that matters out there is suddenly a lot of empty space to put in average lines and yet another label to explain what the line is.

Cool post, Robert. As much as I agree subjectively, I wonder if we can make this idea falsifiable. In a visualization, what kind of imprecision or judgement error or even speed difference would you expect with avg-as-bar vs avg-as-grid-line?

I would expect it to be much easier to see how far away from the average you are with a line than a bar, especially with more bars. Perhaps the more important thing would be confidence, though: can I easily find the average and how sure do I feel about the difference? I’d be surprised if the line wasn’t better at both.