Data-themed articles, essays, and studies

Greater Than…

A paradox of analytics is that its success – or lack thereof – can be determined far upstream, at the sometimes murky headwaters of Do We Really Have The Data For That? and What Are We Really Asking? Given complete data and a well-defined question, a flow of productive answers is more likely, if not inevitable. One reason that productive answers can be elusive is that our data won’t permit the well-defined comparisons that are part of every good data question.

Our question must have a clear comparison – it is the comparison’s definition that largely frames our interpretation of the answers we obtain. Otherwise, we could end up like the guy who was proud of his average score of 9.5, until he learned that was 9.5 out of 100. Or like me one time: after processing some survey results for a client, I was stupified to learn that respondents didn’t particularly care for a best-selling product, delivering an average rating of 3 out of 10. No one else could understand this either, and we were forwarding explanations ranging from sampling bias, to people considering the product to be a sub-optimal commodity – the product was junk but no one else’s junk was any better. Finally, someone discovered that the 1-to-10 scale for that one survey question was inverted – one being the best, while ten was the worst. So people really did like the stuff, a least a little. Speaking of “like,” I didn’t particularly like that I missed that – file it, I suppose, under should have known, cross-indexed with been there, done that. When we’re asking we’re comparing, even when it doesn’t seem that way. It’s good to state our comparison as explicitly as we can.

Still, it’s ironic that after doing a good job of defining the comparison for our question, the actual comparison can become more difficult (or impossible), with uncertain results. That can be frustrating. It’s just A > B! Why can’t I just compare them? Sometimes we can, but ambiguities often arise. In my view it’s better to prove comparability rather than assume it. And what kind of conclusions will we have if we can’t even make a comparison? None, and that’s OK. If the data have nothing to say, neither should we.

Fortunately, ambiguous comparisons happen for predictable reasons, which I break down like this: data uncertainties, aggregration, metric ambiguities, and system ambiguities. Here’s a little description of each effect.

Data uncertainties. Our input data are not perfect, and the impact of input data error on our answers can be modest, or surprisingly large. Regardless, whenever the difference between two answers is comparable to the total error in those numbers, comparison is ambiguous.

Uncertainty analysis, which considers how input errors propagate to our final answers, can be confused with validation, which tells us whether data transformations have distorted presumably perfect input data. It’s prudent – essential in my view – to determine the sensitivity of our answers to reasonable estimates of input data errors. Even if we’re not sure of our input error rates, uncertainty analysis will tell us if particular answers might be unreliable, or if certain inputs are critical (or irrelevant) to the quality of our answers.

Teams can understandly see uncertainty analysis as tedious and depressing. I take a more sanguine attitude – it’s more like spring cleaning for databases. If particular input data errors strongly impact key answers, that’s a good focal point for an external data audit. If other input data errors are irrelevant to our answers, perhaps that is clutter we can be rid of – great! When certain answers – often involving granular aggregration or many operations on signed numbers – are sensitive to any reasonable input error rate, we can identify those outcomes as meaningless, and alert our users accordingly.

One last thought on data uncertainties: theoretical uncertainty analysis and production databases are pretty much incompatible. To shorten a longish story, the assumptions behind theoretical uncertainty estimates are not guaranteed, or even likely, to be met by a production system. There is, however, one certainty in uncertainty analysis: errors do not “cancel out.” That’s a legend. The way to go is simulation: try systematic and random input errors of, say, one and five percent, and directly examine the impact on outcomes. (A one percent input error rate, particularly for something like catagorization, is really quite good. I resist any temptation to assume the data are better than that, unless someone can truly prove it, say by external audit.)

Aggregation. A friend of mine makes a compelling case that invalid reasoning from aggregates is the leading cause of analytics dysfunction. Well it’s up there, for sure. And just possibly, the root of all prejudice.

Statistics has a lot to say about whether the difference between two aggregate metrics – such as means or medians – is significant. But conceptually, the matter is simple: when we have two groups and the total range of values within the groups is comparable to the difference between the groups, we can’t say the two groups are different. Statistical inference can be used to determine the chances that two individuals in different groups exhibit the difference in question – but we cannot say anything about the groups per se in this case.

To my friend’s point, it’s very easy to take two aggregate metrics for groups that are not really distinct, and wrongly apply those metrics to individuals in those groups. So if on average younger people are slightly more productive than older people, I should now hire younger people – they’ll do a better job, and are cheaper to boot. (I’m not making this up – I had a real client trying to prove that.) And from any news outlet we can find all manner of quasi-analytical claptrap related to aggregate-reasoning issues: silly and sometimes dangerous statements about differences across generations, genders, ethnicities, religions. Even if there is a modest difference in a mean or median metric, that is far from guaranteeing a difference amongst typical individuals in different groups.

This kind of reasoning seems ingrained – I believe we humans are hard-wired to the general-to-specific reasoning pattern for survival reasons. if I see a couple of wolves eat my sheep, be assured that I’ll wave off the next wolf I meet. Maybe a full analysis would show there are nice wolves, but I’m not going to take the chance.

A related question is: why do we often ask questions about groups with a nominal difference, but little actual difference? The groups we think about are not usually selected with a particular metric in mind. Neighborhoods are based on geography rather than income,. generations are based on age rather than job performance. Groups can also be very dynamic: if the membership of two different related business groups changes regularly, a metric like group performance is less likely to be distinct.

Metric ambiguities. Give me two numbers with the same nominal units (like “dollars”) and I’ll show you someone who has compared them. There are many different kinds of cost, or kinds of value – they are really only comparable if they reference identical systems and calculational bases. I can’t, for example, compare someone’s salary to the fully-loaded cost of another person, but if all my data tables tell me is that each is a “cost,” an inadvertant comparison is very likely. And then there is the time-value of money, which kicks in as soon as we compare monetary amounts at two different times. Not only that – we can only estimate how the value of money changes with time, even historically. So there is an intrinsic error asssociated with most monetary values. (When was the last time you saw that error field in your database?)

In short, this is a version of the “same or different” problem – we can truly compare metrics that are really identical, not ones that just look identical.

Databases, for all of their utility, aggrevate this problem – a key purpose of data models and data tables is to represent different things identically, so that we can process them easily. And that’s great, if the numbers are truly identical in meaning – a trickier thing than it often appears.

Personally, I like to type metrics strongly – it doesn’t completely resolve the problem of inadvertantly comparing non-congruent metrics, but it is helpful. So a “cost” might be a “wholesale materials and manufacturing cost (2015)”. That leaves out things like inventory cost, but at least I have a better idea of where I stand.

Almost any complex metric can exhibit similar problems. I recently had the opportunity to look at calculations regarding the fuel efficiency for different modes of transport, and their related carbon footprint. Perhaps because people are rightly concerned about their environmental impact (and don’t necessailry have degrees in engineering), they can readily compare things that just don’t line up.

Comparing the carbon footprint of flying to driving? Enjoy it… You get to compare

and that’s before considering the impact of airline routing versus vehicle routing, the number of people in the car, whether you are driving your Prius with a leadfoot, or whether you want to engage the debate about scheduled travel effectively being a sunk carbon cost.

(By the way, it’s a separate post but after considering the various errors and ambiguities, which are at least 25% in each case, an 80% loaded 737 and a Prius getting EPA mileage with one person are roughly equivalent. A Prius with two people? It’s probably a little better than the plane.)

Comparing the carbon footprint of an electric car to a hybrid vehicle like a Prius? The EPA provides a great service with their mileage ratings But the ratings for electric vehicles are deceptive, for what appear to be historical reasons. After a little detective work, I believe what we’re comparing for purposes of carbon footprinting are these numbers:

These numbers are actually not comparable, because gasoline cannot be converted to electricy with 100% efficiency. So a Tesla’s rating of roughly 90 MPGe cannot be directly compared to 55 MPG for a Prius.

(Also a separate item. I used an estimate from the Energy Information Administration (who also do a great job) of the typical heat rate for converting methane to electricity, allowed for typical transmission and distribution loss, and found that the carbon footprint for a Tesla and a Prius are comparable, assuming that we’re charging the car with methane-generated electricity. However – in principle if the electricity for the Tesla came from clean sources, it might actually be much better than the Prius.)

You get the point – even before we get to uncertainties and aggregations, comparisons are not axiomatic.

System definition. Metrics and comparisons can be ambiguous when we don’t have a record identifier that maps each database record to a unique real-world object. If we’re lucky our database has several representative real-world items for each database id, and we can estimate the error associated with each database id directly. If we’re not lucky, a data custodian has selected a “representative” value for the density of polyethylene, or the sales for everyone named “John Jones,” and we have to estimate the resulting error.

Whew – that was longer than I expected or intended. To hell with it – it’s a blog, not a book. Nonetheless, allow me to summarize: comparisons are basic to good data questions, but within a data set comparisons could be clear, or ambiguous. We want to sort out which is which. In my view, our best bet is to start with a presumption that comparisons are invalid, and then to prove their validity. If time and resource constraints limit our assessment, we can call out our presumed valid comparisons as assumptions.

Post navigation

One thought on “Greater Than…”

Lack of meaning full units for comparison (like trying to use “dollars” over a time period of many decades, e.g. meaningless without value weighting the dollar) is a reality that makes me want to slap people several times a week as they flog unsupported conclusions at me. Your carbon example is perfect case of the units being much more complex than people assume.

However the more insipid problem is relative accuracy of attributes when you introduce multiple dimensions and measures. Not just lack of validation and cleansing, but the lack effort to collect and maintain the values in the first place.

For example an inventory item might have hundreds of attributes, but if only eight or nine impact operations, often there is little or no effort to even understand how inaccurate the others are. Strong typing and not-null can make the situation worse (forcing input to supply made up values which can skew the result).

Then some wet behind the ears consultant comes in and produces a flashy visualization showing correlation that is at best questionable, when they compare a clean attribute with one of unknown accuracy