A Day in the Life of Explanatory Variables and Confounding Factors

Features

Author: Kirk Borne

Date: 06 May 2015

Copyright: Image appears courtesy of Getty Images

Would we trust an insurance provider who sets motorbike insurance rates based on the sales of sour cream [1]? Or would we schedule our space launches according to the number of doctoral degrees awarded in Sociology [2]?

Probably all of us would agree that this kind of decision-making is unjustified. A specific decision like this appears to be only superficially supported by the evidence, but is there more to the story? Does it go any deeper? What if there exists a hidden causal factor that induces the apparently spurious correlation?

For example, suppose the increase in space launches and the increase in doctoral degrees in Sociology were both related to an increase in government investments in research studies on the sociological impacts of establishing a permanent human colony on the Moon. This case reveals a hidden causal connection in an otherwise strange correlation. The explanatory variable (which is a hidden confounding factor) is the research investment, and the response variables are the space launches and doctoral degrees.

What about other cases? What about the evidence that sour cream sales correlate with motorbike accidents [2]? In such cases, shouldn’t we all be pleased to see organizations making evidence-based data-driven objective decisions, especially in this brave new world of exploding data volumes and ubiquitous analytics? So, what kind of world is this?

Welcome to the world of explanatory variables and confounding factors!

Statistical literacy is needed now more than ever (to paraphrase H. G. Wells [3]). This includes awareness of and adherence to common principles of statistical reasoning. For example, we all know that correlation does not imply causation. In addition, we preach about the importance of explanatory variables and the dangers of confounding factors [4]. A short tutorial on these statistical concepts illuminates a safe path to avoid these dangers: “Causal inference can be strengthened if the researcher can argue or demonstrate that the variables are modelled in the appropriate causal sequence, if key effects in a mediation model are not confounded by omitted variables” [5]. Nevertheless, with a new population of data scientists emerging across the planet, statistical danger lurks.

In the era of big data, it is easy to find correlations, patterns, and apparent explanatory variables in data, not only in the signal but even in the noise [6]. It is also tempting to stand aside and allow our analytics-as-a-service software applications to build predictive models autonomously off of automatically discovered correlations [7]. This is not a remote possibility but a very real outcome of the fact that automation is one of the recurring mantras recited by everyone (including this author) who is trying to cope with and make discoveries from massive datasets [8].

So what kind of fun could this lead to? There is a collection of humorous and spurious correlations that includes the examples mentioned above (sour cream versus motorbikes; and sociology degrees versus space launches), as well as several others [2]. One of those examples is this one: the number of swimming pool drownings correlates with the number of films in which Nicolas Cage has appeared. Imagine then that the automated predictive analytics software used by a homeowner’s insurance firm discovered this relationship and then subsequently began pricing insurance rates accordingly. Quite unexpectedly, those homeowners that have a swimming pool on their property might see their rates increase after the release of a new Nicolas Cage movie.

In the era of big data, it is easy to find correlations, patterns, and apparent explanatory variables in data, not only in the signal but even in the noise. It is also tempting to stand aside and allow our analytics-as-a-service software applications to build predictive models autonomously off of automatically discovered correlations. This is not a remote possibility but a very real outcome of the fact that automation is one of the recurring mantras recited by everyone (including this author) who is trying to cope with and make discoveries from massive datasets.

In introductory courses, it is challenging for statistics novices to distinguish and apply these different concepts correctly: causality versus correlation, explanatory versus response variables, and confounding (hidden) factors. Here are some examples of these concepts that I have used to instruct my students and/or to amuse my colleagues:

(a) An automated emergency response system learns from its historical training data records of prior building fires that the largest fires lead to a large deployment of emergency response vehicles. A predictive analytics model is then implemented as part of the emergency alert response network. The algorithm deduces that the severity of fires will be reduced if fewer response vehicles and fewer emergency personnel are deployed. That will surely save money and result in smaller fires. Right? No! Obviously, wrong! The explanatory and response variables are reversed in this correlation-inspired model.

(b) A study of alcohol abuse by students on university campuses found that the greatest rate of abuse occurs at those universities that have an alcohol abuse awareness program for students. Obviously, the existence of these programs leads to greater student awareness of alcohol and consequently leads to greater alcohol abuse problems. Right? No! The antecedent and consequent are incorrectly reversed in this interpretation (which actually was reported this way in the news media after the study was released). The explanatory and response variables are again reversed here: the universities with the greatest alcohol abuse problems are in fact those universities that created the substance abuse awareness programs in response to the existence of abuse problems.

(c) Nearly every data mining textbook mentions the famous story (legend?) of beer and nappies: men who go to the store to buy the nappies will also tend to buy beer. So, does the purchase of nappies cause men to buy beer also? Or vice versa? No, there is a hidden variable (confounding factor) that can explain both events quite independently: that factor is the existence of a crying baby at home (maybe).

(d) A study of galaxy classifications by citizen scientists in the Galaxy Zoo project [9] found that some galaxies were classified (as either a Spiral Galaxy or as an Elliptical Galaxy) correctly by over 90% of the volunteers, but many other galaxies received a 50-50 split in the classification decision of the volunteers. A detailed machine learning model was attempted to explain these 50-50 classifications, using the scientific variables provided in the database of astronomical features (database attributes). The trained model was 95% accurate in predicting those galaxies that had 90% correct classifications, but the model was only 5% accurate in predicting which galaxies would receive a 50-50 classification vote. What happened? Even a coin toss (to classify a galaxy as either Elliptical or Spiral) should have 50% accuracy, just based on a purely random selection. So where did the machine learning classification model go wrong? After further study of the models, we determined the following: (1) that there must be at least one hidden variable (confounding factor) that was not recorded in the database of attributes but which affects the galaxy’s true classification; and (2) this explanatory variable (though not recorded among the scientific variables) was nevertheless cognitively discernible in the images by even non-experts. Therefore, with regard to those 50-50 classifications, the omitted explanatory variable is a confounding factor – its absence confounds the classification (response variable) and is the ultimate key to correct causal inference! [10]

Finally, we note that none of these examples can explain to me why I heard a famous politician say this: “I am appalled that half of the students in this country score below average on their national test scores.” Oh well, what was it that H. G. Wells said about statistical literacy? “It is needed now more than ever!”