In my last post, I looked at the gap that arises when we delegate parts of our thought processes to algorithmic models, rather than incorporating the rules they identify directly into our mental models, like we do with traditional statistics. I described how the the idea of model interpretability can make the delegation process smoother by helping to break down the barriers between algorithmic and mental models. An increasing number of research papers these days claim to describe models that are interpretable, or ways of adding a layer of interpretability to existing models, but most of them rely on an implicit, intuitive definition of interpretability, usually one that suits their particular results. It would be nice if there was a canonical notion of what interpretability means. In the next few posts, I plan to explore what such a definition might look like and how you might tell whether a given algorithm is interpretable. In this post, I’ll explore this question from the angle of goals: What should an interpretable model allow you to do that a non-interpretable model can’t.

The research literature identifies many different reasons to want model interpretability, many of which are described in Zack Lipton’s Mythos of Model Interpretability. Zack organizes these into four categories – Trust, Causality, Transferability and Informativeness. (Read the paper for what these mean.) I’m going to suggest a slightly different set of categories based on the question “If I have an interpretable model, what should it allow me to do?” I’m not suggesting that an interpretable model should do all the things below, but it should do at least some of them. Though if you think this scheme leaves something out, please let me know in the comments!

So, here it is: An interpretable model should allow you to…

Identify and Mitigate Bias. All models are biased – both algorithmic models and mental models. In fact, algorithmic models can magnify the bias of our mental models, as Cathy O’Neil has written extensively about. You can never completely eliminate bias, but you can often fix the more egregious forms, or at least choose not to use the models that are biased in unacceptable ways. This is roughly what Zack and others refer to as trust. It seems to be the most common motivation for interpretability described in the literature, probably because it’s of primary interest to the people who are developing the algorithm model, convincing others to use it, and then writing papers about it.

In fact, the ability to understand the biases in a model can be more important than accuracy for model adoption. For example, the introduction of this paper describes a case where a large healthcare project chose a rule-based model over a more accurate neural network for predicting mortality risk for patients with pneumonia. The decision was made after they discovered a rule in the rule-based system suggesting that having asthma lowered one’s risk of dying from pneumonia. It turned out this was true, only because pneumonia patients with pre-existing asthma were consistently given more aggressive treatments that led to better outcomes. However, the model would have suggested they required less treatment since they were at lower risk. The group running the project realized that the neural network probably had similar biases, but they had no way of telling what or how bad they were. So they decided to go with the model whose biases they could recognize, and thus mitigate.

Recognizing the biases in our mental models is notoriously difficult, but recognizing bias in a black-box algorithmic model is even harder. With our mental models, we can use self-reflection and the ability to recognize new factors to reduce bias. Algorithmic models can train on larger data sets and treat all data points equally, but they can’t self-reflect. The right type of interpretability could allow us to apply “self-reflection” to algorithmic models and get the best of both worlds.

Account for context. This is essentially what I wrote about in my last post, so I won’t go into additional detail here. As I described previously, an algorithmic model can never account for all the factors that will affect the decision that the user finally makes. An interpretable model that helps you understand how the factors that are included in the model led to a prediction should allow you to adjust how you use the prediction based on these additional factors.

Extract Knowledge. Algorithmic models often have a form that’s incompatible with mental models. Mental models are made up of relatively simple causal relationships, augmented by flexible, subconscious intuition. Algorithmic models are essentially probability distributions that measure correlations between rigidly defined values. However, if you look at a series of predictions from a model, the pattern recognition parts of your brain won’t be able to help themselves from trying to extract rules to add to your mental models. The problem is that these patterns may not be real, particularly if the set of examples you look at is biased.

An interpretable model should help you to determine if the patterns that appear to be present in the model are really there, or just artifacts of a biased set of examples. This is similar to identifying bias, as described above, except that here you’re learning from the model rather than evaluating it. With an interpretable model, you should be able to combine the strong pattern recognition and simplification skills of the human mind with the algorithmic model’s ability to learn from massive amounts of data. For example, there’s now a fair amount of research on causal inference in algorithmic models. A causal algorithmic model should be more compatible with mental models that rely heavily on causal reasoning, making it even easier to extract rules that can be incorporated into your mental models. Zack’s Mythos paper includes causality as a motive for interpretability, though this isn’t the only type of rule you might want to extract.

Generalize. Algorithmic models are trained on carefully collected datasets to solve narrowly defined problems. Mental models are trained on a fire hose of input and applied to vaguely defined problems that they usually weren’t trained for. If you find an algorithmic model that works well for the problem it was trained on, you may be tempted to apply it to other problems, given how well that seems to work for mental models. In some cases, it might work for algorithmic models too, but in many cases it won’t. An interpretable model should help you determine if and how it can be generalized.

For example, lets say that the pneumonia risk model described above works out really well and you want to use it to predict risks for other types of lung infections. A good approach to interpretability might tell you that the model relies on properties of pneumonia that are different for other infections. So you’d better create a new model for them. In fact, you could argue that this is what happened in the case described above: They tried to use a model that was trained to predict mortality risk for the problem of predicting which patients required the most care. In this case, they didn’t even realize they were generalizing until they tried to interpret one of the models.

So, there’s my proposal for the main motivations of interpretability: An interpretable model should allow you to identify and mitigate bias, account for context, extract knowledge and generalize. There’s a fair amount of overlap between these, but they capture the different types of motivations I’ve seen in the literature. In my next few posts, I’ll use these motivations as a lens through which we can look for ways to tell when models are interpretable.