Interpretability for Deep Learning Methods

The area to be explored in this project is interpretability. As is discussed by Lipton [1], interpretability can refer to several different things. From the different approaches to interpretability listed in [1], the most relevant for this project is post-hoc interpretability. This means that the goal is to be able to explain the classifications made by the model, rather than to increase the transparency of the model. Interpretability is a desirable property for a model to have as it allows the decisions of an interpretable model to be verified by an end user. Other motivations for interpretability are discussed in [1]. Many state-of-the-art models are not interpretable, for example neural networks are not considered to be interpretable.

In particular, the focus of this project is on using methods to make models interpretable as part of the training process. For example, an interpretable model, such as a decision tree, might be trained alongside an uninterpretable model, such as a deep neural network, allowing the user to see what features the uninterpretable model is using, and to make corrections to the interpretable model that can be fed back into the uninterpretable model. This would hopefully allow improved generalisation, as if the uninterpretable model learnt to use features that were not relevant but happened to be correlated with the classes being predicted, the user would be able to see and correct this.

Another application would be to one-shot or few-shot learning. If the user already had a model that was trained on similar data, for example images, and wanted to train the model to classify an additional class, being able to guide the model as to which features are relevant would reduce the amount of training data required for the model to learn this additional class. For complex models, this would not be easy to achieve by manipulating the input features directly, as in complex models the features used at the final classification step can be complicated non-linear transforms of the original inputs. A clear example of this are deep convolutional neural networks for image classification. In this case, the input features are the individual pixels of the image, but the final classification is based on the features extracted from the image by the kernels of the convolutional layers in the network.

The initial focus of the project will be on developing a method for feeding back changes to the interpretable model into the neural network.