How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations

Bidirectional Encoder Representations from Transformers (BERT)reach state-of-the-art results in a variety of Natural Language Pro-cessing tasks. However, understanding of their internal functioningis still insufficient and unsatisfactory. In order to better under-stand BERT and other Transformer-based models, we present alayer-wise analysis of BERT’s hidden states. Unlike previous re-search, which mainly focuses on explaining Transformer modelsby theirattentionweights, we argue that hidden states containequally valuable information. Specifically, our analysis focuses onmodels fine-tuned on the task of Question Answering (QA) as anexample of a complex downstream task. We inspect how QA modelstransform token vectors in order to find the correct answer. To thisend, we apply a set of general and QA-specific probing tasks thatreveal the information stored in each representation layer. Our qual-itative analysis of hidden state visualizations provides additionalinsights into BERT’s reasoning process. Our results show that thetransformations within BERT go through phases that are relatedto traditional pipeline tasks. The system can therefore implicitlyincorporate task-specific information into its token representations.Furthermore, our analysis reveals that fine-tuning has little impacton the models’ semantic abilities and that prediction errors can berecognized in the vector representations of even early layer.