Obtaining training data for Question Answering (QA) is time-consuming and costly, and existing QA datasets are only available for limited domains and languages. In this talk, we’ll explore to what extent high quality training data is actually required for Extractive QA, and investigate the possibility of unsupervised Extractive QA. We approach this problem by first learning to generate context, question and answer triples in an unsupervised manner, which we then use to synthesize Extractive QA training data automatically. We find that modern QA models can learn to answer human questions surprisingly well using only synthetic training data. We demonstrate that, without using the SQuAD training data at all, our approach achieves 56.4 F1 on SQuAD v1 (64.5 F1 when the answer is a Named entity mention), outperforming early supervised models.
We will also explore methods to build cross-lingual Question Answering models which do not require cross-lingual supervision (zero-shot language transfer), as well as the challenge of how to fairly evaluate their performance in many target languages.