Abstract: This paper addresses the problem of evaluating learning systems in safety
critical domains such as autonomous driving, where failures can have
catastrophic consequences. We focus on two problems: searching for scenarios
when learned agents fail and assessing their probability of failure. The
standard method for agent evaluation in reinforcement learning, Vanilla Monte
Carlo, can miss failures entirely, leading to the deployment of unsafe agents.
We demonstrate this is an issue for current agents, where even matching the
compute used for training is sometimes insufficient for evaluation. To address
this shortcoming, we draw upon the rare event probability estimation literature
and propose an adversarial evaluation approach. Our approach focuses evaluation
on adversarially chosen situations, while still providing unbiased estimates of
failure probabilities. The key difficulty is in identifying these adversarial
situations -- since failures are rare there is little signal to drive
optimization. To solve this we propose a continuation approach that learns
failure modes in related but less robust agents. Our approach also allows reuse
of data already collected for training the agent. We demonstrate the efficacy
of adversarial evaluation on two standard domains: humanoid control and
simulated driving. Experimental results show that our methods can find
catastrophic failures and estimate failures rates of agents multiple orders of
magnitude faster than standard evaluation schemes, in minutes to hours rather
than days.