Many empirical results in reinforcement learning are
based on a very small set of environments. These
results often represent the best algorithm
parameters that were found after an ad-hoc tuning or
fitting process. We argue that presenting tuned
scores from a small set of environments leads to
method overfitting, wherein results may not
generalize to similar environments. To address this
problem, we advocate empirical evaluations using
generalized domains: parameterized problem
generators that explicitly encode variations in the
environment to which the learner should be robust.
We argue that evaluating across a set of these
generated problems offers a more meaningful
evaluation of reinforcement learning algorithms.