Reinforcement Learning with Multi-Fidelity Simulators

Fig 1. MFRI architecture: a multi-fidelity chain of simulators and learning agents. Agents send exploration heuristics to higher-fidelity agents and learned model parameters to lower fidelity agents. The environments are related by state mappings ρi and optimism bounds βi. Control switches between learning agents, going up when an optimal policy is found, and down when unexplored regions are encountered.

Simulators play a key role as testbeds for robotics control algorithms, but deciding when to use them versus collecting real-world data is often treated as an art. We have developed a framework for efficient reinforcement learning (RL) in a scenario where multiple simulators or a target task, each with varying levels of fidelity, are available. Our framework is designed to limit the number of samples used in each successively higher-fidelity/cost simulator by allowing a learning agent to choose to run trajectories at the lowest level simulator that will still provide it with information. The approach transfers state-action Q-values from lower-fidelity models as heuristics for the “Knows What It Knows” Rmax family of RL algorithms, which is applicable over a wide range of possible dynamics and reward representations. Theoretical proofs of the framework’s sample complexity have been developed and empirical results demonstrated on a remote controlled car with multiple simulators. The approach allows RL algorithms to find near-optimal policies in a physical robot domain with fewer expensive real-world samples than previous transfer approaches or learning without simulators.