Model-based Reinforcement Learning with Neural Network Dynamics

Fig 1. A learned neural network dynamics model enables a hexapod robot to learn
to run and follow desired trajectories, using just 17 minutes of real-world
experience.

Enabling robots to act autonomously in the real-world is difficult. Really,
really difficult. Even with expensive robots and teams of world-class
researchers, robots still have difficulty autonomously navigating and
interacting in complex, unstructured environments.

Why are autonomous robots not out in the world among us? Engineering systems
that can cope with all the complexities of our world is hard. From nonlinear
dynamics and partial observability to unpredictable terrain and sensor
malfunctions, robots are particularly susceptible to Murphy’s law: everything
that can go wrong, will go wrong. Instead of fighting Murphy’s law by coding
each possible scenario that our robots may encounter, we could instead choose to
embrace this possibility for failure, and enable our robots to learn from it.
Learning control strategies from experience is advantageous because, unlike
hand-engineered controllers, learned controllers can adapt and improve with more
data. Therefore, when presented with a scenario in which everything does go
wrong, although the robot will still fail, the learned controller will hopefully
correct its mistake the next time it is presented with a similar scenario. In
order to deal with complexities of tasks in the real world, current
learning-based methods often use deep neural networks, which are powerful but
not data efficient: These trial-and-error based learners will most often still
fail a second time, and a third time, and often thousands to millions of times.
The sample inefficiency of modern deep reinforcement learning methods is one of
the main bottlenecks to leveraging learning-based methods in the real-world.

We have been investigating sample-efficient learning-based approaches with
neural networks for robot control. For complex and contact-rich simulated
robots, as well as real-world robots (Fig. 1), our approach is able to learn
locomotion skills of trajectory-following using only minutes of data collected
from the robot randomly acting in the environment. In this blog post, we’ll
provide an overview of our approach and results. More details can be found in
our research papers listed at the bottom of this post, including this paper
with code here.

Sample efficiency: model-free versus model-based

Learning robotic skills from experience typically falls under the umbrella of
reinforcement learning. Reinforcement learning algorithms can generally be
divided into two categories: model-free, which learn a policy or value function, and
model-based, which learn a dynamics model. While model-free deep reinforcement
learning algorithms are capable of learning a wide range of robotic skills, they
typically suffer from veryhighsamplecomplexity, often
requiring millions of samples to achieve good performance, and can typically
only learn a single task at a time. Although some prior work has deployed these
model-free algorithms for real-world manipulation tasks, the high sample
complexity and inflexibility of these algorithms has hindered them from being
widely used to learn locomotion skills in the real world.

Model-based reinforcement learning algorithms are generally regarded as being
more sample efficient. However, to achieve good sample efficiency, these
model-based algorithms have conventionally used either relatively simple
functionapproximators, which fail to generalize well to complex
tasks, or probabilistic dynamics models such as Gaussianprocesses,
which generalize well but have difficulty with complex and high-dimensional
domains, such as systems with frictional contacts that induce discontinuous
dynamics. Instead, we use medium-sized neural networks to serve as function
approximators that can achieve excellent sample efficiency, while still being
expressive enough for generalization and application to various complex and
high-dimensional locomotion tasks.

Neural Network Dynamics for Model-Based Deep Reinforcement Learning

In our work, we aim to extend the successes that deep neural network models have
seen in other domains into model-based reinforcement learning. Prior efforts to
combine neural networks with model-based RL in recent years have not achieved
the kinds of results that are competitive with simpler models, such as Gaussian
processes. For example, Gu et. al. observed that even linear models
achieved better performance for synthetic experience generation, while Heess
et. al. saw relatively modest gains from including neural network models
into a model-free learning system. Our approach relies on a few crucial
decisions. First, we use the learned neural network model within a model
predictive control framework, in which the system can iteratively replan and
correct its mistakes. Second, we use a relatively short horizon look-ahead so
that we do not have to rely on the model to make very accurate predictions far
into the future. These two relatively simple design decisions enable our method
to perform a wide variety of locomotion tasks that have not previously been
demonstrated with general-purpose model-based reinforcement learning methods
that operate directly on raw state observations.

A diagram of our model-based reinforcement learning approach is shown in Fig. 2.
We maintain a dataset of trajectories that we iteratively add to, and we use
this dataset to train our dynamics model. The dataset is initialized with random
trajectories. We then perform reinforcement learning by alternating between
training a neural network dynamics model using the dataset, and using a model
predictive controller (MPC) with our learned dynamics model to gather additional
trajectories to aggregate onto the dataset. We discuss these two components
below.

Fig 2. Overview of our model-based reinforcement learning algorithm.

Dynamics Model

We parameterize our learned dynamics function as a deep neural network,
parameterized by some weights that need to be learned. Our dynamics function
takes as input the current state $s_t$ and action $a_t$, and outputs the
predicted state difference $s_{t+1}-s_t$. The dynamics model itself can be
trained in a supervised learning setting, where collected training data comes in
pairs of inputs $(s_t,a_t)$ and corresponding output labels $(s_{t+1},s_t)$.

Note that the “state” that we refer to above can vary with the agent, and it can
include elements such as center of mass position, center of mass velocity, joint
positions, and other measurable quantities that we choose to include.

Controller

In order to use the learned dynamics model to accomplish a task, we need to
define a reward function that encodes the task. For example, a standard “x_vel”
reward could encode a task of moving forward. For the task of trajectory
following, we formulate a reward function that incentivizes staying close to the
trajectory as well as making forward progress along the trajectory.

Using the learned dynamics model and task reward function, we formulate a
model-based controller. At each time step, the agent plans $H$ steps into the
future by randomly generating $K$ candidate action sequences, using the learned
dynamics model to predict the outcome of those action sequences, and selecting
the sequence corresponding to the highest cumulative reward (Fig. 3). We then
execute only the first action from the action sequence, and then repeat the
planning process at the next time step. This replanning makes the approach
robust to inaccuracies in the learned dynamics model.

Fig 3. Illustration of the process of simulating multiple candidate action
sequences using the learned dynamics model, predicting their outcome, and
selecting the best one according to the reward function.

Results

We first evaluated our approach on a variety of MuJoCo agents, including the
swimmer, half-cheetah, and ant. Fig. 4 shows that using our learned dynamics
model and MPC controller, the agents were able to follow paths defined by a set
of sparse waypoints. Furthermore, our approach used only minutes of random
data to train the learned dynamics model, showing its sample efficiency.

Note that with this method, we trained the model only once, but simply by
changing the reward function, we were able to apply the model at runtime to a
variety of different desired trajectories, without a need for separate
task-specific training.

Fig 4: Trajectory following results with ant, swimmer, and half-cheetah. The
dynamics model used by each agent in order to perform these various trajectories
was trained just once, using only randomly collected training data.

What aspects of our approach were important to achieve good performance? We
first looked at the effect of varying the MPC planning horizon H. Fig. 5 shows
that performance suffers if the horizon is too short, possibly due to
unrecoverable greedy behavior. For half-cheetah, performance also suffers if the
horizon is too long, due to inaccuracies in the learned dynamics model. Fig. 6
illustrates our learned dynamics model for a single 100-step prediction, showing
that open-loop predictions for certain state elements eventually diverge from
the ground truth. Therefore, an intermediate planning horizon is best to avoid
greedy behavior while minimizing the detrimental effects of an inaccurate model.

Fig 5: Plot of task performance achieved by controllers using different horizon
values for planning. Too low of a horizon is not good, and neither is too high
of a horizon.

Fig 6: A 100-step forward simulation (open-loop) of the dynamics model, showing
that open-loop predictions for certain state elements eventually diverge from
the ground truth.

We also varied the number of initial random trajectories used to train the
dynamics model. Fig. 7 shows that although a higher amount of initial training
data leads to higher initial performance, data aggregation allows even low-data
initialization experiment runs to reach a high final performance level. This
highlights how on-policy data from reinforcement learning can improve sample
efficiency.

Fig 7: Plot of task performance achieved by dynamics models that were trained
using differing amounts of initial random data.

It is worth noting that the final performance of the model-based controller is
still substantially lower than that of a very good model-free learner (when the
model-free learner is trained with thousands of times more experience). This
suboptimal performance is sometimes referred to as “model bias,” and is a known
issue in model-based RL. To address this issue, we also proposed a hybrid
approach that combines model-based and model-free learning to eliminate the
asymptotic bias at convergence, though at the cost of additional experience.
This hybrid approach, as well as additional analyses, are available in our
paper.

Learning to run in the real world

Fig 8: The VelociRoACH is 10 cm in length, approximately 30 grams in weight, can
move up to 27 body-lengths per second, and uses two motors to control all six
legs.

Since our model-based reinforcement learning algorithm can learn locomotion
gaits using orders of magnitude less experience than model-free algorithms, it
is possible to evaluate it directly on a real-world robotic platform. In other
work, we studied how this method can learn entirely from real-world experience,
acquiring locomotion gaits for a millirobot (Fig. 8) completely from scratch.

Millirobots are a promising robotic platform for many applications due to their
small size and low manufacturing costs. However, controlling these millirobots
is difficult due to their underactuation, power constraints, and size. While
hand-engineered controllers can sometimes control these millirobots, they often
have difficulties with dynamic maneuvers and complex terrains. We therefore
leveraged our model-based learning technique from above to enable the
VelociRoACH millirobot to do trajectory following. Fig. 9 shows that our
model-based controller can accurately follow trajectories at high speeds, after
having been trained using only 17 minutes of random data.

Fig 9: The VelociRoACH following various desired trajectories, using our
model-based learning approach.

To analyze the model’s generalization capabilities, we gathered data on both
carpet and styrofoam terrain, and we evaluated our approach as shown in Table 1.
As expected, the model-based controller performs best when executed on the same
terrain that it was trained on, indicating that the model incorporates knowledge
of the terrain. However, performance diminishes when the model is trained on
data gathered from both terrains, which likely indicates that more work is
needed to develop algorithms for learning models that are effective across
various task settings. Promisingly, Table 2 shows that performance increases as
more data is used to train the dynamics model, which is an encouraging
indication that our approach will continue to improve over time (unlike
hand-engineered solutions).

Table 1: Trajectory following costs incurred for models trained with different
types of data and for trajectories executed on different surfaces.

Table 2: Trajectory following costs incurred during the use of dynamics
models trained with differing amounts of data.
legs.

We hope that these results show the promise of model-based approaches for
sample-efficient robot learning and encourage future research in this area.

We would like to thank Sergey Levine and Ronald Fearing for their feedback.