Sign in to your account

Introduction

In this blog post we show how the application of curriculum learning can affect the performance of a simple reinforcement learning agent on some target task. We do this by handcrafting source tasks using knowledge of our domain and agent. Our findings show that a curriculum can both positively and negatively affect the performance of an agent on some target task, and that the sequencing of source tasks is significant.

Curriculum Learning

"Example of a mathematics curriculum. Lessons progress from simpler topics to more complex ones, with each building on the last."[1]

Curriculum learning is a study in Machine Learning in which the goal is to design a sequence of source tasks (or curriculum) for an agent to initially train on, such that final performance or learning speed of the agent is improved on some target task. It is motivated by the desire to apply autonomous agents to increasingly difficult tasks and serves to make such tasks easier to solve.

Domain

We conduct our experiment on a simple grid world domain. The below description and visuals are quoted/utilized directly from [2].

The world consists of a room, which can contain 4 types of objects. Keys are items the agent can pick up by moving to them and executing a pickup action. These are used to unlock locks. Each lock in a room is dependent on a set of keys. If the agent is holding the right keys, then moving to a lock and executing an unlock action opens the lock. Pits are obstacles placed throughout the domain. If the agent moves into a pit, the episode is terminated. Finally, beacons are landmarks that are placed on the corners of pits.

The goal of the learning agent is to traverse the world and unlock all the locks. At each time step, the learning agent can move in one of the four cardinal directions, execute a pickup action, or an unlock action. Moving into a wall causes no motion. Successfully picking up a key gives a reward of +500, and successfully unlocking a lock gives a reward of +1000. Falling into a pit terminates the episode with a reward of -200. All other actions receive a constant step penalty of -10.

Agent

For our experiments we use a simple tabular Q learning agent with epsilon greedy policy [0.1] and learning rate [0.01]

The agent's observation/state space is implemented as per [2] and described below.

Using localized observations in this manner (measurements in relation to the agent), as opposed to absolute measurements such as {x y} coordinates, allows for the agent to transfer knowledge to similar tasks.

Experiment

We conducted an experiment to compare the performance of our agent on a target task when pretrained with various curricula of source tasks versus when untrained.

The target task is described by Figure 1 (a) (in the domain section above) . We increase the difficulty of the target task by setting a maximum number of steps per episode equal to 40.

The various curricula are handcrafted and each contain two source tasks. All source tasks are subsets of the target task, explained in more detail in the subsequent subsections. The methods used to create the source tasks, 'Promising Initialization' and 'Task Dimension Simplification' (among other methods), were adapted from literature [3] that this work is inspired by. The agent is trained on each source task for 3500 steps.

We average results over 50 episodes.

Curriculum 1

The first curriculum consists of two handcrafted source tasks intended to improve the agent's performance on the target task. In the first source task, the agent starts at a state closer to the key and at a state critical to completing the task. In the second source task, the agent starts at a state in which it already has the key and has to navigate to unlock the lock. Additionally, in both of the source tasks, the dimensionality of the state space has been reduced and can be interpreted as being a 'slice' of the target task grid.

Curriculum 2

This curriculum is identical to the previous (curriculum 1), except that the sequence of the two source tasks has been swapped.

Curriculum 3

This curriculum is similar to that of the first curriculum in that the first source task initializes the agent closer to the key state, and that the second source task initializes the agent having already obtained the key. What distinguishes this curriculum is that is encourages the agent to explore the domain by traversing the right side of the grid as opposed to the left.

Results

The above plot shows the performance of the agent solving the target task when trained with the various curricula vs with no prior training. The curves depicting the performance of the agent trained with curricula are offset by the number of steps required to train the agent on each curriculum (~7000 steps). Despite this training offset, we observe some interesting results:

1) The agent pretrained with curriculum 1 manages to achieve optimal performance on the target task ~10000 steps quicker (+38%) than that of the agent with no prior training, demonstrating the utility of a good curriculum.

2) The agent pretrained with curriculum 2 also manages to achieve optimal performance on the target task but slower than that of curriculum 1. Recall that curriculum 2 is comprised of the same source tasks as curriculum 1, expect in reversed order. This demonstrates the effect of source task sequencing.

3) The agent pretrained with curriculum 3 takes marginally longer to achieve optimal performance on the target task, demonstrating the detriment of a bad curriculum.

Future Work

In this experiment we used a simple tabular Q learning agent. More complex RL agents employing function approximation techniques over the state space have been shown to generalize knowledge to unseen states. We hypothesize that this property should allow these agents to train and apply knowledge from a wider and possibly more valuable set of source tasks. Additionally, it has been shown that curricula effect different kinds of agents in varying ways. In later work we will explore using a variety of different agents.

Additionally, in this experiment we handcrafted the training curricula. In later work we aim to explore automated methods of constructing source tasks and sequencing them into a curriculum.