Introducing ML-Agents Toolkit v0.2: Curriculum Learning, new environments, and more

The Machine Learning team at Unity is happy to announce the release of a new version of Unity Machine Learning Agents Toolkit – v0.2 Beta! With this release, we are improving the toolkit on all fronts: 1- adding new features to both the Unity SDK and Python API; 2- new example environments to build on top of; 3- improvements to our default Reinforcement Learning algorithm (PPO), in addition to many bug-fixes and other smaller improvements. In this post, we will highlight some of the major additions, but for a full list, check out the GitHub release notes. Also visit the GitHub page to download the latest release.

Since we launched v0.1 Beta over two months ago, and it has been wonderful to see projects and games already being made using the Unity ML-Agents Toolkit, as well as all the helpful community feedback. To inspire more creative use cases in machine learning and beyond with ML-Agents Toolkit, we are excited to announce our very first ML-Agents Community Challenge in this post.

New Continuous Control and Platforming Environments

One of the major requests we received was additional example environments to allow developers a greater variety of baselines from which to start building. We are happy to include four new environments in this release. These environments include two new continuous control environments, plus two platforming environments designed to show off our new Curriculum Learning feature (more on that below).

New Features: Curriculum Learning, Broadcasting, and a more flexible monitor

Curriculum Learning – Our Python API now includes a standardized way of utilizing Curriculum Learning during the training process. For those unfamiliar, Curriculum learning is a way of training a machine learning model where more difficult aspects of a problem are gradually introduced in such a way that the model is always optimally challenged. Here is a link to the original paper which introduces the ideal formally. More generally, this idea has been around much longer, for it is how we humans typically learn. If you imagine any childhood primary school education, there is an ordering of classes and topics. Arithmetic is taught before algebra, for example. Likewise, algebra is taught before calculus. The skills and knowledge learned in the earlier subjects provide a scaffolding for later lessons. The same principle can be applied to machine learning, where training on easier tasks can provide a scaffolding for harder tasks in the future.

Example of a mathematics curriculum. Lessons progress from simpler topics to more complex ones, with each building on the last.

When we think about how Reinforcement Learning actually works, the primary learning signal is a scalar reward received occasionally throughout training. In more complex or difficult tasks, this reward can often be sparse, and rarely achieved. For example, imagine a task in which an agent needs to push a block into place to scale a wall and arrive at a goal. The starting point when training an agent to accomplish this task will be a random policy. That starting policy will likely involve the agent running in circles, and will likely never, or very rarely scale the wall properly to receive the reward. If we start with a simpler task, such as moving toward an unobstructed goal, then the agent can easily learn to accomplish the task. From there, we can slowly add to the difficulty of the task by increasing the size of the wall, until the agent can complete the initially near-impossible task of scaling the wall. We are including just such an environment with Unity ML-Agents Toolkit v0.2, called Wall Area.

Demonstration of a curriculum training scenario in which a progressively taller wall obstructs the path to the goal.

To see this in action, observe the two learning curves below. Each displays the reward over time for a brain trained using PPO with the same set of training hyperparameters and data from 32 simultaneous agents. The difference is that the brain in orange was trained using the full-height wall version of the task, and the blue line corresponds to a brain trained using a curriculum version of the task. As you can see, without using curriculum learning the agent has a lot of difficulties, and after 3 million steps has still not solved the task. We think that by using well-crafted curricula, agents trained using reinforcement learning will be able to accomplish tasks otherwise much more difficult, with much less time.

Two training curves on the Wall Area task. Blue line corresponds to brain trained using curriculum learning. Orange line corresponds to brain trained without curriculum learning. Dotted vertical blue lines correspond to a lesson change in the curriculum for the curriculum training session.

So how does it work? In order to define a curriculum, the first step is to decide which parameters of the environment will vary. In the case of the Wall Area environment, what varies is the height of the wall. We can define this as a reset parameter in the Academy object of our scene, and by doing so it becomes adjustable via the Python API. Rather than adjusting it by hand, we then create a simple JSON file which describes the structure of the curriculum. Within it we can set at what points in the training process our wall height will change, either based on the percentage of training steps which have taken place, or what the average reward the agent has received in the recent past is. Once these are in place, we simply launch ppo.py using the –curriculum-file flag to point to the JSON file, and PPO we will train using Curriculum Learning. Of course we can then keep track of the current lesson and progress via TensorBoard.

Here’s an example of a JSON file which defines the Curriculum for the Wall Area environment:

1

2

3

4

5

6

7

8

9

10

11

{

"measure":"reward",

"thresholds":[0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5],

"min_lesson_length":2,

"signal_smoothing":true,

"parameters":

{

"min_wall_height":[0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5],

"max_wall_height":[1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,6.0]

}

}

For those in the community who have created environments which they have had difficulty getting their agents to solve, we encourage you to try Curriculum Learning out, and we would love to hear your findings.

Broadcasting – The internal, heuristic, and player brains now all include a “Broadcast” feature, which is active by default. When active, the states, actions, and rewards for all agents linked to that brain will then be accessible from the Python API. This is in contrast to v0.1, where only the external brain could send information to the Python API. This feature can be used to record, analyze, or store information from these brain types on Python. Specifically, this feature makes imitation learning possible, where data from a player, heuristic, or internal brain can be used as the supervision signal to train a separate network without needing to define a reward function, or in addition to a reward function to augment the training signal. We think this can provide a new avenue for how game developers think of getting intelligent behavior from their systems. In a future blog post, we plan to walk through this scenario and provide an example project.

Brain Inspector window. In 0.2 the “Broadcast” check-box is added.

Flexible Monitor – We have rewritten the Agent Monitor to provide more general usability. Whereas the original Monitor had a fixed set of statistics about an agent which could be displayed, the new Monitor now allows for displaying any desired information related to agents. All you have to do is call Monitor.Log() to display information either on the screen or above an agent within the scene.

Ball balance environment with a variety of training information displayed using the Monitor.

As with any beta release, there will likely be bugs and issues. We encourage you to share feedback with us on the GitHub issues page.

Unity ML-Agents Community Challenge

Last but not least, we are excited to announce that Unity will be hosting an ML-Agents Community Challenge. Whether you’re an expert in Machine Learning or just interested in how ML can be applied to games, this challenge is a great opportunity for you to learn, explore, inspire and get inspired.

We want to see how you apply the new Curriculum Learning method. But we’re not looking for any particular genre or style, so get creative! We’ll send some gifts and surprises to the creators who get the most likes at the end of the challenge.

I’m completly new when it comes to ML but I do have a question : can a trained AI for a specific situation be used to analyse a player behavior and anticipate if the player is doing fine or not based on its “trained history” or something like that ?

Something like this is definitely possible. There are a number of concrete ways to interpret what you are describing. One I could imagine is that if you trained an agent to play a specific level of a game, you could see how well it does on average (based on some scoring mechanism), and see what parts of the level it explores. You could then collect data from players playing that level, and compare their scores to the agent. More interestingly, you could compare the “policies” for the agent and the player, and see how similarly they play. Just some ideas.

Good to hear that. Would be nice to see some example or documentation that could help. I’m more focusing about specific game situations like specific puzzles in a game. Each puzzle would have specific rules and game-play so I guess the “knowledge” required would be smaller but more specific. So if I can have a trained AI that know those rules and how to play it could “understand” if the player is performing well or not.

Fantastic work everyone. Glad to see Unity in this space. Don’t forget OpenCV can now run Tensorflow MODELS !!!! It would be great to get a computer vision demo using OpenCV/Tensorflow inside Unity environment. Things like a “virtual” line following robot or ability for game objects to actually “see” and recognize items in a game would be minds blowing.