Uber has released a blog post describing Go-Explore, a new Reinforcement Learning (RL) algorithm to deal with hard exploration problems. These problems are characterized by the lack (or sparsity) of external feedback, which makes it difficult for the algorithm to learn how to operate. One of the popular testbeds for RL algorithms is Atari Games, among them Montezuma’s Revenge, a complex game that requires an agent to go through hundreds of actions to win points (for example, discover a key in a future level).

Go-Explore shatters the records of all its predecessors in Montezuma’s Revenge and Pitfall, reaching an average result of 400,000 whereas OpenAI’s previous state-of-the-art algorithm (which was released a month ago) broke the record with a result of 11,347. A summary of the OpenAI technique can be found here.

The Model

The Go-Explore algorithm consists of three important techniques:

Exploration from stored statesThe agent starts by exploring the environment without any prior knowledge about it. When the agent reaches a new state, especially one that rewards him with points, the algorithm stores the state in its archive. This storage is key to the agent’s success since in every new episode (game), the agent continues exploring from a stored state, thus managing to progress to new states over time.

A challenge in using stored states for Montezuma’s Revenge is differentiating between similar states which are similar in essence but not identical. Go-Explore generalizes states by downsampling the resolution to 11×8 pixels, causing many similar states to be treated as equal.

Imitation LearningAnother technique which allows to generalize the model and teach it how to act in new states is imitation learning, i.e. training a neural network to learn a policy using “replays” of successful games. However, rather than using human player demonstrations, as usually done, the model uses games that reached interesting states during the exploration stage.

The algorithm repeats the exploration and imitation learning steps over and over again to improve its results. The policy improves the exploration from stored states, which enables the agent to reach new states that can be used later for the imitation learning. This combination achieves state-of-the-art results in Montezuma’s Revenge with an average score of over 35,000.

Domain KnowledgeGo-Explore shows that adding very basic information about the world (i.e. the game) can greatly improve the performance of the algorithm. In Montezuma’s Revenge, for example, the number of keys the player holds can be extracted from the screen pixels. The chart below shows the average scores of the algorithm with and without domain knowledge.

Conclusion

Go-Explore has surprised many by achieving state-of-the-art results in two hard Atari games. The milestone was achieved by combining several interesting concepts – storing generalized states, learning from the agent’s games, and utilizing domain knowledge.

The creators of Go-Explore have not yet release a paper or code, making it difficult to discover possible pitfalls in the presented technique. One concern is the technique’s ability to handle randomness in the game (e.g. the Noisy-TV problem), an issue which will be further explored as more information regarding Go-Explore is published.

Note: In a new version of the blog post the authors added a discussion regarding the model’s ability to handle randomness. They also evaluated the model in a stochastic environment (by added random sticky actions) – the scores are significantly lower but still state-of-the-art.