Pages

Важно

Wednesday, January 31, 2018

How to build your own AlphaZero AI using Python and Keras (English)

In March 2016, Deepmind’s AlphaGo beat 18 times world champion Go player Lee Sedol 4–1 in a series watched by over 200 million people. A machine had learnt a super-human strategy for playing Go, a feat previously thought impossible, or at the very least, at least a decade away from being accomplished.

...on 18th October 2017, DeepMind took a giant leap further.

The paper ‘Mastering the Game of Go without Human Knowledge’ unveiled a new variant of the algorithm, AlphaGo Zero, that had defeated AlphaGo 100–0. Incredibly, it had done so by learning solely through self-play, starting ‘tabula rasa’ (blank state) and gradually finding strategies that would beat previous incarnations of itself. No longer was a database of human expert games required to build a super-human AI.

A mere 48 days later, on 5th December 2017, DeepMind released another paper ‘Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm’ showing how AlphaGo Zero could be adapted to beat the world-champion programs StockFish and Elmo at chess and shogi. The entire learning process, from being shown the games for the first time, to becoming the best computer program in the world, had taken under 24 hours.

With this, AlphaZero was born — the general algorithm for getting good at something, quickly, without any prior knowledge of human expert strategy.

...The key is not in any of the components being extremely innovative (although there is definitely some smart new stuff going on), but rather in the formulation of the problem itself. This is not about supervised vs. unsupervised learning. It is not even about the fact the network learns without human intervention or examples. It is about the fact that Alpha Go Zero learned without any data!

...It turns out that (under some constraints) we don’t need data at all! The only thing that was input into the model was the basic rules of the game, not even complex strategies or known “tricks”. Can you imagine if you could do the same thing in other domains? You specify the rules of the system, you let it generate data and learn from itself...

...Another interesting side effect is that, in fact, the whole AlphaGo Zero system can be seen as a synthetic data generation system. I would be really curious to see how AlphaGo Lee (the previous version) would perform if trained on the data generated by AlphaGo Zero.

I will add that, of course, this is not the first time in AI that we have seen approaches that can learn without data. Genetic algorithms or even classical reinforcement learning can do that. The difference is that this is machine learning without data… and, it works, at scale!

...It is interesting to think how about 10 years ago, some were claiming that we did not need smart algorithms and math anymore: “All you need is data”, they said. While data is obviously valuable in many cases, this breakthrough does clearly represent a complete change of direction. I am excited to see where it takes us...

...a lot of the ideas in the paper are actually far less complex than previous versions. At its heart, lies the following beautifully simple mantra for learning:

Mentally play through possible future scenarios, giving priority to promising paths, whilst also considering how others are most likely to react to your actions and continuing to explore the unknown.

After reaching a state that is unfamiliar, evaluate how favourable you believe the position to be and cascade the score back through previous positions in the mental pathway that led to this point.

After you’ve finished thinking about future possibilities, take the action that you’ve explored the most.

At the end of the game, go back and evaluate where you misjudged the value of the future positions and update your understanding accordingly.

Doesn’t that sound a lot like how you learn to play games? When you play a bad move, it’s either because you misjudged the future value of resulting positions, or you misjudged the likelihood that that your opponent would play a certain move, so didn’t think to explore that possibility. These are exactly the two aspects of gameplay that AlphaZero is trained to learn.

...AlphaGo...evaluated the Go board and chose moves using a combination of two methods:

1. Performing “lookahead” search: looking ahead several moves by simulating games, and thus seeing which current move is most likely to lead to a “good” position in the future.

2. Evaluating positions based on an “intuition”, of whether a position is “good” or “bad” — that is, likely to lead to a win or a loss.

...“Monte Carlo Tree Search” or MCTS. At a high level, this method involves initially exploring many possible moves on the board, and then focusing this exploration over time as certain moves are found to be more likely to lead to wins than others...

...DeepMind’s major innovation with AlphaGo was to use deep neural networks to understand the state of the game, and then use this understanding to intelligently guide the search of the MCTS...

...These three tricks are what enabled AlphaGo Zero to achieve its incredible performance that blew away even Alpha Go:

2. Using one neural network — the “Two Headed Monster” that simultaneously learns both which moves “intelligent lookahead” would recommend and which moves are likely to lead to victory — instead of two separate neural networks.

3. Using a more cutting edge neural network architecture — a “residual” architecture rather than a “convolutional” architecture.