The Monte-Carlo tree search uses the neural network fθ to guide its simulations
(see Figure 2). Each edge (s, a) in the search tree stores a prior probability
P(s, a), a visit count N(s, a), and an action-value Q(s, a). Each simulation
starts from the root state and iteratively selects moves that maximise an upper
confidence bound Q(s, a) +U(s, a), where U(s, a) ∝ P(s, a)/(1 +
N(s, a)) 12, 24, until a leaf node s’ is encountered. This leaf
position is expanded and evaluated justMonte-Carlo tree 검색은 신경망 fθ를 사용하여 시뮬레이션을 유도합니다 (그림 2 참조). 검색
트리의 각 에지 (s, a)는 이전 확률 P (s, a), 방문
카운트 N (s, a) 및 동작 값 Q (s, a)를 저장한다. 각 시뮬레이션은 루트 상태에서 시작하여 상위 신뢰도 한계 Q (s, a) +
U (s, a), where U(s, a) ∝ P(s, a)/(1 + N(s,
a)) 12, 24를 최대화하는 동작을 반복적으로 선택합니다. s’에 도달 할 때까지 반복된다. 이 리프 위치는 확장되어 평가됩니다.Figure 1: Self-play reinforcement
learning in AlphaGo Zero.
a The program plays a game s1, ..., sT against itself. In each position st,
a Monte-Carlo tree search (MCTS) αθ is executed (see Figure 2) using the latest
neural network fθ. Moves are selected according to the search probabilities
computed by the MCTS, at ∼ πt. The
terminal position sT is scored according to the rules of the game to compute
the game winner z. b Neural network training in AlphaGo
Zero. The neural network takes the raw board position st as its input, passes it
through many convolutional layers with parameters θ, and outputs both a vector
pt, representing a probability distribution over moves, and a scalar value vt,
representing the probability of the current player winning in position st. The
neural network parameters θ are updated so as to maximise the similarity of the
policy vector pt to the search probabilities πt , and to minimise the error
between the predicted winner vt and the game winner z (see Equation 1). The new
parameters are used in the next iteration of self-play a.

2.
Empirical Analysis of AlphaGo Zero TrainingWe
applied our reinforcement learning pipeline to train our program AlphaGo Zero.
Training started from completely random behaviour and continued without human
intervention for approximately 3 days. Over the course of training, 4.9 million
games of self-play were generated, using 1,600 simulations for each MCTS, which
corresponds to approximately 0.4s thinking time per move. Parameters were
updated from 700,000 mini-batches of 2,048 positions. The neural network
contained 20 residual blocks (see Methods for further details). Figure 3a shows
the performance of AlphaGo Zero during self-play reinforcement learning, as a
function of training time, on an Elo scale 25. Learning progressed smoothly
throughout training, and did not suffer from the oscillations or catastrophic
forgetting suggested in prior literature 26–28.

Figure 3: Empirical evaluation
of AlphaGo Zero. a
Performance of self-play reinforcement learning. The plot shows the performance
of each MCTS player αθi from each iteration i of reinforcement learning in
AlphaGo Zero. Elo ratings were computed from evaluation games between different
players, using 0.4 seconds of thinking time per move (see Methods). For
comparison, a similar player trained by supervised learning from human data,
using the KGS data-set, is also shown. b Prediction accuracy on human
professional moves. The plot shows the accuracy of the neural network fθi , at
each iteration of self-play i, in predicting human professional moves from the
GoKifu data-set. The accuracy measures the percentage of positions in which the
neural network assigns the highest probability to the human move. The accuracy
of a neural network trained by supervised learning is also shown. c
Mean-squared error (MSE) on human professional game outcomes. The plot shows
the MSE of the neural network fθi , at each iteration of self-play i, in
predicting the outcome of human professional games from the GoKifu data-set.
The MSE is between the actual outcome z ∈ {−1, +1} and the neural network value v, scaled by a
factor of 1 4 to the range [0, 1]. The MSE of a neural network trained by
supervised learning is also shown.

Figure 4: Comparison of
neural network architectures in AlphaGo Zero and AlphaGo Lee.
Comparison of neural network architectures using either separate (“sep”) or
combined policy and value networks (“dual”), and using either convolutional
(“conv”) or residual networks (“res”). The combinations “dual-res” and
“sep-conv” correspond to the neural network architectures used in AlphaGo Zero
and AlphaGo Lee respectively. Each network was trained on a fixed data-set
generated by a previous run of AlphaGo Zero. a Each trained network was
combined with AlphaGo Zero’s search to obtain a different player. Elo ratings
were computed from evaluation games between these different players, using 5
seconds of thinking time per move. b Prediction accuracy on human professional
moves (from the GoKifu data-set) for each network architecture. c Mean-squared
error on human professional game outcomes (from the GoKifu data-set) for each
network architecture.

3.
Knowledge Learned by AlphaGo Zero AlphaGo
Zero discovered a remarkable level of Go knowledge during its self-play
training process. This included fundamental elements of human Go knowledge, and
also non-standard strategies beyond the scope of traditional Go knowledge.
Figure 5 shows a timeline indicating when professional joseki (corner
sequences) were discovered (Figure 5a, Extended Data Figure 1); ultimately
AlphaGo Zero preferred new joseki variants that were previously unknown (Figure
5b, Extended Data Figure 2). Figure 5c and the Supplementary Information show
several fast self-play games played at different stages of training. Tournament
length games played at regular intervals throughout training are shown in
Extended Data Figure 3 and Supplementary Information. AlphaGo Zero rapidly
progressed from entirely random moves towards a sophisticated understanding of
Go concepts including fuseki (opening), tesuji (tactics), life-and-death, ko
(repeated board situations), yose (endgame), capturing races, sente
(initiative), shape, influence and territory, all discovered from first
principles. Surprisingly, shicho (“ladder” capture sequences that may span the
whole board) – one of the first elements of Go knowledge learned by humans –
were only understood by AlphaGo Zero much later in training.

4.
Final Performance of AlphaGo Zero We
subsequently applied our reinforcement learning pipeline to a second instance
of AlphaGo Zero using a larger neural network and over a longer duration.
Training again started from completely random behaviour and continued for
approximately 40 days. Over the course of training, 29 million games of
self-play were generated. Parameters were updated from 3.1 million mini-batches
of 2,048 positions each. The neural network contained 40 residual blocks. The
learning curve is shown in Figure 6a. Games played at regular intervals throughout
training are shown in Extended Data Figure 4 and Supplementary Information.

Figure 5: Go knowledge
learned by AlphaGo Zero.
a Five human joseki (common corner sequences) discovered during AlphaGo Zero
training. The associated timestamps indicate the first time each sequence
occured (taking account of rotation and reflection) during self-play training.
Extended Data Figure 1 provides the frequency of occurence over training for
each sequence. b Five joseki favoured at different stages of self-play
training. Each displayed corner sequence was played with the greatest
frequency, among all corner sequences, during an iteration of self-play
training. The timestamp of that iteration is indicated on the timeline. At 10
hours a weak corner move was preferred. At 47 hours the 3-3 invasion was most
frequently played. This joseki is also common in human professional play;
however AlphaGo Zero later discovered and preferred a new variation. Extended
Data Figure 2 provides the frequency of occurence over time for all five
sequences and the new variation. c The first 80 moves of three self-play games
that were played at different stages of training, using 1,600 simulations
(around 0.4s) per search. At 3 hours, the game focuses greedily on capturing
stones, much like a human beginner. At 19 hours, the game exhibits the
fundamentals of life-and-death, influence and territory. At 70 hours, the game
is beautifully balanced, involving multiple battles and a complicated ko fight,
eventually resolving into a half-point win for white. See Supplementary
Information for the full games.

We
evaluated the fully trained AlphaGo Zero using an internal tournament against
AlphaGo Fan, AlphaGo Lee, and several previous Go programs. We also played
games against the strongest existing program, AlphaGo Master – a program based
on the algorithm and architecture presented in this paper but utilising human
data and features (see Methods) – which defeated the strongest human
professional players 60–0 in online games 34 in January 2017. In our
evaluation, all programs were allowed 5 seconds of thinking time per move;
AlphaGo Zero and AlphaGo Master each played on a single machine with 4 TPUs;
AlphaGo Fan and AlphaGo Lee were distributed over 176 GPUs and 48 TPUs respectively.
We also included a player based solely on the raw neural network of AlphaGo
Zero; this player simply selected the move with maximum probability. Figure
6b shows the performance of each program on an Elo scale. The raw neural
network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo
Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739
for AlphaGo Lee and 3,144 for AlphaGo Fan. Finally,
we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100 game
match with 2 hour time controls. AlphaGo Zero won by 89 games to 11 (see
Extended Data Figure 6) and Supplementary Information.

5.
Conclusion

Our
results comprehensively demonstrate that a pure reinforcement learning approach
is fully feasible, even in the most challenging of domains: it is possible to
train to superhuman level, without human examples or guidance, given no
knowledge of the domain beyond basic rules. Furthermore, a pure reinforcement
learning approach requires just a few more hours to train, and achieves much
better asymptotic performance, compared to training on human expert data. Using
this approach,
AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were
trained from human data using handcrafted features, by a large margin.

Figure 6: Performance of
AlphaGo Zero. a
Learning curve for AlphaGo Zero using larger 40 block residual network over 40
days. The plot shows the performance of each player αθi from each iteration i
of our reinforcement learning algorithm. Elo ratings were computed from
evaluation games between different players, using 0.4 seconds per search (see
Methods). b Final performance of AlphaGo Zero. AlphaGo Zero was trained for 40
days using a 40 residual block neural network. The plot shows the results of a
tournament between: AlphaGo Zero, AlphaGo Master (defeated top human
professionals 60-0 in online games), AlphaGo Lee (defeated Lee Sedol), AlphaGo
Fan (defeated Fan Hui), as well as previous Go programs Crazy Stone, Pachi and
GnuGo. Each program was given 5 seconds of thinking time per move. AlphaGo Zero
and AlphaGo Master played on a single machine on the Google Cloud; AlphaGo Fan
and AlphaGo Lee were distributed over many machines. The raw neural network
from AlphaGo Zero is also included, which directly selects the move a with
maximum probability pa, without using MCTS. Programs were evaluated on an Elo
scale 25: a 200 point gap corresponds to a 75% probability of winning.

Humankind
has accumulated Go knowledge from millions of games played over thousands of
years, collectively distilled into patterns, proverbs and books. In the space
of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much
of this Go knowledge, as well as novel strategies that provide new insights
into the oldest of games.