AlphaGo

Is an AI developed by Google Deepmind that recently became the first machine to beat a top level human Go player.

AlphaToe

Is an attempt to apply the same techniques used in AlphaGo to Tic-Tac-Toe. Why? I hear you ask. Tic-tac-toe is a very simple game and can be solved using basic min-max.

Because it’s a good platform to experiment with some of the AlphaGo techniques which it turns out they work at this scale. Also the neural networks involved can also be trained on my laptop in under an hour as opposed too the weeks on an array of super computers that AlphaGo required.

The project is written in Python using TensorFlow, the Github is here https://github.com/DanielSlater/AlphaToe and contains code for each step that AlphaGo used in it’s learning. It also contains code for Connect 4 and this ability to build games of Tic-Tac-Toe on larger boards.

Here is a sneak peak at how it did in the 3×3 game. In this graph it is training as first player and gets too an 85% win rate against a random opponent after 300000 games.

I will do a longer write up of this at some point, but in the mean time here is a talk I did about AlphaToe at a recent DataScienceFestival event in London. Which gives a broad overview of the project:

When training neural networks there are 2 things that combine to make life frustrating:

Neural networks can take an insane amount of time of train.

How well a network is able to learn can be hugely affected by the choice of hyper parameters(hyper parameters here refers mainly to the numbers of layer and numbers of nodes per layer, but can also include learning rate, activation functions, etc) and without training a network in full you can only guess at which choices are better.

My current research is around ways to try and get neural networks to adjust there size automatically, so that if there isn’t sufficient capacity in a network it will in some way determine this and resize itself. So far my success has been (very) limited, but while working on that I thought I would share this paper: Net2Net: Accelerating Learning via Knowledge Transfer which has a good, simple approach to resizing networks manually while keeping there activation unchanged.

Being able to manually resize a trained network can give big savings on networks training time because when searching through hyper parameters options you can start off with a small partially trained network and see how adding extra hidden nodes or layers affects test results.

This creates the weights and biases for a layer 1 wider than the existing one. To increases the size by more nodes simply do this multiple times(note the finished library on github has the parameter new_layer_size to set exactly how big you want it). The new node is a clone of a random node from the same layer. The original node and it’s copy then have their outputs to the next layer halved so that the overall output from the network is unchanged.

How Net2WiderNet extends a layer with 2 hidden node layer to have 3

Unfortunately if 2 nodes in the same layer have exactly the same parameters then their activation will always be identical, which means their back propagated error will always be identical, they will update in the same way, their activation will still be the same, then you gained nothing by adding the new node… To stop this happening a small amount of noise is injected into the new node. This means as they train they have the potential to move further and further apart while training.

Net2DeeperNet is quite simple, it creates an identity layer, then adds a small amount of noise. This means that the network activation is only unchanged if the layer is a linear layer, because otherwise the activation functions non-linearity will alter the output. So bare in mind if you have an activation function on your new layer(and you almost certainly will) then the network output will be changed and will have worse performance until it has gone through some amount of training.
Here is the code:

Usage in TensorFlow

This technique could be used in any neural network library/framework, but here is how you might use it in TensorFlow.

In this example we first train a minimal network with 100 hidden nodes in the first and second layers and train it for 75 epochs. Then we do a grid search of different numbers of hidden nodes for 50 epochs to see which lead to the best test accuracy.

Last week I gave a talk at PyDataLondon 2016 hosted at the Bloomberg offices in central London. If you don’t know anything about PyData it is an community of Python data science enthusiasts that run various meetups and conferences across the world. If your interested in that sort of thing and they are running something near to you I would highly recommend checking it out.

Below is the YouTube video for my talk and this is the associated GitHub, which includes all the example code.

The complete collection of talks from the conference is here. The standard across the board was very high, but if you only have time to watch a few, of those I saw here are two that you might find interesting.

Bayesian statistics is a fascinating subject with many applications. If your trying to understand deep learning at a certain point research papers such as Auto-Encoding Variational Bayes and Auxiliary Deep Generative Models will stop making any kind of sense unless you have a good understanding of Bayesian statistics(and even if you do it can still be a struggle). This video works as a good introduction to the subject. His blog is also quite good.

This has a good overview of useful techniques, mostly around computer vision(though they could be applied in other areas). Such as computing the saliency of inputs in determining a classification and getting good classifications when there when there is only limited labelled data.

I’m going to be giving a talk/tutorial at PyDataLondon 2016 on Friday the 6th of may, if your in London that weekend I would recommend going, there are going to be lots of interesting talks, and if you do go please say hi.

My talk is going to be a hands on, on how to build a pong playing AI, using Q-learning, step by step. Unfortunately training the agents even for very simple games still takes ages and I really wanted to have something training while I do the talk, so I’ve built two little games that I hope should train a bit faster.

This a version of pong with some of visual noise stripped out, no on screen score, no lines around the board. Also when you start you can pass args for the screen width and height and the game play should scale with these. This means you can run it as an 80×80 size screen(or even 40×40) and save to having to do the downsizing of the image when processing.

This is an even kinder game than pong. There is only the players paddle and you get points just for hitting the other side of the screen. I’ve found that if you fiddle with the parameters you can start to see reasonable performance in the game with an hour of training(results may vary, massively). That said even after significant training the kinds of results I see are a some way off how well google deepmind report doing. Possibly they are using other tricks not reported in the paper, or just lots of hyper parameter tuning, or there are still more bugs in my implementation(entirely possible, if anyone finds any please submit).

I’ve also checked in some checkpoints of a trained half pong player, if anyone just wants to quickly see it running. Simply run this, from the examples directory.