Of course, to play Minecraft well you need to balance local activities - building, mining etc. - with exploration. Another frontier, beyond AlphaGo, is exploration. Monte-Carlo Tree Search (as used in AlphaGo) explores in more limited ways than humans do, argues John Langford [5].

This excellent paper on robotic grasping also caught our attention [7]. A key challenge in this area is adaptability to slightly varying circumstances, such as variations in the objects being grasped and their pose relative the the arm. General solutions to these problems will suddenly make robots far more flexible and applicable to a greater range of tasks.

Last week I also rediscovered this older paper on Hierarchical-Quilted Self-Organizing Maps (HQSOMs) [8].This is close to our hearts because we originally believed this type of representation was the right approach for AGI. With the success of Deep Convolutional Networks (DCNs) it’s worth looking back and noticing the similarities between the two. While HQSOM is purely unsupervised learning, (a plus, see comment from Yann LeCun above) DCNs are trained by supervised techniques. However, both methods use small, overlapping, independent units - analogous to biological cortical columns - to classify different patches of the input. The overlapping and independent classifiers lead to robust and distributed representations, which is probably the reason these methods work so well.

We at Project AGI believe that a grid-like “region” of columns employing a “Winner-Take-All” policy [10], with overlapping input receptive fields, can produce a distributed representation. Different regions are then connected together into a tree-like structure (acyclic). The result is a hierarchy. Not only does this resemble the state-of-the-art methods of DCNs, but there’s a lot of biological evidence for this type of representation too. This paper by Rinkus [11] describes columnar features arranged into a hierarchy, with winner-take-all behaviour implemented via local inhibition.

Rinkus says: “Saying only that a group of L2/3 units forms a WTA CM places no a priori constraints on what their tuning functions or receptive fields should look like. This is what gives that functionality a chance of being truly generic, i.e., of applying across all areas and species, regardless of the observed tuning profiles of closely neighboring units.”

Reinforcement Learning

But unsupervised learning can’t be the only form of learning. We also need to consider consequences, and so we need reinforcement learning to take account of these. As Yann said, the “cherry on the cake” (this is probably understating the difficulty of the RL component, but right now it seems easier than creating representations).

But regular readers of this blog will remember that we’re obsessed with unfolding or inverting abstract plans into concrete actions. We found a great paper by Manita et al [13] that shows biological evidence for the translation and propagation of an abstract concept into sensory and motor areas, where it can assist with perception. This is the hierarchy in action.

Long-Short-Term Memory (LSTM)

One more tack before we finish. Thanks to Jay for this link to NVIDIA’s description of LSTMs [14], an architecture for recurrent neural networks (i.e. the state can depend on the previous state of the cells). It’s a good introduction, but we’re still fans of Monner’s Generalized LSTM [15].

Fun thoughts

Now let’s end with something fun. Wired magazine again, describing watching AlphaGo as our first taste of a superhuman intelligence [16]. Although this is a “narrow” intelligence, not a general one, it has qualities beyond anything we’ve experienced in this domain before. What’s more, watching these machines can make us humans better, without any nasty bio-engineering:

“But as hard as it was for Fan Hui to lose back in October and have the loss reported across the globe—and as hard as it has been to watch Lee Sedol’s struggles—his primary emotion isn’t sadness. As he played match after match with AlphaGo over the past five months, he watched the machine improve. But he also watched himself improve. The experience has, quite literally, changed the way he views the game. When he first played the Google machine, he was ranked 633rd in the world. Now, he is up into the 300s. In the months since October, AlphaGo has taught him, a human, to be a better player. He sees things he didn’t see before. And that makes him happy. “So beautiful,” he says. “So beautiful.”

Why is Go hard?

Go is hard because the search-space of possible moves is so large that tree search and pruning techniques, such as those used to beat humans at Chess, won't work - or at least, they won't work well enough, with a feasible amount of memory, to play Go better than the best humans.

Instead, to play Go well, you need to have "intuition" rather than brute search power: To look at the board and spot local (or gross) patterns that represent opportunities or dangers. And in fact, AlphaGo is able to play in this way. It beat the next best computer algorithm "Pachi" 85% of the time without any tree search - just predicting the best action based on its interpretation of the current state. The authors of the AlphaGo Nature paper say:“During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue did in its chess match against Kasparov; compensating by selecting those positions more intelligently, using the policy network, and evaluating them more precisely, using the value network—an approach that is perhaps closer to how humans play.”

How does AlphaGo work?

AlphaGo is trained by both supervised and reinforcement learning. Supervised learning feedback comes from recordings of moves in expert games. However, these are finite in size and used naively, would lead to overfitting.

Instead, in AlphaGo a Supervised Learning deep neural network learns to model and predict expert behaviour in the recorded games, via conventional deep learning techniques. Then, a reinforcement learning network is used to generate reward data for novel games that AlphaGo plays against itself! This mitigates the limited size of the supervised learning dataset.

Of course, AlphaGo also wants the play better than the best play observed in the training data. To achieve this, the reinforcement learning network is further trained by playing pairs of them (networks) against each other - mixing the pairs up to prevent policies overfitting each other. This is a really clever feature because it allows AlphaGo to go beyond its training data.

Note also that the neural networks cannot possibly fully represent a sufficiently deep tree of board outcomes within their limited set of weights. Instead, the network has to learn to represent good and bad situations with limited resources. It has to form its own representation of the most salient features, during training.

The neural networks function without pre-defined rules specific to Go; instead they have learned from training data collected from many thousands of human and simulated games.

Key advances

AlphaGo is an important advance because it is able to make good judgments about play situations based on a lossy interpretation in a finitely-sized deep neural network.

What’s more, Go wasn’t simply taught to copy human experts - it went further, and improved, by playing against itself.

So, what doesn't it do?

The techniques used in deep neural networks have recently been scaled to work effectively on a wide range of problems. In some subject areas, narrow AIs are reaching superhuman performance. However, it is not clear that these techniques will scale indefinitely. Problems such as vanishing gradients have been pushed back, but not necessarily eliminated.

Much greater scale is needed to get intelligent agents into the real world without them being immediately smashed by cars or stuck in holes. But already, it is time to consider what features or characteristics constitute an artificial general intelligence (AGI), beyond raw intelligence (which AIs now have).

AlphaGo isn't a general intelligence; it's designed specifically to play Go. Sure, it's trained rather than programmed manually, but it was designed for this purpose. The same techniques are likely to generalize to many other problems, but they'll need to be applied thoughtfully and retrained.

AlphaGo isn't an Agent. It doesn't have any sense of self, or intent, and its behaviour is pretty static - its policies would probably work the same way in all similar situations, learning only very slowly. You could say that it doesn't have moods, or other transient biases. Maybe this is a good thing! But this also limits its ability to respond to dynamic situations.

AlphaGo doesn't have any desire to explore, to seek novelty or to try different things. AlphaGo couldn't ever choose to teach itself to play Go because it found it interesting. On the other hand, AlphaGo did teach itself to play Go…

All in all, it's a very exciting time to study artificial intelligence!