Computers used to teach other computers to play Pac-Man, StarCraft

Even incompatible algorithms can share advice, boost the learning curve.

Computer scientists have developed a number of supervised learning algorithms that give people the chance to help the computer generate the appropriate output to a given task, like determining whether there's a face in a photo. But there's no reason that this process has to be limited to accepting input provided by humans; a well-trained computer should work just as well.

Researchers have now demonstrated that computers can successfully help each other learn an unfamiliar task even when they're using different algorithms. And, to add to the recursion, the tasks they chose for the demonstration were computer games: Pac-Man and StarCraft.

Matthew Taylor, the lead author of the paper describing the work, admitted that "testing algorithms in flashier domains [meaning popular games] is generally more exciting to read about." But games also share features with real-world problems, specifically what Taylor called "sequential decision-making tasks." He pointed to driving, where people constantly make small decisions based on their surroundings and then learn from the consequences, gradually building up these learning experiences into an efficient commute. Many games work in a similar manner. Plus "it's easy to get people to play them," Taylor told Ars, which makes for a useful point of comparison.

Computers vs. computer games

For these studies, Taylor's team chose two games that fit the sequential decision-making system. The first is Pac-Man, where each intersection presents a choice that a user makes based on factors like the position of the ghosts and the location of any uneaten dots. The second was a simplified form of StarCraft, with the computer playing a human sniper facing a Zerg on a square course with some cover.

For StarCraft, the game was set up so that, using cover and firing from a distance, the human could take out the Zerg before it was killed. To keep the human from just staying in hiding, the player's score went down over time—the longer it took for the game to be completed, the worse the score. The algorithm evaluated a total of six factors, like distance and relative health, before choosing its next actions.

When self-teaching, the computers would start off avoiding the Zerg. They'd then end up doing a sort of inverse-Zerg-rush, quickly getting themselves killed before their entire score evaporated. Finally, after about 200 training runs, the computers started to get the idea of sniping and began killing the Zerg successfully, though performance was still pretty erratic

The goal of the Pac-Man exercise was simply a high score. To get there, the computers were given a set of 16 features that described the position of key items relative to the player, like the nearest uneaten dot. A second algorithm relied on only seven features, but these were complex values derived from a working knowledge of the game. As a result, the smaller feature set actually led to a higher score, since it was more efficient at hunting down ghosts when they were vulnerable.

Teaching the noobs

Well-trained computer algorithms were then given the chance to tutor some noobs using a variety of teaching styles, all of them put on a strict budget of interventions (mimicking the real world, where you can only offer advice so many times before students get frustrated and walk away). One approach is to just give all your advice early, when the player is just beginning to understand the task. A second intervened every few steps, which spread their advice over a larger time.

A third approach involved the teaching computer only giving advice when it rated the player's situation as being relatively important. Another chimed in whenever the player made a mistake. Finally, the most sophisticated algorithm tried to re-capitulate the inner state of its student, guessing when the student was going to make a mistake instead of requiring the student to announce its intended move before a correction could be issued.

For roughly the first 35 iterations of StarCraft, there was no real difference between any of the supervised computers and the ones who were figuring it out on their own. But as the self-taught algorithms got hung up in their suicidal approach to ending the game quickly, the algorithms that were being supervised by a teacher started separating from the pack. The clear winner was the mistake-correcting approach, which reached a maximum performance after only 100 iterations.

For Pac-Man, an untrained algorithm using the large feature set took about 400 iterations to start generating high scores. The worst of the taught algorithms hit that point after only 200 iterations; the best teaching styles, mistake correction and predictive intervention, managed this in less than 100 tries.

For the algorithms that were given the more focused feature set, it took about 600 iterations for anybody to start hitting high scores. As a result, the teacher that gave all its advice early ended up being nearly useless. And, in this instance, correcting actual mistakes worked much better than trying to predict when a player was likely to screw up.

Mixing algorithms

When both are programs, do students and teachers have to be compatible? Apparently not. When different algorithms were used for the students and teachers, every one of them did better than self-taught algorithms. Impressively, the mistake-predicting algorithm was one of the most effective teachers, even though it was predicting the goofs that would be made by a completely different piece of software.

In fact, the work showed that the students could benefit from, then outdo, their teachers. The researchers set up a situation where the low-scoring Pac-Man software trained the one that used the smaller, more complex feature set. Both the mistake-correcting and predictive approaches to teaching helped the students, even as the students' scores started surpassing anything the teachers could ever accomplish.

The authors also found that different approaches to teaching worked best with different timing. While advising based on the rated importance of a step was most efficient within the first 20 iterations, correcting mistakes was best spread across 100 iterations. So by mixing approaches and tailoring advice based on where things are in the learning process, it may be possible to generate an even more effective teaching algorithm.

Next up, Skynet?

If the computers can teach each other to effectively play a game, how far are we from then teaching each other to play global thermonuclear war? "This work is specific to sequential decision-making tasks," Taylor told Ars. "Other general machine learning methods (i.e., supervised, unsupervised, or semi-supervised methods) that aren't used to solve sequential tasks would not benefit from these techniques." So, it's a solution to a somewhat narrow problem set.

This still can be important. Taylor pointed out that accelerated learning can be critical in cases where the lifetime of the learning process is limited, perhaps by battery life, or by wear and tear in the case of robots.

And he also said that, in many cases, these learning algorithms get hung up in a local minimum, where they find a good solution to a problem and can't find their way to an optimal one. "You can think of learning as a stochastic process," Taylor told Ars. "You don't always research the same end state, and providing advice can bias the process towards a better outcome."