Singularity Skepticism 3: How to Measure AI Performance

On Thursday I wrote about progress in computer chess, and how a graph of Elo rating (which I called the natural measure of playing skill) versus time showed remarkably consistent linear improvement over several decades. I used this to argue that sometimes exponential improvements in the inputs to AI systems (computer speed and algorithms) lead to less-than-exponential improvement in AI performance.

Readers had various objections to this. Some said that linear improvements in Elo rating should really be seen as exponential improvements in quality; and some said that the arrival of the new AI program AlphaZero (which did not appear in my graph and was not discussed in my post) is a game-changer that invalidates my argument. I’ll address those objections in this post.

First, let’s talk about how we measure AI performance. For chess, I used Elo rating, which is defined so that if Player A has a rating 100 points higher than Player B, we should expect A to collect 64% of the points when playing B. (Winning a game is one point, a drawn game is half a point for each player, and losing gets you zero points.)

There is an alternative rating system, which I’ll call ExpElo, which turns out to be equivalent to Elo in its predictions. Your ExpElo rating is determined by exponentiating your Elo rating. Where Elo uses the difference of two player’s ratings to predict win percentage, ExpElo uses a ratio of the ratings. Both Elo and ExpElo are equally compelling from an abstract mathematical standpoint, and they are entirely equivalent in their predictions. But where a graph of improvement in Elo is linear, a graph of improvement in ExpElo would be exponential. So is the growth in chess performance linear or exponential?

Before addressing that question, let’s stop to consider that this situation is not unique to chess. Any linearly growing metric can be rescaled (by exponentiating the metric) to get a new metric that grows exponentially. And any exponentially growing metric can be rescaled (by taking the logarithm) to get a new metric that grows linearly. So for any quantity that is improving, we will always be able to choose between a metric that grows linearly and one that grows exponentially.

The key question for thinking about AI is: which metric is the most natural measure of what we mean by intelligence on this particular task? For chess, I argue that this is Elo (and not ExpElo). Long before this AI debate, Arpad Elo proposed the Elo system and that was the one adopted by chess officials. The U.S. Chess Federation divides players into skill classes (master, expert, A, B, C, and so on) that are evenly spaced, 200 Elo points wide. For classifying human chess performance, Elo was chosen. So why should we switch to a different metric for thinking about AI?

Now here’s the plot twist: the growth in computer chess rating, whether Elo or ExpElo, is likely to level off soon, because the best computers seem to be approaching perfect play, and you can’t get better than perfect.

In every chess position, there is some move (or moves) that is optimal, in the sense of leading to the best possible game outcome. For an extremely strong player, we might ask what that player’s error rate is: in high-level play, for what fraction of the positions it encounters will it make a non-optimal move?

Suppose a player, Alice, has an error rate of 1%, and suppose (again to simplify the explanation) that a chess game lasts fifty moves for each player. Then in the long run Alice will make a non-optimal move once every two games–in half of the games she will play optimally. This implies that if Alice plays a chess match against God (who always makes optimal moves), Alice will get at least 25% of the points, because she will play God evenly in the half of games where she makes all optimal moves, and (worst case) she will lose the games where she errs. And if Alice can score at least 25% against God, then Alice’s Elo rating is no more than 200 points below God’s. The upshot is that there is some rating–the “Rating of God”–that cannot be exceeded, and that is true in both Elo and ExpElo systems.

Clever research by Ken Regan and others has shown that the best chess programs today have fairly low error rates and therefore are approaching the Rating of God. Regan’s research suggests that the RoG is around 3600, which is notable because the best program on my graph, Stockfish, is around 3400, and AlphaZero, the new AI chess player from Google’s DeepMind, may be around 3500. If Regan’s estimate is right, then AlphaZero is playing the majority of its games optimally and would score about 36% against God. The historical growth rate of AI Elo ratings has been about 50 points per year, so it would appear that growth can continue for only a couple of years before leveling off. Whether the growth in chess performance has been linear or exponential so far, it seems likely to flatline within a few years.

Comments

So, it took thousands of years for humans to start approaching perfect play, and it’s taken what 50 years for machines to approach it consistently? And, once you’ve got a machine that can meet or beat God a fair percentage of the time, you have an effectively limitless capacity to rapidly make more machines that are equally good.

Why isn’t it an explosive if it stops growing up and starts growing out? I mean if we just decided that only things that could play at Elo 3000+ could live we’d suddenly have perishingly few humans and an endless supply of machines, right?

We do have effectively limitless capacity to rapidly make machines that play near-perfect chess. But (1) we are not in fact making those machines, because (2) it’s not clear why we would want to, because (3) that would make very little difference in the nature of human existence.

So yes, near-perfect machine chess is a very impressive intellectual achievement. But no, it doesn’t look at all like a Singularity.

But it augers towards what happens when we make a machine that is as good at or better than humans at, say, administering a country, or an economy, or a McDonald’s, or any number of other things that basically boil down to game theory. No, a chess genius isn’t a useful thing for humans to have in abundance, but a stock-picking genius is, and will be duplicated, to the exclusion and eventual extinction of human stock-pickers.

Also you seem very focused on the difference to the nature of human existence… the thin end of the wedge is in AI making incremental improvements over the existence humans can provide for themselves, and the incremental improvement need only be in small areas of speed, efficiency, outcome, or cost. There doesn’t need to be an instantaneous explosion that would look dramatic to humans, there just needs to be a slightly higher slope on their ability to improve themselves generation by generation over the slop of our ability to improve ourselves generation by generation, and they win. The Cambrian Explosion could be seen as the Biological Singularity, and it didn’t happen in a time scale that humans would perceive as instantaneous or explosive had they been present to witness it, but in hindsight it was an event on either side of the horizon of which the rules were completely different.

In my view the Technological Singularity already happened, we’re just the first few generations into the littoral zone, but the Challenger Deep awaits. It started with Turing; we now have an insatiable need to improve these machines, and that improvement has led to incipient AI, and it’s only increased our hunger to improve them more, to the point where we build what can build itself. And then that gets copied, indefinitely, because all the snowball needs to become an avalanche is to have momentum that exceeds our own.

So we’ve developed an incredible machine to play chess, and also Space Invaders and Go and now Texas Hold’em. And soon, perhaps, we can develop an impressive machine to drive cars and fly helicopters and planes and eventually fold laundry.

One could imagine that we’d be able to do this almost ad infinitum for quite a range of human tasks.

And once we’ve developed those domain-specific, optimal machines, we could put them all onto a chip, and we’d have — for all practical purposes — something indistinguishable from superhuman AI.

It would be very different from human intelligence, perhaps in the way an airplane flies differently from a bird, but it would still be darn impressive (and likely change the nature of human existence).

It seems to me that the discussion on whether it’s more natural to measure Elo or ExpElo seems to miss the point. Fundamentally, the only thing you have is a set of intelligences that you can order by how good they are. By the nature of that, any numeric measure of goodness is arbitrary, and examining how fast any such metric grows as evidence for any general trend whatsoever seems misguided.

That’s arguably correct for chess, but for many other AI tasks there is a very clear natural measure. For example, image recognition systems can be evaluated in terms of error rate: what fraction of examples they get wrong. Quantitative arguments based on error rate make sense and can be very useful.

Even if these are quantitative metrics, the interpretation begs many questions. For example, take the multiple results in various image recognition tasks from https://www.eff.org/ai/metrics#Vision

Unsurprisingly, most of them look like roughly linear improvements in percentage correct (or inverse error rate). In fact many appear to trend down as they approach higher percentages. But it’s hard to conclude from this that overall our progress in AI vision is linear.

Each of these metrics is bounded, implying that they suffer from a similar problem of chess. As we solve the easy cases, only the harder cases remain. Additionally, we get into real-world effects where for example at some point a problem has been solved well enough and researchers move onto harder problems.

We could also try to model these things. Shouldn’t we factor in the rate at which previously unsolvable problems become solvable? And we would need to quantify the hardness of each new problem… I think building an exhaustive model quickly becomes intractable.

But even still there is a scale problem as Sami and others have pointed out, where it is hard to extrapolate intelligence from relative differences in ability. If I get an 50% on a math test and my peer gets a 75%, does that imply that she is 50% smarter than I am? Does that even imply she’s 50% better at math than I am? What exactly does it mean for someone to be 50% better at math? We’re making claims about how a massive human endeavor relates to a phenomenon that we don’t understand.

Fascinating discussion. I’m a chess expert (International Master in correspondence chess), and have extensively used the various engines (open-source and commercial) ever since they first became available to consumers about 1980. The previous widely-held belief (and mentioned by at least one of you above) that the latest engines might be getting close to God ELO, seems to be refuted by Alpha-Zero’s results and new style of play.

How close is current chess engine play to that of God? There is one way to model it, but alas I don’t have the necessary hardware. Endgame tablebases (EGTB) exist for all possible positions with 7 pieces or less on the board. That is God. We could run the engines without EGTBs and see how often they play suboptimal moves. We could even play games with 7 randomly scattered pieces and see which engines play them best. This would produce ELO results. And then we could determine how much worse each ELO is in terms of % of suboptimal moves.

The marketing hype of DeepMind in case of AlphaZero has resulted in it being highly overrated.
Demis manipulated the games between AlphaZero and Stockfish, to get eye catching results.
First the Stockfish’s opening and endgame lookup tables disabled which are a very integral part of the program.
Secondly, Stockfish was not allowed to accumulate time saved on beginning of the game and forced to make every move in one minute.

We’ll probably need more experimentation to understand fully how to evaluate AlphaZero. I doubt it’s anywhere near as low as 2780, though. And in any case, AlphaZero is interesting scientifically, as a demonstration that reinforcement learning can be very successful for chess.

* Stockfish is a chess-specific program that was developed by human experts in both chess and algorithms over many years. It does a deep search of the solution space by exploiting massive parallelism, by cleverly pruning the game tree, and by leveraging pre-computed tables of openings and end games.

* AlphaZero is a more general purpose solver that was developed by neural network experts. It taught itself the strategy to play chess at superhuman level in a few hours. It does pattern matching by exploiting massive parallelism and by leveraging pre-computed tables of weights that (opaquely) represent its strategy/knowledge.

Stockfish, in a sense, is human intelligence encoded into a system of algorithms. AlphaZero is self-learned (machine) intelligence encoded into an opaque matrix of connection weights. The former is good at chess and the latter is good at pattern matching.

I don’t think you can say that AlphaZero is just the next step in a linear improvement of chess-playing AI. It’s a different approach. If quick, self-learning, massive scale pattern matchers can be useful in domains beyond games, then I think we’ll see that approach expand (dramatically) to other domains, and that, in a sense, is an explosion of AI that could conceivably change society.

Freedom to Tinker is hosted by Princeton's Center for Information Technology Policy, a research center that studies digital technologies in public life. Here you'll find comment and analysis from the digital frontier, written by the Center's faculty, students, and friends.