Frederic Lord wrote a consequential doctoral dissertation at Princeton in 1951. He was already the director of statistical analysis for the Educational Testing Service, which was formed in Princeton in 1947. All the scoring of our SATs, GREs, and numerous other standardized tests have been influenced both by his application of classical test theory and his development in the dissertation of Item Response Theory (IRT).

Today we discuss IRT and issues of scaling that arise in my chess model. The main point is that the problems are ingrained and beautiful observed regularities burnish them rather than fix them.

This post is long, but has other takeaways including how ability in chess identifies with scaling up the perception of value, yet how value may be a detour for training chess programs, and how the presence of logistic curves everywhere doesn’t mean your main quantities of interest will follow them.

Some Item Response Theory

The basic component of IRT is a curve in which is a measure of aptitude or tendency and is the expected test score of somebody described by . Each item—for instance, a single question on a test or a reading of sentiment—has its own curve that looks like one of the following:

Each curve has two main parameters: the placement of its symmetry point on the -axis and its slope at that point. The diagram shows all three curves centered at the origin, so , but this need not be so. Shifting a curve right lowers the expectation for every and represents a question being more difficult; shifting it left represents an easier item. The steeper the slope, the greater discrimination between levels of ability or tendency. A third parameter given equal standing by ETS is guessability. One could expect to score at least 20% on the old SAT (without the present wrong-answer penalty of 0.25) just by random guessing, so the curves might be given a lower asymptote of . Axiomatically this need not shift the expectation at up from 50% to 60%, but that is the effect of the popular logistic formula for the curves:

Our discussion starts with the scale of the -axis. It is not in units of grade-point average or any medicalreading. It presumes that the population has normally distributed around some mean with some standard deviation . The values and shown on the -axis in the figure thus represent the “95%” interval around the mean. When aggregating large samples of test results one can infer this interval from the middle 95% of the scores .

This plus the translation invariance of the curves facilitate putting offerings of different tests (or editions of a test) on a common scoring scale. That’s why you’re not scored on the actual % of SAT or GRE questions you got right. We will, however, find other places in the mechanics of models where absolute values are desired.

The Chess Test

We just posted about exactly this kind of S-shaped curve, where, however, represents the difference in ability of a chess player and one’s opponent. The value still represents the scoring expectation of the player. The curve has slope intended to confer a special meaning to the difference on the standard Elo rating scale of chess ability.

Incidentally, this figure from our previous post shows that it does not make too much difference whether the S-curve is logistic as above (red) or derived from the normal distribution (green); there is a well-known conversion of about between their slope units. Using the logistic version does not countermand the assumption that the population’s ability levels are normally distributed.

The online playing site Chess.com maintains ratings for over 4.7 million players, ten times as many as the World Chess Federation, and it shows a mostly-normal distribution of ratings:

There are issues of skew: the right-hand tail is longer, higher-rated players play more games, and they are less likely to exit the population. The whole 4.7 million are skewed relative to humanity on the whole but one can also say this of SAT- and GRE-taking students. On the whole, the population assumptions of IRT apply.

The presence of an opponent differs from test-taking. There are “solitaire” versions of chess, and more broadly, compilations of chess-puzzle tests such as these by Chess.com. To be sure, the administration of those tests is not standardized. However, the whole “Intrinsic Ratings” idea of my chess model is that we can factor out the opponent by direct analysis of the quality of move choices made by “player ” in games. The administration of games in chess competitions is completely regular and draws consistent full attention from the players.

The Ability-Value Digression

A second appearance of the S-shaped curve makes chess appear even more to conform to IRT. Amir Ban has argued that it is vital to chess programs. But we will see how the conformity is an illusion and how AlphaZero has exposed it as a digression. The curve has the same -axis but a different -axis representing position value rather than player ability. Here is an example from my previous post about these curves:

The -axis represents the advantage or disadvantage for “player ” in so-called centipawn units (here divided by 100 to mesh with the colloquial idea of being “a Pawn ahead” etc.) and the -axis shows the scoring frequency from positions of a given value . The curve has been symmetrized by plotting both for the player to move and for the player not to move, so . The and parameters (conforming to Wikipedia’s usage—note also ) are the same as in IRT. Here represents the frequency with which a player should have been checkmated but the opponent missed it; by symmetry the upper asymptote is not but and represents the frequency of blowing a completely winning game. Note that this is real data from over 100,000 moves in all recorded games where both players were within 10 Elo points of the 2000 level—thus the incredibly good logistic fit has the force of natural law.

Cross-referencing the two curve diagrams and the observation that a superiority of 150 Elo points gives about 70% expectation leads to a meaningful conclusion:

For players in the region of 2000, having 150 points more ability is just like having an extra Pawn in your pocket.

This looks like a perfect correspondence between ability and advantage. But wait—there’s a catch: That’s only valid for players at the Elo 2000 level. The slope , which governs the conversion, changes with the Elo rating . So does : weaker players blow more games. So the above “” is . That’s where the sliding-scale problem enters:

The change in slope when drawn games are removed from the sample—indicative of games like Go and Shogi in which draws are rare—is even more pronounced:

Whereas the 70% prediction from a 150-point rating difference is valid everywhere on the scale, the value of a Pawn slides. Give an extra pawn to a tyro and it matters little. Give it to Magnus Carlsen, and even if you’re his challenger Fabiano Caruana, you may as well start thinking about the next game. The shifting slope is both the main correlate of skill and the conversion factor from the centipawn values given by chess programs. Skill can thus be boiled down to the rate of the conversion—the vividness of perception of value.

Why, then, say the value axis is a digression? Chess programmers put colossal effort into designing their evaluation functions and tuning them in thousands of trial games. Yet the real goal is not to find moves of highest but rather moves giving the best expectation to win the game.

Monte Carlo tree search (MCTS) as employed by AlphaZero bypasses and trains its network by sampling results of self-play to optimize directly. The “which ?” problem disappears because it uses its evolving self as the standard. Not only the public Leela Zero project but the latest “MCTS” release of the commercial Komodo chess program have gone this route. As explained neatly by Bram Cohen, evidently earlier Komodo versions got boxed in to non-optimal minima of the design space. Cutting out the “middleman” avoids creating such holes.

The Sliding Scale Issue

My chess model is purposed not to design a champion computer chess program but to measure flesh-and-blood humans (as hopefully staying apart from champion computer chess programs). So I must grapple with dependence across all ratings . Moreover, the values output by chess programs are the only chess-specific data my model uses.

A key observation I made early on is that the average magnitude of differences between the value of the best move and the value of the played move depends not only on the player’s but also on . The higher in absolute value, the higher are all —markedly so. One might expect a higher from playing conservatively when well ahead (like “prevent defense” in football) and taking risks when well behind, but the data shows a clean affine-linear dependence clear down to . See the quartet of graphs midway through this post. Per evidence here, I treat this as a matter of perception needing correction to make differences less dependent on —and the post shows both that it flattens fairly well and makes tangible improvements.

The correction is, however, artificial, computationally cumbersome, and hard to explain. A more natural scaling seems evident from the last section’s curves: take the difference in expectations rather than raw values. Namely, use

A glance at the logistic curves shows the desired effect of damping differences when is away from and damping more when . The problem, however, is that has to be for some rating level . Which should it be?

The rating of the chess program (at the time or depth employed), say ? This gives a fixed number but uses “inhuman” values to correct human perception.

The player’s own rating ? This makes sense for my large training sets labeled by rating as used for the above charts. But often I am fitting game data of unknown quality to compute an intrinsic rating . Using to get seems circular. Or the fitting may be iterated to reach a fixed point but this is time-consuming on large data and might be unstable on small data.

A one-size-fits-all value in the middle, say ? This too is arbitrary, and while it is in the middle of the World Chess Federation’s human rating range 1000-to-2840 (Carlsen), the population mean is well below it.

I originally had a fourth reason for rejecting this approach: at the time, all these options gave inferior results to my and device. This enhanced my feeling against using a “reference 2000 player” in particular. Now my model has more levers to pull and the logistic-curve ideas are competitive, but still not compelling.

More Sliding

A simpler instance is that I want to measure the amount of challenge a player creates for the opponent. My “intrinsic ” as it stands is primarily a measure of accuracy. It penalizes enterprising strategies, ones that the computer doing my data-gathering sees how to defang but a human opponent usually won’t. Having a “Challenge Created” measure that applies to any position (with the opponent to move) might even incentivize elite players to create more fight on the board.

Since my model already generates probabilities for every possible move by , the metric is well-defined by

namely the expected (scaled) loss of value in the position that was confronted with. Here (as opposed to in the last section) is from my rating-independent metric. But the depend on the rating used for , so is really an ensemble . Again we have the choices:

Taking to be the computer’s rating makes even less sense than before.

Taking to be ‘s own rating in highly relevant for heads-up contests. But it impedes comparing the and measures for players and of different ratings. Values depend markedly on , so we would wind up saying that Carlsen creates only a third as much challenge as a 2000-level player from the same positions .

Fixing enables said comparisons but stretches credulity another way: Why should Carlsen be concerned with how much challenge he creates for a “reference 2000 player”? He can wait for errors to pounce on and generally crush a 2000-level player like a bug.

In work with Tamal Biswas covered here and here, we attempted to define a non-sliding measure in terms of “swings” in values of moves as the search deepens. This is still my desire but has been shackled by model-stability issues I’ve covered there and subsequently.

Doubting the Curves

A related issue comes from my desire to test my model’s projected probabilities in positions that have been reached by many players across the spectrum of ratings. The SAT has no trouble here: the same question is faced by thousands of takers at the same time. But there is no such control in chess, and popular positions become “book” that many players—even amateurs—know. The weaker players have free knowledge of what the masters did—or nowadays of what computers say to do in .

What I can do instead is cluster positions according to similar vectors of values . It is also legit to test my model by clustering the vectors of it generates. The high dimension of , the typical number of legal moves, can be reduced to a smaller by a vector similarity metric that down-weights poor moves. This doesn’t need clustering the whole space of positions and size matters more than tightness of the cluster. Yet despite having millions of data points it has been hard to find good clusters.

I’ve only done tests with , that is, on positions with two reasonable moves and , similarly spaced in value, and all other moves bad. My model’s projections have fared OK in these tests—as could be expected in such simple and numerous cases from how it is fitted to begin with. But a surprise comes from how this is also the simplest test of IRT for chess, considering to be the right answer and and everything else wrong. Thus we can observe a composite item curve for from these positions. And the consistent result is not a sigmoid curve. Rather, it looks like the left half of the logistic curve, as if the inflection point of maximum slope would come around Elo . Thus the only ability level discriminated by the “chess test” is perfection.

So the logistic law of IRT is out for chess. The logistic law of ratings works OK despite caveats here. The logistic law of value, despite being observed with incredible fidelity for each rating level in the above plots, has two more feet of clay. Ideally it should give me value conversion factors for each chess program so that my model could use one set of equations for all—and importantly, so it could pool all of the programs’ move-values together to make more-reliable projections.

But chess programs are not constrained by the law. They can do any post-processing they want of reported move values: so long as the rank order is preserved, nothing changes in the program’s playing behavior. The “calibrations” advertised by the Houdini chess program not only trip on the sliding scale but diverge from my own data for non-blitz chess at any point on it. Similar morphing of values by Komodo evidently causes the anomaly at the end of my earlier post on the “law.”

And second—where the scale slides away completely—the conversions don’t capture the different positioning of programs (and versions of the “same” program) on the landscape they share with human players. An unfortunate new cheating case last week has shown this most definitively. Thus I am resigned to having to re-jigger my equations and re-fit my model on re-run training data (a quarter million CPU core-hours per set, many thanks due to UB CCR) for each major new program release. And I wonder less at the need for continual re-centering of SAT scales.

Open Problems

Can you suggest a general solution to my sliding-scale problems?

I have skirted the issues of SAT and GRE re-scaling per se. The report on re-centering linked just above acknowledges large shifts in the population. One attraction of using chess is that the rating system gives a fixed benchmark and—per my joint-work evidence—has remained remarkably stable for the population at world level. Can the non-sliding standards in chess be leveraged to transfer deductions about distributions to general testing?

A further problem is that we treat both grade points and chess ratings as linear. Raising a C+ to a B- has the same effect on one’s GPA as raising an A- to an A. A 10-player chess tournament needing to raise its average rating by 3 points to reach the next category can get it equally from the bottom player raising 2210 to 2240 as the top player raising 2610 to 2640. Yet the latter lifts seem harder to achieve. Perhaps more aspects of the scale need plumbing before discussing how it slides.