A quick test run of the FactoMineR package for R. This package focuses on multivariate exploratory data analysis, such as Principle Components Analysis (for numerical data) and Correspondence Analysis (for categorical data).

In an earlier blog post I took a look at a large collection of chess games and tried to quantify the “first move” advantage in chess, in terms of ratings. This time I’ll use the same large database of chess games, and look at opening repertoires. A chess opening is a set of moves that a player uses at the start of the game in an attempt to steer the game to positions familiar to the player, and which align with that player’s style and preferences. Such openings have descriptive, often colorful names, like King’s Gambit, Sicilian Poisoned Pawn, or Nimzo-Indian Defense, as well as a standard code, from the Encyclopedia of Chess Openings, like B07, C44 and E80. There are 500 such “ECO” codes, from A00 to E99.

I extracted games from all World Chess Champions, from Steinitz (1866) to Carlsen (2014) and calculated the percentage of the games for each player in each ECO code. So each player’s opening repertoire is represented as a vector of 500 weights, summing to 1.0. I then used FactoMineR’s PCA() method to extract principle components from this dataset. The first two components extracted together represent around 42% of the total variance.

Plotting the Champions against these two dimensions shows some intriguing patterns, bringing together players by era:

Further insights can be gleaned by plotting how these two components weight the various openings. To make it easier to read I grouped some of the ECO codes and used descriptive names for the better-known openings. From this we see that the first component appears to distinguish the player’s use of open games (1.e4 e5) in the positive direction versus semi-open and closed games in the negative direction. I’m having a harder time reading a real-world meaning into the second component. Maybe a reader sees something here?

Something to remember in all of this is that the choice of opening in a game is a result of the moves of both players. Players try to influence the opening, steer the game toward their advantages and preparations and against those of their opponents. But neither player has 100% control over the opening, aside with some fringe moves like 1. h4. However, players, especially world-class caliber players, do specialize in certain opening systems, and it is fair to speak of their repertoires.

Update:

The comment from Dana Mackenzie prompted me to try out another feature of FactoMineR, the ability to chart supplemental variables. These are variables that are not used in doing the underlying PCA calculation but can be shown in the charts, to see how they align with the extracted components. For example, I could add catagorical variable for each player to represent their nationality and then plot that, to see if there are national schools of practice regarding openings. Or, as I’ll do here, add a year variable the year the individual won their world championship, to see how this aligns:

We can see by the length of the line here that the Year has a strong correlation with these two components, mostly with the 1st component.