The apache spark framework is a great way of processing HUGE amounts of data in parallel. The learning curve is really flat, and you’ll get to crunching data in no time. I have already implemented a batch log processing / analytics application and a real-time streaming application at work for n11 and I thought that I do another fun project with spark. This time it’s generating a heat-map of a chess board by processing chess games. A heat map of a chess board is 8×8 table with each cell representing a square of the chess board with a color. Blue means there were not many moves to that square, and red means there were a lot. It’s an analogy to the warm-cold method of defining distance. We will do this for each piece there is in chess. The end results will look something like this

So, a method we could use and that I’ve tried was to generate a FEN representation of the board after each move, and feed this data to apache spark to extract how much each piece occupied each square. This did sound reasonable and it would have been very efficient because spark would not need any chess knowledge to compute the heat map. Just some string parsing would suffice. But it turns out that this method of processing overwhelms the heat map with the initial positions of the piece. Why? because the pieces rest on their initial squares for quite some time before they move. Let’s look at an example to make it clearer. The initial state of the board as a FEN string is:

The pawn will register a hit for the e4 square, but so will all the other remaining pieces for their home squares! The will generate a heat map that says all the pieces like their home squares best which isn’t really helpful.

So we need another approach. The best way is to register a square that a piece has moved to. This requires parsing the moves in the PGN game and actually playing through the game. So we need a chess parsing library to use in our spark job. Fortunately I had a nifty little port of chess.js for java handy here that would allow me to play through the games and get back the squares the pieces moved to. But before we can dive into this, we need to clean the PGN files and make them suitable for processing by Spark. Data processing platforms like spark like to munch data that are in CSV format – a line of values separated by a comma. The values in our file will be the moves from each game, starting from the initial position. Here is an example of what I mean

So let’s dive into converting the PGN files.We will use the excellent chesspresso library to parse the PGN files because I haven’t yet integrated PGN methods into chesslib. the dependency for chesspresso can be added using

Thanks to the guy who was kind enough to package and upload this to a repository. Meanwhile I’m not so nice and haven’t uploaded chesslib to a repository, but no problem we’ll just add it to the /lib directory and the sbt assembly plugin will pick it up from there. So let’s start by parsing the PGN files using chesspresso

this is the actual method that reads the games to a list. Chesspresso iterates over the games and streams the games from the file on demand (because PGN files may contain more than one game). Since we are consuming the whole stream this will not work if you try to process a PGN file with lots of games in it (around 1M suffers on my 8gb machine). If you need to split a PGN file to smaller files use pgn-extract.

This method will get the moves made in the game as a list and concatenate them using a comma to the CSV format that I showed you above. Chesspresso will throw an exception on an invalid move, and some PGN’s do contain invalid moves.

this map can be though of like this: the keys represent each piece (the colors are irrelevant) and the values are an array of size 64 – one slot for each square on the chess board. When a piece moves to a square we will increment the count in the array.

this second mmap array is a mapping from the string (SAN) notation of a square to it’s place the array that I mentioned above. We will need this because chesslib’s internal representation of a chess board as an array is different than what we will be using. Now let’s get to crunching those chess moves:

we start by mapping each line in our input. We initialise a chesslib instance that we will use to process the moves and reset to the starting position of a game. Next we split the CSV into a list of moves and for each move we make chesslib play that move. The result of the move method contains the piece and the square that the move was made to. We take that square and increment the count in the array contained in the map. We will need some grouping to we convert the map into a sequence. So before the sequence the result will be something like

Why do we need this? because we will merge these maps to sum all the values in the arrays for each piece. The above example states that the rook was on a1 10 times in game 1 and 11 times in game 2. So the heat map will contains 21 for the rook on a1. When we group this sequence by key using the spark provided groupByKey method we will have the following result

this will give us the total numbers of moves to each square for each piece. Just what we needed for the heat map!
But we still need to do one more thing before we can start the rendering of the map and that is to normalise the values in the array to values between 0 and 1. We can do that using the feature normalisation formula

the html() function used here is a string template that uses the values between 0 and 1 to create a HLS color for a 5 color heat map. check out the code in the repo – it’s too verbose to include it in this post.

That’s about it on how to generate a chess board heat map. Here are the results for 1.5M games analyzed: