the script expects a source directory as a parameter and will output a compressed CSV file for each input file in a folder called `out`. This will take some time if there is a large number of games to process.
now that we have every move from every position we need to combine and count them. There are numerous ways of doing this for example you cloud use a spark cluster if you want to get fancy but I went the really simple way and used the tools at hand: bash tools. If you use the games I used the total number of rows in your CSV files will be around 570 million so you need fast tools if don't want this to take forever. Bash tools are written in C and support parallel running so they are relatively fast. So the script to combine and count these games is like this

the second sort is optional I just used it to check the head and tail of the file to see if it looked correct. The parameters to sort are tuned for my machine with 32GB of ram and 8 cores. You may need to adjust these parameters for your environment. This script took around 6-7 hours to complete on my machine so leave it overnight or do something else as it will consume a lot of cpu and ram so the machine may not be usable.

After the script ends you will get a file that contains rows like this

COUNT is the number of times SAN was played from FEN exactly what we were looking for. You may stop here if this suits your needs but this file in its current state will take up a lot of disk space when you import it into a database. So I went 1 step further and added some space optimizations as the following: The fen takes as average of 40 bytes per position. But if you were to store the position in a different way like only storing the positions of the 32 pieces you will get a better average. To here is a groovy script that will convert a given fen string to piece coordinates

this script will read a line from the counts.csv file and output a line with the piece indexes on the board. I also filtered out the position and moves that only occurred once and imported both version into a mysql database. The unoptimized version takes around 5GB (without indexes) and the optimized version around 2.1GB for around 30M rows