Tag Cloud

To port an classic chess engine approach with an parallel Alphabeta algorithm like YBWC to an GPU architecutre would take a significant bunch of time, if it is even possible to port all well known computer chess techniques straight forward. And it is questionable if an Elo gain, by more computed nodes per second, is eaten up again by an higher branchingfactor due to an simpler implementation.

Zeta 098 and 097 make use of an Randomized Best First MiniMax Search, but my implementation makes excessive use of Global Memory and scales poorly.

At the very beginning of the project it was clear, that an Monte Carlo Tree Search would fit best for gpus. But until now there is no known engine that could make MCTS work well for Chess.

What is left, except to try to port an classic approach?

I could improve the performance of the BestFist search significantly by switching from GlobalMemory to LocalMemory and i could remove the randomness...another alternative would be to switch to MCAB, Monte Carlo Alphabeta...

The actual conclusion of the current iteration is, that an simple engine, with standard chess programming techniques, can be ported to OpenCL to run on a gpu, but it would take more effort to make the engine competitive in terms of computed nodes per second (speed), heuristics (expert knowledge), and scaling (parallel search algorithm).

Computer Chess, as an computer science topic, evolved over decades, starting in the 40s and 50s, and reached one peak 1997 with the match Deep Blue vs. Kapsarow. Nowadays chess engines are tuned by playing thousands and thousands of games, so to get an chess playing engine running on the gpu and to get an competitive chess playing engine running on the gpu are two different tasks.

One SIMD Unit - One Board To avoid thread divergence in a Warp, resp. Wavefront, the engine could couple, for example, 32 or 64 Work-Items of one Work-Group to work together on the same chess position. For instance, to generate moves, sort a move list or do an board evaluation in parallel. A move generator of such an Work-Group could operate over pieces, directions, or simply 64 squares in parallel. But in any of these cases current GPU SIMD units will 'waste' some instructions compared to the more efficient, sequential, processing of an CPU.

Use of Local Memory* instead of Global Memory The more sequential threads are coupled into one Work-Group to work on one chess position in parallel, the more Local Memory* per Work-Group could be available to store a move list, or a move list stack. By the use of faster Local Memory ,less Warps (resp. Wavefronts) are in need to hide Global Memory latency.

Hundreds of Work-Groups instead Thousands of Threads YBWC is a parallel game tree search algorithm used in nowadays chess engines, but the more workers the algorithm runs, the less efficient he performs.So, by coupling sequential operating threads into one Work-Group, to work on one chess position in parallel, we lower the total number of workers and increase efficiency of the parallel search.

* Local Memory as OpenCL term is translated to Shared Memory as Nvidia Cuda term.

To make it short, Zeta v097 and v098 make excessive use of the slower Global Memory, therefore the computed nodes per second scale linear with the memory bandwidth and not with the amount of cores or their clock rates.

SIMT architecture of GPUs GPUs consists of tens to hundreds of SIMD or Vector Units that process multiple threads in multiple Warps or Wavefronts in SIMT fashion.

Memory architecture of GPUs To hide latency of Global Memory (VRAM) GPUs can run multiple Warps or Wavefronts and prefer to do computation by the use of Local or Private Memory. So, the more Work-Items and Work-Groups you run to hide latency, the less Local and Private Memory per thread will be available.

Thousands of threads on GPUs MiniMax search with Alpha-beta pruning performs best serial, not parallel.