What's All This Then?
ChessPT, as the name suggests, is a pre-trained transformer model, trained on the Lichess database1. Since chess games can be expressed unambiguously as a string of sequential moves using Portable Game Notation (PGN), PGNs seem like a good fit for an auto-regressive GPT model.
From the model's perspective, a chess game is just a highly structured sequence of tokens (moves) in a relatively limited vocabulary. Its responsibility is to calculate the move most likely to come next, given the previous sequence of moves. The chosen move is appended to the game's PGN string, which appears as a move played on a chess board for a person, and play continues. The key difference between ChessPT and a regular person is that it makes this prediction based on existing data. ChessPT has no ability to calculate subsequent moves, like a human or conventional chess bot would, so its play is entirely probabilistic.
Why I Did This
This is a toy project -- while I've been developing novel models for a few years as part of my doctoral research, I haven't needed to do much to make them usable outside the command line and wanted to expand my skills a bit. This website is to give me some practice on the cloud side of things with some basic web and API stuff, and the model itself is to practice data management and model training practices outside my usual wheelhouse. I'm about as good at chess as I am at web design, but I find the problem interesting from a modelling standpoint, and my approach appears to be novel, as far as I can tell2.
What I find most interesting is that the rules of chess should be deducible from a massive dataset of real games, even if not every move theoretically possible actually gets played. The positions of the pieces of the board can be directly inferred from the history of the current game.
The Model
ChessPT is based on the HuggingFace implementation of OpenAI's GPT2. Technical explanations of how transformers work abound online3, but the simplified explanation is that when ChessPT is given a list of moves which make up the current game, each move is "embedded" as a vector representing information about the move in isolation. Each move is then able to extract information from every other move in the game, which it uses to update its knowledge of itself. After doing this over several successive layers, each taking in the previous embeddings for each move and outputting an updated version of each of them, the model uses a fully-connected layer which takes the last move's embedding and produces a probability distribution over the set of all possible moves. The move with the highest probability is chosen as the next move to play, and given to the user.
The Vocabulary
ChessPT is just a chess model, trained from scratch. As a result, I can use an almost preposterously tiny vocabulary, since there really are not very many moves which could ever be made in a chess game. You may have noticed the "Estimated Move Probabilities" table does not show moves in a conventional PGN format (e.g., e4, Qd7+, O-O-O#, etc.), but rather in Universal Chess Interface (UCI) format, in which each move is a source square and a destination square (e.g., e2e4, d1d7, e1c1, etc.). UCI is useful because a given move in UCI contains as little redundant information as possible, while remaining essentially stateless. Strictly speaking, the "+" in the PGN "Qd7+" is redundant, as the knowledge that a king is now in check is implicit in the rules of the game. PGN also can include some very unusual moves (Bc6e4#, for example), which come up in a real game so infrequently that it's hard for our model to really learn anything about them or when they would or wouldn't be possible or appropriate.
UCI does away with this issue and keeps our vocabulary small, with 642 = 4,096 possible moves at a first glance. Most of these are guaranteed to be illegal, though, as no piece is capable of making them. It turns out that every possible legal move can be expressed by either a Queen move or a Knight move on an empty board, even unusual moves like castling or en-passant4. This gives us 1,970 possible input tokens representing every possible move which could ever be made on a chess board.
We also include some special tokens: [START] as the first token for a game, and [WHITE], [BLACK], and [DRAW] to indicate the game's result at the end. Every game that exists can be expressed using these 1,974 tokens, giving ChessPT its vocabulary. In theory, such a small vocabulary which is used to train a model with such highly structured inputs should make a highly-trained ChessPT model more efficient than fine-tuning a foundational LLM, since the overwhelming majority of text that GPT-4 or Llama are trained on has nothing to do with chess and should5 contribute nothing to a chess bot.
Datasets
Acquiring a dataset for ChessPT was the easiest step in the entire process, since Lichess, the second-biggest chess website, publishes a monthly dataset of every single game played in PGN format, totalling 1.92 TB and 6,11,374,411 games as of November 2024. Every game is guaranteed to be valid, so we run no risk of illegal moves being played, and they come with plenty of metadata if we want to filter to just long games, just the best players, or even just games where particular openings were played.
A dataset this large, diverse, realistic, and clean is the stuff dreams are made of.
In my initial training runs on my potato of a computer, we trained the model on the smallest month's games (January 2013, with 121,332 total games). The results were less than impressive, but after only 10 epochs on a pretty tiny dataset, I found ChessPT could already consistently play first 5 or 6 moves of the mainline for some common openings of the Spanish, the Fried Liver Attack, or the King's Gambit.
One problem, though, is that the model was quite susceptible to traps. This is pretty understandable, remembering that ChessPT's goal is to predict a realistic next move, rather than a good one, and since most of our dataset is full of pretty bad players playing pretty bad games (and falling for traps), the most realistic option is not usually the best one.
Playing Good Games
This is where our final two tokens come in. We add a pair of [START: WHITE] and [START: BLACK] tokens to the vocabulary, bringing the total to 1,976. During training, we use these to spoil the ending by giving away the ultimate winner of the game. This makes our model equivalent to asking the question: "Given the current game history, and the fact that white will win, what's the most likely next move?"
During play, we set the START token to the colour representing the model so that it tries to play well, rather than just playing realistically. Interestingly, this also implies that setting the START token to the human player's colour would encourage ChessPT to play the worst game it possibly could.
Technical Stuff
We trained the current ChessPT model on an RTX 3070, which isn't much but it's what we've got. This makes training slow, especially for larger and larger chunks of the total Lichess database. It also limits our model's overall size.
We use a move embedding dimensionality of 512 and a context length of 256, since virtually no chess games are even close to 256 moves long. The full model is comprised of 12 layers with 12 heads per layer, for a total of ~87 million parameters in the whole model.
These size limitations are pretty severe, and under these conditions ChessPT will never be a world-class chess player. The convenient thing about transformers, though, is that if I ever find an A100 sitting on the side of the road, it'll be trivial to scale up. As noted earlier, we use the HuggingFace implementation of a blank GPT-2 model, and the HuggingFace Accelerate library for our GPU.
Future Work
There are quite a few changes I'd like to make, when I have the time. Primarily, I'd like to train the model for longer (and ideally larger, though that's resource dependent) in the hopes that it learns deeper and deeper lines of play.
I intend to experiment with making training more efficient by masking the illegal moves from the final softmax function. In chess, the probability of selecting an illegal move is absolutely always 0, but ChessPT doesn't know that during training (though during inference I do zero out illegal moves, which is why the game doesn't try to play them.)
AlphaZero, one of the most capable ML-based chess bots uses a Convolutional Neural Network as the backbone of its position evaluation algorithm, but its real strength comes from reinforcement learning using self-play. I posit that it should be possible to replace the CNN with a transformer and achieve similar results. Currently ChessPT is trained purely on historical data, but I'd like to eventually bring RL and self-play into the picture. It could be efficient to pre-train for a few epochs on the Lichess database and then let the RL take over. Throw in a Monte Carlo tree search and we've basically got AlphaZero again, but with a transformer model at its core.
Finally, I hope to get around to taking a look at the attention weights themselves. I expect that there will be some heads or layers which keep track of individual pieces (e.g., moving a bishop for a second time should attend to the previous bishop move), those which keep track of checks (moves which break a check should attend very, very strongly to the move which put the king in check to begin with), or even castling (the actual castling move should attend to the moves which got the knight, bishop, and/or queen out of the way for the castle to occur).
- Currently only part of the database
- Some work has involved fine-tuning a GPT model, but nothing quite like this
- For my money, 3Blue1Brown's explanation is easily the best, though incomplete at time of writing
- En-passant is just a single diagonal move, and castling is moving either one or two spaces in a straight line
- I say "should" because LLMs are spooky and horrifying and we can't currently predict what training data contributes to what outputs