Cloud Infrastructure

Overview

This website is a project for me to get some practice working with cloud services (AWS, in this case). While it started out as a shot at the Cloud Resume Challenge, it's now moved beyond that with the addition of ChessPT, so the infrastructure is mostly designed to support that for now. If I ever decide to extend ChessPT or add new projects, I'll update this page. Almost all of the AWS infrastructure was provisioned using Terraform.

Frontend

The frontend is quite straightforward. This site is a statically-hosted HTML site with some fairly spartan CSS and the bare minimum JavaScript to call the ChessPT API and display the board and move probabilities (chess.js for frontend chess logic, chessboard.js for the chess interface, and Chart.js for the move probability display). I have it sitting in a publicly-accessible S3 bucket, which is delivered using CloudFront. The domain and certification are all managed by AWS, and I have a redirect which ensures www and non-www requests both come here.

Backend

The backend handles the model itself. It's accessed through GET and POST requests to API Gateway, which points to a Lambda function. The ChessPT model is small enough to entirely fit inside Lambda's memory limits. Partly because it's good practice to separate responsibilities and partly because the PyTorch dependency is quite large, I have the Lambda function and all its dependencies packaged as a Docker image, stored in the Elastic Container Registry. Lambda executes the container and pulls the saved model itself (and the tokeniser) from another, private S3 bucket.

The API itself is very simple: it accepts GET requests containing the current game state as a PGN and the AI player, and responds with a new PGN with the next move, and the probabilities of all the moves which could have been chosen. It also accepts POST requests with a new value for the model's 'temperature', which is used to increase the chance that the model selects a move other than the most likely one. Because GPT models are stateless (it's the user's responsibility to keep track of the current game state), we don't need to worry about tracking how the game is going in the backend, which is convenient.

Future Work

There are a good few improvements I could make to this system that would bring it more in line with the bigger cloud ML projects out there. The most obvious, and the one I intend to get to first, would be the inclusion of SageMaker for training the model, though I'd need to be careful not to exceed my budget of the few cents a month needed to keep my domain running. I intend to get to this in the near future, since having the full training pipeline in the cloud as well would be good practice and a single epoch of training takes something like a week on my machine.

Next, storing games which are actually played against ChessPT as potential training data may be on the table, but it would need some cleaning system to make sure people aren't just abusing the API to poison the model.