By 苏剑林 | Nov 11, 2020
I wonder if readers have seen the article by QuantumBit at the beginning of the year titled "The Strongest Writing AI Actually Learned Chess and Composition, Language Model Cross-Border Operations Spark Heated Discussion, Seeking Opponents Online". It mentioned that a netizen trained a GPT-2 model to play International Chess. I have always been thinking, how can such an interesting thing not have a Chinese version? For International Chess, the Chinese counterpart is naturally Chinese Chess (Xiangqi). So, I have been wanting to replicate its results on Chinese Chess. After dragging it out for more than half a year, I finally completed it in the last few days. I would like to share it with everyone here.
Chess Formula: The General never leaves the Nine Palaces; the Advisors remain by his side and never exit the palace. The Bishops fly across the four directions to the four corners; the Knights move in one step and one diagonal jump. The Cannon must skip over one piece to hit another; the Rook moves in straight lines as it pleases. Only the Pawns move just one step; once across the river, they advance and move sideways without a trace.
In fact, by taking a simple look at the QuantumBit article, one can understand the principle of GPT-2 playing chess: it is essentially "memorizing manuals." Simply put, a chess game manual can be represented as a continuous text string. Since GPT-2 is skilled at memorizing text, we can use it to "memorize" human chess manuals. When playing, it can be viewed as "reciting" the next move based on the partial manual that has already been played. Therefore, the task can theoretically be completed using GPT-2.
To accomplish this, we need to understand how computers record chess games. Regarding standard notations, the ICCS notation and the FEN position representation are common. For details, refer to the articles "Chinese Chess Computer Application Specification (II): Move Representation" and "Chinese Chess Computer Application Specification (II): FEN File Format".
In simple terms, ICCS notation represents the chessboard using horizontal and vertical coordinates as shown in the figure below. Each move only needs to record the starting and ending coordinates. For example, "h2e2" means moving the piece originally at coordinate (h, 2) to (e, 2). If the current state is a new game, this corresponds to the move "Cannon Two moves to Five" (炮二平五). In this way, each move only requires 4 characters. A manual of $n$ steps becomes a string of length $4n$. Of course, for model input, it doesn't strictly have to follow this method; for instance, "h2" could be represented by a single ID and "e2" by another ID, meaning each grid point uses one coordinate instead of two. This reduces each move to just two IDs, shortening the sequence length. There is no fixed rule; interested readers can improve this on their own.
Diagram of Chinese Chess ICCS Coordinates
As for the FEN representation, it is used to describe which pieces are on the board and whose turn it is. The manuals modeled in this article are actually full-game manuals, so the model here technically doesn't need FEN (the board state defaults to a starting position). However, to facilitate improvements for interested readers, I will briefly introduce it. The FEN representation records what pieces are in each row. For example, the starting position is represented as "rnbakabnr/9/1c5c1/p1p1p1p1p/9/9/P1P1P1P1P/1C5C1/9/RNBAKABNR w - - 0 1". The red part represents the board state, where lowercase denotes Black and uppercase denotes Red. Different rows are separated by "/". The meanings of the letters are shown in the table below. Thus, "rnbakabnr" indicates the first row contains Black pieces: "Rook, Knight, Bishop, Advisor, General, Advisor, Bishop, Knight, Rook". "9" means the second row has 9 empty squares. "1c5c1" means the third row is "1 empty + 1 Black Cannon + 5 empty + 1 Black Cannon + 1 empty", and so on. The green part indicates whose turn it is: "w" for Red and "b" for Black. The remaining parts are generally less important; interested readers can check the linked sources.
\begin{array}{c} \text{Chinese Chess FEN Notation Meaning Table} \\ {\begin{array}{c|c|c|c|c|c|c|c} \hline \color{red}{R}/\color{black}{r} & \color{red}{N}/\color{black}{n} & \color{red}{B}/\color{black}{b} & \color{red}{A}/\color{black}{a} & \color{red}{K}/\color{black}{k} & \color{red}{C}/\color{black}{c} & \color{red}{P}/\color{black}{p} & \text{Arabic Num } m \\ \hline \color{red}{\text{俥}}/\color{black}{\text{車}} & \color{red}{\text{傌}}/\color{black}{\text{馬}} & \color{red}{\text{相}}/\color{black}{\text{象}} & \color{red}{\text{仕}}/\color{black}{\text{士}} & \color{red}{\text{帥}}/\color{black}{\text{將}} & \color{red}{\text{炮}}/\color{black}{\text{砲}} & \color{red}{\text{兵}}/\color{black}{\text{卒}} & \text{Represents } m \text{ consecutive empty spaces} \\ \hline \end{array}} \end{array}
After seeing the above introduction to notation, we know that whether it is move sequences or board states, they have been converted into strings of text. Since chess reasoning is based on positions and moves as inputs and outputs, chess modeling is theoretically a "text processing" problem! This is the theoretical basis for GPT-2 playing chess. Essentially, GPT is just BERT with a language model Attention Mask. Therefore, we could say this is BERT playing chess or GPT playing chess—both are fine~
There is not much to say about the model principle. Previous articles such as "From Language Models to Seq2Seq: Transformer is Like a Play, It All Depends on the Mask" have introduced it. Related examples include "Conditional Text Generation Based on Conditional Layer Normalization", and "What Grade Level can BERT Reach? Seq2Seq 'Hard-Coding' Primary School Math Word Problems". Readers can check them for reference. The processing in this article is simple: only full-game manuals are kept, the ICCS notation of the manual is treated as a long sentence, and then a language model is trained.
Project Link: https://github.com/bojone/gpt_cchess
The training process uses progressive training, meaning it gradually increases the sequence length rather than using a uniform length from the start. Some articles call this "Curriculum Learning." This approach can improve the model's training speed and convergence (sequences are shorter at the beginning, making training faster and easier to converge). Intuitively, it lets the model learn the "Opening" first, then "Opening + Midgame," and finally "Opening + Midgame + Endgame," gradually increasing the difficulty. BERT weights were loaded before training. Readers might wonder if BERT weights have anything to do with chess manuals. Actually, they don't, but using BERT weights is slightly better than completely random initialization, as it helps convergence a bit faster.
The model script also includes an implementation for interactive play with the model. Readers can experience it themselves. This interactive play uses the python cchess module for auxiliary implementation, for which I express my gratitude. GPT itself is a generative model, but when deciding which move to take, I do not use a generative method (because unconstrained generation might output illegal moves). Instead, I use a scoring method: I generate all legal moves for the current position, input them into the model to score them, and take the move with the highest score. This ensures every move the model outputs is legal, allowing games to continue until a win or loss occurs.
Interactive Playing Demo
Readers might wonder if enumerating all feasible moves would involve a huge amount of calculation. In fact, for any given board state, the number of legal moves is not large. It can be logically proven that it does not exceed 111 (surprising, isn't it? The candidate moves for any step in Chinese Chess do not exceed 111, rather than being an extremely large number). Thus, the batch_size for this step will not exceed 111, which is acceptable.
The derivation is simple: 1 Rook or 1 Cannon has at most 17 possible moves; 2 Rooks and 2 Cannons have at most 68 moves. If all pawns have crossed the river, each pawn has at most 3 moves; 5 pawns have at most 15 moves. 1 Knight has at most 8 moves, so 2 Knights have at most 16. 2 Bishops have at most 6 moves (1 in the center, 1 at the edge: 4+2). 1 Advisor in the center has at most 4 moves (2 Advisors actually block each other). Finally, the General has at most 2 moves. Thus, the total is $68+15+16+6+4+2=111$. For specific board designs, one can refer to the math forum topic "A difficult design problem for Chinese Chess positions."
What everyone probably cares about most is: how strong is the chess level of such a model? I performed a simple test, and the conclusion is roughly: it can play a fairly good opening and has decent adaptability during the opening. However, once it reaches the midgame, its adaptability drops significantly. It is not very sensitive to capturing pieces; if you capture its pieces randomly, it might not respond appropriately. It can be seen that these are weaknesses of purely memorizing manuals. Of course, as mentioned before, every move output is legal, so you can play with it until the game actually ends.
Some readers might ask if the model can improve its chess strength through self-play. Theoretically, it certainly can, but unfortunately, it hasn't been implemented here for two reasons: one is that I didn't want to bother, and the other is the lack of computing power. Additionally, increasing the model size should further improve strength. Note that my results used a Base version (100 million parameters); the netizen mentioned at the beginning who played International Chess used a 1.5 billion parameter GPT-2, which is 15 times larger than ours. Another area for improvement is that in the above modeling, we learned from the entire game manual. Ideally, for better chess strength, we should only learn from the winner's moves and avoid learning from the loser's moves.
Of course, even if these methods bring improvements, they are likely limited. Ultimately, this differs from how we understand the principles of playing chess. We play chess by projecting forward from a board state, but the language model approach mentioned above has no concept of a "position," or rather, its "position" must be determined by all previous moves. For the mid-to-late game, there are too many historical moves, which is asking too much of the model. The improvement method is actually simple: change the input to "position" and the output to "move." As we said earlier, a position can be represented as text using FEN, so this is just a Seq2Seq task.
Beyond that, there are other methods. For example, we could treat every position of the winner as a positive sample and every position of the loser as a negative sample. Then we could train a binary classification model to judge the quality of a position. With this judgment function, we could enumerate every feasible move and choose the optimal one based on the judgment function's result. Since a position can be represented as text, this means we have turned playing chess into a text classification task.
In short, thanks to the powerful text modeling capabilities of the Transformer model, our ideas for modeling chess have become simple and diverse~
This article attempted to use GPT via bert4keras to play Chinese Chess. The main idea is to use the "language model memorizing manuals" method to give the model the ability to predict the next move, and discussed some improvement ideas. Although the method in this article is not the standard way to model chess, it allows us to further experience the power of language models.
Everyone is welcome to report the chess intelligence of the models they've trained. Haha~
Giving you a pair of elephants (Xiang)
Original article address: https://kexue.fm/archives/7877
Citation: Su Jianlin. (Nov. 11, 2020). "When GPT Meets Chinese Chess: Written Articles, Solved Problems, How About a Game of Chess?". [Blog post]. Retrieved from https://kexue.fm/archives/7877