Simon Willison’s Weblog: OK, I can partly explain the LLM chess weirdness now

Source URL: https://simonwillison.net/2024/Nov/21/llm-chess/#atom-everything
Source: Simon Willison’s Weblog
Title: OK, I can partly explain the LLM chess weirdness now

Feedly Summary: OK, I can partly explain the LLM chess weirdness now
Last week Dynomight published Something weird is happening with LLMs and chess pointing out that most LLMs are terrible chess players with the exception of gpt-3.5-turbo-instruct (OpenAI’s last remaining completion as opposed to chat model, which they describe as “Similar capabilities as GPT-3 era models").
After diving deep into this, Dynomight now has a theory. It’s mainly about completion models v.s. chat models – a completion model like gpt-3.5-turbo-instruct naturally outputs good next-turn suggestions, but something about reformatting that challenge as a chat conversation dramatically reduces the quality of the results.
Through extensive prompt engineering Dynomight got results out of GPT-4o that were almost as good as the 3.5 instruct model. The two tricks that had the biggest impact:

Examples. Including just three examples of inputs (with valid chess moves) and expected outputs gave a huge boost in performance.
"Regurgitation" – encouraging the model to repeat the entire sequence of previous moves before outputting the next move, as a way to help it reconstruct its context regarding the state of the board.

No non-OpenAI models have exhibited any talents for chess at all yet. I think that’s explained by the A.2 Chess Puzzles section of OpenAI’s December 2023 paper Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision:

The GPT-4 pretraining dataset included chess games in the format of move sequence known as Portable Game Notation (PGN). We note that only games with players of Elo 1800 or higher were included in pretraining.

Via Hacker News
Tags: prompt-engineering, generative-ai, openai, gpt-4, ai, llms, training-data

AI Summary and Description: Yes

Summary: The text discusses the performance discrepancies of large language models (LLMs) in chess, particularly comparing OpenAI’s models, specifically GPT-3.5 turbo-instruct and GPT-4, to other models. It highlights how the model’s format (completion vs. chat) affects its output quality and shares insights on improving performance through prompt engineering.

Detailed Description: The conversation centers on the performance of various LLMs when tasked with chess-related activities. Key points include:

– **Model Performance Differences**: Generally, LLMs have been noted to struggle with chess, except for the OpenAI models. The text emphasizes that GPT-3.5 turbo-instruct excels due to its completion model architecture, which allows for better next-move suggestions compared to chat models.

– **Chat vs. Completion Models**: The main theory put forward is that the structure in which the model operates (completion vs. chat) significantly impacts its performance. Completion models yield better results in context-based tasks, while chat models compromise the quality of output.

– **Prompt Engineering Enhancements**: Dynomight has successfully improved the chess performance of GPT-4 by employing two key strategies:
– **Providing Examples**: Including a few examples of valid inputs and expected outputs has drastically improved the model’s performance.
– **Regurgitation Encouragement**: Encouraging the model to recall and restate the entire sequence of prior moves before making a suggestion helps reinforce context and board state recognition.

– **Training Data Considerations**: OpenAI’s approach to training data is also highlighted, specifically their choice to include only higher-rated chess games (Elo 1800+) in the GPT-4 pretraining dataset, potentially attributing to its superior abilities.

– **Lack of Competence in Other Models**: Aside from OpenAI’s models, no other currently assessed models have demonstrated noteworthy chess skills, indicating a significant gap in available LLM capabilities.

Overall, these insights reflect on the importance of model architecture and training data in achieving desired outcomes in AI applications like chess, indicating practical implications for AI model development and usage in gamified contexts.

– **Implications for Developers**:
– Importance of selecting the right model architecture based on the use case.
– Need for effective prompt engineering strategies to enhance AI performance.
– Consideration of training data quality and selection criteria in developing robust AI systems.