LLMs Tackle General Intelligence: A Novel Benchmark Using Generated Games
Researchers at the University of California, Berkeley have introduced a new benchmark, called gg-bench
, designed to assess the general reasoning capabilities of large language models (LLMs). Unlike traditional static benchmarks, gg-bench
is a dynamic data-generating process that creates unique game environments on demand, providing a more robust test of a model’s ability to generalize.
The gg-bench
benchmark works in three stages:
-
Game Description Generation: An LLM is used to generate natural language descriptions of novel two-player, turn-based strategy games. For example, it might create a game called “Number Duel,” where players try to “capture” their opponent’s numbers by strategically selecting and timing attacks.
-
Implementation Generation: The same LLM then translates these game descriptions into executable code, creating a Gym environment for each game. This includes defining the rules, state space, and actions available to the players. The “Number Duel” example would be coded with functions defining valid moves, state updates, and win conditions.
-
Reinforcement Learning (RL) Agent Training: RL agents are trained to play these generated games using self-play, allowing them to learn optimal strategies within each environment. This creates a challenging opponent against which to test the LLMs.
The performance of the LLMs is then evaluated by measuring their win rate against these trained RL agents. The models are provided with the game description, the current game state, and the available moves, and are then tasked with selecting the best move to make.
Early results from gg-bench
indicate that even state-of-the-art LLMs struggle with this benchmark. Models like GPT-4o and Claude 3.7 Sonnet only achieve win rates of 7-9%. However, reasoning-focused models like 01, 03-mini and DeepSeek-R1 perform significantly better, achieving win rates of 31-36%. This suggests that explicit reasoning capabilities are crucial for success in these generated game environments.
One of the key advantages of gg-bench
is its scalability. As LLMs improve, the benchmark can be regenerated to create more difficult and complex games, preventing the benchmark from becoming saturated. The researchers also emphasize the controllable nature of the benchmark, allowing for the modification of the game generation prompts to focus on specific design elements or to adjust the difficulty level.
The researchers have released the generated games, the data generation process, and the evaluation code to foster further research and development in this area. This allows others to explore the benchmark, develop new LLMs capable of excelling in these dynamic environments, and potentially expand the benchmark itself.
gg-bench
represents a significant step towards creating more robust and dynamic benchmarks for evaluating the general intelligence of AI systems. By challenging LLMs to not only understand and follow rules but also to strategize and adapt in novel game environments, gg-bench
paves the way for the development of more versatile and capable AI.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.