Orak: A New Benchmark for LLM Gaming Agents Across Diverse Video Games

🔊

💬 Ask

A new benchmark called Orak has been introduced to evaluate and train Large Language Model (LLM) agents across a wide range of video games. Unlike existing benchmarks, Orak emphasizes diversity in game genres, crucial agentic modules (like memory and planning), and the availability of fine-tuning datasets.

The benchmark includes 12 popular video games across major genres like action, adventure, strategy, puzzle, role-playing and simulation. Examples include Street Fighter III, Super Mario, Ace Attorney, Minecraft, and Stardew Valley. This diversity ensures a comprehensive assessment of LLM capabilities, such as fine-grained player control (action games), long-term memory (adventure games), and complex reasoning (strategy/puzzle games).

A key innovation of Orak is its plug-and-play interface based on a “Model Context Protocol” (MCP). This allows different LLMs to seamlessly connect with the games and manipulate agentic modules. Imagine each game and agentic module (like a “reflection” module that allows the agent to learn from its mistakes) acting as independent MCP servers. The LLM, using this interface, can ask for the current game state, execute actions using agentic strategies, and receive the game’s feedback.

Orak also provides a fine-tuning dataset of LLM gameplay trajectories across diverse game genres. This dataset, generated by expert LLMs (e.g., GPT-4), encapsulates “meta-knowledge” on how to effectively use different agentic strategies in various game types. This enables more resource-efficient transfer of skills from larger LLMs to smaller ones.

The benchmark offers multiple evaluation dimensions:

General game score leaderboards: Overall scores across the 12 games, providing a general performance metric.
LLM battle arenas: Competitive scenarios in games like Street Fighter III and StarCraft II, where LLMs compete against each other.
In-depth analyses: Assessments of visual input state understanding, agentic strategies, and the effects of fine-tuning.

Experiments using 12 LLMs revealed that proprietary LLMs generally outperform open-source models, but the performance gap narrows in battle games. Fine-tuning on the provided gameplay trajectories allows smaller LLMs to effectively transfer gameplay knowledge from larger LLMs, even generalizing to unseen game scenarios.

The creators of Orak believe that it not only establishes a foundation for developing effective gaming LLM agents but also serves as a critical benchmark for evaluating general LLMs on realistic, long-horizon decision-making tasks. The code and resources are available at https://github.com/krafton-ai/Orak.

AI Papers Reader

Personalized digests of latest AI research

Orak: A New Benchmark for LLM Gaming Agents Across Diverse Video Games

Chat about this paper