AI Papers Reader

Personalized digests of latest AI research

View on GitHub

ByteDance Unveils Lumine: The First AI Agent to Master Hours-Long Missions in 3D Open Worlds

ByteDance Seed has announced the development of Lumine, a new generalist artificial intelligence agent capable of completing complex, hours-long missions in expansive 3D open-world environments in real time. Lumine, described by the researchers as the “first open recipe” for building such agents, demonstrates human-level efficiency in complex video game environments, successfully navigating, fighting, and problem-solving without needing explicit game APIs.

Lumine was primarily trained using the highly popular open-world action RPG Genshin Impact as its testbed. Crucially, the VLM-based (Vision-Language Model) agent interacts with the game exactly like a human player: processing raw screen pixels (1280x720 resolution) and issuing precise keyboard and mouse actions, unifying perception, reasoning, and action in an end-to-end loop.

The key to Lumine’s success in real-time, dynamic environments is its innovative hybrid thinking strategy. Unlike traditional agents that reason at every step—a computationally slow process—Lumine adaptively invokes explicit reasoning, generating an “inner monologue” only when necessary, such as when a plan needs adjusting or a new objective appears. Otherwise, it generates actions directly for efficient, low-latency control.

To achieve fluid performance in a fast-paced environment, Lumine processes visual input at 5 Hz but outputs actions at a high frequency of 30 Hz using a mechanism called “action chunking.” This means it generates sequences of precise keyboard presses and relative mouse movements (e.g., turning the camera 92 units right while dashing forward with Shift + W), enabling the complex, continuous control required for 3D navigation and combat.

Human-Level Performance and Zero-Shot Transfer

The results are striking. Lumine successfully completed the entire five-hour Mondstadt main storyline in Genshin Impact in just 56 minutes, matching the efficiency of expert human players (average expert time is 53 minutes). This achievement spans a broad spectrum of in-game challenges, including 3D exploration, real-time combat, puzzle-solving (like activating elemental mechanisms), and precise GUI manipulation (such as character revival).

Beyond its training domain, Lumine exhibited remarkable zero-shot generalization—the ability to perform tasks in new games without any fine-tuning.

For instance, the agent successfully completed a 100-minute main storyline mission in the entirely different open-world ARPG Wuthering Waves. Furthermore, it mastered the full five-hour first chapter of Honkai: Star Rail, a turn-based strategy game with a hub-based world design, demonstrating that its core skills—such as 3D navigation and 2D UI manipulation—are widely transferable across distinct genres and interaction dynamics.

The Lumine project, built upon a 7B-parameter VLM and optimized for real-time inference (achieving a 25.3x latency speedup over baseline models), marks a significant step toward general-purpose agents capable of sustained, high-level decision-making in complex digital worlds. Researchers suggest the “recipe” can be scaled further, potentially accelerating applications in quality assurance, game testing, and large-scale usability evaluation.