AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Gemini Robotics: Google DeepMind Unveils Next-Gen AI for Embodied Agents

San Francisco, CA - Google DeepMind has announced a new family of AI models, called Gemini Robotics, designed specifically to bridge the gap between general AI capabilities and real-world robotic applications. This breakthrough promises to bring unprecedented levels of understanding and interaction to robots, paving the way for more capable and versatile automated systems.

The researchers behind Gemini Robotics, outlined in a recently released white paper, are building their system on top of Google’s powerful Gemini 2.0 foundation model. This approach allows the robots to understand their surroundings, interpret complex instructions, and react dynamically to changes in their environment.

“Imagine a robot that can not only navigate your home but also understand that the red tomato needs to be placed into the salad bowl, while the toy should be packed into its designated box,” says a lead researcher on the Gemini Robotics team. “That’s the level of embodied reasoning we’re aiming for.”

The new Gemini Robotics family comprises two core models:

  • Gemini Robotics-ER (Embodied Reasoning): This model enhances spatial and temporal understanding, allowing robots to perceive their environment in three dimensions, predict trajectories, and determine optimal grasp points. For example, given an image of a cluttered desk, Gemini Robotics-ER can identify the stapler, predict a trajectory for a robot arm to reach it, and determine the precise location and angle for a secure grasp.
  • Gemini Robotics: This advanced Vision-Language-Action (VLA) model directly controls robot actions based on visual input and natural language instructions. It enables robots to perform a wide range of complex manipulation tasks, such as folding an origami fox, or playing a game of cards. For example, a user could instruct a robot to “pack a lunchbox,” and Gemini Robotics would understand the command, identify the necessary items, and execute the task.

What sets Gemini Robotics apart is its ability to generalize and adapt. The models can be fine-tuned to control completely new robot embodiments, such as a bi-armed industrial robot or a humanoid robot. This adaptability opens up possibilities for a variety of robotic applications, from manufacturing to elder care.

To evaluate the performance of Gemini Robotics, the team introduced a new benchmark called the Embodied Reasoning Question Answering (ERQA). ERQA consists of 400 VQA-style questions testing spatial reasoning, action reasoning, trajectory reasoning and much more. This benchmark will allow the research community to perform standardized evaluation of general embodied reasoning.

Importantly, the paper also addresses safety concerns related to AI-powered robots. The team discusses the development of safety mitigation frameworks to cover embodied reasoning and action output modalities, aligning with Google AI Principles.

The Gemini Robotics paper marks a step towards creating truly general-purpose robots capable of interacting with the physical world safely and competently. While challenges remain, the advancement in AI-driven robotics could transform industries and enhance human lives in ways previously thought impossible.