EmbodiedEval: A New Benchmark for Multimodal LLMs in Embodied AI

🔊

💬 Ask

A new benchmark, EmbodiedEval, is pushing the boundaries of how we evaluate multimodal large language models (MLLMs) in embodied AI. Developed by researchers at Tsinghua University, EmbodiedEval moves beyond static image or video-based evaluations, focusing instead on the real-world challenges faced by embodied agents interacting within dynamic 3D environments. The paper detailing this work, published on arXiv, highlights the significant shortcomings of current state-of-the-art MLLMs when tasked with complex, interactive scenarios.

Existing benchmarks often rely on static images or videos, limiting their assessment to non-interactive scenarios. Embodied AI benchmarks, while interactive, are often overly task-specific and lack diversity, failing to provide a comprehensive evaluation of an MLLM’s capabilities. EmbodiedEval tackles these limitations by introducing a rich and diverse benchmark.

A Diverse and Interactive Evaluation

EmbodiedEval boasts 328 distinct tasks across 125 varied 3D scenes. The tasks are categorized into five key areas:

Navigation: Guiding the agent through a 3D space using natural language instructions. For example, a task might instruct the agent to “Go to the kitchen, then come back and tell me if there are any extra cups.” This tests the agent’s ability to understand spatial relationships, follow instructions, and perceive its surroundings.
Object Interaction: Requiring manipulation of objects within the environment. A sample task might be “Wash the apple near the sink and place the apple in the fridge,” demanding the agent to identify the objects, understand their properties, and execute the necessary actions.
Social Interaction: Evaluating the model’s ability to interact with virtual humans. Tasks involve dialogue and understanding human instructions and feedback. For example, “I’m hungry, could you please buy me a loaf of bread? The bread section is right across from the checkout counter.” requires nuanced understanding of human communication and task completion.
Attribute Question Answering: Assessing the agent’s ability to answer questions about object attributes. An example would be: “Which famous masterpiece is the painting hanging on the wall imitating?”
Spatial Question Answering: Testing the model’s understanding of spatial relationships and reasoning capabilities. A sample task is: “Assess whether a new two-foot-long painting can be hung on the wall of the current exhibition hall without blocking other paintings and maintaining an aesthetic arrangement.”

The simulation environment for EmbodiedEval uses a combination of synthetic and realistic scenes, making it more robust and comprehensive. The action space within the environment is also discrete and semantically understandable.

Significant Shortfalls of Current MLLMs

The researchers evaluated several state-of-the-art MLLMs on EmbodiedEval, including both closed-source models (e.g., GPT-4) and open-source models. The results revealed a significant gap between the performance of MLLMs and humans. Human participants achieved a success rate of 97.26%, highlighting how far current MLLMs are from replicating human-level performance in embodied tasks. Even the best-performing models struggled to attain success rates beyond 32%, often falling considerably short, especially in tasks involving complex interactions or long sequences of actions.

Open Source and Future Directions

EmbodiedEval’s data and simulation framework are publicly available, encouraging further development and improvement in the field. The researchers point out key areas that need improvement in MLLMs, such as more robust spatial reasoning, improved planning capabilities, better handling of long-horizon tasks, and enhanced grounding of language to actions within the environment. EmbodiedEval promises to be a crucial tool in driving progress in embodied AI and pushing the boundaries of MLLM capabilities.

AI Papers Reader

Personalized digests of latest AI research

EmbodiedEval: A New Benchmark for Multimodal LLMs in Embodied AI

Chat about this paper