AI Papers Reader

Personalized digests of latest AI research

View on GitHub

AI Agents Learn to Multitask: New Model Drastically Cuts Time for Household Chores

Embodied artificial intelligence (AI) agents tasked with completing instructions in real-world environments have long suffered from a fundamental inefficiency: sequential thinking. When asked to “Prepare the kitchen for cooking,” an agent typically waits for one subtask to finish—like a 15-minute microwave cycle—before starting the next, wasting valuable time.

A new paper, “Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution,” introduces a novel approach called Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D). This framework equips embodied agents with scheduling intelligence, allowing them to execute parallel subtasks and utilize waiting periods for maximum efficiency.

The Problem with Sequential Planning

Traditional AI planning models, while capable of generating plausible step-by-step actions, overlook temporal logic and efficiency constraints. For example, a simple sequence of tasks might take 24 minutes if performed sequentially: washing a sink (4 mins), using the microwave to heat food (15 mins), and wiping the counter (5 mins).

The core insight of ORS3D is that many tasks are parallelizable. While the microwave operates for 15 minutes, the agent is free.

By leveraging Operations Research (OR) principles—the same logic used for complex industrial scheduling—the system learns to reorder actions. The agent initiates the 15-minute microwave task and, during the waiting period, efficiently completes the non-parallelizable subtasks, such as washing the sink and wiping the counter. The total completion time is drastically reduced from 24 minutes to just 15 minutes.

Introducing GRANT and ORS3D-60K

To train agents in this sophisticated multitasking ability, the researchers constructed ORS3D-60K, the largest dataset of its kind. Comprising over 60,000 composite tasks across 4,000 real-world scenes, the dataset is unique in that it integrates language understanding, 3D spatial grounding, and OR scheduling constraints. Critically, for an agent to execute these parallel plans, it must accurately locate the target objects (e.g., identifying the “oblong-shaped sink” and its precise 3D location) at every step—a complex multimodal challenge.

To tackle this, the team developed GRANT (Grounded Task Scheduling Agent), an embodied Multi-modal Large Language Model (MLLM). GRANT uses a simple yet effective Scheduling Token Mechanism (STM) that acts as a bridge, connecting the MLLM’s language capabilities to an external optimization solver.

First, the MLLM identifies which subtasks are parallelizable (like using the microwave) and which are non-parallelizable (like wiping a counter). This information is fed to the optimization solver, which rapidly generates the mathematically optimal schedule. This optimized plan is then injected back into the MLLM, guiding it to generate step-wise action descriptions alongside accurate 3D spatial groundings for the target objects.

Experiments on the ORS3D-60K benchmark showed that GRANT achieves a significant improvement in efficiency. Measured by the Time Efficiency (TE) metric—which normalizes performance between the naive sequential baseline and the optimal schedule—GRANT demonstrated a 30.53% gain in task completion time efficiency compared to previous state-of-the-art methods.

This research establishes a new standard for embodied AI, moving agents beyond rigid step-by-step execution and paving the way for truly time-efficient robots capable of managing complex, real-world composite tasks.