AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Learning Interactive Behavior Models From Casual Longitudinal Videos

A research team from Meta and Carnegie Mellon University has introduced a new framework, called Agent-to-Sim (ATS), for learning interactive behavior models of 3D agents from casually captured video collections. The research, published as a paper on arXiv, addresses a key challenge in AI: how to accurately model the behavior of agents, such as humans and animals, in complex real-world environments.

Traditional approaches to learning agent behavior often rely on expensive, studio-based motion capture, which is restrictive and doesn’t capture the full range of natural behaviors. ATS, on the other hand, is designed to learn from casual videos recorded over extended periods, like a month, using a single smartphone. This means it can learn from a much broader range of behaviors and interactions than studio-based approaches.

The key innovation of ATS lies in its ability to reconstruct a complete and persistent 4D representation of the scene from casually captured videos. This 4D reconstruction includes not only the 3D geometry of the scene but also the trajectories of both the agent and the observer (the person filming the videos). This 4D representation is crucial for learning interactive behaviors, as it enables the model to understand how the agent’s actions are influenced by the scene and the observer.

To achieve this 4D reconstruction, ATS uses a novel coarse-to-fine registration approach that utilizes large image models, such as DINO-v2, to register the camera with respect to canonical spaces of both the agent and the scene. This registration process is crucial for aligning the video frames across time and building a consistent representation of the environment.

Once the 4D representation is built, ATS trains a generative model of agent behaviors using paired data of perception and motion. This generative model is hierarchical, with a goal, a path, and a body pose, all conditioned on the agent’s ego-perception of the scene and the observer’s trajectory. This means that the model can generate realistic and interactive behaviors that are influenced by the agent’s understanding of the environment and the observer’s actions.

The researchers have demonstrated the effectiveness of ATS on a variety of datasets, including cats, dogs, bunnies, and humans. The results show that ATS is able to learn and generate a diverse range of interactive behaviors that are faithful to the real world. For example, the model can generate behaviors such as a cat leaping onto furniture, a dog timidly approaching a user, or a human reacting to the motion of the observer.

ATS has several potential applications in areas such as virtual reality, augmented reality, and robotics. It could be used to create more realistic and engaging virtual agents, to enable robots to better understand and interact with their environments, or to generate realistic simulations for training purposes.

While promising, ATS still has some limitations, such as its ability to handle complex scene interactions with multiple agents. The researchers are planning to address these limitations in future work. However, ATS represents a significant step forward in the field of agent behavior modeling and has the potential to revolutionize how we design and interact with AI agents in real-world environments.