Robots Learn to Feel: AI Achieves Multi-Sensory Robot Control Through Language
A team of researchers from UC Berkeley has developed a novel approach, called FuSe (Finetuning with Heterogeneous Sensors via Language Grounding), that allows robots to effectively integrate various sensory inputs — beyond just vision — to perform complex tasks. Their findings, detailed in a recent paper, show significant improvements in robotic manipulation by leveraging the power of natural language as a common bridge between different sensory modalities.
Current state-of-the-art “generalist” robots typically rely heavily on visual data for training. This severely limits their capabilities in situations where vision is partially or completely obstructed. Imagine a robot trying to find a specific object inside a dark bag. A human would naturally use touch and perhaps even sound to locate the object; existing robots struggle with such scenarios.
FuSe addresses this limitation by cleverly using language as a unifying factor. The researchers train their robot policies on a relatively small dataset that includes visual, tactile (touch), and auditory information, all linked through natural language instructions. For example, the training data might include an image of a robot grasping a soft, red ball, along with the tactile feedback of the ball’s softness and the instruction “Grab the soft red ball.” By linking these heterogeneous modalities through language, the robot learns to associate different sensory experiences with the same semantic meaning.
The core of FuSe is a two-pronged approach:
-
Multimodal Contrastive Loss: This technique encourages the robot to understand the relationship between different sensory inputs related to the same object or action. For instance, the algorithm learns that the visual appearance of a “squishy” object correlates with its tactile feel.
-
Sensory-Grounded Language Generation Loss: This loss function trains the robot to generate natural language descriptions of its sensory experiences. For example, after grasping an object, the robot might generate a caption like, “The object feels smooth and round.” This helps to solidify the link between perception and language understanding.
The researchers evaluated FuSe on three real-world robotic tasks: tabletop grasping, grasping objects from a bag (a partially observable setting), and button pressing (using sound as a cue). The results showed impressive improvements over baseline methods, with success rates increasing by over 20% in some cases. Specifically, the robot demonstrated significant gains in the challenging bag-grasping task, where visual information is severely limited. FuSe-trained robots also excelled at complex, multimodal instructions such as “Pick the object that feels squishy and looks red,” showcasing their ability to jointly reason across different sensory modalities.
FuSe’s adaptability was also tested on different robotic architectures, including both transformer-based policies and large vision-language-action models. The consistent performance improvements across these architectures highlight the generalizability of the approach.
The work represents a substantial leap forward in developing truly robust and versatile robotic systems. By harnessing the power of language to bridge the gap between diverse sensory modalities, FuSe opens the door for robots to effectively interact with complex, real-world environments where reliance on vision alone is insufficient. The release of the training data and code by the research team further fosters collaboration and accelerates progress in the field of multi-sensory robot control.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.