Foundation for Research and Technology-HellasApr 2, 2026arXiv:2604.02108

Cross-Modal Visuo-Tactile Object Perception

Anirvan Dutta, Anirvan Dutta, Simone Tasciotti, S. Tasciotti, Claudia Cusseddu, Claudia Cusseddu, Ang Li, Panayiota Poirazi, Panayiota Poirazi, Julijana Gjorgjieva, Julijana Gjorgjieva, Etienne Burdet, Etienne Burdet, Patrick van der Smagt, Patrick van der Smagt, Mohsen Kaboli, Mohsen Kaboli

AI Summary

The paper introduces the Cross-Modal Latent Filter (CMLF) for estimating physical object properties by fusing vision and tactile sensing in robotic manipulation. CMLF learns a structured latent space of object properties and uses Bayesian inference to integrate sensory evidence over time, enabling bidirectional transfer of cross-modal priors. Real-world experiments demonstrate that CMLF improves the efficiency and robustness of physical property estimation under uncertainty, and exhibits human-like perceptual coupling.

Key Contribution

Robots can now perceive objects more like humans, even experiencing cross-modal illusions, thanks to a new visuo-tactile fusion model that learns and infers physical properties over time.

Abstract

Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cross-Modal Visuo-Tactile Object Perception

Related Papers