NVIDIAMcGillMar 15, 2026arXiv:2603.14604

Tactile Modality Fusion for Vision-Language-Action Models

Charlotte Morissette, Amin Abyaneh, Wei-Di Chang, Anas Houssaini, David Meger, Hsiu-Chin Lin, Jonathan Tremblay, Gregory Dudek

AI Summary

The paper introduces TacFiLM, a lightweight modality-fusion approach using feature-wise linear modulation (FiLM) to integrate tactile signals into vision-language-action (VLA) models. TacFiLM conditions intermediate visual features on pretrained tactile representations via post-training finetuning, avoiding the complexity of token concatenation or large-scale pretraining. Experiments on insertion tasks demonstrate that TacFiLM improves success rate, insertion performance, completion time, and force stability in both in- and out-of-distribution scenarios.

Key Contribution

Tactile sensing can be efficiently injected into vision-language-action models via feature-wise linear modulation, boosting robot manipulation performance without the computational overhead of large-scale pretraining.

Abstract

We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact forces, surface friction, compliance, and shear. While recent attempts to integrate tactile signals into VLA models often increase complexity through token concatenation or large-scale pretraining, the heavy computational demands of behavioural models necessitate more lightweight fusion strategies. To address these challenges, TacFiLM outlines a post-training finetuning approach that conditions intermediate visual features on pretrained tactile representations using feature-wise linear modulation (FiLM). Experimental results on insertion tasks demonstrate consistent improvements in success rate, direct insertion performance, completion time, and force stability across both in-distribution and out-of-distribution tasks. Together, these results support our method as an effective approach to integrating tactile signals into VLA models, improving contact-rich manipulation behaviours.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Tactile Modality Fusion for Vision-Language-Action Models

Related Papers