Search papers, labs, and topics across Lattice.
The paper introduces TacFiLM, a lightweight modality-fusion approach using feature-wise linear modulation (FiLM) to integrate tactile signals into vision-language-action (VLA) models. TacFiLM conditions intermediate visual features on pretrained tactile representations via post-training finetuning, avoiding the complexity of token concatenation or large-scale pretraining. Experiments on insertion tasks demonstrate that TacFiLM improves success rate, insertion performance, completion time, and force stability in both in- and out-of-distribution scenarios.
Tactile sensing can be efficiently injected into vision-language-action models via feature-wise linear modulation, boosting robot manipulation performance without the computational overhead of large-scale pretraining.
We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact forces, surface friction, compliance, and shear. While recent attempts to integrate tactile signals into VLA models often increase complexity through token concatenation or large-scale pretraining, the heavy computational demands of behavioural models necessitate more lightweight fusion strategies. To address these challenges, TacFiLM outlines a post-training finetuning approach that conditions intermediate visual features on pretrained tactile representations using feature-wise linear modulation (FiLM). Experimental results on insertion tasks demonstrate consistent improvements in success rate, direct insertion performance, completion time, and force stability across both in-distribution and out-of-distribution tasks. Together, these results support our method as an effective approach to integrating tactile signals into VLA models, improving contact-rich manipulation behaviours.