CASMar 11, 2026arXiv:2603.10871

FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

Wenxuan Ma, Chaofan Zhang, Yinghao Cai, Guocai Yao, Shaowei Cui, Shuo Wang

AI Summary

The paper introduces FG-CLTP, a fine-grained contrastive language tactile pretraining framework, to incorporate quantitative contact states into VLA models for robotic manipulation. They created a dataset of 100k tactile 3D point cloud-language pairs and used discretized numerical tokenization to align quantitative and semantic information. FG-CLTP achieves 95.9% classification accuracy and reduces regression error by 52.6% compared to SOTA, enabling a 3D-TLA architecture with improved performance in contact-rich manipulation tasks.

Key Contribution

Tactile robotic perception gets a boost with a new pretraining method that explicitly encodes force, geometry, and orientation, leading to a 52% reduction in regression error.

Abstract

Recent advancements in integrating tactile sensing into vision-language-action (VLA) models have demonstrated transformative potential for robotic perception. However, existing tactile representations predominantly rely on qualitative descriptors (e.g., texture), neglecting quantitative contact states such as force magnitude, contact geometry, and principal axis orientation, which are indispensable for fine-grained manipulation. To bridge this gap, we propose FG-CLTP, a fine-grained contrastive language tactile pretraining framework. We first introduce a novel dataset comprising over 100k tactile 3D point cloud-language pairs that explicitly capture multidimensional contact states from the sensor's perspective. We then implement a discretized numerical tokenization mechanism to achieve quantitative-semantic alignment, effectively injecting explicit physical metrics into the multimodal feature space. The proposed FG-CLTP model yields a 95.9% classification accuracy and reduces the regression error (MAE) by 52.6% compared to state-of-the-art methods. Furthermore, the integration of 3D point cloud representations establishes a sensor-agnostic foundation with a minimal sim-to-real gap of 3.5%. Building upon this fine-grained representation, we develop a 3D tactile-language-action (3D-TLA) architecture driven by a flow matching policy to enable multimodal reasoning and control. Extensive experiments demonstrate that our framework significantly outperforms strong baselines in contact-rich manipulation tasks, providing a robust and generalizable foundation for tactile-language-action models.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

Related Papers