Search papers, labs, and topics across Lattice.
TacViT, a Vision Transformer-based architecture, is introduced to address the challenge of generalizing tactile perception across diverse vision-based tactile sensors on multi-fingered hands. By employing global self-attention, TacViT extracts robust features from tactile images, enabling accurate inference of contact properties on novel, unseen sensors without requiring sensor-specific retraining. Experiments on a five-fingered robot hand demonstrate TacViT's superior generalization compared to CNNs, significantly reducing the data collection burden for new tactile sensors.
Stop retraining your tactile models for every new sensor: TacViT uses vision transformers to generalize across diverse tactile sensors without sensor-specific data.
Rapid deployment of new tactile sensors is essential for scalable robotic manipulation, especially in multi-fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining for each new sensor due to differences in lens properties, illumination, and sensor wear. Here we introduce TacViT, a novel tactile perception model based on Vision Transformers, designed to generalize on new sensor data. TacViT leverages global self-attention mechanisms to extract robust features from tactile images, enabling accurate contact property inference even on previously unseen sensors. This capability significantly reduces the need for data collection and retraining, accelerating the deployment of new sensors. We evaluate TacViT on sensors for a five-fingered robot hand and demonstrate its superior generalization performance compared to CNNs. Our results highlight TacViTs potential to make tactile sensing more scalable and practical for real-world robotic applications.