DartmouthMar 12, 2026arXiv:2603.11950

Learning Transferable Sensor Models via Language-Informed Pretraining

Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, LisaA Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell, Andrew T. Campbell

AI Summary

SLIP, a Sensor Language-Informed Pretraining framework, is introduced to learn language-aligned representations from sensor data that generalize across diverse sensor configurations. It combines contrastive alignment with sensor-conditioned captioning, enabling both discriminative understanding and generative reasoning. By using a pretrained decoder-only language model with cross-attention and a flexible patch-embedder, SLIP achieves state-of-the-art performance in zero-shot transfer, signal captioning, and question answering across 11 datasets, outperforming existing methods by a significant margin.

Key Contribution

Achieve zero-shot sensor understanding across diverse sensor setups by repurposing a pretrained language model with a novel sensor-language alignment framework.

Abstract

Modern sensing systems generate large volumes of unlabeled multivariate time-series data. This abundance of unlabeled data makes self-supervised learning (SSL) a natural approach for learning transferable representations. However, most existing approaches are optimized for reconstruction or forecasting objectives and often fail to capture the semantic structure required for downstream classification and reasoning tasks. While recent sensor-language alignment methods improve semantic generalization through captioning and zero-shot transfer, they are limited to fixed sensor configurations, such as predefined channel sets, signal lengths, or temporal resolutions, which hinders cross-domain applicability. To address these gaps, we introduce \textbf{SLIP} (\textbf{S}ensor \textbf{L}anguage-\textbf{I}nformed \textbf{P}retraining), an open-source framework for learning language-aligned representations that generalize across diverse sensor setups. SLIP integrates contrastive alignment with sensor-conditioned captioning, facilitating both discriminative understanding and generative reasoning. By repurposing a pretrained decoder-only language model via cross-attention and introducing an elegant, flexible patch-embedder, SLIP supports different temporal resolutions and variable-length input at inference time without additional retraining. Across 11 datasets, SLIP demonstrates superior performance in zero-shot transfer, signal captioning, and question answering. It achieves a 77.14% average linear-probing accuracy, a 5.93% relative improvement over strong baselines, and reaches 64.83% accuracy in sensor-based question answering.

Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References52

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning Transferable Sensor Models via Language-Informed Pretraining

Related Papers