Apr 7, 2026arXiv:2604.05584

Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher

Pengcheng Weng, Y. Qian, Yangxin Xu, Fei Wang School of Software Engineering, Xi'an Jiaotong University, China, Institute of Space Science, U. Bern, Switzerland., College of Computing, Data Science, Nanyang Technological University, Singapore

AI Summary

This paper introduces Purify-then-Align (PTA), a framework for robust multimodal human sensing that addresses the challenges of missing modalities by mitigating the Representation Gap and Contamination Effect. PTA uses meta-learning to dynamically weight modalities, reducing the influence of noisy inputs, and then employs diffusion-based knowledge distillation to align and refine single-modality features using a purified, cross-modal teacher. Experiments on MM-Fi and XRF55 datasets demonstrate that PTA achieves state-of-the-art performance and enhances the robustness of single-modality models in missing-modality scenarios.

Key Contribution

Don't let missing modalities sink your human sensing: PTA leverages meta-learning and knowledge distillation to build surprisingly robust single-modality encoders from noisy multimodal data.

Abstract

Robust multimodal human sensing must overcome the critical challenge of missing modalities. Two principal barriers are the Representation Gap between heterogeneous data and the Contamination Effect from low-quality modalities. These barriers are causally linked, as the corruption introduced by contamination fundamentally impedes the reduction of representation disparities. In this paper, we propose PTA, a novel"Purify-then-Align"framework that solves this causal dependency through a synergistic integration of meta-learning and knowledge diffusion. To purify the knowledge source, PTA first employs a meta-learning-driven weighting mechanism that dynamically learns to down-weight the influence of noisy, low-contributing modalities. Subsequently, to align different modalities, PTA introduces a diffusion-based knowledge distillation paradigm in which an information-rich clean teacher, formed from this purified consensus, refines the features of each student modality. The ultimate payoff of this"Purify-then-Align"strategy is the creation of exceptionally powerful single-modality encoders imbued with cross-modal knowledge. Comprehensive experiments on the large-scale MM-Fi and XRF55 datasets, under pronounced Representation Gap and Contamination Effect, demonstrate that PTA achieves state-of-the-art performance and significantly improves the robustness of single-modality models in diverse missing-modality scenarios.

Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher

Related Papers