UIUCApr 23, 2026arXiv:2604.21926

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

Hao-Yu Hsu, Tianhang Cheng, Jing Wen, Alexander G. Schwing, Shenlong Wang

AI Summary

The paper introduces IMU-to-4D, a framework leveraging large language models to reconstruct 4D human motion and scene layouts using only wearable inertial measurement unit (IMU) data. By mapping IMU sensor readings to spatiotemporal representations, the system bypasses the need for visual input, addressing privacy and efficiency concerns. Experiments demonstrate that IMU-to-4D achieves more coherent and stable 4D reconstructions compared to vision-based pipelines.

Key Contribution

Imagine reconstructing detailed human motion and scene layouts using just your smartwatch and earbuds – no cameras needed.

Abstract

Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References123

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

Related Papers