Tsinghua AIMay 5, 2026arXiv:2605.03846

SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision

Shiyi Chen, Haiyi Liu, Ming Yang, Jiaqi Zhang, Debing Zhang

AI Summary

SigLoMa, a quadrupedal loco-manipulation framework, achieves fully onboard, ego-centric vision-based pick-and-place by introducing Sigma Points, a lightweight geometric representation for exteroception, and an ego-centric Kalman Filter for high-rate state estimation. To improve sample efficiency, they use an Active Sampling Curriculum guided by Hint Poses and tackle visual blind spots with temporal encoding and simulated random-walk drift. Real-world experiments demonstrate that SigLoMa, using only a 5Hz open-vocabulary detector, achieves dynamic loco-manipulation performance comparable to expert human teleoperation.

Key Contribution

Quadrupedal robots can now perform dynamic loco-manipulation in the real world, matching human teleoperation, using only onboard ego-centric vision and a low-frequency (5Hz) open-vocabulary detector.

Abstract

Designing an open-world quadrupedal loco-manipulation system is highly challenging. Traditional reinforcement learning frameworks utilizing exteroception often suffer from extreme sample inefficiency and massive sim-to-real gaps. Furthermore, the inherent latency of visual tracking fundamentally conflicts with the high-frequency demands of precise floating-base control. Consequently, existing systems lean heavily on expensive external motion capture and off-board computation. To eliminate these dependencies, we present SigLoMa, a fully onboard, ego-centric vision-based pick-and-place framework. At the core of SigLoMa is the introduction of Sigma Points, a lightweight geometric representation for exteroception that guarantees high scalability and native sim-to-real alignment. To bridge the frequency divide between slow perception and fast control, we design an ego-centric Kalman Filter to provide robust, high-rate state estimation. On the learning front, we alleviate sample inefficiency via an Active Sampling Curriculum guided by Hint Poses, and tackle the robot's structural visual blind spots using temporal encoding coupled with simulated random-walk drift. Real-world experiments validate that, relying solely on a 5Hz (200 ms latency) open-vocabulary detector, SigLoMa successfully executes dynamic loco-manipulation across multiple tasks, achieving performance comparable to expert human teleoperation.

Computer Vision Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision

Related Papers