ColumbiaINRIASJTUMar 9, 2026arXiv:2603.07875

Choose What to Observe: Task-Aware Semantic-Geometric Representations for Visuomotor Policy

Haoran Ding, Liang Ma, Yaxun Yang, Wen Yang, Tianyu Liu, Anqing Duan, Xiaodan Liang, Dezhen Song, Yoshihiko Nakamura

AI Summary

This paper introduces a task-aware observation interface (L0 and L1) that canonicalizes visual input for visuomotor policies, enhancing robustness to out-of-distribution appearance changes. The approach uses SAM3 to segment task-relevant entities (target object and robot/gripper) and repaints them with semantic colors on a constant background (L0), optionally injecting monocular depth from Depth Anything 3 (L1). Experiments on RoboMimic, ManiSkill, RLBench, and real-world Franka tasks demonstrate that this interface preserves in-distribution performance while significantly improving robustness to OOD visual shifts across different policy backbones.

Key Contribution

Visuomotor policies can learn to ignore distracting visual variations simply by preprocessing raw RGB images into task-aware, semantic-geometric representations *before* feeding them to the policy.

Abstract

Visuomotor policies learned from demonstrations often overfit to nuisance visual factors in raw RGB observations, resulting in brittle behavior under appearance shifts such as background changes and object recoloring. We propose a task-aware observation interface that canonicalizes visual input into a shared representation, improving robustness to out-of-distribution (OOD) appearance changes without modifying or fine-tuning the policy. Given an RGB image and an open-vocabulary specification of task-relevant entities, we use SAM3 to segment the target object and robot/gripper. We construct an L0 observation by repainting segmented entities with predefined semantic colors on a constant background. For tasks requiring stronger geometric cues, we further inject monocular depth from Depth Anything 3 into the segmented regions via depth-guided overwrite, yielding a unified semantic--geometric observation (L1) that remains a standard 3-channel, image-like input. We evaluate on RoboMimic (Lift), ManiSkill YCB grasping under clutter, four RLBench tasks under controlled appearance shifts, and two real-world Franka tasks (ReachX and CloseCabinet). Across benchmarks and policy backbones (Flow Matching Policy and SmolVLA), our interface preserves in-distribution performance while substantially improving robustness under OOD visual shifts.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Choose What to Observe: Task-Aware Semantic-Geometric Representations for Visuomotor Policy

Related Papers