ZJUMar 12, 2026arXiv:2603.11563

SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

Yuyuan Yang, Junkun Hong, Hongrong Wang, Honghao Cai, Xunpeng Ren, Ge Wang, Mingcong Lei, Shenhao Yan, Chengsi Yao, Xi Li, Yatong Han, Jinke Ren

AI Summary

The paper introduces Staged Vision-Language Learning (SVLL), a three-stage framework that first decouples spatial grounding from temporal reasoning before introducing sequential action history for embodied task planning. To address limitations in Direct Preference Optimization (DPO), they propose Bias-DPO, which explicitly maximizes likelihood on ground-truth actions while penalizing overconfident hallucinations. Experiments on AI2-THOR and real-world robotic deployments demonstrate that SVLL with Bias-DPO outperforms state-of-the-art models, including GPT-4o and Gemini-2.0-flash, in task success rate and reduces physical constraint violations.

Key Contribution

Forget end-to-end training and unstable RL: this staged learning approach with a novel Bias-DPO objective lets vision-language models plan physically plausible actions better than GPT-4o.

Abstract

Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

Related Papers