Tsinghua AIACE RoboticsCUHKJun 15, 2026arXiv:2606.17043

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Tongyan Fang, Siyuan Huang, Naiyu Fang, Ganlong Zhao, Zhongjin Luo, Jianbo Liu, Xiaogang Wang, Ying Dong, Hongsheng Li

AI Summary

This paper introduces Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method that fine-tunes pretrained Variable-Length Action (VLA) policies in online reinforcement learning by addressing the limitations of using a single scalar reward from sparse episode outcomes. HABC employs separate critic heads for viability and efficiency, adapting their contributions based on the state of the task, and incorporates intervention-aware credit assignment to ensure accurate feedback during mixed autonomous and intervention segments. The approach significantly improves task success rates in real-robot experiments, achieving up to 92% success compared to supervised fine-tuning baselines that ranged from 12% to 44%.

Key Contribution

Fine-tuning with HABC boosts task success rates from as low as 12% to over 90% by intelligently balancing viability and efficiency in sparse feedback scenarios.

Abstract

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

Scalable Oversight & Alignment Theory Training Efficiency & Optimization World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Related Papers