Tsinghua AIMar 17, 2026arXiv:2603.16542

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Wanpeng Zhang, Hao Luo, Sipeng Zheng, Yicheng Feng, Haiweng Xu, Ziheng Xi, Chaoyi Xu, Haoqi Yuan, Zongqing Lu

AI Summary

This paper introduces Posterior-Transition Reweighting (PTR), a novel offline robot policy learning method that addresses the heterogeneity of robot datasets by reweighting training samples based on the attributability of their post-action consequences. PTR uses a transition scorer to estimate a softmax identification posterior over mismatched target indices, and then uses the posterior-to-uniform ratio to define a weight for each sample during supervised regression. Experiments demonstrate that PTR improves conservative offline adaptation to heterogeneous robot data by reallocating credit to more attributable samples.

Key Contribution

Stop averaging over noisy robot data: PTR selectively trusts training samples based on how well their post-action consequences align with learned representations, leading to more robust offline policy learning.

Abstract

Offline post-training adapts a pretrained robot policy to a target dataset by supervised regression on recorded actions. In practice, robot datasets are heterogeneous: they mix embodiments, camera setups, and demonstrations of varying quality, so many trajectories reflect recovery behavior, inconsistent operator skill, or weakly informative supervision. Uniform post-training gives equal credit to all samples and can therefore average over conflicting or low-attribution data. We propose Posterior-Transition Reweighting (PTR), a reward-free and conservative post-training method that decides how much each training sample should influence the supervised update. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a candidate pool of mismatched targets, and uses a separate transition scorer to estimate a softmax identification posterior over target indices. The posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original action objective through self-normalized weighted regression. This construction requires no tractable policy likelihood and is compatible with both diffusion and flow-matching action heads. Rather than uniformly trusting all recorded supervision, PTR reallocates credit according to how attributable each sample's post-action consequence is under the current representation, improving conservative offline adaptation to heterogeneous robot data.

Data Curation & Synthetic Data Robotics & Embodied AI Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References64

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Related Papers