NUSCUHKNTUUNCMay 26, 2026arXiv:2605.27095

Adversarial Dual On-Policy Distillation from Expressive Flow-based Teacher

Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Mingcong Lei, Bo An, Ivor W. Tsang, Yang You

AI Summary

This paper introduces FA-OPD, an adversarial dual on-policy distillation method for learning from demonstrations in embodied control, using a Flow Matching teacher co-trained with an MLP student. FA-OPD leverages two distillation channels: a reward channel that learns an expert-likeness objective for exploration and an action channel that provides dense local targets for exploitation. Experiments across robot navigation, manipulation, and locomotion tasks demonstrate that FA-OPD outperforms strong baselines, especially under noisy or limited demonstrations, showcasing its improved robustness and generalization.

Key Contribution

Flow-based imitation learning can be significantly improved by distilling both rewards and actions on-policy, enabling more robust and generalizable policies, especially with limited or noisy demonstrations.

Abstract

Learning from demonstrations in embodied control is often cast as behavioral cloning, and recent diffusion or flow-matching policies improve this paradigm by modeling multi-modal expert actions. Yet these methods remain offline supervised learners: the policy is trained only on expert states and receives no corrective signal on the states it actually visits. On-policy distillation (OPD) offers a natural remedy, but standard OPD assumes a strong fixed teacher, which is unavailable in demonstration-only control. We propose \textbf{FA-OPD}, an \emph{adversarial dual on-policy distillation} method in which a Flow Matching (FM) teacher is learned from demonstrations and co-trained with a lightweight MLP student. The teacher provides two complementary signals on student rollouts. The reward channel learns an expert-likeness objective over state-action pairs and drives online exploration through long-horizon policy optimization. The action channel supplies dense local targets at student-visited states, stabilizing exploitation. FA-OPD couples them so that reward distillation enables generalization beyond point-wise demonstrations, while action distillation keeps exploration anchored near expert-like behavior. Across six robot navigation, manipulation, and locomotion benchmarks, FA-OPD beats strong baselines and shows much stronger robustness under noisy or limited demonstrations.

Inference & Quantization Robotics & Embodied AI Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Adversarial Dual On-Policy Distillation from Expressive Flow-based Teacher

Related Papers