Search papers, labs, and topics across Lattice.
This paper introduces PRIMT, a preference-based reinforcement learning framework that leverages foundation models (FMs) to generate multimodal synthetic feedback and synthesize trajectories, addressing the limitations of traditional PbRL in terms of human input and query ambiguity. PRIMT employs a hierarchical neuro-symbolic fusion strategy, combining large language models and vision-language models for robust behavior evaluation, and integrates foresight trajectory generation and hindsight trajectory augmentation to improve query clarity and credit assignment. Experiments across locomotion and manipulation tasks demonstrate that PRIMT outperforms FM-based and scripted baselines, showcasing its effectiveness in learning complex robot behaviors.
PRIMT tackles the data inefficiency of preference-based RL by using foundation models to generate synthetic multimodal feedback and synthesize trajectories, significantly outperforming existing FM-based and scripted baselines.
Preference-based reinforcement learning (PbRL) has emerged as a promising paradigm for teaching robots complex behaviors without reward engineering. However, its effectiveness is often limited by two critical challenges: the reliance on extensive human input and the inherent difficulties in resolving query ambiguity and credit assignment during reward learning. In this paper, we introduce PRIMT, a PbRL framework designed to overcome these challenges by leveraging foundation models (FMs) for multimodal synthetic feedback and trajectory synthesis. Unlike prior approaches that rely on single-modality FM evaluations, PRIMT employs a hierarchical neuro-symbolic fusion strategy, integrating the complementary strengths of large language models and vision-language models in evaluating robot behaviors for more reliable and comprehensive feedback. PRIMT also incorporates foresight trajectory generation, which reduces early-stage query ambiguity by warm-starting the trajectory buffer with bootstrapped samples, and hindsight trajectory augmentation, which enables counterfactual reasoning with a causal auxiliary loss to improve credit assignment. We evaluate PRIMT on 2 locomotion and 6 manipulation tasks on various benchmarks, demonstrating superior performance over FM-based and scripted baselines.