Tsinghua AINTUThe Fin AIUWAFeb 19, 2026arXiv:2602.17259

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Han Zhao, Jingbo Wang, Jingbo Wang, Wenxuan Song, Wenxuan Song, Shuai Chen, Shuai Chen, Yang Liu, Yang Liu, Yan Wang, Yan Wang, Yan Wang, Haoang Li, Haoang Li, Donglin Wang, Donglin Wang

AI Summary

The paper introduces Future Representation Alignment via Parallel Progressive Expansion (FRAPPE) to improve world modeling in generalist robotic policies by addressing limitations of pixel-level reconstruction and error accumulation in future observation prediction. FRAPPE employs a two-stage fine-tuning strategy, first predicting latent representations of future observations, and then aligning these representations with multiple visual foundation models in parallel. Experiments on RoboTwin and real-world tasks demonstrate that FRAPPE achieves superior performance and generalization in long-horizon and unseen scenarios compared to existing methods.

Key Contribution

By aligning latent representations with multiple visual foundation models, FRAPPE offers a more scalable and data-efficient way to imbue generalist robotic policies with robust world-awareness.

Abstract

Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models. By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies. Experiments on the RoboTwin benchmark and real-world tasks demonstrate that FRAPPE outperforms state-of-the-art approaches and shows strong generalization in long-horizon and unseen scenarios.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Related Papers