Feb 24, 2026arXiv:2602.20659

Recursive Belief Vision Language Model

Vaidehi Bagaria, Bijo Sebastian, Nirav Patel

AI Summary

The paper introduces Recursive Belief Vision Language Model (RB-VLA), a novel architecture that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions for long-horizon manipulation tasks under partial observability. RB-VLA uses a VLM for initial task specification and a belief module to track task progress, enabling phase-aware, causally grounded control without storing raw observations. Empirical results on multi-stage manipulation benchmarks demonstrate that RB-VLA significantly outperforms existing VLAs, achieving up to 52.5% higher success rates and reducing inference latency by up to 5x, highlighting the importance of belief-based state representations.

Key Contribution

VLAs struggle with long-horizon manipulation not because of semantic reasoning limitations, but due to a lack of persistent, action-conditioned state representations, which RB-VLA addresses with a belief-centric architecture, achieving state-of-the-art results.

Abstract

Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. Semantic reasoning alone is not the primary bottleneck in long-horizon manipulation. Instead, VLAs lack persistent, action-conditioned state representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once for high-level intent, the VLM provides task specification, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5% and 37.5% higher success on multi-stage pick-and-place and stacking tasks, respectively, compared to π0. It also reduces inference latency by up to 5x relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show that the belief module is the primary driver of performance, increasing success rates from 32.5% to 77.5%. These results demonstrate the effectiveness of belief-based state representations for long-horizon VLA policies.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Recursive Belief Vision Language Model

Related Papers