MIT CSAILCambridgePrincetonMar 5, 2026

From Pixels to Predicates: Learning Symbolic World Models via Pretrained VLMs

Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Jiuguang Wang, T. Lozano-Pérez, L. Kaelbling

AI Summary

This paper introduces a method for learning symbolic world models for long-horizon robotics tasks by leveraging pretrained vision-language models (VLMs) to propose and evaluate visual predicates directly from camera images. The proposed approach uses these predicates and demonstrations to train an abstract symbolic world model via optimization, enabling zero-shot generalization to novel goals through planning. Experiments in simulation and the real world demonstrate the method's ability to generalize across varying visual backgrounds, object arrangements, and goal configurations.

Key Contribution

Forget hand-engineered features: this approach learns symbolic representations for robotic planning directly from pixels using VLMs, enabling impressive zero-shot generalization to new environments and goals.

Abstract

Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with varying visual backgrounds, types, numbers, and arrangements of objects, as well as novel goals and much longer horizons than those seen at training time.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueIEEE Robotics and Automation Letters

Related Papers

Finding related papers...

Search

From Pixels to Predicates: Learning Symbolic World Models via Pretrained VLMs

Related Papers