Search papers, labs, and topics across Lattice.
TiPToP is introduced as a modular system combining pre-trained vision foundation models with a Task and Motion Planner (TAMP) to address multi-step manipulation tasks using RGB images and natural language. This system requires no robot data and can be deployed quickly, offering ease of use and adaptability. Evaluations across 28 tabletop manipulation tasks demonstrate that TiPToP matches or exceeds the performance of a VLA model fine-tuned on extensive robot-specific data.
Zero-shot robotic manipulation is now within reach: TiPToP matches a 350-hour fine-tuned model without *any* robot data.
We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms $π_{0.5}\text{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io