Stanford HAICentral South UniversityWaterlooJun 10, 2026arXiv:2606.12402

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

Jadelynn Dao, Milan Ganai, Yasmina Abukhadra, Ajay Sridhar, A. Sridhar, Mozhgan Nasr Azadani, Katie Luo, Clark W. Barrett, Clark Barrett, Jiajun Wu, Chelsea Finn, Marco Pavone

AI Summary

This paper introduces DIRECT, a routing framework that optimally allocates test-time compute in embodied planners by leveraging multimodal scene context, addressing the inefficiencies of uniform compute scaling. The authors demonstrate that varying compute allocation across different scaling axes—such as chain-of-thought depth, model size, and memory history—can yield distinct and significant improvements in success rates while reducing latency and resource consumption. Experiments conducted on VLABench and a physical Franka arm show that DIRECT achieves comparable or superior performance to stronger models with up to 65% lower average latency, emphasizing the importance of strategic compute allocation in real-world deployments.

Key Contribution

Naively scaling test-time compute is wasteful; strategically allocating it with DIRECT can enhance embodied agent performance while slashing latency by up to 65%.

Abstract

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at jadee-dao.github.io/direct/.

Multimodal Models Robotics & Embodied AI Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

Related Papers