Search papers, labs, and topics across Lattice.
This paper introduces DIRECT, a routing framework that optimally allocates test-time compute in embodied planners by leveraging multimodal scene context, addressing the inefficiencies of uniform compute scaling. The authors demonstrate that varying compute allocation across different scaling axes鈥攕uch as chain-of-thought depth, model size, and memory history鈥攃an yield distinct and significant improvements in success rates while reducing latency and resource consumption. Experiments conducted on VLABench and a physical Franka arm show that DIRECT achieves comparable or superior performance to stronger models with up to 65% lower average latency, emphasizing the importance of strategic compute allocation in real-world deployments.
Naively scaling test-time compute is wasteful; strategically allocating it with DIRECT can enhance embodied agent performance while slashing latency by up to 65%.
Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at jadee-dao.github.io/direct/.