Tsinghua AIChina University of Mining TechnologyShanghai AI LabSJTUZJUJun 11, 2026arXiv:2606.13040

RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

Dayu Xia, Yue Shi, Yao Mu, Huiting Ji, Chaofan Ma, Yingjie Zhou, Hua Chen, Yang Liu, Yang Liu, Jiezhang Cao, Guangtao Zhai

AI Summary

This paper introduces RoboProcessBench, a benchmark designed to evaluate process-aware understanding in vision-language models (VLMs) for robotic manipulation. It identifies critical gaps in existing evaluations by focusing on two dimensions of process understanding: static monitoring and dynamic reasoning, assessed through 12 diagnostic question families derived from a dataset of approximately 58,000 question-answer pairs. The findings reveal significant limitations in current VLMs' abilities to comprehend manipulation processes, although post-training improvements were observed in specific models, highlighting RoboProcessBench's dual role as both a benchmark and a training resource.

Key Contribution

Current vision-language models struggle with process understanding in robotic manipulation, but targeted post-training can yield significant improvements.

Abstract

Vision-language models (VLMs) are increasingly explored as visual critics, reward generators, and failure detectors in robotic manipulation. These roles implicitly require models to judge not only final task success, but also how a manipulation execution is physically and temporally progressing. However, existing evaluations fail to test whether VLMs possess fine-grained process understanding. To address this gap, we present RoboProcessBench, a benchmark for process-aware understanding in vision-language robotic manipulation. RoboProcessBench decomposes such capability into two complementary dimensions, \emph{static monitoring} and \emph{dynamic reasoning}, instantiated as 12 diagnostic question families covering phase, contact, motion, coordination, primitive-local progress, temporal order, outcome, and primitive-level transitions. Built from physically grounded execution traces, the curated benchmark corpus ProcessData contains \textasciitilde 58k question-answer pairs across 260 manipulation tasks, which is further split into ProcessData-SFT and ProcessData-Eval for post-training and evaluation purposes. Extensive evaluation of various VLMs on ProcessData-Eval reveals broad limitations across 12 diagnostic task families, suggesting current models still lack robust process-aware understanding of manipulation executions. But with ProcessData-SFT, the post-trained \textit{Qwen2.5-VL-7B} and \textit{InternVL-3-8B} exhibit consistent gains on local state, motion, progress, and primitive-aware cues. These results demonstrate that RoboProcessBench serves as both an evaluation benchmark and a learnable supervision source for developing VLMs capable of monitoring and evaluating robotic manipulation processes. Project webpage: \href{https://processbench-2026.github.io/RoboProcessBench-Web/}{https://processbench-2026.github.io}.

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

Related Papers