Search papers, labs, and topics across Lattice.
This paper introduces SC3-Eval, a novel self-consistent video generation framework designed to evaluate robot manipulation policies by addressing the challenges of action-conditioned video world models. By enforcing forward-inverse dynamics consistency, cross-view consistency, and test-time consistency, SC3-Eval effectively mitigates compounding errors and maintains coherence across multiple camera views during policy rollouts. The approach achieves a closed-loop Pearson correlation of 0.929 and significantly outperforms existing baselines, demonstrating its efficacy in accurately diagnosing policy performance in real-world scenarios.
SC3-Eval achieves a remarkable 0.929 Pearson correlation in evaluating robot policies, revealing critical insights into their real-world performance.
Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.