Search papers, labs, and topics across Lattice.
This paper addresses the problem of unreliable uncertainty quantification in Vision-Language-Action (VLA) models for robotics, where mean aggregation dilutes critical, short-lived uncertainty spikes. They propose a new approach that uses max-based sliding window pooling, motion-aware stability weighting, and DoF-adaptive calibration to better capture transient risk signals and unstable behaviors. Experiments on the LIBERO benchmark demonstrate improved failure prediction accuracy, enabling more reliable failure detection for human-in-the-loop interventions.
Don't let your robot's brief moment of panic get lost in the noise – this new uncertainty method spotlights those critical spikes to predict failures before they happen.
Vision-Language-Action (VLA) models enable general-purpose robotic policies by mapping visual observations and language instructions to low-level actions, but they often lack reliable introspection. A common practice is to compute a token-level uncertainty signal and take its mean over a rollout. However, mean aggregation can dilute short-lived but safety-critical uncertainty spikes in continuous control. In particular, successful rollouts may contain localized high-entropy segments due to benign noise or non-critical micro-adjustments, while failure rollouts can appear low-entropy for most timesteps and only exhibit brief spikes near the onset of failure. We propose a unified uncertainty quantification approach for predicting rollout success versus failure that (1) uses max-based sliding window pooling to preserve transient risk signals, (2) applies motion-aware stability weighting to emphasize high-frequency action oscillations associated with unstable behaviors, and (3) performs DoF-adaptive calibration via Bayesian Optimization to prioritize kinematically critical axes. Experiments on the LIBERO benchmark show that our method substantially improves failure prediction accuracy and yields more reliable signals for failure detection, which can support downstream human-in-the-loop interventions.