CMU MLJun 10, 2026arXiv:2606.12299

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

Hyun Joe Jeong, Gokul Swamy, Andrea Bajcsy

AI Summary

This paper introduces a novel framework for enhancing Vision-Language-Action (VLA) models by interactively searching for effective language sequences that improve task performance. By distilling these sequences into a test-time language feedback policy (LFP) and implementing a conformalized improvement head, the authors ensure that steering interventions do not degrade performance in out-of-distribution scenarios. The results demonstrate a significant performance boost—24.7% in simulation and 65.0% in hardware—while maintaining strong guarantees against harmful steering effects.

Key Contribution

A conformalized language feedback policy can boost VLA performance by over 65% while ensuring safe and reliable task execution in novel environments.

Abstract

Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero-shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed-loop VLA task performance, distills these sequences into a test-time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out-of-distribution scenarios. Crucially, our approach operates on arbitrary frozen pre-trained VLAs, requiring neither access to the original training distribution nor fine-tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open-loop prompting.

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

Related Papers