CASHebei Key Laboratory of Cognitive IntelligenceHebei University of TechnologyUniversityJun 7, 2026arXiv:2606.08653

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du

AI Summary

This paper introduces FiberTune, a novel training objective designed to mitigate residual visual collapse in vision-language-action (VLA) policies during action-supervised fine-tuning. By employing an online action probe to filter action-predictive feature directions from visual-token representations, FiberTune enhances the alignment with a frozen visual teacher while maintaining effective rank regularization. The method demonstrates significant improvements across multiple benchmarks, achieving up to a 10.7 percentage point increase in success rates on long-horizon tasks and enhancing physical task success rates from 72.7% to 78.1%.

Key Contribution

FiberTune boosts VLA policy performance by preserving critical visual structure, resulting in up to 10.7 percentage points higher success rates on complex tasks.

Abstract

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

Related Papers