FudanJD.comTJUJun 22, 2026arXiv:2606.22794

UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models

Lin Sun, Zhiwei Guan, Conglin Wang, Zihong Chen, Jianhai Yu, Zongsheng Li, Boyong He, Tao Sun, Jiale Cao, Lige Liu

AI Summary

This paper introduces UniFS, a unified fast-to-slow hierarchical architecture designed to enhance the efficiency and performance of vision-language-action models by addressing the frequency dilemma inherent in existing systems. By stratifying VLM layers based on update frequency, employing a latent vector inversion mechanism for improved feature interaction, and implementing a multi-level supervision strategy, UniFS enables richer information transfer while preserving temporal context. Experimental results demonstrate that UniFS achieves state-of-the-art performance with a 2.5% improvement in success rate and a 2.1× reduction in inference latency compared to the VLA-Adapter baseline.

Key Contribution

UniFS achieves a remarkable 2.5% increase in success rate while slashing inference latency by over half, redefining efficiency in vision-language-action models.

Abstract

Mainstream Fast-Slow dual system vision-language-action models decouple a high-frequency action expert from a low-frequency vision-language model for efficiency, yet they face a fundamental frequency dilemma: large update gaps cause semantic drift from stale context, while small gaps erode the intended computational savings. Moreover, because the action expert receives only the VLM's final-layer representation at a single fixed frequency, rich intermediate features are discarded, limiting both information coupling and manipulation precision. Inspired by multi-timescale neural processing in the human brain, we introduce UniFS, a unified fast-to-slow architecture that resolves these challenges through three key designs. First, we stratify the VLM layers into groups with progressively decreasing update frequencies, enabling shallow layers to capture fast-changing dynamics while deeper layers cache stable semantic context. Second, a latent vector inversion mechanism re-routes the interaction order between multi-scale VLM features and the action expert, aligning fast-varying representations with fine-grained action decoding and slow-varying ones with coarse planning. Third, a multi-level supervision strategy enforces a coarse-to-fine learning hierarchy across temporal scales. Together, these designs enable richer cross-frequency information transfer within a single backbone, while the low-frequency pathways additionally preserve temporal context across steps. Experiments on LIBERO show that UniFS achieves state-of-the-art performance (98.3\% average success rate, a 2.5\% gain over VLA-Adapter baseline) while reducing average inference latency from 36.5~ms to 17.8~ms (2.1$\times$ speedup). Real-robot experiments on a Franka platform further validate its practical applicability. Code is opensourced at https://github.com/linsun449/UniFS.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models

Related Papers