UCLAFeb 23, 2026arXiv:2602.19449

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Jason Wu, Tianchen Zhao, Chang Liu, Jiarui Cai, Zheng Zhang, Zhuowei Li, Aaditya Singh, Xiang Xu, Mani Srivastava, Jonathan Wu

AI Summary

The paper introduces Codebook RegulAted Fine-Tuning (CRAFT), a method for adapting vision encoders in LVLMs to domain-specific visual tasks by anchoring visual representations to a discrete codebook. CRAFT decouples the vision encoder adaptation from the language model by fine-tuning the encoder to map visual features to a shared, stable token space defined by the codebook. This approach allows the adapted encoder to improve performance across different LVLM architectures that use the same codebook, achieving a 13.51% average gain on 10 domain-specific benchmarks.

Key Contribution

Domain-specific visual tasks get a 13.51% performance boost in LVLMs thanks to a new codebook-anchored adaptation method that decouples vision encoder tuning from the language model.

Abstract

Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Related Papers