Ant GroupHITSJTUUSCZJUFeb 12, 2026arXiv:2602.11858

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Lingzhong Dong, Yutong Cai, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang

AI Summary

The paper introduces Region-to-Image Distillation, a method to improve fine-grained multimodal perception in MLLMs by distilling knowledge from zoomed-in regions to the full image during training. This approach generates high-quality VQA data from micro-cropped regions using strong teacher models and then transfers this region-grounded supervision to a student model. The resulting student model exhibits improved fine-grained perception in a single forward pass, eliminating the need for iterative zooming during inference.

Key Contribution

Ditch the slow, iterative zooming during MLLM inference: Region-to-Image Distillation lets you bake those agentic zooming benefits directly into a single forward pass.

Abstract

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent"Thinking-with-Images"methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves"single-glance"fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional"zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when"Thinking-with-Images"is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.

Computer Vision Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References123

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Related Papers