Feb 16, 2026arXiv:2602.14482

TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning

Hao Ding, Zhichuan Yang, Weijie Ge, Ziqin Gao, Chaoyi Lu, Lei Zhao

AI Summary

The paper introduces TikArt, an aperture-guided agent that iteratively refines its visual focus using zoom and segmentation actions to improve fine-grained visual reasoning in MLLMs. TikArt employs a Think-Aperture-Observe loop, where the agent alternates between language generation and aperture actions (Zoom and Segment, the latter using SAM2) to extract and verbalize local visual cues. The agent's reasoning policy is optimized using AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum that encourages purposeful aperture use, leading to performance gains on several fine-grained reasoning benchmarks.

Key Contribution

By learning to intelligently "zoom in" on relevant image regions, TikArt significantly boosts MLLM performance on fine-grained visual reasoning tasks.

Abstract

We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that couple task success with purposeful aperture use. Experiments on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, and ReasonSeg show consistent gains over the backbone and yield interpretable aperture trajectories for high-resolution reasoning.

Multimodal Models Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning

Related Papers