Search papers, labs, and topics across Lattice.
The paper introduces TikArt, an aperture-guided agent that iteratively refines its visual focus using zoom and segmentation actions to improve fine-grained visual reasoning in MLLMs. TikArt employs a Think-Aperture-Observe loop, where the agent alternates between language generation and aperture actions (Zoom and Segment, the latter using SAM2) to extract and verbalize local visual cues. The agent's reasoning policy is optimized using AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum that encourages purposeful aperture use, leading to performance gains on several fine-grained reasoning benchmarks.
By learning to intelligently "zoom in" on relevant image regions, TikArt significantly boosts MLLM performance on fine-grained visual reasoning tasks.
We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that couple task success with purposeful aperture use. Experiments on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, and ReasonSeg show consistent gains over the backbone and yield interpretable aperture trajectories for high-resolution reasoning.