Tsinghua AIArenaPKUUCLAUMDJun 11, 2026arXiv:2606.13929

Self-Evolving Visual Questioner

Yijun Liang, Hengguang Zhou, Ming Li, Lichen Li, Cho-Jui Hsieh, Tianyi Zhou

AI Summary

This paper introduces a self-evolving framework for visual question generation that leverages vision-language models (VLMs) to autonomously create and refine questions without external supervision. By using the VLM itself as both the question proposer and filter, the method generates more challenging and diverse visual-centric questions, which are then utilized to train the model in both questioning and answering capacities. Experimental results demonstrate that this approach not only improves the quality of generated questions but also enhances the model's performance as an answerer, outperforming traditional static training methods under the same resource constraints.

Key Contribution

A VLM can autonomously evolve its questioning capabilities, producing harder and more diverse questions that enhance its overall performance without needing external data.

Abstract

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.

Data Curation & Synthetic Data Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Self-Evolving Visual Questioner

Related Papers