AI2Paul G. Allen School of Computer ScienceMar 25, 2026arXiv:2603.24575

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Qi He, Xunmei Liu, Hammaad Memon, Ziang Li, Zixian Ma, Jaemin Cho, Jason Ren, Dan Weld, Ranjay Krishna

AI Summary

The paper introduces VFIG, a family of Vision-Language Models (VLMs) for converting rasterized figures into SVG format, addressing the challenge of lost or inaccessible vector source files. To train these models, the authors created VFIG-DATA, a large-scale dataset of 66K figure-SVG pairs, and employed a coarse-to-fine training curriculum involving supervised fine-tuning and reinforcement learning. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, demonstrating its ability to reconstruct complex figures with high fidelity.

Key Contribution

Forget redrawing diagrams by hand: VFIG, a new vision-language model, can automatically convert rasterized figures into editable SVGs with near GPT-5.2 quality.

Abstract

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only"flat"rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Related Papers